All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 00/13] sched: Clean-ups and asymmetric cpu capacity support
@ 2016-06-22 17:03 Morten Rasmussen
  2016-06-22 17:03 ` [PATCH v2 01/13] sched: Fix power to capacity renaming in comment Morten Rasmussen
                   ` (16 more replies)
  0 siblings, 17 replies; 64+ messages in thread
From: Morten Rasmussen @ 2016-06-22 17:03 UTC (permalink / raw)
  To: peterz, mingo
  Cc: dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	linux-kernel, Morten Rasmussen

Hi,

The scheduler is currently not doing much to help performance on systems with
asymmetric compute capacities (read ARM big.LITTLE). This series improves the
situation with a few tweaks mainly to the task wake-up path that considers
compute capacity at wake-up and not just whether a cpu is idle for these
systems. This gives us consistent, and potentially higher, throughput in
partially utilized scenarios. SMP behaviour and performance should be
unaffected.

Test 0:
	for i in `seq 1 10`; \
	       do sysbench --test=cpu --max-time=3 --num-threads=1 run; \
	       done \
	| awk '{if ($4=="events:") {print $5; sum +=$5; runs +=1}} \
	       END {print "Average events: " sum/runs}'

Target: ARM TC2 (2xA15+3xA7)

	(Higher is better)
tip:	Average events: 146.9
patch:	Average events: 217.9

Test 1:
	perf stat --null --repeat 10 -- \
	perf bench sched messaging -g 50 -l 5000

Target: Intel IVB-EP (2*10*2)

tip:    4.861970420 seconds time elapsed ( +-  1.39% )
patch:  4.886204224 seconds time elapsed ( +-  0.75% )

Target: ARM TC2 A7-only (3xA7) (-l 1000)

tip:    61.485682596 seconds time elapsed ( +-  0.07% )
patch:  62.667950130 seconds time elapsed ( +-  0.36% )

More analysis:

Statistics from mixed periodic task workload (rt-app) containing both
big and little task, single run on ARM TC2:

tu   = Task utilization big/little
pcpu = Previous cpu big/little
tcpu = This (waker) cpu big/little
dl   = New cpu is little
db   = New cpu is big
sis  = New cpu chosen by select_idle_sibling()
figc = New cpu chosen by find_idlest_*()
ww   = wake_wide(task) count for figc wakeups
bw   = sd_flag & SD_BALANCE_WAKE (non-fork/exec wake)
       for figc wakeups

case tu   pcpu tcpu   dl   db  sis figc   ww   bw
1    l    l    l     122   68   28  162  161  161
2    l    l    b      11    4    0   15   15   15
3    l    b    l       0  252    8  244  244  244
4    l    b    b      36 1928  711 1253 1016 1016
5    b    l    l       5   19    0   24   22   24
6    b    l    b       5    1    0    6    0    6
7    b    b    l       0   31    0   31   31   31
8    b    b    b       1  194  109   86   59   59
--------------------------------------------------
                     180 2497  856 1821

Cases 1-4 + 8 are fine to be served by select_idle_sibling() as both
this_cpu and prev_cpu are suitable cpus for the task. However, as the
figc column reveals, those cases are often served by find_idlest_*()
anyway due to wake_wide() sending the wakeup that way when
SD_BALANCE_WAKE is set on the sched_domains.

Pulling in the wakee_flip patch (dropped in v2) from v1 shifts a
significant share of the wakeups to sis from figc:

case tu   pcpu tcpu   dl   db  sis figc   ww   bw
1    l    l    l     537    8  537    8    6    6
2    l    l    b      49   11   32   28   28   28
3    l    b    l       4  323  322    5    5    5
4    l    b    b       1 1910 1209  702  458  456
5    b    l    l       0    5    0    5    1    5
6    b    l    b       0    0    0    0    0    0
7    b    b    l       0   32    0   32    2   32
8    b    b    b       0  198  168   30   13   13
--------------------------------------------------
                     591 2487 2268  810

Notes:

Active migration of tasks away from small capacity cpus isn't addressed
in this set although it is necessary for consistent throughput in other
scenarios on asymmetric cpu capacity systems.

The infrastructure to enable capacity awareness for arm64 is not provided here
but will be based on Juri's DT bindings patch set [1]. A combined preview
branch is available [2].

[1] https://lkml.org/lkml/2016/6/15/291
[2] git://linux-arm.org/linux-power.git capacity_awareness_v2_arm64_v1

Patch   1-3: Generic fixes and clean-ups.
Patch  4-11: Improve capacity awareness.
Patch 11-12: Arch features for arm to enable asymmetric capacity support.

v2:

- Dropped patch ignoring wakee_flips for pid=0 for now as we can not
  distinguish cpu time processing irqs from idle time.

- Dropped disabling WAKE_AFFINE as suggested by Vincent to allow more
  scenarios to use fast-path (select_idle_sibling()). Asymmetric wake
  conditions adjusted accordingly.

- Changed use of new SD_ASYM_CPUCAPACITY slightly. Now enables
  SD_BALANCE_WAKE.

- Minor clean-ups and rebased to more recent tip/sched/core.

v1: https://lkml.org/lkml/2014/5/23/621

Dietmar Eggemann (1):
  sched: Store maximum per-cpu capacity in root domain

Morten Rasmussen (12):
  sched: Fix power to capacity renaming in comment
  sched/fair: Consistent use of prev_cpu in wakeup path
  sched/fair: Optimize find_idlest_cpu() when there is no choice
  sched: Introduce SD_ASYM_CPUCAPACITY sched_domain topology flag
  sched: Enable SD_BALANCE_WAKE for asymmetric capacity systems
  sched/fair: Let asymmetric cpu configurations balance at wake-up
  sched/fair: Compute task/cpu utilization at wake-up more correctly
  sched/fair: Consider spare capacity in find_idlest_group()
  sched: Add per-cpu max capacity to sched_group_capacity
  sched/fair: Avoid pulling tasks from non-overloaded higher capacity
    groups
  arm: Set SD_ASYM_CPUCAPACITY for big.LITTLE platforms
  arm: Update arch_scale_cpu_capacity() to reflect change to define

 arch/arm/include/asm/topology.h |   5 +
 arch/arm/kernel/topology.c      |  25 ++++-
 include/linux/sched.h           |   3 +-
 kernel/sched/core.c             |  21 +++-
 kernel/sched/fair.c             | 212 +++++++++++++++++++++++++++++++++++-----
 kernel/sched/sched.h            |   5 +-
 6 files changed, 241 insertions(+), 30 deletions(-)

-- 
1.9.1

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [PATCH v2 01/13] sched: Fix power to capacity renaming in comment
  2016-06-22 17:03 [PATCH v2 00/13] sched: Clean-ups and asymmetric cpu capacity support Morten Rasmussen
@ 2016-06-22 17:03 ` Morten Rasmussen
  2016-08-10 18:03   ` [tip:sched/core] sched/core: " tip-bot for Morten Rasmussen
  2016-06-22 17:03 ` [PATCH v2 02/13] sched/fair: Consistent use of prev_cpu in wakeup path Morten Rasmussen
                   ` (15 subsequent siblings)
  16 siblings, 1 reply; 64+ messages in thread
From: Morten Rasmussen @ 2016-06-22 17:03 UTC (permalink / raw)
  To: peterz, mingo
  Cc: dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	linux-kernel, Morten Rasmussen

It is seems that this one escaped Nico's renaming of cpu_power to
cpu_capacity a while back.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 include/linux/sched.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index dee41bf59e6b..d31a3ca1cd64 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1020,7 +1020,7 @@ extern void wake_up_q(struct wake_q_head *head);
 #define SD_BALANCE_FORK		0x0008	/* Balance on fork, clone */
 #define SD_BALANCE_WAKE		0x0010  /* Balance on wakeup */
 #define SD_WAKE_AFFINE		0x0020	/* Wake task to waking CPU */
-#define SD_SHARE_CPUCAPACITY	0x0080	/* Domain members share cpu power */
+#define SD_SHARE_CPUCAPACITY	0x0080	/* Domain members share cpu capacity */
 #define SD_SHARE_POWERDOMAIN	0x0100	/* Domain members share power domain */
 #define SD_SHARE_PKG_RESOURCES	0x0200	/* Domain members share cpu pkg resources */
 #define SD_SERIALIZE		0x0400	/* Only a single load balancing instance */
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v2 02/13] sched/fair: Consistent use of prev_cpu in wakeup path
  2016-06-22 17:03 [PATCH v2 00/13] sched: Clean-ups and asymmetric cpu capacity support Morten Rasmussen
  2016-06-22 17:03 ` [PATCH v2 01/13] sched: Fix power to capacity renaming in comment Morten Rasmussen
@ 2016-06-22 17:03 ` Morten Rasmussen
  2016-06-22 18:04   ` Rik van Riel
  2016-08-10 18:03   ` [tip:sched/core] sched/fair: Make the use of prev_cpu consistent in the " tip-bot for Morten Rasmussen
  2016-06-22 17:03 ` [PATCH v2 03/13] sched/fair: Optimize find_idlest_cpu() when there is no choice Morten Rasmussen
                   ` (14 subsequent siblings)
  16 siblings, 2 replies; 64+ messages in thread
From: Morten Rasmussen @ 2016-06-22 17:03 UTC (permalink / raw)
  To: peterz, mingo
  Cc: dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	linux-kernel, Morten Rasmussen, Rik van Riel

In commit ac66f5477239 ("sched/numa: Introduce migrate_swap()")
select_task_rq() got a 'cpu' argument to enable overriding of prev_cpu
in special cases (NUMA task swapping). However, the
select_task_rq_fair() helper functions: wake_affine() and
select_idle_sibling(), still use task_cpu(p) directly to work out
prev_cpu which leads to inconsistencies.

This patch passes prev_cpu (potentially overridden by NUMA code) into
the helper functions to ensure prev_cpu is indeed the same cpu
everywhere in the wakeup path.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
cc: Rik van Riel <riel@redhat.com>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c | 24 +++++++++++++-----------
 1 file changed, 13 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c6dd8bab010c..eec8e29104f9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -656,7 +656,7 @@ static u64 sched_vslice(struct cfs_rq *cfs_rq, struct sched_entity *se)
 }
 
 #ifdef CONFIG_SMP
-static int select_idle_sibling(struct task_struct *p, int cpu);
+static int select_idle_sibling(struct task_struct *p, int prev_cpu, int cpu);
 static unsigned long task_h_load(struct task_struct *p);
 
 /*
@@ -1483,7 +1483,8 @@ static void task_numa_compare(struct task_numa_env *env,
 	 * Call select_idle_sibling to maybe find a better one.
 	 */
 	if (!cur)
-		env->dst_cpu = select_idle_sibling(env->p, env->dst_cpu);
+		env->dst_cpu = select_idle_sibling(env->p, env->src_cpu,
+						   env->dst_cpu);
 
 assign:
 	task_numa_assign(env, cur, imp);
@@ -4985,18 +4986,18 @@ static int wake_wide(struct task_struct *p)
 	return 1;
 }
 
-static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
+static int wake_affine(struct sched_domain *sd, struct task_struct *p,
+		       int prev_cpu, int sync)
 {
 	s64 this_load, load;
 	s64 this_eff_load, prev_eff_load;
-	int idx, this_cpu, prev_cpu;
+	int idx, this_cpu;
 	struct task_group *tg;
 	unsigned long weight;
 	int balanced;
 
 	idx	  = sd->wake_idx;
 	this_cpu  = smp_processor_id();
-	prev_cpu  = task_cpu(p);
 	load	  = source_load(prev_cpu, idx);
 	this_load = target_load(this_cpu, idx);
 
@@ -5161,11 +5162,10 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
 /*
  * Try and locate an idle CPU in the sched_domain.
  */
-static int select_idle_sibling(struct task_struct *p, int target)
+static int select_idle_sibling(struct task_struct *p, int prev, int target)
 {
 	struct sched_domain *sd;
 	struct sched_group *sg;
-	int i = task_cpu(p);
 
 	if (idle_cpu(target))
 		return target;
@@ -5173,8 +5173,8 @@ static int select_idle_sibling(struct task_struct *p, int target)
 	/*
 	 * If the prevous cpu is cache affine and idle, don't be stupid.
 	 */
-	if (i != target && cpus_share_cache(i, target) && idle_cpu(i))
-		return i;
+	if (prev != target && cpus_share_cache(prev, target) && idle_cpu(prev))
+		return prev;
 
 	/*
 	 * Otherwise, iterate the domains and find an eligible idle cpu.
@@ -5195,6 +5195,8 @@ static int select_idle_sibling(struct task_struct *p, int target)
 	for_each_lower_domain(sd) {
 		sg = sd->groups;
 		do {
+			int i;
+
 			if (!cpumask_intersects(sched_group_cpus(sg),
 						tsk_cpus_allowed(p)))
 				goto next;
@@ -5303,13 +5305,13 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 
 	if (affine_sd) {
 		sd = NULL; /* Prefer wake_affine over balance flags */
-		if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
+		if (cpu != prev_cpu && wake_affine(affine_sd, p, prev_cpu, sync))
 			new_cpu = cpu;
 	}
 
 	if (!sd) {
 		if (sd_flag & SD_BALANCE_WAKE) /* XXX always ? */
-			new_cpu = select_idle_sibling(p, new_cpu);
+			new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
 
 	} else while (sd) {
 		struct sched_group *group;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v2 03/13] sched/fair: Optimize find_idlest_cpu() when there is no choice
  2016-06-22 17:03 [PATCH v2 00/13] sched: Clean-ups and asymmetric cpu capacity support Morten Rasmussen
  2016-06-22 17:03 ` [PATCH v2 01/13] sched: Fix power to capacity renaming in comment Morten Rasmussen
  2016-06-22 17:03 ` [PATCH v2 02/13] sched/fair: Consistent use of prev_cpu in wakeup path Morten Rasmussen
@ 2016-06-22 17:03 ` Morten Rasmussen
  2016-07-13 12:20   ` Vincent Guittot
  2016-08-10 18:03   ` [tip:sched/core] " tip-bot for Morten Rasmussen
  2016-06-22 17:03 ` [PATCH v2 04/13] sched: Introduce SD_ASYM_CPUCAPACITY sched_domain topology flag Morten Rasmussen
                   ` (13 subsequent siblings)
  16 siblings, 2 replies; 64+ messages in thread
From: Morten Rasmussen @ 2016-06-22 17:03 UTC (permalink / raw)
  To: peterz, mingo
  Cc: dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	linux-kernel, Morten Rasmussen

In the current find_idlest_group()/find_idlest_cpu() search we end up
calling find_idlest_cpu() in a sched_group containing only one cpu in
the end. Checking idle-states becomes pointless when there is no
alternative, so bail out instead.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index eec8e29104f9..216db302e87d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5123,6 +5123,10 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
 	int shallowest_idle_cpu = -1;
 	int i;
 
+	/* Check if we have any choice */
+	if (group->group_weight == 1)
+		return cpumask_first(sched_group_cpus(group));
+
 	/* Traverse only the allowed CPUs */
 	for_each_cpu_and(i, sched_group_cpus(group), tsk_cpus_allowed(p)) {
 		if (idle_cpu(i)) {
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v2 04/13] sched: Introduce SD_ASYM_CPUCAPACITY sched_domain topology flag
  2016-06-22 17:03 [PATCH v2 00/13] sched: Clean-ups and asymmetric cpu capacity support Morten Rasmussen
                   ` (2 preceding siblings ...)
  2016-06-22 17:03 ` [PATCH v2 03/13] sched/fair: Optimize find_idlest_cpu() when there is no choice Morten Rasmussen
@ 2016-06-22 17:03 ` Morten Rasmussen
  2016-07-11  9:55   ` Peter Zijlstra
  2016-06-22 17:03 ` [PATCH v2 05/13] sched: Enable SD_BALANCE_WAKE for asymmetric capacity systems Morten Rasmussen
                   ` (12 subsequent siblings)
  16 siblings, 1 reply; 64+ messages in thread
From: Morten Rasmussen @ 2016-06-22 17:03 UTC (permalink / raw)
  To: peterz, mingo
  Cc: dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	linux-kernel, Morten Rasmussen

Add a topology flag to the sched_domain hierarchy indicating
sched_groups at this sched_domain level having different per cpu
capacity (e.g. big.LITTLE big-only and little-only groups) or groups in
sibling domains with different capacity. IOW, domains not spanning cpus
of all available capacities up until and including the first level
spanning cpus of all capacities available system wide. This information
is currently only available through iterating through the capacities of
all the cpus at parent levels in the sched_domain hierarchy.

SD 2      [  0      1      2      3]  SD_ASYM_CPUCAPACITY

SD 1      [  0      1] [   2      3]  SD_ASYM_CPUCAPACITY

cpu:         0      1      2      3
capacity:  756    756   1024   1024

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 include/linux/sched.h | 1 +
 kernel/sched/core.c   | 6 +++++-
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index d31a3ca1cd64..62c3584fc69a 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1020,6 +1020,7 @@ extern void wake_up_q(struct wake_q_head *head);
 #define SD_BALANCE_FORK		0x0008	/* Balance on fork, clone */
 #define SD_BALANCE_WAKE		0x0010  /* Balance on wakeup */
 #define SD_WAKE_AFFINE		0x0020	/* Wake task to waking CPU */
+#define SD_ASYM_CPUCAPACITY	0x0040  /* Groups have only a subset of cpu capacities */
 #define SD_SHARE_CPUCAPACITY	0x0080	/* Domain members share cpu capacity */
 #define SD_SHARE_POWERDOMAIN	0x0100	/* Domain members share power domain */
 #define SD_SHARE_PKG_RESOURCES	0x0200	/* Domain members share cpu pkg resources */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 385c947482e1..351609279341 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5677,6 +5677,7 @@ static int sd_degenerate(struct sched_domain *sd)
 			 SD_BALANCE_FORK |
 			 SD_BALANCE_EXEC |
 			 SD_SHARE_CPUCAPACITY |
+			 SD_ASYM_CPUCAPACITY |
 			 SD_SHARE_PKG_RESOURCES |
 			 SD_SHARE_POWERDOMAIN)) {
 		if (sd->groups != sd->groups->next)
@@ -5707,6 +5708,7 @@ sd_parent_degenerate(struct sched_domain *sd, struct sched_domain *parent)
 				SD_BALANCE_NEWIDLE |
 				SD_BALANCE_FORK |
 				SD_BALANCE_EXEC |
+				SD_ASYM_CPUCAPACITY |
 				SD_SHARE_CPUCAPACITY |
 				SD_SHARE_PKG_RESOURCES |
 				SD_PREFER_SIBLING |
@@ -6321,14 +6323,16 @@ static int sched_domains_curr_level;
  * SD_NUMA                - describes NUMA topologies
  * SD_SHARE_POWERDOMAIN   - describes shared power domain
  *
- * Odd one out:
+ * Odd ones out:
  * SD_ASYM_PACKING        - describes SMT quirks
+ * SD_ASYM_CPUCAPACITY    - describes mixed capacity topologies
  */
 #define TOPOLOGY_SD_FLAGS		\
 	(SD_SHARE_CPUCAPACITY |		\
 	 SD_SHARE_PKG_RESOURCES |	\
 	 SD_NUMA |			\
 	 SD_ASYM_PACKING |		\
+	 SD_ASYM_CPUCAPACITY |		\
 	 SD_SHARE_POWERDOMAIN)
 
 static struct sched_domain *
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v2 05/13] sched: Enable SD_BALANCE_WAKE for asymmetric capacity systems
  2016-06-22 17:03 [PATCH v2 00/13] sched: Clean-ups and asymmetric cpu capacity support Morten Rasmussen
                   ` (3 preceding siblings ...)
  2016-06-22 17:03 ` [PATCH v2 04/13] sched: Introduce SD_ASYM_CPUCAPACITY sched_domain topology flag Morten Rasmussen
@ 2016-06-22 17:03 ` Morten Rasmussen
  2016-07-11 10:04   ` Peter Zijlstra
  2016-06-22 17:03 ` [PATCH v2 06/13] sched: Store maximum per-cpu capacity in root domain Morten Rasmussen
                   ` (11 subsequent siblings)
  16 siblings, 1 reply; 64+ messages in thread
From: Morten Rasmussen @ 2016-06-22 17:03 UTC (permalink / raw)
  To: peterz, mingo
  Cc: dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	linux-kernel, Morten Rasmussen

Systems with the SD_ASYM_CPUCAPACITY flag set indicate that sched_groups
at this level or below do not include cpus of all capacities available
(e.g. group containing little-only or big-only cpus in big.LITTLE
systems). It is therefore necessary to put in more effort in finding an
appropriate cpu at task wake-up by enabling balancing at wake-up
(SD_BALANCE_WAKE).

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/core.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 351609279341..fe39118ffdfb 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6397,6 +6397,9 @@ sd_init(struct sched_domain_topology_level *tl, int cpu)
 	 * Convert topological properties into behaviour.
 	 */
 
+	if (sd->flags & SD_ASYM_CPUCAPACITY)
+		sd->flags |= SD_BALANCE_WAKE;
+
 	if (sd->flags & SD_SHARE_CPUCAPACITY) {
 		sd->flags |= SD_PREFER_SIBLING;
 		sd->imbalance_pct = 110;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v2 06/13] sched: Store maximum per-cpu capacity in root domain
  2016-06-22 17:03 [PATCH v2 00/13] sched: Clean-ups and asymmetric cpu capacity support Morten Rasmussen
                   ` (4 preceding siblings ...)
  2016-06-22 17:03 ` [PATCH v2 05/13] sched: Enable SD_BALANCE_WAKE for asymmetric capacity systems Morten Rasmussen
@ 2016-06-22 17:03 ` Morten Rasmussen
  2016-07-11 10:18   ` Peter Zijlstra
  2016-07-13 12:40   ` Vincent Guittot
  2016-06-22 17:03 ` [PATCH v2 07/13] sched/fair: Let asymmetric cpu configurations balance at wake-up Morten Rasmussen
                   ` (10 subsequent siblings)
  16 siblings, 2 replies; 64+ messages in thread
From: Morten Rasmussen @ 2016-06-22 17:03 UTC (permalink / raw)
  To: peterz, mingo
  Cc: dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	linux-kernel, Morten Rasmussen

From: Dietmar Eggemann <dietmar.eggemann@arm.com>

To be able to compare the capacity of the target cpu with the highest
available cpu capacity, store the maximum per-cpu capacity in the root
domain.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/core.c  | 9 +++++++++
 kernel/sched/sched.h | 2 ++
 2 files changed, 11 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index fe39118ffdfb..5093765e9930 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6855,6 +6855,7 @@ static int build_sched_domains(const struct cpumask *cpu_map,
 	enum s_alloc alloc_state;
 	struct sched_domain *sd;
 	struct s_data d;
+	struct rq *rq = NULL;
 	int i, ret = -ENOMEM;
 
 	alloc_state = __visit_domain_allocation_hell(&d, cpu_map);
@@ -6905,11 +6906,19 @@ static int build_sched_domains(const struct cpumask *cpu_map,
 	/* Attach the domains */
 	rcu_read_lock();
 	for_each_cpu(i, cpu_map) {
+		rq = cpu_rq(i);
 		sd = *per_cpu_ptr(d.sd, i);
 		cpu_attach_domain(sd, d.rd, i);
+
+		if (rq->cpu_capacity_orig > rq->rd->max_cpu_capacity)
+			rq->rd->max_cpu_capacity = rq->cpu_capacity_orig;
 	}
 	rcu_read_unlock();
 
+	if (rq)
+		pr_info("span: %*pbl (max cpu_capacity = %lu)\n",
+			cpumask_pr_args(cpu_map), rq->rd->max_cpu_capacity);
+
 	ret = 0;
 error:
 	__free_domain_allocs(&d, alloc_state, cpu_map);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 72f1f3087b04..3e9904ef224f 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -564,6 +564,8 @@ struct root_domain {
 	 */
 	cpumask_var_t rto_mask;
 	struct cpupri cpupri;
+
+	unsigned long max_cpu_capacity;
 };
 
 extern struct root_domain def_root_domain;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v2 07/13] sched/fair: Let asymmetric cpu configurations balance at wake-up
  2016-06-22 17:03 [PATCH v2 00/13] sched: Clean-ups and asymmetric cpu capacity support Morten Rasmussen
                   ` (5 preceding siblings ...)
  2016-06-22 17:03 ` [PATCH v2 06/13] sched: Store maximum per-cpu capacity in root domain Morten Rasmussen
@ 2016-06-22 17:03 ` Morten Rasmussen
  2016-07-11 11:13   ` Peter Zijlstra
  2016-07-13 12:56   ` Vincent Guittot
  2016-06-22 17:03 ` [PATCH v2 08/13] sched/fair: Compute task/cpu utilization at wake-up more correctly Morten Rasmussen
                   ` (9 subsequent siblings)
  16 siblings, 2 replies; 64+ messages in thread
From: Morten Rasmussen @ 2016-06-22 17:03 UTC (permalink / raw)
  To: peterz, mingo
  Cc: dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	linux-kernel, Morten Rasmussen

Currently, SD_WAKE_AFFINE always takes priority over wakeup balancing if
SD_BALANCE_WAKE is set on the sched_domains. For asymmetric
configurations SD_WAKE_AFFINE is only desirable if the waking task's
compute demand (utilization) is suitable for all the cpu capacities
available within the SD_WAKE_AFFINE sched_domain. If not, let wakeup
balancing take over (find_idlest_{group, cpu}()).

This patch makes affine wake-ups conditional on whether both the waker
cpu and prev_cpu has sufficient capacity for the waking task, or not.

It is assumed that the sched_group(s) containing the waker cpu and
prev_cpu only contain cpu with the same capacity (homogeneous).

Ideally, we shouldn't set 'want_affine' in the first place, but we don't
know if SD_BALANCE_WAKE is enabled on the sched_domain(s) until we start
traversing them.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c | 28 +++++++++++++++++++++++++++-
 1 file changed, 27 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 216db302e87d..dba02c7b57b3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -114,6 +114,12 @@ unsigned int __read_mostly sysctl_sched_shares_window = 10000000UL;
 unsigned int sysctl_sched_cfs_bandwidth_slice = 5000UL;
 #endif
 
+/*
+ * The margin used when comparing utilization with cpu capacity:
+ * util * 1024 < capacity * margin
+ */
+unsigned int capacity_margin = 1280; /* ~20% */
+
 static inline void update_load_add(struct load_weight *lw, unsigned long inc)
 {
 	lw->weight += inc;
@@ -5260,6 +5266,25 @@ static int cpu_util(int cpu)
 	return (util >= capacity) ? capacity : util;
 }
 
+static inline int task_util(struct task_struct *p)
+{
+	return p->se.avg.util_avg;
+}
+
+static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
+{
+	long min_cap, max_cap;
+
+	min_cap = min(capacity_orig_of(prev_cpu), capacity_orig_of(cpu));
+	max_cap = cpu_rq(cpu)->rd->max_cpu_capacity;
+
+	/* Minimum capacity is close to max, no need to abort wake_affine */
+	if (max_cap - min_cap < max_cap >> 3)
+		return 0;
+
+	return min_cap * 1024 < task_util(p) * capacity_margin;
+}
+
 /*
  * select_task_rq_fair: Select target runqueue for the waking task in domains
  * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
@@ -5283,7 +5308,8 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 
 	if (sd_flag & SD_BALANCE_WAKE) {
 		record_wakee(p);
-		want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
+		want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu)
+			      && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
 	}
 
 	rcu_read_lock();
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v2 08/13] sched/fair: Compute task/cpu utilization at wake-up more correctly
  2016-06-22 17:03 [PATCH v2 00/13] sched: Clean-ups and asymmetric cpu capacity support Morten Rasmussen
                   ` (6 preceding siblings ...)
  2016-06-22 17:03 ` [PATCH v2 07/13] sched/fair: Let asymmetric cpu configurations balance at wake-up Morten Rasmussen
@ 2016-06-22 17:03 ` Morten Rasmussen
  2016-06-22 17:03 ` [PATCH v2 09/13] sched/fair: Consider spare capacity in find_idlest_group() Morten Rasmussen
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 64+ messages in thread
From: Morten Rasmussen @ 2016-06-22 17:03 UTC (permalink / raw)
  To: peterz, mingo
  Cc: dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	linux-kernel, Morten Rasmussen

At task wake-up load-tracking isn't updated until the task is enqueued.
The task's own view of its utilization contribution may therefore not be
aligned with its contribution to the cfs_rq load-tracking which may have
been updated in the meantime. Basically, the task's own utilization
hasn't yet accounted for the sleep decay, while the cfs_rq may have
(partially). Estimating the cfs_rq utilization in case the task is
migrated at wake-up as task_rq(p)->cfs.avg.util_avg - p->se.avg.util_avg
is therefore incorrect as the two load-tracking signals aren't time
synchronized (different last update).

To solve this problem, this patch introduces task_util_wake() which
computes the decayed task utilization based on the last update of the
previous cpu's last load-tracking update. It is done without having to
take the rq lock, similar to how it is done in remove_entity_load_avg().

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c | 69 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 69 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index dba02c7b57b3..2874aeb08fb4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5271,6 +5271,75 @@ static inline int task_util(struct task_struct *p)
 	return p->se.avg.util_avg;
 }
 
+/*
+ * task_util_wake: Returns an updated estimate of the utilization contribution
+ * of a waking task. At wake-up the task blocked utilization contribution
+ * (cfs_rq->avg) may have decayed while the utilization tracking of the task
+ * (se->avg) hasn't yet.
+ * Note that this estimate isn't perfectly accurate as the 1ms boundaries used
+ * for updating util_avg in __update_load_avg() are not considered here. This
+ * results in an error of up to 1ms utilization decay/accumulation which leads
+ * to an absolute util_avg error margin of 1024*1024/LOAD_AVG_MAX ~= 22
+ * (for LOAD_AVG_MAX = 47742).
+ */
+static inline int task_util_wake(struct task_struct *p)
+{
+	struct cfs_rq *prev_cfs_rq = &task_rq(p)->cfs;
+	struct sched_avg *psa = &p->se.avg;
+	u64 cfs_rq_last_update, p_last_update, delta;
+	u32 util_decayed;
+
+	p_last_update = psa->last_update_time;
+
+	/*
+	 * Task on rq (exec()) should be load-tracking aligned already.
+	 * New tasks have no history and should use the init value.
+	 */
+	if (p->se.on_rq || !p_last_update)
+		return task_util(p);
+
+	cfs_rq_last_update = cfs_rq_last_update_time(prev_cfs_rq);
+	delta = cfs_rq_last_update - p_last_update;
+
+	if ((s64)delta <= 0)
+		return task_util(p);
+
+	delta >>= 20;
+
+	if (!delta)
+		return task_util(p);
+
+	util_decayed = decay_load((u64)psa->util_sum, delta);
+	util_decayed /= LOAD_AVG_MAX;
+
+	/*
+	 * psa->util_avg can be slightly out of date as it is only updated
+	 * when a 1ms boundary is crossed.
+	 * See 'decayed' in __update_load_avg()
+	 */
+	util_decayed = min_t(unsigned long, util_decayed, task_util(p));
+
+	return util_decayed;
+}
+
+/*
+ * cpu_util_wake: Compute cpu utilization with any contributions from
+ * the waking task p removed.
+ */
+static int cpu_util_wake(int cpu, struct task_struct *p)
+{
+	unsigned long util, capacity;
+
+	/* Task has no contribution or is new */
+	if (cpu != task_cpu(p) || !p->se.avg.last_update_time)
+		return cpu_util(cpu);
+
+	capacity = capacity_orig_of(cpu);
+	util = max_t(long, cpu_rq(cpu)->cfs.avg.util_avg - task_util_wake(p), 0);
+
+	return (util >= capacity) ? capacity : util;
+}
+
 static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
 {
 	long min_cap, max_cap;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v2 09/13] sched/fair: Consider spare capacity in find_idlest_group()
  2016-06-22 17:03 [PATCH v2 00/13] sched: Clean-ups and asymmetric cpu capacity support Morten Rasmussen
                   ` (7 preceding siblings ...)
  2016-06-22 17:03 ` [PATCH v2 08/13] sched/fair: Compute task/cpu utilization at wake-up more correctly Morten Rasmussen
@ 2016-06-22 17:03 ` Morten Rasmussen
  2016-06-22 17:03 ` [PATCH v2 10/13] sched: Add per-cpu max capacity to sched_group_capacity Morten Rasmussen
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 64+ messages in thread
From: Morten Rasmussen @ 2016-06-22 17:03 UTC (permalink / raw)
  To: peterz, mingo
  Cc: dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	linux-kernel, Morten Rasmussen

In low-utilization scenarios comparing relative loads in
find_idlest_group() doesn't always lead to the most optimum choice.
Systems with groups containing different numbers of cpus and/or cpus of
different compute capacity are significantly better off when considering
spare capacity rather than relative load in those scenarios.

In addition to existing load based search an alternative spare capacity
based candidate sched_group is found and selected instead if sufficient
spare capacity exists. If not, existing behaviour is preserved.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c | 46 +++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 41 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2874aeb08fb4..9c9b837742f6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5058,6 +5058,14 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
 	return 1;
 }
 
+static inline int task_util(struct task_struct *p);
+static int cpu_util_wake(int cpu, struct task_struct *p);
+
+static unsigned long capacity_spare_wake(int cpu, struct task_struct *p)
+{
+	return capacity_orig_of(cpu) - cpu_util_wake(cpu, p);
+}
+
 /*
  * find_idlest_group finds and returns the least busy CPU group within the
  * domain.
@@ -5067,7 +5075,9 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 		  int this_cpu, int sd_flag)
 {
 	struct sched_group *idlest = NULL, *group = sd->groups;
+	struct sched_group *most_spare_sg = NULL;
 	unsigned long min_load = ULONG_MAX, this_load = 0;
+	unsigned long most_spare = 0, this_spare = 0;
 	int load_idx = sd->forkexec_idx;
 	int imbalance = 100 + (sd->imbalance_pct-100)/2;
 
@@ -5075,7 +5085,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 		load_idx = sd->wake_idx;
 
 	do {
-		unsigned long load, avg_load;
+		unsigned long load, avg_load, spare_cap, max_spare_cap;
 		int local_group;
 		int i;
 
@@ -5087,8 +5097,12 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 		local_group = cpumask_test_cpu(this_cpu,
 					       sched_group_cpus(group));
 
-		/* Tally up the load of all CPUs in the group */
+		/*
+		 * Tally up the load of all CPUs in the group and find
+		 * the group containing the cpu with most spare capacity.
+		 */
 		avg_load = 0;
+		max_spare_cap = 0;
 
 		for_each_cpu(i, sched_group_cpus(group)) {
 			/* Bias balancing toward cpus of our domain */
@@ -5098,6 +5112,13 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 				load = target_load(i, load_idx);
 
 			avg_load += load;
+
+			spare_cap = capacity_spare_wake(i, p);
+
+			if (spare_cap > max_spare_cap &&
+			    spare_cap > capacity_of(i) >> 3) {
+				max_spare_cap = spare_cap;
+			}
 		}
 
 		/* Adjust by relative CPU capacity of the group */
@@ -5105,12 +5126,27 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 
 		if (local_group) {
 			this_load = avg_load;
-		} else if (avg_load < min_load) {
-			min_load = avg_load;
-			idlest = group;
+			this_spare = max_spare_cap;
+		} else {
+			if (avg_load < min_load) {
+				min_load = avg_load;
+				idlest = group;
+			}
+
+			if (most_spare < max_spare_cap) {
+				most_spare = max_spare_cap;
+				most_spare_sg = group;
+			}
 		}
 	} while (group = group->next, group != sd->groups);
 
+	/* Found a significant amount of spare capacity. */
+	if (this_spare > task_util(p) / 2 &&
+	    imbalance*this_spare > 100*most_spare)
+		return NULL;
+	else if (most_spare > task_util(p) / 2)
+		return most_spare_sg;
+
 	if (!idlest || 100*this_load < imbalance*min_load)
 		return NULL;
 	return idlest;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v2 10/13] sched: Add per-cpu max capacity to sched_group_capacity
  2016-06-22 17:03 [PATCH v2 00/13] sched: Clean-ups and asymmetric cpu capacity support Morten Rasmussen
                   ` (8 preceding siblings ...)
  2016-06-22 17:03 ` [PATCH v2 09/13] sched/fair: Consider spare capacity in find_idlest_group() Morten Rasmussen
@ 2016-06-22 17:03 ` Morten Rasmussen
  2016-06-22 17:03 ` [PATCH v2 11/13] sched/fair: Avoid pulling tasks from non-overloaded higher capacity groups Morten Rasmussen
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 64+ messages in thread
From: Morten Rasmussen @ 2016-06-22 17:03 UTC (permalink / raw)
  To: peterz, mingo
  Cc: dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	linux-kernel, Morten Rasmussen

struct sched_group_capacity currently represents the compute capacity
sum of all cpus in the sched_group. Unless it is divided by the
group_weight to get the average capacity per cpu it hides differences in
cpu capacity for mixed capacity systems (e.g. high RT/IRQ utilization or
ARM big.LITTLE). But even the average may not be sufficient if the group
covers cpus of different capacities. Instead, by extending struct
sched_group_capacity to indicate max per-cpu capacity in the group a
suitable group for a given task utilization can easily be found such
that cpus with reduced capacity can be avoided for tasks with high
utilization (not implemented by this patch).

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/core.c  |  3 ++-
 kernel/sched/fair.c  | 17 ++++++++++++-----
 kernel/sched/sched.h |  3 ++-
 3 files changed, 16 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 5093765e9930..06cd7e4a81a2 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5617,7 +5617,7 @@ static int sched_domain_debug_one(struct sched_domain *sd, int cpu, int level,
 		printk(KERN_CONT " %*pbl",
 		       cpumask_pr_args(sched_group_cpus(group)));
 		if (group->sgc->capacity != SCHED_CAPACITY_SCALE) {
-			printk(KERN_CONT " (cpu_capacity = %d)",
+			printk(KERN_CONT " (cpu_capacity = %lu)",
 				group->sgc->capacity);
 		}
 
@@ -6086,6 +6086,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
 		 * die on a /0 trap.
 		 */
 		sg->sgc->capacity = SCHED_CAPACITY_SCALE * cpumask_weight(sg_span);
+		sg->sgc->max_capacity = SCHED_CAPACITY_SCALE;
 
 		/*
 		 * Make sure the first group of this domain contains the
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9c9b837742f6..4d10d022006d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6625,13 +6625,14 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
 
 	cpu_rq(cpu)->cpu_capacity = capacity;
 	sdg->sgc->capacity = capacity;
+	sdg->sgc->max_capacity = capacity;
 }
 
 void update_group_capacity(struct sched_domain *sd, int cpu)
 {
 	struct sched_domain *child = sd->child;
 	struct sched_group *group, *sdg = sd->groups;
-	unsigned long capacity;
+	unsigned long capacity, max_capacity;
 	unsigned long interval;
 
 	interval = msecs_to_jiffies(sd->balance_interval);
@@ -6644,6 +6645,7 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
 	}
 
 	capacity = 0;
+	max_capacity = 0;
 
 	if (child->flags & SD_OVERLAP) {
 		/*
@@ -6668,11 +6670,12 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
 			 */
 			if (unlikely(!rq->sd)) {
 				capacity += capacity_of(cpu);
-				continue;
+			} else {
+				sgc = rq->sd->groups->sgc;
+				capacity += sgc->capacity;
 			}
 
-			sgc = rq->sd->groups->sgc;
-			capacity += sgc->capacity;
+			max_capacity = max(capacity, max_capacity);
 		}
 	} else  {
 		/*
@@ -6682,12 +6685,16 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
 
 		group = child->groups;
 		do {
-			capacity += group->sgc->capacity;
+			struct sched_group_capacity *sgc = group->sgc;
+
+			capacity += sgc->capacity;
+			max_capacity = max(sgc->max_capacity, max_capacity);
 			group = group->next;
 		} while (group != child->groups);
 	}
 
 	sdg->sgc->capacity = capacity;
+	sdg->sgc->max_capacity = max_capacity;
 }
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3e9904ef224f..0cdb52168984 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -868,7 +868,8 @@ struct sched_group_capacity {
 	 * CPU capacity of this group, SCHED_CAPACITY_SCALE being max capacity
 	 * for a single CPU.
 	 */
-	unsigned int capacity;
+	unsigned long capacity;
+	unsigned long max_capacity; /* Max per-cpu capacity in group */
 	unsigned long next_update;
 	int imbalance; /* XXX unrelated to capacity but shared group state */
 	/*
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v2 11/13] sched/fair: Avoid pulling tasks from non-overloaded higher capacity groups
  2016-06-22 17:03 [PATCH v2 00/13] sched: Clean-ups and asymmetric cpu capacity support Morten Rasmussen
                   ` (9 preceding siblings ...)
  2016-06-22 17:03 ` [PATCH v2 10/13] sched: Add per-cpu max capacity to sched_group_capacity Morten Rasmussen
@ 2016-06-22 17:03 ` Morten Rasmussen
  2016-06-23 21:20   ` Sai Gurrappadi
  2016-07-12 12:59   ` Peter Zijlstra
  2016-06-22 17:03 ` [PATCH v2 12/13] arm: Set SD_ASYM_CPUCAPACITY for big.LITTLE platforms Morten Rasmussen
                   ` (5 subsequent siblings)
  16 siblings, 2 replies; 64+ messages in thread
From: Morten Rasmussen @ 2016-06-22 17:03 UTC (permalink / raw)
  To: peterz, mingo
  Cc: dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	linux-kernel, Morten Rasmussen

For asymmetric cpu capacity systems it is counter-productive for
throughput if low capacity cpus are pulling tasks from non-overloaded
cpus with higher capacity. The assumption is that higher cpu capacity is
preferred over running alone in a group with lower cpu capacity.

This patch rejects higher cpu capacity groups with one or less task per
cpu as potential busiest group which could otherwise lead to a series of
failing load-balancing attempts leading to a force-migration.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c | 24 ++++++++++++++++++++++++
 1 file changed, 24 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4d10d022006d..ca0048d95b3f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6789,6 +6789,17 @@ group_is_overloaded(struct lb_env *env, struct sg_lb_stats *sgs)
 	return false;
 }
 
+/*
+ * group_smaller_cpu_capacity: Returns true if sched_group sg has smaller
+ * per-cpu capacity than sched_group ref.
+ */
+static inline bool
+group_smaller_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
+{
+	return sg->sgc->max_capacity * capacity_margin <
+						ref->sgc->max_capacity * 1024;
+}
+
 static inline enum
 group_type group_classify(struct sched_group *group,
 			  struct sg_lb_stats *sgs)
@@ -6892,6 +6903,19 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 	if (sgs->avg_load <= busiest->avg_load)
 		return false;
 
+	if (!(env->sd->flags & SD_ASYM_CPUCAPACITY))
+		goto asym_packing;
+
+	/* Candidate sg has no more than one task per cpu and has
+	 * higher per-cpu capacity. Migrating tasks to less capable
+	 * cpus may harm throughput. Maximize throughput,
+	 * power/energy consequences are not considered.
+	 */
+	if (sgs->sum_nr_running <= sgs->group_weight &&
+	    group_smaller_cpu_capacity(sds->local, sg))
+		return false;
+
+asym_packing:
 	/* This is the busiest node in its class. */
 	if (!(env->sd->flags & SD_ASYM_PACKING))
 		return true;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v2 12/13] arm: Set SD_ASYM_CPUCAPACITY for big.LITTLE platforms
  2016-06-22 17:03 [PATCH v2 00/13] sched: Clean-ups and asymmetric cpu capacity support Morten Rasmussen
                   ` (10 preceding siblings ...)
  2016-06-22 17:03 ` [PATCH v2 11/13] sched/fair: Avoid pulling tasks from non-overloaded higher capacity groups Morten Rasmussen
@ 2016-06-22 17:03 ` Morten Rasmussen
  2016-06-22 17:03 ` [PATCH v2 13/13] arm: Update arch_scale_cpu_capacity() to reflect change to define Morten Rasmussen
                   ` (4 subsequent siblings)
  16 siblings, 0 replies; 64+ messages in thread
From: Morten Rasmussen @ 2016-06-22 17:03 UTC (permalink / raw)
  To: peterz, mingo
  Cc: dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	linux-kernel, Morten Rasmussen, Russell King

Set the SD_ASYM_CPUCAPACITY flag for DIE and MC level sched_domains
on big.LITTLE systems.

cc: Russell King <linux@arm.linux.org.uk>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 arch/arm/kernel/topology.c | 23 ++++++++++++++++++++---
 1 file changed, 20 insertions(+), 3 deletions(-)

diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
index ec279d161b32..ab6d9c1835a4 100644
--- a/arch/arm/kernel/topology.c
+++ b/arch/arm/kernel/topology.c
@@ -78,6 +78,7 @@ static unsigned long *__cpu_capacity;
 #define cpu_capacity(cpu)	__cpu_capacity[cpu]
 
 static unsigned long middle_capacity = 1;
+static unsigned int big_little;
 
 /*
  * Iterate all CPUs' descriptor in DT and compute the efficiency
@@ -151,6 +152,8 @@ static void __init parse_dt_topology(void)
 		middle_capacity = ((max_capacity / 3)
 				>> (SCHED_CAPACITY_SHIFT-1)) + 1;
 
+	if (max_capacity && max_capacity != min_capacity)
+		big_little = 1;
 }
 
 /*
@@ -277,15 +280,29 @@ void store_cpu_topology(unsigned int cpuid)
 
 static inline int cpu_corepower_flags(void)
 {
-	return SD_SHARE_PKG_RESOURCES  | SD_SHARE_POWERDOMAIN;
+	int flags = SD_SHARE_PKG_RESOURCES | SD_SHARE_POWERDOMAIN;
+
+	return big_little ? SD_ASYM_CPUCAPACITY | flags : flags;
+}
+
+static inline int arm_cpu_core_flags(void)
+{
+	int flags = SD_SHARE_PKG_RESOURCES;
+
+	return big_little ? SD_ASYM_CPUCAPACITY | flags : flags;
+}
+
+static inline int arm_cpu_cpu_flags(void)
+{
+	return big_little ? SD_ASYM_CPUCAPACITY : 0;
 }
 
 static struct sched_domain_topology_level arm_topology[] = {
 #ifdef CONFIG_SCHED_MC
 	{ cpu_corepower_mask, cpu_corepower_flags, SD_INIT_NAME(GMC) },
-	{ cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },
+	{ cpu_coregroup_mask, arm_cpu_core_flags, SD_INIT_NAME(MC) },
 #endif
-	{ cpu_cpu_mask, SD_INIT_NAME(DIE) },
+	{ cpu_cpu_mask, arm_cpu_cpu_flags, SD_INIT_NAME(DIE) },
 	{ NULL, },
 };
 
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [PATCH v2 13/13] arm: Update arch_scale_cpu_capacity() to reflect change to define
  2016-06-22 17:03 [PATCH v2 00/13] sched: Clean-ups and asymmetric cpu capacity support Morten Rasmussen
                   ` (11 preceding siblings ...)
  2016-06-22 17:03 ` [PATCH v2 12/13] arm: Set SD_ASYM_CPUCAPACITY for big.LITTLE platforms Morten Rasmussen
@ 2016-06-22 17:03 ` Morten Rasmussen
  2016-06-28 10:20 ` [PATCH v2 00/13] sched: Clean-ups and asymmetric cpu capacity support Koan-Sin Tan
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 64+ messages in thread
From: Morten Rasmussen @ 2016-06-22 17:03 UTC (permalink / raw)
  To: peterz, mingo
  Cc: dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	linux-kernel, Morten Rasmussen, Russell King

arch_scale_cpu_capacity() is no longer a weak function but a #define
instead. Include the #define in topology.h.

cc: Russell King <linux@arm.linux.org.uk>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 arch/arm/include/asm/topology.h | 5 +++++
 arch/arm/kernel/topology.c      | 2 +-
 2 files changed, 6 insertions(+), 1 deletion(-)

diff --git a/arch/arm/include/asm/topology.h b/arch/arm/include/asm/topology.h
index 370f7a732900..a09a10599316 100644
--- a/arch/arm/include/asm/topology.h
+++ b/arch/arm/include/asm/topology.h
@@ -24,6 +24,11 @@ void init_cpu_topology(void);
 void store_cpu_topology(unsigned int cpuid);
 const struct cpumask *cpu_coregroup_mask(int cpu);
 
+#define arch_scale_cpu_capacity arm_arch_scale_cpu_capacity
+struct sched_domain;
+extern
+unsigned long arm_arch_scale_cpu_capacity(struct sched_domain *sd, int cpu);
+
 #else
 
 static inline void init_cpu_topology(void) { }
diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
index ab6d9c1835a4..cd7060c9e67d 100644
--- a/arch/arm/kernel/topology.c
+++ b/arch/arm/kernel/topology.c
@@ -42,7 +42,7 @@
  */
 static DEFINE_PER_CPU(unsigned long, cpu_scale) = SCHED_CAPACITY_SCALE;
 
-unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
+unsigned long arm_arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
 {
 	return per_cpu(cpu_scale, cpu);
 }
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 02/13] sched/fair: Consistent use of prev_cpu in wakeup path
  2016-06-22 17:03 ` [PATCH v2 02/13] sched/fair: Consistent use of prev_cpu in wakeup path Morten Rasmussen
@ 2016-06-22 18:04   ` Rik van Riel
  2016-06-23  9:56     ` Morten Rasmussen
  2016-08-10 18:03   ` [tip:sched/core] sched/fair: Make the use of prev_cpu consistent in the " tip-bot for Morten Rasmussen
  1 sibling, 1 reply; 64+ messages in thread
From: Rik van Riel @ 2016-06-22 18:04 UTC (permalink / raw)
  To: Morten Rasmussen, peterz, mingo
  Cc: dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 2401 bytes --]

On Wed, 2016-06-22 at 18:03 +0100, Morten Rasmussen wrote:
> In commit ac66f5477239 ("sched/numa: Introduce migrate_swap()")
> select_task_rq() got a 'cpu' argument to enable overriding of
> prev_cpu
> in special cases (NUMA task swapping). However, the
> select_task_rq_fair() helper functions: wake_affine() and
> select_idle_sibling(), still use task_cpu(p) directly to work out
> prev_cpu which leads to inconsistencies.
> 
> This patch passes prev_cpu (potentially overridden by NUMA code) into
> the helper functions to ensure prev_cpu is indeed the same cpu
> everywhere in the wakeup path.
> 
> cc: Ingo Molnar <mingo@redhat.com>
> cc: Peter Zijlstra <peterz@infradead.org>
> cc: Rik van Riel <riel@redhat.com>
> 
> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
> ---
>  kernel/sched/fair.c | 24 +++++++++++++-----------
>  1 file changed, 13 insertions(+), 11 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c6dd8bab010c..eec8e29104f9 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -656,7 +656,7 @@ static u64 sched_vslice(struct cfs_rq *cfs_rq,
> struct sched_entity *se)
>  }
>  
>  #ifdef CONFIG_SMP
> -static int select_idle_sibling(struct task_struct *p, int cpu);
> +static int select_idle_sibling(struct task_struct *p, int prev_cpu,
> int cpu);
>  static unsigned long task_h_load(struct task_struct *p);
>  
>  /*
> @@ -1483,7 +1483,8 @@ static void task_numa_compare(struct
> task_numa_env *env,
>  	 * Call select_idle_sibling to maybe find a better one.
>  	 */
>  	if (!cur)
> -		env->dst_cpu = select_idle_sibling(env->p, env-
> >dst_cpu);
> +		env->dst_cpu = select_idle_sibling(env->p, env-
> >src_cpu,
> +						   env->dst_cpu);

It is worth remembering that "prev" will only
ever be returned by select_idle_sibling() if
it is part of the same NUMA node as target.

That means this patch does not change behaviour
of the NUMA balancing code, since that always
migrates between nodes.

Now lets look at try_to_wake_up(). It will pass
p->wake_cpu as the argument for "prev_cpu", which
again appears to be the same CPU number as that used
by the current code.

I have no objection to your patch, but must be
overlooking something, since I cannot find a change
in behaviour that your patch would create.

-- 
All rights reversed

[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 02/13] sched/fair: Consistent use of prev_cpu in wakeup path
  2016-06-22 18:04   ` Rik van Riel
@ 2016-06-23  9:56     ` Morten Rasmussen
  2016-06-23 12:24       ` Rik van Riel
  0 siblings, 1 reply; 64+ messages in thread
From: Morten Rasmussen @ 2016-06-23  9:56 UTC (permalink / raw)
  To: Rik van Riel
  Cc: peterz, mingo, dietmar.eggemann, yuyang.du, vincent.guittot,
	mgalbraith, linux-kernel

On Wed, Jun 22, 2016 at 02:04:11PM -0400, Rik van Riel wrote:
> On Wed, 2016-06-22 at 18:03 +0100, Morten Rasmussen wrote:
> > In commit ac66f5477239 ("sched/numa: Introduce migrate_swap()")
> > select_task_rq() got a 'cpu' argument to enable overriding of
> > prev_cpu
> > in special cases (NUMA task swapping). However, the
> > select_task_rq_fair() helper functions: wake_affine() and
> > select_idle_sibling(), still use task_cpu(p) directly to work out
> > prev_cpu which leads to inconsistencies.
> > 
> > This patch passes prev_cpu (potentially overridden by NUMA code) into
> > the helper functions to ensure prev_cpu is indeed the same cpu
> > everywhere in the wakeup path.
> > 
> > cc: Ingo Molnar <mingo@redhat.com>
> > cc: Peter Zijlstra <peterz@infradead.org>
> > cc: Rik van Riel <riel@redhat.com>
> > 
> > Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
> > ---
> >  kernel/sched/fair.c | 24 +++++++++++++-----------
> >  1 file changed, 13 insertions(+), 11 deletions(-)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index c6dd8bab010c..eec8e29104f9 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -656,7 +656,7 @@ static u64 sched_vslice(struct cfs_rq *cfs_rq,
> > struct sched_entity *se)
> >  }
> >  
> >  #ifdef CONFIG_SMP
> > -static int select_idle_sibling(struct task_struct *p, int cpu);
> > +static int select_idle_sibling(struct task_struct *p, int prev_cpu,
> > int cpu);
> >  static unsigned long task_h_load(struct task_struct *p);
> >  
> >  /*
> > @@ -1483,7 +1483,8 @@ static void task_numa_compare(struct
> > task_numa_env *env,
> >  	 * Call select_idle_sibling to maybe find a better one.
> >  	 */
> >  	if (!cur)
> > -		env->dst_cpu = select_idle_sibling(env->p, env-
> > >dst_cpu);
> > +		env->dst_cpu = select_idle_sibling(env->p, env-
> > >src_cpu,
> > +						   env->dst_cpu);
> 
> It is worth remembering that "prev" will only
> ever be returned by select_idle_sibling() if
> it is part of the same NUMA node as target.
> 
> That means this patch does not change behaviour
> of the NUMA balancing code, since that always
> migrates between nodes.
> 
> Now lets look at try_to_wake_up(). It will pass
> p->wake_cpu as the argument for "prev_cpu", which
> again appears to be the same CPU number as that used
> by the current code.

IIUC, p->wake_cpu != task_cpu(p) if task_numa_migrate() decided to call
migrate_swap() on the task while it was sleeping intending it to swap
places with a task on a different NUMA node when it wakes up. Using
p->wake_cpu in select_idle_sibling() as "prev_cpu" when called through
try_to_wake_up()->select_task_rq() should only make a difference if the
target cpu happens to share cache with it and it is idle.

	if (prev != target && cpus_share_cache(prev, target) && idle_cpu(prev))
		return prev;

The selection of the target cpu for select_idle_sibling() is also
slightly affected as wake_affine() currently compares task_cpu(p) and
smp_processor_id(), and then picks p->wake_cpu or smp_processor_id()
depending on the outcome. With this patch wake_affine() uses
p->wake_cpu instead of task_cpu(p) so we actually compare the candidates
we choose between.

I think that would lead to some minor changes in behaviour in a few
corner cases, but I mainly wrote the patch as I thought it was very
confusing that we could have different "prev_cpu"s in different parts of
the select_task_rq_fair() code path.

> 
> I have no objection to your patch, but must be
> overlooking something, since I cannot find a change
> in behaviour that your patch would create.

Thanks for confirming that it shouldn't change anything for NUMA load
balancing. That is what I hope for :-)

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 02/13] sched/fair: Consistent use of prev_cpu in wakeup path
  2016-06-23  9:56     ` Morten Rasmussen
@ 2016-06-23 12:24       ` Rik van Riel
  0 siblings, 0 replies; 64+ messages in thread
From: Rik van Riel @ 2016-06-23 12:24 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: peterz, mingo, dietmar.eggemann, yuyang.du, vincent.guittot,
	mgalbraith, linux-kernel

[-- Attachment #1: Type: text/plain, Size: 410 bytes --]

On Thu, 2016-06-23 at 10:56 +0100, Morten Rasmussen wrote:
> 
> I think that would lead to some minor changes in behaviour in a few
> corner cases, but I mainly wrote the patch as I thought it was very
> confusing that we could have different "prev_cpu"s in different parts
> of
> the select_task_rq_fair() code path.
> 
I agree that cleaning up the confusion is good.

-- 
All Rights Reversed.


[-- Attachment #2: This is a digitally signed message part --]
[-- Type: application/pgp-signature, Size: 473 bytes --]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 11/13] sched/fair: Avoid pulling tasks from non-overloaded higher capacity groups
  2016-06-22 17:03 ` [PATCH v2 11/13] sched/fair: Avoid pulling tasks from non-overloaded higher capacity groups Morten Rasmussen
@ 2016-06-23 21:20   ` Sai Gurrappadi
  2016-06-30  7:49     ` Morten Rasmussen
  2016-07-12 12:59   ` Peter Zijlstra
  1 sibling, 1 reply; 64+ messages in thread
From: Sai Gurrappadi @ 2016-06-23 21:20 UTC (permalink / raw)
  To: Morten Rasmussen, peterz, mingo
  Cc: dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	linux-kernel, Peter Boonstoppel

Hi Morten,

On 06/22/2016 10:03 AM, Morten Rasmussen wrote:

[...]

>  
> +/*
> + * group_smaller_cpu_capacity: Returns true if sched_group sg has smaller
> + * per-cpu capacity than sched_group ref.
> + */
> +static inline bool
> +group_smaller_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
> +{
> +	return sg->sgc->max_capacity * capacity_margin <
> +						ref->sgc->max_capacity * 1024;
> +}
> +
>  static inline enum
>  group_type group_classify(struct sched_group *group,
>  			  struct sg_lb_stats *sgs)
> @@ -6892,6 +6903,19 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>  	if (sgs->avg_load <= busiest->avg_load)
>  		return false;
>  
> +	if (!(env->sd->flags & SD_ASYM_CPUCAPACITY))
> +		goto asym_packing;
> +
> +	/* Candidate sg has no more than one task per cpu and has
> +	 * higher per-cpu capacity. Migrating tasks to less capable
> +	 * cpus may harm throughput. Maximize throughput,
> +	 * power/energy consequences are not considered.
> +	 */
> +	if (sgs->sum_nr_running <= sgs->group_weight &&
> +	    group_smaller_cpu_capacity(sds->local, sg))
> +		return false;
> +
> +asym_packing:

What about the case where IRQ/RT work reduces the capacity of some of these bigger CPUs? sgc->max_capacity might not necessarily capture that case.

Thanks,
-Sai

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 00/13] sched: Clean-ups and asymmetric cpu capacity support
  2016-06-22 17:03 [PATCH v2 00/13] sched: Clean-ups and asymmetric cpu capacity support Morten Rasmussen
                   ` (12 preceding siblings ...)
  2016-06-22 17:03 ` [PATCH v2 13/13] arm: Update arch_scale_cpu_capacity() to reflect change to define Morten Rasmussen
@ 2016-06-28 10:20 ` Koan-Sin Tan
  2016-06-30  7:53   ` Morten Rasmussen
  2016-07-08  7:35 ` KEITA KOBAYASHI
                   ` (2 subsequent siblings)
  16 siblings, 1 reply; 64+ messages in thread
From: Koan-Sin Tan @ 2016-06-28 10:20 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	linux-kernel, peterz, mingo

Hi,

I tested these patches with patches from your preview branch [1] and some
platform specific patches on MediaTek MT8173 EVB (integrated branch at [2])

> Test 0:
> 	for i in `seq 1 10`; \
> 	       do sysbench --test=cpu --max-time=3 --num-threads=1 run; \
> 	       done \
> 	| awk '{if ($4=="events:") {print $5; sum +=$5; runs +=1}} \
> 	       END {print "Average events: " sum/runs}'
> Target: ARM TC2 (2xA15+3xA7)
> 
> 	(Higher is better)
> tip:	Average events: 146.9
> patch:	Average events: 217.9

Target: MediaTek MT8173 EVB (2xA53+2xA72)

w/o capacity patches:
	274.7 (actually, it varies a lot between 194 and 329
w/ capacity patches:
	328.7

> Test 1:
> 	perf stat --null --repeat 10 -- \
> 	perf bench sched messaging -g 50 -l 5000
> 
> Target: Intel IVB-EP (2*10*2)
> 
> tip:    4.861970420 seconds time elapsed ( +-  1.39% )
> patch:  4.886204224 seconds time elapsed ( +-  0.75% )
> 
> Target: ARM TC2 A7-only (3xA7) (-l 1000)
> 
> tip:    61.485682596 seconds time elapsed ( +-  0.07% )
> patch:  62.667950130 seconds time elapsed ( +-  0.36% )
> 

Target: MediaTek MT8173 EVB A53-only (2xA7) (-l 1000)

w/o capacity patches:
	72.663181227 seconds time elapsed ( +-  1.62% )
	71.502263666 seconds time elapsed ( +-  0.45% ) 
	73.707466212 seconds time elapsed ( +-  0.99% )
        73.623523997 seconds time elapsed ( +-  1.85% )
     	74.150137066 seconds time elapsed ( +-  1.41% )

w/ capacity patches:
	70.360055056 seconds time elapsed ( +-  1.58% ) 
	69.884793481 seconds time elapsed ( +-  0.78% )
	73.395917166 seconds time elapsed ( +-  1.86% )
	70.082440427 seconds time elapsed ( +-  1.66% )
	69.365401758 seconds time elapsed ( +-  0.80% )

Tested-by: Koan-Sin Tan <freedom.tan@mediatek.com>

[1] git://linux-arm.org/linux-power.git capacity_awareness_v2_arm64_v1
[2] https://git.linaro.org/people/freedom.tan/linux-8173.git v4.7-rc5-mt8173-capacity

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 11/13] sched/fair: Avoid pulling tasks from non-overloaded higher capacity groups
  2016-06-23 21:20   ` Sai Gurrappadi
@ 2016-06-30  7:49     ` Morten Rasmussen
  2016-07-14 16:39       ` Sai Gurrappadi
  0 siblings, 1 reply; 64+ messages in thread
From: Morten Rasmussen @ 2016-06-30  7:49 UTC (permalink / raw)
  To: Sai Gurrappadi
  Cc: peterz, mingo, dietmar.eggemann, yuyang.du, vincent.guittot,
	mgalbraith, linux-kernel, Peter Boonstoppel

On Thu, Jun 23, 2016 at 02:20:48PM -0700, Sai Gurrappadi wrote:
> Hi Morten,
> 
> On 06/22/2016 10:03 AM, Morten Rasmussen wrote:
> 
> [...]
> 
> >  
> > +/*
> > + * group_smaller_cpu_capacity: Returns true if sched_group sg has smaller
> > + * per-cpu capacity than sched_group ref.
> > + */
> > +static inline bool
> > +group_smaller_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
> > +{
> > +	return sg->sgc->max_capacity * capacity_margin <
> > +						ref->sgc->max_capacity * 1024;
> > +}
> > +
> >  static inline enum
> >  group_type group_classify(struct sched_group *group,
> >  			  struct sg_lb_stats *sgs)
> > @@ -6892,6 +6903,19 @@ static bool update_sd_pick_busiest(struct lb_env *env,
> >  	if (sgs->avg_load <= busiest->avg_load)
> >  		return false;
> >  
> > +	if (!(env->sd->flags & SD_ASYM_CPUCAPACITY))
> > +		goto asym_packing;
> > +
> > +	/* Candidate sg has no more than one task per cpu and has
> > +	 * higher per-cpu capacity. Migrating tasks to less capable
> > +	 * cpus may harm throughput. Maximize throughput,
> > +	 * power/energy consequences are not considered.
> > +	 */
> > +	if (sgs->sum_nr_running <= sgs->group_weight &&
> > +	    group_smaller_cpu_capacity(sds->local, sg))
> > +		return false;
> > +
> > +asym_packing:
> 
> What about the case where IRQ/RT work reduces the capacity of some of
> these bigger CPUs? sgc->max_capacity might not necessarily capture
> that case.

Right, we could possibly improve this by using min_capacity instead, but
we could end up allowing tasks to be pulled to lower capacity cpus just
because one big cpu has reduced capacity due to RT/IRQ pressure and
therefore has lowered the groups min_capacity.

Ideally we should check all the capacities, but that complicates things
a lot.

Would you prefer min_capacity instead, or attempts to consider all the
cpu capacities available in both groups?

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 00/13] sched: Clean-ups and asymmetric cpu capacity support
  2016-06-28 10:20 ` [PATCH v2 00/13] sched: Clean-ups and asymmetric cpu capacity support Koan-Sin Tan
@ 2016-06-30  7:53   ` Morten Rasmussen
  0 siblings, 0 replies; 64+ messages in thread
From: Morten Rasmussen @ 2016-06-30  7:53 UTC (permalink / raw)
  To: Koan-Sin Tan
  Cc: dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	linux-kernel, peterz, mingo

On Tue, Jun 28, 2016 at 06:20:27PM +0800, Koan-Sin Tan wrote:
> Hi,
> 
> I tested these patches with patches from your preview branch [1] and some
> platform specific patches on MediaTek MT8173 EVB (integrated branch at [2])
> 
> > Test 0:
> > 	for i in `seq 1 10`; \
> > 	       do sysbench --test=cpu --max-time=3 --num-threads=1 run; \
> > 	       done \
> > 	| awk '{if ($4=="events:") {print $5; sum +=$5; runs +=1}} \
> > 	       END {print "Average events: " sum/runs}'
> > Target: ARM TC2 (2xA15+3xA7)
> > 
> > 	(Higher is better)
> > tip:	Average events: 146.9
> > patch:	Average events: 217.9
> 
> Target: MediaTek MT8173 EVB (2xA53+2xA72)
> 
> w/o capacity patches:
> 	274.7 (actually, it varies a lot between 194 and 329

The variation you mention is consistent with our observations on both
TC2 and Juno. Most times you get either big or little performance, and
occasionally a number in between.

> w/ capacity patches:
> 	328.7
> 
> > Test 1:
> > 	perf stat --null --repeat 10 -- \
> > 	perf bench sched messaging -g 50 -l 5000
> > 
> > Target: Intel IVB-EP (2*10*2)
> > 
> > tip:    4.861970420 seconds time elapsed ( +-  1.39% )
> > patch:  4.886204224 seconds time elapsed ( +-  0.75% )
> > 
> > Target: ARM TC2 A7-only (3xA7) (-l 1000)
> > 
> > tip:    61.485682596 seconds time elapsed ( +-  0.07% )
> > patch:  62.667950130 seconds time elapsed ( +-  0.36% )
> > 
> 
> Target: MediaTek MT8173 EVB A53-only (2xA7) (-l 1000)
> 
> w/o capacity patches:
> 	72.663181227 seconds time elapsed ( +-  1.62% )
> 	71.502263666 seconds time elapsed ( +-  0.45% ) 
> 	73.707466212 seconds time elapsed ( +-  0.99% )
>         73.623523997 seconds time elapsed ( +-  1.85% )
>      	74.150137066 seconds time elapsed ( +-  1.41% )
> 
> w/ capacity patches:
> 	70.360055056 seconds time elapsed ( +-  1.58% ) 
> 	69.884793481 seconds time elapsed ( +-  0.78% )
> 	73.395917166 seconds time elapsed ( +-  1.86% )
> 	70.082440427 seconds time elapsed ( +-  1.66% )
> 	69.365401758 seconds time elapsed ( +-  0.80% )
> 
> Tested-by: Koan-Sin Tan <freedom.tan@mediatek.com>

Thanks for testing.

Morten

^ permalink raw reply	[flat|nested] 64+ messages in thread

* RE: [PATCH v2 00/13] sched: Clean-ups and asymmetric cpu capacity support
  2016-06-22 17:03 [PATCH v2 00/13] sched: Clean-ups and asymmetric cpu capacity support Morten Rasmussen
                   ` (13 preceding siblings ...)
  2016-06-28 10:20 ` [PATCH v2 00/13] sched: Clean-ups and asymmetric cpu capacity support Koan-Sin Tan
@ 2016-07-08  7:35 ` KEITA KOBAYASHI
  2016-07-08  8:18   ` Morten Rasmussen
  2016-07-11  8:33 ` Morten Rasmussen
  2016-07-13 12:06 ` Vincent Guittot
  16 siblings, 1 reply; 64+ messages in thread
From: KEITA KOBAYASHI @ 2016-07-08  7:35 UTC (permalink / raw)
  To: Morten Rasmussen, peterz, mingo
  Cc: dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	linux-kernel, Gaku Inami, Yoshihiro Shimoda, KEITA KOBAYASHI

Hi,

I tested these patches on Renesas SoC r8a7790(CA15*4 + CA7*4)
and your preview branch[1] on Renesas SoC r8a7795(CA57*4 + CA53*4).

> Test 0:
> 	for i in `seq 1 10`; \
> 	       do sysbench --test=cpu --max-time=3 --num-threads=1 run;
> \
> 	       done \
> 	| awk '{if ($4=="events:") {print $5; sum +=$5; runs +=1}} \
> 	       END {print "Average events: " sum/runs}'
> 
> Target: ARM TC2 (2xA15+3xA7)
> 
> 	(Higher is better)
> tip:	Average events: 146.9
> patch:	Average events: 217.9
> 
Target: Renesas SoC r8a7790(CA15*4 + CA7*4)
  w/  capacity patches: Average events: 200.2
  w/o capacity patches: Average events: 144.4

Target: Renesas SoC r8a7795(CA57*4 + CA53*4)
  w/  capacity patches : 3587.7
  w/o capacity patches : 2327.8

> Test 1:
> 	perf stat --null --repeat 10 -- \
> 	perf bench sched messaging -g 50 -l 5000
> 
> Target: Intel IVB-EP (2*10*2)
> 
> tip:    4.861970420 seconds time elapsed ( +-  1.39% )
> patch:  4.886204224 seconds time elapsed ( +-  0.75% )
> 
> Target: ARM TC2 A7-only (3xA7) (-l 1000)
> 
> tip:    61.485682596 seconds time elapsed ( +-  0.07% )
> patch:  62.667950130 seconds time elapsed ( +-  0.36% )
> 
Target: Renesas SoC r8a7790(CA15*4) (-l 1000)
  w/  capacity patches: 38.955532040 seconds time elapsed ( +-  0.12% )
  w/o capacity patches: 39.424945580 seconds time elapsed ( +-  0.10% )

Target: Renesas SoC r8a7795(CA57*4) (-l 1000)
  w/  capacity patches : 29.804292200 seconds time elapsed ( +-  0.37% )
  w/o capacity patches : 29.838826790 seconds time elapsed ( +-  0.40% )

Tested-by: Keita Kobayashi <keita.kobayashi.ym@renesas.com>

[1] git://linux-arm.org/linux-power.git capacity_awareness_v2_arm64_v1

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 00/13] sched: Clean-ups and asymmetric cpu capacity support
  2016-07-08  7:35 ` KEITA KOBAYASHI
@ 2016-07-08  8:18   ` Morten Rasmussen
  0 siblings, 0 replies; 64+ messages in thread
From: Morten Rasmussen @ 2016-07-08  8:18 UTC (permalink / raw)
  To: KEITA KOBAYASHI
  Cc: peterz, mingo, dietmar.eggemann, yuyang.du, vincent.guittot,
	mgalbraith, linux-kernel, Gaku Inami, Yoshihiro Shimoda

On Fri, Jul 08, 2016 at 07:35:56AM +0000, KEITA KOBAYASHI wrote:
> Hi,
> 
> I tested these patches on Renesas SoC r8a7790(CA15*4 + CA7*4)
> and your preview branch[1] on Renesas SoC r8a7795(CA57*4 + CA53*4).
> 
> > Test 0:
> > 	for i in `seq 1 10`; \
> > 	       do sysbench --test=cpu --max-time=3 --num-threads=1 run;
> > \
> > 	       done \
> > 	| awk '{if ($4=="events:") {print $5; sum +=$5; runs +=1}} \
> > 	       END {print "Average events: " sum/runs}'
> > 
> > Target: ARM TC2 (2xA15+3xA7)
> > 
> > 	(Higher is better)
> > tip:	Average events: 146.9
> > patch:	Average events: 217.9
> > 
> Target: Renesas SoC r8a7790(CA15*4 + CA7*4)
>   w/  capacity patches: Average events: 200.2
>   w/o capacity patches: Average events: 144.4
> 
> Target: Renesas SoC r8a7795(CA57*4 + CA53*4)
>   w/  capacity patches : 3587.7
>   w/o capacity patches : 2327.8
> 
> > Test 1:
> > 	perf stat --null --repeat 10 -- \
> > 	perf bench sched messaging -g 50 -l 5000
> > 
> > Target: Intel IVB-EP (2*10*2)
> > 
> > tip:    4.861970420 seconds time elapsed ( +-  1.39% )
> > patch:  4.886204224 seconds time elapsed ( +-  0.75% )
> > 
> > Target: ARM TC2 A7-only (3xA7) (-l 1000)
> > 
> > tip:    61.485682596 seconds time elapsed ( +-  0.07% )
> > patch:  62.667950130 seconds time elapsed ( +-  0.36% )
> > 
> Target: Renesas SoC r8a7790(CA15*4) (-l 1000)
>   w/  capacity patches: 38.955532040 seconds time elapsed ( +-  0.12% )
>   w/o capacity patches: 39.424945580 seconds time elapsed ( +-  0.10% )
> 
> Target: Renesas SoC r8a7795(CA57*4) (-l 1000)
>   w/  capacity patches : 29.804292200 seconds time elapsed ( +-  0.37% )
>   w/o capacity patches : 29.838826790 seconds time elapsed ( +-  0.40% )
> 
> Tested-by: Keita Kobayashi <keita.kobayashi.ym@renesas.com>

Thank you for testing and sharing your test results. They seem to show a
significant improvement in throughput which is in line with the
measurement we have for the ARM dev boards and the MediaTek SoC.

Thanks,
Morten

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 00/13] sched: Clean-ups and asymmetric cpu capacity support
  2016-06-22 17:03 [PATCH v2 00/13] sched: Clean-ups and asymmetric cpu capacity support Morten Rasmussen
                   ` (14 preceding siblings ...)
  2016-07-08  7:35 ` KEITA KOBAYASHI
@ 2016-07-11  8:33 ` Morten Rasmussen
  2016-07-11 12:44   ` Vincent Guittot
  2016-07-12 13:25   ` Peter Zijlstra
  2016-07-13 12:06 ` Vincent Guittot
  16 siblings, 2 replies; 64+ messages in thread
From: Morten Rasmussen @ 2016-07-11  8:33 UTC (permalink / raw)
  To: peterz, vincent.guittot
  Cc: dietmar.eggemann, yuyang.du, mgalbraith, mingo, linux-kernel

On Wed, Jun 22, 2016 at 06:03:11PM +0100, Morten Rasmussen wrote:
> Hi,
> 
> The scheduler is currently not doing much to help performance on systems with
> asymmetric compute capacities (read ARM big.LITTLE). This series improves the
> situation with a few tweaks mainly to the task wake-up path that considers
> compute capacity at wake-up and not just whether a cpu is idle for these
> systems. This gives us consistent, and potentially higher, throughput in
> partially utilized scenarios. SMP behaviour and performance should be
> unaffected.

Peter, Vincent: Any chance you have time for another look?

Thanks,
Morten

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 04/13] sched: Introduce SD_ASYM_CPUCAPACITY sched_domain topology flag
  2016-06-22 17:03 ` [PATCH v2 04/13] sched: Introduce SD_ASYM_CPUCAPACITY sched_domain topology flag Morten Rasmussen
@ 2016-07-11  9:55   ` Peter Zijlstra
  2016-07-11 10:42     ` Morten Rasmussen
  0 siblings, 1 reply; 64+ messages in thread
From: Peter Zijlstra @ 2016-07-11  9:55 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: mingo, dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	linux-kernel

On Wed, Jun 22, 2016 at 06:03:15PM +0100, Morten Rasmussen wrote:
> Add a topology flag to the sched_domain hierarchy indicating
> sched_groups at this sched_domain level having different per cpu
> capacity (e.g. big.LITTLE big-only and little-only groups) or groups in
> sibling domains with different capacity.

>                                          IOW, domains not spanning cpus
> of all available capacities up until and including the first level
> spanning cpus of all capacities available system wide.

This ^, which would be SD1 below, why? I would expect only SD2 to be
tagged, since that does indeed have asymmetric capacities.

> 
> SD 2      [  0      1      2      3]  SD_ASYM_CPUCAPACITY
> 
> SD 1      [  0      1] [   2      3]  SD_ASYM_CPUCAPACITY
>
> cpu:         0      1      2      3
> capacity:  756    756   1024   1024
> 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 05/13] sched: Enable SD_BALANCE_WAKE for asymmetric capacity systems
  2016-06-22 17:03 ` [PATCH v2 05/13] sched: Enable SD_BALANCE_WAKE for asymmetric capacity systems Morten Rasmussen
@ 2016-07-11 10:04   ` Peter Zijlstra
  2016-07-11 10:37     ` Morten Rasmussen
  0 siblings, 1 reply; 64+ messages in thread
From: Peter Zijlstra @ 2016-07-11 10:04 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: mingo, dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	linux-kernel

On Wed, Jun 22, 2016 at 06:03:16PM +0100, Morten Rasmussen wrote:
> Systems with the SD_ASYM_CPUCAPACITY flag set indicate that sched_groups
> at this level or below do not include cpus of all capacities available
> (e.g. group containing little-only or big-only cpus in big.LITTLE
> systems). It is therefore necessary to put in more effort in finding an
> appropriate cpu at task wake-up by enabling balancing at wake-up
> (SD_BALANCE_WAKE).

> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6397,6 +6397,9 @@ sd_init(struct sched_domain_topology_level *tl, int cpu)
>  	 * Convert topological properties into behaviour.
>  	 */
>  
> +	if (sd->flags & SD_ASYM_CPUCAPACITY)
> +		sd->flags |= SD_BALANCE_WAKE;
> +

So I'm a bit confused on the exact requirements for this; as also per
the previous patch.

Should all sched domains get BALANCE_WAKE if one (typically the top)
domain has ASYM_CAP set?

The previous patch set it on the actual asym one and one below that, but
what if there's more levels below that? Imagine ARM gaining SMT or
somesuch. Should not then that level also get BALANCE_WAKE in order to
'correctly' place light/heavy tasks?

IOW, are you trying to fudge the behaviour semantics by creating 'weird'
ASYM_CAP rules instead of having a more complex behaviour rule here?

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 06/13] sched: Store maximum per-cpu capacity in root domain
  2016-06-22 17:03 ` [PATCH v2 06/13] sched: Store maximum per-cpu capacity in root domain Morten Rasmussen
@ 2016-07-11 10:18   ` Peter Zijlstra
  2016-07-11 16:16     ` Dietmar Eggemann
  2016-07-13 12:40   ` Vincent Guittot
  1 sibling, 1 reply; 64+ messages in thread
From: Peter Zijlstra @ 2016-07-11 10:18 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: mingo, dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	linux-kernel

On Wed, Jun 22, 2016 at 06:03:17PM +0100, Morten Rasmussen wrote:
> @@ -6905,11 +6906,19 @@ static int build_sched_domains(const struct cpumask *cpu_map,
>  	/* Attach the domains */
>  	rcu_read_lock();
>  	for_each_cpu(i, cpu_map) {
> +		rq = cpu_rq(i);
>  		sd = *per_cpu_ptr(d.sd, i);
>  		cpu_attach_domain(sd, d.rd, i);
> +
> +		if (rq->cpu_capacity_orig > rq->rd->max_cpu_capacity)
> +			rq->rd->max_cpu_capacity = rq->cpu_capacity_orig;
>  	}

Should you not set that _before_ cpu_attach_domain(), such that the
state is up-to-date when its published?

Also, since its lockless, should we not use {READ,WRITE}_ONCE() with it?

>  	rcu_read_unlock();
>  
> +	if (rq)
> +		pr_info("span: %*pbl (max cpu_capacity = %lu)\n",
> +			cpumask_pr_args(cpu_map), rq->rd->max_cpu_capacity);
> +

While a single statement, it is multi line, please add brackets.

>  	ret = 0;
>  error:

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 05/13] sched: Enable SD_BALANCE_WAKE for asymmetric capacity systems
  2016-07-11 10:04   ` Peter Zijlstra
@ 2016-07-11 10:37     ` Morten Rasmussen
  2016-07-11 11:04       ` Morten Rasmussen
  0 siblings, 1 reply; 64+ messages in thread
From: Morten Rasmussen @ 2016-07-11 10:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	linux-kernel

On Mon, Jul 11, 2016 at 12:04:49PM +0200, Peter Zijlstra wrote:
> On Wed, Jun 22, 2016 at 06:03:16PM +0100, Morten Rasmussen wrote:
> > Systems with the SD_ASYM_CPUCAPACITY flag set indicate that sched_groups
> > at this level or below do not include cpus of all capacities available
> > (e.g. group containing little-only or big-only cpus in big.LITTLE
> > systems). It is therefore necessary to put in more effort in finding an
> > appropriate cpu at task wake-up by enabling balancing at wake-up
> > (SD_BALANCE_WAKE).
> 
> > --- a/kernel/sched/core.c
> > +++ b/kernel/sched/core.c
> > @@ -6397,6 +6397,9 @@ sd_init(struct sched_domain_topology_level *tl, int cpu)
> >  	 * Convert topological properties into behaviour.
> >  	 */
> >  
> > +	if (sd->flags & SD_ASYM_CPUCAPACITY)
> > +		sd->flags |= SD_BALANCE_WAKE;
> > +
> 
> So I'm a bit confused on the exact requirements for this; as also per
> the previous patch.
> 
> Should all sched domains get BALANCE_WAKE if one (typically the top)
> domain has ASYM_CAP set?
> 
> The previous patch set it on the actual asym one and one below that, but
> what if there's more levels below that? Imagine ARM gaining SMT or
> somesuch. Should not then that level also get BALANCE_WAKE in order to
> 'correctly' place light/heavy tasks?
> 
> IOW, are you trying to fudge the behaviour semantics by creating 'weird'
> ASYM_CAP rules instead of having a more complex behaviour rule here?

That is one possible way of describing it :-)

The proposed semantic is to set ASYM_CAP at all levels starting from the
bottom up until you have sched_groups containing all types of cpus
available in the system, or reach the top level.

The fundamental reason for this weird semantics is that we somehow need
to know at the lower levels, which may be capacity symmetric, if we need
to consider balancing at a higher level to see the asymmetry or not.

If the flag isn't set bottom up we need some other way of knowing if the
system is asymmetric, or we would have to go look for the flag further
up the sched_domain hierarchy each time.

I'm not saying this is the perfect solution, I'm happy to discuss
alternatives.

The example in the previous patch has the flag set on both levels, as we
have two clusters of different cpus and therefore have to go to the top
so 'see' all the types of cpus we have in the system.

If you add SMT, you would add a third level at the bottom with
ASYM_CAP set as well as you still have to balance at top level to have
the full range of choice of cpu type.

Should someone build a system with multiple big.LITTLE cluster pairs and
essentially add another sched_domain level on top, then that level
should _not_ have the ASYM_CAP flag set. The sched_groups at this level
would span both big and little cpus of the cluster pair so there is
little reason to expand the search scope at wake-up further.

I hope that makes sense.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 04/13] sched: Introduce SD_ASYM_CPUCAPACITY sched_domain topology flag
  2016-07-11  9:55   ` Peter Zijlstra
@ 2016-07-11 10:42     ` Morten Rasmussen
  0 siblings, 0 replies; 64+ messages in thread
From: Morten Rasmussen @ 2016-07-11 10:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	linux-kernel

On Mon, Jul 11, 2016 at 11:55:23AM +0200, Peter Zijlstra wrote:
> On Wed, Jun 22, 2016 at 06:03:15PM +0100, Morten Rasmussen wrote:
> > Add a topology flag to the sched_domain hierarchy indicating
> > sched_groups at this sched_domain level having different per cpu
> > capacity (e.g. big.LITTLE big-only and little-only groups) or groups in
> > sibling domains with different capacity.
> 
> >                                          IOW, domains not spanning cpus
> > of all available capacities up until and including the first level
> > spanning cpus of all capacities available system wide.
> 
> This ^, which would be SD1 below, why? I would expect only SD2 to be
> tagged, since that does indeed have asymmetric capacities.

That would make most sense, but it complicates how we can use the flag.
See reply to your related question for patch #5.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 05/13] sched: Enable SD_BALANCE_WAKE for asymmetric capacity systems
  2016-07-11 10:37     ` Morten Rasmussen
@ 2016-07-11 11:04       ` Morten Rasmussen
  2016-07-11 11:24         ` Peter Zijlstra
  0 siblings, 1 reply; 64+ messages in thread
From: Morten Rasmussen @ 2016-07-11 11:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	linux-kernel

On Mon, Jul 11, 2016 at 11:37:18AM +0100, Morten Rasmussen wrote:
> On Mon, Jul 11, 2016 at 12:04:49PM +0200, Peter Zijlstra wrote:
> > On Wed, Jun 22, 2016 at 06:03:16PM +0100, Morten Rasmussen wrote:
> > > Systems with the SD_ASYM_CPUCAPACITY flag set indicate that sched_groups
> > > at this level or below do not include cpus of all capacities available
> > > (e.g. group containing little-only or big-only cpus in big.LITTLE
> > > systems). It is therefore necessary to put in more effort in finding an
> > > appropriate cpu at task wake-up by enabling balancing at wake-up
> > > (SD_BALANCE_WAKE).
> > 
> > > --- a/kernel/sched/core.c
> > > +++ b/kernel/sched/core.c
> > > @@ -6397,6 +6397,9 @@ sd_init(struct sched_domain_topology_level *tl, int cpu)
> > >  	 * Convert topological properties into behaviour.
> > >  	 */
> > >  
> > > +	if (sd->flags & SD_ASYM_CPUCAPACITY)
> > > +		sd->flags |= SD_BALANCE_WAKE;
> > > +
> > 
> > So I'm a bit confused on the exact requirements for this; as also per
> > the previous patch.
> > 
> > Should all sched domains get BALANCE_WAKE if one (typically the top)
> > domain has ASYM_CAP set?
> > 
> > The previous patch set it on the actual asym one and one below that, but
> > what if there's more levels below that? Imagine ARM gaining SMT or
> > somesuch. Should not then that level also get BALANCE_WAKE in order to
> > 'correctly' place light/heavy tasks?
> > 
> > IOW, are you trying to fudge the behaviour semantics by creating 'weird'
> > ASYM_CAP rules instead of having a more complex behaviour rule here?
> 
> That is one possible way of describing it :-)
> 
> The proposed semantic is to set ASYM_CAP at all levels starting from the
> bottom up until you have sched_groups containing all types of cpus
> available in the system, or reach the top level.
> 
> The fundamental reason for this weird semantics is that we somehow need
> to know at the lower levels, which may be capacity symmetric, if we need
> to consider balancing at a higher level to see the asymmetry or not.
> 
> If the flag isn't set bottom up we need some other way of knowing if the
> system is asymmetric, or we would have to go look for the flag further
> up the sched_domain hierarchy each time.
> 
> I'm not saying this is the perfect solution, I'm happy to discuss
> alternatives.

One alternative to setting ASYM_CAP bottom up would be to set it only
where the asymmetry can be observed, and instead come up with a more
complicated way of setting BALANCE_WAKE bottom up until and including
the first level having the ASYM_CAP.

I looked at it briefly an realized that I couldn't find a clean way of
implementing it as I don't think we have visibility of which flags that
will be set at higher levels in the sched_domain hierarchy when the
lower levels are initialized. IOW, we have behavioural flags settings
depend on topology flags settings at a different level.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 07/13] sched/fair: Let asymmetric cpu configurations balance at wake-up
  2016-06-22 17:03 ` [PATCH v2 07/13] sched/fair: Let asymmetric cpu configurations balance at wake-up Morten Rasmussen
@ 2016-07-11 11:13   ` Peter Zijlstra
  2016-07-11 12:32     ` Morten Rasmussen
  2016-07-13 12:56   ` Vincent Guittot
  1 sibling, 1 reply; 64+ messages in thread
From: Peter Zijlstra @ 2016-07-11 11:13 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: mingo, dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	linux-kernel

On Wed, Jun 22, 2016 at 06:03:18PM +0100, Morten Rasmussen wrote:
> Currently, SD_WAKE_AFFINE always takes priority over wakeup balancing if
> SD_BALANCE_WAKE is set on the sched_domains. For asymmetric
> configurations SD_WAKE_AFFINE is only desirable if the waking task's
> compute demand (utilization) is suitable for all the cpu capacities
> available within the SD_WAKE_AFFINE sched_domain. If not, let wakeup
> balancing take over (find_idlest_{group, cpu}()).

I think I tripped over this one the last time around, and I'm not sure
this Changelog is any clearer.

This is about the case where the waking cpu and prev_cpu are both in the
'wrong' cluster, right?

> This patch makes affine wake-ups conditional on whether both the waker
> cpu and prev_cpu has sufficient capacity for the waking task, or not.
> 
> It is assumed that the sched_group(s) containing the waker cpu and
> prev_cpu only contain cpu with the same capacity (homogeneous).

> 
> Ideally, we shouldn't set 'want_affine' in the first place, but we don't
> know if SD_BALANCE_WAKE is enabled on the sched_domain(s) until we start
> traversing them.

Is this again more fallout from that weird ASYM_CAP thing?

> +static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
> +{
> +	long min_cap, max_cap;
> +
> +	min_cap = min(capacity_orig_of(prev_cpu), capacity_orig_of(cpu));
> +	max_cap = cpu_rq(cpu)->rd->max_cpu_capacity;
> +
> +	/* Minimum capacity is close to max, no need to abort wake_affine */
> +	if (max_cap - min_cap < max_cap >> 3)
> +		return 0;
> +
> +	return min_cap * 1024 < task_util(p) * capacity_margin;
> +}

I'm most puzzled by these inequalities, how, why ?

I would figure you'd compare task_util to the current remaining util of
the small group, and if it fits, place it there. This seems to do
something entirely different.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 05/13] sched: Enable SD_BALANCE_WAKE for asymmetric capacity systems
  2016-07-11 11:04       ` Morten Rasmussen
@ 2016-07-11 11:24         ` Peter Zijlstra
  2016-07-12 14:26           ` Morten Rasmussen
  0 siblings, 1 reply; 64+ messages in thread
From: Peter Zijlstra @ 2016-07-11 11:24 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: mingo, dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	linux-kernel

On Mon, Jul 11, 2016 at 12:04:58PM +0100, Morten Rasmussen wrote:

> One alternative to setting ASYM_CAP bottom up would be to set it only
> where the asymmetry can be observed, and instead come up with a more
> complicated way of setting BALANCE_WAKE bottom up until and including
> the first level having the ASYM_CAP.

Right, that is what I was thinking.

> I looked at it briefly an realized that I couldn't find a clean way of
> implementing it as I don't think we have visibility of which flags that
> will be set at higher levels in the sched_domain hierarchy when the
> lower levels are initialized. IOW, we have behavioural flags settings
> depend on topology flags settings at a different level.

Looks doable if we pass @child into sd_init() in build_sched_domain().
Then we could simply do:

	*sd = (struct sched_domain){
		/* ... */
		.child = child,
	};

	if (sd->flags & ASYM_CAP) {
		struct sched_domain *t = sd;
		while (t) {
			t->sd_flags |= BALANCE_WAKE;
			t = t->child;
		}
	}

Or something like that.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 07/13] sched/fair: Let asymmetric cpu configurations balance at wake-up
  2016-07-11 11:13   ` Peter Zijlstra
@ 2016-07-11 12:32     ` Morten Rasmussen
  0 siblings, 0 replies; 64+ messages in thread
From: Morten Rasmussen @ 2016-07-11 12:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	linux-kernel

On Mon, Jul 11, 2016 at 01:13:44PM +0200, Peter Zijlstra wrote:
> On Wed, Jun 22, 2016 at 06:03:18PM +0100, Morten Rasmussen wrote:
> > Currently, SD_WAKE_AFFINE always takes priority over wakeup balancing if
> > SD_BALANCE_WAKE is set on the sched_domains. For asymmetric
> > configurations SD_WAKE_AFFINE is only desirable if the waking task's
> > compute demand (utilization) is suitable for all the cpu capacities
> > available within the SD_WAKE_AFFINE sched_domain. If not, let wakeup
> > balancing take over (find_idlest_{group, cpu}()).
> 
> I think I tripped over this one the last time around, and I'm not sure
> this Changelog is any clearer.
> 
> This is about the case where the waking cpu and prev_cpu are both in the
> 'wrong' cluster, right?

Almost :-) It is the cases where waking cpu _or_ prev_cpu both don't
'fit' the task ('wrong' cluster). select_idle_sibling() doesn't consider
capacity so we can't let it choose between waking cpu and prev_cpu if
one of them isn't a good choice.

Bringing in the table from the cover letter:

tu   = Task utilization big/little
pcpu = Previous cpu big/little
tcpu = This (waker) cpu big/little
dl   = New cpu is little
db   = New cpu is big
sis  = New cpu chosen by select_idle_sibling()
figc = New cpu chosen by find_idlest_*()
ww   = wake_wide(task) count for figc wakeups
bw   = sd_flag & SD_BALANCE_WAKE (non-fork/exec wake)
       for figc wakeups

case tu   pcpu tcpu   dl   db  sis figc   ww   bw
1    l    l    l     122   68   28  162  161  161
2    l    l    b      11    4    0   15   15   15
3    l    b    l       0  252    8  244  244  244
4    l    b    b      36 1928  711 1253 1016 1016
5    b    l    l       5   19    0   24   22   24
6    b    l    b       5    1    0    6    0    6
7    b    b    l       0   31    0   31   31   31
8    b    b    b       1  194  109   86   59   59
--------------------------------------------------
                     180 2497  856 1821

In cases 5, 6, and 7 prev_cpu or waking cpu is little and the task only
fits on big. This is why we have zeros in the 'sis' column for those
three cases. All other cases we don't care from a capacity point of view
if select_idle_sibling() choose prev_cpu or waking cpu.

I will try to rewrite the commit message to makes this clearer.

> 
> > This patch makes affine wake-ups conditional on whether both the waker
> > cpu and prev_cpu has sufficient capacity for the waking task, or not.
> > 
> > It is assumed that the sched_group(s) containing the waker cpu and
> > prev_cpu only contain cpu with the same capacity (homogeneous).
> 
> > 
> > Ideally, we shouldn't set 'want_affine' in the first place, but we don't
> > know if SD_BALANCE_WAKE is enabled on the sched_domain(s) until we start
> > traversing them.
> 
> Is this again more fallout from that weird ASYM_CAP thing?

I think this is a left-over comment from v1 that shouldn't have
survived into v2. The flag game was more complicated in v1 as it
required disabling SD_WAKE_AFFINE. That is all gone know thanks to
Vincent's suggestions in the v1 review.

> 
> > +static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
> > +{
> > +	long min_cap, max_cap;
> > +
> > +	min_cap = min(capacity_orig_of(prev_cpu), capacity_orig_of(cpu));
> > +	max_cap = cpu_rq(cpu)->rd->max_cpu_capacity;
> > +
> > +	/* Minimum capacity is close to max, no need to abort wake_affine */
> > +	if (max_cap - min_cap < max_cap >> 3)
> > +		return 0;
> > +
> > +	return min_cap * 1024 < task_util(p) * capacity_margin;
> > +}
> 
> I'm most puzzled by these inequalities, how, why ?
> 
> I would figure you'd compare task_util to the current remaining util of
> the small group, and if it fits, place it there. This seems to do
> something entirely different.

Right, I only compare task utilization to max capacity here and
completely ignore whether any of those cpus are already fully or
partially utilized. Available (remaining) capacity is considered in
find_idlest_group() when that route is taken.

wake_cap() is meant as a first check to see if we should worry about cpu
types (max capacities) at all: If not go select_idle_sibling(), if we
should, go find_idlest_{group,cpu}() and look more into the details.

Find the best target cpu quickly becomes messy. Comparing group
utilization to group capacity can be very misleading. The ARM Juno
platform has four little cpus and two big cpus, so cluster group
capacities are roughly comparable. We have to iterate over all the cpus
to find the one with most spare capacity to find the best fit. So I
thought that would fit better with the existing code in find_idlest_*().

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 00/13] sched: Clean-ups and asymmetric cpu capacity support
  2016-07-11  8:33 ` Morten Rasmussen
@ 2016-07-11 12:44   ` Vincent Guittot
  2016-07-12 13:25   ` Peter Zijlstra
  1 sibling, 0 replies; 64+ messages in thread
From: Vincent Guittot @ 2016-07-11 12:44 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Peter Zijlstra, Dietmar Eggemann, Yuyang Du, mgalbraith, mingo,
	linux-kernel

Hi Morten,

On 11 July 2016 at 10:33, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> On Wed, Jun 22, 2016 at 06:03:11PM +0100, Morten Rasmussen wrote:
>> Hi,
>>
>> The scheduler is currently not doing much to help performance on systems with
>> asymmetric compute capacities (read ARM big.LITTLE). This series improves the
>> situation with a few tweaks mainly to the task wake-up path that considers
>> compute capacity at wake-up and not just whether a cpu is idle for these
>> systems. This gives us consistent, and potentially higher, throughput in
>> partially utilized scenarios. SMP behaviour and performance should be
>> unaffected.
>
> Peter, Vincent: Any chance you have time for another look?

I will have a look but have only partial network access for now

Vincent
>
> Thanks,
> Morten

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 06/13] sched: Store maximum per-cpu capacity in root domain
  2016-07-11 10:18   ` Peter Zijlstra
@ 2016-07-11 16:16     ` Dietmar Eggemann
  2016-07-12 11:42       ` Peter Zijlstra
  0 siblings, 1 reply; 64+ messages in thread
From: Dietmar Eggemann @ 2016-07-11 16:16 UTC (permalink / raw)
  To: Peter Zijlstra, Morten Rasmussen
  Cc: mingo, yuyang.du, vincent.guittot, mgalbraith, linux-kernel

On 11/07/16 11:18, Peter Zijlstra wrote:
> On Wed, Jun 22, 2016 at 06:03:17PM +0100, Morten Rasmussen wrote:
>> @@ -6905,11 +6906,19 @@ static int build_sched_domains(const struct cpumask *cpu_map,
>>  	/* Attach the domains */
>>  	rcu_read_lock();
>>  	for_each_cpu(i, cpu_map) {
>> +		rq = cpu_rq(i);
>>  		sd = *per_cpu_ptr(d.sd, i);
>>  		cpu_attach_domain(sd, d.rd, i);
>> +
>> +		if (rq->cpu_capacity_orig > rq->rd->max_cpu_capacity)
>> +			rq->rd->max_cpu_capacity = rq->cpu_capacity_orig;
>>  	}
> 
> Should you not set that _before_ cpu_attach_domain(), such that the
> state is up-to-date when its published?

yes, much better.

> Also, since its lockless, should we not use {READ,WRITE}_ONCE() with it?

You mean for rq->rd->max_cpu_capacity ? IMHO, there is a data dependency
between the read and the write and the code only runs on one cpu.

I assume here that this is related to item 2 'Overlapping loads and
stores within a particular CPU ...' in GUARANTEES of
doc/Documentation/memory-barriers.txt.

Do I miss something?

>>  	rcu_read_unlock();
>>  
>> +	if (rq)
>> +		pr_info("span: %*pbl (max cpu_capacity = %lu)\n",
>> +			cpumask_pr_args(cpu_map), rq->rd->max_cpu_capacity);
>> +
> 
> While a single statement, it is multi line, please add brackets.

OK.

> 
>>  	ret = 0;
>>  error:

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 06/13] sched: Store maximum per-cpu capacity in root domain
  2016-07-11 16:16     ` Dietmar Eggemann
@ 2016-07-12 11:42       ` Peter Zijlstra
  2016-07-13 11:18         ` Dietmar Eggemann
  0 siblings, 1 reply; 64+ messages in thread
From: Peter Zijlstra @ 2016-07-12 11:42 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Morten Rasmussen, mingo, yuyang.du, vincent.guittot, mgalbraith,
	linux-kernel

On Mon, Jul 11, 2016 at 05:16:06PM +0100, Dietmar Eggemann wrote:
> On 11/07/16 11:18, Peter Zijlstra wrote:
> > On Wed, Jun 22, 2016 at 06:03:17PM +0100, Morten Rasmussen wrote:
> >> @@ -6905,11 +6906,19 @@ static int build_sched_domains(const struct cpumask *cpu_map,
> >>  	/* Attach the domains */
> >>  	rcu_read_lock();
> >>  	for_each_cpu(i, cpu_map) {
> >> +		rq = cpu_rq(i);
> >>  		sd = *per_cpu_ptr(d.sd, i);
> >>  		cpu_attach_domain(sd, d.rd, i);
> >> +
> >> +		if (rq->cpu_capacity_orig > rq->rd->max_cpu_capacity)
> >> +			rq->rd->max_cpu_capacity = rq->cpu_capacity_orig;
> >>  	}
> > 
> > Should you not set that _before_ cpu_attach_domain(), such that the
> > state is up-to-date when its published?
> 
> yes, much better.
> 
> > Also, since its lockless, should we not use {READ,WRITE}_ONCE() with it?
> 
> You mean for rq->rd->max_cpu_capacity ? IMHO, there is a data dependency
> between the read and the write and the code only runs on one cpu.
> 
> I assume here that this is related to item 2 'Overlapping loads and
> stores within a particular CPU ...' in GUARANTEES of
> doc/Documentation/memory-barriers.txt.
> 
> Do I miss something?

Well, the value 'rd->max_cpu_capacity' is read by all CPUs attached to
the root_domain, right? So CPUs already attached can observe this change
when we update the value, we want them to observe either the old or the
new max value, not a random mix of bytes.

{READ,WRITE}_ONCE() ensure whole word load/store, iow they avoid
load/store-tearing.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 11/13] sched/fair: Avoid pulling tasks from non-overloaded higher capacity groups
  2016-06-22 17:03 ` [PATCH v2 11/13] sched/fair: Avoid pulling tasks from non-overloaded higher capacity groups Morten Rasmussen
  2016-06-23 21:20   ` Sai Gurrappadi
@ 2016-07-12 12:59   ` Peter Zijlstra
  2016-07-12 14:34     ` Morten Rasmussen
  1 sibling, 1 reply; 64+ messages in thread
From: Peter Zijlstra @ 2016-07-12 12:59 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: mingo, dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	linux-kernel

On Wed, Jun 22, 2016 at 06:03:22PM +0100, Morten Rasmussen wrote:
> @@ -6892,6 +6903,19 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>  	if (sgs->avg_load <= busiest->avg_load)
>  		return false;
>  
> +	if (!(env->sd->flags & SD_ASYM_CPUCAPACITY))
> +		goto asym_packing;

Does this rely on the 'funny' ASYM_CAP semantics?

> +
> +	/* Candidate sg has no more than one task per cpu and has

Tssk, borken comment style.

> +	 * higher per-cpu capacity. Migrating tasks to less capable
> +	 * cpus may harm throughput. Maximize throughput,
> +	 * power/energy consequences are not considered.
> +	 */
> +	if (sgs->sum_nr_running <= sgs->group_weight &&
> +	    group_smaller_cpu_capacity(sds->local, sg))
> +		return false;
> +
> +asym_packing:
>  	/* This is the busiest node in its class. */
>  	if (!(env->sd->flags & SD_ASYM_PACKING))
>  		return true;
> -- 
> 1.9.1
> 

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 00/13] sched: Clean-ups and asymmetric cpu capacity support
  2016-07-11  8:33 ` Morten Rasmussen
  2016-07-11 12:44   ` Vincent Guittot
@ 2016-07-12 13:25   ` Peter Zijlstra
  2016-07-12 14:39     ` Morten Rasmussen
  1 sibling, 1 reply; 64+ messages in thread
From: Peter Zijlstra @ 2016-07-12 13:25 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: vincent.guittot, dietmar.eggemann, yuyang.du, mgalbraith, mingo,
	linux-kernel

On Mon, Jul 11, 2016 at 09:33:54AM +0100, Morten Rasmussen wrote:
> On Wed, Jun 22, 2016 at 06:03:11PM +0100, Morten Rasmussen wrote:
> > Hi,
> > 
> > The scheduler is currently not doing much to help performance on systems with
> > asymmetric compute capacities (read ARM big.LITTLE). This series improves the
> > situation with a few tweaks mainly to the task wake-up path that considers
> > compute capacity at wake-up and not just whether a cpu is idle for these
> > systems. This gives us consistent, and potentially higher, throughput in
> > partially utilized scenarios. SMP behaviour and performance should be
> > unaffected.
> 
> Peter, Vincent: Any chance you have time for another look?

So no fundamental objections, just these few issues that need sorting.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 05/13] sched: Enable SD_BALANCE_WAKE for asymmetric capacity systems
  2016-07-11 11:24         ` Peter Zijlstra
@ 2016-07-12 14:26           ` Morten Rasmussen
  0 siblings, 0 replies; 64+ messages in thread
From: Morten Rasmussen @ 2016-07-12 14:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	linux-kernel

On Mon, Jul 11, 2016 at 01:24:04PM +0200, Peter Zijlstra wrote:
> On Mon, Jul 11, 2016 at 12:04:58PM +0100, Morten Rasmussen wrote:
> 
> > One alternative to setting ASYM_CAP bottom up would be to set it only
> > where the asymmetry can be observed, and instead come up with a more
> > complicated way of setting BALANCE_WAKE bottom up until and including
> > the first level having the ASYM_CAP.
> 
> Right, that is what I was thinking.
> 
> > I looked at it briefly an realized that I couldn't find a clean way of
> > implementing it as I don't think we have visibility of which flags that
> > will be set at higher levels in the sched_domain hierarchy when the
> > lower levels are initialized. IOW, we have behavioural flags settings
> > depend on topology flags settings at a different level.
> 
> Looks doable if we pass @child into sd_init() in build_sched_domain().
> Then we could simply do:
> 
> 	*sd = (struct sched_domain){
> 		/* ... */
> 		.child = child,
> 	};
> 
> 	if (sd->flags & ASYM_CAP) {
> 		struct sched_domain *t = sd;
> 		while (t) {
> 			t->sd_flags |= BALANCE_WAKE;
> 			t = t->child;
> 		}
> 	}
> 
> Or something like that.

It appears to be working fine. I will roll it into v3 along with the
simpler and more sane ASYM_CAP semantics :)

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 11/13] sched/fair: Avoid pulling tasks from non-overloaded higher capacity groups
  2016-07-12 12:59   ` Peter Zijlstra
@ 2016-07-12 14:34     ` Morten Rasmussen
  0 siblings, 0 replies; 64+ messages in thread
From: Morten Rasmussen @ 2016-07-12 14:34 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	linux-kernel

On Tue, Jul 12, 2016 at 02:59:53PM +0200, Peter Zijlstra wrote:
> On Wed, Jun 22, 2016 at 06:03:22PM +0100, Morten Rasmussen wrote:
> > @@ -6892,6 +6903,19 @@ static bool update_sd_pick_busiest(struct lb_env *env,
> >  	if (sgs->avg_load <= busiest->avg_load)
> >  		return false;
> >  
> > +	if (!(env->sd->flags & SD_ASYM_CPUCAPACITY))
> > +		goto asym_packing;
> 
> Does this rely on the 'funny' ASYM_CAP semantics?

No, I would actually prefer 'sane' ASYM_CAP semantics. With 'funny'
semantics we ended up doing capacity checks inside domain with similar
cpus for the lower domain levels which would is pointless and pure
overhead. With 'sane' semantics, we only do the check for the domain
level where asymmetric capacities are actually observed.

> 
> > +
> > +	/* Candidate sg has no more than one task per cpu and has
> 
> Tssk, borken comment style.

Yes. I will fix that in v3.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 00/13] sched: Clean-ups and asymmetric cpu capacity support
  2016-07-12 13:25   ` Peter Zijlstra
@ 2016-07-12 14:39     ` Morten Rasmussen
  0 siblings, 0 replies; 64+ messages in thread
From: Morten Rasmussen @ 2016-07-12 14:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: vincent.guittot, dietmar.eggemann, yuyang.du, mgalbraith, mingo,
	linux-kernel

On Tue, Jul 12, 2016 at 03:25:48PM +0200, Peter Zijlstra wrote:
> On Mon, Jul 11, 2016 at 09:33:54AM +0100, Morten Rasmussen wrote:
> > On Wed, Jun 22, 2016 at 06:03:11PM +0100, Morten Rasmussen wrote:
> > > Hi,
> > > 
> > > The scheduler is currently not doing much to help performance on systems with
> > > asymmetric compute capacities (read ARM big.LITTLE). This series improves the
> > > situation with a few tweaks mainly to the task wake-up path that considers
> > > compute capacity at wake-up and not just whether a cpu is idle for these
> > > systems. This gives us consistent, and potentially higher, throughput in
> > > partially utilized scenarios. SMP behaviour and performance should be
> > > unaffected.
> > 
> > Peter, Vincent: Any chance you have time for another look?
> 
> So no fundamental objections, just these few issues that need sorting.

Thanks for the review. I will put together a v3.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 06/13] sched: Store maximum per-cpu capacity in root domain
  2016-07-12 11:42       ` Peter Zijlstra
@ 2016-07-13 11:18         ` Dietmar Eggemann
  0 siblings, 0 replies; 64+ messages in thread
From: Dietmar Eggemann @ 2016-07-13 11:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Morten Rasmussen, mingo, yuyang.du, vincent.guittot, mgalbraith,
	linux-kernel

On 12/07/16 12:42, Peter Zijlstra wrote:
> On Mon, Jul 11, 2016 at 05:16:06PM +0100, Dietmar Eggemann wrote:
>> On 11/07/16 11:18, Peter Zijlstra wrote:
>>> On Wed, Jun 22, 2016 at 06:03:17PM +0100, Morten Rasmussen wrote:
>>>> @@ -6905,11 +6906,19 @@ static int build_sched_domains(const struct cpumask *cpu_map,
>>>>  	/* Attach the domains */
>>>>  	rcu_read_lock();
>>>>  	for_each_cpu(i, cpu_map) {
>>>> +		rq = cpu_rq(i);
>>>>  		sd = *per_cpu_ptr(d.sd, i);
>>>>  		cpu_attach_domain(sd, d.rd, i);
>>>> +
>>>> +		if (rq->cpu_capacity_orig > rq->rd->max_cpu_capacity)
>>>> +			rq->rd->max_cpu_capacity = rq->cpu_capacity_orig;
>>>>  	}
>>>
>>> Should you not set that _before_ cpu_attach_domain(), such that the
>>> state is up-to-date when its published?
>>
>> yes, much better.
>>
>>> Also, since its lockless, should we not use {READ,WRITE}_ONCE() with it?
>>
>> You mean for rq->rd->max_cpu_capacity ? IMHO, there is a data dependency
>> between the read and the write and the code only runs on one cpu.
>>
>> I assume here that this is related to item 2 'Overlapping loads and
>> stores within a particular CPU ...' in GUARANTEES of
>> doc/Documentation/memory-barriers.txt.
>>
>> Do I miss something?
> 
> Well, the value 'rd->max_cpu_capacity' is read by all CPUs attached to
> the root_domain, right? So CPUs already attached can observe this change
> when we update the value, we want them to observe either the old or the
> new max value, not a random mix of bytes.
> 
> {READ,WRITE}_ONCE() ensure whole word load/store, iow they avoid
> load/store-tearing.
> 

OK, thanks, will add them.

So this maps to the point '(*) For aligned memory locations whose size
allows them to be accessed with a single memory-reference instruction,
prevents "load tearing" and "store tearing," ...' under 'The READ_ONCE()
and WRITE_ONCE() functions can prevent any number of optimizations ...'
section in the 'COMPILER BARRIER' paragraph in
Documentation/memory-barriers.txt.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 00/13] sched: Clean-ups and asymmetric cpu capacity support
  2016-06-22 17:03 [PATCH v2 00/13] sched: Clean-ups and asymmetric cpu capacity support Morten Rasmussen
                   ` (15 preceding siblings ...)
  2016-07-11  8:33 ` Morten Rasmussen
@ 2016-07-13 12:06 ` Vincent Guittot
  2016-07-13 15:54   ` Morten Rasmussen
  16 siblings, 1 reply; 64+ messages in thread
From: Vincent Guittot @ 2016-07-13 12:06 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Peter Zijlstra, mingo, Dietmar Eggemann, Yuyang Du, mgalbraith,
	linux-kernel

Hi Morten,

On 22 June 2016 at 19:03, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> Hi,
>
> The scheduler is currently not doing much to help performance on systems with
> asymmetric compute capacities (read ARM big.LITTLE). This series improves the
> situation with a few tweaks mainly to the task wake-up path that considers
> compute capacity at wake-up and not just whether a cpu is idle for these
> systems. This gives us consistent, and potentially higher, throughput in
> partially utilized scenarios. SMP behaviour and performance should be
> unaffected.
>
> Test 0:
>         for i in `seq 1 10`; \
>                do sysbench --test=cpu --max-time=3 --num-threads=1 run; \
>                done \
>         | awk '{if ($4=="events:") {print $5; sum +=$5; runs +=1}} \
>                END {print "Average events: " sum/runs}'
>
> Target: ARM TC2 (2xA15+3xA7)
>
>         (Higher is better)
> tip:    Average events: 146.9
> patch:  Average events: 217.9
>
> Test 1:
>         perf stat --null --repeat 10 -- \
>         perf bench sched messaging -g 50 -l 5000
>
> Target: Intel IVB-EP (2*10*2)
>
> tip:    4.861970420 seconds time elapsed ( +-  1.39% )
> patch:  4.886204224 seconds time elapsed ( +-  0.75% )
>
> Target: ARM TC2 A7-only (3xA7) (-l 1000)
>
> tip:    61.485682596 seconds time elapsed ( +-  0.07% )
> patch:  62.667950130 seconds time elapsed ( +-  0.36% )
>
> More analysis:
>
> Statistics from mixed periodic task workload (rt-app) containing both
> big and little task, single run on ARM TC2:
>
> tu   = Task utilization big/little
> pcpu = Previous cpu big/little
> tcpu = This (waker) cpu big/little
> dl   = New cpu is little
> db   = New cpu is big
> sis  = New cpu chosen by select_idle_sibling()
> figc = New cpu chosen by find_idlest_*()
> ww   = wake_wide(task) count for figc wakeups
> bw   = sd_flag & SD_BALANCE_WAKE (non-fork/exec wake)
>        for figc wakeups
>
> case tu   pcpu tcpu   dl   db  sis figc   ww   bw
> 1    l    l    l     122   68   28  162  161  161
> 2    l    l    b      11    4    0   15   15   15
> 3    l    b    l       0  252    8  244  244  244
> 4    l    b    b      36 1928  711 1253 1016 1016
> 5    b    l    l       5   19    0   24   22   24
> 6    b    l    b       5    1    0    6    0    6
> 7    b    b    l       0   31    0   31   31   31
> 8    b    b    b       1  194  109   86   59   59
> --------------------------------------------------
>                      180 2497  856 1821

I'm not sure to know how to interpret all these statistics

>
> Cases 1-4 + 8 are fine to be served by select_idle_sibling() as both
> this_cpu and prev_cpu are suitable cpus for the task. However, as the
> figc column reveals, those cases are often served by find_idlest_*()
> anyway due to wake_wide() sending the wakeup that way when
> SD_BALANCE_WAKE is set on the sched_domains.
>
> Pulling in the wakee_flip patch (dropped in v2) from v1 shifts a
> significant share of the wakeups to sis from figc:
>
> case tu   pcpu tcpu   dl   db  sis figc   ww   bw
> 1    l    l    l     537    8  537    8    6    6
> 2    l    l    b      49   11   32   28   28   28
> 3    l    b    l       4  323  322    5    5    5
> 4    l    b    b       1 1910 1209  702  458  456
> 5    b    l    l       0    5    0    5    1    5
> 6    b    l    b       0    0    0    0    0    0
> 7    b    b    l       0   32    0   32    2   32
> 8    b    b    b       0  198  168   30   13   13
> --------------------------------------------------
>                      591 2487 2268  810
>
> Notes:
>
> Active migration of tasks away from small capacity cpus isn't addressed
> in this set although it is necessary for consistent throughput in other
> scenarios on asymmetric cpu capacity systems.
>
> The infrastructure to enable capacity awareness for arm64 is not provided here
> but will be based on Juri's DT bindings patch set [1]. A combined preview
> branch is available [2].
>
> [1] https://lkml.org/lkml/2016/6/15/291
> [2] git://linux-arm.org/linux-power.git capacity_awareness_v2_arm64_v1
>
> Patch   1-3: Generic fixes and clean-ups.
> Patch  4-11: Improve capacity awareness.
> Patch 11-12: Arch features for arm to enable asymmetric capacity support.
>
> v2:
>
> - Dropped patch ignoring wakee_flips for pid=0 for now as we can not
>   distinguish cpu time processing irqs from idle time.
>
> - Dropped disabling WAKE_AFFINE as suggested by Vincent to allow more
>   scenarios to use fast-path (select_idle_sibling()). Asymmetric wake
>   conditions adjusted accordingly.
>
> - Changed use of new SD_ASYM_CPUCAPACITY slightly. Now enables
>   SD_BALANCE_WAKE.
>
> - Minor clean-ups and rebased to more recent tip/sched/core.
>
> v1: https://lkml.org/lkml/2014/5/23/621
>
> Dietmar Eggemann (1):
>   sched: Store maximum per-cpu capacity in root domain
>
> Morten Rasmussen (12):
>   sched: Fix power to capacity renaming in comment
>   sched/fair: Consistent use of prev_cpu in wakeup path
>   sched/fair: Optimize find_idlest_cpu() when there is no choice
>   sched: Introduce SD_ASYM_CPUCAPACITY sched_domain topology flag
>   sched: Enable SD_BALANCE_WAKE for asymmetric capacity systems
>   sched/fair: Let asymmetric cpu configurations balance at wake-up
>   sched/fair: Compute task/cpu utilization at wake-up more correctly
>   sched/fair: Consider spare capacity in find_idlest_group()
>   sched: Add per-cpu max capacity to sched_group_capacity
>   sched/fair: Avoid pulling tasks from non-overloaded higher capacity
>     groups
>   arm: Set SD_ASYM_CPUCAPACITY for big.LITTLE platforms
>   arm: Update arch_scale_cpu_capacity() to reflect change to define
>
>  arch/arm/include/asm/topology.h |   5 +
>  arch/arm/kernel/topology.c      |  25 ++++-
>  include/linux/sched.h           |   3 +-
>  kernel/sched/core.c             |  21 +++-
>  kernel/sched/fair.c             | 212 +++++++++++++++++++++++++++++++++++-----
>  kernel/sched/sched.h            |   5 +-
>  6 files changed, 241 insertions(+), 30 deletions(-)
>
> --
> 1.9.1
>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 03/13] sched/fair: Optimize find_idlest_cpu() when there is no choice
  2016-06-22 17:03 ` [PATCH v2 03/13] sched/fair: Optimize find_idlest_cpu() when there is no choice Morten Rasmussen
@ 2016-07-13 12:20   ` Vincent Guittot
  2016-08-10 18:03   ` [tip:sched/core] " tip-bot for Morten Rasmussen
  1 sibling, 0 replies; 64+ messages in thread
From: Vincent Guittot @ 2016-07-13 12:20 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Peter Zijlstra, mingo, Dietmar Eggemann, Yuyang Du, mgalbraith,
	linux-kernel

On 22 June 2016 at 19:03, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> In the current find_idlest_group()/find_idlest_cpu() search we end up
> calling find_idlest_cpu() in a sched_group containing only one cpu in
> the end. Checking idle-states becomes pointless when there is no
> alternative, so bail out instead.
>
> cc: Ingo Molnar <mingo@redhat.com>
> cc: Peter Zijlstra <peterz@infradead.org>
>
> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
> ---
>  kernel/sched/fair.c | 4 ++++
>  1 file changed, 4 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index eec8e29104f9..216db302e87d 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5123,6 +5123,10 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
>         int shallowest_idle_cpu = -1;
>         int i;
>
> +       /* Check if we have any choice */
> +       if (group->group_weight == 1)
> +               return cpumask_first(sched_group_cpus(group));
> +
>         /* Traverse only the allowed CPUs */
>         for_each_cpu_and(i, sched_group_cpus(group), tsk_cpus_allowed(p)) {
>                 if (idle_cpu(i)) {

FWIW, you can add my Acked-by: : Vincent Guittot <vincent.guittot@linaro.org>

> --
> 1.9.1
>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 06/13] sched: Store maximum per-cpu capacity in root domain
  2016-06-22 17:03 ` [PATCH v2 06/13] sched: Store maximum per-cpu capacity in root domain Morten Rasmussen
  2016-07-11 10:18   ` Peter Zijlstra
@ 2016-07-13 12:40   ` Vincent Guittot
  2016-07-13 13:48     ` Dietmar Eggemann
  1 sibling, 1 reply; 64+ messages in thread
From: Vincent Guittot @ 2016-07-13 12:40 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Peter Zijlstra, mingo, Dietmar Eggemann, Yuyang Du, mgalbraith,
	linux-kernel

On 22 June 2016 at 19:03, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> From: Dietmar Eggemann <dietmar.eggemann@arm.com>
>
> To be able to compare the capacity of the target cpu with the highest
> available cpu capacity, store the maximum per-cpu capacity in the root
> domain.

I thought that the capacity of all CPUS were built so the highest
capacity of the CPU of the system is 1024  for big LITTLE system . So
this patch doesn't seem necessary for big.LITTLE system

>
> cc: Ingo Molnar <mingo@redhat.com>
> cc: Peter Zijlstra <peterz@infradead.org>
>
> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
> ---
>  kernel/sched/core.c  | 9 +++++++++
>  kernel/sched/sched.h | 2 ++
>  2 files changed, 11 insertions(+)
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index fe39118ffdfb..5093765e9930 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6855,6 +6855,7 @@ static int build_sched_domains(const struct cpumask *cpu_map,
>         enum s_alloc alloc_state;
>         struct sched_domain *sd;
>         struct s_data d;
> +       struct rq *rq = NULL;
>         int i, ret = -ENOMEM;
>
>         alloc_state = __visit_domain_allocation_hell(&d, cpu_map);
> @@ -6905,11 +6906,19 @@ static int build_sched_domains(const struct cpumask *cpu_map,
>         /* Attach the domains */
>         rcu_read_lock();
>         for_each_cpu(i, cpu_map) {
> +               rq = cpu_rq(i);
>                 sd = *per_cpu_ptr(d.sd, i);
>                 cpu_attach_domain(sd, d.rd, i);
> +
> +               if (rq->cpu_capacity_orig > rq->rd->max_cpu_capacity)
> +                       rq->rd->max_cpu_capacity = rq->cpu_capacity_orig;
>         }
>         rcu_read_unlock();
>
> +       if (rq)
> +               pr_info("span: %*pbl (max cpu_capacity = %lu)\n",
> +                       cpumask_pr_args(cpu_map), rq->rd->max_cpu_capacity);
> +
>         ret = 0;
>  error:
>         __free_domain_allocs(&d, alloc_state, cpu_map);
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 72f1f3087b04..3e9904ef224f 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -564,6 +564,8 @@ struct root_domain {
>          */
>         cpumask_var_t rto_mask;
>         struct cpupri cpupri;
> +
> +       unsigned long max_cpu_capacity;
>  };
>
>  extern struct root_domain def_root_domain;
> --
> 1.9.1
>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 07/13] sched/fair: Let asymmetric cpu configurations balance at wake-up
  2016-06-22 17:03 ` [PATCH v2 07/13] sched/fair: Let asymmetric cpu configurations balance at wake-up Morten Rasmussen
  2016-07-11 11:13   ` Peter Zijlstra
@ 2016-07-13 12:56   ` Vincent Guittot
  2016-07-13 16:14     ` Morten Rasmussen
  1 sibling, 1 reply; 64+ messages in thread
From: Vincent Guittot @ 2016-07-13 12:56 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Peter Zijlstra, mingo, Dietmar Eggemann, Yuyang Du, mgalbraith,
	linux-kernel

On 22 June 2016 at 19:03, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> Currently, SD_WAKE_AFFINE always takes priority over wakeup balancing if
> SD_BALANCE_WAKE is set on the sched_domains. For asymmetric
> configurations SD_WAKE_AFFINE is only desirable if the waking task's
> compute demand (utilization) is suitable for all the cpu capacities
> available within the SD_WAKE_AFFINE sched_domain. If not, let wakeup

instead of "suitable for all the cpu capacities available within the
SD_WAKE_AFFINE sched_domain", should it be "suitable for local cpu and
prev cpu" becasue you only check the capacity of these 2 CPUs.

Other than this comment for the commit message, the patch looks good to me
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>

> balancing take over (find_idlest_{group, cpu}()).
>
> This patch makes affine wake-ups conditional on whether both the waker
> cpu and prev_cpu has sufficient capacity for the waking task, or not.
>
> It is assumed that the sched_group(s) containing the waker cpu and
> prev_cpu only contain cpu with the same capacity (homogeneous).
>
> Ideally, we shouldn't set 'want_affine' in the first place, but we don't
> know if SD_BALANCE_WAKE is enabled on the sched_domain(s) until we start
> traversing them.
>
> cc: Ingo Molnar <mingo@redhat.com>
> cc: Peter Zijlstra <peterz@infradead.org>
>
> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
> ---
>  kernel/sched/fair.c | 28 +++++++++++++++++++++++++++-
>  1 file changed, 27 insertions(+), 1 deletion(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 216db302e87d..dba02c7b57b3 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -114,6 +114,12 @@ unsigned int __read_mostly sysctl_sched_shares_window = 10000000UL;
>  unsigned int sysctl_sched_cfs_bandwidth_slice = 5000UL;
>  #endif
>
> +/*
> + * The margin used when comparing utilization with cpu capacity:
> + * util * 1024 < capacity * margin
> + */
> +unsigned int capacity_margin = 1280; /* ~20% */
> +
>  static inline void update_load_add(struct load_weight *lw, unsigned long inc)
>  {
>         lw->weight += inc;
> @@ -5260,6 +5266,25 @@ static int cpu_util(int cpu)
>         return (util >= capacity) ? capacity : util;
>  }
>
> +static inline int task_util(struct task_struct *p)
> +{
> +       return p->se.avg.util_avg;
> +}
> +
> +static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
> +{
> +       long min_cap, max_cap;
> +
> +       min_cap = min(capacity_orig_of(prev_cpu), capacity_orig_of(cpu));
> +       max_cap = cpu_rq(cpu)->rd->max_cpu_capacity;
> +
> +       /* Minimum capacity is close to max, no need to abort wake_affine */
> +       if (max_cap - min_cap < max_cap >> 3)
> +               return 0;
> +
> +       return min_cap * 1024 < task_util(p) * capacity_margin;
> +}
> +
>  /*
>   * select_task_rq_fair: Select target runqueue for the waking task in domains
>   * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
> @@ -5283,7 +5308,8 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
>
>         if (sd_flag & SD_BALANCE_WAKE) {
>                 record_wakee(p);
> -               want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
> +               want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu)
> +                             && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
>         }
>
>         rcu_read_lock();
> --
> 1.9.1
>

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 06/13] sched: Store maximum per-cpu capacity in root domain
  2016-07-13 12:40   ` Vincent Guittot
@ 2016-07-13 13:48     ` Dietmar Eggemann
  2016-07-13 16:37       ` Morten Rasmussen
  0 siblings, 1 reply; 64+ messages in thread
From: Dietmar Eggemann @ 2016-07-13 13:48 UTC (permalink / raw)
  To: Vincent Guittot, Morten Rasmussen
  Cc: Peter Zijlstra, mingo, Yuyang Du, mgalbraith, linux-kernel

On 13/07/16 13:40, Vincent Guittot wrote:
> On 22 June 2016 at 19:03, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
>> From: Dietmar Eggemann <dietmar.eggemann@arm.com>
>>
>> To be able to compare the capacity of the target cpu with the highest
>> available cpu capacity, store the maximum per-cpu capacity in the root
>> domain.
> 
> I thought that the capacity of all CPUS were built so the highest
> capacity of the CPU of the system is 1024  for big LITTLE system . So
> this patch doesn't seem necessary for big.LITTLE system

The asymmetric cpu capacity support currently only has an effect on arm
big.LITTLE (32bit) using the existing 'struct cpu_efficiency
table_efficiency[]' based approach.

So e.g. on TC2 we have 1441 for highest capacity.

[    0.041007] SMP: Total of 5 processors activated (240.00 BogoMIPS).
[    0.041024] CPU: All CPU(s) started in SVC mode.
[    0.041103] CPU0 attaching sched-domain:
[    0.041119]  domain 0: span 0-1 level MC
[    0.041141]   groups: 0 (cpu_capacity = 1441) 1 (cpu_capacity = 1441)
[    0.041179]   domain 1: span 0-4 level DIE
[    0.041199]    groups: 0-1 (cpu_capacity = 2882) 2-4 (cpu_capacity =
1818)
[    0.041245] CPU1 attaching sched-domain:
[    0.041260]  domain 0: span 0-1 level MC
[    0.041279]   groups: 1 (cpu_capacity = 1441) 0 (cpu_capacity = 1441)
[    0.041315]   domain 1: span 0-4 level DIE
[    0.041334]    groups: 0-1 (cpu_capacity = 2882) 2-4 (cpu_capacity =
1818)
[    0.041376] CPU2 attaching sched-domain:
[    0.041391]  domain 0: span 2-4 level MC
[    0.041409]   groups: 2 (cpu_capacity = 606) 3 (cpu_capacity = 606) 4
(cpu_capacity = 606)
[    0.041460]   domain 1: span 0-4 level DIE
[    0.041479]    groups: 2-4 (cpu_capacity = 1818) 0-1 (cpu_capacity =
2882)
..

[...]

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 00/13] sched: Clean-ups and asymmetric cpu capacity support
  2016-07-13 12:06 ` Vincent Guittot
@ 2016-07-13 15:54   ` Morten Rasmussen
  0 siblings, 0 replies; 64+ messages in thread
From: Morten Rasmussen @ 2016-07-13 15:54 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, mingo, Dietmar Eggemann, Yuyang Du, mgalbraith,
	linux-kernel

On Wed, Jul 13, 2016 at 02:06:17PM +0200, Vincent Guittot wrote:
> Hi Morten,
> 
> On 22 June 2016 at 19:03, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> > Hi,
> >
> > The scheduler is currently not doing much to help performance on systems with
> > asymmetric compute capacities (read ARM big.LITTLE). This series improves the
> > situation with a few tweaks mainly to the task wake-up path that considers
> > compute capacity at wake-up and not just whether a cpu is idle for these
> > systems. This gives us consistent, and potentially higher, throughput in
> > partially utilized scenarios. SMP behaviour and performance should be
> > unaffected.
> >
> > Test 0:
> >         for i in `seq 1 10`; \
> >                do sysbench --test=cpu --max-time=3 --num-threads=1 run; \
> >                done \
> >         | awk '{if ($4=="events:") {print $5; sum +=$5; runs +=1}} \
> >                END {print "Average events: " sum/runs}'
> >
> > Target: ARM TC2 (2xA15+3xA7)
> >
> >         (Higher is better)
> > tip:    Average events: 146.9
> > patch:  Average events: 217.9
> >
> > Test 1:
> >         perf stat --null --repeat 10 -- \
> >         perf bench sched messaging -g 50 -l 5000
> >
> > Target: Intel IVB-EP (2*10*2)
> >
> > tip:    4.861970420 seconds time elapsed ( +-  1.39% )
> > patch:  4.886204224 seconds time elapsed ( +-  0.75% )
> >
> > Target: ARM TC2 A7-only (3xA7) (-l 1000)
> >
> > tip:    61.485682596 seconds time elapsed ( +-  0.07% )
> > patch:  62.667950130 seconds time elapsed ( +-  0.36% )
> >
> > More analysis:
> >
> > Statistics from mixed periodic task workload (rt-app) containing both
> > big and little task, single run on ARM TC2:
> >
> > tu   = Task utilization big/little
> > pcpu = Previous cpu big/little
> > tcpu = This (waker) cpu big/little
> > dl   = New cpu is little
> > db   = New cpu is big
> > sis  = New cpu chosen by select_idle_sibling()
> > figc = New cpu chosen by find_idlest_*()
> > ww   = wake_wide(task) count for figc wakeups
> > bw   = sd_flag & SD_BALANCE_WAKE (non-fork/exec wake)
> >        for figc wakeups
> >
> > case tu   pcpu tcpu   dl   db  sis figc   ww   bw
> > 1    l    l    l     122   68   28  162  161  161
> > 2    l    l    b      11    4    0   15   15   15
> > 3    l    b    l       0  252    8  244  244  244
> > 4    l    b    b      36 1928  711 1253 1016 1016
> > 5    b    l    l       5   19    0   24   22   24
> > 6    b    l    b       5    1    0    6    0    6
> > 7    b    b    l       0   31    0   31   31   31
> > 8    b    b    b       1  194  109   86   59   59
> > --------------------------------------------------
> >                      180 2497  856 1821
> 
> I'm not sure to know how to interpret all these statistics

Thanks for looking into the details. Let me provide a bit more context.

After our discussion around v1 I wanted to understand how the patches
works with different combinations of task utilization, prev_cpu, and
waking cpu. IIRC, the outcome our discussion was that tasks with
utilization too high to fit little cpus should go on big cpus, tasks
small enough to fit anywhere can go anywhere. For the latter we don't
want to spend too much time placing them as they essentially don't care
so they can be placed using select_idle_sibling().

So, I created a workload with rt-app with a number of different periodic
tasks with different periods and busy times. I traced all wake-ups and
put them into eight categories depending on the wake-up scenario, i.e.
task utilization, prev_cpu, and waking cpu (tu, pcpu, and tcpu).

The next two columns (dl, db) show the number of wake-ups that ended up
on a little or big cpu. If we take case 1 as an example, we had 190
wake-ups in total where a little task last ran on a little cpu and was
woken up by a little cpu. 122 of the wake-ups ended up selecting a
little cpu again, while in 68 cases it went to a big cpu. That is fine
according to our scheduling policy above.

The sis and figc columns show the split between wake-ups being handled
by select_idle_sibling() versus find_idlest_*(). Coming back to case 1,
28 wake-ups where handled by the former and 162 by the latter. We can't
really say exactly how many of select_idle_sibling() wake-ups that ended
up on big or little, but based on the numbers it is clear that in some
cases find_idlest_*() chose a big cpu for a little task (68 > 28).

The last two columns, ww and bw, try to explain why we have so many
wake-ups handled by find_idlest_*() for cases (1-4, and 8) where we
could have used select_idle_sibling(). The bw number is the number of
find_idlest_*() wake-ups that were passed the SD_BALANCE_WAKE flag (i.e.
non-FORK and non-EXEC wake-ups). FORK and EXEC wake-ups always take the
find_idlest_*() route, so we should ignore those. For case 1 it turned
out that only one of the 162 figc wake-ups was a FORK/EXEC wake-up. So
something else mush have caused those wake-ups to not go via
select_idle_sibling(). The ww column explains why as it shows how many
of the figc wake-ups where wake_wide() returned true and therefore
disabled want_affine. Because we have enabled SD_BALANCE_WAKE on the
sched_domains, !wake_affine wake-ups no longer end up using
select_idle_sibling() anyway, but end up using find_idlest_*().

Thinking more about it, should we force those tasks to use
select_idle_sibling() anyway? Something like the below could do it I
think:

@@ -5444,7 +5444,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
                        new_cpu = cpu;
        }
 
-       if (!sd) {
+       if (!sd || (!wake_cap(p, cpu, prev_cpu) && sd_flags & SD_BALANCE_WAKE)) {
                if (sd_flag & SD_BALANCE_WAKE) /* XXX always ? */
                        new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);

Ideally, cases 5-7 should be handled by find_idlest_*(), which seems to
hold true in the table above, and cases 1-4, and 8 should be handled by
select_idle_sibling(), which isn't always the case due to wake_wide().

Thoughts?

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 07/13] sched/fair: Let asymmetric cpu configurations balance at wake-up
  2016-07-13 12:56   ` Vincent Guittot
@ 2016-07-13 16:14     ` Morten Rasmussen
  2016-07-14 13:45       ` Vincent Guittot
  0 siblings, 1 reply; 64+ messages in thread
From: Morten Rasmussen @ 2016-07-13 16:14 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, mingo, Dietmar Eggemann, Yuyang Du, mgalbraith,
	linux-kernel

On Wed, Jul 13, 2016 at 02:56:41PM +0200, Vincent Guittot wrote:
> On 22 June 2016 at 19:03, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> > Currently, SD_WAKE_AFFINE always takes priority over wakeup balancing if
> > SD_BALANCE_WAKE is set on the sched_domains. For asymmetric
> > configurations SD_WAKE_AFFINE is only desirable if the waking task's
> > compute demand (utilization) is suitable for all the cpu capacities
> > available within the SD_WAKE_AFFINE sched_domain. If not, let wakeup
> 
> instead of "suitable for all the cpu capacities available within the
> SD_WAKE_AFFINE sched_domain", should it be "suitable for local cpu and
> prev cpu" becasue you only check the capacity of these 2 CPUs.

Good point. I currently make the implicit assumption that capacity of local cpu
and prev cpu represent the capacity for all cpus their SD_WAKE_AFFINE
domains. It breaks if you should choose to have SD_WAKE_AFFINE on a
domain that spans both little and big cpus, as if local/prev cpu happens
to be big we assume that they are all big and let select_idle_sibling()
handle the task placement even for big tasks if local/prev cpu are both
big.

I don't see why anybody would want that kind of setup, but I think the
assumption should still be written down somewhere, either here or in a
comment in wake_cap() or both.

The next paragraph in the commit message mentions that we actually only
check waker cpu and prev_cpu capacity. Would it be more clear if we
extend that to something like:

    This patch makes affine wake-ups conditional on whether both the waker
    cpu and prev_cpu has sufficient capacity for the waking task, or
    not, assuming that the cpu capacities within an SD_WAKE_AFFINE
    domain are homogeneous.

Thoughts?

> 
> Other than this comment for the commit message, the patch looks good to me
> Acked-by: Vincent Guittot <vincent.guittot@linaro.org>

Thanks,
Morten


> 
> > balancing take over (find_idlest_{group, cpu}()).
> >
> > This patch makes affine wake-ups conditional on whether both the waker
> > cpu and prev_cpu has sufficient capacity for the waking task, or not.
> >
> > It is assumed that the sched_group(s) containing the waker cpu and
> > prev_cpu only contain cpu with the same capacity (homogeneous).
> >
> > Ideally, we shouldn't set 'want_affine' in the first place, but we don't
> > know if SD_BALANCE_WAKE is enabled on the sched_domain(s) until we start
> > traversing them.
> >
> > cc: Ingo Molnar <mingo@redhat.com>
> > cc: Peter Zijlstra <peterz@infradead.org>
> >
> > Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
> > ---
> >  kernel/sched/fair.c | 28 +++++++++++++++++++++++++++-
> >  1 file changed, 27 insertions(+), 1 deletion(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 216db302e87d..dba02c7b57b3 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -114,6 +114,12 @@ unsigned int __read_mostly sysctl_sched_shares_window = 10000000UL;
> >  unsigned int sysctl_sched_cfs_bandwidth_slice = 5000UL;
> >  #endif
> >
> > +/*
> > + * The margin used when comparing utilization with cpu capacity:
> > + * util * 1024 < capacity * margin
> > + */
> > +unsigned int capacity_margin = 1280; /* ~20% */
> > +
> >  static inline void update_load_add(struct load_weight *lw, unsigned long inc)
> >  {
> >         lw->weight += inc;
> > @@ -5260,6 +5266,25 @@ static int cpu_util(int cpu)
> >         return (util >= capacity) ? capacity : util;
> >  }
> >
> > +static inline int task_util(struct task_struct *p)
> > +{
> > +       return p->se.avg.util_avg;
> > +}
> > +
> > +static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
> > +{
> > +       long min_cap, max_cap;
> > +
> > +       min_cap = min(capacity_orig_of(prev_cpu), capacity_orig_of(cpu));
> > +       max_cap = cpu_rq(cpu)->rd->max_cpu_capacity;
> > +
> > +       /* Minimum capacity is close to max, no need to abort wake_affine */
> > +       if (max_cap - min_cap < max_cap >> 3)
> > +               return 0;
> > +
> > +       return min_cap * 1024 < task_util(p) * capacity_margin;
> > +}
> > +
> >  /*
> >   * select_task_rq_fair: Select target runqueue for the waking task in domains
> >   * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
> > @@ -5283,7 +5308,8 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
> >
> >         if (sd_flag & SD_BALANCE_WAKE) {
> >                 record_wakee(p);
> > -               want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
> > +               want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu)
> > +                             && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
> >         }
> >
> >         rcu_read_lock();
> > --
> > 1.9.1
> >

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 06/13] sched: Store maximum per-cpu capacity in root domain
  2016-07-13 13:48     ` Dietmar Eggemann
@ 2016-07-13 16:37       ` Morten Rasmussen
  2016-07-14 13:25         ` Vincent Guittot
  0 siblings, 1 reply; 64+ messages in thread
From: Morten Rasmussen @ 2016-07-13 16:37 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Vincent Guittot, Peter Zijlstra, mingo, Yuyang Du, mgalbraith,
	linux-kernel

On Wed, Jul 13, 2016 at 02:48:24PM +0100, Dietmar Eggemann wrote:
> On 13/07/16 13:40, Vincent Guittot wrote:
> > On 22 June 2016 at 19:03, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> >> From: Dietmar Eggemann <dietmar.eggemann@arm.com>
> >>
> >> To be able to compare the capacity of the target cpu with the highest
> >> available cpu capacity, store the maximum per-cpu capacity in the root
> >> domain.
> > 
> > I thought that the capacity of all CPUS were built so the highest
> > capacity of the CPU of the system is 1024  for big LITTLE system . So
> > this patch doesn't seem necessary for big.LITTLE system
> 
> The asymmetric cpu capacity support currently only has an effect on arm
> big.LITTLE (32bit) using the existing 'struct cpu_efficiency
> table_efficiency[]' based approach.

True for this patch set, but longer term and if you use the preview
branch mentioned in the cover letter Vincent is right. The idea is that
the highest capacity anywhere should be 1024.

If we fix the arch/arm/kernel/topology.c code at the same time we could
kill this patch.

However, even further down the road we might need it (or something
similar) anyway due to the thermal framework. At some point we would
like to adjust the max capacity based any OPP constraints imposed by the
thermal framework. In extreme cases big cpus might be capped so hard
that they effectively have smaller capacity than little. I don't think
it makes sense to re-normalize everything to the highest available
capacity to ensure that there is always a cpu with capacity = 1024 in
the system, instead we must be able to cope with scenarios where max
capacity is smaller than 1024.

Also, for SMT max capacity is less than 1024 already. No?
But we may be able to cater for this in wake_cap() somehow. I can have a
look if Vincent doesn't like this patch.

Cheers,
Morten

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 06/13] sched: Store maximum per-cpu capacity in root domain
  2016-07-13 16:37       ` Morten Rasmussen
@ 2016-07-14 13:25         ` Vincent Guittot
  2016-07-14 15:15           ` Morten Rasmussen
  0 siblings, 1 reply; 64+ messages in thread
From: Vincent Guittot @ 2016-07-14 13:25 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Dietmar Eggemann, Peter Zijlstra, mingo, Yuyang Du, mgalbraith,
	linux-kernel

On 13 July 2016 at 18:37, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> On Wed, Jul 13, 2016 at 02:48:24PM +0100, Dietmar Eggemann wrote:
>> On 13/07/16 13:40, Vincent Guittot wrote:
>> > On 22 June 2016 at 19:03, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
>> >> From: Dietmar Eggemann <dietmar.eggemann@arm.com>
>> >>
>> >> To be able to compare the capacity of the target cpu with the highest
>> >> available cpu capacity, store the maximum per-cpu capacity in the root
>> >> domain.
>> >
>> > I thought that the capacity of all CPUS were built so the highest
>> > capacity of the CPU of the system is 1024  for big LITTLE system . So
>> > this patch doesn't seem necessary for big.LITTLE system
>>
>> The asymmetric cpu capacity support currently only has an effect on arm
>> big.LITTLE (32bit) using the existing 'struct cpu_efficiency
>> table_efficiency[]' based approach.
>
> True for this patch set, but longer term and if you use the preview
> branch mentioned in the cover letter Vincent is right. The idea is that
> the highest capacity anywhere should be 1024.
>
> If we fix the arch/arm/kernel/topology.c code at the same time we could
> kill this patch.
>
> However, even further down the road we might need it (or something
> similar) anyway due to the thermal framework. At some point we would
> like to adjust the max capacity based any OPP constraints imposed by the
> thermal framework. In extreme cases big cpus might be capped so hard
> that they effectively have smaller capacity than little. I don't think
> it makes sense to re-normalize everything to the highest available
> capacity to ensure that there is always a cpu with capacity = 1024 in
> the system, instead we must be able to cope with scenarios where max
> capacity is smaller than 1024.

Yes we will have to found a solution for thermal mitigation but i
don't know if a rd->max_cpu_capacity would the best solution
>
> Also, for SMT max capacity is less than 1024 already. No?

Yes, it is. I haven't looked in details but i think that we could use
a capacity of 1024 for SMT with changes that have been done on how to
evaluate if a sched_group is overloaded or not.

> But we may be able to cater for this in wake_cap() somehow. I can have a
> look if Vincent doesn't like this patch.

IMO, rd->max_cpu_capacity field doesn't seem to be required for now .

Vincent

>
> Cheers,
> Morten

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 07/13] sched/fair: Let asymmetric cpu configurations balance at wake-up
  2016-07-13 16:14     ` Morten Rasmussen
@ 2016-07-14 13:45       ` Vincent Guittot
  2016-07-15  8:37         ` Morten Rasmussen
  0 siblings, 1 reply; 64+ messages in thread
From: Vincent Guittot @ 2016-07-14 13:45 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Peter Zijlstra, mingo, Dietmar Eggemann, Yuyang Du, mgalbraith,
	linux-kernel

On 13 July 2016 at 18:14, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> On Wed, Jul 13, 2016 at 02:56:41PM +0200, Vincent Guittot wrote:
>> On 22 June 2016 at 19:03, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
>> > Currently, SD_WAKE_AFFINE always takes priority over wakeup balancing if
>> > SD_BALANCE_WAKE is set on the sched_domains. For asymmetric
>> > configurations SD_WAKE_AFFINE is only desirable if the waking task's
>> > compute demand (utilization) is suitable for all the cpu capacities
>> > available within the SD_WAKE_AFFINE sched_domain. If not, let wakeup
>>
>> instead of "suitable for all the cpu capacities available within the
>> SD_WAKE_AFFINE sched_domain", should it be "suitable for local cpu and
>> prev cpu" becasue you only check the capacity of these 2 CPUs.
>
> Good point. I currently make the implicit assumption that capacity of local cpu
> and prev cpu represent the capacity for all cpus their SD_WAKE_AFFINE
> domains. It breaks if you should choose to have SD_WAKE_AFFINE on a
> domain that spans both little and big cpus, as if local/prev cpu happens
> to be big we assume that they are all big and let select_idle_sibling()
> handle the task placement even for big tasks if local/prev cpu are both
> big.

Isn't the sd_llc used in select_idle_sibling and not the
SD_WAKE_AFFINE domian so if CPUs in the sd_llc are homogeneous, we are
safe

>
> I don't see why anybody would want that kind of setup, but I think the
> assumption should still be written down somewhere, either here or in a
> comment in wake_cap() or both.
>
> The next paragraph in the commit message mentions that we actually only
> check waker cpu and prev_cpu capacity. Would it be more clear if we
> extend that to something like:
>
>     This patch makes affine wake-ups conditional on whether both the waker
>     cpu and prev_cpu has sufficient capacity for the waking task, or
>     not, assuming that the cpu capacities within an SD_WAKE_AFFINE
>     domain are homogeneous.
>
> Thoughts?
>
>>
>> Other than this comment for the commit message, the patch looks good to me
>> Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
>
> Thanks,
> Morten
>
>
>>
>> > balancing take over (find_idlest_{group, cpu}()).
>> >
>> > This patch makes affine wake-ups conditional on whether both the waker
>> > cpu and prev_cpu has sufficient capacity for the waking task, or not.
>> >
>> > It is assumed that the sched_group(s) containing the waker cpu and
>> > prev_cpu only contain cpu with the same capacity (homogeneous).
>> >
>> > Ideally, we shouldn't set 'want_affine' in the first place, but we don't
>> > know if SD_BALANCE_WAKE is enabled on the sched_domain(s) until we start
>> > traversing them.
>> >
>> > cc: Ingo Molnar <mingo@redhat.com>
>> > cc: Peter Zijlstra <peterz@infradead.org>
>> >
>> > Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
>> > ---
>> >  kernel/sched/fair.c | 28 +++++++++++++++++++++++++++-
>> >  1 file changed, 27 insertions(+), 1 deletion(-)
>> >
>> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> > index 216db302e87d..dba02c7b57b3 100644
>> > --- a/kernel/sched/fair.c
>> > +++ b/kernel/sched/fair.c
>> > @@ -114,6 +114,12 @@ unsigned int __read_mostly sysctl_sched_shares_window = 10000000UL;
>> >  unsigned int sysctl_sched_cfs_bandwidth_slice = 5000UL;
>> >  #endif
>> >
>> > +/*
>> > + * The margin used when comparing utilization with cpu capacity:
>> > + * util * 1024 < capacity * margin
>> > + */
>> > +unsigned int capacity_margin = 1280; /* ~20% */
>> > +
>> >  static inline void update_load_add(struct load_weight *lw, unsigned long inc)
>> >  {
>> >         lw->weight += inc;
>> > @@ -5260,6 +5266,25 @@ static int cpu_util(int cpu)
>> >         return (util >= capacity) ? capacity : util;
>> >  }
>> >
>> > +static inline int task_util(struct task_struct *p)
>> > +{
>> > +       return p->se.avg.util_avg;
>> > +}
>> > +
>> > +static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
>> > +{
>> > +       long min_cap, max_cap;
>> > +
>> > +       min_cap = min(capacity_orig_of(prev_cpu), capacity_orig_of(cpu));
>> > +       max_cap = cpu_rq(cpu)->rd->max_cpu_capacity;
>> > +
>> > +       /* Minimum capacity is close to max, no need to abort wake_affine */
>> > +       if (max_cap - min_cap < max_cap >> 3)
>> > +               return 0;
>> > +
>> > +       return min_cap * 1024 < task_util(p) * capacity_margin;
>> > +}
>> > +
>> >  /*
>> >   * select_task_rq_fair: Select target runqueue for the waking task in domains
>> >   * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
>> > @@ -5283,7 +5308,8 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
>> >
>> >         if (sd_flag & SD_BALANCE_WAKE) {
>> >                 record_wakee(p);
>> > -               want_affine = !wake_wide(p) && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
>> > +               want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu)
>> > +                             && cpumask_test_cpu(cpu, tsk_cpus_allowed(p));
>> >         }
>> >
>> >         rcu_read_lock();
>> > --
>> > 1.9.1
>> >

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 06/13] sched: Store maximum per-cpu capacity in root domain
  2016-07-14 13:25         ` Vincent Guittot
@ 2016-07-14 15:15           ` Morten Rasmussen
  2016-07-15 11:46             ` Morten Rasmussen
  0 siblings, 1 reply; 64+ messages in thread
From: Morten Rasmussen @ 2016-07-14 15:15 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Dietmar Eggemann, Peter Zijlstra, mingo, Yuyang Du, mgalbraith,
	linux-kernel

On Thu, Jul 14, 2016 at 03:25:36PM +0200, Vincent Guittot wrote:
> On 13 July 2016 at 18:37, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> > On Wed, Jul 13, 2016 at 02:48:24PM +0100, Dietmar Eggemann wrote:
> >> On 13/07/16 13:40, Vincent Guittot wrote:
> >> > On 22 June 2016 at 19:03, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> >> >> From: Dietmar Eggemann <dietmar.eggemann@arm.com>
> >> >>
> >> >> To be able to compare the capacity of the target cpu with the highest
> >> >> available cpu capacity, store the maximum per-cpu capacity in the root
> >> >> domain.
> >> >
> >> > I thought that the capacity of all CPUS were built so the highest
> >> > capacity of the CPU of the system is 1024  for big LITTLE system . So
> >> > this patch doesn't seem necessary for big.LITTLE system
> >>
> >> The asymmetric cpu capacity support currently only has an effect on arm
> >> big.LITTLE (32bit) using the existing 'struct cpu_efficiency
> >> table_efficiency[]' based approach.
> >
> > True for this patch set, but longer term and if you use the preview
> > branch mentioned in the cover letter Vincent is right. The idea is that
> > the highest capacity anywhere should be 1024.
> >
> > If we fix the arch/arm/kernel/topology.c code at the same time we could
> > kill this patch.
> >
> > However, even further down the road we might need it (or something
> > similar) anyway due to the thermal framework. At some point we would
> > like to adjust the max capacity based any OPP constraints imposed by the
> > thermal framework. In extreme cases big cpus might be capped so hard
> > that they effectively have smaller capacity than little. I don't think
> > it makes sense to re-normalize everything to the highest available
> > capacity to ensure that there is always a cpu with capacity = 1024 in
> > the system, instead we must be able to cope with scenarios where max
> > capacity is smaller than 1024.
> 
> Yes we will have to found a solution for thermal mitigation but i
> don't know if a rd->max_cpu_capacity would the best solution

Agreed, I'm pretty sure that the current form isn't sufficient.

> >
> > Also, for SMT max capacity is less than 1024 already. No?
> 
> Yes, it is. I haven't looked in details but i think that we could use
> a capacity of 1024 for SMT with changes that have been done on how to
> evaluate if a sched_group is overloaded or not.

Changing SMT is a bit more invasive that I had hoped for for this patch
set. I will see if we can make it work with the current SMT capacities.

> 
> > But we may be able to cater for this in wake_cap() somehow. I can have a
> > look if Vincent doesn't like this patch.
> 
> IMO, rd->max_cpu_capacity field doesn't seem to be required for now .

No problem. I will try to get rid of it. I will drop the "arm:" patches
as well as they would have to be extended to guarantee a max capacity of
1024 and we most likely will have to change it again when Juri's DT
solution hopefully gets merged.

Morten

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 11/13] sched/fair: Avoid pulling tasks from non-overloaded higher capacity groups
  2016-06-30  7:49     ` Morten Rasmussen
@ 2016-07-14 16:39       ` Sai Gurrappadi
  2016-07-15  8:39         ` Morten Rasmussen
  0 siblings, 1 reply; 64+ messages in thread
From: Sai Gurrappadi @ 2016-07-14 16:39 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: peterz, mingo, dietmar.eggemann, yuyang.du, vincent.guittot,
	mgalbraith, linux-kernel, Peter Boonstoppel

On 06/30/2016 12:49 AM, Morten Rasmussen wrote:
> On Thu, Jun 23, 2016 at 02:20:48PM -0700, Sai Gurrappadi wrote:
>> Hi Morten,
>>
>> On 06/22/2016 10:03 AM, Morten Rasmussen wrote:
>>
>> [...]
>>
>>>  
>>> +/*
>>> + * group_smaller_cpu_capacity: Returns true if sched_group sg has smaller
>>> + * per-cpu capacity than sched_group ref.
>>> + */
>>> +static inline bool
>>> +group_smaller_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
>>> +{
>>> +	return sg->sgc->max_capacity * capacity_margin <
>>> +						ref->sgc->max_capacity * 1024;
>>> +}
>>> +
>>>  static inline enum
>>>  group_type group_classify(struct sched_group *group,
>>>  			  struct sg_lb_stats *sgs)
>>> @@ -6892,6 +6903,19 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>>>  	if (sgs->avg_load <= busiest->avg_load)
>>>  		return false;
>>>  
>>> +	if (!(env->sd->flags & SD_ASYM_CPUCAPACITY))
>>> +		goto asym_packing;
>>> +
>>> +	/* Candidate sg has no more than one task per cpu and has
>>> +	 * higher per-cpu capacity. Migrating tasks to less capable
>>> +	 * cpus may harm throughput. Maximize throughput,
>>> +	 * power/energy consequences are not considered.
>>> +	 */
>>> +	if (sgs->sum_nr_running <= sgs->group_weight &&
>>> +	    group_smaller_cpu_capacity(sds->local, sg))
>>> +		return false;
>>> +
>>> +asym_packing:
>>
>> What about the case where IRQ/RT work reduces the capacity of some of
>> these bigger CPUs? sgc->max_capacity might not necessarily capture
>> that case.
> 
> Right, we could possibly improve this by using min_capacity instead, but
> we could end up allowing tasks to be pulled to lower capacity cpus just
> because one big cpu has reduced capacity due to RT/IRQ pressure and
> therefore has lowered the groups min_capacity.
> 
> Ideally we should check all the capacities, but that complicates things
> a lot.
> 
> Would you prefer min_capacity instead, or attempts to consider all the
> cpu capacities available in both groups?
> 

min_capacity as a start works I think given that we are only trying to make existing LB better, not necessarily optimizing for every case. Might have to revisit this anyways for thermals etc.

Thanks,
-Sai

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 07/13] sched/fair: Let asymmetric cpu configurations balance at wake-up
  2016-07-14 13:45       ` Vincent Guittot
@ 2016-07-15  8:37         ` Morten Rasmussen
  0 siblings, 0 replies; 64+ messages in thread
From: Morten Rasmussen @ 2016-07-15  8:37 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, mingo, Dietmar Eggemann, Yuyang Du, mgalbraith,
	linux-kernel

On Thu, Jul 14, 2016 at 03:45:17PM +0200, Vincent Guittot wrote:
> On 13 July 2016 at 18:14, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> > On Wed, Jul 13, 2016 at 02:56:41PM +0200, Vincent Guittot wrote:
> >> On 22 June 2016 at 19:03, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> >> > Currently, SD_WAKE_AFFINE always takes priority over wakeup balancing if
> >> > SD_BALANCE_WAKE is set on the sched_domains. For asymmetric
> >> > configurations SD_WAKE_AFFINE is only desirable if the waking task's
> >> > compute demand (utilization) is suitable for all the cpu capacities
> >> > available within the SD_WAKE_AFFINE sched_domain. If not, let wakeup
> >>
> >> instead of "suitable for all the cpu capacities available within the
> >> SD_WAKE_AFFINE sched_domain", should it be "suitable for local cpu and
> >> prev cpu" becasue you only check the capacity of these 2 CPUs.
> >
> > Good point. I currently make the implicit assumption that capacity of local cpu
> > and prev cpu represent the capacity for all cpus their SD_WAKE_AFFINE
> > domains. It breaks if you should choose to have SD_WAKE_AFFINE on a
> > domain that spans both little and big cpus, as if local/prev cpu happens
> > to be big we assume that they are all big and let select_idle_sibling()
> > handle the task placement even for big tasks if local/prev cpu are both
> > big.
> 
> Isn't the sd_llc used in select_idle_sibling and not the
> SD_WAKE_AFFINE domian so if CPUs in the sd_llc are homogeneous, we are
> safe

Yes, I confused myself (again) with SD_WAKE_AFFINE and sd_llc in the
above. It should have been sd_llc instead of SD_WAKE_AFFINE. I will fix
the commit message to be correct.

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 11/13] sched/fair: Avoid pulling tasks from non-overloaded higher capacity groups
  2016-07-14 16:39       ` Sai Gurrappadi
@ 2016-07-15  8:39         ` Morten Rasmussen
  0 siblings, 0 replies; 64+ messages in thread
From: Morten Rasmussen @ 2016-07-15  8:39 UTC (permalink / raw)
  To: Sai Gurrappadi
  Cc: peterz, mingo, dietmar.eggemann, yuyang.du, vincent.guittot,
	mgalbraith, linux-kernel, Peter Boonstoppel

On Thu, Jul 14, 2016 at 09:39:23AM -0700, Sai Gurrappadi wrote:
> On 06/30/2016 12:49 AM, Morten Rasmussen wrote:
> > On Thu, Jun 23, 2016 at 02:20:48PM -0700, Sai Gurrappadi wrote:
> >> Hi Morten,
> >>
> >> On 06/22/2016 10:03 AM, Morten Rasmussen wrote:
> >>
> >> [...]
> >>
> >>>  
> >>> +/*
> >>> + * group_smaller_cpu_capacity: Returns true if sched_group sg has smaller
> >>> + * per-cpu capacity than sched_group ref.
> >>> + */
> >>> +static inline bool
> >>> +group_smaller_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
> >>> +{
> >>> +	return sg->sgc->max_capacity * capacity_margin <
> >>> +						ref->sgc->max_capacity * 1024;
> >>> +}
> >>> +
> >>>  static inline enum
> >>>  group_type group_classify(struct sched_group *group,
> >>>  			  struct sg_lb_stats *sgs)
> >>> @@ -6892,6 +6903,19 @@ static bool update_sd_pick_busiest(struct lb_env *env,
> >>>  	if (sgs->avg_load <= busiest->avg_load)
> >>>  		return false;
> >>>  
> >>> +	if (!(env->sd->flags & SD_ASYM_CPUCAPACITY))
> >>> +		goto asym_packing;
> >>> +
> >>> +	/* Candidate sg has no more than one task per cpu and has
> >>> +	 * higher per-cpu capacity. Migrating tasks to less capable
> >>> +	 * cpus may harm throughput. Maximize throughput,
> >>> +	 * power/energy consequences are not considered.
> >>> +	 */
> >>> +	if (sgs->sum_nr_running <= sgs->group_weight &&
> >>> +	    group_smaller_cpu_capacity(sds->local, sg))
> >>> +		return false;
> >>> +
> >>> +asym_packing:
> >>
> >> What about the case where IRQ/RT work reduces the capacity of some of
> >> these bigger CPUs? sgc->max_capacity might not necessarily capture
> >> that case.
> > 
> > Right, we could possibly improve this by using min_capacity instead, but
> > we could end up allowing tasks to be pulled to lower capacity cpus just
> > because one big cpu has reduced capacity due to RT/IRQ pressure and
> > therefore has lowered the groups min_capacity.
> > 
> > Ideally we should check all the capacities, but that complicates things
> > a lot.
> > 
> > Would you prefer min_capacity instead, or attempts to consider all the
> > cpu capacities available in both groups?
> > 
> 
> min_capacity as a start works I think given that we are only trying to
> make existing LB better, not necessarily optimizing for every case.
> Might have to revisit this anyways for thermals etc.

Agreed, I will make it min_capacity instead of max_capacity in v3.

Thanks,
Morten

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 06/13] sched: Store maximum per-cpu capacity in root domain
  2016-07-14 15:15           ` Morten Rasmussen
@ 2016-07-15 11:46             ` Morten Rasmussen
  2016-07-15 13:39               ` Vincent Guittot
  0 siblings, 1 reply; 64+ messages in thread
From: Morten Rasmussen @ 2016-07-15 11:46 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Dietmar Eggemann, Peter Zijlstra, mingo, Yuyang Du, mgalbraith,
	linux-kernel

On Thu, Jul 14, 2016 at 04:15:20PM +0100, Morten Rasmussen wrote:
> On Thu, Jul 14, 2016 at 03:25:36PM +0200, Vincent Guittot wrote:
> > On 13 July 2016 at 18:37, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> > > Also, for SMT max capacity is less than 1024 already. No?
> > 
> > Yes, it is. I haven't looked in details but i think that we could use
> > a capacity of 1024 for SMT with changes that have been done on how to
> > evaluate if a sched_group is overloaded or not.
> 
> Changing SMT is a bit more invasive that I had hoped for for this patch
> set. I will see if we can make it work with the current SMT capacities.
> 
> > 
> > > But we may be able to cater for this in wake_cap() somehow. I can have a
> > > look if Vincent doesn't like this patch.
> > 
> > IMO, rd->max_cpu_capacity field doesn't seem to be required for now .
> 
> No problem. I will try to get rid of it. I will drop the "arm:" patches
> as well as they would have to be extended to guarantee a max capacity of
> 1024 and we most likely will have to change it again when Juri's DT
> solution hopefully gets merged.

I have had a closer look at wake_cap() again. Getting rid of
rd->max_cpu_capacity isn't as easy as I thought.

The fundamental problem is that all we have in wake_cap() is the waking
cpu and previous cpu ids which isn't sufficient to determine whether we
have an asymmetric capacity system or not. A capacity <1024 can either a
little cpu or an SMT thread. We need a third piece of information, which
can be either the highest cpu capacity available in the cpu, or a
flag/variable/function telling us whether we are on an SMT system.

I see the following solutions to the problem:

1. Have a system-wide max_cpu_capacity (as proposed in this patch) which
can let us detect SMT systems as max_cpu_capacity < 1024 implies SMT.

2. Change SMT thread capacity to 1024 so we implicitly know that max
capacity is always 1024. As said above, this is a very invasive change
as it would mean that we no longer distinguish between SMP and SMT.
smt_gain and SD_SHARE_CPUCAPACITY would no longer have any effect and
can be ripped out. I would prefer not create a dependency on such a
massive change. We can do the experiment afterwards if needed.

3. Detect SMT in wake_cap(). This requires access to the sched_domain
hierarchy as the SD_SHARE_CPUCAPACITY is the only way to detect SMT,
AFAIK, apart from looping through the capacities of all cpus in the
system basically computing max_cpu_capacity each time.
wake_cap() is currently called before rcu_read_lock() that gives us
access to the sched_domain hierarchy. I would have to postpone the
wake_cap() call to being inside the lock and introduce another lookup in
the sched_domain hierarchy which would be executed on every wake-up on
all systems. IMHO, that is a bit ugly.

I don't really like any of the solutions, but of those three I would go
for the current solution (1) as it is very minimal both in the amount of
code touched/affected and overhead. We can kill it later if we have a
better one, no problem for me.

Do you see any alternatives?

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 06/13] sched: Store maximum per-cpu capacity in root domain
  2016-07-15 11:46             ` Morten Rasmussen
@ 2016-07-15 13:39               ` Vincent Guittot
  2016-07-15 16:02                 ` Morten Rasmussen
  0 siblings, 1 reply; 64+ messages in thread
From: Vincent Guittot @ 2016-07-15 13:39 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Dietmar Eggemann, Peter Zijlstra, mingo, Yuyang Du, mgalbraith,
	linux-kernel

On 15 July 2016 at 13:46, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> On Thu, Jul 14, 2016 at 04:15:20PM +0100, Morten Rasmussen wrote:
>> On Thu, Jul 14, 2016 at 03:25:36PM +0200, Vincent Guittot wrote:
>> > On 13 July 2016 at 18:37, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
>> > > Also, for SMT max capacity is less than 1024 already. No?
>> >
>> > Yes, it is. I haven't looked in details but i think that we could use
>> > a capacity of 1024 for SMT with changes that have been done on how to
>> > evaluate if a sched_group is overloaded or not.
>>
>> Changing SMT is a bit more invasive that I had hoped for for this patch
>> set. I will see if we can make it work with the current SMT capacities.
>>
>> >
>> > > But we may be able to cater for this in wake_cap() somehow. I can have a
>> > > look if Vincent doesn't like this patch.
>> >
>> > IMO, rd->max_cpu_capacity field doesn't seem to be required for now .
>>
>> No problem. I will try to get rid of it. I will drop the "arm:" patches
>> as well as they would have to be extended to guarantee a max capacity of
>> 1024 and we most likely will have to change it again when Juri's DT
>> solution hopefully gets merged.
>
> I have had a closer look at wake_cap() again. Getting rid of
> rd->max_cpu_capacity isn't as easy as I thought.
>
> The fundamental problem is that all we have in wake_cap() is the waking
> cpu and previous cpu ids which isn't sufficient to determine whether we
> have an asymmetric capacity system or not. A capacity <1024 can either a
> little cpu or an SMT thread. We need a third piece of information, which
> can be either the highest cpu capacity available in the cpu, or a
> flag/variable/function telling us whether we are on an SMT system.
>
> I see the following solutions to the problem:
>
> 1. Have a system-wide max_cpu_capacity (as proposed in this patch) which
> can let us detect SMT systems as max_cpu_capacity < 1024 implies SMT.
>
> 2. Change SMT thread capacity to 1024 so we implicitly know that max
> capacity is always 1024. As said above, this is a very invasive change
> as it would mean that we no longer distinguish between SMP and SMT.
> smt_gain and SD_SHARE_CPUCAPACITY would no longer have any effect and
> can be ripped out. I would prefer not create a dependency on such a
> massive change. We can do the experiment afterwards if needed.
>
> 3. Detect SMT in wake_cap(). This requires access to the sched_domain
> hierarchy as the SD_SHARE_CPUCAPACITY is the only way to detect SMT,
> AFAIK, apart from looping through the capacities of all cpus in the
> system basically computing max_cpu_capacity each time.
> wake_cap() is currently called before rcu_read_lock() that gives us
> access to the sched_domain hierarchy. I would have to postpone the
> wake_cap() call to being inside the lock and introduce another lookup in
> the sched_domain hierarchy which would be executed on every wake-up on
> all systems. IMHO, that is a bit ugly.
>
> I don't really like any of the solutions, but of those three I would go
> for the current solution (1) as it is very minimal both in the amount of
> code touched/affected and overhead. We can kill it later if we have a
> better one, no problem for me.

I had solution 2 in mind. I haven't looked deeply the impact but I
thought that the main remaining blocking  point is in
update_numa_stats where it use the fact that the capacity is less than
1024 vat SMT level to compute task_capacity and  set has_free_capacity
only if we have less than 1 task per core.
smt_gain would not be used anymore

>
> Do you see any alternatives?

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 06/13] sched: Store maximum per-cpu capacity in root domain
  2016-07-15 13:39               ` Vincent Guittot
@ 2016-07-15 16:02                 ` Morten Rasmussen
  2016-07-18 12:48                   ` Vincent Guittot
  0 siblings, 1 reply; 64+ messages in thread
From: Morten Rasmussen @ 2016-07-15 16:02 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Dietmar Eggemann, Peter Zijlstra, mingo, Yuyang Du, mgalbraith,
	linux-kernel

On Fri, Jul 15, 2016 at 03:39:05PM +0200, Vincent Guittot wrote:
> On 15 July 2016 at 13:46, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> > On Thu, Jul 14, 2016 at 04:15:20PM +0100, Morten Rasmussen wrote:
> >> On Thu, Jul 14, 2016 at 03:25:36PM +0200, Vincent Guittot wrote:
> >> > On 13 July 2016 at 18:37, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> >> > > Also, for SMT max capacity is less than 1024 already. No?
> >> >
> >> > Yes, it is. I haven't looked in details but i think that we could use
> >> > a capacity of 1024 for SMT with changes that have been done on how to
> >> > evaluate if a sched_group is overloaded or not.
> >>
> >> Changing SMT is a bit more invasive that I had hoped for for this patch
> >> set. I will see if we can make it work with the current SMT capacities.
> >>
> >> >
> >> > > But we may be able to cater for this in wake_cap() somehow. I can have a
> >> > > look if Vincent doesn't like this patch.
> >> >
> >> > IMO, rd->max_cpu_capacity field doesn't seem to be required for now .
> >>
> >> No problem. I will try to get rid of it. I will drop the "arm:" patches
> >> as well as they would have to be extended to guarantee a max capacity of
> >> 1024 and we most likely will have to change it again when Juri's DT
> >> solution hopefully gets merged.
> >
> > I have had a closer look at wake_cap() again. Getting rid of
> > rd->max_cpu_capacity isn't as easy as I thought.
> >
> > The fundamental problem is that all we have in wake_cap() is the waking
> > cpu and previous cpu ids which isn't sufficient to determine whether we
> > have an asymmetric capacity system or not. A capacity <1024 can either a
> > little cpu or an SMT thread. We need a third piece of information, which
> > can be either the highest cpu capacity available in the cpu, or a
> > flag/variable/function telling us whether we are on an SMT system.
> >
> > I see the following solutions to the problem:
> >
> > 1. Have a system-wide max_cpu_capacity (as proposed in this patch) which
> > can let us detect SMT systems as max_cpu_capacity < 1024 implies SMT.
> >
> > 2. Change SMT thread capacity to 1024 so we implicitly know that max
> > capacity is always 1024. As said above, this is a very invasive change
> > as it would mean that we no longer distinguish between SMP and SMT.
> > smt_gain and SD_SHARE_CPUCAPACITY would no longer have any effect and
> > can be ripped out. I would prefer not create a dependency on such a
> > massive change. We can do the experiment afterwards if needed.
> >
> > 3. Detect SMT in wake_cap(). This requires access to the sched_domain
> > hierarchy as the SD_SHARE_CPUCAPACITY is the only way to detect SMT,
> > AFAIK, apart from looping through the capacities of all cpus in the
> > system basically computing max_cpu_capacity each time.
> > wake_cap() is currently called before rcu_read_lock() that gives us
> > access to the sched_domain hierarchy. I would have to postpone the
> > wake_cap() call to being inside the lock and introduce another lookup in
> > the sched_domain hierarchy which would be executed on every wake-up on
> > all systems. IMHO, that is a bit ugly.
> >
> > I don't really like any of the solutions, but of those three I would go
> > for the current solution (1) as it is very minimal both in the amount of
> > code touched/affected and overhead. We can kill it later if we have a
> > better one, no problem for me.
> 
> I had solution 2 in mind. I haven't looked deeply the impact but I
> thought that the main remaining blocking  point is in
> update_numa_stats where it use the fact that the capacity is less than
> 1024 vat SMT level to compute task_capacity and  set has_free_capacity
> only if we have less than 1 task per core.
> smt_gain would not be used anymore

Isn't group capacities of also smaller and hence influence load
balancing decisions?

I was hoping that we could decouple a full audit of the load-balance
code from this relatively simple patch set by staying with 1 for now. I
worry that the changing SMT capacity can turn into a major task. Just
proving that there is no regressions even if we know it should be, is a
lot of work.

I'm happy to look at the SMT stuff it has been on my list of outstanding
issues for a very long time, but I would prefer to break it into
multiple independent patch sets to keep them focused. I haven't had a
much luck with massive complicated patch sets so far ;-)

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 06/13] sched: Store maximum per-cpu capacity in root domain
  2016-07-15 16:02                 ` Morten Rasmussen
@ 2016-07-18 12:48                   ` Vincent Guittot
  2016-07-18 15:11                     ` Morten Rasmussen
  0 siblings, 1 reply; 64+ messages in thread
From: Vincent Guittot @ 2016-07-18 12:48 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Dietmar Eggemann, Peter Zijlstra, mingo, Yuyang Du, mgalbraith,
	linux-kernel

On 15 July 2016 at 18:02, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> On Fri, Jul 15, 2016 at 03:39:05PM +0200, Vincent Guittot wrote:
>> On 15 July 2016 at 13:46, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
>> > On Thu, Jul 14, 2016 at 04:15:20PM +0100, Morten Rasmussen wrote:
>> >> On Thu, Jul 14, 2016 at 03:25:36PM +0200, Vincent Guittot wrote:
>> >> > On 13 July 2016 at 18:37, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
>> >> > > Also, for SMT max capacity is less than 1024 already. No?
>> >> >
>> >> > Yes, it is. I haven't looked in details but i think that we could use
>> >> > a capacity of 1024 for SMT with changes that have been done on how to
>> >> > evaluate if a sched_group is overloaded or not.
>> >>
>> >> Changing SMT is a bit more invasive that I had hoped for for this patch
>> >> set. I will see if we can make it work with the current SMT capacities.
>> >>
>> >> >
>> >> > > But we may be able to cater for this in wake_cap() somehow. I can have a
>> >> > > look if Vincent doesn't like this patch.
>> >> >
>> >> > IMO, rd->max_cpu_capacity field doesn't seem to be required for now .
>> >>
>> >> No problem. I will try to get rid of it. I will drop the "arm:" patches
>> >> as well as they would have to be extended to guarantee a max capacity of
>> >> 1024 and we most likely will have to change it again when Juri's DT
>> >> solution hopefully gets merged.
>> >
>> > I have had a closer look at wake_cap() again. Getting rid of
>> > rd->max_cpu_capacity isn't as easy as I thought.
>> >
>> > The fundamental problem is that all we have in wake_cap() is the waking
>> > cpu and previous cpu ids which isn't sufficient to determine whether we
>> > have an asymmetric capacity system or not. A capacity <1024 can either a
>> > little cpu or an SMT thread. We need a third piece of information, which
>> > can be either the highest cpu capacity available in the cpu, or a
>> > flag/variable/function telling us whether we are on an SMT system.
>> >
>> > I see the following solutions to the problem:
>> >
>> > 1. Have a system-wide max_cpu_capacity (as proposed in this patch) which
>> > can let us detect SMT systems as max_cpu_capacity < 1024 implies SMT.
>> >
>> > 2. Change SMT thread capacity to 1024 so we implicitly know that max
>> > capacity is always 1024. As said above, this is a very invasive change
>> > as it would mean that we no longer distinguish between SMP and SMT.
>> > smt_gain and SD_SHARE_CPUCAPACITY would no longer have any effect and
>> > can be ripped out. I would prefer not create a dependency on such a
>> > massive change. We can do the experiment afterwards if needed.
>> >
>> > 3. Detect SMT in wake_cap(). This requires access to the sched_domain
>> > hierarchy as the SD_SHARE_CPUCAPACITY is the only way to detect SMT,
>> > AFAIK, apart from looping through the capacities of all cpus in the
>> > system basically computing max_cpu_capacity each time.
>> > wake_cap() is currently called before rcu_read_lock() that gives us
>> > access to the sched_domain hierarchy. I would have to postpone the
>> > wake_cap() call to being inside the lock and introduce another lookup in
>> > the sched_domain hierarchy which would be executed on every wake-up on
>> > all systems. IMHO, that is a bit ugly.
>> >
>> > I don't really like any of the solutions, but of those three I would go
>> > for the current solution (1) as it is very minimal both in the amount of
>> > code touched/affected and overhead. We can kill it later if we have a
>> > better one, no problem for me.
>>
>> I had solution 2 in mind. I haven't looked deeply the impact but I
>> thought that the main remaining blocking  point is in
>> update_numa_stats where it use the fact that the capacity is less than
>> 1024 vat SMT level to compute task_capacity and  set has_free_capacity
>> only if we have less than 1 task per core.
>> smt_gain would not be used anymore
>
> Isn't group capacities of also smaller and hence influence load
> balancing decisions?

It should not because the capacity is now only used to compare groups
together and no more with the 1024 value

>
> I was hoping that we could decouple a full audit of the load-balance
> code from this relatively simple patch set by staying with 1 for now. I
> worry that the changing SMT capacity can turn into a major task. Just
> proving that there is no regressions even if we know it should be, is a
> lot of work.

Yes, you are probably right on that point

>
> I'm happy to look at the SMT stuff it has been on my list of outstanding
> issues for a very long time, but I would prefer to break it into
> multiple independent patch sets to keep them focused. I haven't had a
> much luck with massive complicated patch sets so far ;-)

^ permalink raw reply	[flat|nested] 64+ messages in thread

* Re: [PATCH v2 06/13] sched: Store maximum per-cpu capacity in root domain
  2016-07-18 12:48                   ` Vincent Guittot
@ 2016-07-18 15:11                     ` Morten Rasmussen
  0 siblings, 0 replies; 64+ messages in thread
From: Morten Rasmussen @ 2016-07-18 15:11 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Dietmar Eggemann, Peter Zijlstra, mingo, Yuyang Du, mgalbraith,
	linux-kernel

On Mon, Jul 18, 2016 at 02:48:42PM +0200, Vincent Guittot wrote:
> On 15 July 2016 at 18:02, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> > On Fri, Jul 15, 2016 at 03:39:05PM +0200, Vincent Guittot wrote:
> >> On 15 July 2016 at 13:46, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> >> > On Thu, Jul 14, 2016 at 04:15:20PM +0100, Morten Rasmussen wrote:
> >> >> On Thu, Jul 14, 2016 at 03:25:36PM +0200, Vincent Guittot wrote:
> >> >> > On 13 July 2016 at 18:37, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> >> >> > > Also, for SMT max capacity is less than 1024 already. No?
> >> >> >
> >> >> > Yes, it is. I haven't looked in details but i think that we could use
> >> >> > a capacity of 1024 for SMT with changes that have been done on how to
> >> >> > evaluate if a sched_group is overloaded or not.
> >> >>
> >> >> Changing SMT is a bit more invasive that I had hoped for for this patch
> >> >> set. I will see if we can make it work with the current SMT capacities.
> >> >>
> >> >> >
> >> >> > > But we may be able to cater for this in wake_cap() somehow. I can have a
> >> >> > > look if Vincent doesn't like this patch.
> >> >> >
> >> >> > IMO, rd->max_cpu_capacity field doesn't seem to be required for now .
> >> >>
> >> >> No problem. I will try to get rid of it. I will drop the "arm:" patches
> >> >> as well as they would have to be extended to guarantee a max capacity of
> >> >> 1024 and we most likely will have to change it again when Juri's DT
> >> >> solution hopefully gets merged.
> >> >
> >> > I have had a closer look at wake_cap() again. Getting rid of
> >> > rd->max_cpu_capacity isn't as easy as I thought.
> >> >
> >> > The fundamental problem is that all we have in wake_cap() is the waking
> >> > cpu and previous cpu ids which isn't sufficient to determine whether we
> >> > have an asymmetric capacity system or not. A capacity <1024 can either a
> >> > little cpu or an SMT thread. We need a third piece of information, which
> >> > can be either the highest cpu capacity available in the cpu, or a
> >> > flag/variable/function telling us whether we are on an SMT system.
> >> >
> >> > I see the following solutions to the problem:
> >> >
> >> > 1. Have a system-wide max_cpu_capacity (as proposed in this patch) which
> >> > can let us detect SMT systems as max_cpu_capacity < 1024 implies SMT.
> >> >
> >> > 2. Change SMT thread capacity to 1024 so we implicitly know that max
> >> > capacity is always 1024. As said above, this is a very invasive change
> >> > as it would mean that we no longer distinguish between SMP and SMT.
> >> > smt_gain and SD_SHARE_CPUCAPACITY would no longer have any effect and
> >> > can be ripped out. I would prefer not create a dependency on such a
> >> > massive change. We can do the experiment afterwards if needed.
> >> >
> >> > 3. Detect SMT in wake_cap(). This requires access to the sched_domain
> >> > hierarchy as the SD_SHARE_CPUCAPACITY is the only way to detect SMT,
> >> > AFAIK, apart from looping through the capacities of all cpus in the
> >> > system basically computing max_cpu_capacity each time.
> >> > wake_cap() is currently called before rcu_read_lock() that gives us
> >> > access to the sched_domain hierarchy. I would have to postpone the
> >> > wake_cap() call to being inside the lock and introduce another lookup in
> >> > the sched_domain hierarchy which would be executed on every wake-up on
> >> > all systems. IMHO, that is a bit ugly.
> >> >
> >> > I don't really like any of the solutions, but of those three I would go
> >> > for the current solution (1) as it is very minimal both in the amount of
> >> > code touched/affected and overhead. We can kill it later if we have a
> >> > better one, no problem for me.
> >>
> >> I had solution 2 in mind. I haven't looked deeply the impact but I
> >> thought that the main remaining blocking  point is in
> >> update_numa_stats where it use the fact that the capacity is less than
> >> 1024 vat SMT level to compute task_capacity and  set has_free_capacity
> >> only if we have less than 1 task per core.
> >> smt_gain would not be used anymore
> >
> > Isn't group capacities of also smaller and hence influence load
> > balancing decisions?
> 
> It should not because the capacity is now only used to compare groups
> together and no more with the 1024 value

You may very well be right. It is definitely worth a look, a lot of code
can be ripped out if we can move SMT threads to have default capacity.

> 
> >
> > I was hoping that we could decouple a full audit of the load-balance
> > code from this relatively simple patch set by staying with 1 for now. I
> > worry that the changing SMT capacity can turn into a major task. Just
> > proving that there is no regressions even if we know it should be, is a
> > lot of work.
> 
> Yes, you are probably right on that point

I will put together v3 still containing 1.

Thanks,
Morten

^ permalink raw reply	[flat|nested] 64+ messages in thread

* [tip:sched/core] sched/core: Fix power to capacity renaming in comment
  2016-06-22 17:03 ` [PATCH v2 01/13] sched: Fix power to capacity renaming in comment Morten Rasmussen
@ 2016-08-10 18:03   ` tip-bot for Morten Rasmussen
  0 siblings, 0 replies; 64+ messages in thread
From: tip-bot for Morten Rasmussen @ 2016-08-10 18:03 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: mingo, tglx, hpa, peterz, torvalds, efault, linux-kernel,
	morten.rasmussen

Commit-ID:  bd425d4bfc7a1a6064dbbadfbac9c7eec0e426ec
Gitweb:     http://git.kernel.org/tip/bd425d4bfc7a1a6064dbbadfbac9c7eec0e426ec
Author:     Morten Rasmussen <morten.rasmussen@arm.com>
AuthorDate: Wed, 22 Jun 2016 18:03:12 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 10 Aug 2016 14:03:32 +0200

sched/core: Fix power to capacity renaming in comment

It is seems that this one escaped Nico's renaming of cpu_power to
cpu_capacity a while back.

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: linux-kernel@vger.kernel.org
Cc: mgalbraith@suse.de
Cc: vincent.guittot@linaro.org
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1466615004-3503-2-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/sched.h | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 62c68e5..f3db596 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1022,7 +1022,7 @@ extern void wake_up_q(struct wake_q_head *head);
 #define SD_BALANCE_FORK		0x0008	/* Balance on fork, clone */
 #define SD_BALANCE_WAKE		0x0010  /* Balance on wakeup */
 #define SD_WAKE_AFFINE		0x0020	/* Wake task to waking CPU */
-#define SD_SHARE_CPUCAPACITY	0x0080	/* Domain members share cpu power */
+#define SD_SHARE_CPUCAPACITY	0x0080	/* Domain members share cpu capacity */
 #define SD_SHARE_POWERDOMAIN	0x0100	/* Domain members share power domain */
 #define SD_SHARE_PKG_RESOURCES	0x0200	/* Domain members share cpu pkg resources */
 #define SD_SERIALIZE		0x0400	/* Only a single load balancing instance */

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [tip:sched/core] sched/fair: Make the use of prev_cpu consistent in the wakeup path
  2016-06-22 17:03 ` [PATCH v2 02/13] sched/fair: Consistent use of prev_cpu in wakeup path Morten Rasmussen
  2016-06-22 18:04   ` Rik van Riel
@ 2016-08-10 18:03   ` tip-bot for Morten Rasmussen
  1 sibling, 0 replies; 64+ messages in thread
From: tip-bot for Morten Rasmussen @ 2016-08-10 18:03 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: mingo, torvalds, peterz, hpa, linux-kernel, mingo, efault, riel,
	morten.rasmussen, tglx

Commit-ID:  772bd008cd9a1d4e8ce566f2edcc61d1c28fcbe5
Gitweb:     http://git.kernel.org/tip/772bd008cd9a1d4e8ce566f2edcc61d1c28fcbe5
Author:     Morten Rasmussen <morten.rasmussen@arm.com>
AuthorDate: Wed, 22 Jun 2016 18:03:13 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 10 Aug 2016 14:03:32 +0200

sched/fair: Make the use of prev_cpu consistent in the wakeup path

In commit:

  ac66f5477239 ("sched/numa: Introduce migrate_swap()")

select_task_rq() got a 'cpu' argument to enable overriding of prev_cpu
in special cases (NUMA task swapping).

However, the select_task_rq_fair() helper functions: wake_affine() and
select_idle_sibling(), still use task_cpu(p) directly to work out
prev_cpu, which leads to inconsistencies.

This patch passes prev_cpu (potentially overridden by NUMA code) into
the helper functions to ensure prev_cpu is indeed the same CPU
everywhere in the wakeup path.

cc: Ingo Molnar <mingo@redhat.com>
cc: Rik van Riel <riel@redhat.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Rik van Riel <riel@redhat.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: linux-kernel@vger.kernel.org
Cc: mgalbraith@suse.de
Cc: vincent.guittot@linaro.org
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1466615004-3503-3-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 24 +++++++++++++-----------
 1 file changed, 13 insertions(+), 11 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9f9a4e5..d819da6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -656,7 +656,7 @@ static u64 sched_vslice(struct cfs_rq *cfs_rq, struct sched_entity *se)
 }
 
 #ifdef CONFIG_SMP
-static int select_idle_sibling(struct task_struct *p, int cpu);
+static int select_idle_sibling(struct task_struct *p, int prev_cpu, int cpu);
 static unsigned long task_h_load(struct task_struct *p);
 
 /*
@@ -1512,7 +1512,8 @@ balance:
 	 * Call select_idle_sibling to maybe find a better one.
 	 */
 	if (!cur)
-		env->dst_cpu = select_idle_sibling(env->p, env->dst_cpu);
+		env->dst_cpu = select_idle_sibling(env->p, env->src_cpu,
+						   env->dst_cpu);
 
 assign:
 	task_numa_assign(env, cur, imp);
@@ -5101,18 +5102,18 @@ static int wake_wide(struct task_struct *p)
 	return 1;
 }
 
-static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
+static int wake_affine(struct sched_domain *sd, struct task_struct *p,
+		       int prev_cpu, int sync)
 {
 	s64 this_load, load;
 	s64 this_eff_load, prev_eff_load;
-	int idx, this_cpu, prev_cpu;
+	int idx, this_cpu;
 	struct task_group *tg;
 	unsigned long weight;
 	int balanced;
 
 	idx	  = sd->wake_idx;
 	this_cpu  = smp_processor_id();
-	prev_cpu  = task_cpu(p);
 	load	  = source_load(prev_cpu, idx);
 	this_load = target_load(this_cpu, idx);
 
@@ -5277,11 +5278,10 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
 /*
  * Try and locate an idle CPU in the sched_domain.
  */
-static int select_idle_sibling(struct task_struct *p, int target)
+static int select_idle_sibling(struct task_struct *p, int prev, int target)
 {
 	struct sched_domain *sd;
 	struct sched_group *sg;
-	int i = task_cpu(p);
 
 	if (idle_cpu(target))
 		return target;
@@ -5289,8 +5289,8 @@ static int select_idle_sibling(struct task_struct *p, int target)
 	/*
 	 * If the prevous cpu is cache affine and idle, don't be stupid.
 	 */
-	if (i != target && cpus_share_cache(i, target) && idle_cpu(i))
-		return i;
+	if (prev != target && cpus_share_cache(prev, target) && idle_cpu(prev))
+		return prev;
 
 	/*
 	 * Otherwise, iterate the domains and find an eligible idle cpu.
@@ -5311,6 +5311,8 @@ static int select_idle_sibling(struct task_struct *p, int target)
 	for_each_lower_domain(sd) {
 		sg = sd->groups;
 		do {
+			int i;
+
 			if (!cpumask_intersects(sched_group_cpus(sg),
 						tsk_cpus_allowed(p)))
 				goto next;
@@ -5419,13 +5421,13 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 
 	if (affine_sd) {
 		sd = NULL; /* Prefer wake_affine over balance flags */
-		if (cpu != prev_cpu && wake_affine(affine_sd, p, sync))
+		if (cpu != prev_cpu && wake_affine(affine_sd, p, prev_cpu, sync))
 			new_cpu = cpu;
 	}
 
 	if (!sd) {
 		if (sd_flag & SD_BALANCE_WAKE) /* XXX always ? */
-			new_cpu = select_idle_sibling(p, new_cpu);
+			new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
 
 	} else while (sd) {
 		struct sched_group *group;

^ permalink raw reply related	[flat|nested] 64+ messages in thread

* [tip:sched/core] sched/fair: Optimize find_idlest_cpu() when there is no choice
  2016-06-22 17:03 ` [PATCH v2 03/13] sched/fair: Optimize find_idlest_cpu() when there is no choice Morten Rasmussen
  2016-07-13 12:20   ` Vincent Guittot
@ 2016-08-10 18:03   ` tip-bot for Morten Rasmussen
  1 sibling, 0 replies; 64+ messages in thread
From: tip-bot for Morten Rasmussen @ 2016-08-10 18:03 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: tglx, hpa, efault, torvalds, peterz, morten.rasmussen, mingo,
	linux-kernel

Commit-ID:  eaecf41f5abf80b63c8e025fcb9ee4aa203c3038
Gitweb:     http://git.kernel.org/tip/eaecf41f5abf80b63c8e025fcb9ee4aa203c3038
Author:     Morten Rasmussen <morten.rasmussen@arm.com>
AuthorDate: Wed, 22 Jun 2016 18:03:14 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Wed, 10 Aug 2016 14:03:32 +0200

sched/fair: Optimize find_idlest_cpu() when there is no choice

In the current find_idlest_group()/find_idlest_cpu() search we end up
calling find_idlest_cpu() in a sched_group containing only one CPU in
the end. Checking idle-states becomes pointless when there is no
alternative, so bail out instead.

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: dietmar.eggemann@arm.com
Cc: linux-kernel@vger.kernel.org
Cc: mgalbraith@suse.de
Cc: vincent.guittot@linaro.org
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1466615004-3503-4-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d819da6..acdc351 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5239,6 +5239,10 @@ find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
 	int shallowest_idle_cpu = -1;
 	int i;
 
+	/* Check if we have any choice: */
+	if (group->group_weight == 1)
+		return cpumask_first(sched_group_cpus(group));
+
 	/* Traverse only the allowed CPUs */
 	for_each_cpu_and(i, sched_group_cpus(group), tsk_cpus_allowed(p)) {
 		if (idle_cpu(i)) {

^ permalink raw reply related	[flat|nested] 64+ messages in thread

end of thread, other threads:[~2016-08-10 21:24 UTC | newest]

Thread overview: 64+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-06-22 17:03 [PATCH v2 00/13] sched: Clean-ups and asymmetric cpu capacity support Morten Rasmussen
2016-06-22 17:03 ` [PATCH v2 01/13] sched: Fix power to capacity renaming in comment Morten Rasmussen
2016-08-10 18:03   ` [tip:sched/core] sched/core: " tip-bot for Morten Rasmussen
2016-06-22 17:03 ` [PATCH v2 02/13] sched/fair: Consistent use of prev_cpu in wakeup path Morten Rasmussen
2016-06-22 18:04   ` Rik van Riel
2016-06-23  9:56     ` Morten Rasmussen
2016-06-23 12:24       ` Rik van Riel
2016-08-10 18:03   ` [tip:sched/core] sched/fair: Make the use of prev_cpu consistent in the " tip-bot for Morten Rasmussen
2016-06-22 17:03 ` [PATCH v2 03/13] sched/fair: Optimize find_idlest_cpu() when there is no choice Morten Rasmussen
2016-07-13 12:20   ` Vincent Guittot
2016-08-10 18:03   ` [tip:sched/core] " tip-bot for Morten Rasmussen
2016-06-22 17:03 ` [PATCH v2 04/13] sched: Introduce SD_ASYM_CPUCAPACITY sched_domain topology flag Morten Rasmussen
2016-07-11  9:55   ` Peter Zijlstra
2016-07-11 10:42     ` Morten Rasmussen
2016-06-22 17:03 ` [PATCH v2 05/13] sched: Enable SD_BALANCE_WAKE for asymmetric capacity systems Morten Rasmussen
2016-07-11 10:04   ` Peter Zijlstra
2016-07-11 10:37     ` Morten Rasmussen
2016-07-11 11:04       ` Morten Rasmussen
2016-07-11 11:24         ` Peter Zijlstra
2016-07-12 14:26           ` Morten Rasmussen
2016-06-22 17:03 ` [PATCH v2 06/13] sched: Store maximum per-cpu capacity in root domain Morten Rasmussen
2016-07-11 10:18   ` Peter Zijlstra
2016-07-11 16:16     ` Dietmar Eggemann
2016-07-12 11:42       ` Peter Zijlstra
2016-07-13 11:18         ` Dietmar Eggemann
2016-07-13 12:40   ` Vincent Guittot
2016-07-13 13:48     ` Dietmar Eggemann
2016-07-13 16:37       ` Morten Rasmussen
2016-07-14 13:25         ` Vincent Guittot
2016-07-14 15:15           ` Morten Rasmussen
2016-07-15 11:46             ` Morten Rasmussen
2016-07-15 13:39               ` Vincent Guittot
2016-07-15 16:02                 ` Morten Rasmussen
2016-07-18 12:48                   ` Vincent Guittot
2016-07-18 15:11                     ` Morten Rasmussen
2016-06-22 17:03 ` [PATCH v2 07/13] sched/fair: Let asymmetric cpu configurations balance at wake-up Morten Rasmussen
2016-07-11 11:13   ` Peter Zijlstra
2016-07-11 12:32     ` Morten Rasmussen
2016-07-13 12:56   ` Vincent Guittot
2016-07-13 16:14     ` Morten Rasmussen
2016-07-14 13:45       ` Vincent Guittot
2016-07-15  8:37         ` Morten Rasmussen
2016-06-22 17:03 ` [PATCH v2 08/13] sched/fair: Compute task/cpu utilization at wake-up more correctly Morten Rasmussen
2016-06-22 17:03 ` [PATCH v2 09/13] sched/fair: Consider spare capacity in find_idlest_group() Morten Rasmussen
2016-06-22 17:03 ` [PATCH v2 10/13] sched: Add per-cpu max capacity to sched_group_capacity Morten Rasmussen
2016-06-22 17:03 ` [PATCH v2 11/13] sched/fair: Avoid pulling tasks from non-overloaded higher capacity groups Morten Rasmussen
2016-06-23 21:20   ` Sai Gurrappadi
2016-06-30  7:49     ` Morten Rasmussen
2016-07-14 16:39       ` Sai Gurrappadi
2016-07-15  8:39         ` Morten Rasmussen
2016-07-12 12:59   ` Peter Zijlstra
2016-07-12 14:34     ` Morten Rasmussen
2016-06-22 17:03 ` [PATCH v2 12/13] arm: Set SD_ASYM_CPUCAPACITY for big.LITTLE platforms Morten Rasmussen
2016-06-22 17:03 ` [PATCH v2 13/13] arm: Update arch_scale_cpu_capacity() to reflect change to define Morten Rasmussen
2016-06-28 10:20 ` [PATCH v2 00/13] sched: Clean-ups and asymmetric cpu capacity support Koan-Sin Tan
2016-06-30  7:53   ` Morten Rasmussen
2016-07-08  7:35 ` KEITA KOBAYASHI
2016-07-08  8:18   ` Morten Rasmussen
2016-07-11  8:33 ` Morten Rasmussen
2016-07-11 12:44   ` Vincent Guittot
2016-07-12 13:25   ` Peter Zijlstra
2016-07-12 14:39     ` Morten Rasmussen
2016-07-13 12:06 ` Vincent Guittot
2016-07-13 15:54   ` Morten Rasmussen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.