LKML Archive on lore.kernel.org
 help / Atom feed
* [PATCHv4 00/12] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems
@ 2018-07-04 10:17 Morten Rasmussen
  2018-07-04 10:17 ` [PATCHv4 01/12] sched: Add static_key for asymmetric cpu capacity optimizations Morten Rasmussen
                   ` (14 more replies)
  0 siblings, 15 replies; 37+ messages in thread
From: Morten Rasmussen @ 2018-07-04 10:17 UTC (permalink / raw)
  To: peterz, mingo
  Cc: valentin.schneider, dietmar.eggemann, vincent.guittot,
	gaku.inami.xh, linux-kernel, Morten Rasmussen

On asymmetric cpu capacity systems (e.g. Arm big.LITTLE) it is crucial
for performance that cpu intensive tasks are aggressively migrated to
high capacity cpus as soon as those become available. The capacity
awareness tweaks already in the wake-up path can't handle this as such
tasks might run or be runnable forever. If they happen to be placed on a
low capacity cpu from the beginning they are stuck there forever while
high capacity cpus may have become available in the meantime.

To address this issue this patch set introduces a new "misfit"
load-balancing scenario in periodic/nohz/newly idle balance which tweaks
the load-balance conditions to ignore load per capacity in certain
cases. Since misfit tasks are commonly running alone on a cpu, more
aggressive active load-balancing is needed too.

The fundamental idea of this patch set has been in Android kernels for a
long time and is absolutely essential for consistent performance on
asymmetric cpu capacity systems.

The patches have been tested on:
   1. Arm Juno (r0): 2+4 Cortex A57/A53
   2. Hikey960: 4+4 Cortex A73/A53

Test case:
Big cpus are always kept busy. Pin a shorter running sysbench tasks to
big cpus, while creating a longer running set of unpinned sysbench
tasks.

    REQUESTS=1000
    BIGS="1 2"
    LITTLES="0 3 4 5"
 
    # Don't care about the score for those, just keep the bigs busy
    for i in $BIGS; do
        taskset -c $i sysbench --max-requests=$((REQUESTS / 4)) \
        --test=cpu  run &>/dev/null &
    done
 
    for i in $LITTLES; do
        sysbench --max-requests=$REQUESTS --test=cpu run \
	| grep "total time:" &
    done
 
    wait

Results:
Single runs with completion time of each task
Juno (tip)
    total time:                          1.2608s
    total time:                          1.2995s
    total time:                          1.5954s
    total time:                          1.7463s

Juno (misfit)
    total time:                          1.2575s
    total time:                          1.3004s
    total time:                          1.5860s
    total time:                          1.5871s

Hikey960 (tip)
    total time:                          1.7431s
    total time:                          2.2914s
    total time:                          2.5976s
    total time:                          1.7280s

Hikey960 (misfit)
    total time:                          1.7866s
    total time:                          1.7513s
    total time:                          1.6918s
    total time:                          1.6965s

10 run summary (tracking longest running task for each run)
	Juno		Hikey960
	avg	max	avg	max
tip     1.7465  1.7469  2.5997  2.6131 
misfit  1.6016  1.6192  1.8506  1.9666

Changelog:
v4
- Added check for empty cpu_map in sd_init().
- Added patch to disable SD_ASYM_CPUCAPACITY for root_domains that don't
  observe capacity asymmetry if the system as a whole is asymmetric.
- Added patch to disable SD_PREFER_SIBLING on the sched_domain level below
  SD_ASYM_CPUCAPACITY.
- Rebased against tip/sched/core.
- Fixed uninitialised variable introduced in update_sd_lb_stats.
- Added patch to do a slight variable initialisation cleanup in update_sd_lb_stats.
- Removed superfluous type changes for temp variables assigned to root_domain->overload.
- Reworded commit for the patch setting rq->rd->overload when misfit.
- v3 Tested-by: Gaku Inami <gaku.inami.xh@renesas.com>

v3
- Fixed locking around static_key.
- Changed group per-cpu capacity comparison to be based on max rather
  than min capacity.
- Added patch to prevent occasional pointless high->low capacity
  migrations.
- Changed type of group_misfit_task_load and misfit_task_load to
  unsigned long.
- Changed fbq() to pick the cpu with highest misfit_task_load rather
  than breaking when the first is found.
- Rebased against tip/sched/core.
- v2 Tested-by: Gaku Inami <gaku.inami.xh@renesas.com>

v2
- Removed redudant condition in static_key enablement.
- Fixed logic flaw in patch #2 reported by Yi Yao <yi.yao@intel.com>
- Dropped patch #4 as although the patch seems to make sense no benefit
  has been proven.
- Dropped root_domain->overload renaming
- Changed type of root_domain->overload to int
- Wrapped accesses of rq->rd->overload with READ/WRITE_ONCE
- v1 Tested-by: Gaku Inami <gaku.inami.xh@renesas.com>

Chris Redpath (1):
  sched/fair: Don't move tasks to lower capacity cpus unless necessary

Morten Rasmussen (6):
  sched: Add static_key for asymmetric cpu capacity optimizations
  sched/fair: Add group_misfit_task load-balance type
  sched: Add sched_group per-cpu max capacity
  sched/fair: Consider misfit tasks when load-balancing
  sched/core: Disable SD_ASYM_CPUCAPACITY for root_domains without
    asymmetry
  sched/core: Disable SD_PREFER_SIBLING on asymmetric cpu capacity
    domains

Valentin Schneider (5):
  sched/fair: Kick nohz balance if rq->misfit_task_load
  sched/fair: Change prefer_sibling type to bool
  sched: Change root_domain->overload type to int
  sched: Wrap rq->rd->overload accesses with READ/WRITE_ONCE
  sched/fair: Set rq->rd->overload when misfit

 kernel/sched/fair.c     | 161 +++++++++++++++++++++++++++++++++++++++++-------
 kernel/sched/sched.h    |  16 +++--
 kernel/sched/topology.c |  53 ++++++++++++++--
 3 files changed, 199 insertions(+), 31 deletions(-)

-- 
2.7.4


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCHv4 01/12] sched: Add static_key for asymmetric cpu capacity optimizations
  2018-07-04 10:17 [PATCHv4 00/12] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems Morten Rasmussen
@ 2018-07-04 10:17 ` Morten Rasmussen
  2018-07-31 10:59   ` Peter Zijlstra
  2018-07-04 10:17 ` [PATCHv4 02/12] sched/fair: Add group_misfit_task load-balance type Morten Rasmussen
                   ` (13 subsequent siblings)
  14 siblings, 1 reply; 37+ messages in thread
From: Morten Rasmussen @ 2018-07-04 10:17 UTC (permalink / raw)
  To: peterz, mingo
  Cc: valentin.schneider, dietmar.eggemann, vincent.guittot,
	gaku.inami.xh, linux-kernel, Morten Rasmussen

The existing asymmetric cpu capacity code should cause minimal overhead
for others. Putting it behind a static_key, it has been done for SMT
optimizations, would make it easier to extend and improve without
causing harm to others moving forward.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c     |  3 +++
 kernel/sched/sched.h    |  1 +
 kernel/sched/topology.c | 19 +++++++++++++++++++
 3 files changed, 23 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 321cd5dcf2e8..85fb7e8ff5c8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6583,6 +6583,9 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
 {
 	long min_cap, max_cap;
 
+	if (!static_branch_unlikely(&sched_asym_cpucapacity))
+		return 0;
+
 	min_cap = min(capacity_orig_of(prev_cpu), capacity_orig_of(cpu));
 	max_cap = cpu_rq(cpu)->rd->max_cpu_capacity;
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c7742dcc136c..35ce218f0157 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1160,6 +1160,7 @@ DECLARE_PER_CPU(int, sd_llc_id);
 DECLARE_PER_CPU(struct sched_domain_shared *, sd_llc_shared);
 DECLARE_PER_CPU(struct sched_domain *, sd_numa);
 DECLARE_PER_CPU(struct sched_domain *, sd_asym);
+extern struct static_key_false sched_asym_cpucapacity;
 
 struct sched_group_capacity {
 	atomic_t		ref;
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 05a831427bc7..0cfdeff669fe 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -398,6 +398,7 @@ DEFINE_PER_CPU(int, sd_llc_id);
 DEFINE_PER_CPU(struct sched_domain_shared *, sd_llc_shared);
 DEFINE_PER_CPU(struct sched_domain *, sd_numa);
 DEFINE_PER_CPU(struct sched_domain *, sd_asym);
+DEFINE_STATIC_KEY_FALSE(sched_asym_cpucapacity);
 
 static void update_top_cache_domain(int cpu)
 {
@@ -425,6 +426,21 @@ static void update_top_cache_domain(int cpu)
 	rcu_assign_pointer(per_cpu(sd_asym, cpu), sd);
 }
 
+static void update_asym_cpucapacity(int cpu)
+{
+	int enable = false;
+
+	rcu_read_lock();
+	if (lowest_flag_domain(cpu, SD_ASYM_CPUCAPACITY))
+		enable = true;
+	rcu_read_unlock();
+
+	if (enable) {
+		/* This expects to be hotplug-safe */
+		static_branch_enable_cpuslocked(&sched_asym_cpucapacity);
+	}
+}
+
 /*
  * Attach the domain 'sd' to 'cpu' as its base domain. Callers must
  * hold the hotplug lock.
@@ -1707,6 +1723,9 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
 	}
 	rcu_read_unlock();
 
+	if (!cpumask_empty(cpu_map))
+		update_asym_cpucapacity(cpumask_first(cpu_map));
+
 	if (rq && sched_debug_enabled) {
 		pr_info("root domain span: %*pbl (max cpu_capacity = %lu)\n",
 			cpumask_pr_args(cpu_map), rq->rd->max_cpu_capacity);
-- 
2.7.4


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCHv4 02/12] sched/fair: Add group_misfit_task load-balance type
  2018-07-04 10:17 [PATCHv4 00/12] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems Morten Rasmussen
  2018-07-04 10:17 ` [PATCHv4 01/12] sched: Add static_key for asymmetric cpu capacity optimizations Morten Rasmussen
@ 2018-07-04 10:17 ` Morten Rasmussen
  2018-07-04 10:17 ` [PATCHv4 03/12] sched: Add sched_group per-cpu max capacity Morten Rasmussen
                   ` (12 subsequent siblings)
  14 siblings, 0 replies; 37+ messages in thread
From: Morten Rasmussen @ 2018-07-04 10:17 UTC (permalink / raw)
  To: peterz, mingo
  Cc: valentin.schneider, dietmar.eggemann, vincent.guittot,
	gaku.inami.xh, linux-kernel, Morten Rasmussen

To maximize throughput in systems with asymmetric cpu capacities (e.g.
ARM big.LITTLE) load-balancing has to consider task and cpu utilization
as well as per-cpu compute capacity when load-balancing in addition to
the current average load based load-balancing policy. Tasks with high
utilization that are scheduled on a lower capacity cpu need to be
identified and migrated to a higher capacity cpu if possible to maximize
throughput.

To implement this additional policy an additional group_type
(load-balance scenario) is added: group_misfit_task. This represents
scenarios where a sched_group has one or more tasks that are not
suitable for its per-cpu capacity. group_misfit_task is only considered
if the system is not overloaded or imbalanced (group_imbalanced or
group_overloaded).

Identifying misfit tasks requires the rq lock to be held. To avoid
taking remote rq locks to examine source sched_groups for misfit tasks,
each cpu is responsible for tracking misfit tasks themselves and update
the rq->misfit_task flag. This means checking task utilization when
tasks are scheduled and on sched_tick.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c  | 54 ++++++++++++++++++++++++++++++++++++++++++++--------
 kernel/sched/sched.h |  2 ++
 2 files changed, 48 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 85fb7e8ff5c8..e05e5202a1d2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -697,6 +697,7 @@ static u64 sched_vslice(struct cfs_rq *cfs_rq, struct sched_entity *se)
 
 static int select_idle_sibling(struct task_struct *p, int prev_cpu, int cpu);
 static unsigned long task_h_load(struct task_struct *p);
+static unsigned long capacity_of(int cpu);
 
 /* Give new sched_entity start runnable values to heavy its load in infant time */
 void init_entity_runnable_average(struct sched_entity *se)
@@ -1448,7 +1449,6 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
 static unsigned long weighted_cpuload(struct rq *rq);
 static unsigned long source_load(int cpu, int type);
 static unsigned long target_load(int cpu, int type);
-static unsigned long capacity_of(int cpu);
 
 /* Cached statistics for all CPUs within a node */
 struct numa_stats {
@@ -4035,6 +4035,29 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep)
 	WRITE_ONCE(p->se.avg.util_est, ue);
 }
 
+static inline int task_fits_capacity(struct task_struct *p, long capacity)
+{
+	return capacity * 1024 > task_util_est(p) * capacity_margin;
+}
+
+static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
+{
+	if (!static_branch_unlikely(&sched_asym_cpucapacity))
+		return;
+
+	if (!p) {
+		rq->misfit_task_load = 0;
+		return;
+	}
+
+	if (task_fits_capacity(p, capacity_of(cpu_of(rq)))) {
+		rq->misfit_task_load = 0;
+		return;
+	}
+
+	rq->misfit_task_load = task_h_load(p);
+}
+
 #else /* CONFIG_SMP */
 
 static inline int
@@ -4070,6 +4093,7 @@ util_est_enqueue(struct cfs_rq *cfs_rq, struct task_struct *p) {}
 static inline void
 util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p,
 		 bool task_sleep) {}
+static inline void update_misfit_status(struct task_struct *p, struct rq *rq) {}
 
 #endif /* CONFIG_SMP */
 
@@ -6596,7 +6620,7 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
 	/* Bring task utilization in sync with prev_cpu */
 	sync_entity_load_avg(&p->se);
 
-	return min_cap * 1024 < task_util(p) * capacity_margin;
+	return !task_fits_capacity(p, min_cap);
 }
 
 /*
@@ -7013,9 +7037,12 @@ done: __maybe_unused;
 	if (hrtick_enabled(rq))
 		hrtick_start_fair(rq, p);
 
+	update_misfit_status(p, rq);
+
 	return p;
 
 idle:
+	update_misfit_status(NULL, rq);
 	new_tasks = idle_balance(rq, rf);
 
 	/*
@@ -7221,6 +7248,13 @@ static unsigned long __read_mostly max_load_balance_interval = HZ/10;
 
 enum fbq_type { regular, remote, all };
 
+enum group_type {
+	group_other = 0,
+	group_misfit_task,
+	group_imbalanced,
+	group_overloaded,
+};
+
 #define LBF_ALL_PINNED	0x01
 #define LBF_NEED_BREAK	0x02
 #define LBF_DST_PINNED  0x04
@@ -7762,12 +7796,6 @@ static unsigned long task_h_load(struct task_struct *p)
 
 /********** Helpers for find_busiest_group ************************/
 
-enum group_type {
-	group_other = 0,
-	group_imbalanced,
-	group_overloaded,
-};
-
 /*
  * sg_lb_stats - stats of a sched_group required for load_balancing
  */
@@ -7783,6 +7811,7 @@ struct sg_lb_stats {
 	unsigned int group_weight;
 	enum group_type group_type;
 	int group_no_capacity;
+	unsigned long group_misfit_task_load; /* A cpu has a task too big for its capacity */
 #ifdef CONFIG_NUMA_BALANCING
 	unsigned int nr_numa_running;
 	unsigned int nr_preferred_running;
@@ -8082,6 +8111,9 @@ group_type group_classify(struct sched_group *group,
 	if (sg_imbalanced(group))
 		return group_imbalanced;
 
+	if (sgs->group_misfit_task_load)
+		return group_misfit_task;
+
 	return group_other;
 }
 
@@ -8156,6 +8188,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 		 */
 		if (!nr_running && idle_cpu(i))
 			sgs->idle_cpus++;
+
+		if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
+		    sgs->group_misfit_task_load < rq->misfit_task_load)
+			sgs->group_misfit_task_load = rq->misfit_task_load;
 	}
 
 	/* Adjust by relative CPU capacity of the group */
@@ -9937,6 +9973,8 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 
 	if (static_branch_unlikely(&sched_numa_balancing))
 		task_tick_numa(rq, curr);
+
+	update_misfit_status(curr, rq);
 }
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 35ce218f0157..3376bacab712 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -822,6 +822,8 @@ struct rq {
 
 	unsigned char		idle_balance;
 
+	unsigned long		misfit_task_load;
+
 	/* For active balancing */
 	int			active_balance;
 	int			push_cpu;
-- 
2.7.4


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCHv4 03/12] sched: Add sched_group per-cpu max capacity
  2018-07-04 10:17 [PATCHv4 00/12] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems Morten Rasmussen
  2018-07-04 10:17 ` [PATCHv4 01/12] sched: Add static_key for asymmetric cpu capacity optimizations Morten Rasmussen
  2018-07-04 10:17 ` [PATCHv4 02/12] sched/fair: Add group_misfit_task load-balance type Morten Rasmussen
@ 2018-07-04 10:17 ` Morten Rasmussen
  2018-07-04 10:17 ` [PATCHv4 04/12] sched/fair: Consider misfit tasks when load-balancing Morten Rasmussen
                   ` (11 subsequent siblings)
  14 siblings, 0 replies; 37+ messages in thread
From: Morten Rasmussen @ 2018-07-04 10:17 UTC (permalink / raw)
  To: peterz, mingo
  Cc: valentin.schneider, dietmar.eggemann, vincent.guittot,
	gaku.inami.xh, linux-kernel, Morten Rasmussen

The current sg->min_capacity tracks the lowest per-cpu compute capacity
available in the sched_group when rt/irq pressure is taken into account.
Minimum capacity isn't the ideal metric for tracking if a sched_group
needs offloading to another sched_group for some scenarios, e.g. a
sched_group with multiple cpus if only one is under heavy pressure.
Tracking maximum capacity isn't perfect either but a better choice for
some situations as it indicates that the sched_group definitely compute
capacity constrained either due to rt/irq pressure on all cpus or
asymmetric cpu capacities (e.g. big.LITTLE).

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c     | 24 ++++++++++++++++++++----
 kernel/sched/sched.h    |  1 +
 kernel/sched/topology.c |  2 ++
 3 files changed, 23 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e05e5202a1d2..09ede4321a3d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7927,13 +7927,14 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
 	cpu_rq(cpu)->cpu_capacity = capacity;
 	sdg->sgc->capacity = capacity;
 	sdg->sgc->min_capacity = capacity;
+	sdg->sgc->max_capacity = capacity;
 }
 
 void update_group_capacity(struct sched_domain *sd, int cpu)
 {
 	struct sched_domain *child = sd->child;
 	struct sched_group *group, *sdg = sd->groups;
-	unsigned long capacity, min_capacity;
+	unsigned long capacity, min_capacity, max_capacity;
 	unsigned long interval;
 
 	interval = msecs_to_jiffies(sd->balance_interval);
@@ -7947,6 +7948,7 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
 
 	capacity = 0;
 	min_capacity = ULONG_MAX;
+	max_capacity = 0;
 
 	if (child->flags & SD_OVERLAP) {
 		/*
@@ -7977,6 +7979,7 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
 			}
 
 			min_capacity = min(capacity, min_capacity);
+			max_capacity = max(capacity, max_capacity);
 		}
 	} else  {
 		/*
@@ -7990,12 +7993,14 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
 
 			capacity += sgc->capacity;
 			min_capacity = min(sgc->min_capacity, min_capacity);
+			max_capacity = max(sgc->max_capacity, max_capacity);
 			group = group->next;
 		} while (group != child->groups);
 	}
 
 	sdg->sgc->capacity = capacity;
 	sdg->sgc->min_capacity = min_capacity;
+	sdg->sgc->max_capacity = max_capacity;
 }
 
 /*
@@ -8091,16 +8096,27 @@ group_is_overloaded(struct lb_env *env, struct sg_lb_stats *sgs)
 }
 
 /*
- * group_smaller_cpu_capacity: Returns true if sched_group sg has smaller
+ * group_smaller_min_cpu_capacity: Returns true if sched_group sg has smaller
  * per-CPU capacity than sched_group ref.
  */
 static inline bool
-group_smaller_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
+group_smaller_min_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
 {
 	return sg->sgc->min_capacity * capacity_margin <
 						ref->sgc->min_capacity * 1024;
 }
 
+/*
+ * group_smaller_max_cpu_capacity: Returns true if sched_group sg has smaller
+ * per-CPU capacity_orig than sched_group ref.
+ */
+static inline bool
+group_smaller_max_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
+{
+	return sg->sgc->max_capacity * capacity_margin <
+						ref->sgc->max_capacity * 1024;
+}
+
 static inline enum
 group_type group_classify(struct sched_group *group,
 			  struct sg_lb_stats *sgs)
@@ -8246,7 +8262,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 	 * power/energy consequences are not considered.
 	 */
 	if (sgs->sum_nr_running <= sgs->group_weight &&
-	    group_smaller_cpu_capacity(sds->local, sg))
+	    group_smaller_min_cpu_capacity(sds->local, sg))
 		return false;
 
 asym_packing:
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3376bacab712..6c39a07e8a68 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1172,6 +1172,7 @@ struct sched_group_capacity {
 	 */
 	unsigned long		capacity;
 	unsigned long		min_capacity;		/* Min per-CPU capacity in group */
+	unsigned long		max_capacity;		/* Max per-CPU capacity in group */
 	unsigned long		next_update;
 	int			imbalance;		/* XXX unrelated to capacity but shared group state */
 
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 0cfdeff669fe..71330e0e41db 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -708,6 +708,7 @@ static void init_overlap_sched_group(struct sched_domain *sd,
 	sg_span = sched_group_span(sg);
 	sg->sgc->capacity = SCHED_CAPACITY_SCALE * cpumask_weight(sg_span);
 	sg->sgc->min_capacity = SCHED_CAPACITY_SCALE;
+	sg->sgc->max_capacity = SCHED_CAPACITY_SCALE;
 }
 
 static int
@@ -867,6 +868,7 @@ static struct sched_group *get_group(int cpu, struct sd_data *sdd)
 
 	sg->sgc->capacity = SCHED_CAPACITY_SCALE * cpumask_weight(sched_group_span(sg));
 	sg->sgc->min_capacity = SCHED_CAPACITY_SCALE;
+	sg->sgc->max_capacity = SCHED_CAPACITY_SCALE;
 
 	return sg;
 }
-- 
2.7.4


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCHv4 04/12] sched/fair: Consider misfit tasks when load-balancing
  2018-07-04 10:17 [PATCHv4 00/12] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems Morten Rasmussen
                   ` (2 preceding siblings ...)
  2018-07-04 10:17 ` [PATCHv4 03/12] sched: Add sched_group per-cpu max capacity Morten Rasmussen
@ 2018-07-04 10:17 ` Morten Rasmussen
  2018-07-04 10:17 ` [PATCHv4 05/12] sched/fair: Kick nohz balance if rq->misfit_task_load Morten Rasmussen
                   ` (10 subsequent siblings)
  14 siblings, 0 replies; 37+ messages in thread
From: Morten Rasmussen @ 2018-07-04 10:17 UTC (permalink / raw)
  To: peterz, mingo
  Cc: valentin.schneider, dietmar.eggemann, vincent.guittot,
	gaku.inami.xh, linux-kernel, Morten Rasmussen

On asymmetric cpu capacity systems load intensive tasks can end up on
cpus that don't suit their compute demand.  In this scenarios 'misfit'
tasks should be migrated to cpus with higher compute capacity to ensure
better throughput. group_misfit_task indicates this scenario, but tweaks
to the load-balance code are needed to make the migrations happen.

Misfit balancing only makes sense between a source group of lower
per-cpu capacity and destination group of higher compute capacity.
Otherwise, misfit balancing is ignored. group_misfit_task has lowest
priority so any imbalance due to overload is dealt with first.

The modifications are:

1. Only pick a group containing misfit tasks as the busiest group if the
   destination group has higher capacity and has spare capacity.
2. When the busiest group is a 'misfit' group, skip the usual average
   load and group capacity checks.
3. Set the imbalance for 'misfit' balancing sufficiently high for a task
   to be pulled ignoring average load.
4. Pick the cpu with the highest misfit load as the source cpu.
5. If the misfit task is alone on the source cpu, go for active
   balancing.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c | 51 +++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 49 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 09ede4321a3d..6e885d92fad2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7285,6 +7285,7 @@ struct lb_env {
 	unsigned int		loop_max;
 
 	enum fbq_type		fbq_type;
+	enum group_type		src_grp_type;
 	struct list_head	tasks;
 };
 
@@ -8243,6 +8244,17 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 {
 	struct sg_lb_stats *busiest = &sds->busiest_stat;
 
+	/*
+	 * Don't try to pull misfit tasks we can't help.
+	 * We can use max_capacity here as reduction in capacity on some
+	 * cpus in the group should either be possible to resolve
+	 * internally or be covered by avg_load imbalance (eventually).
+	 */
+	if (sgs->group_type == group_misfit_task &&
+	    (!group_smaller_max_cpu_capacity(sg, sds->local) ||
+	     !group_has_capacity(env, &sds->local_stat)))
+		return false;
+
 	if (sgs->group_type > busiest->group_type)
 		return true;
 
@@ -8265,6 +8277,13 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 	    group_smaller_min_cpu_capacity(sds->local, sg))
 		return false;
 
+	/*
+	 * If we have more than one misfit sg go with the biggest misfit.
+	 */
+	if (sgs->group_type == group_misfit_task &&
+	    sgs->group_misfit_task_load < busiest->group_misfit_task_load)
+		return false;
+
 asym_packing:
 	/* This is the busiest node in its class. */
 	if (!(env->sd->flags & SD_ASYM_PACKING))
@@ -8562,8 +8581,9 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 	 * factors in sg capacity and sgs with smaller group_type are
 	 * skipped when updating the busiest sg:
 	 */
-	if (busiest->avg_load <= sds->avg_load ||
-	    local->avg_load >= sds->avg_load) {
+	if (busiest->group_type != group_misfit_task &&
+	    (busiest->avg_load <= sds->avg_load ||
+	     local->avg_load >= sds->avg_load)) {
 		env->imbalance = 0;
 		return fix_small_imbalance(env, sds);
 	}
@@ -8597,6 +8617,12 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 		(sds->avg_load - local->avg_load) * local->group_capacity
 	) / SCHED_CAPACITY_SCALE;
 
+	/* Boost imbalance to allow misfit task to be balanced. */
+	if (busiest->group_type == group_misfit_task) {
+		env->imbalance = max_t(long, env->imbalance,
+				       busiest->group_misfit_task_load);
+	}
+
 	/*
 	 * if *imbalance is less than the average load per runnable task
 	 * there is no guarantee that any tasks will be moved so we'll have
@@ -8663,6 +8689,10 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
 	    busiest->group_no_capacity)
 		goto force_balance;
 
+	/* Misfit tasks should be dealt with regardless of the avg load */
+	if (busiest->group_type == group_misfit_task)
+		goto force_balance;
+
 	/*
 	 * If the local group is busier than the selected busiest group
 	 * don't try and pull any tasks.
@@ -8700,6 +8730,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
 
 force_balance:
 	/* Looks like there is an imbalance. Compute it */
+	env->src_grp_type = busiest->group_type;
 	calculate_imbalance(env, &sds);
 	return sds.busiest;
 
@@ -8747,6 +8778,19 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 		if (rt > env->fbq_type)
 			continue;
 
+		/*
+		 * For ASYM_CPUCAPACITY domains with misfit tasks we simply
+		 * seek the "biggest" misfit task.
+		 */
+		if (env->src_grp_type == group_misfit_task) {
+			if (rq->misfit_task_load > busiest_load) {
+				busiest_load = rq->misfit_task_load;
+				busiest = rq;
+			}
+
+			continue;
+		}
+
 		capacity = capacity_of(i);
 
 		wl = weighted_cpuload(rq);
@@ -8816,6 +8860,9 @@ static int need_active_balance(struct lb_env *env)
 			return 1;
 	}
 
+	if (env->src_grp_type == group_misfit_task)
+		return 1;
+
 	return unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2);
 }
 
-- 
2.7.4


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCHv4 05/12] sched/fair: Kick nohz balance if rq->misfit_task_load
  2018-07-04 10:17 [PATCHv4 00/12] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems Morten Rasmussen
                   ` (3 preceding siblings ...)
  2018-07-04 10:17 ` [PATCHv4 04/12] sched/fair: Consider misfit tasks when load-balancing Morten Rasmussen
@ 2018-07-04 10:17 ` Morten Rasmussen
  2018-07-04 10:17 ` [PATCHv4 06/12] sched/fair: Change prefer_sibling type to bool Morten Rasmussen
                   ` (9 subsequent siblings)
  14 siblings, 0 replies; 37+ messages in thread
From: Morten Rasmussen @ 2018-07-04 10:17 UTC (permalink / raw)
  To: peterz, mingo
  Cc: valentin.schneider, dietmar.eggemann, vincent.guittot,
	gaku.inami.xh, linux-kernel, Morten Rasmussen

From: Valentin Schneider <valentin.schneider@arm.com>

There already are a few conditions in nohz_kick_needed() to ensure
a nohz kick is triggered, but they are not enough for some misfit
task scenarios. Excluding asym packing, those are:

* rq->nr_running >=2: Not relevant here because we are running a
misfit task, it needs to be migrated regardless and potentially through
active balance.
* sds->nr_busy_cpus > 1: If there is only the misfit task being run
on a group of low capacity cpus, this will be evaluated to False.
* rq->cfs.h_nr_running >=1 && check_cpu_capacity(): Not relevant here,
misfit task needs to be migrated regardless of rt/IRQ pressure

As such, this commit adds an rq->misfit_task_load condition to trigger a
nohz kick.

The idea to kick a nohz balance for misfit tasks originally came from
Leo Yan <leo.yan@linaro.org>, and a similar patch was submitted for
the Android Common Kernel - see [1].

[1]: https://lists.linaro.org/pipermail/eas-dev/2016-September/000551.html

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6e885d92fad2..acec93e1dc51 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9505,7 +9505,7 @@ static void nohz_balancer_kick(struct rq *rq)
 	if (time_before(now, nohz.next_balance))
 		goto out;
 
-	if (rq->nr_running >= 2) {
+	if (rq->nr_running >= 2 || rq->misfit_task_load) {
 		flags = NOHZ_KICK_MASK;
 		goto out;
 	}
-- 
2.7.4


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCHv4 06/12] sched/fair: Change prefer_sibling type to bool
  2018-07-04 10:17 [PATCHv4 00/12] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems Morten Rasmussen
                   ` (4 preceding siblings ...)
  2018-07-04 10:17 ` [PATCHv4 05/12] sched/fair: Kick nohz balance if rq->misfit_task_load Morten Rasmussen
@ 2018-07-04 10:17 ` Morten Rasmussen
  2018-07-04 10:17 ` [PATCHv4 07/12] sched: Change root_domain->overload type to int Morten Rasmussen
                   ` (8 subsequent siblings)
  14 siblings, 0 replies; 37+ messages in thread
From: Morten Rasmussen @ 2018-07-04 10:17 UTC (permalink / raw)
  To: peterz, mingo
  Cc: valentin.schneider, dietmar.eggemann, vincent.guittot,
	gaku.inami.xh, linux-kernel, Morten Rasmussen

From: Valentin Schneider <valentin.schneider@arm.com>

This variable is entirely local to update_sd_lb_stats, so we can
safely change its type and slightly clean up its initialisation.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index acec93e1dc51..ee26eeb188ef 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8352,11 +8352,9 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 	struct sched_group *sg = env->sd->groups;
 	struct sg_lb_stats *local = &sds->local_stat;
 	struct sg_lb_stats tmp_sgs;
-	int load_idx, prefer_sibling = 0;
+	int load_idx;
 	bool overload = false;
-
-	if (child && child->flags & SD_PREFER_SIBLING)
-		prefer_sibling = 1;
+	bool prefer_sibling = child && child->flags & SD_PREFER_SIBLING;
 
 #ifdef CONFIG_NO_HZ_COMMON
 	if (env->idle == CPU_NEWLY_IDLE && READ_ONCE(nohz.has_blocked))
-- 
2.7.4


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCHv4 07/12] sched: Change root_domain->overload type to int
  2018-07-04 10:17 [PATCHv4 00/12] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems Morten Rasmussen
                   ` (5 preceding siblings ...)
  2018-07-04 10:17 ` [PATCHv4 06/12] sched/fair: Change prefer_sibling type to bool Morten Rasmussen
@ 2018-07-04 10:17 ` Morten Rasmussen
  2018-07-04 10:17 ` [PATCHv4 08/12] sched: Wrap rq->rd->overload accesses with READ/WRITE_ONCE Morten Rasmussen
                   ` (7 subsequent siblings)
  14 siblings, 0 replies; 37+ messages in thread
From: Morten Rasmussen @ 2018-07-04 10:17 UTC (permalink / raw)
  To: peterz, mingo
  Cc: valentin.schneider, dietmar.eggemann, vincent.guittot,
	gaku.inami.xh, linux-kernel, Morten Rasmussen

From: Valentin Schneider <valentin.schneider@arm.com>

sizeof(_Bool) is implementation defined, so let's just go with 'int' as
is done for other structures e.g. sched_domain_shared->has_idle_cores.

The local 'overload' variable used in update_sd_lb_stats can remain
bool, as it won't impact any struct layout and can be assigned to the
root_domain field.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/sched.h | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 6c39a07e8a68..648224b23287 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -696,7 +696,7 @@ struct root_domain {
 	cpumask_var_t		online;
 
 	/* Indicate more than one runnable task for any CPU */
-	bool			overload;
+	int			overload;
 
 	/*
 	 * The bit corresponding to a CPU gets set here if such CPU has more
@@ -1673,7 +1673,7 @@ static inline void add_nr_running(struct rq *rq, unsigned count)
 	if (prev_nr < 2 && rq->nr_running >= 2) {
 #ifdef CONFIG_SMP
 		if (!rq->rd->overload)
-			rq->rd->overload = true;
+			rq->rd->overload = 1;
 #endif
 	}
 
-- 
2.7.4


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCHv4 08/12] sched: Wrap rq->rd->overload accesses with READ/WRITE_ONCE
  2018-07-04 10:17 [PATCHv4 00/12] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems Morten Rasmussen
                   ` (6 preceding siblings ...)
  2018-07-04 10:17 ` [PATCHv4 07/12] sched: Change root_domain->overload type to int Morten Rasmussen
@ 2018-07-04 10:17 ` Morten Rasmussen
  2018-07-04 10:17 ` [PATCHv4 09/12] sched/fair: Set rq->rd->overload when misfit Morten Rasmussen
                   ` (6 subsequent siblings)
  14 siblings, 0 replies; 37+ messages in thread
From: Morten Rasmussen @ 2018-07-04 10:17 UTC (permalink / raw)
  To: peterz, mingo
  Cc: valentin.schneider, dietmar.eggemann, vincent.guittot,
	gaku.inami.xh, linux-kernel, Morten Rasmussen

From: Valentin Schneider <valentin.schneider@arm.com>

This variable can be read and set locklessly within update_sd_lb_stats().
As such, READ/WRITE_ONCE are added to make sure nothing terribly wrong
can happen because of the compiler.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c  | 6 +++---
 kernel/sched/sched.h | 4 ++--
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ee26eeb188ef..d0641ba7bea1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8428,8 +8428,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 
 	if (!env->sd->parent) {
 		/* update overload indicator if we are at root domain */
-		if (env->dst_rq->rd->overload != overload)
-			env->dst_rq->rd->overload = overload;
+		if (READ_ONCE(env->dst_rq->rd->overload) != overload)
+			WRITE_ONCE(env->dst_rq->rd->overload, overload);
 	}
 }
 
@@ -9872,7 +9872,7 @@ static int idle_balance(struct rq *this_rq, struct rq_flags *rf)
 	rq_unpin_lock(this_rq, rf);
 
 	if (this_rq->avg_idle < sysctl_sched_migration_cost ||
-	    !this_rq->rd->overload) {
+	    !READ_ONCE(this_rq->rd->overload)) {
 
 		rcu_read_lock();
 		sd = rcu_dereference_check_sched_domain(this_rq->sd);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 648224b23287..e1dc85d1bfdd 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1672,8 +1672,8 @@ static inline void add_nr_running(struct rq *rq, unsigned count)
 
 	if (prev_nr < 2 && rq->nr_running >= 2) {
 #ifdef CONFIG_SMP
-		if (!rq->rd->overload)
-			rq->rd->overload = 1;
+		if (!READ_ONCE(rq->rd->overload))
+			WRITE_ONCE(rq->rd->overload, 1);
 #endif
 	}
 
-- 
2.7.4


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCHv4 09/12] sched/fair: Set rq->rd->overload when misfit
  2018-07-04 10:17 [PATCHv4 00/12] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems Morten Rasmussen
                   ` (7 preceding siblings ...)
  2018-07-04 10:17 ` [PATCHv4 08/12] sched: Wrap rq->rd->overload accesses with READ/WRITE_ONCE Morten Rasmussen
@ 2018-07-04 10:17 ` Morten Rasmussen
  2018-07-04 10:17 ` [PATCHv4 10/12] sched/fair: Don't move tasks to lower capacity cpus unless necessary Morten Rasmussen
                   ` (5 subsequent siblings)
  14 siblings, 0 replies; 37+ messages in thread
From: Morten Rasmussen @ 2018-07-04 10:17 UTC (permalink / raw)
  To: peterz, mingo
  Cc: valentin.schneider, dietmar.eggemann, vincent.guittot,
	gaku.inami.xh, linux-kernel, Morten Rasmussen

From: Valentin Schneider <valentin.schneider@arm.com>

Idle balance is a great opportunity to pull a misfit task. However,
there are scenarios where misfit tasks are present but idle balance is
prevented by the overload flag.

A good example of this is a workload of n identical tasks. Let's suppose
we have a 2+2 Arm big.LITTLE system. We then spawn 4 fairly
CPU-intensive tasks - for the sake of simplicity let's say they are just
CPU hogs, even when running on big CPUs.

They are identical tasks, so on an SMP system they should all end at
(roughly) the same time. However, in our case the LITTLE CPUs are less
performing than the big CPUs, so tasks running on the LITTLEs will have
a longer completion time.

This means that the big CPUs will complete their work earlier, at which
point they should pull the tasks from the LITTLEs. What we want to
happen is summarized as follows:

a,b,c,d are our CPU-hogging tasks
_ signifies idling

LITTLE_0 | a a a a _ _
LITTLE_1 | b b b b _ _
---------|-------------
  big_0  | c c c c a a
  big_1  | d d d d b b
		  ^
		  ^
    Tasks end on the big CPUs, idle balance happens
    and the misfit tasks are pulled straight away

This however won't happen, because currently the overload flag is only
set when there is any CPU that has more than one runnable task - which
may very well not be the case here if our CPU-hogging workload is all
there is to run.

As such, this commit sets the overload flag in update_sg_lb_stats when
a group is flagged as having a misfit task.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c  | 6 ++++--
 kernel/sched/sched.h | 6 +++++-
 2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d0641ba7bea1..de84f5a9a65a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8163,7 +8163,7 @@ static bool update_nohz_stats(struct rq *rq, bool force)
  * @load_idx: Load index of sched_domain of this_cpu for load calc.
  * @local_group: Does group contain this_cpu.
  * @sgs: variable to hold the statistics for this group.
- * @overload: Indicate more than one runnable task for any CPU.
+ * @overload: Indicate pullable load (e.g. >1 runnable task).
  */
 static inline void update_sg_lb_stats(struct lb_env *env,
 			struct sched_group *group, int load_idx,
@@ -8207,8 +8207,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 			sgs->idle_cpus++;
 
 		if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
-		    sgs->group_misfit_task_load < rq->misfit_task_load)
+		    sgs->group_misfit_task_load < rq->misfit_task_load) {
 			sgs->group_misfit_task_load = rq->misfit_task_load;
+			*overload = 1;
+		}
 	}
 
 	/* Adjust by relative CPU capacity of the group */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e1dc85d1bfdd..377545b5aa15 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -695,7 +695,11 @@ struct root_domain {
 	cpumask_var_t		span;
 	cpumask_var_t		online;
 
-	/* Indicate more than one runnable task for any CPU */
+	/*
+	 * Indicate pullable load on at least one CPU, e.g:
+	 * - More than one runnable task
+	 * - Running task is misfit
+	 */
 	int			overload;
 
 	/*
-- 
2.7.4


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCHv4 10/12] sched/fair: Don't move tasks to lower capacity cpus unless necessary
  2018-07-04 10:17 [PATCHv4 00/12] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems Morten Rasmussen
                   ` (8 preceding siblings ...)
  2018-07-04 10:17 ` [PATCHv4 09/12] sched/fair: Set rq->rd->overload when misfit Morten Rasmussen
@ 2018-07-04 10:17 ` Morten Rasmussen
  2018-07-04 10:17 ` [PATCHv4 11/12] sched/core: Disable SD_ASYM_CPUCAPACITY for root_domains without asymmetry Morten Rasmussen
                   ` (4 subsequent siblings)
  14 siblings, 0 replies; 37+ messages in thread
From: Morten Rasmussen @ 2018-07-04 10:17 UTC (permalink / raw)
  To: peterz, mingo
  Cc: valentin.schneider, dietmar.eggemann, vincent.guittot,
	gaku.inami.xh, linux-kernel, Chris Redpath, Morten Rasmussen

From: Chris Redpath <chris.redpath@arm.com>

When lower capacity CPUs are load balancing and considering to pull
something from a higher capacity group, we should not pull tasks from a
cpu with only one task running as this is guaranteed to impede progress
for that task. If there is more than one task running, load balance in
the higher capacity group would have already made any possible moves to
resolve imbalance and we should make better use of system compute
capacity by moving a task if we still have more than one running.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index de84f5a9a65a..06beefa02420 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8793,6 +8793,17 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 
 		capacity = capacity_of(i);
 
+		/*
+		 * For ASYM_CPUCAPACITY domains, don't pick a cpu that could
+		 * eventually lead to active_balancing high->low capacity.
+		 * Higher per-cpu capacity is considered better than balancing
+		 * average load.
+		 */
+		if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
+			capacity_of(env->dst_cpu) < capacity &&
+			rq->nr_running == 1)
+			continue;
+
 		wl = weighted_cpuload(rq);
 
 		/*
-- 
2.7.4


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCHv4 11/12] sched/core: Disable SD_ASYM_CPUCAPACITY for root_domains without asymmetry
  2018-07-04 10:17 [PATCHv4 00/12] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems Morten Rasmussen
                   ` (9 preceding siblings ...)
  2018-07-04 10:17 ` [PATCHv4 10/12] sched/fair: Don't move tasks to lower capacity cpus unless necessary Morten Rasmussen
@ 2018-07-04 10:17 ` Morten Rasmussen
  2018-07-05 13:31   ` Quentin Perret
  2018-07-04 10:17 ` [PATCHv4 12/12] sched/core: Disable SD_PREFER_SIBLING on asymmetric cpu capacity domains Morten Rasmussen
                   ` (3 subsequent siblings)
  14 siblings, 1 reply; 37+ messages in thread
From: Morten Rasmussen @ 2018-07-04 10:17 UTC (permalink / raw)
  To: peterz, mingo
  Cc: valentin.schneider, dietmar.eggemann, vincent.guittot,
	gaku.inami.xh, linux-kernel, Morten Rasmussen

When hotplugging cpus out or creating exclusive cpusets (disabling
sched_load_balance) systems which were asymmetric at boot might become
symmetric. In this case leaving the flag set might lead to suboptimal
scheduling decisions.

The arch-code proving the flag doesn't have visibility of the cpuset
configuration so it must either be told by passing a cpumask or the
generic topology code has to verify if the flag should still be set
when taking the actual sched_domain_span() into account. This patch
implements the latter approach.

We need to detect capacity based on calling arch_scale_cpu_capacity()
directly as rq->cpu_capacity_orig hasn't been set yet early in the boot
process.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/topology.c | 20 ++++++++++++++++++++
 1 file changed, 20 insertions(+)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 71330e0e41db..29c186961345 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1160,6 +1160,26 @@ sd_init(struct sched_domain_topology_level *tl,
 	sd_id = cpumask_first(sched_domain_span(sd));
 
 	/*
+	 * Check if cpu_map eclipses cpu capacity asymmetry.
+	 */
+
+	if (sd->flags & SD_ASYM_CPUCAPACITY) {
+		int i;
+		bool disable = true;
+		long capacity = arch_scale_cpu_capacity(NULL, sd_id);
+
+		for_each_cpu(i, sched_domain_span(sd)) {
+			if (capacity != arch_scale_cpu_capacity(NULL, i)) {
+				disable = false;
+				break;
+			}
+		}
+
+		if (disable)
+			sd->flags &= ~SD_ASYM_CPUCAPACITY;
+	}
+
+	/*
 	 * Convert topological properties into behaviour.
 	 */
 
-- 
2.7.4


^ permalink raw reply	[flat|nested] 37+ messages in thread

* [PATCHv4 12/12] sched/core: Disable SD_PREFER_SIBLING on asymmetric cpu capacity domains
  2018-07-04 10:17 [PATCHv4 00/12] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems Morten Rasmussen
                   ` (10 preceding siblings ...)
  2018-07-04 10:17 ` [PATCHv4 11/12] sched/core: Disable SD_ASYM_CPUCAPACITY for root_domains without asymmetry Morten Rasmussen
@ 2018-07-04 10:17 ` Morten Rasmussen
  2018-07-06 10:18   ` Vincent Guittot
  2018-07-06 10:18 ` [PATCHv4 00/12] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems Vincent Guittot
                   ` (2 subsequent siblings)
  14 siblings, 1 reply; 37+ messages in thread
From: Morten Rasmussen @ 2018-07-04 10:17 UTC (permalink / raw)
  To: peterz, mingo
  Cc: valentin.schneider, dietmar.eggemann, vincent.guittot,
	gaku.inami.xh, linux-kernel, Morten Rasmussen

The 'prefer sibling' sched_domain flag is intended to encourage
spreading tasks to sibling sched_domain to take advantage of more caches
and core for SMT systems. It has recently been changed to be on all
non-NUMA topology level. However, spreading across domains with cpu
capacity asymmetry isn't desirable, e.g. spreading from high capacity to
low capacity cpus even if high capacity cpus aren't overutilized might
give access to more cache but the cpu will be slower and possibly lead
to worse overall throughput.

To prevent this, we need to remove SD_PREFER_SIBLING on the sched_domain
level immediately below SD_ASYM_CPUCAPACITY.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/topology.c | 12 ++++++++----
 1 file changed, 8 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 29c186961345..00c7a08c7f77 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1140,7 +1140,7 @@ sd_init(struct sched_domain_topology_level *tl,
 					| 0*SD_SHARE_CPUCAPACITY
 					| 0*SD_SHARE_PKG_RESOURCES
 					| 0*SD_SERIALIZE
-					| 0*SD_PREFER_SIBLING
+					| 1*SD_PREFER_SIBLING
 					| 0*SD_NUMA
 					| sd_flags
 					,
@@ -1186,17 +1186,21 @@ sd_init(struct sched_domain_topology_level *tl,
 	if (sd->flags & SD_ASYM_CPUCAPACITY) {
 		struct sched_domain *t = sd;
 
+		/*
+		 * Don't attempt to spread across cpus of different capacities.
+		 */
+		if (sd->child)
+			sd->child->flags &= ~SD_PREFER_SIBLING;
+
 		for_each_lower_domain(t)
 			t->flags |= SD_BALANCE_WAKE;
 	}
 
 	if (sd->flags & SD_SHARE_CPUCAPACITY) {
-		sd->flags |= SD_PREFER_SIBLING;
 		sd->imbalance_pct = 110;
 		sd->smt_gain = 1178; /* ~15% */
 
 	} else if (sd->flags & SD_SHARE_PKG_RESOURCES) {
-		sd->flags |= SD_PREFER_SIBLING;
 		sd->imbalance_pct = 117;
 		sd->cache_nice_tries = 1;
 		sd->busy_idx = 2;
@@ -1207,6 +1211,7 @@ sd_init(struct sched_domain_topology_level *tl,
 		sd->busy_idx = 3;
 		sd->idle_idx = 2;
 
+		sd->flags &= ~SD_PREFER_SIBLING;
 		sd->flags |= SD_SERIALIZE;
 		if (sched_domains_numa_distance[tl->numa_level] > RECLAIM_DISTANCE) {
 			sd->flags &= ~(SD_BALANCE_EXEC |
@@ -1216,7 +1221,6 @@ sd_init(struct sched_domain_topology_level *tl,
 
 #endif
 	} else {
-		sd->flags |= SD_PREFER_SIBLING;
 		sd->cache_nice_tries = 1;
 		sd->busy_idx = 2;
 		sd->idle_idx = 1;
-- 
2.7.4


^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCHv4 11/12] sched/core: Disable SD_ASYM_CPUCAPACITY for root_domains without asymmetry
  2018-07-04 10:17 ` [PATCHv4 11/12] sched/core: Disable SD_ASYM_CPUCAPACITY for root_domains without asymmetry Morten Rasmussen
@ 2018-07-05 13:31   ` Quentin Perret
  2018-07-05 14:13     ` Morten Rasmussen
  0 siblings, 1 reply; 37+ messages in thread
From: Quentin Perret @ 2018-07-05 13:31 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: peterz, mingo, valentin.schneider, dietmar.eggemann,
	vincent.guittot, gaku.inami.xh, linux-kernel

Hi Morten,

On Wednesday 04 Jul 2018 at 11:17:49 (+0100), Morten Rasmussen wrote:
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 71330e0e41db..29c186961345 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1160,6 +1160,26 @@ sd_init(struct sched_domain_topology_level *tl,
>  	sd_id = cpumask_first(sched_domain_span(sd));
>  
>  	/*
> +	 * Check if cpu_map eclipses cpu capacity asymmetry.
> +	 */
> +
> +	if (sd->flags & SD_ASYM_CPUCAPACITY) {
> +		int i;
> +		bool disable = true;
> +		long capacity = arch_scale_cpu_capacity(NULL, sd_id);
> +
> +		for_each_cpu(i, sched_domain_span(sd)) {
> +			if (capacity != arch_scale_cpu_capacity(NULL, i)) {
> +				disable = false;
> +				break;
> +			}
> +		}
> +
> +		if (disable)
> +			sd->flags &= ~SD_ASYM_CPUCAPACITY;
> +	}
> +
> +	/*
>  	 * Convert topological properties into behaviour.
>  	 */

If SD_ASYM_CPUCAPACITY means that some CPUs have different
arch_scale_cpu_capacity() values, we could also automatically _set_
the flag in sd_init() no ? Why should we let the arch set it and just
correct it later ?

I understand the moment at which we know the capacities of CPUs varies
from arch to arch, but the arch code could just call
rebuild_sched_domain when the capacities of CPUs change and let the
scheduler detect things automatically. I mean, even if the arch code
sets the flag in its topology level table, it will have to rebuild
the sched domains anyway ...

What do you think ?

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCHv4 11/12] sched/core: Disable SD_ASYM_CPUCAPACITY for root_domains without asymmetry
  2018-07-05 13:31   ` Quentin Perret
@ 2018-07-05 14:13     ` Morten Rasmussen
  2018-07-05 15:03       ` Quentin Perret
  0 siblings, 1 reply; 37+ messages in thread
From: Morten Rasmussen @ 2018-07-05 14:13 UTC (permalink / raw)
  To: Quentin Perret
  Cc: peterz, mingo, valentin.schneider, dietmar.eggemann,
	vincent.guittot, gaku.inami.xh, linux-kernel

On Thu, Jul 05, 2018 at 02:31:43PM +0100, Quentin Perret wrote:
> Hi Morten,
> 
> On Wednesday 04 Jul 2018 at 11:17:49 (+0100), Morten Rasmussen wrote:
> > diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> > index 71330e0e41db..29c186961345 100644
> > --- a/kernel/sched/topology.c
> > +++ b/kernel/sched/topology.c
> > @@ -1160,6 +1160,26 @@ sd_init(struct sched_domain_topology_level *tl,
> >  	sd_id = cpumask_first(sched_domain_span(sd));
> >  
> >  	/*
> > +	 * Check if cpu_map eclipses cpu capacity asymmetry.
> > +	 */
> > +
> > +	if (sd->flags & SD_ASYM_CPUCAPACITY) {
> > +		int i;
> > +		bool disable = true;
> > +		long capacity = arch_scale_cpu_capacity(NULL, sd_id);
> > +
> > +		for_each_cpu(i, sched_domain_span(sd)) {
> > +			if (capacity != arch_scale_cpu_capacity(NULL, i)) {
> > +				disable = false;
> > +				break;
> > +			}
> > +		}
> > +
> > +		if (disable)
> > +			sd->flags &= ~SD_ASYM_CPUCAPACITY;
> > +	}
> > +
> > +	/*
> >  	 * Convert topological properties into behaviour.
> >  	 */
> 
> If SD_ASYM_CPUCAPACITY means that some CPUs have different
> arch_scale_cpu_capacity() values, we could also automatically _set_
> the flag in sd_init() no ? Why should we let the arch set it and just
> correct it later ?
> 
> I understand the moment at which we know the capacities of CPUs varies
> from arch to arch, but the arch code could just call
> rebuild_sched_domain when the capacities of CPUs change and let the
> scheduler detect things automatically. I mean, even if the arch code
> sets the flag in its topology level table, it will have to rebuild
> the sched domains anyway ...
> 
> What do you think ?

We could as well set the flag here so the architecture doesn't have to
do it. It is a bit more complicated though for few reasons:

1. Detecting when to disable the flag is a lot simpler than checking
which level is should be set on. You basically have to work you way up
from the lowest topology level until you get to a level spanning all the
capacities available in the system to figure out where the flag should
be set. I don't think this fits easily with how we build the
sched_domain hierarchy. It can of course be done.

2. As you say, we still need the arch code (or cpufreq?) to rebuild the
whole thing once we know that the capacities have been determined. That
currently implies implementing arch_update_cpu_topology() which is
arch-specific. So we would need some arch code to make rebuild happen at
the right point in time. If the rebuild should be triggering the rebuild
we need another way to force a full rebuild. This can also be done.

3. Detecting the flag in generic kernel/sched/* code means that all
architectures will pay the for the overhead when building/rebuilding the
sched_domain hierarchy, and all architectures that sets the cpu
capacities to asymmetric will set the flag whether they like it or not.
I'm not sure if this is a problem.

In the end it is really about how much of this we want in generic code
and how much we hide in arch/, and if we dare to touch the sched_domain
build code ;-)

Morten

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCHv4 11/12] sched/core: Disable SD_ASYM_CPUCAPACITY for root_domains without asymmetry
  2018-07-05 14:13     ` Morten Rasmussen
@ 2018-07-05 15:03       ` Quentin Perret
  2018-07-20 13:54         ` Morten Rasmussen
  0 siblings, 1 reply; 37+ messages in thread
From: Quentin Perret @ 2018-07-05 15:03 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: peterz, mingo, valentin.schneider, dietmar.eggemann,
	vincent.guittot, gaku.inami.xh, linux-kernel

On Thursday 05 Jul 2018 at 15:13:49 (+0100), Morten Rasmussen wrote:
> On Thu, Jul 05, 2018 at 02:31:43PM +0100, Quentin Perret wrote:
> > If SD_ASYM_CPUCAPACITY means that some CPUs have different
> > arch_scale_cpu_capacity() values, we could also automatically _set_
> > the flag in sd_init() no ? Why should we let the arch set it and just
> > correct it later ?
> > 
> > I understand the moment at which we know the capacities of CPUs varies
> > from arch to arch, but the arch code could just call
> > rebuild_sched_domain when the capacities of CPUs change and let the
> > scheduler detect things automatically. I mean, even if the arch code
> > sets the flag in its topology level table, it will have to rebuild
> > the sched domains anyway ...
> > 
> > What do you think ?
> 
> We could as well set the flag here so the architecture doesn't have to
> do it. It is a bit more complicated though for few reasons:
> 
> 1. Detecting when to disable the flag is a lot simpler than checking
> which level is should be set on. You basically have to work you way up
> from the lowest topology level until you get to a level spanning all the
> capacities available in the system to figure out where the flag should
> be set. I don't think this fits easily with how we build the
> sched_domain hierarchy. It can of course be done.

Ah, that is something I missed. I see in 1f6e6c7cb9bc ("sched/core:
Introduce SD_ASYM_CPUCAPACITY sched_domain topology flag") that this
flag should be set only at the _lowest_ level at which there is
asymmetry. I had the wrong impression that the flag was supposed to be
set at _all_ level where there is some asymmetry. And actually having
'some asymmetry' isn't enough, we want to see the full range of CPU
capacities. Hmmm, that is indeed more complex than I thought ... :/

> 
> 2. As you say, we still need the arch code (or cpufreq?) to rebuild the
> whole thing once we know that the capacities have been determined. That
> currently implies implementing arch_update_cpu_topology() which is
> arch-specific. So we would need some arch code to make rebuild happen at
> the right point in time. If the rebuild should be triggering the rebuild
> we need another way to force a full rebuild. This can also be done.

Yeah, with just this patch the arch code will have to:
   1. update the arch_scale_cpu_capacity() of the CPUs;
   2. detect the asymmetry to set the flag in the topology table;
   3. rebuild the sched domains to let the scheduler know about the new
      topology information.

By doing what I suggest we would just save 2 from the arch side and do
it in the scheduler. So actually, I really start to wonder if it's worth
it given the added complexity ...

> 3. Detecting the flag in generic kernel/sched/* code means that all
> architectures will pay the for the overhead when building/rebuilding the
> sched_domain hierarchy, and all architectures that sets the cpu
> capacities to asymmetric will set the flag whether they like it or not.
> I'm not sure if this is a problem.

That is true as well ...

> 
> In the end it is really about how much of this we want in generic code
> and how much we hide in arch/, and if we dare to touch the sched_domain
> build code ;-)

Right so you can argue that the arch code is here to give you a
system-level information, and that if the scheduler wants to virtually
split that system, then it's its job to make sure that happens properly.
That is exactly what your patch does (IIUC), and I now think that this
is a very sensible middle-ground option. But this is debatable so I'm
interested to see what others think :-)

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCHv4 12/12] sched/core: Disable SD_PREFER_SIBLING on asymmetric cpu capacity domains
  2018-07-04 10:17 ` [PATCHv4 12/12] sched/core: Disable SD_PREFER_SIBLING on asymmetric cpu capacity domains Morten Rasmussen
@ 2018-07-06 10:18   ` Vincent Guittot
  2018-07-06 14:31     ` Morten Rasmussen
  0 siblings, 1 reply; 37+ messages in thread
From: Vincent Guittot @ 2018-07-06 10:18 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Peter Zijlstra, Ingo Molnar, Valentin Schneider,
	Dietmar Eggemann, gaku.inami.xh, linux-kernel

On Wed, 4 Jul 2018 at 12:18, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
>
> The 'prefer sibling' sched_domain flag is intended to encourage
> spreading tasks to sibling sched_domain to take advantage of more caches
> and core for SMT systems. It has recently been changed to be on all
> non-NUMA topology level. However, spreading across domains with cpu
> capacity asymmetry isn't desirable, e.g. spreading from high capacity to
> low capacity cpus even if high capacity cpus aren't overutilized might
> give access to more cache but the cpu will be slower and possibly lead
> to worse overall throughput.
>
> To prevent this, we need to remove SD_PREFER_SIBLING on the sched_domain
> level immediately below SD_ASYM_CPUCAPACITY.

This makes sense. Nevertheless, this patch also raises a scheduling
problem and break the 1 task per CPU policy that is enforced by
SD_PREFER_SIBLING. When running the tests of your cover letter, 1 long
running task is often co scheduled on a big core whereas short pinned
tasks are still running and a little core is idle which is not an
optimal scheduling decision

>
> cc: Ingo Molnar <mingo@redhat.com>
> cc: Peter Zijlstra <peterz@infradead.org>
>
> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
> ---
>  kernel/sched/topology.c | 12 ++++++++----
>  1 file changed, 8 insertions(+), 4 deletions(-)
>
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 29c186961345..00c7a08c7f77 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1140,7 +1140,7 @@ sd_init(struct sched_domain_topology_level *tl,
>                                         | 0*SD_SHARE_CPUCAPACITY
>                                         | 0*SD_SHARE_PKG_RESOURCES
>                                         | 0*SD_SERIALIZE
> -                                       | 0*SD_PREFER_SIBLING
> +                                       | 1*SD_PREFER_SIBLING
>                                         | 0*SD_NUMA
>                                         | sd_flags
>                                         ,
> @@ -1186,17 +1186,21 @@ sd_init(struct sched_domain_topology_level *tl,
>         if (sd->flags & SD_ASYM_CPUCAPACITY) {
>                 struct sched_domain *t = sd;
>
> +               /*
> +                * Don't attempt to spread across cpus of different capacities.
> +                */
> +               if (sd->child)
> +                       sd->child->flags &= ~SD_PREFER_SIBLING;
> +
>                 for_each_lower_domain(t)
>                         t->flags |= SD_BALANCE_WAKE;
>         }
>
>         if (sd->flags & SD_SHARE_CPUCAPACITY) {
> -               sd->flags |= SD_PREFER_SIBLING;
>                 sd->imbalance_pct = 110;
>                 sd->smt_gain = 1178; /* ~15% */
>
>         } else if (sd->flags & SD_SHARE_PKG_RESOURCES) {
> -               sd->flags |= SD_PREFER_SIBLING;
>                 sd->imbalance_pct = 117;
>                 sd->cache_nice_tries = 1;
>                 sd->busy_idx = 2;
> @@ -1207,6 +1211,7 @@ sd_init(struct sched_domain_topology_level *tl,
>                 sd->busy_idx = 3;
>                 sd->idle_idx = 2;
>
> +               sd->flags &= ~SD_PREFER_SIBLING;
>                 sd->flags |= SD_SERIALIZE;
>                 if (sched_domains_numa_distance[tl->numa_level] > RECLAIM_DISTANCE) {
>                         sd->flags &= ~(SD_BALANCE_EXEC |
> @@ -1216,7 +1221,6 @@ sd_init(struct sched_domain_topology_level *tl,
>
>  #endif
>         } else {
> -               sd->flags |= SD_PREFER_SIBLING;
>                 sd->cache_nice_tries = 1;
>                 sd->busy_idx = 2;
>                 sd->idle_idx = 1;
> --
> 2.7.4
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCHv4 00/12] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems
  2018-07-04 10:17 [PATCHv4 00/12] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems Morten Rasmussen
                   ` (11 preceding siblings ...)
  2018-07-04 10:17 ` [PATCHv4 12/12] sched/core: Disable SD_PREFER_SIBLING on asymmetric cpu capacity domains Morten Rasmussen
@ 2018-07-06 10:18 ` Vincent Guittot
  2018-07-09 15:08   ` Morten Rasmussen
  2018-07-31 12:00 ` Peter Zijlstra
  2018-08-17  1:57 ` Joel Fernandes
  14 siblings, 1 reply; 37+ messages in thread
From: Vincent Guittot @ 2018-07-06 10:18 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Peter Zijlstra, Ingo Molnar, Valentin Schneider,
	Dietmar Eggemann, gaku.inami.xh, linux-kernel

Hi Morten,

On Wed, 4 Jul 2018 at 12:18, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
>
> On asymmetric cpu capacity systems (e.g. Arm big.LITTLE) it is crucial
> for performance that cpu intensive tasks are aggressively migrated to
> high capacity cpus as soon as those become available. The capacity
> awareness tweaks already in the wake-up path can't handle this as such
> tasks might run or be runnable forever. If they happen to be placed on a
> low capacity cpu from the beginning they are stuck there forever while
> high capacity cpus may have become available in the meantime.
>
> To address this issue this patch set introduces a new "misfit"
> load-balancing scenario in periodic/nohz/newly idle balance which tweaks
> the load-balance conditions to ignore load per capacity in certain
> cases. Since misfit tasks are commonly running alone on a cpu, more
> aggressive active load-balancing is needed too.
>
> The fundamental idea of this patch set has been in Android kernels for a
> long time and is absolutely essential for consistent performance on
> asymmetric cpu capacity systems.
>

As already said , I'm not convinced by the proposal which seems quite
complex and also adds some kind of arbitrary and fixed power
management policy by deciding which tasks can or not go on big cores
whereas there are other frameworks to take such decision like EAS or
cgroups. Furthermore, there is already something similar in the kernel
with SD_ASYM_PACKING and IMO, it would be better to improve this
feature (if needed) instead of adding a new one which often do similar
things.
I have rerun your tests and got same results than misfit task patchset
on my hikey960 with SD_ASYM_PACKING feature for legacy b.L topology
and fake dynamiQ topology. And it give better performance when the
pinned tasks are short and scheduler has to wait for the task to
increase their utilization before getting a chance to migrate on big
core.
Then, I have tested SD_ASYM_PACKING with EAS patchset and they work
together for b/L and dynamiQ topology

> The patches have been tested on:
>    1. Arm Juno (r0): 2+4 Cortex A57/A53
>    2. Hikey960: 4+4 Cortex A73/A53
>
> Test case:
> Big cpus are always kept busy. Pin a shorter running sysbench tasks to
> big cpus, while creating a longer running set of unpinned sysbench
> tasks.
>
>     REQUESTS=1000
>     BIGS="1 2"
>     LITTLES="0 3 4 5"
>
>     # Don't care about the score for those, just keep the bigs busy
>     for i in $BIGS; do
>         taskset -c $i sysbench --max-requests=$((REQUESTS / 4)) \
>         --test=cpu  run &>/dev/null &
>     done
>
>     for i in $LITTLES; do
>         sysbench --max-requests=$REQUESTS --test=cpu run \
>         | grep "total time:" &
>     done
>
>     wait
>
> Results:
> Single runs with completion time of each task
> Juno (tip)
>     total time:                          1.2608s
>     total time:                          1.2995s
>     total time:                          1.5954s
>     total time:                          1.7463s
>
> Juno (misfit)
>     total time:                          1.2575s
>     total time:                          1.3004s
>     total time:                          1.5860s
>     total time:                          1.5871s
>
> Hikey960 (tip)
>     total time:                          1.7431s
>     total time:                          2.2914s
>     total time:                          2.5976s
>     total time:                          1.7280s
>
> Hikey960 (misfit)
>     total time:                          1.7866s
>     total time:                          1.7513s
>     total time:                          1.6918s
>     total time:                          1.6965s
>
> 10 run summary (tracking longest running task for each run)
>         Juno            Hikey960
>         avg     max     avg     max
> tip     1.7465  1.7469  2.5997  2.6131
> misfit  1.6016  1.6192  1.8506  1.9666
>
> Changelog:
> v4
> - Added check for empty cpu_map in sd_init().
> - Added patch to disable SD_ASYM_CPUCAPACITY for root_domains that don't
>   observe capacity asymmetry if the system as a whole is asymmetric.
> - Added patch to disable SD_PREFER_SIBLING on the sched_domain level below
>   SD_ASYM_CPUCAPACITY.
> - Rebased against tip/sched/core.
> - Fixed uninitialised variable introduced in update_sd_lb_stats.
> - Added patch to do a slight variable initialisation cleanup in update_sd_lb_stats.
> - Removed superfluous type changes for temp variables assigned to root_domain->overload.
> - Reworded commit for the patch setting rq->rd->overload when misfit.
> - v3 Tested-by: Gaku Inami <gaku.inami.xh@renesas.com>
>
> v3
> - Fixed locking around static_key.
> - Changed group per-cpu capacity comparison to be based on max rather
>   than min capacity.
> - Added patch to prevent occasional pointless high->low capacity
>   migrations.
> - Changed type of group_misfit_task_load and misfit_task_load to
>   unsigned long.
> - Changed fbq() to pick the cpu with highest misfit_task_load rather
>   than breaking when the first is found.
> - Rebased against tip/sched/core.
> - v2 Tested-by: Gaku Inami <gaku.inami.xh@renesas.com>
>
> v2
> - Removed redudant condition in static_key enablement.
> - Fixed logic flaw in patch #2 reported by Yi Yao <yi.yao@intel.com>
> - Dropped patch #4 as although the patch seems to make sense no benefit
>   has been proven.
> - Dropped root_domain->overload renaming
> - Changed type of root_domain->overload to int
> - Wrapped accesses of rq->rd->overload with READ/WRITE_ONCE
> - v1 Tested-by: Gaku Inami <gaku.inami.xh@renesas.com>
>
> Chris Redpath (1):
>   sched/fair: Don't move tasks to lower capacity cpus unless necessary
>
> Morten Rasmussen (6):
>   sched: Add static_key for asymmetric cpu capacity optimizations
>   sched/fair: Add group_misfit_task load-balance type
>   sched: Add sched_group per-cpu max capacity
>   sched/fair: Consider misfit tasks when load-balancing
>   sched/core: Disable SD_ASYM_CPUCAPACITY for root_domains without
>     asymmetry
>   sched/core: Disable SD_PREFER_SIBLING on asymmetric cpu capacity
>     domains
>
> Valentin Schneider (5):
>   sched/fair: Kick nohz balance if rq->misfit_task_load
>   sched/fair: Change prefer_sibling type to bool
>   sched: Change root_domain->overload type to int
>   sched: Wrap rq->rd->overload accesses with READ/WRITE_ONCE
>   sched/fair: Set rq->rd->overload when misfit
>
>  kernel/sched/fair.c     | 161 +++++++++++++++++++++++++++++++++++++++++-------
>  kernel/sched/sched.h    |  16 +++--
>  kernel/sched/topology.c |  53 ++++++++++++++--
>  3 files changed, 199 insertions(+), 31 deletions(-)
>
> --
> 2.7.4
>

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCHv4 12/12] sched/core: Disable SD_PREFER_SIBLING on asymmetric cpu capacity domains
  2018-07-06 10:18   ` Vincent Guittot
@ 2018-07-06 14:31     ` Morten Rasmussen
  2018-07-31 12:17       ` Vincent Guittot
  0 siblings, 1 reply; 37+ messages in thread
From: Morten Rasmussen @ 2018-07-06 14:31 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, Ingo Molnar, Valentin Schneider,
	Dietmar Eggemann, gaku.inami.xh, linux-kernel

On Fri, Jul 06, 2018 at 12:18:17PM +0200, Vincent Guittot wrote:
> On Wed, 4 Jul 2018 at 12:18, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> >
> > The 'prefer sibling' sched_domain flag is intended to encourage
> > spreading tasks to sibling sched_domain to take advantage of more caches
> > and core for SMT systems. It has recently been changed to be on all
> > non-NUMA topology level. However, spreading across domains with cpu
> > capacity asymmetry isn't desirable, e.g. spreading from high capacity to
> > low capacity cpus even if high capacity cpus aren't overutilized might
> > give access to more cache but the cpu will be slower and possibly lead
> > to worse overall throughput.
> >
> > To prevent this, we need to remove SD_PREFER_SIBLING on the sched_domain
> > level immediately below SD_ASYM_CPUCAPACITY.
> 
> This makes sense. Nevertheless, this patch also raises a scheduling
> problem and break the 1 task per CPU policy that is enforced by
> SD_PREFER_SIBLING.

Scheduling one task per cpu when n_task == n_cpus on asymmetric
topologies is generally broken already and this patch set doesn't fix
that problem.

SD_PREFER_SIBLING might seem to help in very specific cases:
n_litte_cpus == n_big_cpus. In that case the little group might
classified as overloaded. It doesn't guarantee that anything gets pulled
as the grp_load/grp_capacity in the imbalance calculation on some system
still says the little cpus are more loaded than the bigs despite one of
them being idle. That depends on the little cpu capacities.

On systems where n_little_cpus != n_big_cpus SD_PREFER_SIBLING is broken
as it assumes the group_weight to be the same. This is the case on Juno
and several other platforms.

IMHO, SD_PREFER_SIBLING isn't the solution to this problem. It might
help for a limited subset of topologies/capacities but the right
solution is to change the imbalance calculation. As the name says, it is
meant to spread tasks and does so unconditionally. For asymmetric
systems we would like to consider cpu capacity before migrating tasks.

> When running the tests of your cover letter, 1 long
> running task is often co scheduled on a big core whereas short pinned
> tasks are still running and a little core is idle which is not an
> optimal scheduling decision

This can easily happen with SD_PREFER_SIBLING enabled too so I wouldn't
say that this patch breaks anything that isn't broken already. In fact
we this happening with and without this patch applied.

Morten

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCHv4 00/12] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems
  2018-07-06 10:18 ` [PATCHv4 00/12] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems Vincent Guittot
@ 2018-07-09 15:08   ` Morten Rasmussen
  2018-07-26 17:14     ` Valentin Schneider
  2018-07-31 12:11     ` Vincent Guittot
  0 siblings, 2 replies; 37+ messages in thread
From: Morten Rasmussen @ 2018-07-09 15:08 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, Ingo Molnar, Valentin Schneider,
	Dietmar Eggemann, gaku.inami.xh, linux-kernel

On Fri, Jul 06, 2018 at 12:18:27PM +0200, Vincent Guittot wrote:
> Hi Morten,
> 
> On Wed, 4 Jul 2018 at 12:18, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> >
> > On asymmetric cpu capacity systems (e.g. Arm big.LITTLE) it is crucial
> > for performance that cpu intensive tasks are aggressively migrated to
> > high capacity cpus as soon as those become available. The capacity
> > awareness tweaks already in the wake-up path can't handle this as such
> > tasks might run or be runnable forever. If they happen to be placed on a
> > low capacity cpu from the beginning they are stuck there forever while
> > high capacity cpus may have become available in the meantime.
> >
> > To address this issue this patch set introduces a new "misfit"
> > load-balancing scenario in periodic/nohz/newly idle balance which tweaks
> > the load-balance conditions to ignore load per capacity in certain
> > cases. Since misfit tasks are commonly running alone on a cpu, more
> > aggressive active load-balancing is needed too.
> >
> > The fundamental idea of this patch set has been in Android kernels for a
> > long time and is absolutely essential for consistent performance on
> > asymmetric cpu capacity systems.
> >
> 
> As already said , I'm not convinced by the proposal which seems quite
> complex and also adds some kind of arbitrary and fixed power
> management policy by deciding which tasks can or not go on big cores
> whereas there are other frameworks to take such decision like EAS or
> cgroups.

The misfit patches are a crucial part of the EAS solution but they also
make sense for some users on their own without an energy model. This is
why they are posted separately.

We have already discussed at length why the patches are needed and why
the look like they do here in this thread:

https://lore.kernel.org/lkml/CAKfTPtD4skW_3SAk--vBEC5-m1Ua48bjOQYS0pDqW3nPSpsENg@mail.gmail.com/

> Furthermore, there is already something similar in the kernel
> with SD_ASYM_PACKING and IMO, it would be better to improve this
> feature (if needed) instead of adding a new one which often do similar
> things.

As said in the previous thread, while it might look similar it isn't.
SD_ASYM_PACKING isn't utilization-based which is the key metric used for
EAS, schedutil, util_est, and util_clamp. SD_ASYM_PACKING serves a
different purpose (see previous thread for details).

> I have rerun your tests and got same results than misfit task patchset
> on my hikey960 with SD_ASYM_PACKING feature for legacy b.L topology
> and fake dynamiQ topology. And it give better performance when the
> pinned tasks are short and scheduler has to wait for the task to
> increase their utilization before getting a chance to migrate on big
> core.

Right, the test cases are quite simple and could be served better by
SD_ASYM_PACKING. As we already discussed in that thread, that is due to
the PELT lag but this the cost we have to pay if we don't have
additional information about the requirements of the task and we don't
want to default to big-first with all its implications.

We have covered all this in the thread in early April.

> Then, I have tested SD_ASYM_PACKING with EAS patchset and they work
> together for b/L and dynamiQ topology

Could you provide some more details about your evaluation? It probably
works well for some use-cases but it isn't really designed for what we
need for EAS.

Morten

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCHv4 11/12] sched/core: Disable SD_ASYM_CPUCAPACITY for root_domains without asymmetry
  2018-07-05 15:03       ` Quentin Perret
@ 2018-07-20 13:54         ` Morten Rasmussen
  0 siblings, 0 replies; 37+ messages in thread
From: Morten Rasmussen @ 2018-07-20 13:54 UTC (permalink / raw)
  To: Quentin Perret
  Cc: peterz, mingo, valentin.schneider, dietmar.eggemann,
	vincent.guittot, gaku.inami.xh, linux-kernel

On Thu, Jul 05, 2018 at 04:03:11PM +0100, Quentin Perret wrote:
> On Thursday 05 Jul 2018 at 15:13:49 (+0100), Morten Rasmussen wrote:
> > 3. Detecting the flag in generic kernel/sched/* code means that all
> > architectures will pay the for the overhead when building/rebuilding the
> > sched_domain hierarchy, and all architectures that sets the cpu
> > capacities to asymmetric will set the flag whether they like it or not.
> > I'm not sure if this is a problem.
> 
> That is true as well ...
> 
> > 
> > In the end it is really about how much of this we want in generic code
> > and how much we hide in arch/, and if we dare to touch the sched_domain
> > build code ;-)
> 
> Right so you can argue that the arch code is here to give you a
> system-level information, and that if the scheduler wants to virtually
> split that system, then it's its job to make sure that happens properly.
> That is exactly what your patch does (IIUC), and I now think that this
> is a very sensible middle-ground option. But this is debatable so I'm
> interested to see what others think :-)

I went ahead an hacked up some patches that sets the flag automatically
as part of the sched_domain build process. I posted them so people can
have a look: 1532093554-30504-1-git-send-email-morten.rasmussen@arm.com

With those patches this patch has to be reverted/dropped.

Morten

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCHv4 00/12] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems
  2018-07-09 15:08   ` Morten Rasmussen
@ 2018-07-26 17:14     ` Valentin Schneider
  2018-07-30 14:30       ` Dietmar Eggemann
  2018-07-31 12:11     ` Vincent Guittot
  1 sibling, 1 reply; 37+ messages in thread
From: Valentin Schneider @ 2018-07-26 17:14 UTC (permalink / raw)
  To: Morten Rasmussen, Vincent Guittot
  Cc: Peter Zijlstra, Ingo Molnar, Dietmar Eggemann, gaku.inami.xh,
	linux-kernel

Hi,

On 09/07/18 16:08, Morten Rasmussen wrote:
> On Fri, Jul 06, 2018 at 12:18:27PM +0200, Vincent Guittot wrote:
>> Hi Morten,
>>
>> On Wed, 4 Jul 2018 at 12:18, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
>>> [...]
>> As already said , I'm not convinced by the proposal which seems quite
>> complex and also adds some kind of arbitrary and fixed power
>> management policy by deciding which tasks can or not go on big cores
>> whereas there are other frameworks to take such decision like EAS or
>> cgroups.
> 
> The misfit patches are a crucial part of the EAS solution but they also
> make sense for some users on their own without an energy model. This is
> why they are posted separately.
> 
> We have already discussed at length why the patches are needed and why
> the look like they do here in this thread:
> 
> https://lore.kernel.org/lkml/CAKfTPtD4skW_3SAk--vBEC5-m1Ua48bjOQYS0pDqW3nPSpsENg@mail.gmail.com/
> 
>> Furthermore, there is already something similar in the kernel
>> with SD_ASYM_PACKING and IMO, it would be better to improve this
>> feature (if needed) instead of adding a new one which often do similar
>> things.
> 
> As said in the previous thread, while it might look similar it isn't.
> SD_ASYM_PACKING isn't utilization-based which is the key metric used for
> EAS, schedutil, util_est, and util_clamp. SD_ASYM_PACKING serves a
> different purpose (see previous thread for details).
> 
>> I have rerun your tests and got same results than misfit task patchset
>> on my hikey960 with SD_ASYM_PACKING feature for legacy b.L topology
>> and fake dynamiQ topology. And it give better performance when the
>> pinned tasks are short and scheduler has to wait for the task to
>> increase their utilization before getting a chance to migrate on big
>> core.
> 
> Right, the test cases are quite simple and could be served better by
> SD_ASYM_PACKING. As we already discussed in that thread, that is due to
> the PELT lag but this the cost we have to pay if we don't have
> additional information about the requirements of the task and we don't
> want to default to big-first with all its implications.
> 

I played around with SD_ASYM_PACKING & lmbench on my HiKey960, and I think I
can bring a few more arguments to the table.



In terms of setup, I took Vincent's approach ([1]) which is to define
CPU priority as CPU capacity. As for the sched_domain flags, I initially
added SD_ASYM_PACKING to the DIE sched_domain, since it's the only level where
we want to do any ASYM_PACKING.

That causes a problem with nohz kicks, because the per-CPU sd_asym cache 
(which is used to determine if we can pack stuff into nohz CPUs) is defined
as:

    highest_flag_domain(cpu, SD_ASYM_PACKING);

which returns NULL because the first domain (MC) doesn't have it. That makes
sense to me as AFAICT that feature is mostly used to pack stuff on SMT levels.
It is only set at MC level in Vincent's patch, but that doesn't work for
regular big.LITTLE, so I had to set it for both MC and DIE. This does add a few
useless ops though, since all of the CPUs in a given MC domain have the same
priority.



With that out of the way, I did some lmbench runs:
> lat_mem_rd 10 1024

With ASYM_PACKING, I still see lmbench tasks remaining on LITTLE CPUs while
bigs are free, because ASYM_PACKING only does explicit active balancing on
CPU_NEWLY_IDLE balancing - otherwise it'll rely on the nr_balance_failed counter.

However, that counter can be reset before it reaches the threshold at which
active balance is done, which can lead to huge upmigration delays (almost a
full second). I also see the same kind of issues on Juno r0.

This could be resolved by extending ASYM_PACKING active balancing to 
non NEWLY_IDLE cases, but then we'd be thrashing everything. That's another
argument for basing upmigration on task load-tracking signals, as we can
determine which tasks need active balancing much faster than the
nr_balance_failed counter way while not active balancing the world.

---

[1]: https://lore.kernel.org/lkml/1522223215-23524-1-git-send-email-vincent.guittot@linaro.org/

---
lmbench results are meant to be plotted, I've added some pointers to show the
big obvious anomalies. I don't really care for the actual scores, as the 
resulting traces are interesting on their own, but I've included them for
the sake of completeness.

(lat_mem_rd 10 1024) with ASYM_PACKING:

0.00098 1.275
0.00195 1.274
0.00293 1.274
0.00391 1.274
0.00586 1.274
0.00781 1.275
0.00977 1.274
0.01172 1.275
0.01367 1.274
0.01562 1.274
0.01758 1.274
0.01953 1.274
0.02148 1.274
0.02344 1.274
0.02539 1.274
0.02734 1.275
0.0293 1.275
0.03125 1.275
0.03516 1.275
0.03906 1.275
0.04297 1.274
0.04688 1.275
0.05078 1.275
0.05469 1.275
0.05859 1.275
0.0625 1.275
0.07031 3.153
0.07812 4.035
0.08594 4.164
0.09375 4.237
0.10156 4.172
0.10938 4.1
0.11719 4.121
0.125 4.171
0.14062 4.073
0.15625 4.051
0.17188 4.026
0.1875 4.002
0.20312 3.973
0.21875 3.948
0.23438 3.927
0.25 3.904
0.28125 3.869
0.3125 3.86
0.34375 3.824
0.375 3.803
0.40625 3.798
0.4375 3.768
0.46875 3.784
0.5 3.753
0.5625 3.73
0.625 3.739
0.6875 3.703
0.75 3.69
0.8125 3.693
0.875 3.679
0.9375 3.686
1.0 3.664
1.125 3.656
1.25 3.658
1.375 3.638
1.5 3.635
1.625 3.628
1.75 4.274
1.875 4.579
2.0 4.651
2.25 5.313
2.5 6.314
2.75 7.585
3.0 8.457
3.25 9.045
3.5 9.532
3.75 9.909
4.0 148.66   <-----
4.5 10.191
5.0 10.222
5.5 10.208
6.0 10.21
6.5 10.21
7.0 10.199
7.5 10.203
8.0 154.354   <-----
9.0 10.163
10.0 10.138

(lat_mem_rd 10 1024) with misfit patches:

0.00098 1.273
0.00195 1.273
0.00293 1.273
0.00391 1.273
0.00586 1.274
0.00781 1.273
0.00977 1.273
0.01172 1.273
0.01367 1.273
0.01562 1.273
0.01758 1.273
0.01953 1.273
0.02148 1.273
0.02344 1.274
0.02539 1.273
0.02734 1.273
0.0293 1.273
0.03125 1.273
0.03516 1.273
0.03906 1.273
0.04297 1.273
0.04688 1.274
0.05078 1.274
0.05469 1.274
0.05859 1.274
0.0625 1.274
0.07031 3.323
0.07812 4.074
0.08594 4.171
0.09375 4.254
0.10156 4.166
0.10938 4.084
0.11719 4.088
0.125 4.112
0.14062 4.127
0.15625 4.132
0.17188 4.132
0.1875 4.131
0.20312 4.187
0.21875 4.17
0.23438 4.153
0.25 4.138
0.28125 4.102
0.3125 4.081
0.34375 4.075
0.375 4.011
0.40625 4.033
0.4375 4.021
0.46875 3.937
0.5 3.99
0.5625 3.901
0.625 3.995
0.6875 3.89
0.75 3.863
0.8125 3.903
0.875 3.883
0.9375 3.82
1.0 3.945
1.125 3.85
1.25 3.884
1.375 3.833
1.5 4.89
1.625 4.834
1.75 5.041
1.875 5.054
2.0 5.38
2.25 5.752
2.5 6.805
2.75 7.516
3.0 8.545
3.25 9.115
3.5 9.525
3.75 9.871
4.0 10.017
4.5 10.177
5.0 10.201
5.5 10.209
6.0 10.204
6.5 10.18
7.0 10.19
7.5 10.171
8.0 10.166
9.0 10.164
10.0 10.166

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCHv4 00/12] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems
  2018-07-26 17:14     ` Valentin Schneider
@ 2018-07-30 14:30       ` Dietmar Eggemann
  2018-07-31 12:13         ` Vincent Guittot
  0 siblings, 1 reply; 37+ messages in thread
From: Dietmar Eggemann @ 2018-07-30 14:30 UTC (permalink / raw)
  To: Valentin Schneider, Morten Rasmussen, Vincent Guittot
  Cc: Peter Zijlstra, Ingo Molnar, gaku.inami.xh, linux-kernel

On 07/26/2018 07:14 PM, Valentin Schneider wrote:
> Hi,
> 
> On 09/07/18 16:08, Morten Rasmussen wrote:
>> On Fri, Jul 06, 2018 at 12:18:27PM +0200, Vincent Guittot wrote:
>>> Hi Morten,
>>>
>>> On Wed, 4 Jul 2018 at 12:18, Morten Rasmussen <morten.rasmussen@arm.com> wrote:

[...]

> With that out of the way, I did some lmbench runs:
>> lat_mem_rd 10 1024
> 
> With ASYM_PACKING, I still see lmbench tasks remaining on LITTLE CPUs while
> bigs are free, because ASYM_PACKING only does explicit active balancing on
> CPU_NEWLY_IDLE balancing - otherwise it'll rely on the nr_balance_failed counter.
> 
> However, that counter can be reset before it reaches the threshold at which
> active balance is done, which can lead to huge upmigration delays (almost a
> full second). I also see the same kind of issues on Juno r0.
> 
> This could be resolved by extending ASYM_PACKING active balancing to
> non NEWLY_IDLE cases, but then we'd be thrashing everything. That's another
> argument for basing upmigration on task load-tracking signals, as we can
> determine which tasks need active balancing much faster than the
> nr_balance_failed counter way while not active balancing the world.

The task layout of the test looks like n=85 always running tasks (each 
for ~ 1.25ms on big or little) and they all get created and run one 
after the other. So on a big cpu, their util values go from 512 to 1024 
and from 223 to 446 on little cpu (Juno board). Latter thanks to 
Quentin's 'sched/fair: Fix util_avg of new tasks for asymmetric systems'.

root@juno:~# cat /sys/devices/system/cpu/cpu[01]/cpu_capacity
446
1024

> (lat_mem_rd 10 1024) with ASYM_PACKING:

...
> 4.0 148.66   <-----
> 4.5 10.191
...
> 7.5 10.203
> 8.0 154.354   <-----

I ran the test affine to big, little and all cpus on tip/sched/core w/o 
ASYM_PACKING or Misfit:

cputype:     big  little     all
cpumask:    0x06    0x39    0xff

mem size   <---- latency  ---->

  0.00098   3.668   3.595   3.669
  0.00195   3.668   3.594   3.594
  0.00293   3.668   3.593   3.595
  0.00391   3.669   3.596   3.595
  ...
  3.75000  58.687 121.934 122.293
  4.00000  57.054 121.771 120.489
  4.50000  56.914 121.851  56.729
  5.00000  57.347 121.777  56.975
  5.50000  57.705 121.738  68.981
  6.00000  57.935 121.728  57.542
  6.50000  58.119 121.694 121.799
  7.00000  58.194 121.502  57.844
  7.50000  58.258 121.684  58.050
  8.00000  58.293 121.725  58.030
  9.00000  58.309 121.793  58.188
10.00000  58.561 122.252 122.078

There is no diff between big and little cpus with small memory sizes, 
just with the MB range.
If I look into the trace for 'all' it turns out that their are cases in 
which, even if the task only run for ~15% of the time on big, the 
latency value is printed as when it was running affine to big. So using 
the latency value as an indicator where the task was scheduled is IMHO 
not really possible.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCHv4 01/12] sched: Add static_key for asymmetric cpu capacity optimizations
  2018-07-04 10:17 ` [PATCHv4 01/12] sched: Add static_key for asymmetric cpu capacity optimizations Morten Rasmussen
@ 2018-07-31 10:59   ` Peter Zijlstra
  2018-08-02 15:15     ` Morten Rasmussen
  0 siblings, 1 reply; 37+ messages in thread
From: Peter Zijlstra @ 2018-07-31 10:59 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: mingo, valentin.schneider, dietmar.eggemann, vincent.guittot,
	gaku.inami.xh, linux-kernel


Combined with that SD_ASYM.. rework I ended up with the below.

Holler if you want it changed :-)


---


Subject: sched: Add static_key for asymmetric cpu capacity optimizations
From: Morten Rasmussen <morten.rasmussen@arm.com>
Date: Wed, 4 Jul 2018 11:17:39 +0100

The existing asymmetric cpu capacity code should cause minimal overhead
for others. Putting it behind a static_key, it has been done for SMT
optimizations, would make it easier to extend and improve without
causing harm to others moving forward.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Cc: valentin.schneider@arm.com
Cc: mingo@redhat.com
Cc: vincent.guittot@linaro.org
Cc: dietmar.eggemann@arm.com
Cc: gaku.inami.xh@renesas.com
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Link: http://lkml.kernel.org/r/1530699470-29808-2-git-send-email-morten.rasmussen@arm.com
---
 kernel/sched/fair.c     |    3 +++
 kernel/sched/sched.h    |    1 +
 kernel/sched/topology.c |    9 ++++++++-
 3 files changed, 12 insertions(+), 1 deletion(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6186,6 +6186,9 @@ static int wake_cap(struct task_struct *
 {
 	long min_cap, max_cap;
 
+	if (!static_branch_unlikely(&sched_asym_cpucapacity))
+		return 0;
+
 	min_cap = min(capacity_orig_of(prev_cpu), capacity_orig_of(cpu));
 	max_cap = cpu_rq(cpu)->rd->max_cpu_capacity;
 
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1185,6 +1185,7 @@ DECLARE_PER_CPU(int, sd_llc_id);
 DECLARE_PER_CPU(struct sched_domain_shared *, sd_llc_shared);
 DECLARE_PER_CPU(struct sched_domain *, sd_numa);
 DECLARE_PER_CPU(struct sched_domain *, sd_asym);
+extern struct static_key_false sched_asym_cpucapacity;
 
 struct sched_group_capacity {
 	atomic_t		ref;
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -398,6 +398,7 @@ DEFINE_PER_CPU(int, sd_llc_id);
 DEFINE_PER_CPU(struct sched_domain_shared *, sd_llc_shared);
 DEFINE_PER_CPU(struct sched_domain *, sd_numa);
 DEFINE_PER_CPU(struct sched_domain *, sd_asym);
+DEFINE_STATIC_KEY_FALSE(sched_asym_cpucapacity);
 
 static void update_top_cache_domain(int cpu)
 {
@@ -1708,6 +1709,7 @@ build_sched_domains(const struct cpumask
 	struct rq *rq = NULL;
 	int i, ret = -ENOMEM;
 	struct sched_domain_topology_level *tl_asym;
+	bool has_asym = false;
 
 	alloc_state = __visit_domain_allocation_hell(&d, cpu_map);
 	if (alloc_state != sa_rootdomain)
@@ -1723,8 +1725,10 @@ build_sched_domains(const struct cpumask
 		for_each_sd_topology(tl) {
 			int dflags = 0;
 
-			if (tl == tl_asym)
+			if (tl == tl_asym) {
 				dflags |= SD_ASYM_CPUCAPACITY;
+				has_asym = true;
+			}
 
 			sd = build_sched_domain(tl, cpu_map, attr, sd, dflags, i);
 
@@ -1776,6 +1780,9 @@ build_sched_domains(const struct cpumask
 	}
 	rcu_read_unlock();
 
+	if (has_asym)
+		static_branch_enable_cpuslocked(&sched_asym_cpucapacity);
+
 	if (rq && sched_debug_enabled) {
 		pr_info("root domain span: %*pbl (max cpu_capacity = %lu)\n",
 			cpumask_pr_args(cpu_map), rq->rd->max_cpu_capacity);

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCHv4 00/12] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems
  2018-07-04 10:17 [PATCHv4 00/12] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems Morten Rasmussen
                   ` (12 preceding siblings ...)
  2018-07-06 10:18 ` [PATCHv4 00/12] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems Vincent Guittot
@ 2018-07-31 12:00 ` Peter Zijlstra
  2018-07-31 12:10   ` Valentin Schneider
  2018-08-17  1:57 ` Joel Fernandes
  14 siblings, 1 reply; 37+ messages in thread
From: Peter Zijlstra @ 2018-07-31 12:00 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: mingo, valentin.schneider, dietmar.eggemann, vincent.guittot,
	gaku.inami.xh, linux-kernel



Aside from the first patch, which I posted the change on, I've picked up
until 10. I think that other SD_ASYM patch-set replaces 11 and 12,
right?

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCHv4 00/12] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems
  2018-07-31 12:00 ` Peter Zijlstra
@ 2018-07-31 12:10   ` Valentin Schneider
  2018-07-31 12:50     ` Morten Rasmussen
  0 siblings, 1 reply; 37+ messages in thread
From: Valentin Schneider @ 2018-07-31 12:10 UTC (permalink / raw)
  To: Peter Zijlstra, Morten Rasmussen
  Cc: mingo, dietmar.eggemann, vincent.guittot, gaku.inami.xh, linux-kernel

Hi Peter,

On 31/07/18 13:00, Peter Zijlstra wrote:
> 
> 
> Aside from the first patch, which I posted the change on, I've picked up
> until 10. I think that other SD_ASYM patch-set replaces 11 and 12,
> right?
>
11 is no longer needed, but AFAICT we still need 12 - we don't want
PREFER_SIBLING to interfere with asymmetric systems.

Also, I've been playing around with a patch to modify 5 (the nohz kicks) that
first scans the capacity of nohz CPUs before doing any kick (pretty much like
what asym packing is doing with nohz CPU priority), do you want me to toss
it your way? I haven't done run any tests with it yet...

Cheers,
Valentin

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCHv4 00/12] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems
  2018-07-09 15:08   ` Morten Rasmussen
  2018-07-26 17:14     ` Valentin Schneider
@ 2018-07-31 12:11     ` Vincent Guittot
  1 sibling, 0 replies; 37+ messages in thread
From: Vincent Guittot @ 2018-07-31 12:11 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Peter Zijlstra, Ingo Molnar, Valentin Schneider,
	Dietmar Eggemann, gaku.inami.xh, linux-kernel

On Mon, 9 Jul 2018 at 17:08, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
>
> On Fri, Jul 06, 2018 at 12:18:27PM +0200, Vincent Guittot wrote:
> > Hi Morten,
> >
> > On Wed, 4 Jul 2018 at 12:18, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> > >
> > > On asymmetric cpu capacity systems (e.g. Arm big.LITTLE) it is crucial
> > > for performance that cpu intensive tasks are aggressively migrated to
> > > high capacity cpus as soon as those become available. The capacity
> > > awareness tweaks already in the wake-up path can't handle this as such
> > > tasks might run or be runnable forever. If they happen to be placed on a
> > > low capacity cpu from the beginning they are stuck there forever while
> > > high capacity cpus may have become available in the meantime.
> > >
> > > To address this issue this patch set introduces a new "misfit"
> > > load-balancing scenario in periodic/nohz/newly idle balance which tweaks
> > > the load-balance conditions to ignore load per capacity in certain
> > > cases. Since misfit tasks are commonly running alone on a cpu, more
> > > aggressive active load-balancing is needed too.
> > >
> > > The fundamental idea of this patch set has been in Android kernels for a
> > > long time and is absolutely essential for consistent performance on
> > > asymmetric cpu capacity systems.
> > >
> >
> > As already said , I'm not convinced by the proposal which seems quite
> > complex and also adds some kind of arbitrary and fixed power
> > management policy by deciding which tasks can or not go on big cores
> > whereas there are other frameworks to take such decision like EAS or
> > cgroups.
>
> The misfit patches are a crucial part of the EAS solution but they also

EAS needs  the scheduler to move long running task on big core
(especially when overloaded), and the misfit task is just one proposal

> make sense for some users on their own without an energy model. This is
> why they are posted separately.
>
> We have already discussed at length why the patches are needed and why
> the look like they do here in this thread:
>
> https://lore.kernel.org/lkml/CAKfTPtD4skW_3SAk--vBEC5-m1Ua48bjOQYS0pDqW3nPSpsENg@mail.gmail.com/
>
> > Furthermore, there is already something similar in the kernel
> > with SD_ASYM_PACKING and IMO, it would be better to improve this
> > feature (if needed) instead of adding a new one which often do similar
> > things.
>
> As said in the previous thread, while it might look similar it isn't.
> SD_ASYM_PACKING isn't utilization-based which is the key metric used for
> EAS, schedutil, util_est, and util_clamp. SD_ASYM_PACKING serves a
> different purpose (see previous thread for details).
>
> > I have rerun your tests and got same results than misfit task patchset
> > on my hikey960 with SD_ASYM_PACKING feature for legacy b.L topology
> > and fake dynamiQ topology. And it give better performance when the
> > pinned tasks are short and scheduler has to wait for the task to
> > increase their utilization before getting a chance to migrate on big
> > core.
>
> Right, the test cases are quite simple and could be served better by
> SD_ASYM_PACKING. As we already discussed in that thread, that is due to
> the PELT lag but this the cost we have to pay if we don't have
> additional information about the requirements of the task and we don't
> want to default to big-first with all its implications.
>
> We have covered all this in the thread in early April.
>
> > Then, I have tested SD_ASYM_PACKING with EAS patchset and they work
> > together for b/L and dynamiQ topology
>
> Could you provide some more details about your evaluation? It probably
> works well for some use-cases but it isn't really designed for what we
> need for EAS.
>
> Morten

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCHv4 00/12] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems
  2018-07-30 14:30       ` Dietmar Eggemann
@ 2018-07-31 12:13         ` Vincent Guittot
  2018-07-31 12:14           ` Dietmar Eggemann
  0 siblings, 1 reply; 37+ messages in thread
From: Vincent Guittot @ 2018-07-31 12:13 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Valentin Schneider, Morten Rasmussen, Peter Zijlstra,
	Ingo Molnar, gaku.inami.xh, linux-kernel

On Mon, 30 Jul 2018 at 16:30, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>
> On 07/26/2018 07:14 PM, Valentin Schneider wrote:
> > Hi,
> >
> > On 09/07/18 16:08, Morten Rasmussen wrote:
> >> On Fri, Jul 06, 2018 at 12:18:27PM +0200, Vincent Guittot wrote:
> >>> Hi Morten,
> >>>
> >>> On Wed, 4 Jul 2018 at 12:18, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
>
> [...]
>
> > With that out of the way, I did some lmbench runs:
> >> lat_mem_rd 10 1024
> >
> > With ASYM_PACKING, I still see lmbench tasks remaining on LITTLE CPUs while
> > bigs are free, because ASYM_PACKING only does explicit active balancing on
> > CPU_NEWLY_IDLE balancing - otherwise it'll rely on the nr_balance_failed counter.
> >
> > However, that counter can be reset before it reaches the threshold at which
> > active balance is done, which can lead to huge upmigration delays (almost a
> > full second). I also see the same kind of issues on Juno r0.
> >
> > This could be resolved by extending ASYM_PACKING active balancing to
> > non NEWLY_IDLE cases, but then we'd be thrashing everything. That's another
> > argument for basing upmigration on task load-tracking signals, as we can
> > determine which tasks need active balancing much faster than the
> > nr_balance_failed counter way while not active balancing the world.
>
> The task layout of the test looks like n=85 always running tasks (each
> for ~ 1.25ms on big or little) and they all get created and run one

How mistfit task can make a difference for a benchmark which uses 1.25ms tasks ?

> after the other. So on a big cpu, their util values go from 512 to 1024
> and from 223 to 446 on little cpu (Juno board). Latter thanks to
> Quentin's 'sched/fair: Fix util_avg of new tasks for asymmetric systems'.
>
> root@juno:~# cat /sys/devices/system/cpu/cpu[01]/cpu_capacity
> 446
> 1024
>
> > (lat_mem_rd 10 1024) with ASYM_PACKING:
>
> ...
> > 4.0 148.66   <-----
> > 4.5 10.191
> ...
> > 7.5 10.203
> > 8.0 154.354   <-----
>
> I ran the test affine to big, little and all cpus on tip/sched/core w/o
> ASYM_PACKING or Misfit:
>
> cputype:     big  little     all
> cpumask:    0x06    0x39    0xff
>
> mem size   <---- latency  ---->
>
>   0.00098   3.668   3.595   3.669
>   0.00195   3.668   3.594   3.594
>   0.00293   3.668   3.593   3.595
>   0.00391   3.669   3.596   3.595
>   ...
>   3.75000  58.687 121.934 122.293
>   4.00000  57.054 121.771 120.489
>   4.50000  56.914 121.851  56.729
>   5.00000  57.347 121.777  56.975
>   5.50000  57.705 121.738  68.981
>   6.00000  57.935 121.728  57.542
>   6.50000  58.119 121.694 121.799
>   7.00000  58.194 121.502  57.844
>   7.50000  58.258 121.684  58.050
>   8.00000  58.293 121.725  58.030
>   9.00000  58.309 121.793  58.188
> 10.00000  58.561 122.252 122.078
>
> There is no diff between big and little cpus with small memory sizes,
> just with the MB range.
> If I look into the trace for 'all' it turns out that their are cases in
> which, even if the task only run for ~15% of the time on big, the
> latency value is printed as when it was running affine to big. So using
> the latency value as an indicator where the task was scheduled is IMHO
> not really possible.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCHv4 00/12] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems
  2018-07-31 12:13         ` Vincent Guittot
@ 2018-07-31 12:14           ` Dietmar Eggemann
  0 siblings, 0 replies; 37+ messages in thread
From: Dietmar Eggemann @ 2018-07-31 12:14 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Valentin Schneider, Morten Rasmussen, Peter Zijlstra,
	Ingo Molnar, gaku.inami.xh, linux-kernel

On 07/31/2018 02:13 PM, Vincent Guittot wrote:
> On Mon, 30 Jul 2018 at 16:30, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>
>> On 07/26/2018 07:14 PM, Valentin Schneider wrote:

[...]

>> The task layout of the test looks like n=85 always running tasks (each
>> for ~ 1.25ms on big or little) and they all get created and run one
> 
> How mistfit task can make a difference for a benchmark which uses 1.25ms tasks ?

Ah, sorry! This was a typo. It should be ~ 1.25s.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCHv4 12/12] sched/core: Disable SD_PREFER_SIBLING on asymmetric cpu capacity domains
  2018-07-06 14:31     ` Morten Rasmussen
@ 2018-07-31 12:17       ` Vincent Guittot
  2018-07-31 12:33         ` Valentin Schneider
  0 siblings, 1 reply; 37+ messages in thread
From: Vincent Guittot @ 2018-07-31 12:17 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Peter Zijlstra, Ingo Molnar, Valentin Schneider,
	Dietmar Eggemann, gaku.inami.xh, linux-kernel

On Fri, 6 Jul 2018 at 16:31, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
>
> On Fri, Jul 06, 2018 at 12:18:17PM +0200, Vincent Guittot wrote:
> > On Wed, 4 Jul 2018 at 12:18, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> > >
> > > The 'prefer sibling' sched_domain flag is intended to encourage
> > > spreading tasks to sibling sched_domain to take advantage of more caches
> > > and core for SMT systems. It has recently been changed to be on all
> > > non-NUMA topology level. However, spreading across domains with cpu
> > > capacity asymmetry isn't desirable, e.g. spreading from high capacity to
> > > low capacity cpus even if high capacity cpus aren't overutilized might
> > > give access to more cache but the cpu will be slower and possibly lead
> > > to worse overall throughput.
> > >
> > > To prevent this, we need to remove SD_PREFER_SIBLING on the sched_domain
> > > level immediately below SD_ASYM_CPUCAPACITY.
> >
> > This makes sense. Nevertheless, this patch also raises a scheduling
> > problem and break the 1 task per CPU policy that is enforced by
> > SD_PREFER_SIBLING.
>
> Scheduling one task per cpu when n_task == n_cpus on asymmetric
> topologies is generally broken already and this patch set doesn't fix
> that problem.
>
> SD_PREFER_SIBLING might seem to help in very specific cases:
> n_litte_cpus == n_big_cpus. In that case the little group might
> classified as overloaded. It doesn't guarantee that anything gets pulled
> as the grp_load/grp_capacity in the imbalance calculation on some system
> still says the little cpus are more loaded than the bigs despite one of
> them being idle. That depends on the little cpu capacities.
>
> On systems where n_little_cpus != n_big_cpus SD_PREFER_SIBLING is broken
> as it assumes the group_weight to be the same. This is the case on Juno
> and several other platforms.
>
> IMHO, SD_PREFER_SIBLING isn't the solution to this problem. It might

I agree but this patchset creates a regression in the scheduling behavior

> help for a limited subset of topologies/capacities but the right
> solution is to change the imbalance calculation. As the name says, it is

Yes that what does the prototype that I came with.

> meant to spread tasks and does so unconditionally. For asymmetric
> systems we would like to consider cpu capacity before migrating tasks.
>
> > When running the tests of your cover letter, 1 long
> > running task is often co scheduled on a big core whereas short pinned
> > tasks are still running and a little core is idle which is not an
> > optimal scheduling decision
>
> This can easily happen with SD_PREFER_SIBLING enabled too so I wouldn't
> say that this patch breaks anything that isn't broken already. In fact
> we this happening with and without this patch applied.

At least for the use case  above, this doesn't happen when
SD_PREFER_SIBLING is set

>
> Morten

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCHv4 12/12] sched/core: Disable SD_PREFER_SIBLING on asymmetric cpu capacity domains
  2018-07-31 12:17       ` Vincent Guittot
@ 2018-07-31 12:33         ` Valentin Schneider
  2018-08-06 10:20           ` Vincent Guittot
  0 siblings, 1 reply; 37+ messages in thread
From: Valentin Schneider @ 2018-07-31 12:33 UTC (permalink / raw)
  To: Vincent Guittot, Morten Rasmussen
  Cc: Peter Zijlstra, Ingo Molnar, Dietmar Eggemann, gaku.inami.xh,
	linux-kernel

Hi,

On 31/07/18 13:17, Vincent Guittot wrote:
> On Fri, 6 Jul 2018 at 16:31, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
>>
>> On Fri, Jul 06, 2018 at 12:18:17PM +0200, Vincent Guittot wrote:
>>> [...]
>>
>> Scheduling one task per cpu when n_task == n_cpus on asymmetric
>> topologies is generally broken already and this patch set doesn't fix
>> that problem.
>>
>> SD_PREFER_SIBLING might seem to help in very specific cases:
>> n_litte_cpus == n_big_cpus. In that case the little group might
>> classified as overloaded. It doesn't guarantee that anything gets pulled
>> as the grp_load/grp_capacity in the imbalance calculation on some system
>> still says the little cpus are more loaded than the bigs despite one of
>> them being idle. That depends on the little cpu capacities.
>>
>> On systems where n_little_cpus != n_big_cpus SD_PREFER_SIBLING is broken
>> as it assumes the group_weight to be the same. This is the case on Juno
>> and several other platforms.
>>
>> IMHO, SD_PREFER_SIBLING isn't the solution to this problem. It might
> 
> I agree but this patchset creates a regression in the scheduling behavior
> 
>> help for a limited subset of topologies/capacities but the right
>> solution is to change the imbalance calculation. As the name says, it is
> 
> Yes that what does the prototype that I came with.
> 
>> meant to spread tasks and does so unconditionally. For asymmetric
>> systems we would like to consider cpu capacity before migrating tasks.
>>
>>> When running the tests of your cover letter, 1 long
>>> running task is often co scheduled on a big core whereas short pinned
>>> tasks are still running and a little core is idle which is not an
>>> optimal scheduling decision
>>
>> This can easily happen with SD_PREFER_SIBLING enabled too so I wouldn't
>> say that this patch breaks anything that isn't broken already. In fact
>> we this happening with and without this patch applied.
> 
> At least for the use case  above, this doesn't happen when
> SD_PREFER_SIBLING is set
> 

On my HiKey960 I can see coscheduling on a big CPU while a LITTLE is free
with **and** without SD_PREFER_SIBLING. Having it set only means that in
some cases the imbalance will be re-classified as group_overloaded instead
of group_misfit_task, so we'll skip the misfit logic when we shouldn't (this
happens on Juno for instance).

It does nothing for the "1 task per CPU" problem that Morten described above.
When you have this little amount of tasks, load isn't very relevant, but it
skews the load-balancer into thinking the LITTLE CPUs are more busy than
the bigs even though there's an idle one in the lot.

>>
>> Morten

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCHv4 00/12] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems
  2018-07-31 12:10   ` Valentin Schneider
@ 2018-07-31 12:50     ` Morten Rasmussen
  0 siblings, 0 replies; 37+ messages in thread
From: Morten Rasmussen @ 2018-07-31 12:50 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Peter Zijlstra, mingo, dietmar.eggemann, vincent.guittot,
	gaku.inami.xh, linux-kernel

On Tue, Jul 31, 2018 at 01:10:49PM +0100, Valentin Schneider wrote:
> Hi Peter,
> 
> On 31/07/18 13:00, Peter Zijlstra wrote:
> > 
> > 
> > Aside from the first patch, which I posted the change on, I've picked up
> > until 10. I think that other SD_ASYM patch-set replaces 11 and 12,
> > right?
> >
> 11 is no longer needed, but AFAICT we still need 12 - we don't want
> PREFER_SIBLING to interfere with asymmetric systems.

Yes, we still want patch 12 if possible.

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCHv4 01/12] sched: Add static_key for asymmetric cpu capacity optimizations
  2018-07-31 10:59   ` Peter Zijlstra
@ 2018-08-02 15:15     ` Morten Rasmussen
  0 siblings, 0 replies; 37+ messages in thread
From: Morten Rasmussen @ 2018-08-02 15:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, valentin.schneider, dietmar.eggemann, vincent.guittot,
	gaku.inami.xh, linux-kernel

On Tue, Jul 31, 2018 at 12:59:16PM +0200, Peter Zijlstra wrote:
> 
> Combined with that SD_ASYM.. rework I ended up with the below.
> 
> Holler if you want it changed :-)

Looks good to me.

Thanks,
Morten

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCHv4 12/12] sched/core: Disable SD_PREFER_SIBLING on asymmetric cpu capacity domains
  2018-07-31 12:33         ` Valentin Schneider
@ 2018-08-06 10:20           ` Vincent Guittot
  2018-08-06 10:53             ` Valentin Schneider
  0 siblings, 1 reply; 37+ messages in thread
From: Vincent Guittot @ 2018-08-06 10:20 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Morten Rasmussen, Peter Zijlstra, Ingo Molnar, Dietmar Eggemann,
	gaku.inami.xh, linux-kernel

Hi Valentin,

On Tue, 31 Jul 2018 at 14:33, Valentin Schneider
<valentin.schneider@arm.com> wrote:
>
> Hi,
>
> On 31/07/18 13:17, Vincent Guittot wrote:
> > On Fri, 6 Jul 2018 at 16:31, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> >>
> >> On Fri, Jul 06, 2018 at 12:18:17PM +0200, Vincent Guittot wrote:
> >>> [...]
> >>
> >> Scheduling one task per cpu when n_task == n_cpus on asymmetric
> >> topologies is generally broken already and this patch set doesn't fix
> >> that problem.
> >>
> >> SD_PREFER_SIBLING might seem to help in very specific cases:
> >> n_litte_cpus == n_big_cpus. In that case the little group might
> >> classified as overloaded. It doesn't guarantee that anything gets pulled
> >> as the grp_load/grp_capacity in the imbalance calculation on some system
> >> still says the little cpus are more loaded than the bigs despite one of
> >> them being idle. That depends on the little cpu capacities.
> >>
> >> On systems where n_little_cpus != n_big_cpus SD_PREFER_SIBLING is broken
> >> as it assumes the group_weight to be the same. This is the case on Juno
> >> and several other platforms.
> >>
> >> IMHO, SD_PREFER_SIBLING isn't the solution to this problem. It might
> >
> > I agree but this patchset creates a regression in the scheduling behavior
> >
> >> help for a limited subset of topologies/capacities but the right
> >> solution is to change the imbalance calculation. As the name says, it is
> >
> > Yes that what does the prototype that I came with.
> >
> >> meant to spread tasks and does so unconditionally. For asymmetric
> >> systems we would like to consider cpu capacity before migrating tasks.
> >>
> >>> When running the tests of your cover letter, 1 long
> >>> running task is often co scheduled on a big core whereas short pinned
> >>> tasks are still running and a little core is idle which is not an
> >>> optimal scheduling decision
> >>
> >> This can easily happen with SD_PREFER_SIBLING enabled too so I wouldn't
> >> say that this patch breaks anything that isn't broken already. In fact
> >> we this happening with and without this patch applied.
> >
> > At least for the use case  above, this doesn't happen when
> > SD_PREFER_SIBLING is set
> >
>
> On my HiKey960 I can see coscheduling on a big CPU while a LITTLE is free
> with **and** without SD_PREFER_SIBLING. Having it set only means that in
> some cases the imbalance will be re-classified as group_overloaded instead
> of group_misfit_task, so we'll skip the misfit logic when we shouldn't (this
> happens on Juno for instance).

Can you give more details about your test case ?

>
> It does nothing for the "1 task per CPU" problem that Morten described above.
> When you have this little amount of tasks, load isn't very relevant, but it
> skews the load-balancer into thinking the LITTLE CPUs are more busy than
> the bigs even though there's an idle one in the lot.
>
> >>
> >> Morten

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCHv4 12/12] sched/core: Disable SD_PREFER_SIBLING on asymmetric cpu capacity domains
  2018-08-06 10:20           ` Vincent Guittot
@ 2018-08-06 10:53             ` Valentin Schneider
  2018-08-10  9:30               ` Vincent Guittot
  0 siblings, 1 reply; 37+ messages in thread
From: Valentin Schneider @ 2018-08-06 10:53 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Morten Rasmussen, Peter Zijlstra, Ingo Molnar, Dietmar Eggemann,
	gaku.inami.xh, linux-kernel

Hi,

On 06/08/18 11:20, Vincent Guittot wrote:
> Hi Valentin,
> 
> On Tue, 31 Jul 2018 at 14:33, Valentin Schneider
> <valentin.schneider@arm.com> wrote:
>>
>> Hi,
>>
>> On 31/07/18 13:17, Vincent Guittot wrote:
>> [...]
>>>>
>>>> This can easily happen with SD_PREFER_SIBLING enabled too so I wouldn't
>>>> say that this patch breaks anything that isn't broken already. In fact
>>>> we this happening with and without this patch applied.
>>>
>>> At least for the use case  above, this doesn't happen when
>>> SD_PREFER_SIBLING is set
>>>
>>
>> On my HiKey960 I can see coscheduling on a big CPU while a LITTLE is free
>> with **and** without SD_PREFER_SIBLING. Having it set only means that in
>> some cases the imbalance will be re-classified as group_overloaded instead
>> of group_misfit_task, so we'll skip the misfit logic when we shouldn't (this
>> happens on Juno for instance).
> 
> Can you give more details about your test case ?
> 

I've been running the same test case as presented in the cover letter on
my HiKey960 but with sched_switch tracing and with no tasksets. I've just
re-run the testcase with tasksets and I get similar results (i.e. a big
with coscheduling while a LITTLE is free) with or without the flag.

>>
>> It does nothing for the "1 task per CPU" problem that Morten described above.
>> When you have this little amount of tasks, load isn't very relevant, but it
>> skews the load-balancer into thinking the LITTLE CPUs are more busy than
>> the bigs even though there's an idle one in the lot.
>>
>>>>
>>>> Morten

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCHv4 12/12] sched/core: Disable SD_PREFER_SIBLING on asymmetric cpu capacity domains
  2018-08-06 10:53             ` Valentin Schneider
@ 2018-08-10  9:30               ` Vincent Guittot
  0 siblings, 0 replies; 37+ messages in thread
From: Vincent Guittot @ 2018-08-10  9:30 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Morten Rasmussen, Peter Zijlstra, Ingo Molnar, Dietmar Eggemann,
	gaku.inami.xh, linux-kernel

On Mon, 6 Aug 2018 at 12:53, Valentin Schneider
<valentin.schneider@arm.com> wrote:
>
> Hi,
>
> On 06/08/18 11:20, Vincent Guittot wrote:
> > Hi Valentin,
> >
> > On Tue, 31 Jul 2018 at 14:33, Valentin Schneider
> > <valentin.schneider@arm.com> wrote:
> >>
> >> Hi,
> >>
> >> On 31/07/18 13:17, Vincent Guittot wrote:
> >> [...]
> >>>>
> >>>> This can easily happen with SD_PREFER_SIBLING enabled too so I wouldn't
> >>>> say that this patch breaks anything that isn't broken already. In fact
> >>>> we this happening with and without this patch applied.
> >>>
> >>> At least for the use case  above, this doesn't happen when
> >>> SD_PREFER_SIBLING is set
> >>>
> >>
> >> On my HiKey960 I can see coscheduling on a big CPU while a LITTLE is free
> >> with **and** without SD_PREFER_SIBLING. Having it set only means that in
> >> some cases the imbalance will be re-classified as group_overloaded instead
> >> of group_misfit_task, so we'll skip the misfit logic when we shouldn't (this
> >> happens on Juno for instance).
> >
> > Can you give more details about your test case ?
> >
>
> I've been running the same test case as presented in the cover letter on
> my HiKey960 but with sched_switch tracing and with no tasksets. I've just
> re-run the testcase with tasksets and I get similar results (i.e. a big
> with coscheduling while a LITTLE is free) with or without the flag.
>
> >>
> >> It does nothing for the "1 task per CPU" problem that Morten described above.
> >> When you have this little amount of tasks, load isn't very relevant, but it
> >> skews the load-balancer into thinking the LITTLE CPUs are more busy than
> >> the bigs even though there's an idle one in the lot.

ok. I meant  that without misfit patch, SD_PREFER_SIBLING ensures 1
task per CPU and no co scheduling. But co scheduling appears with
misfit task patchset

Vincent
> >>
> >>>>
> >>>> Morten

^ permalink raw reply	[flat|nested] 37+ messages in thread

* Re: [PATCHv4 00/12] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems
  2018-07-04 10:17 [PATCHv4 00/12] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems Morten Rasmussen
                   ` (13 preceding siblings ...)
  2018-07-31 12:00 ` Peter Zijlstra
@ 2018-08-17  1:57 ` Joel Fernandes
  14 siblings, 0 replies; 37+ messages in thread
From: Joel Fernandes @ 2018-08-17  1:57 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: peterz, mingo, valentin.schneider, dietmar.eggemann,
	vincent.guittot, gaku.inami.xh, linux-kernel, tkjos

On Wed, Jul 04, 2018 at 11:17:38AM +0100, Morten Rasmussen wrote:
> On asymmetric cpu capacity systems (e.g. Arm big.LITTLE) it is crucial
> for performance that cpu intensive tasks are aggressively migrated to
> high capacity cpus as soon as those become available. The capacity
> awareness tweaks already in the wake-up path can't handle this as such
> tasks might run or be runnable forever. If they happen to be placed on a
> low capacity cpu from the beginning they are stuck there forever while
> high capacity cpus may have become available in the meantime.
> 
> To address this issue this patch set introduces a new "misfit"
> load-balancing scenario in periodic/nohz/newly idle balance which tweaks
> the load-balance conditions to ignore load per capacity in certain
> cases. Since misfit tasks are commonly running alone on a cpu, more
> aggressive active load-balancing is needed too.
> 
> The fundamental idea of this patch set has been in Android kernels for a
> long time and is absolutely essential for consistent performance on
> asymmetric cpu capacity systems.
> 
> The patches have been tested on:
>    1. Arm Juno (r0): 2+4 Cortex A57/A53
>    2. Hikey960: 4+4 Cortex A73/A53
> 

Thanks for posting these, we have been carrying it in the Android Common
Kernel for some time. They have been useful for Android systems.

Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org>

thanks,

 - Joel


^ permalink raw reply	[flat|nested] 37+ messages in thread

end of thread, back to index

Thread overview: 37+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-04 10:17 [PATCHv4 00/12] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems Morten Rasmussen
2018-07-04 10:17 ` [PATCHv4 01/12] sched: Add static_key for asymmetric cpu capacity optimizations Morten Rasmussen
2018-07-31 10:59   ` Peter Zijlstra
2018-08-02 15:15     ` Morten Rasmussen
2018-07-04 10:17 ` [PATCHv4 02/12] sched/fair: Add group_misfit_task load-balance type Morten Rasmussen
2018-07-04 10:17 ` [PATCHv4 03/12] sched: Add sched_group per-cpu max capacity Morten Rasmussen
2018-07-04 10:17 ` [PATCHv4 04/12] sched/fair: Consider misfit tasks when load-balancing Morten Rasmussen
2018-07-04 10:17 ` [PATCHv4 05/12] sched/fair: Kick nohz balance if rq->misfit_task_load Morten Rasmussen
2018-07-04 10:17 ` [PATCHv4 06/12] sched/fair: Change prefer_sibling type to bool Morten Rasmussen
2018-07-04 10:17 ` [PATCHv4 07/12] sched: Change root_domain->overload type to int Morten Rasmussen
2018-07-04 10:17 ` [PATCHv4 08/12] sched: Wrap rq->rd->overload accesses with READ/WRITE_ONCE Morten Rasmussen
2018-07-04 10:17 ` [PATCHv4 09/12] sched/fair: Set rq->rd->overload when misfit Morten Rasmussen
2018-07-04 10:17 ` [PATCHv4 10/12] sched/fair: Don't move tasks to lower capacity cpus unless necessary Morten Rasmussen
2018-07-04 10:17 ` [PATCHv4 11/12] sched/core: Disable SD_ASYM_CPUCAPACITY for root_domains without asymmetry Morten Rasmussen
2018-07-05 13:31   ` Quentin Perret
2018-07-05 14:13     ` Morten Rasmussen
2018-07-05 15:03       ` Quentin Perret
2018-07-20 13:54         ` Morten Rasmussen
2018-07-04 10:17 ` [PATCHv4 12/12] sched/core: Disable SD_PREFER_SIBLING on asymmetric cpu capacity domains Morten Rasmussen
2018-07-06 10:18   ` Vincent Guittot
2018-07-06 14:31     ` Morten Rasmussen
2018-07-31 12:17       ` Vincent Guittot
2018-07-31 12:33         ` Valentin Schneider
2018-08-06 10:20           ` Vincent Guittot
2018-08-06 10:53             ` Valentin Schneider
2018-08-10  9:30               ` Vincent Guittot
2018-07-06 10:18 ` [PATCHv4 00/12] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems Vincent Guittot
2018-07-09 15:08   ` Morten Rasmussen
2018-07-26 17:14     ` Valentin Schneider
2018-07-30 14:30       ` Dietmar Eggemann
2018-07-31 12:13         ` Vincent Guittot
2018-07-31 12:14           ` Dietmar Eggemann
2018-07-31 12:11     ` Vincent Guittot
2018-07-31 12:00 ` Peter Zijlstra
2018-07-31 12:10   ` Valentin Schneider
2018-07-31 12:50     ` Morten Rasmussen
2018-08-17  1:57 ` Joel Fernandes

LKML Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/lkml/0 lkml/git/0.git
	git clone --mirror https://lore.kernel.org/lkml/1 lkml/git/1.git
	git clone --mirror https://lore.kernel.org/lkml/2 lkml/git/2.git
	git clone --mirror https://lore.kernel.org/lkml/3 lkml/git/3.git
	git clone --mirror https://lore.kernel.org/lkml/4 lkml/git/4.git
	git clone --mirror https://lore.kernel.org/lkml/5 lkml/git/5.git
	git clone --mirror https://lore.kernel.org/lkml/6 lkml/git/6.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ https://lore.kernel.org/lkml \
		linux-kernel@vger.kernel.org linux-kernel@archiver.kernel.org
	public-inbox-index lkml


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel


AGPL code for this site: git clone https://public-inbox.org/ public-inbox