All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCHv3 0/9] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems
@ 2018-06-20  9:05 Morten Rasmussen
  2018-06-20  9:05 ` [PATCHv3 1/9] sched: Add static_key for asymmetric cpu capacity optimizations Morten Rasmussen
                   ` (9 more replies)
  0 siblings, 10 replies; 18+ messages in thread
From: Morten Rasmussen @ 2018-06-20  9:05 UTC (permalink / raw)
  To: peterz, mingo
  Cc: valentin.schneider, dietmar.eggemann, vincent.guittot,
	gaku.inami.xh, linux-kernel, Morten Rasmussen

On asymmetric cpu capacity systems (e.g. Arm big.LITTLE) it is crucial
for performance that cpu intensive tasks are aggressively migrated to
high capacity cpus as soon as those become available. The capacity
awareness tweaks already in the wake-up path can't handle this as such
tasks might run or be runnable forever. If they happen to be placed on a
low capacity cpu from the beginning they are stuck there forever while
high capacity cpus may have become available in the meantime.

To address this issue this patch set introduces a new "misfit"
load-balancing scenario in periodic/nohz/newly idle balance which tweaks
the load-balance conditions to ignore load per capacity in certain
cases. Since misfit tasks are commonly running alone on a cpu, more
aggressive active load-balancing is needed too.

The fundamental idea of this patch set has been in Android kernels for a
long time and is absolutely essential for consistent performance on
asymmetric cpu capacity systems.

The patches have been tested on:
   1. Arm Juno (r0): 2+4 Cortex A57/A53
   2. Hikey960: 4+4 Cortex A73/A53

Test case:
Big cpus are always kept busy. Pin a shorter running sysbench tasks to
big cpus, while creating a longer running set of unpinned sysbench
tasks.

    REQUESTS=1000
    BIGS="1 2"
    LITTLES="0 3 4 5"
 
    # Don't care about the score for those, just keep the bigs busy
    for i in $BIGS; do
        taskset -c $i sysbench --max-requests=$((REQUESTS / 4)) \
        --test=cpu  run &>/dev/null &
    done
 
    for i in $LITTLES; do
        sysbench --max-requests=$REQUESTS --test=cpu run \
	| grep "total time:" &
    done
 
    wait

Results:
Single runs with completion time of each task
Juno (tip)
    total time:                          1.2608s
    total time:                          1.2995s
    total time:                          1.5954s
    total time:                          1.7463s

Juno (misfit)
    total time:                          1.2575s
    total time:                          1.3004s
    total time:                          1.5860s
    total time:                          1.5871s

Hikey960 (tip)
    total time:                          1.7431s
    total time:                          2.2914s
    total time:                          2.5976s
    total time:                          1.7280s

Hikey960 (misfit)
    total time:                          1.7866s
    total time:                          1.7513s
    total time:                          1.6918s
    total time:                          1.6965s

10 run summary (tracking longest running task for each run)
	Juno		Hikey960
	avg	max	avg	max
tip     1.7465  1.7469  2.5997  2.6131 
misfit  1.6016  1.6192  1.8506  1.9666

Changelog:
v3
- Fixed locking around static_key.
- Changed group per-cpu capacity comparison to be based on max rather
  than min capacity.
- Added patch to prevent occasional pointless high->low capacity
  migrations.
- Changed type of group_misfit_task_load and misfit_task_load to
  unsigned long.
- Changed fbq() to pick the cpu with highest misfit_task_load rather
  than breaking when the first is found.
- Rebased against tip/sched/core.
- v2 Tested-by: Gaku Inami <gaku.inami.xh@renesas.com>

v2
- Removed redudant condition in static_key enablement.
- Fixed logic flaw in patch #2 reported by Yi Yao <yi.yao@intel.com>
- Dropped patch #4 as although the patch seems to make sense no benefit
  has been proven.
- Dropped root_domain->overload renaming
- Changed type of root_domain->overload to int
- Wrapped accesses of rq->rd->overload with READ/WRITE_ONCE
- v1 Tested-by: Gaku Inami <gaku.inami.xh@renesas.com>

Chris Redpath (1):
  sched/fair: Don't move tasks to lower capacity cpus unless necessary

Morten Rasmussen (4):
  sched: Add static_key for asymmetric cpu capacity optimizations
  sched/fair: Add group_misfit_task load-balance type
  sched: Add sched_group per-cpu max capacity
  sched/fair: Consider misfit tasks when load-balancing

Valentin Schneider (4):
  sched/fair: Kick nohz balance if rq->misfit_task
  sched: Change root_domain->overload type to int
  sched: Wrap rq->rd->overload accesses with READ/WRITE_ONCE
  sched/fair: Set sd->overload when misfit

 kernel/sched/fair.c     | 162 +++++++++++++++++++++++++++++++++++++++++-------
 kernel/sched/sched.h    |  16 +++--
 kernel/sched/topology.c |  20 ++++++
 3 files changed, 171 insertions(+), 27 deletions(-)

-- 
2.7.4


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCHv3 1/9] sched: Add static_key for asymmetric cpu capacity optimizations
  2018-06-20  9:05 [PATCHv3 0/9] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems Morten Rasmussen
@ 2018-06-20  9:05 ` Morten Rasmussen
  2018-06-22  8:22   ` Quentin Perret
  2018-06-20  9:05 ` [PATCHv3 2/9] sched/fair: Add group_misfit_task load-balance type Morten Rasmussen
                   ` (8 subsequent siblings)
  9 siblings, 1 reply; 18+ messages in thread
From: Morten Rasmussen @ 2018-06-20  9:05 UTC (permalink / raw)
  To: peterz, mingo
  Cc: valentin.schneider, dietmar.eggemann, vincent.guittot,
	gaku.inami.xh, linux-kernel, Morten Rasmussen

The existing asymmetric cpu capacity code should cause minimal overhead
for others. Putting it behind a static_key, it has been done for SMT
optimizations, would make it easier to extend and improve without
causing harm to others moving forward.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c     |  3 +++
 kernel/sched/sched.h    |  1 +
 kernel/sched/topology.c | 18 ++++++++++++++++++
 3 files changed, 22 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e497c05aab7f..6116d1b7e441 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6585,6 +6585,9 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
 {
 	long min_cap, max_cap;
 
+	if (!static_branch_unlikely(&sched_asym_cpucapacity))
+		return 0;
+
 	min_cap = min(capacity_orig_of(prev_cpu), capacity_orig_of(cpu));
 	max_cap = cpu_rq(cpu)->rd->max_cpu_capacity;
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 67702b4d9ac7..0fbfbcf5c551 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1153,6 +1153,7 @@ DECLARE_PER_CPU(int, sd_llc_id);
 DECLARE_PER_CPU(struct sched_domain_shared *, sd_llc_shared);
 DECLARE_PER_CPU(struct sched_domain *, sd_numa);
 DECLARE_PER_CPU(struct sched_domain *, sd_asym);
+extern struct static_key_false sched_asym_cpucapacity;
 
 struct sched_group_capacity {
 	atomic_t		ref;
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 61a1125c1ae4..edc87e35fc75 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -398,6 +398,7 @@ DEFINE_PER_CPU(int, sd_llc_id);
 DEFINE_PER_CPU(struct sched_domain_shared *, sd_llc_shared);
 DEFINE_PER_CPU(struct sched_domain *, sd_numa);
 DEFINE_PER_CPU(struct sched_domain *, sd_asym);
+DEFINE_STATIC_KEY_FALSE(sched_asym_cpucapacity);
 
 static void update_top_cache_domain(int cpu)
 {
@@ -425,6 +426,21 @@ static void update_top_cache_domain(int cpu)
 	rcu_assign_pointer(per_cpu(sd_asym, cpu), sd);
 }
 
+static void update_asym_cpucapacity(int cpu)
+{
+	int enable = false;
+
+	rcu_read_lock();
+	if (lowest_flag_domain(cpu, SD_ASYM_CPUCAPACITY))
+		enable = true;
+	rcu_read_unlock();
+
+	if (enable) {
+		/* This expects to be hotplug-safe */
+		static_branch_enable_cpuslocked(&sched_asym_cpucapacity);
+	}
+}
+
 /*
  * Attach the domain 'sd' to 'cpu' as its base domain. Callers must
  * hold the hotplug lock.
@@ -1707,6 +1723,8 @@ build_sched_domains(const struct cpumask *cpu_map, struct sched_domain_attr *att
 	}
 	rcu_read_unlock();
 
+	update_asym_cpucapacity(cpumask_first(cpu_map));
+
 	if (rq && sched_debug_enabled) {
 		pr_info("root domain span: %*pbl (max cpu_capacity = %lu)\n",
 			cpumask_pr_args(cpu_map), rq->rd->max_cpu_capacity);
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCHv3 2/9] sched/fair: Add group_misfit_task load-balance type
  2018-06-20  9:05 [PATCHv3 0/9] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems Morten Rasmussen
  2018-06-20  9:05 ` [PATCHv3 1/9] sched: Add static_key for asymmetric cpu capacity optimizations Morten Rasmussen
@ 2018-06-20  9:05 ` Morten Rasmussen
  2018-06-20  9:05 ` [PATCHv3 3/9] sched: Add sched_group per-cpu max capacity Morten Rasmussen
                   ` (7 subsequent siblings)
  9 siblings, 0 replies; 18+ messages in thread
From: Morten Rasmussen @ 2018-06-20  9:05 UTC (permalink / raw)
  To: peterz, mingo
  Cc: valentin.schneider, dietmar.eggemann, vincent.guittot,
	gaku.inami.xh, linux-kernel, Morten Rasmussen

To maximize throughput in systems with asymmetric cpu capacities (e.g.
ARM big.LITTLE) load-balancing has to consider task and cpu utilization
as well as per-cpu compute capacity when load-balancing in addition to
the current average load based load-balancing policy. Tasks with high
utilization that are scheduled on a lower capacity cpu need to be
identified and migrated to a higher capacity cpu if possible to maximize
throughput.

To implement this additional policy an additional group_type
(load-balance scenario) is added: group_misfit_task. This represents
scenarios where a sched_group has one or more tasks that are not
suitable for its per-cpu capacity. group_misfit_task is only considered
if the system is not overloaded or imbalanced (group_imbalanced or
group_overloaded).

Identifying misfit tasks requires the rq lock to be held. To avoid
taking remote rq locks to examine source sched_groups for misfit tasks,
each cpu is responsible for tracking misfit tasks themselves and update
the rq->misfit_task flag. This means checking task utilization when
tasks are scheduled and on sched_tick.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c  | 54 ++++++++++++++++++++++++++++++++++++++++++++--------
 kernel/sched/sched.h |  2 ++
 2 files changed, 48 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6116d1b7e441..ed7eb2ac068f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -697,6 +697,7 @@ static u64 sched_vslice(struct cfs_rq *cfs_rq, struct sched_entity *se)
 
 static int select_idle_sibling(struct task_struct *p, int prev_cpu, int cpu);
 static unsigned long task_h_load(struct task_struct *p);
+static unsigned long capacity_of(int cpu);
 
 /* Give new sched_entity start runnable values to heavy its load in infant time */
 void init_entity_runnable_average(struct sched_entity *se)
@@ -1448,7 +1449,6 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
 static unsigned long weighted_cpuload(struct rq *rq);
 static unsigned long source_load(int cpu, int type);
 static unsigned long target_load(int cpu, int type);
-static unsigned long capacity_of(int cpu);
 
 /* Cached statistics for all CPUs within a node */
 struct numa_stats {
@@ -4043,6 +4043,29 @@ util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p, bool task_sleep)
 	WRITE_ONCE(p->se.avg.util_est, ue);
 }
 
+static inline int task_fits_capacity(struct task_struct *p, long capacity)
+{
+	return capacity * 1024 > task_util_est(p) * capacity_margin;
+}
+
+static inline void update_misfit_status(struct task_struct *p, struct rq *rq)
+{
+	if (!static_branch_unlikely(&sched_asym_cpucapacity))
+		return;
+
+	if (!p) {
+		rq->misfit_task_load = 0;
+		return;
+	}
+
+	if (task_fits_capacity(p, capacity_of(cpu_of(rq)))) {
+		rq->misfit_task_load = 0;
+		return;
+	}
+
+	rq->misfit_task_load = task_h_load(p);
+}
+
 #else /* CONFIG_SMP */
 
 static inline int
@@ -4078,6 +4101,7 @@ util_est_enqueue(struct cfs_rq *cfs_rq, struct task_struct *p) {}
 static inline void
 util_est_dequeue(struct cfs_rq *cfs_rq, struct task_struct *p,
 		 bool task_sleep) {}
+static inline void update_misfit_status(struct task_struct *p, struct rq *rq) {}
 
 #endif /* CONFIG_SMP */
 
@@ -6598,7 +6622,7 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
 	/* Bring task utilization in sync with prev_cpu */
 	sync_entity_load_avg(&p->se);
 
-	return min_cap * 1024 < task_util(p) * capacity_margin;
+	return !task_fits_capacity(p, min_cap);
 }
 
 /*
@@ -7015,9 +7039,12 @@ done: __maybe_unused;
 	if (hrtick_enabled(rq))
 		hrtick_start_fair(rq, p);
 
+	update_misfit_status(p, rq);
+
 	return p;
 
 idle:
+	update_misfit_status(NULL, rq);
 	new_tasks = idle_balance(rq, rf);
 
 	/*
@@ -7223,6 +7250,13 @@ static unsigned long __read_mostly max_load_balance_interval = HZ/10;
 
 enum fbq_type { regular, remote, all };
 
+enum group_type {
+	group_other = 0,
+	group_misfit_task,
+	group_imbalanced,
+	group_overloaded,
+};
+
 #define LBF_ALL_PINNED	0x01
 #define LBF_NEED_BREAK	0x02
 #define LBF_DST_PINNED  0x04
@@ -7764,12 +7798,6 @@ static unsigned long task_h_load(struct task_struct *p)
 
 /********** Helpers for find_busiest_group ************************/
 
-enum group_type {
-	group_other = 0,
-	group_imbalanced,
-	group_overloaded,
-};
-
 /*
  * sg_lb_stats - stats of a sched_group required for load_balancing
  */
@@ -7785,6 +7813,7 @@ struct sg_lb_stats {
 	unsigned int group_weight;
 	enum group_type group_type;
 	int group_no_capacity;
+	unsigned long group_misfit_task_load; /* A cpu has a task too big for its capacity */
 #ifdef CONFIG_NUMA_BALANCING
 	unsigned int nr_numa_running;
 	unsigned int nr_preferred_running;
@@ -8084,6 +8113,9 @@ group_type group_classify(struct sched_group *group,
 	if (sg_imbalanced(group))
 		return group_imbalanced;
 
+	if (sgs->group_misfit_task_load)
+		return group_misfit_task;
+
 	return group_other;
 }
 
@@ -8158,6 +8190,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 		 */
 		if (!nr_running && idle_cpu(i))
 			sgs->idle_cpus++;
+
+		if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
+		    sgs->group_misfit_task_load < rq->misfit_task_load)
+			sgs->group_misfit_task_load = rq->misfit_task_load;
 	}
 
 	/* Adjust by relative CPU capacity of the group */
@@ -9939,6 +9975,8 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 
 	if (static_branch_unlikely(&sched_numa_balancing))
 		task_tick_numa(rq, curr);
+
+	update_misfit_status(curr, rq);
 }
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 0fbfbcf5c551..273d07dedc90 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -815,6 +815,8 @@ struct rq {
 
 	unsigned char		idle_balance;
 
+	unsigned long		misfit_task_load;
+
 	/* For active balancing */
 	int			active_balance;
 	int			push_cpu;
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCHv3 3/9] sched: Add sched_group per-cpu max capacity
  2018-06-20  9:05 [PATCHv3 0/9] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems Morten Rasmussen
  2018-06-20  9:05 ` [PATCHv3 1/9] sched: Add static_key for asymmetric cpu capacity optimizations Morten Rasmussen
  2018-06-20  9:05 ` [PATCHv3 2/9] sched/fair: Add group_misfit_task load-balance type Morten Rasmussen
@ 2018-06-20  9:05 ` Morten Rasmussen
  2018-06-20  9:05 ` [PATCHv3 4/9] sched/fair: Consider misfit tasks when load-balancing Morten Rasmussen
                   ` (6 subsequent siblings)
  9 siblings, 0 replies; 18+ messages in thread
From: Morten Rasmussen @ 2018-06-20  9:05 UTC (permalink / raw)
  To: peterz, mingo
  Cc: valentin.schneider, dietmar.eggemann, vincent.guittot,
	gaku.inami.xh, linux-kernel, Morten Rasmussen

The current sg->min_capacity tracks the lowest per-cpu compute capacity
available in the sched_group when rt/irq pressure is taken into account.
Minimum capacity isn't the ideal metric for tracking if a sched_group
needs offloading to another sched_group for some scenarios, e.g. a
sched_group with multiple cpus if only one is under heavy pressure.
Tracking maximum capacity isn't perfect either but a better choice for
some situations as it indicates that the sched_group definitely compute
capacity constrained either due to rt/irq pressure on all cpus or
asymmetric cpu capacities (e.g. big.LITTLE).

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c     | 24 ++++++++++++++++++++----
 kernel/sched/sched.h    |  1 +
 kernel/sched/topology.c |  2 ++
 3 files changed, 23 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ed7eb2ac068f..6af3354e9e26 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7929,13 +7929,14 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
 	cpu_rq(cpu)->cpu_capacity = capacity;
 	sdg->sgc->capacity = capacity;
 	sdg->sgc->min_capacity = capacity;
+	sdg->sgc->max_capacity = capacity;
 }
 
 void update_group_capacity(struct sched_domain *sd, int cpu)
 {
 	struct sched_domain *child = sd->child;
 	struct sched_group *group, *sdg = sd->groups;
-	unsigned long capacity, min_capacity;
+	unsigned long capacity, min_capacity, max_capacity;
 	unsigned long interval;
 
 	interval = msecs_to_jiffies(sd->balance_interval);
@@ -7949,6 +7950,7 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
 
 	capacity = 0;
 	min_capacity = ULONG_MAX;
+	max_capacity = 0;
 
 	if (child->flags & SD_OVERLAP) {
 		/*
@@ -7979,6 +7981,7 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
 			}
 
 			min_capacity = min(capacity, min_capacity);
+			max_capacity = max(capacity, max_capacity);
 		}
 	} else  {
 		/*
@@ -7992,12 +7995,14 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
 
 			capacity += sgc->capacity;
 			min_capacity = min(sgc->min_capacity, min_capacity);
+			max_capacity = max(sgc->max_capacity, max_capacity);
 			group = group->next;
 		} while (group != child->groups);
 	}
 
 	sdg->sgc->capacity = capacity;
 	sdg->sgc->min_capacity = min_capacity;
+	sdg->sgc->max_capacity = max_capacity;
 }
 
 /*
@@ -8093,16 +8098,27 @@ group_is_overloaded(struct lb_env *env, struct sg_lb_stats *sgs)
 }
 
 /*
- * group_smaller_cpu_capacity: Returns true if sched_group sg has smaller
+ * group_smaller_min_cpu_capacity: Returns true if sched_group sg has smaller
  * per-CPU capacity than sched_group ref.
  */
 static inline bool
-group_smaller_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
+group_smaller_min_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
 {
 	return sg->sgc->min_capacity * capacity_margin <
 						ref->sgc->min_capacity * 1024;
 }
 
+/*
+ * group_smaller_max_cpu_capacity: Returns true if sched_group sg has smaller
+ * per-CPU capacity_orig than sched_group ref.
+ */
+static inline bool
+group_smaller_max_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
+{
+	return sg->sgc->max_capacity * capacity_margin <
+						ref->sgc->max_capacity * 1024;
+}
+
 static inline enum
 group_type group_classify(struct sched_group *group,
 			  struct sg_lb_stats *sgs)
@@ -8248,7 +8264,7 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 	 * power/energy consequences are not considered.
 	 */
 	if (sgs->sum_nr_running <= sgs->group_weight &&
-	    group_smaller_cpu_capacity(sds->local, sg))
+	    group_smaller_min_cpu_capacity(sds->local, sg))
 		return false;
 
 asym_packing:
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 273d07dedc90..5ed67122cf59 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1165,6 +1165,7 @@ struct sched_group_capacity {
 	 */
 	unsigned long		capacity;
 	unsigned long		min_capacity;		/* Min per-CPU capacity in group */
+	unsigned long		max_capacity;		/* Max per-CPU capacity in group */
 	unsigned long		next_update;
 	int			imbalance;		/* XXX unrelated to capacity but shared group state */
 
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index edc87e35fc75..f32bf3a998b1 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -708,6 +708,7 @@ static void init_overlap_sched_group(struct sched_domain *sd,
 	sg_span = sched_group_span(sg);
 	sg->sgc->capacity = SCHED_CAPACITY_SCALE * cpumask_weight(sg_span);
 	sg->sgc->min_capacity = SCHED_CAPACITY_SCALE;
+	sg->sgc->max_capacity = SCHED_CAPACITY_SCALE;
 }
 
 static int
@@ -867,6 +868,7 @@ static struct sched_group *get_group(int cpu, struct sd_data *sdd)
 
 	sg->sgc->capacity = SCHED_CAPACITY_SCALE * cpumask_weight(sched_group_span(sg));
 	sg->sgc->min_capacity = SCHED_CAPACITY_SCALE;
+	sg->sgc->max_capacity = SCHED_CAPACITY_SCALE;
 
 	return sg;
 }
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCHv3 4/9] sched/fair: Consider misfit tasks when load-balancing
  2018-06-20  9:05 [PATCHv3 0/9] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems Morten Rasmussen
                   ` (2 preceding siblings ...)
  2018-06-20  9:05 ` [PATCHv3 3/9] sched: Add sched_group per-cpu max capacity Morten Rasmussen
@ 2018-06-20  9:05 ` Morten Rasmussen
  2018-06-20  9:05 ` [PATCHv3 5/9] sched/fair: Kick nohz balance if rq->misfit_task Morten Rasmussen
                   ` (5 subsequent siblings)
  9 siblings, 0 replies; 18+ messages in thread
From: Morten Rasmussen @ 2018-06-20  9:05 UTC (permalink / raw)
  To: peterz, mingo
  Cc: valentin.schneider, dietmar.eggemann, vincent.guittot,
	gaku.inami.xh, linux-kernel, Morten Rasmussen

On asymmetric cpu capacity systems load intensive tasks can end up on
cpus that don't suit their compute demand.  In this scenarios 'misfit'
tasks should be migrated to cpus with higher compute capacity to ensure
better throughput. group_misfit_task indicates this scenario, but tweaks
to the load-balance code are needed to make the migrations happen.

Misfit balancing only makes sense between a source group of lower
per-cpu capacity and destination group of higher compute capacity.
Otherwise, misfit balancing is ignored. group_misfit_task has lowest
priority so any imbalance due to overload is dealt with first.

The modifications are:

1. Only pick a group containing misfit tasks as the busiest group if the
   destination group has higher capacity and has spare capacity.
2. When the busiest group is a 'misfit' group, skip the usual average
   load and group capacity checks.
3. Set the imbalance for 'misfit' balancing sufficiently high for a task
   to be pulled ignoring average load.
4. Pick the cpu with the highest misfit load as the source cpu.
5. If the misfit task is alone on the source cpu, go for active
   balancing.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c | 51 +++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 49 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6af3354e9e26..11f3efa299ca 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7287,6 +7287,7 @@ struct lb_env {
 	unsigned int		loop_max;
 
 	enum fbq_type		fbq_type;
+	enum group_type		src_grp_type;
 	struct list_head	tasks;
 };
 
@@ -8245,6 +8246,17 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 {
 	struct sg_lb_stats *busiest = &sds->busiest_stat;
 
+	/*
+	 * Don't try to pull misfit tasks we can't help.
+	 * We can use max_capacity here as reduction in capacity on some
+	 * cpus in the group should either be possible to resolve
+	 * internally or be covered by avg_load imbalance (eventually).
+	 */
+	if (sgs->group_type == group_misfit_task &&
+	    (!group_smaller_max_cpu_capacity(sg, sds->local) ||
+	     !group_has_capacity(env, &sds->local_stat)))
+		return false;
+
 	if (sgs->group_type > busiest->group_type)
 		return true;
 
@@ -8267,6 +8279,13 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 	    group_smaller_min_cpu_capacity(sds->local, sg))
 		return false;
 
+	/*
+	 * If we have more than one misfit sg go with the biggest misfit.
+	 */
+	if (sgs->group_type == group_misfit_task &&
+	    sgs->group_misfit_task_load < busiest->group_misfit_task_load)
+		return false;
+
 asym_packing:
 	/* This is the busiest node in its class. */
 	if (!(env->sd->flags & SD_ASYM_PACKING))
@@ -8564,8 +8583,9 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 	 * factors in sg capacity and sgs with smaller group_type are
 	 * skipped when updating the busiest sg:
 	 */
-	if (busiest->avg_load <= sds->avg_load ||
-	    local->avg_load >= sds->avg_load) {
+	if (busiest->group_type != group_misfit_task &&
+	    (busiest->avg_load <= sds->avg_load ||
+	     local->avg_load >= sds->avg_load)) {
 		env->imbalance = 0;
 		return fix_small_imbalance(env, sds);
 	}
@@ -8599,6 +8619,12 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 		(sds->avg_load - local->avg_load) * local->group_capacity
 	) / SCHED_CAPACITY_SCALE;
 
+	/* Boost imbalance to allow misfit task to be balanced. */
+	if (busiest->group_type == group_misfit_task) {
+		env->imbalance = max_t(long, env->imbalance,
+				       busiest->group_misfit_task_load);
+	}
+
 	/*
 	 * if *imbalance is less than the average load per runnable task
 	 * there is no guarantee that any tasks will be moved so we'll have
@@ -8665,6 +8691,10 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
 	    busiest->group_no_capacity)
 		goto force_balance;
 
+	/* Misfit tasks should be dealt with regardless of the avg load */
+	if (busiest->group_type == group_misfit_task)
+		goto force_balance;
+
 	/*
 	 * If the local group is busier than the selected busiest group
 	 * don't try and pull any tasks.
@@ -8702,6 +8732,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
 
 force_balance:
 	/* Looks like there is an imbalance. Compute it */
+	env->src_grp_type = busiest->group_type;
 	calculate_imbalance(env, &sds);
 	return sds.busiest;
 
@@ -8749,6 +8780,19 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 		if (rt > env->fbq_type)
 			continue;
 
+		/*
+		 * For ASYM_CPUCAPACITY domains with misfit tasks we simply
+		 * seek the "biggest" misfit task.
+		 */
+		if (env->src_grp_type == group_misfit_task) {
+			if (rq->misfit_task_load > busiest_load) {
+				busiest_load = rq->misfit_task_load;
+				busiest = rq;
+			}
+
+			continue;
+		}
+
 		capacity = capacity_of(i);
 
 		wl = weighted_cpuload(rq);
@@ -8818,6 +8862,9 @@ static int need_active_balance(struct lb_env *env)
 			return 1;
 	}
 
+	if (env->src_grp_type == group_misfit_task)
+		return 1;
+
 	return unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2);
 }
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCHv3 5/9] sched/fair: Kick nohz balance if rq->misfit_task
  2018-06-20  9:05 [PATCHv3 0/9] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems Morten Rasmussen
                   ` (3 preceding siblings ...)
  2018-06-20  9:05 ` [PATCHv3 4/9] sched/fair: Consider misfit tasks when load-balancing Morten Rasmussen
@ 2018-06-20  9:05 ` Morten Rasmussen
  2018-06-20  9:05 ` [PATCHv3 6/9] sched: Change root_domain->overload type to int Morten Rasmussen
                   ` (4 subsequent siblings)
  9 siblings, 0 replies; 18+ messages in thread
From: Morten Rasmussen @ 2018-06-20  9:05 UTC (permalink / raw)
  To: peterz, mingo
  Cc: valentin.schneider, dietmar.eggemann, vincent.guittot,
	gaku.inami.xh, linux-kernel, Morten Rasmussen

From: Valentin Schneider <valentin.schneider@arm.com>

There already are a few conditions in nohz_kick_needed() to ensure
a nohz kick is triggered, but they are not enough for some misfit
task scenarios. Excluding asym packing, those are:

* rq->nr_running >=2: Not relevant here because we are running a
misfit task, it needs to be migrated regardless and potentially through
active balance.
* sds->nr_busy_cpus > 1: If there is only the misfit task being run
on a group of low capacity cpus, this will be evaluated to False.
* rq->cfs.h_nr_running >=1 && check_cpu_capacity(): Not relevant here,
misfit task needs to be migrated regardless of rt/IRQ pressure

As such, this commit adds an rq->misfit_task condition to trigger a
nohz kick.

The idea to kick a nohz balance for misfit tasks originally came from
Leo Yan <leo.yan@linaro.org>, and a similar patch was submitted for
the Android Common Kernel - see [1].

[1]: https://lists.linaro.org/pipermail/eas-dev/2016-September/000551.html

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 11f3efa299ca..b29d53d1e22a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -9507,7 +9507,7 @@ static void nohz_balancer_kick(struct rq *rq)
 	if (time_before(now, nohz.next_balance))
 		goto out;
 
-	if (rq->nr_running >= 2) {
+	if (rq->nr_running >= 2 || rq->misfit_task_load) {
 		flags = NOHZ_KICK_MASK;
 		goto out;
 	}
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCHv3 6/9] sched: Change root_domain->overload type to int
  2018-06-20  9:05 [PATCHv3 0/9] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems Morten Rasmussen
                   ` (4 preceding siblings ...)
  2018-06-20  9:05 ` [PATCHv3 5/9] sched/fair: Kick nohz balance if rq->misfit_task Morten Rasmussen
@ 2018-06-20  9:05 ` Morten Rasmussen
  2018-06-20  9:05 ` [PATCHv3 7/9] sched: Wrap rq->rd->overload accesses with READ/WRITE_ONCE Morten Rasmussen
                   ` (3 subsequent siblings)
  9 siblings, 0 replies; 18+ messages in thread
From: Morten Rasmussen @ 2018-06-20  9:05 UTC (permalink / raw)
  To: peterz, mingo
  Cc: valentin.schneider, dietmar.eggemann, vincent.guittot,
	gaku.inami.xh, linux-kernel, Morten Rasmussen

From: Valentin Schneider <valentin.schneider@arm.com>

sizeof(_Bool) is implementation defined, so let's just go with 'int' as
is done for other structures e.g. sched_domain_shared->has_idle_cores.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c  | 7 +++----
 kernel/sched/sched.h | 4 ++--
 2 files changed, 5 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b29d53d1e22a..545fa3600894 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8170,7 +8170,7 @@ static bool update_nohz_stats(struct rq *rq, bool force)
 static inline void update_sg_lb_stats(struct lb_env *env,
 			struct sched_group *group, int load_idx,
 			int local_group, struct sg_lb_stats *sgs,
-			bool *overload)
+			int *overload)
 {
 	unsigned long load;
 	int i, nr_running;
@@ -8195,7 +8195,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 
 		nr_running = rq->nr_running;
 		if (nr_running > 1)
-			*overload = true;
+			*overload = 1;
 
 #ifdef CONFIG_NUMA_BALANCING
 		sgs->nr_numa_running += rq->nr_numa_running;
@@ -8354,8 +8354,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 	struct sched_group *sg = env->sd->groups;
 	struct sg_lb_stats *local = &sds->local_stat;
 	struct sg_lb_stats tmp_sgs;
-	int load_idx, prefer_sibling = 0;
-	bool overload = false;
+	int load_idx, prefer_sibling, overload = 0;
 
 	if (child && child->flags & SD_PREFER_SIBLING)
 		prefer_sibling = 1;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 5ed67122cf59..f945edadd2c5 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -689,7 +689,7 @@ struct root_domain {
 	cpumask_var_t		online;
 
 	/* Indicate more than one runnable task for any CPU */
-	bool			overload;
+	int			overload;
 
 	/*
 	 * The bit corresponding to a CPU gets set here if such CPU has more
@@ -1666,7 +1666,7 @@ static inline void add_nr_running(struct rq *rq, unsigned count)
 	if (prev_nr < 2 && rq->nr_running >= 2) {
 #ifdef CONFIG_SMP
 		if (!rq->rd->overload)
-			rq->rd->overload = true;
+			rq->rd->overload = 1;
 #endif
 	}
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCHv3 7/9] sched: Wrap rq->rd->overload accesses with READ/WRITE_ONCE
  2018-06-20  9:05 [PATCHv3 0/9] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems Morten Rasmussen
                   ` (5 preceding siblings ...)
  2018-06-20  9:05 ` [PATCHv3 6/9] sched: Change root_domain->overload type to int Morten Rasmussen
@ 2018-06-20  9:05 ` Morten Rasmussen
  2018-06-20  9:05 ` [PATCHv3 8/9] sched/fair: Set sd->overload when misfit Morten Rasmussen
                   ` (2 subsequent siblings)
  9 siblings, 0 replies; 18+ messages in thread
From: Morten Rasmussen @ 2018-06-20  9:05 UTC (permalink / raw)
  To: peterz, mingo
  Cc: valentin.schneider, dietmar.eggemann, vincent.guittot,
	gaku.inami.xh, linux-kernel, Morten Rasmussen

From: Valentin Schneider <valentin.schneider@arm.com>

This variable can be read and set locklessly within update_sd_lb_stats().
As such, READ/WRITE_ONCE are added to make sure nothing terribly wrong
can happen because of the compiler.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c  | 6 +++---
 kernel/sched/sched.h | 4 ++--
 2 files changed, 5 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 545fa3600894..9dd147c71dd0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8431,8 +8431,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 
 	if (!env->sd->parent) {
 		/* update overload indicator if we are at root domain */
-		if (env->dst_rq->rd->overload != overload)
-			env->dst_rq->rd->overload = overload;
+		if (READ_ONCE(env->dst_rq->rd->overload) != overload)
+			WRITE_ONCE(env->dst_rq->rd->overload, overload);
 	}
 }
 
@@ -9875,7 +9875,7 @@ static int idle_balance(struct rq *this_rq, struct rq_flags *rf)
 	rq_unpin_lock(this_rq, rf);
 
 	if (this_rq->avg_idle < sysctl_sched_migration_cost ||
-	    !this_rq->rd->overload) {
+	    !READ_ONCE(this_rq->rd->overload)) {
 
 		rcu_read_lock();
 		sd = rcu_dereference_check_sched_domain(this_rq->sd);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f945edadd2c5..791d9badad31 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1665,8 +1665,8 @@ static inline void add_nr_running(struct rq *rq, unsigned count)
 
 	if (prev_nr < 2 && rq->nr_running >= 2) {
 #ifdef CONFIG_SMP
-		if (!rq->rd->overload)
-			rq->rd->overload = 1;
+		if (!READ_ONCE(rq->rd->overload))
+			WRITE_ONCE(rq->rd->overload, 1);
 #endif
 	}
 
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCHv3 8/9] sched/fair: Set sd->overload when misfit
  2018-06-20  9:05 [PATCHv3 0/9] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems Morten Rasmussen
                   ` (6 preceding siblings ...)
  2018-06-20  9:05 ` [PATCHv3 7/9] sched: Wrap rq->rd->overload accesses with READ/WRITE_ONCE Morten Rasmussen
@ 2018-06-20  9:05 ` Morten Rasmussen
  2018-06-20  9:05 ` [PATCHv3 9/9] sched/fair: Don't move tasks to lower capacity cpus unless necessary Morten Rasmussen
  2018-07-03  2:28 ` [PATCHv3 0/9] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems Gaku Inami
  9 siblings, 0 replies; 18+ messages in thread
From: Morten Rasmussen @ 2018-06-20  9:05 UTC (permalink / raw)
  To: peterz, mingo
  Cc: valentin.schneider, dietmar.eggemann, vincent.guittot,
	gaku.inami.xh, linux-kernel, Morten Rasmussen

From: Valentin Schneider <valentin.schneider@arm.com>

Idle balance is a great opportunity to pull a misfit task. However,
there are scenarios where misfit tasks are present but idle balance is
prevented by the overload flag.

A good example of this is a workload of n identical tasks. Let's suppose
we have a 2+2 Arm big.LITTLE system. We then spawn 4 fairly
CPU-intensive tasks - for the sake of simplicity let's say they are just
CPU hogs, even when running on big CPUs.

They are identical tasks, so on an SMP system they should all end at
(roughly) the same time. However, in our case the LITTLE CPUs are less
performing than the big CPUs, so tasks running on the LITTLEs will have
a longer completion time.

This means that the big CPUs will complete their work earlier, at which
point they should pull the tasks from the LITTLEs. What we want to
happen is summarized as follows:

a,b,c,d are our CPU-hogging tasks
_ signifies idling

LITTLE_0 | a a a a _ _
LITTLE_1 | b b b b _ _
---------|-------------
  big_0  | c c c c a a
  big_1  | d d d d b b
                  ^
                  ^
    Tasks end on the big CPUs, idle balance happens
    and the misfit tasks are pulled straight away

This however won't happen, because currently the overload flag is only
set when there is any CPU that has more than one runnable task - which
may very well not be the case here if our CPU-hogging workload is all
there is to run.

As such, this commit sets the overload flag in update_sg_lb_stats when
a group is flagged as having a misfit task.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c  | 6 ++++--
 kernel/sched/sched.h | 6 +++++-
 2 files changed, 9 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9dd147c71dd0..aac7b9155bc4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8165,7 +8165,7 @@ static bool update_nohz_stats(struct rq *rq, bool force)
  * @load_idx: Load index of sched_domain of this_cpu for load calc.
  * @local_group: Does group contain this_cpu.
  * @sgs: variable to hold the statistics for this group.
- * @overload: Indicate more than one runnable task for any CPU.
+ * @overload: Indicate pullable load (e.g. >1 runnable task).
  */
 static inline void update_sg_lb_stats(struct lb_env *env,
 			struct sched_group *group, int load_idx,
@@ -8209,8 +8209,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 			sgs->idle_cpus++;
 
 		if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
-		    sgs->group_misfit_task_load < rq->misfit_task_load)
+		    sgs->group_misfit_task_load < rq->misfit_task_load) {
 			sgs->group_misfit_task_load = rq->misfit_task_load;
+			*overload = 1;
+		}
 	}
 
 	/* Adjust by relative CPU capacity of the group */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 791d9badad31..6dfab618e457 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -688,7 +688,11 @@ struct root_domain {
 	cpumask_var_t		span;
 	cpumask_var_t		online;
 
-	/* Indicate more than one runnable task for any CPU */
+	/*
+	 * Indicate pullable load on at least one CPU, e.g:
+	 * - More than one runnable task
+	 * - Running task is misfit
+	 */
 	int			overload;
 
 	/*
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCHv3 9/9] sched/fair: Don't move tasks to lower capacity cpus unless necessary
  2018-06-20  9:05 [PATCHv3 0/9] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems Morten Rasmussen
                   ` (7 preceding siblings ...)
  2018-06-20  9:05 ` [PATCHv3 8/9] sched/fair: Set sd->overload when misfit Morten Rasmussen
@ 2018-06-20  9:05 ` Morten Rasmussen
  2018-07-03  2:28 ` [PATCHv3 0/9] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems Gaku Inami
  9 siblings, 0 replies; 18+ messages in thread
From: Morten Rasmussen @ 2018-06-20  9:05 UTC (permalink / raw)
  To: peterz, mingo
  Cc: valentin.schneider, dietmar.eggemann, vincent.guittot,
	gaku.inami.xh, linux-kernel, Chris Redpath, Morten Rasmussen

From: Chris Redpath <chris.redpath@arm.com>

When lower capacity CPUs are load balancing and considering to pull
something from a higher capacity group, we should not pull tasks from a
cpu with only one task running as this is guaranteed to impede progress
for that task. If there is more than one task running, load balance in
the higher capacity group would have already made any possible moves to
resolve imbalance and we should make better use of system compute
capacity by moving a task if we still have more than one running.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Chris Redpath <chris.redpath@arm.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index aac7b9155bc4..5d612d3906d5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8796,6 +8796,17 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 
 		capacity = capacity_of(i);
 
+		/*
+		 * For ASYM_CPUCAPACITY domains, don't pick a cpu that could
+		 * eventually lead to active_balancing high->low capacity.
+		 * Higher per-cpu capacity is considered better than balancing
+		 * average load.
+		 */
+		if (env->sd->flags & SD_ASYM_CPUCAPACITY &&
+			capacity_of(env->dst_cpu) < capacity &&
+			rq->nr_running == 1)
+			continue;
+
 		wl = weighted_cpuload(rq);
 
 		/*
-- 
2.7.4


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCHv3 1/9] sched: Add static_key for asymmetric cpu capacity optimizations
  2018-06-20  9:05 ` [PATCHv3 1/9] sched: Add static_key for asymmetric cpu capacity optimizations Morten Rasmussen
@ 2018-06-22  8:22   ` Quentin Perret
  2018-06-22 14:36     ` Morten Rasmussen
  0 siblings, 1 reply; 18+ messages in thread
From: Quentin Perret @ 2018-06-22  8:22 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: peterz, mingo, valentin.schneider, dietmar.eggemann,
	vincent.guittot, gaku.inami.xh, linux-kernel

Hi Morten,

On Wednesday 20 Jun 2018 at 10:05:41 (+0100), Morten Rasmussen wrote:
> +static void update_asym_cpucapacity(int cpu)
> +{
> +	int enable = false;
> +
> +	rcu_read_lock();
> +	if (lowest_flag_domain(cpu, SD_ASYM_CPUCAPACITY))
> +		enable = true;
> +	rcu_read_unlock();
> +
> +	if (enable) {
> +		/* This expects to be hotplug-safe */
> +		static_branch_enable_cpuslocked(&sched_asym_cpucapacity);
> +	}
> +}

What would happen if you hotplugged an entire cluster ? You'd loose the
SD_ASYM_CPUCAPACITY flag but keep the static key is that right ? Should
we care ?

And also, Peter mentioned an issue with the EAS patches with multiple
root_domains. Does that apply here as well ? What if you had a
configuration with big and little CPUs in different root_domains for ex?

Should we disable the static key in the above cases ?

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCHv3 1/9] sched: Add static_key for asymmetric cpu capacity optimizations
  2018-06-22  8:22   ` Quentin Perret
@ 2018-06-22 14:36     ` Morten Rasmussen
  2018-06-27 15:41       ` Dietmar Eggemann
  0 siblings, 1 reply; 18+ messages in thread
From: Morten Rasmussen @ 2018-06-22 14:36 UTC (permalink / raw)
  To: Quentin Perret
  Cc: peterz, mingo, valentin.schneider, dietmar.eggemann,
	vincent.guittot, gaku.inami.xh, linux-kernel

On Fri, Jun 22, 2018 at 09:22:22AM +0100, Quentin Perret wrote:
> Hi Morten,
> 
> On Wednesday 20 Jun 2018 at 10:05:41 (+0100), Morten Rasmussen wrote:
> > +static void update_asym_cpucapacity(int cpu)
> > +{
> > +	int enable = false;
> > +
> > +	rcu_read_lock();
> > +	if (lowest_flag_domain(cpu, SD_ASYM_CPUCAPACITY))
> > +		enable = true;
> > +	rcu_read_unlock();
> > +
> > +	if (enable) {
> > +		/* This expects to be hotplug-safe */
> > +		static_branch_enable_cpuslocked(&sched_asym_cpucapacity);
> > +	}
> > +}
> 
> What would happen if you hotplugged an entire cluster ? You'd loose the
> SD_ASYM_CPUCAPACITY flag but keep the static key is that right ? Should
> we care ?

I don't think we should care. The static key enables additional checks
and tweaks but AFAICT none of them requires the SD_ASYM_CPUCAPACITY to
be set and they should all be have no effect if that is the case. I
added the static key just avoid the overhead on systems where they would
have no effect. At least that is intention, I could course have broken
things by mistake.

> And also, Peter mentioned an issue with the EAS patches with multiple
> root_domains. Does that apply here as well ? What if you had a
> configuration with big and little CPUs in different root_domains for ex?
> 
> Should we disable the static key in the above cases ?

Exclusive cpusets are more tricky as the flags will be the same for
sched_domains at the same level. So we can't set the flag correctly if
someone configures the exclusive cpusets such that you have one
root_domain spanning big and a subset of little, and one spanning the
remaining little cpus if all topology levels are preserved. If we
imagine a three cluster system where 0-3 and 4-7 little clusters, and
8-11 is a big cluster with cpusets configured as 0-5 and 6-11. The first
set should _not_ have SD_ASYM_CPUCAPACITY set, while the second should.

I'm tempted to say we shouldn't care in this situation. Setting the
flags correctly in the three cluster example would require knowledge
about the cpuset configuration which we don't have in the arch code so
SD_ASYM_CPUCAPACITY flag detection would have be done by the
sched_domain build code. However, not setting the flag according to the
actual members of the exclusive cpuset means that homogeneous
sched_domains might have SD_ASYM_CPUCAPACITY set enabling potentially
wrong scheduling decisions.

We can actually end up with this problem just by hotplugging too. If you
unplug the entire big cluster in the three cluster example above, you
preserve DIE level which would have SD_ASYM_CPUCAPACITY set even though
we only have little cpus left.

As I see it, we have two choices: 1) Set the flags correctly for
exclusive cpusets which means some additional "fun" in the sched_domain
hierarchy set up, or 2) ignore it and make sure that setting
SD_ASYM_CPUCAPACITY on homogeneous sched_domains works fairly okay. The
latter seems easier.

Morten

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCHv3 1/9] sched: Add static_key for asymmetric cpu capacity optimizations
  2018-06-22 14:36     ` Morten Rasmussen
@ 2018-06-27 15:41       ` Dietmar Eggemann
  2018-06-28  8:48         ` Morten Rasmussen
  0 siblings, 1 reply; 18+ messages in thread
From: Dietmar Eggemann @ 2018-06-27 15:41 UTC (permalink / raw)
  To: Morten Rasmussen, Quentin Perret
  Cc: peterz, mingo, valentin.schneider, vincent.guittot,
	gaku.inami.xh, linux-kernel

On 06/22/2018 04:36 PM, Morten Rasmussen wrote:
> On Fri, Jun 22, 2018 at 09:22:22AM +0100, Quentin Perret wrote:
>> Hi Morten,
>>
>> On Wednesday 20 Jun 2018 at 10:05:41 (+0100), Morten Rasmussen wrote:
>>> +static void update_asym_cpucapacity(int cpu)
>>> +{
>>> +	int enable = false;
>>> +
>>> +	rcu_read_lock();
>>> +	if (lowest_flag_domain(cpu, SD_ASYM_CPUCAPACITY))
>>> +		enable = true;
>>> +	rcu_read_unlock();
>>> +
>>> +	if (enable) {
>>> +		/* This expects to be hotplug-safe */
>>> +		static_branch_enable_cpuslocked(&sched_asym_cpucapacity);
>>> +	}
>>> +}
>>
>> What would happen if you hotplugged an entire cluster ? You'd loose the
>> SD_ASYM_CPUCAPACITY flag but keep the static key is that right ? Should
>> we care ?
> 
> I don't think we should care. The static key enables additional checks
> and tweaks but AFAICT none of them requires the SD_ASYM_CPUCAPACITY to
> be set and they should all be have no effect if that is the case. I
> added the static key just avoid the overhead on systems where they would
> have no effect. At least that is intention, I could course have broken
> things by mistake.

I tent to agree for misfit but it would be easy to just add an 
static_branch_disable_cpuslocked() into the else path of if(enable).

>> And also, Peter mentioned an issue with the EAS patches with multiple
>> root_domains. Does that apply here as well ? What if you had a
>> configuration with big and little CPUs in different root_domains for ex?
>>
>> Should we disable the static key in the above cases ?
> 
> Exclusive cpusets are more tricky as the flags will be the same for
> sched_domains at the same level. So we can't set the flag correctly if
> someone configures the exclusive cpusets such that you have one
> root_domain spanning big and a subset of little, and one spanning the
> remaining little cpus if all topology levels are preserved. If we
> imagine a three cluster system where 0-3 and 4-7 little clusters, and
> 8-11 is a big cluster with cpusets configured as 0-5 and 6-11. The first
> set should _not_ have SD_ASYM_CPUCAPACITY set, while the second should.
> 
> I'm tempted to say we shouldn't care in this situation. Setting the
> flags correctly in the three cluster example would require knowledge
> about the cpuset configuration which we don't have in the arch code so
> SD_ASYM_CPUCAPACITY flag detection would have be done by the
> sched_domain build code. However, not setting the flag according to the
> actual members of the exclusive cpuset means that homogeneous
> sched_domains might have SD_ASYM_CPUCAPACITY set enabling potentially
> wrong scheduling decisions.

We could easily pass the CPU as an argument to all these 
sched_domain_flags_f functions.

-typedef int (*sched_domain_flags_f)(void);
+typedef int (*sched_domain_flags_f)(int cpu);

In this case, the arch specific flag functions on a sched domain (sd) 
level could use the corresponding sched_domain_mask_f function to 
iterate over the span of the sd seen by CPU instead of all online cpus.

The overall question is ... do we need sd setups which are asymmetric 
(different topology flags for sd hierarchies seen by different CPUs) and 
does the scheduler cope with this?

We have seen 3 cluster systems like the MediaTek X20 (2 A72, 4 A53 (max 
1.85Ghz), 4 A53 (max 1.4Ghz)) but this would rather be a big-little-tiny 
system due to the max CPU frequency differences between the A53's.

We could also say that systems with 2 clusters with the same uArch and 
same max CPU frequency and additional clusters are insane, like we e.g. 
do with the Energy Model and CPUs with different uArch within a 
frequency domain?

> We can actually end up with this problem just by hotplugging too. If you
> unplug the entire big cluster in the three cluster example above, you
> preserve DIE level which would have SD_ASYM_CPUCAPACITY set even though
> we only have little cpus left.
> 
> As I see it, we have two choices: 1) Set the flags correctly for
> exclusive cpusets which means some additional "fun" in the sched_domain
> hierarchy set up, or 2) ignore it and make sure that setting

I assume you refer to this cpu parameter for sched_domain_flags_f under 1).

> SD_ASYM_CPUCAPACITY on homogeneous sched_domains works fairly okay. The
> latter seems easier.
> 
> Morten
> 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCHv3 1/9] sched: Add static_key for asymmetric cpu capacity optimizations
  2018-06-27 15:41       ` Dietmar Eggemann
@ 2018-06-28  8:48         ` Morten Rasmussen
  2018-06-28 17:16           ` Dietmar Eggemann
  0 siblings, 1 reply; 18+ messages in thread
From: Morten Rasmussen @ 2018-06-28  8:48 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Quentin Perret, peterz, mingo, valentin.schneider,
	vincent.guittot, gaku.inami.xh, linux-kernel

On Wed, Jun 27, 2018 at 05:41:22PM +0200, Dietmar Eggemann wrote:
> On 06/22/2018 04:36 PM, Morten Rasmussen wrote:
> >On Fri, Jun 22, 2018 at 09:22:22AM +0100, Quentin Perret wrote:
> >>Hi Morten,
> >>
> >>On Wednesday 20 Jun 2018 at 10:05:41 (+0100), Morten Rasmussen wrote:
> >>>+static void update_asym_cpucapacity(int cpu)
> >>>+{
> >>>+	int enable = false;
> >>>+
> >>>+	rcu_read_lock();
> >>>+	if (lowest_flag_domain(cpu, SD_ASYM_CPUCAPACITY))
> >>>+		enable = true;
> >>>+	rcu_read_unlock();
> >>>+
> >>>+	if (enable) {
> >>>+		/* This expects to be hotplug-safe */
> >>>+		static_branch_enable_cpuslocked(&sched_asym_cpucapacity);
> >>>+	}
> >>>+}
> >>
> >>What would happen if you hotplugged an entire cluster ? You'd loose the
> >>SD_ASYM_CPUCAPACITY flag but keep the static key is that right ? Should
> >>we care ?
> >
> >I don't think we should care. The static key enables additional checks
> >and tweaks but AFAICT none of them requires the SD_ASYM_CPUCAPACITY to
> >be set and they should all be have no effect if that is the case. I
> >added the static key just avoid the overhead on systems where they would
> >have no effect. At least that is intention, I could course have broken
> >things by mistake.
> 
> I tent to agree for misfit but it would be easy to just add an
> static_branch_disable_cpuslocked() into the else path of if(enable).

Depending on how we address the exclusive cpuset mess Quentin pointed
out, it isn't as simple as that. As it is with our current
not-yet-posted patches we would never remove the SD_ASYM_CPUCAPACITY
flag, so we would never do a static_branch_disable_cpuslocked().

However, I'm currently thinking that we should probably set the flag
according to the cpus in each root_domain. If we do that, we can't just
disable the static_key because one root_domain is SMP, so we would have
to track if _any_ root_domain (exclusive cpuset topology) has the flag
set.

Is it worth it to iterate over all the exclusive cpuset with all
the locking implications to disable the static_key in the corner case
where exclusive cpuset are set up for all cpu capacities in the system?

> >>And also, Peter mentioned an issue with the EAS patches with multiple
> >>root_domains. Does that apply here as well ? What if you had a
> >>configuration with big and little CPUs in different root_domains for ex?
> >>
> >>Should we disable the static key in the above cases ?
> >
> >Exclusive cpusets are more tricky as the flags will be the same for
> >sched_domains at the same level. So we can't set the flag correctly if
> >someone configures the exclusive cpusets such that you have one
> >root_domain spanning big and a subset of little, and one spanning the
> >remaining little cpus if all topology levels are preserved. If we
> >imagine a three cluster system where 0-3 and 4-7 little clusters, and
> >8-11 is a big cluster with cpusets configured as 0-5 and 6-11. The first
> >set should _not_ have SD_ASYM_CPUCAPACITY set, while the second should.
> >
> >I'm tempted to say we shouldn't care in this situation. Setting the
> >flags correctly in the three cluster example would require knowledge
> >about the cpuset configuration which we don't have in the arch code so
> >SD_ASYM_CPUCAPACITY flag detection would have be done by the
> >sched_domain build code. However, not setting the flag according to the
> >actual members of the exclusive cpuset means that homogeneous
> >sched_domains might have SD_ASYM_CPUCAPACITY set enabling potentially
> >wrong scheduling decisions.
> 
> We could easily pass the CPU as an argument to all these
> sched_domain_flags_f functions.
> 
> -typedef int (*sched_domain_flags_f)(void);
> +typedef int (*sched_domain_flags_f)(int cpu);
> 
> In this case, the arch specific flag functions on a sched domain (sd) level
> could use the corresponding sched_domain_mask_f function to iterate over the
> span of the sd seen by CPU instead of all online cpus.

Yeah, but I'm afraid that doesn't solve the problem. The
sched_domain_mask_f function doesn't tell us anything about any
exclusive cpusets. We need sched_domain_span(sd) which is
sched_domain_mask_f & cpu_map where cpu_map is the cpumask of the
exclusive cpuset. So we either need to pass the sched_domain_span() mask
through sched_domain_flags_f or work our way back to that information
based on the cpu id. I'm not sure if the latter is possible.

> The overall question is ... do we need sd setups which are asymmetric
> (different topology flags for sd hierarchies seen by different CPUs) and
> does the scheduler cope with this?

Well, in this case each root_domain is still flag symmetric so don't
think it should cause any problems.

> We have seen 3 cluster systems like the MediaTek X20 (2 A72, 4 A53 (max
> 1.85Ghz), 4 A53 (max 1.4Ghz)) but this would rather be a big-little-tiny
> system due to the max CPU frequency differences between the A53's.
> 
> We could also say that systems with 2 clusters with the same uArch and same
> max CPU frequency and additional clusters are insane, like we e.g. do with
> the Energy Model and CPUs with different uArch within a frequency domain?

I fail to get the point here. Are you saying that SMP is insane? ;-)

> 
> >We can actually end up with this problem just by hotplugging too. If you
> >unplug the entire big cluster in the three cluster example above, you
> >preserve DIE level which would have SD_ASYM_CPUCAPACITY set even though
> >we only have little cpus left.
> >
> >As I see it, we have two choices: 1) Set the flags correctly for
> >exclusive cpusets which means some additional "fun" in the sched_domain
> >hierarchy set up, or 2) ignore it and make sure that setting
> 
> I assume you refer to this cpu parameter for sched_domain_flags_f under 1).

No, what I had in mind was to let sd_init() set SD_ASYM_CPUCAPACITY
flag as the arch-code doesn't know about the exclusive cpusets as
discussed above. In the meantime I have thought about a different
approach where sd_init() only disables the flag when it is no longer
needed due to exclusive cpusets.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCHv3 1/9] sched: Add static_key for asymmetric cpu capacity optimizations
  2018-06-28  8:48         ` Morten Rasmussen
@ 2018-06-28 17:16           ` Dietmar Eggemann
  2018-07-02  8:36             ` Morten Rasmussen
  0 siblings, 1 reply; 18+ messages in thread
From: Dietmar Eggemann @ 2018-06-28 17:16 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Quentin Perret, peterz, mingo, valentin.schneider,
	vincent.guittot, gaku.inami.xh, linux-kernel

On 06/28/2018 10:48 AM, Morten Rasmussen wrote:
> On Wed, Jun 27, 2018 at 05:41:22PM +0200, Dietmar Eggemann wrote:
>> On 06/22/2018 04:36 PM, Morten Rasmussen wrote:
>>> On Fri, Jun 22, 2018 at 09:22:22AM +0100, Quentin Perret wrote:

[...]

>>>> What would happen if you hotplugged an entire cluster ? You'd loose the
>>>> SD_ASYM_CPUCAPACITY flag but keep the static key is that right ? Should
>>>> we care ?
>>>
>>> I don't think we should care. The static key enables additional checks
>>> and tweaks but AFAICT none of them requires the SD_ASYM_CPUCAPACITY to
>>> be set and they should all be have no effect if that is the case. I
>>> added the static key just avoid the overhead on systems where they would
>>> have no effect. At least that is intention, I could course have broken
>>> things by mistake.
>>
>> I tent to agree for misfit but it would be easy to just add an
>> static_branch_disable_cpuslocked() into the else path of if(enable).
> 
> Depending on how we address the exclusive cpuset mess Quentin pointed
> out, it isn't as simple as that. As it is with our current
> not-yet-posted patches we would never remove the SD_ASYM_CPUCAPACITY
> flag, so we would never do a static_branch_disable_cpuslocked().

I was referring rather to the 'hotplug out an entire cluster' mentioned 
above. Looking closer into the code, I see that this will only work for 
traditional big.LITTLE (one big, one little cluster) since on them the 
DIE level vanishes and so the SD_ASYM_CPUCAPACITY flag.
Currently we only detect the flags when the system comes up (in the 
init_cpu_capacity_callback()) and not when we hotplug cpus. That's why 
it doesn't work for your 'three cluster system where 0-3 and 4-7 little 
clusters, and 8-11 is a big cluster' example.
So we don't re-detect the flags every time we call from the scheduler 
into the arch code and so the the DIE level arch topology flag function 
will return SD_ASYM_CPUCAPACITY.

> However, I'm currently thinking that we should probably set the flag
> according to the cpus in each root_domain. If we do that, we can't just
> disable the static_key because one root_domain is SMP, so we would have
> to track if _any_ root_domain (exclusive cpuset topology) has the flag
> set.

This is then in sync with the energy model where the static key should 
be enabled if any root domain can do EAS. The static key would be system 
wide, not per root domain.

> Is it worth it to iterate over all the exclusive cpuset with all
> the locking implications to disable the static_key in the corner case
> where exclusive cpuset are set up for all cpu capacities in the system?

Don't know either? But if the code to do this would include 'exclusive 
cpusets' and platforms like your three cluster example then maybe ...

[...]

>>> I'm tempted to say we shouldn't care in this situation. Setting the
>>> flags correctly in the three cluster example would require knowledge
>>> about the cpuset configuration which we don't have in the arch code so
>>> SD_ASYM_CPUCAPACITY flag detection would have be done by the
>>> sched_domain build code. However, not setting the flag according to the
>>> actual members of the exclusive cpuset means that homogeneous
>>> sched_domains might have SD_ASYM_CPUCAPACITY set enabling potentially
>>> wrong scheduling decisions.
>>
>> We could easily pass the CPU as an argument to all these
>> sched_domain_flags_f functions.
>>
>> -typedef int (*sched_domain_flags_f)(void);
>> +typedef int (*sched_domain_flags_f)(int cpu);
>>
>> In this case, the arch specific flag functions on a sched domain (sd) level
>> could use the corresponding sched_domain_mask_f function to iterate over the
>> span of the sd seen by CPU instead of all online cpus.
> 
> Yeah, but I'm afraid that doesn't solve the problem. The
> sched_domain_mask_f function doesn't tell us anything about any
> exclusive cpusets. We need sched_domain_span(sd) which is

You're right, I checked again and even if we hotplug, the flag function 
tl->mask(cpu) like cpu_coregroup_mask() always return the initial set of 
cpus. So we are only save if the DIE sd vanishes and with it the 
SD_ASYM_CPUCAPACITY flag since nobody will call the appropriate arch 
topology flag function.

> sched_domain_mask_f & cpu_map where cpu_map is the cpumask of the
> exclusive cpuset. So we either need to pass the sched_domain_span() mask
> through sched_domain_flags_f or work our way back to that information
> based on the cpu id. I'm not sure if the latter is possible.

Agreed.

[...]

>> We could also say that systems with 2 clusters with the same uArch and same
>> max CPU frequency and additional clusters are insane, like we e.g. do with
>> the Energy Model and CPUs with different uArch within a frequency domain?
> 
> I fail to get the point here. Are you saying that SMP is insane? ;-)

The '... and additional clusters' is important, a system like your three 
cluster system you describe above.

But the problem that if you hotplug-out the big cluster, the DIE level 
is still there with the SD_ASYM_CPUCAPACITY flag is due to the fact that 
we don't start the flag detection mechanism during hotplug, right?

>>> We can actually end up with this problem just by hotplugging too. If you
>>> unplug the entire big cluster in the three cluster example above, you
>>> preserve DIE level which would have SD_ASYM_CPUCAPACITY set even though
>>> we only have little cpus left.

IMHO, again that's because we do flag detection only at system startup.

>>> As I see it, we have two choices: 1) Set the flags correctly for
>>> exclusive cpusets which means some additional "fun" in the sched_domain
>>> hierarchy set up, or 2) ignore it and make sure that setting
>>
>> I assume you refer to this cpu parameter for sched_domain_flags_f under 1).
> 
> No, what I had in mind was to let sd_init() set SD_ASYM_CPUCAPACITY
> flag as the arch-code doesn't know about the exclusive cpusets as
> discussed above. In the meantime I have thought about a different
> approach where sd_init() only disables the flag when it is no longer
> needed due to exclusive cpusets.

In this case sd_init() has to know the cpu capacity values. I assume 
that the arch still has to call rebuild_sched_domains() after cpufreq is 
up and running due to the max cpu frequency dependency for cpu capacity.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCHv3 1/9] sched: Add static_key for asymmetric cpu capacity optimizations
  2018-06-28 17:16           ` Dietmar Eggemann
@ 2018-07-02  8:36             ` Morten Rasmussen
  0 siblings, 0 replies; 18+ messages in thread
From: Morten Rasmussen @ 2018-07-02  8:36 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Quentin Perret, peterz, mingo, valentin.schneider,
	vincent.guittot, gaku.inami.xh, linux-kernel

On Thu, Jun 28, 2018 at 07:16:46PM +0200, Dietmar Eggemann wrote:
> On 06/28/2018 10:48 AM, Morten Rasmussen wrote:
> >On Wed, Jun 27, 2018 at 05:41:22PM +0200, Dietmar Eggemann wrote:
> >>On 06/22/2018 04:36 PM, Morten Rasmussen wrote:
> >>>On Fri, Jun 22, 2018 at 09:22:22AM +0100, Quentin Perret wrote:
> 
> [...]
> 
> >>>>What would happen if you hotplugged an entire cluster ? You'd loose the
> >>>>SD_ASYM_CPUCAPACITY flag but keep the static key is that right ? Should
> >>>>we care ?
> >>>
> >>>I don't think we should care. The static key enables additional checks
> >>>and tweaks but AFAICT none of them requires the SD_ASYM_CPUCAPACITY to
> >>>be set and they should all be have no effect if that is the case. I
> >>>added the static key just avoid the overhead on systems where they would
> >>>have no effect. At least that is intention, I could course have broken
> >>>things by mistake.
> >>
> >>I tent to agree for misfit but it would be easy to just add an
> >>static_branch_disable_cpuslocked() into the else path of if(enable).
> >
> >Depending on how we address the exclusive cpuset mess Quentin pointed
> >out, it isn't as simple as that. As it is with our current
> >not-yet-posted patches we would never remove the SD_ASYM_CPUCAPACITY
> >flag, so we would never do a static_branch_disable_cpuslocked().
> 
> I was referring rather to the 'hotplug out an entire cluster' mentioned
> above. Looking closer into the code, I see that this will only work for
> traditional big.LITTLE (one big, one little cluster) since on them the DIE
> level vanishes and so the SD_ASYM_CPUCAPACITY flag.

Agreed, disabling the branch in that case should be possible but only if
we know that there aren't any exclusive cpusets.

> Currently we only detect the flags when the system comes up (in the
> init_cpu_capacity_callback()) and not when we hotplug cpus. That's why it
> doesn't work for your 'three cluster system where 0-3 and 4-7 little
> clusters, and 8-11 is a big cluster' example.
> So we don't re-detect the flags every time we call from the scheduler into
> the arch code and so the the DIE level arch topology flag function will
> return SD_ASYM_CPUCAPACITY.

It would fairly easy to re-detect every time we hotplug a cpu in/out but
that won't help us with exclusive cpuset issue.

> 
> >However, I'm currently thinking that we should probably set the flag
> >according to the cpus in each root_domain. If we do that, we can't just
> >disable the static_key because one root_domain is SMP, so we would have
> >to track if _any_ root_domain (exclusive cpuset topology) has the flag
> >set.
> 
> This is then in sync with the energy model where the static key should be
> enabled if any root domain can do EAS. The static key would be system wide,
> not per root domain.

The static_key can only be system wide :-)

> 
> >Is it worth it to iterate over all the exclusive cpuset with all
> >the locking implications to disable the static_key in the corner case
> >where exclusive cpuset are set up for all cpu capacities in the system?
> 
> Don't know either? But if the code to do this would include 'exclusive
> cpusets' and platforms like your three cluster example then maybe ...

Iterating over all root_domains should work for any platform. IMHO, it
might be a lot of additional complexity to disable some optimizations in
a few corner cases.

> 
> [...]
> 
> >>>I'm tempted to say we shouldn't care in this situation. Setting the
> >>>flags correctly in the three cluster example would require knowledge
> >>>about the cpuset configuration which we don't have in the arch code so
> >>>SD_ASYM_CPUCAPACITY flag detection would have be done by the
> >>>sched_domain build code. However, not setting the flag according to the
> >>>actual members of the exclusive cpuset means that homogeneous
> >>>sched_domains might have SD_ASYM_CPUCAPACITY set enabling potentially
> >>>wrong scheduling decisions.
> >>
> >>We could easily pass the CPU as an argument to all these
> >>sched_domain_flags_f functions.
> >>
> >>-typedef int (*sched_domain_flags_f)(void);
> >>+typedef int (*sched_domain_flags_f)(int cpu);
> >>
> >>In this case, the arch specific flag functions on a sched domain (sd) level
> >>could use the corresponding sched_domain_mask_f function to iterate over the
> >>span of the sd seen by CPU instead of all online cpus.
> >
> >Yeah, but I'm afraid that doesn't solve the problem. The
> >sched_domain_mask_f function doesn't tell us anything about any
> >exclusive cpusets. We need sched_domain_span(sd) which is
> 
> You're right, I checked again and even if we hotplug, the flag function
> tl->mask(cpu) like cpu_coregroup_mask() always return the initial set of
> cpus. So we are only save if the DIE sd vanishes and with it the
> SD_ASYM_CPUCAPACITY flag since nobody will call the appropriate arch
> topology flag function.

Yeah, that isn't really solving the problem, it is more like working by
accident ;-)

> 
> >sched_domain_mask_f & cpu_map where cpu_map is the cpumask of the
> >exclusive cpuset. So we either need to pass the sched_domain_span() mask
> >through sched_domain_flags_f or work our way back to that information
> >based on the cpu id. I'm not sure if the latter is possible.
> 
> Agreed.
> 
> [...]
> 
> >>We could also say that systems with 2 clusters with the same uArch and same
> >>max CPU frequency and additional clusters are insane, like we e.g. do with
> >>the Energy Model and CPUs with different uArch within a frequency domain?
> >
> >I fail to get the point here. Are you saying that SMP is insane? ;-)
> 
> The '... and additional clusters' is important, a system like your three
> cluster system you describe above.
> 
> But the problem that if you hotplug-out the big cluster, the DIE level is
> still there with the SD_ASYM_CPUCAPACITY flag is due to the fact that we
> don't start the flag detection mechanism during hotplug, right?

True, flag detection is currently only done once (well, twice: when the
secondary cpus come up, and again at cpufreq init). We can't rely on
sched_domains being elemitated to get the flags right in the general
case. It just appears to work on systems we have considered so far.

> 
> >>>We can actually end up with this problem just by hotplugging too. If you
> >>>unplug the entire big cluster in the three cluster example above, you
> >>>preserve DIE level which would have SD_ASYM_CPUCAPACITY set even though
> >>>we only have little cpus left.
> 
> IMHO, again that's because we do flag detection only at system startup.

Yes.

> 
> >>>As I see it, we have two choices: 1) Set the flags correctly for
> >>>exclusive cpusets which means some additional "fun" in the sched_domain
> >>>hierarchy set up, or 2) ignore it and make sure that setting
> >>
> >>I assume you refer to this cpu parameter for sched_domain_flags_f under 1).
> >
> >No, what I had in mind was to let sd_init() set SD_ASYM_CPUCAPACITY
> >flag as the arch-code doesn't know about the exclusive cpusets as
> >discussed above. In the meantime I have thought about a different
> >approach where sd_init() only disables the flag when it is no longer
> >needed due to exclusive cpusets.
> 
> In this case sd_init() has to know the cpu capacity values. I assume that
> the arch still has to call rebuild_sched_domains() after cpufreq is up and
> running due to the max cpu frequency dependency for cpu capacity.

Yes, sd_init() will just access rq->cpu_capacity_orig which should be
straightforward. There is no change in when we rebuild the sched_domain
hierarchy. It will be rebuild once cpufreq is up, and every time we mess
with exclusive cpusets.

^ permalink raw reply	[flat|nested] 18+ messages in thread

* RE: [PATCHv3 0/9] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems
  2018-06-20  9:05 [PATCHv3 0/9] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems Morten Rasmussen
                   ` (8 preceding siblings ...)
  2018-06-20  9:05 ` [PATCHv3 9/9] sched/fair: Don't move tasks to lower capacity cpus unless necessary Morten Rasmussen
@ 2018-07-03  2:28 ` Gaku Inami
  2018-07-04 10:33   ` Morten Rasmussen
  9 siblings, 1 reply; 18+ messages in thread
From: Gaku Inami @ 2018-07-03  2:28 UTC (permalink / raw)
  To: Morten Rasmussen, peterz, mingo
  Cc: valentin.schneider, dietmar.eggemann, vincent.guittot, linux-kernel

Hi,

> -----Original Message-----
> From: Morten Rasmussen <morten.rasmussen@arm.com>
> Sent: Wednesday, June 20, 2018 6:06 PM
> To: peterz@infradead.org; mingo@redhat.com
> Cc: valentin.schneider@arm.com; dietmar.eggemann@arm.com; vincent.guittot@linaro.org; Gaku Inami
> <gaku.inami.xh@renesas.com>; linux-kernel@vger.kernel.org; Morten Rasmussen <morten.rasmussen@arm.com>
> Subject: [PATCHv3 0/9] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems
[snip]
> 
> The patches have been tested on:
>    1. Arm Juno (r0): 2+4 Cortex A57/A53
>    2. Hikey960: 4+4 Cortex A73/A53
> 
> Test case:
> Big cpus are always kept busy. Pin a shorter running sysbench tasks to
> big cpus, while creating a longer running set of unpinned sysbench
> tasks.

I have tested v3 patches on Renesas SoC again. It looks fine.

You can add:

Tested-by: Gaku Inami <gaku.inami.xh@renesas.com>

The patches have been tested on:
   3. Renesas R-Car H3 : 4+4 Cortex A57/A53

Results:
Single runs with completion time of each task
R-Car H3 (tip)
    total time:                          0.9391s
    total time:                          0.9865s
    total time:                          1.3691s
    total time:                          1.6740s

R-Car H3 (misfit)
    total time:                          0.9368s
    total time:                          0.9475s
    total time:                          0.9471s
    total time:                          0.9505s

10 run summary (tracking longest running task for each run)
	R-Car H3
	avg	max
tip     1.6742  1.6750
misfit  0.9784  0.9905

Regards,
Inami

[snip]

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCHv3 0/9] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems
  2018-07-03  2:28 ` [PATCHv3 0/9] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems Gaku Inami
@ 2018-07-04 10:33   ` Morten Rasmussen
  0 siblings, 0 replies; 18+ messages in thread
From: Morten Rasmussen @ 2018-07-04 10:33 UTC (permalink / raw)
  To: Gaku Inami
  Cc: peterz, mingo, valentin.schneider, dietmar.eggemann,
	vincent.guittot, linux-kernel

Hi,

On Tue, Jul 03, 2018 at 02:28:28AM +0000, Gaku Inami wrote:
> Hi,
> 
> > -----Original Message-----
> > From: Morten Rasmussen <morten.rasmussen@arm.com>
> > Sent: Wednesday, June 20, 2018 6:06 PM
> > To: peterz@infradead.org; mingo@redhat.com
> > Cc: valentin.schneider@arm.com; dietmar.eggemann@arm.com; vincent.guittot@linaro.org; Gaku Inami
> > <gaku.inami.xh@renesas.com>; linux-kernel@vger.kernel.org; Morten Rasmussen <morten.rasmussen@arm.com>
> > Subject: [PATCHv3 0/9] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems
> [snip]
> > 
> > The patches have been tested on:
> >    1. Arm Juno (r0): 2+4 Cortex A57/A53
> >    2. Hikey960: 4+4 Cortex A73/A53
> > 
> > Test case:
> > Big cpus are always kept busy. Pin a shorter running sysbench tasks to
> > big cpus, while creating a longer running set of unpinned sysbench
> > tasks.
> 
> I have tested v3 patches on Renesas SoC again. It looks fine.
> 
> You can add:
> 
> Tested-by: Gaku Inami <gaku.inami.xh@renesas.com>
> 
> The patches have been tested on:
>    3. Renesas R-Car H3 : 4+4 Cortex A57/A53
> 
> Results:
> Single runs with completion time of each task
> R-Car H3 (tip)
>     total time:                          0.9391s
>     total time:                          0.9865s
>     total time:                          1.3691s
>     total time:                          1.6740s
> 
> R-Car H3 (misfit)
>     total time:                          0.9368s
>     total time:                          0.9475s
>     total time:                          0.9471s
>     total time:                          0.9505s
> 
> 10 run summary (tracking longest running task for each run)
> 	R-Car H3
> 	avg	max
> tip     1.6742  1.6750
> misfit  0.9784  0.9905

Thanks for testing again. I have just posted v4 with some minor changes.
Behaviour for the test-cases should be the same.

Morten

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2018-07-04 10:33 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-20  9:05 [PATCHv3 0/9] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems Morten Rasmussen
2018-06-20  9:05 ` [PATCHv3 1/9] sched: Add static_key for asymmetric cpu capacity optimizations Morten Rasmussen
2018-06-22  8:22   ` Quentin Perret
2018-06-22 14:36     ` Morten Rasmussen
2018-06-27 15:41       ` Dietmar Eggemann
2018-06-28  8:48         ` Morten Rasmussen
2018-06-28 17:16           ` Dietmar Eggemann
2018-07-02  8:36             ` Morten Rasmussen
2018-06-20  9:05 ` [PATCHv3 2/9] sched/fair: Add group_misfit_task load-balance type Morten Rasmussen
2018-06-20  9:05 ` [PATCHv3 3/9] sched: Add sched_group per-cpu max capacity Morten Rasmussen
2018-06-20  9:05 ` [PATCHv3 4/9] sched/fair: Consider misfit tasks when load-balancing Morten Rasmussen
2018-06-20  9:05 ` [PATCHv3 5/9] sched/fair: Kick nohz balance if rq->misfit_task Morten Rasmussen
2018-06-20  9:05 ` [PATCHv3 6/9] sched: Change root_domain->overload type to int Morten Rasmussen
2018-06-20  9:05 ` [PATCHv3 7/9] sched: Wrap rq->rd->overload accesses with READ/WRITE_ONCE Morten Rasmussen
2018-06-20  9:05 ` [PATCHv3 8/9] sched/fair: Set sd->overload when misfit Morten Rasmussen
2018-06-20  9:05 ` [PATCHv3 9/9] sched/fair: Don't move tasks to lower capacity cpus unless necessary Morten Rasmussen
2018-07-03  2:28 ` [PATCHv3 0/9] sched/fair: Migrate 'misfit' tasks on asymmetric capacity systems Gaku Inami
2018-07-04 10:33   ` Morten Rasmussen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.