All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 0/5] sched: Clean-ups and asymmetric cpu capacity support
@ 2016-08-31 10:52 Morten Rasmussen
  2016-08-31 10:52 ` [PATCH v4 1/5] sched/fair: Compute task/cpu utilization at wake-up correctly Morten Rasmussen
                   ` (5 more replies)
  0 siblings, 6 replies; 9+ messages in thread
From: Morten Rasmussen @ 2016-08-31 10:52 UTC (permalink / raw)
  To: peterz, mingo
  Cc: dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	sgurrappadi, freedom.tan, keita.kobayashi.ym, linux-kernel,
	Morten Rasmussen

Hi,

The scheduler is currently not doing much to help performance on systems with
asymmetric compute capacities (read ARM big.LITTLE). This series improves the
situation with a few tweaks mainly to the task wake-up path that considers
compute capacity at wake-up and not just whether a cpu is idle for these
systems. This gives us consistent, and potentially higher, throughput in
partially utilized scenarios. SMP behaviour and performance should be
unaffected.

Test 0:
	for i in `seq 1 10`; \
	       do sysbench --test=cpu --max-time=3 --num-threads=1 run; \
	       done \
	| awk '{if ($4=="events:") {print $5; sum +=$5; runs +=1}} \
	       END {print "Average events: " sum/runs}'

Target: ARM TC2 (2xA15+3xA7)

	(Higher is better)
tip:	Average events: 126.9 
patch:	Average events: 217.9

Target: ARM Juno (2xA57+4xA53)

	(Higher is better)
tip:	Average events: 2082.6
patch:	Average events: 2687.5

Test 1:
	perf stat --null --repeat 10 -- \
	perf bench sched messaging -g 50 -l 5000

Target: Intel IVB-EP (2*10*2)

tip:     4.652802973 seconds time elapsed ( +-  0.99% )
patch:   4.643020680 seconds time elapsed ( +-  1.26% )

Target: ARM TC2 A7-only (3xA7) (-l 1000)

tip:    61.902516175 seconds time elapsed ( +-  0.22% )
patch:  63.178903751 seconds time elapsed ( +-  0.30% )

Target: ARM Juno A53-only (4xA53) (-l 1000)

tip:    37.919193364 seconds time elapsed ( +-  0.29% ) 
patch:  37.568717760 seconds time elapsed ( +-  0.13% )

Notes:

Active migration of tasks away from small capacity cpus isn't addressed
in this set although it is necessary for consistent throughput in other
scenarios on asymmetric cpu capacity systems.

The infrastructure to enable capacity awareness for arm64 and arm is not
provided here but will be based on Juri's DT bindings patch set [1]. A
combined preview branch is available [2]. Test results above a based on
[2].

[1] https://lkml.org/lkml/2016/7/19/419
[2] git://linux-arm.org/linux-power.git capacity_awareness_v4_arm64_v1

Patch    1: Fix task utilization for wake-up decisions.
Patch  2-5: Improve capacity awareness.

Tested-by: Koan-Sin Tan <freedom.tan@mediatek.com>
Tested-by: Keita Kobayashi <keita.kobayashi.ym@renesas.com>

v4:

- Removed patches already in tip/sched/core.

- Fixed wrong use of capacity_of() instead of capacity_orig_of() as
  reported by Wanpeng Li.

- Re-implement fix for task wake-up utilization. Instead of estimating
  the utilization it is now computed and updated correctly.

- Introduced peak utilization tracking to compensate for decay in
  wake-up placement decisions.

- Removed pointless spare capacity selection criteria in
  find_idlest_group() as pointed out by Vincent and added a comment
  describing when we use spare capacity instead of least load.

v3: https://lkml.org/lkml/2016/7/25/245

- Changed SD_ASYM_CPUCAPACITY sched_domain flag semantics as suggested
  by PeterZ.

- Dropped arm specific patches for setting cpu capacity as these are
  superseded by Juri's patches [2].

- Changed capacity-aware pulling during load-balance to use sched_group
  min capacity instead of max as suggested by Sai.

v2: https://lkml.org/lkml/2016/6/22/614

- Dropped patch ignoring wakee_flips for pid=0 for now as we can not
  distinguish cpu time processing irqs from idle time.

- Dropped disabling WAKE_AFFINE as suggested by Vincent to allow more
  scenarios to use fast-path (select_idle_sibling()). Asymmetric wake
  conditions adjusted accordingly.

- Changed use of new SD_ASYM_CPUCAPACITY slightly. Now enables
  SD_BALANCE_WAKE.

- Minor clean-ups and rebased to more recent tip/sched/core.

v1: https://lkml.org/lkml/2014/5/23/621

Morten Rasmussen (5):
  sched/fair: Compute task/cpu utilization at wake-up correctly
  sched/fair: Consider spare capacity in find_idlest_group()
  sched: Add per-cpu min capacity to sched_group_capacity
  sched/fair: Avoid pulling tasks from non-overloaded higher capacity
    groups
  sched/fair: Track peak per-entity utilization

 include/linux/sched.h |   2 +-
 kernel/sched/core.c   |   3 +-
 kernel/sched/fair.c   | 142 ++++++++++++++++++++++++++++++++++++++++++++------
 kernel/sched/sched.h  |   3 +-
 4 files changed, 130 insertions(+), 20 deletions(-)

-- 
1.9.1

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v4 1/5] sched/fair: Compute task/cpu utilization at wake-up correctly
  2016-08-31 10:52 [PATCH v4 0/5] sched: Clean-ups and asymmetric cpu capacity support Morten Rasmussen
@ 2016-08-31 10:52 ` Morten Rasmussen
  2016-08-31 10:52 ` [PATCH v4 2/5] sched/fair: Consider spare capacity in find_idlest_group() Morten Rasmussen
                   ` (4 subsequent siblings)
  5 siblings, 0 replies; 9+ messages in thread
From: Morten Rasmussen @ 2016-08-31 10:52 UTC (permalink / raw)
  To: peterz, mingo
  Cc: dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	sgurrappadi, freedom.tan, keita.kobayashi.ym, linux-kernel,
	Morten Rasmussen

At task wake-up load-tracking isn't updated until the task is enqueued.
The task's own view of its utilization contribution may therefore not be
aligned with its contribution to the cfs_rq load-tracking which may have
been updated in the meantime. Basically, the task's own utilization
hasn't yet accounted for the sleep decay, while the cfs_rq may have
(partially). Estimating the cfs_rq utilization in case the task is
migrated at wake-up as task_rq(p)->cfs.avg.util_avg - p->se.avg.util_avg
is therefore incorrect as the two load-tracking signals aren't time
synchronized (different last update).

To solve this problem, this patch synchronizes the task utilization with
its previous rq before the task utilization is used in the wake-up path.
Currently the update/synchronization is done _after_ the task has been
placed by select_task_rq_fair(). The synchronization is done without
having to take the rq lock using the existing mechanism used in
remove_entity_load_avg().

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c | 39 +++++++++++++++++++++++++++++++++++----
 1 file changed, 35 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 61d485421bed..adff97d6c834 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3129,13 +3129,25 @@ static inline u64 cfs_rq_last_update_time(struct cfs_rq *cfs_rq)
 #endif
 
 /*
+ * Synchronize entity load avg of dequeued entity without locking
+ * the previous rq.
+ */
+void sync_entity_load_avg(struct sched_entity *se)
+{
+	struct cfs_rq *cfs_rq = cfs_rq_of(se);
+	u64 last_update_time;
+
+	last_update_time = cfs_rq_last_update_time(cfs_rq);
+	__update_load_avg(last_update_time, cpu_of(rq_of(cfs_rq)), &se->avg, 0, 0, NULL);
+}
+
+/*
  * Task first catches up with cfs_rq, and then subtract
  * itself from the cfs_rq (task must be off the queue now).
  */
 void remove_entity_load_avg(struct sched_entity *se)
 {
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
-	u64 last_update_time;
 
 	/*
 	 * tasks cannot exit without having gone through wake_up_new_task() ->
@@ -3147,9 +3159,7 @@ void remove_entity_load_avg(struct sched_entity *se)
 	 * calls this.
 	 */
 
-	last_update_time = cfs_rq_last_update_time(cfs_rq);
-
-	__update_load_avg(last_update_time, cpu_of(rq_of(cfs_rq)), &se->avg, 0, 0, NULL);
+	sync_entity_load_avg(se);
 	atomic_long_add(se->avg.load_avg, &cfs_rq->removed_load_avg);
 	atomic_long_add(se->avg.util_avg, &cfs_rq->removed_util_avg);
 }
@@ -5388,6 +5398,24 @@ static inline int task_util(struct task_struct *p)
 }
 
 /*
+ * cpu_util_wake: Compute cpu utilization with any contributions from
+ * the waking task p removed.
+ */
+static int cpu_util_wake(int cpu, struct task_struct *p)
+{
+	unsigned long util, capacity;
+
+	/* Task has no contribution or is new */
+	if (cpu != task_cpu(p) || !p->se.avg.last_update_time)
+		return cpu_util(cpu);
+
+	capacity = capacity_orig_of(cpu);
+	util = max_t(long, cpu_rq(cpu)->cfs.avg.util_avg - task_util(p), 0);
+
+	return (util >= capacity) ? capacity : util;
+}
+
+/*
  * Disable WAKE_AFFINE in the case where task @p doesn't fit in the
  * capacity of either the waking CPU @cpu or the previous CPU @prev_cpu.
  *
@@ -5405,6 +5433,9 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
 	if (max_cap - min_cap < max_cap >> 3)
 		return 0;
 
+	/* Bring task utilization in sync with prev_cpu */
+	sync_entity_load_avg(&p->se);
+
 	return min_cap * 1024 < task_util(p) * capacity_margin;
 }
 
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v4 2/5] sched/fair: Consider spare capacity in find_idlest_group()
  2016-08-31 10:52 [PATCH v4 0/5] sched: Clean-ups and asymmetric cpu capacity support Morten Rasmussen
  2016-08-31 10:52 ` [PATCH v4 1/5] sched/fair: Compute task/cpu utilization at wake-up correctly Morten Rasmussen
@ 2016-08-31 10:52 ` Morten Rasmussen
  2016-08-31 10:52 ` [PATCH v4 3/5] sched: Add per-cpu min capacity to sched_group_capacity Morten Rasmussen
                   ` (3 subsequent siblings)
  5 siblings, 0 replies; 9+ messages in thread
From: Morten Rasmussen @ 2016-08-31 10:52 UTC (permalink / raw)
  To: peterz, mingo
  Cc: dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	sgurrappadi, freedom.tan, keita.kobayashi.ym, linux-kernel,
	Morten Rasmussen

In low-utilization scenarios comparing relative loads in
find_idlest_group() doesn't always lead to the most optimum choice.
Systems with groups containing different numbers of cpus and/or cpus of
different compute capacity are significantly better off when considering
spare capacity rather than relative load in those scenarios.

In addition to existing load based search an alternative spare capacity
based candidate sched_group is found and selected instead if sufficient
spare capacity exists. If not, existing behaviour is preserved.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c | 50 +++++++++++++++++++++++++++++++++++++++++++++-----
 1 file changed, 45 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index adff97d6c834..ebe8907119b5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5184,6 +5184,14 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
 	return 1;
 }
 
+static inline int task_util(struct task_struct *p);
+static int cpu_util_wake(int cpu, struct task_struct *p);
+
+static unsigned long capacity_spare_wake(int cpu, struct task_struct *p)
+{
+	return capacity_orig_of(cpu) - cpu_util_wake(cpu, p);
+}
+
 /*
  * find_idlest_group finds and returns the least busy CPU group within the
  * domain.
@@ -5193,7 +5201,9 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 		  int this_cpu, int sd_flag)
 {
 	struct sched_group *idlest = NULL, *group = sd->groups;
+	struct sched_group *most_spare_sg = NULL;
 	unsigned long min_load = ULONG_MAX, this_load = 0;
+	unsigned long most_spare = 0, this_spare = 0;
 	int load_idx = sd->forkexec_idx;
 	int imbalance = 100 + (sd->imbalance_pct-100)/2;
 
@@ -5201,7 +5211,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 		load_idx = sd->wake_idx;
 
 	do {
-		unsigned long load, avg_load;
+		unsigned long load, avg_load, spare_cap, max_spare_cap;
 		int local_group;
 		int i;
 
@@ -5213,8 +5223,12 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 		local_group = cpumask_test_cpu(this_cpu,
 					       sched_group_cpus(group));
 
-		/* Tally up the load of all CPUs in the group */
+		/*
+		 * Tally up the load of all CPUs in the group and find
+		 * the group containing the cpu with most spare capacity.
+		 */
 		avg_load = 0;
+		max_spare_cap = 0;
 
 		for_each_cpu(i, sched_group_cpus(group)) {
 			/* Bias balancing toward cpus of our domain */
@@ -5224,6 +5238,11 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 				load = target_load(i, load_idx);
 
 			avg_load += load;
+
+			spare_cap = capacity_spare_wake(i, p);
+
+			if (spare_cap > max_spare_cap)
+				max_spare_cap = spare_cap;
 		}
 
 		/* Adjust by relative CPU capacity of the group */
@@ -5231,12 +5250,33 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 
 		if (local_group) {
 			this_load = avg_load;
-		} else if (avg_load < min_load) {
-			min_load = avg_load;
-			idlest = group;
+			this_spare = max_spare_cap;
+		} else {
+			if (avg_load < min_load) {
+				min_load = avg_load;
+				idlest = group;
+			}
+
+			if (most_spare < max_spare_cap) {
+				most_spare = max_spare_cap;
+				most_spare_sg = group;
+			}
 		}
 	} while (group = group->next, group != sd->groups);
 
+	/*
+	 * The cross-over point between using spare capacity or least load
+	 * is too conservative for high utilization tasks on partially
+	 * utilized systems if we require spare_capacity > task_util(p),
+	 * so we allow for some task stuffing by using
+	 * spare_capacity > task_util(p)/2.
+	 */
+	if (this_spare > task_util(p) / 2 &&
+	    imbalance*this_spare > 100*most_spare)
+		return NULL;
+	else if (most_spare > task_util(p) / 2)
+		return most_spare_sg;
+
 	if (!idlest || 100*this_load < imbalance*min_load)
 		return NULL;
 	return idlest;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v4 3/5] sched: Add per-cpu min capacity to sched_group_capacity
  2016-08-31 10:52 [PATCH v4 0/5] sched: Clean-ups and asymmetric cpu capacity support Morten Rasmussen
  2016-08-31 10:52 ` [PATCH v4 1/5] sched/fair: Compute task/cpu utilization at wake-up correctly Morten Rasmussen
  2016-08-31 10:52 ` [PATCH v4 2/5] sched/fair: Consider spare capacity in find_idlest_group() Morten Rasmussen
@ 2016-08-31 10:52 ` Morten Rasmussen
  2016-08-31 10:52 ` [PATCH v4 4/5] sched/fair: Avoid pulling tasks from non-overloaded higher capacity groups Morten Rasmussen
                   ` (2 subsequent siblings)
  5 siblings, 0 replies; 9+ messages in thread
From: Morten Rasmussen @ 2016-08-31 10:52 UTC (permalink / raw)
  To: peterz, mingo
  Cc: dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	sgurrappadi, freedom.tan, keita.kobayashi.ym, linux-kernel,
	Morten Rasmussen

struct sched_group_capacity currently represents the compute capacity
sum of all cpus in the sched_group. Unless it is divided by the
group_weight to get the average capacity per cpu it hides differences in
cpu capacity for mixed capacity systems (e.g. high RT/IRQ utilization or
ARM big.LITTLE). But even the average may not be sufficient if the group
covers cpus of different capacities. Instead, by extending struct
sched_group_capacity to indicate min per-cpu capacity in the group a
suitable group for a given task utilization can more easily be found
such that cpus with reduced capacity can be avoided for tasks with high
utilization (not implemented by this patch).

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/core.c  |  3 ++-
 kernel/sched/fair.c  | 17 ++++++++++++-----
 kernel/sched/sched.h |  3 ++-
 3 files changed, 16 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 69243142cad1..4df35c448f41 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5656,7 +5656,7 @@ static int sched_domain_debug_one(struct sched_domain *sd, int cpu, int level,
 		printk(KERN_CONT " %*pbl",
 		       cpumask_pr_args(sched_group_cpus(group)));
 		if (group->sgc->capacity != SCHED_CAPACITY_SCALE) {
-			printk(KERN_CONT " (cpu_capacity = %d)",
+			printk(KERN_CONT " (cpu_capacity = %lu)",
 				group->sgc->capacity);
 		}
 
@@ -6125,6 +6125,7 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
 		 * die on a /0 trap.
 		 */
 		sg->sgc->capacity = SCHED_CAPACITY_SCALE * cpumask_weight(sg_span);
+		sg->sgc->min_capacity = SCHED_CAPACITY_SCALE;
 
 		/*
 		 * Make sure the first group of this domain contains the
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index ebe8907119b5..112a6d3f3943 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6714,13 +6714,14 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
 
 	cpu_rq(cpu)->cpu_capacity = capacity;
 	sdg->sgc->capacity = capacity;
+	sdg->sgc->min_capacity = capacity;
 }
 
 void update_group_capacity(struct sched_domain *sd, int cpu)
 {
 	struct sched_domain *child = sd->child;
 	struct sched_group *group, *sdg = sd->groups;
-	unsigned long capacity;
+	unsigned long capacity, min_capacity;
 	unsigned long interval;
 
 	interval = msecs_to_jiffies(sd->balance_interval);
@@ -6733,6 +6734,7 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
 	}
 
 	capacity = 0;
+	min_capacity = ULONG_MAX;
 
 	if (child->flags & SD_OVERLAP) {
 		/*
@@ -6757,11 +6759,12 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
 			 */
 			if (unlikely(!rq->sd)) {
 				capacity += capacity_of(cpu);
-				continue;
+			} else {
+				sgc = rq->sd->groups->sgc;
+				capacity += sgc->capacity;
 			}
 
-			sgc = rq->sd->groups->sgc;
-			capacity += sgc->capacity;
+			min_capacity = min(capacity, min_capacity);
 		}
 	} else  {
 		/*
@@ -6771,12 +6774,16 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
 
 		group = child->groups;
 		do {
-			capacity += group->sgc->capacity;
+			struct sched_group_capacity *sgc = group->sgc;
+
+			capacity += sgc->capacity;
+			min_capacity = min(sgc->min_capacity, min_capacity);
 			group = group->next;
 		} while (group != child->groups);
 	}
 
 	sdg->sgc->capacity = capacity;
+	sdg->sgc->min_capacity = min_capacity;
 }
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 420c05d099c3..ef646f542994 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -868,7 +868,8 @@ struct sched_group_capacity {
 	 * CPU capacity of this group, SCHED_CAPACITY_SCALE being max capacity
 	 * for a single CPU.
 	 */
-	unsigned int capacity;
+	unsigned long capacity;
+	unsigned long min_capacity; /* Min per-cpu capacity in group */
 	unsigned long next_update;
 	int imbalance; /* XXX unrelated to capacity but shared group state */
 	/*
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v4 4/5] sched/fair: Avoid pulling tasks from non-overloaded higher capacity groups
  2016-08-31 10:52 [PATCH v4 0/5] sched: Clean-ups and asymmetric cpu capacity support Morten Rasmussen
                   ` (2 preceding siblings ...)
  2016-08-31 10:52 ` [PATCH v4 3/5] sched: Add per-cpu min capacity to sched_group_capacity Morten Rasmussen
@ 2016-08-31 10:52 ` Morten Rasmussen
  2016-08-31 10:52 ` [PATCH v4 5/5] sched/fair: Track peak per-entity utilization Morten Rasmussen
  2016-09-01  8:33 ` [PATCH v4 0/5] sched: Clean-ups and asymmetric cpu capacity support Morten Rasmussen
  5 siblings, 0 replies; 9+ messages in thread
From: Morten Rasmussen @ 2016-08-31 10:52 UTC (permalink / raw)
  To: peterz, mingo
  Cc: dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	sgurrappadi, freedom.tan, keita.kobayashi.ym, linux-kernel,
	Morten Rasmussen

For asymmetric cpu capacity systems it is counter-productive for
throughput if low capacity cpus are pulling tasks from non-overloaded
cpus with higher capacity. The assumption is that higher cpu capacity is
preferred over running alone in a group with lower cpu capacity.

This patch rejects higher cpu capacity groups with one or less task per
cpu as potential busiest group which could otherwise lead to a series of
failing load-balancing attempts leading to a force-migration.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c | 25 +++++++++++++++++++++++++
 1 file changed, 25 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 112a6d3f3943..68d8b40c546b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6878,6 +6878,17 @@ group_is_overloaded(struct lb_env *env, struct sg_lb_stats *sgs)
 	return false;
 }
 
+/*
+ * group_smaller_cpu_capacity: Returns true if sched_group sg has smaller
+ * per-cpu capacity than sched_group ref.
+ */
+static inline bool
+group_smaller_cpu_capacity(struct sched_group *sg, struct sched_group *ref)
+{
+	return sg->sgc->min_capacity * capacity_margin <
+						ref->sgc->min_capacity * 1024;
+}
+
 static inline enum
 group_type group_classify(struct sched_group *group,
 			  struct sg_lb_stats *sgs)
@@ -6981,6 +6992,20 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 	if (sgs->avg_load <= busiest->avg_load)
 		return false;
 
+	if (!(env->sd->flags & SD_ASYM_CPUCAPACITY))
+		goto asym_packing;
+
+	/*
+	 * Candidate sg has no more than one task per cpu and
+	 * has higher per-cpu capacity. Migrating tasks to less
+	 * capable cpus may harm throughput. Maximize throughput,
+	 * power/energy consequences are not considered.
+	 */
+	if (sgs->sum_nr_running <= sgs->group_weight &&
+	    group_smaller_cpu_capacity(sds->local, sg))
+		return false;
+
+asym_packing:
 	/* This is the busiest node in its class. */
 	if (!(env->sd->flags & SD_ASYM_PACKING))
 		return true;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v4 5/5] sched/fair: Track peak per-entity utilization
  2016-08-31 10:52 [PATCH v4 0/5] sched: Clean-ups and asymmetric cpu capacity support Morten Rasmussen
                   ` (3 preceding siblings ...)
  2016-08-31 10:52 ` [PATCH v4 4/5] sched/fair: Avoid pulling tasks from non-overloaded higher capacity groups Morten Rasmussen
@ 2016-08-31 10:52 ` Morten Rasmussen
  2016-09-01 11:51   ` Patrick Bellasi
  2016-09-01  8:33 ` [PATCH v4 0/5] sched: Clean-ups and asymmetric cpu capacity support Morten Rasmussen
  5 siblings, 1 reply; 9+ messages in thread
From: Morten Rasmussen @ 2016-08-31 10:52 UTC (permalink / raw)
  To: peterz, mingo
  Cc: dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	sgurrappadi, freedom.tan, keita.kobayashi.ym, linux-kernel,
	Morten Rasmussen

When using PELT (per-entity load tracking) utilization to place tasks at
wake-up using the decayed utilization (due to sleep) leads to
under-estimation of true utilization of the task. This could mean
putting the task on a cpu with less available capacity than is actually
needed. This issue can be mitigated by using 'peak' utilization instead
of the decayed utilization for placement decisions, e.g. at task
wake-up.

The 'peak' utilization metric, util_peak, tracks util_avg when the task
is running and retains its previous value while the task is
blocked/waiting on the rq. It is instantly updated to track util_avg
again as soon as the task running again.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 include/linux/sched.h |  2 +-
 kernel/sched/fair.c   | 23 ++++++++++++++---------
 2 files changed, 15 insertions(+), 10 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index d75024053e9b..fff4e4b6e654 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1282,7 +1282,7 @@ struct load_weight {
 struct sched_avg {
 	u64 last_update_time, load_sum;
 	u32 util_sum, period_contrib;
-	unsigned long load_avg, util_avg;
+	unsigned long load_avg, util_avg, util_peak;
 };
 
 #ifdef CONFIG_SCHEDSTATS
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 68d8b40c546b..27534e36555b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -692,6 +692,7 @@ void init_entity_runnable_average(struct sched_entity *se)
 	 * At this point, util_avg won't be used in select_task_rq_fair anyway
 	 */
 	sa->util_avg = 0;
+	sa->util_peak = 0;
 	sa->util_sum = 0;
 	/* when this task enqueue'ed, it will contribute to its cfs_rq's load_avg */
 }
@@ -743,6 +744,7 @@ void post_init_entity_util_avg(struct sched_entity *se)
 		} else {
 			sa->util_avg = cap;
 		}
+		sa->util_peak = sa->util_avg;
 		sa->util_sum = sa->util_avg * LOAD_AVG_MAX;
 	}
 
@@ -2804,6 +2806,9 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
 		sa->util_avg = sa->util_sum / LOAD_AVG_MAX;
 	}
 
+	if (running || sa->util_avg > sa->util_peak)
+		sa->util_peak = sa->util_avg;
+
 	return decayed;
 }
 
@@ -5184,7 +5189,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
 	return 1;
 }
 
-static inline int task_util(struct task_struct *p);
+static inline int task_util_peak(struct task_struct *p);
 static int cpu_util_wake(int cpu, struct task_struct *p);
 
 static unsigned long capacity_spare_wake(int cpu, struct task_struct *p)
@@ -5267,14 +5272,14 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 	/*
 	 * The cross-over point between using spare capacity or least load
 	 * is too conservative for high utilization tasks on partially
-	 * utilized systems if we require spare_capacity > task_util(p),
+	 * utilized systems if we require spare_capacity > task_util_peak(p),
 	 * so we allow for some task stuffing by using
-	 * spare_capacity > task_util(p)/2.
+	 * spare_capacity > task_util_peak(p)/2.
 	 */
-	if (this_spare > task_util(p) / 2 &&
+	if (this_spare > task_util_peak(p) / 2 &&
 	    imbalance*this_spare > 100*most_spare)
 		return NULL;
-	else if (most_spare > task_util(p) / 2)
+	else if (most_spare > task_util_peak(p) / 2)
 		return most_spare_sg;
 
 	if (!idlest || 100*this_load < imbalance*min_load)
@@ -5432,9 +5437,9 @@ static int cpu_util(int cpu)
 	return (util >= capacity) ? capacity : util;
 }
 
-static inline int task_util(struct task_struct *p)
+static inline int task_util_peak(struct task_struct *p)
 {
-	return p->se.avg.util_avg;
+	return p->se.avg.util_peak;
 }
 
 /*
@@ -5450,7 +5455,7 @@ static int cpu_util_wake(int cpu, struct task_struct *p)
 		return cpu_util(cpu);
 
 	capacity = capacity_orig_of(cpu);
-	util = max_t(long, cpu_rq(cpu)->cfs.avg.util_avg - task_util(p), 0);
+	util = max_t(long, cpu_rq(cpu)->cfs.avg.util_avg - task_util_peak(p), 0);
 
 	return (util >= capacity) ? capacity : util;
 }
@@ -5476,7 +5481,7 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
 	/* Bring task utilization in sync with prev_cpu */
 	sync_entity_load_avg(&p->se);
 
-	return min_cap * 1024 < task_util(p) * capacity_margin;
+	return min_cap * 1024 < task_util_peak(p) * capacity_margin;
 }
 
 /*
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v4 0/5] sched: Clean-ups and asymmetric cpu capacity support
  2016-08-31 10:52 [PATCH v4 0/5] sched: Clean-ups and asymmetric cpu capacity support Morten Rasmussen
                   ` (4 preceding siblings ...)
  2016-08-31 10:52 ` [PATCH v4 5/5] sched/fair: Track peak per-entity utilization Morten Rasmussen
@ 2016-09-01  8:33 ` Morten Rasmussen
  5 siblings, 0 replies; 9+ messages in thread
From: Morten Rasmussen @ 2016-09-01  8:33 UTC (permalink / raw)
  To: peterz, mingo
  Cc: dietmar.eggemann, yuyang.du, vincent.guittot, mgalbraith,
	sgurrappadi, freedom.tan, keita.kobayashi.ym, linux-kernel

On Wed, Aug 31, 2016 at 11:52:14AM +0100, Morten Rasmussen wrote:
> v4:
> 
> - Removed patches already in tip/sched/core.

Just discovered an error on the comment of one of those patches :-/

----8<----

>From a0040f8dfe60696d7e6429f50906d6b17671bbcc Mon Sep 17 00:00:00 2001
From: Morten Rasmussen <morten.rasmussen@arm.com>
Date: Thu, 1 Sep 2016 09:24:35 +0100
Subject: [PATCH] sched/fair: Fix wrong comment for capacity_margin

The comment for capacity_margin introduced in "sched/fair: Let
asymmetric cpu configurations balance at wake-up" got its usage the
wrong way round.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 27534e36555b..161fd8eba234 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -116,7 +116,7 @@ unsigned int sysctl_sched_cfs_bandwidth_slice = 5000UL;
 
 /*
  * The margin used when comparing utilization with CPU capacity:
- * util * 1024 < capacity * margin
+ * util * margin < capacity * 1024
  */
 unsigned int capacity_margin = 1280; /* ~20% */
 
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v4 5/5] sched/fair: Track peak per-entity utilization
  2016-08-31 10:52 ` [PATCH v4 5/5] sched/fair: Track peak per-entity utilization Morten Rasmussen
@ 2016-09-01 11:51   ` Patrick Bellasi
  2016-09-01 12:43     ` Morten Rasmussen
  0 siblings, 1 reply; 9+ messages in thread
From: Patrick Bellasi @ 2016-09-01 11:51 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: peterz, mingo, dietmar.eggemann, yuyang.du, vincent.guittot,
	mgalbraith, sgurrappadi, freedom.tan, keita.kobayashi.ym,
	linux-kernel

On 31-Aug 11:52, Morten Rasmussen wrote:
> When using PELT (per-entity load tracking) utilization to place tasks at
> wake-up using the decayed utilization (due to sleep) leads to
> under-estimation of true utilization of the task. This could mean
> putting the task on a cpu with less available capacity than is actually
> needed. This issue can be mitigated by using 'peak' utilization instead
> of the decayed utilization for placement decisions, e.g. at task
> wake-up.
> 
> The 'peak' utilization metric, util_peak, tracks util_avg when the task
> is running and retains its previous value while the task is
> blocked/waiting on the rq. It is instantly updated to track util_avg
> again as soon as the task running again.
> 
> cc: Ingo Molnar <mingo@redhat.com>
> cc: Peter Zijlstra <peterz@infradead.org>
> 
> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
> ---
>  include/linux/sched.h |  2 +-
>  kernel/sched/fair.c   | 23 ++++++++++++++---------
>  2 files changed, 15 insertions(+), 10 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index d75024053e9b..fff4e4b6e654 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1282,7 +1282,7 @@ struct load_weight {
>  struct sched_avg {
>  	u64 last_update_time, load_sum;
>  	u32 util_sum, period_contrib;
> -	unsigned long load_avg, util_avg;
> +	unsigned long load_avg, util_avg, util_peak;

By adding util_peak here (in sched_avg) we implicitly define a new
signal for CFS RQs as well, but in the rest of this patch it seems we
use it only for tasks?

Overall this seems to be a filtered signal on top of PELT but just for
tasks. Perhaps something similar can be useful for CPUs utilization as
well...

>  };
>  
>  #ifdef CONFIG_SCHEDSTATS
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 68d8b40c546b..27534e36555b 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -692,6 +692,7 @@ void init_entity_runnable_average(struct sched_entity *se)
>  	 * At this point, util_avg won't be used in select_task_rq_fair anyway
>  	 */
>  	sa->util_avg = 0;
> +	sa->util_peak = 0;

For consistency with other sched_avg's signals, perhaps we should report
the value of util_peak from:

   kernel/sched/debug.c::{print_cfs_group_statproc_sched_show_task,proc_sched_show_task}

>  	sa->util_sum = 0;
>  	/* when this task enqueue'ed, it will contribute to its cfs_rq's load_avg */
>  }
> @@ -743,6 +744,7 @@ void post_init_entity_util_avg(struct sched_entity *se)
>  		} else {
>  			sa->util_avg = cap;
>  		}
> +		sa->util_peak = sa->util_avg;
>  		sa->util_sum = sa->util_avg * LOAD_AVG_MAX;
>  	}
>  
> @@ -2804,6 +2806,9 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
>  		sa->util_avg = sa->util_sum / LOAD_AVG_MAX;
>  	}
>  
> +	if (running || sa->util_avg > sa->util_peak)
> +		sa->util_peak = sa->util_avg;

Do we really need to update this new signal so often?

It seems that we use it only at wakeup time, is it not enough
to cache the util_avg value in dequeue_task_fair() in case of a
DEQUEUE_SLEEP?

> +
>  	return decayed;
>  }
>  
> @@ -5184,7 +5189,7 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p,
>  	return 1;
>  }
>  
> -static inline int task_util(struct task_struct *p);
> +static inline int task_util_peak(struct task_struct *p);
>  static int cpu_util_wake(int cpu, struct task_struct *p);
>  
>  static unsigned long capacity_spare_wake(int cpu, struct task_struct *p)
> @@ -5267,14 +5272,14 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
>  	/*
>  	 * The cross-over point between using spare capacity or least load
>  	 * is too conservative for high utilization tasks on partially
> -	 * utilized systems if we require spare_capacity > task_util(p),
> +	 * utilized systems if we require spare_capacity > task_util_peak(p),
>  	 * so we allow for some task stuffing by using
> -	 * spare_capacity > task_util(p)/2.
> +	 * spare_capacity > task_util_peak(p)/2.
>  	 */
> -	if (this_spare > task_util(p) / 2 &&
> +	if (this_spare > task_util_peak(p) / 2 &&
>  	    imbalance*this_spare > 100*most_spare)
>  		return NULL;
> -	else if (most_spare > task_util(p) / 2)
> +	else if (most_spare > task_util_peak(p) / 2)
>  		return most_spare_sg;
>  
>  	if (!idlest || 100*this_load < imbalance*min_load)
> @@ -5432,9 +5437,9 @@ static int cpu_util(int cpu)
>  	return (util >= capacity) ? capacity : util;
>  }
>  
> -static inline int task_util(struct task_struct *p)
> +static inline int task_util_peak(struct task_struct *p)
>  {
> -	return p->se.avg.util_avg;
> +	return p->se.avg.util_peak;
>  }
>  
>  /*
> @@ -5450,7 +5455,7 @@ static int cpu_util_wake(int cpu, struct task_struct *p)
>  		return cpu_util(cpu);
>  
>  	capacity = capacity_orig_of(cpu);
> -	util = max_t(long, cpu_rq(cpu)->cfs.avg.util_avg - task_util(p), 0);
> +	util = max_t(long, cpu_rq(cpu)->cfs.avg.util_avg - task_util_peak(p), 0);
>  
>  	return (util >= capacity) ? capacity : util;
>  }
> @@ -5476,7 +5481,7 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
>  	/* Bring task utilization in sync with prev_cpu */
>  	sync_entity_load_avg(&p->se);
>  
> -	return min_cap * 1024 < task_util(p) * capacity_margin;
> +	return min_cap * 1024 < task_util_peak(p) * capacity_margin;
>  }
>  
>  /*
> -- 
> 1.9.1
> 

-- 
#include <best/regards.h>

Patrick Bellasi

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v4 5/5] sched/fair: Track peak per-entity utilization
  2016-09-01 11:51   ` Patrick Bellasi
@ 2016-09-01 12:43     ` Morten Rasmussen
  0 siblings, 0 replies; 9+ messages in thread
From: Morten Rasmussen @ 2016-09-01 12:43 UTC (permalink / raw)
  To: Patrick Bellasi
  Cc: peterz, mingo, dietmar.eggemann, yuyang.du, vincent.guittot,
	mgalbraith, sgurrappadi, freedom.tan, keita.kobayashi.ym,
	linux-kernel

On Thu, Sep 01, 2016 at 12:51:36PM +0100, Patrick Bellasi wrote:
> On 31-Aug 11:52, Morten Rasmussen wrote:
> > When using PELT (per-entity load tracking) utilization to place tasks at
> > wake-up using the decayed utilization (due to sleep) leads to
> > under-estimation of true utilization of the task. This could mean
> > putting the task on a cpu with less available capacity than is actually
> > needed. This issue can be mitigated by using 'peak' utilization instead
> > of the decayed utilization for placement decisions, e.g. at task
> > wake-up.
> > 
> > The 'peak' utilization metric, util_peak, tracks util_avg when the task
> > is running and retains its previous value while the task is
> > blocked/waiting on the rq. It is instantly updated to track util_avg
> > again as soon as the task running again.
> > 
> > cc: Ingo Molnar <mingo@redhat.com>
> > cc: Peter Zijlstra <peterz@infradead.org>
> > 
> > Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
> > ---
> >  include/linux/sched.h |  2 +-
> >  kernel/sched/fair.c   | 23 ++++++++++++++---------
> >  2 files changed, 15 insertions(+), 10 deletions(-)
> > 
> > diff --git a/include/linux/sched.h b/include/linux/sched.h
> > index d75024053e9b..fff4e4b6e654 100644
> > --- a/include/linux/sched.h
> > +++ b/include/linux/sched.h
> > @@ -1282,7 +1282,7 @@ struct load_weight {
> >  struct sched_avg {
> >  	u64 last_update_time, load_sum;
> >  	u32 util_sum, period_contrib;
> > -	unsigned long load_avg, util_avg;
> > +	unsigned long load_avg, util_avg, util_peak;
> 
> By adding util_peak here (in sched_avg) we implicitly define a new
> signal for CFS RQs as well, but in the rest of this patch it seems we
> use it only for tasks?
> 
> Overall this seems to be a filtered signal on top of PELT but just for
> tasks. Perhaps something similar can be useful for CPUs utilization as
> well...

Yes, sched_avg is used by both sched_entity and cfs_rq. Current code is
structured to exploit this, so I didn't want to change that. At the
moment we only need util_peak to cache the non-decayed utilization for
wake-up placement as patch 1 changes the task utilization in
select_task_rq_fair() from non-decayed to being up-to-date. That means
that we would consistently under-estimate the task utilization in the
remaining patches of this v4 patch set if we use the up-to-date task
utilization.

It is currently not used anywhere else. It might be later on. I'm not
sure how it should be defined for cfs_rqs to make any sense.

> 
> >  };
> >  
> >  #ifdef CONFIG_SCHEDSTATS
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 68d8b40c546b..27534e36555b 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -692,6 +692,7 @@ void init_entity_runnable_average(struct sched_entity *se)
> >  	 * At this point, util_avg won't be used in select_task_rq_fair anyway
> >  	 */
> >  	sa->util_avg = 0;
> > +	sa->util_peak = 0;
> 
> For consistency with other sched_avg's signals, perhaps we should report
> the value of util_peak from:
> 
>    kernel/sched/debug.c::{print_cfs_group_statproc_sched_show_task,proc_sched_show_task}

I forgot to add the debug bits. I'm not sure how useful it is though. It
is just an outdated snapshot, but if the others are there, this one
should be there too.

> 
> >  	sa->util_sum = 0;
> >  	/* when this task enqueue'ed, it will contribute to its cfs_rq's load_avg */
> >  }
> > @@ -743,6 +744,7 @@ void post_init_entity_util_avg(struct sched_entity *se)
> >  		} else {
> >  			sa->util_avg = cap;
> >  		}
> > +		sa->util_peak = sa->util_avg;
> >  		sa->util_sum = sa->util_avg * LOAD_AVG_MAX;
> >  	}
> >  
> > @@ -2804,6 +2806,9 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
> >  		sa->util_avg = sa->util_sum / LOAD_AVG_MAX;
> >  	}
> >  
> > +	if (running || sa->util_avg > sa->util_peak)
> > +		sa->util_peak = sa->util_avg;
> 
> Do we really need to update this new signal so often?
> 
> It seems that we use it only at wakeup time, is it not enough
> to cache the util_avg value in dequeue_task_fair() in case of a
> DEQUEUE_SLEEP?

I think we could hook into the dequeue path and do it there instead. I
do cause an additional comparison and potentially writing one
additional variable, but I thought it was a very simple and easy to
understand implementation.

This implementation has the side-effect that task_util_peak() does not
maintain its previous peak value until the task sleeps again. But
currently we don't use it in between, so why not.

I will give it a try.

Morten

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2016-09-01 12:43 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2016-08-31 10:52 [PATCH v4 0/5] sched: Clean-ups and asymmetric cpu capacity support Morten Rasmussen
2016-08-31 10:52 ` [PATCH v4 1/5] sched/fair: Compute task/cpu utilization at wake-up correctly Morten Rasmussen
2016-08-31 10:52 ` [PATCH v4 2/5] sched/fair: Consider spare capacity in find_idlest_group() Morten Rasmussen
2016-08-31 10:52 ` [PATCH v4 3/5] sched: Add per-cpu min capacity to sched_group_capacity Morten Rasmussen
2016-08-31 10:52 ` [PATCH v4 4/5] sched/fair: Avoid pulling tasks from non-overloaded higher capacity groups Morten Rasmussen
2016-08-31 10:52 ` [PATCH v4 5/5] sched/fair: Track peak per-entity utilization Morten Rasmussen
2016-09-01 11:51   ` Patrick Bellasi
2016-09-01 12:43     ` Morten Rasmussen
2016-09-01  8:33 ` [PATCH v4 0/5] sched: Clean-ups and asymmetric cpu capacity support Morten Rasmussen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.