All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v6 0/6] sched: consolidation of cpu_capacity
@ 2014-09-23 16:07 ` Vincent Guittot
  0 siblings, 0 replies; 72+ messages in thread
From: Vincent Guittot @ 2014-09-23 16:07 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel, preeti, linux, linux-arm-kernel
  Cc: riel, Morten.Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano, dietmar.eggemann, pjt, bsegall, Vincent Guittot

During load balance, the scheduler evaluates the number of tasks that a group
of CPUs can handle. The current method assumes that tasks have a fix load of
SCHED_LOAD_SCALE and CPUs have a default capacity of SCHED_CAPACITY_SCALE.
This assumption generates wrong decision by creating ghost cores or by
removing real ones when the original capacity of CPUs is different from the
default SCHED_CAPACITY_SCALE. We don't try anymore to evaluate the number of
available cores based on the group_capacity but instead we evaluate the usage
of a group and compare it with its capacity

This patchset mainly replaces the old capacity method by a new one and has kept
the policy almost unchanged whereas we could certainly take advantage of this
new statistic in several other places of the load balance.

The utilization_avg_contrib is based on the current implementation of the
load avg tracking. I also have a version of the utilization_avg_contrib that
is based on the new implementation proposal [1] but haven't provide the patches
and results as [1] is still under review. I can provide change above [1] to
change how utilization_avg_contrib is computed and adapt to new mecanism.

Change since V5
 - remove patches that have been merged since v5 : patches 01, 02, 03, 04, 05, 07
 - update commit log and add more details on the purpose of the patches
 - fix/remove useless code with the rebase on patchset [2]
 - remove capacity_orig in sched_group_capacity as it is not used
 - move code in the right patch
 - add some helper function to factorize code

Change since V4
 - rebase to manage conflicts with changes in selection of busiest group [4]

Change since V3:
 - add usage_avg_contrib statistic which sums the running time of tasks on a rq
 - use usage_avg_contrib instead of runnable_avg_sum for cpu_utilization
 - fix replacement power by capacity
 - update some comments

Change since V2:
 - rebase on top of capacity renaming
 - fix wake_affine statistic update
 - rework nohz_kick_needed
 - optimize the active migration of a task from CPU with reduced capacity
 - rename group_activity by group_utilization and remove unused total_utilization
 - repair SD_PREFER_SIBLING and use it for SMT level
 - reorder patchset to gather patches with same topics

Change since V1:
 - add 3 fixes
 - correct some commit messages
 - replace capacity computation by activity
 - take into account current cpu capacity

[1] https://lkml.org/lkml/2014/7/18/110
[2] https://lkml.org/lkml/2014/7/25/589

Vincent Guittot (6):
  sched: add per rq cpu_capacity_orig
  sched: move cfs task on a CPU with higher capacity
  sched: add utilization_avg_contrib
  sched: get CPU's usage statistic
  sched: replace capacity_factor by usage
  sched: add SD_PREFER_SIBLING for SMT level

 include/linux/sched.h |  19 +++-
 kernel/sched/core.c   |  15 +--
 kernel/sched/debug.c  |   9 +-
 kernel/sched/fair.c   | 276 ++++++++++++++++++++++++++++++--------------------
 kernel/sched/sched.h  |  11 +-
 5 files changed, 199 insertions(+), 131 deletions(-)

-- 
1.9.1


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH v6 0/6] sched: consolidation of cpu_capacity
@ 2014-09-23 16:07 ` Vincent Guittot
  0 siblings, 0 replies; 72+ messages in thread
From: Vincent Guittot @ 2014-09-23 16:07 UTC (permalink / raw)
  To: linux-arm-kernel

During load balance, the scheduler evaluates the number of tasks that a group
of CPUs can handle. The current method assumes that tasks have a fix load of
SCHED_LOAD_SCALE and CPUs have a default capacity of SCHED_CAPACITY_SCALE.
This assumption generates wrong decision by creating ghost cores or by
removing real ones when the original capacity of CPUs is different from the
default SCHED_CAPACITY_SCALE. We don't try anymore to evaluate the number of
available cores based on the group_capacity but instead we evaluate the usage
of a group and compare it with its capacity

This patchset mainly replaces the old capacity method by a new one and has kept
the policy almost unchanged whereas we could certainly take advantage of this
new statistic in several other places of the load balance.

The utilization_avg_contrib is based on the current implementation of the
load avg tracking. I also have a version of the utilization_avg_contrib that
is based on the new implementation proposal [1] but haven't provide the patches
and results as [1] is still under review. I can provide change above [1] to
change how utilization_avg_contrib is computed and adapt to new mecanism.

Change since V5
 - remove patches that have been merged since v5 : patches 01, 02, 03, 04, 05, 07
 - update commit log and add more details on the purpose of the patches
 - fix/remove useless code with the rebase on patchset [2]
 - remove capacity_orig in sched_group_capacity as it is not used
 - move code in the right patch
 - add some helper function to factorize code

Change since V4
 - rebase to manage conflicts with changes in selection of busiest group [4]

Change since V3:
 - add usage_avg_contrib statistic which sums the running time of tasks on a rq
 - use usage_avg_contrib instead of runnable_avg_sum for cpu_utilization
 - fix replacement power by capacity
 - update some comments

Change since V2:
 - rebase on top of capacity renaming
 - fix wake_affine statistic update
 - rework nohz_kick_needed
 - optimize the active migration of a task from CPU with reduced capacity
 - rename group_activity by group_utilization and remove unused total_utilization
 - repair SD_PREFER_SIBLING and use it for SMT level
 - reorder patchset to gather patches with same topics

Change since V1:
 - add 3 fixes
 - correct some commit messages
 - replace capacity computation by activity
 - take into account current cpu capacity

[1] https://lkml.org/lkml/2014/7/18/110
[2] https://lkml.org/lkml/2014/7/25/589

Vincent Guittot (6):
  sched: add per rq cpu_capacity_orig
  sched: move cfs task on a CPU with higher capacity
  sched: add utilization_avg_contrib
  sched: get CPU's usage statistic
  sched: replace capacity_factor by usage
  sched: add SD_PREFER_SIBLING for SMT level

 include/linux/sched.h |  19 +++-
 kernel/sched/core.c   |  15 +--
 kernel/sched/debug.c  |   9 +-
 kernel/sched/fair.c   | 276 ++++++++++++++++++++++++++++++--------------------
 kernel/sched/sched.h  |  11 +-
 5 files changed, 199 insertions(+), 131 deletions(-)

-- 
1.9.1

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH v6 1/6] sched: add per rq cpu_capacity_orig
  2014-09-23 16:07 ` Vincent Guittot
@ 2014-09-23 16:08   ` Vincent Guittot
  -1 siblings, 0 replies; 72+ messages in thread
From: Vincent Guittot @ 2014-09-23 16:08 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel, preeti, linux, linux-arm-kernel
  Cc: riel, Morten.Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano, dietmar.eggemann, pjt, bsegall, Vincent Guittot

This new field cpu_capacity_orig reflects the original capacity of a CPU
before being altered by frequency scaling, rt tasks and/or IRQ

The cpu_capacity_orig  will be used in several places to detect when the
capapcity of a CPU has been noticeably reduced so we can trig load balance
to look for a CPU with better capacity.
As an example, we can detect when a CPU handles a significant amount of irq
(with CONFIG_IRQ_TIME_ACCOUNTING) but this CPU is seen as an idle CPU by
scheduler whereas CPUs, which are really idle, are available

In addition, this new cpu_capacity_orig will be used to evaluate the usage of
a the CPU by CFS tasks

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
---
 kernel/sched/core.c  | 2 +-
 kernel/sched/fair.c  | 8 +++++++-
 kernel/sched/sched.h | 1 +
 3 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2a93b87..e20f203 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7056,7 +7056,7 @@ void __init sched_init(void)
 #ifdef CONFIG_SMP
 		rq->sd = NULL;
 		rq->rd = NULL;
-		rq->cpu_capacity = SCHED_CAPACITY_SCALE;
+		rq->cpu_capacity = rq->cpu_capacity_orig = SCHED_CAPACITY_SCALE;
 		rq->post_schedule = 0;
 		rq->active_balance = 0;
 		rq->next_balance = jiffies;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2a1e6ac..622f8b0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4092,6 +4092,11 @@ static unsigned long capacity_of(int cpu)
 	return cpu_rq(cpu)->cpu_capacity;
 }
 
+static unsigned long capacity_orig_of(int cpu)
+{
+	return cpu_rq(cpu)->cpu_capacity_orig;
+}
+
 static unsigned long cpu_avg_load_per_task(int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
@@ -5765,6 +5770,7 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
 
 	capacity >>= SCHED_CAPACITY_SHIFT;
 
+	cpu_rq(cpu)->cpu_capacity_orig = capacity;
 	sdg->sgc->capacity_orig = capacity;
 
 	if (sched_feat(ARCH_CAPACITY))
@@ -5826,7 +5832,7 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
 			 * Runtime updates will correct capacity_orig.
 			 */
 			if (unlikely(!rq->sd)) {
-				capacity_orig += capacity_of(cpu);
+				capacity_orig += capacity_orig_of(cpu);
 				capacity += capacity_of(cpu);
 				continue;
 			}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 1bc6aad..f332e45 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -575,6 +575,7 @@ struct rq {
 	struct sched_domain *sd;
 
 	unsigned long cpu_capacity;
+	unsigned long cpu_capacity_orig;
 
 	unsigned char idle_balance;
 	/* For active balancing */
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v6 1/6] sched: add per rq cpu_capacity_orig
@ 2014-09-23 16:08   ` Vincent Guittot
  0 siblings, 0 replies; 72+ messages in thread
From: Vincent Guittot @ 2014-09-23 16:08 UTC (permalink / raw)
  To: linux-arm-kernel

This new field cpu_capacity_orig reflects the original capacity of a CPU
before being altered by frequency scaling, rt tasks and/or IRQ

The cpu_capacity_orig  will be used in several places to detect when the
capapcity of a CPU has been noticeably reduced so we can trig load balance
to look for a CPU with better capacity.
As an example, we can detect when a CPU handles a significant amount of irq
(with CONFIG_IRQ_TIME_ACCOUNTING) but this CPU is seen as an idle CPU by
scheduler whereas CPUs, which are really idle, are available

In addition, this new cpu_capacity_orig will be used to evaluate the usage of
a the CPU by CFS tasks

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
---
 kernel/sched/core.c  | 2 +-
 kernel/sched/fair.c  | 8 +++++++-
 kernel/sched/sched.h | 1 +
 3 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 2a93b87..e20f203 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7056,7 +7056,7 @@ void __init sched_init(void)
 #ifdef CONFIG_SMP
 		rq->sd = NULL;
 		rq->rd = NULL;
-		rq->cpu_capacity = SCHED_CAPACITY_SCALE;
+		rq->cpu_capacity = rq->cpu_capacity_orig = SCHED_CAPACITY_SCALE;
 		rq->post_schedule = 0;
 		rq->active_balance = 0;
 		rq->next_balance = jiffies;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2a1e6ac..622f8b0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4092,6 +4092,11 @@ static unsigned long capacity_of(int cpu)
 	return cpu_rq(cpu)->cpu_capacity;
 }
 
+static unsigned long capacity_orig_of(int cpu)
+{
+	return cpu_rq(cpu)->cpu_capacity_orig;
+}
+
 static unsigned long cpu_avg_load_per_task(int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
@@ -5765,6 +5770,7 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
 
 	capacity >>= SCHED_CAPACITY_SHIFT;
 
+	cpu_rq(cpu)->cpu_capacity_orig = capacity;
 	sdg->sgc->capacity_orig = capacity;
 
 	if (sched_feat(ARCH_CAPACITY))
@@ -5826,7 +5832,7 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
 			 * Runtime updates will correct capacity_orig.
 			 */
 			if (unlikely(!rq->sd)) {
-				capacity_orig += capacity_of(cpu);
+				capacity_orig += capacity_orig_of(cpu);
 				capacity += capacity_of(cpu);
 				continue;
 			}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 1bc6aad..f332e45 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -575,6 +575,7 @@ struct rq {
 	struct sched_domain *sd;
 
 	unsigned long cpu_capacity;
+	unsigned long cpu_capacity_orig;
 
 	unsigned char idle_balance;
 	/* For active balancing */
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v6 2/6] sched: move cfs task on a CPU with higher capacity
  2014-09-23 16:07 ` Vincent Guittot
@ 2014-09-23 16:08   ` Vincent Guittot
  -1 siblings, 0 replies; 72+ messages in thread
From: Vincent Guittot @ 2014-09-23 16:08 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel, preeti, linux, linux-arm-kernel
  Cc: riel, Morten.Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano, dietmar.eggemann, pjt, bsegall, Vincent Guittot

If the CPU is used for handling lot of IRQs, trig a load balance to check if
it's worth moving its tasks on another CPU that has more capacity.

As a sidenote, this will note generate more spurious ilb because we already
trig an ilb if there is more than 1 busy cpu. If this cpu is the only one that
has a task, we will trig the ilb once for migrating the task.

The nohz_kick_needed function has been cleaned up a bit while adding the new
test

env.src_cpu and env.src_rq must be set unconditionnaly because they are used
in need_active_balance which is called even if busiest->nr_running equals 1

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 70 ++++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 50 insertions(+), 20 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 622f8b0..7422044 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5885,6 +5885,18 @@ fix_small_capacity(struct sched_domain *sd, struct sched_group *group)
 }
 
 /*
+ * Check whether the capacity of the rq has been noticeably reduced by side
+ * activity. The imbalance_pct is used for the threshold.
+ * Return true is the capacity is reduced
+ */
+static inline int
+check_cpu_capacity(struct rq *rq, struct sched_domain *sd)
+{
+	return ((rq->cpu_capacity * sd->imbalance_pct) <
+				(rq->cpu_capacity_orig * 100));
+}
+
+/*
  * Group imbalance indicates (and tries to solve) the problem where balancing
  * groups is inadequate due to tsk_cpus_allowed() constraints.
  *
@@ -6555,6 +6567,14 @@ static int need_active_balance(struct lb_env *env)
 		 */
 		if ((sd->flags & SD_ASYM_PACKING) && env->src_cpu > env->dst_cpu)
 			return 1;
+
+		/*
+		 * The src_cpu's capacity is reduced because of other
+		 * sched_class or IRQs, we trig an active balance to move the
+		 * task
+		 */
+		if (check_cpu_capacity(env->src_rq, sd))
+			return 1;
 	}
 
 	return unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2);
@@ -6656,6 +6676,9 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 
 	schedstat_add(sd, lb_imbalance[idle], env.imbalance);
 
+	env.src_cpu = busiest->cpu;
+	env.src_rq = busiest;
+
 	ld_moved = 0;
 	if (busiest->nr_running > 1) {
 		/*
@@ -6665,8 +6688,6 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 		 * correctly treated as an imbalance.
 		 */
 		env.flags |= LBF_ALL_PINNED;
-		env.src_cpu   = busiest->cpu;
-		env.src_rq    = busiest;
 		env.loop_max  = min(sysctl_sched_nr_migrate, busiest->nr_running);
 
 more_balance:
@@ -7372,10 +7393,12 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
 
 /*
  * Current heuristic for kicking the idle load balancer in the presence
- * of an idle cpu is the system.
+ * of an idle cpu in the system.
  *   - This rq has more than one task.
- *   - At any scheduler domain level, this cpu's scheduler group has multiple
- *     busy cpu's exceeding the group's capacity.
+ *   - This rq has at least one CFS task and the capacity of the CPU is
+ *     significantly reduced because of RT tasks or IRQs.
+ *   - At parent of LLC scheduler domain level, this cpu's scheduler group has
+ *     multiple busy cpu.
  *   - For SD_ASYM_PACKING, if the lower numbered cpu's in the scheduler
  *     domain span are idle.
  */
@@ -7385,9 +7408,10 @@ static inline int nohz_kick_needed(struct rq *rq)
 	struct sched_domain *sd;
 	struct sched_group_capacity *sgc;
 	int nr_busy, cpu = rq->cpu;
+	bool kick = false;
 
 	if (unlikely(rq->idle_balance))
-		return 0;
+		return false;
 
        /*
 	* We may be recently in ticked or tickless idle mode. At the first
@@ -7401,38 +7425,44 @@ static inline int nohz_kick_needed(struct rq *rq)
 	 * balancing.
 	 */
 	if (likely(!atomic_read(&nohz.nr_cpus)))
-		return 0;
+		return false;
 
 	if (time_before(now, nohz.next_balance))
-		return 0;
+		return false;
 
 	if (rq->nr_running >= 2)
-		goto need_kick;
+		return true;
 
 	rcu_read_lock();
 	sd = rcu_dereference(per_cpu(sd_busy, cpu));
-
 	if (sd) {
 		sgc = sd->groups->sgc;
 		nr_busy = atomic_read(&sgc->nr_busy_cpus);
 
-		if (nr_busy > 1)
-			goto need_kick_unlock;
+		if (nr_busy > 1) {
+			kick = true;
+			goto unlock;
+		}
+
 	}
 
-	sd = rcu_dereference(per_cpu(sd_asym, cpu));
+	sd = rcu_dereference(rq->sd);
+	if (sd) {
+		if ((rq->cfs.h_nr_running >= 1) &&
+				check_cpu_capacity(rq, sd)) {
+			kick = true;
+			goto unlock;
+		}
+	}
 
+	sd = rcu_dereference(per_cpu(sd_asym, cpu));
 	if (sd && (cpumask_first_and(nohz.idle_cpus_mask,
 				  sched_domain_span(sd)) < cpu))
-		goto need_kick_unlock;
-
-	rcu_read_unlock();
-	return 0;
+		kick = true;
 
-need_kick_unlock:
+unlock:
 	rcu_read_unlock();
-need_kick:
-	return 1;
+	return kick;
 }
 #else
 static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) { }
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v6 2/6] sched: move cfs task on a CPU with higher capacity
@ 2014-09-23 16:08   ` Vincent Guittot
  0 siblings, 0 replies; 72+ messages in thread
From: Vincent Guittot @ 2014-09-23 16:08 UTC (permalink / raw)
  To: linux-arm-kernel

If the CPU is used for handling lot of IRQs, trig a load balance to check if
it's worth moving its tasks on another CPU that has more capacity.

As a sidenote, this will note generate more spurious ilb because we already
trig an ilb if there is more than 1 busy cpu. If this cpu is the only one that
has a task, we will trig the ilb once for migrating the task.

The nohz_kick_needed function has been cleaned up a bit while adding the new
test

env.src_cpu and env.src_rq must be set unconditionnaly because they are used
in need_active_balance which is called even if busiest->nr_running equals 1

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 70 ++++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 50 insertions(+), 20 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 622f8b0..7422044 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5885,6 +5885,18 @@ fix_small_capacity(struct sched_domain *sd, struct sched_group *group)
 }
 
 /*
+ * Check whether the capacity of the rq has been noticeably reduced by side
+ * activity. The imbalance_pct is used for the threshold.
+ * Return true is the capacity is reduced
+ */
+static inline int
+check_cpu_capacity(struct rq *rq, struct sched_domain *sd)
+{
+	return ((rq->cpu_capacity * sd->imbalance_pct) <
+				(rq->cpu_capacity_orig * 100));
+}
+
+/*
  * Group imbalance indicates (and tries to solve) the problem where balancing
  * groups is inadequate due to tsk_cpus_allowed() constraints.
  *
@@ -6555,6 +6567,14 @@ static int need_active_balance(struct lb_env *env)
 		 */
 		if ((sd->flags & SD_ASYM_PACKING) && env->src_cpu > env->dst_cpu)
 			return 1;
+
+		/*
+		 * The src_cpu's capacity is reduced because of other
+		 * sched_class or IRQs, we trig an active balance to move the
+		 * task
+		 */
+		if (check_cpu_capacity(env->src_rq, sd))
+			return 1;
 	}
 
 	return unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2);
@@ -6656,6 +6676,9 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 
 	schedstat_add(sd, lb_imbalance[idle], env.imbalance);
 
+	env.src_cpu = busiest->cpu;
+	env.src_rq = busiest;
+
 	ld_moved = 0;
 	if (busiest->nr_running > 1) {
 		/*
@@ -6665,8 +6688,6 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 		 * correctly treated as an imbalance.
 		 */
 		env.flags |= LBF_ALL_PINNED;
-		env.src_cpu   = busiest->cpu;
-		env.src_rq    = busiest;
 		env.loop_max  = min(sysctl_sched_nr_migrate, busiest->nr_running);
 
 more_balance:
@@ -7372,10 +7393,12 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
 
 /*
  * Current heuristic for kicking the idle load balancer in the presence
- * of an idle cpu is the system.
+ * of an idle cpu in the system.
  *   - This rq has more than one task.
- *   - At any scheduler domain level, this cpu's scheduler group has multiple
- *     busy cpu's exceeding the group's capacity.
+ *   - This rq has at least one CFS task and the capacity of the CPU is
+ *     significantly reduced because of RT tasks or IRQs.
+ *   - At parent of LLC scheduler domain level, this cpu's scheduler group has
+ *     multiple busy cpu.
  *   - For SD_ASYM_PACKING, if the lower numbered cpu's in the scheduler
  *     domain span are idle.
  */
@@ -7385,9 +7408,10 @@ static inline int nohz_kick_needed(struct rq *rq)
 	struct sched_domain *sd;
 	struct sched_group_capacity *sgc;
 	int nr_busy, cpu = rq->cpu;
+	bool kick = false;
 
 	if (unlikely(rq->idle_balance))
-		return 0;
+		return false;
 
        /*
 	* We may be recently in ticked or tickless idle mode. At the first
@@ -7401,38 +7425,44 @@ static inline int nohz_kick_needed(struct rq *rq)
 	 * balancing.
 	 */
 	if (likely(!atomic_read(&nohz.nr_cpus)))
-		return 0;
+		return false;
 
 	if (time_before(now, nohz.next_balance))
-		return 0;
+		return false;
 
 	if (rq->nr_running >= 2)
-		goto need_kick;
+		return true;
 
 	rcu_read_lock();
 	sd = rcu_dereference(per_cpu(sd_busy, cpu));
-
 	if (sd) {
 		sgc = sd->groups->sgc;
 		nr_busy = atomic_read(&sgc->nr_busy_cpus);
 
-		if (nr_busy > 1)
-			goto need_kick_unlock;
+		if (nr_busy > 1) {
+			kick = true;
+			goto unlock;
+		}
+
 	}
 
-	sd = rcu_dereference(per_cpu(sd_asym, cpu));
+	sd = rcu_dereference(rq->sd);
+	if (sd) {
+		if ((rq->cfs.h_nr_running >= 1) &&
+				check_cpu_capacity(rq, sd)) {
+			kick = true;
+			goto unlock;
+		}
+	}
 
+	sd = rcu_dereference(per_cpu(sd_asym, cpu));
 	if (sd && (cpumask_first_and(nohz.idle_cpus_mask,
 				  sched_domain_span(sd)) < cpu))
-		goto need_kick_unlock;
-
-	rcu_read_unlock();
-	return 0;
+		kick = true;
 
-need_kick_unlock:
+unlock:
 	rcu_read_unlock();
-need_kick:
-	return 1;
+	return kick;
 }
 #else
 static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) { }
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v6 3/6] sched: add utilization_avg_contrib
  2014-09-23 16:07 ` Vincent Guittot
@ 2014-09-23 16:08   ` Vincent Guittot
  -1 siblings, 0 replies; 72+ messages in thread
From: Vincent Guittot @ 2014-09-23 16:08 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel, preeti, linux, linux-arm-kernel
  Cc: riel, Morten.Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano, dietmar.eggemann, pjt, bsegall, Vincent Guittot

Add new statistics which reflect the average time a task is running on the CPU
and the sum of these running time of the tasks on a runqueue. The latter is
named utilization_avg_contrib.

This patch is based on the usage metric that was proposed in the 1st
versions of the per-entity load tracking patchset by Paul Turner
<pjt@google.com> but that has be removed afterward. This version differs
from the original one in the sense that it's not linked to task_group.

The rq's utilization_avg_contrib will be used to check if a rq is overloaded
or not instead of trying to compute how many task a group of CPUs can handle

Rename runnable_avg_period into avg_period as it is now used with both
runnable_avg_sum and running_avg_sum

Add some descriptions of the variables to explain their differences

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 include/linux/sched.h | 19 +++++++++++---
 kernel/sched/debug.c  |  9 ++++---
 kernel/sched/fair.c   | 68 +++++++++++++++++++++++++++++++++++++++------------
 kernel/sched/sched.h  |  8 +++++-
 4 files changed, 81 insertions(+), 23 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 48ae6c4..51df220 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1071,15 +1071,26 @@ struct load_weight {
 };
 
 struct sched_avg {
+	u64 last_runnable_update;
+	s64 decay_count;
+	/*
+	 * utilization_avg_contrib describes the amount of time that a
+	 * sched_entity is running on a CPU. It is based on running_avg_sum
+	 * and is scaled in the range [0..SCHED_LOAD_SCALE].
+	 * load_avg_contrib described the the amount of time that a
+	 * sched_entity is runnable on a rq. It is based on both
+	 * runnable_avg_sum and the weight of the task.
+	 */
+	unsigned long load_avg_contrib, utilization_avg_contrib;
 	/*
 	 * These sums represent an infinite geometric series and so are bound
 	 * above by 1024/(1-y).  Thus we only need a u32 to store them for all
 	 * choices of y < 1-2^(-32)*1024.
+	 * runnable_avg_sum represents the amount of time a sched_entity is on
+	 * the runqueue whereas running_avg_sum reflects the time the
+	 * sched_entity is effectively running on the runqueue.
 	 */
-	u32 runnable_avg_sum, runnable_avg_period;
-	u64 last_runnable_update;
-	s64 decay_count;
-	unsigned long load_avg_contrib;
+	u32 runnable_avg_sum, avg_period, running_avg_sum;
 };
 
 #ifdef CONFIG_SCHEDSTATS
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index c7fe1ea0..1761db9 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -71,7 +71,7 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
 	if (!se) {
 		struct sched_avg *avg = &cpu_rq(cpu)->avg;
 		P(avg->runnable_avg_sum);
-		P(avg->runnable_avg_period);
+		P(avg->avg_period);
 		return;
 	}
 
@@ -94,7 +94,7 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
 	P(se->load.weight);
 #ifdef CONFIG_SMP
 	P(se->avg.runnable_avg_sum);
-	P(se->avg.runnable_avg_period);
+	P(se->avg.avg_period);
 	P(se->avg.load_avg_contrib);
 	P(se->avg.decay_count);
 #endif
@@ -215,6 +215,8 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 			cfs_rq->runnable_load_avg);
 	SEQ_printf(m, "  .%-30s: %ld\n", "blocked_load_avg",
 			cfs_rq->blocked_load_avg);
+	SEQ_printf(m, "  .%-30s: %ld\n", "utilization_load_avg",
+			cfs_rq->utilization_load_avg);
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	SEQ_printf(m, "  .%-30s: %ld\n", "tg_load_contrib",
 			cfs_rq->tg_load_contrib);
@@ -631,7 +633,8 @@ void proc_sched_show_task(struct task_struct *p, struct seq_file *m)
 	P(se.load.weight);
 #ifdef CONFIG_SMP
 	P(se.avg.runnable_avg_sum);
-	P(se.avg.runnable_avg_period);
+	P(se.avg.running_avg_sum);
+	P(se.avg.avg_period);
 	P(se.avg.load_avg_contrib);
 	P(se.avg.decay_count);
 #endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7422044..2cf153d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -677,8 +677,8 @@ void init_task_runnable_average(struct task_struct *p)
 
 	p->se.avg.decay_count = 0;
 	slice = sched_slice(task_cfs_rq(p), &p->se) >> 10;
-	p->se.avg.runnable_avg_sum = slice;
-	p->se.avg.runnable_avg_period = slice;
+	p->se.avg.runnable_avg_sum = p->se.avg.running_avg_sum = slice;
+	p->se.avg.avg_period = slice;
 	__update_task_entity_contrib(&p->se);
 }
 #else
@@ -1547,7 +1547,7 @@ static u64 numa_get_avg_runtime(struct task_struct *p, u64 *period)
 		*period = now - p->last_task_numa_placement;
 	} else {
 		delta = p->se.avg.runnable_avg_sum;
-		*period = p->se.avg.runnable_avg_period;
+		*period = p->se.avg.avg_period;
 	}
 
 	p->last_sum_exec_runtime = runtime;
@@ -2297,7 +2297,8 @@ static u32 __compute_runnable_contrib(u64 n)
  */
 static __always_inline int __update_entity_runnable_avg(u64 now,
 							struct sched_avg *sa,
-							int runnable)
+							int runnable,
+							int running)
 {
 	u64 delta, periods;
 	u32 runnable_contrib;
@@ -2323,7 +2324,7 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
 	sa->last_runnable_update = now;
 
 	/* delta_w is the amount already accumulated against our next period */
-	delta_w = sa->runnable_avg_period % 1024;
+	delta_w = sa->avg_period % 1024;
 	if (delta + delta_w >= 1024) {
 		/* period roll-over */
 		decayed = 1;
@@ -2336,7 +2337,9 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
 		delta_w = 1024 - delta_w;
 		if (runnable)
 			sa->runnable_avg_sum += delta_w;
-		sa->runnable_avg_period += delta_w;
+		if (running)
+			sa->running_avg_sum += delta_w;
+		sa->avg_period += delta_w;
 
 		delta -= delta_w;
 
@@ -2346,20 +2349,26 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
 
 		sa->runnable_avg_sum = decay_load(sa->runnable_avg_sum,
 						  periods + 1);
-		sa->runnable_avg_period = decay_load(sa->runnable_avg_period,
+		sa->running_avg_sum = decay_load(sa->running_avg_sum,
+						  periods + 1);
+		sa->avg_period = decay_load(sa->avg_period,
 						     periods + 1);
 
 		/* Efficiently calculate \sum (1..n_period) 1024*y^i */
 		runnable_contrib = __compute_runnable_contrib(periods);
 		if (runnable)
 			sa->runnable_avg_sum += runnable_contrib;
-		sa->runnable_avg_period += runnable_contrib;
+		if (running)
+			sa->running_avg_sum += runnable_contrib;
+		sa->avg_period += runnable_contrib;
 	}
 
 	/* Remainder of delta accrued against u_0` */
 	if (runnable)
 		sa->runnable_avg_sum += delta;
-	sa->runnable_avg_period += delta;
+	if (running)
+		sa->running_avg_sum += delta;
+	sa->avg_period += delta;
 
 	return decayed;
 }
@@ -2411,7 +2420,7 @@ static inline void __update_tg_runnable_avg(struct sched_avg *sa,
 
 	/* The fraction of a cpu used by this cfs_rq */
 	contrib = div_u64((u64)sa->runnable_avg_sum << NICE_0_SHIFT,
-			  sa->runnable_avg_period + 1);
+			  sa->avg_period + 1);
 	contrib -= cfs_rq->tg_runnable_contrib;
 
 	if (abs(contrib) > cfs_rq->tg_runnable_contrib / 64) {
@@ -2464,7 +2473,8 @@ static inline void __update_group_entity_contrib(struct sched_entity *se)
 
 static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
 {
-	__update_entity_runnable_avg(rq_clock_task(rq), &rq->avg, runnable);
+	__update_entity_runnable_avg(rq_clock_task(rq), &rq->avg, runnable,
+			runnable);
 	__update_tg_runnable_avg(&rq->avg, &rq->cfs);
 }
 #else /* CONFIG_FAIR_GROUP_SCHED */
@@ -2482,7 +2492,7 @@ static inline void __update_task_entity_contrib(struct sched_entity *se)
 
 	/* avoid overflowing a 32-bit type w/ SCHED_LOAD_SCALE */
 	contrib = se->avg.runnable_avg_sum * scale_load_down(se->load.weight);
-	contrib /= (se->avg.runnable_avg_period + 1);
+	contrib /= (se->avg.avg_period + 1);
 	se->avg.load_avg_contrib = scale_load(contrib);
 }
 
@@ -2501,6 +2511,27 @@ static long __update_entity_load_avg_contrib(struct sched_entity *se)
 	return se->avg.load_avg_contrib - old_contrib;
 }
 
+
+static inline void __update_task_entity_utilization(struct sched_entity *se)
+{
+	u32 contrib;
+
+	/* avoid overflowing a 32-bit type w/ SCHED_LOAD_SCALE */
+	contrib = se->avg.running_avg_sum * scale_load_down(SCHED_LOAD_SCALE);
+	contrib /= (se->avg.avg_period + 1);
+	se->avg.utilization_avg_contrib = scale_load(contrib);
+}
+
+static long __update_entity_utilization_avg_contrib(struct sched_entity *se)
+{
+	long old_contrib = se->avg.utilization_avg_contrib;
+
+	if (entity_is_task(se))
+		__update_task_entity_utilization(se);
+
+	return se->avg.utilization_avg_contrib - old_contrib;
+}
+
 static inline void subtract_blocked_load_contrib(struct cfs_rq *cfs_rq,
 						 long load_contrib)
 {
@@ -2517,7 +2548,7 @@ static inline void update_entity_load_avg(struct sched_entity *se,
 					  int update_cfs_rq)
 {
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
-	long contrib_delta;
+	long contrib_delta, utilization_delta;
 	u64 now;
 
 	/*
@@ -2529,16 +2560,20 @@ static inline void update_entity_load_avg(struct sched_entity *se,
 	else
 		now = cfs_rq_clock_task(group_cfs_rq(se));
 
-	if (!__update_entity_runnable_avg(now, &se->avg, se->on_rq))
+	if (!__update_entity_runnable_avg(now, &se->avg, se->on_rq,
+					cfs_rq->curr == se))
 		return;
 
 	contrib_delta = __update_entity_load_avg_contrib(se);
+	utilization_delta = __update_entity_utilization_avg_contrib(se);
 
 	if (!update_cfs_rq)
 		return;
 
-	if (se->on_rq)
+	if (se->on_rq) {
 		cfs_rq->runnable_load_avg += contrib_delta;
+		cfs_rq->utilization_load_avg += utilization_delta;
+	}
 	else
 		subtract_blocked_load_contrib(cfs_rq, -contrib_delta);
 }
@@ -2615,6 +2650,7 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
 	}
 
 	cfs_rq->runnable_load_avg += se->avg.load_avg_contrib;
+	cfs_rq->utilization_load_avg += se->avg.utilization_avg_contrib;
 	/* we force update consideration on load-balancer moves */
 	update_cfs_rq_blocked_load(cfs_rq, !wakeup);
 }
@@ -2633,6 +2669,7 @@ static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq,
 	update_cfs_rq_blocked_load(cfs_rq, !sleep);
 
 	cfs_rq->runnable_load_avg -= se->avg.load_avg_contrib;
+	cfs_rq->utilization_load_avg -= se->avg.utilization_avg_contrib;
 	if (sleep) {
 		cfs_rq->blocked_load_avg += se->avg.load_avg_contrib;
 		se->avg.decay_count = atomic64_read(&cfs_rq->decay_counter);
@@ -2970,6 +3007,7 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 		 */
 		update_stats_wait_end(cfs_rq, se);
 		__dequeue_entity(cfs_rq, se);
+		update_entity_load_avg(se, 1);
 	}
 
 	update_stats_curr_start(cfs_rq, se);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f332e45..3ccb136 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -339,8 +339,14 @@ struct cfs_rq {
 	 * Under CFS, load is tracked on a per-entity basis and aggregated up.
 	 * This allows for the description of both thread and group usage (in
 	 * the FAIR_GROUP_SCHED case).
+	 * runnable_load_avg is the sum of the load_avg_contrib of the
+	 * sched_entities on the rq.
+	 * blocked_load_avg is similar to runnable_load_avg except that its
+	 * the blocked sched_entities on the rq.
+	 * utilization_load_avg is the sum of the average running time of the
+	 * sched_entities on the rq.
 	 */
-	unsigned long runnable_load_avg, blocked_load_avg;
+	unsigned long runnable_load_avg, blocked_load_avg, utilization_load_avg;
 	atomic64_t decay_counter;
 	u64 last_decay;
 	atomic_long_t removed_load;
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v6 3/6] sched: add utilization_avg_contrib
@ 2014-09-23 16:08   ` Vincent Guittot
  0 siblings, 0 replies; 72+ messages in thread
From: Vincent Guittot @ 2014-09-23 16:08 UTC (permalink / raw)
  To: linux-arm-kernel

Add new statistics which reflect the average time a task is running on the CPU
and the sum of these running time of the tasks on a runqueue. The latter is
named utilization_avg_contrib.

This patch is based on the usage metric that was proposed in the 1st
versions of the per-entity load tracking patchset by Paul Turner
<pjt@google.com> but that has be removed afterward. This version differs
from the original one in the sense that it's not linked to task_group.

The rq's utilization_avg_contrib will be used to check if a rq is overloaded
or not instead of trying to compute how many task a group of CPUs can handle

Rename runnable_avg_period into avg_period as it is now used with both
runnable_avg_sum and running_avg_sum

Add some descriptions of the variables to explain their differences

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 include/linux/sched.h | 19 +++++++++++---
 kernel/sched/debug.c  |  9 ++++---
 kernel/sched/fair.c   | 68 +++++++++++++++++++++++++++++++++++++++------------
 kernel/sched/sched.h  |  8 +++++-
 4 files changed, 81 insertions(+), 23 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 48ae6c4..51df220 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1071,15 +1071,26 @@ struct load_weight {
 };
 
 struct sched_avg {
+	u64 last_runnable_update;
+	s64 decay_count;
+	/*
+	 * utilization_avg_contrib describes the amount of time that a
+	 * sched_entity is running on a CPU. It is based on running_avg_sum
+	 * and is scaled in the range [0..SCHED_LOAD_SCALE].
+	 * load_avg_contrib described the the amount of time that a
+	 * sched_entity is runnable on a rq. It is based on both
+	 * runnable_avg_sum and the weight of the task.
+	 */
+	unsigned long load_avg_contrib, utilization_avg_contrib;
 	/*
 	 * These sums represent an infinite geometric series and so are bound
 	 * above by 1024/(1-y).  Thus we only need a u32 to store them for all
 	 * choices of y < 1-2^(-32)*1024.
+	 * runnable_avg_sum represents the amount of time a sched_entity is on
+	 * the runqueue whereas running_avg_sum reflects the time the
+	 * sched_entity is effectively running on the runqueue.
 	 */
-	u32 runnable_avg_sum, runnable_avg_period;
-	u64 last_runnable_update;
-	s64 decay_count;
-	unsigned long load_avg_contrib;
+	u32 runnable_avg_sum, avg_period, running_avg_sum;
 };
 
 #ifdef CONFIG_SCHEDSTATS
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index c7fe1ea0..1761db9 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -71,7 +71,7 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
 	if (!se) {
 		struct sched_avg *avg = &cpu_rq(cpu)->avg;
 		P(avg->runnable_avg_sum);
-		P(avg->runnable_avg_period);
+		P(avg->avg_period);
 		return;
 	}
 
@@ -94,7 +94,7 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
 	P(se->load.weight);
 #ifdef CONFIG_SMP
 	P(se->avg.runnable_avg_sum);
-	P(se->avg.runnable_avg_period);
+	P(se->avg.avg_period);
 	P(se->avg.load_avg_contrib);
 	P(se->avg.decay_count);
 #endif
@@ -215,6 +215,8 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 			cfs_rq->runnable_load_avg);
 	SEQ_printf(m, "  .%-30s: %ld\n", "blocked_load_avg",
 			cfs_rq->blocked_load_avg);
+	SEQ_printf(m, "  .%-30s: %ld\n", "utilization_load_avg",
+			cfs_rq->utilization_load_avg);
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	SEQ_printf(m, "  .%-30s: %ld\n", "tg_load_contrib",
 			cfs_rq->tg_load_contrib);
@@ -631,7 +633,8 @@ void proc_sched_show_task(struct task_struct *p, struct seq_file *m)
 	P(se.load.weight);
 #ifdef CONFIG_SMP
 	P(se.avg.runnable_avg_sum);
-	P(se.avg.runnable_avg_period);
+	P(se.avg.running_avg_sum);
+	P(se.avg.avg_period);
 	P(se.avg.load_avg_contrib);
 	P(se.avg.decay_count);
 #endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7422044..2cf153d 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -677,8 +677,8 @@ void init_task_runnable_average(struct task_struct *p)
 
 	p->se.avg.decay_count = 0;
 	slice = sched_slice(task_cfs_rq(p), &p->se) >> 10;
-	p->se.avg.runnable_avg_sum = slice;
-	p->se.avg.runnable_avg_period = slice;
+	p->se.avg.runnable_avg_sum = p->se.avg.running_avg_sum = slice;
+	p->se.avg.avg_period = slice;
 	__update_task_entity_contrib(&p->se);
 }
 #else
@@ -1547,7 +1547,7 @@ static u64 numa_get_avg_runtime(struct task_struct *p, u64 *period)
 		*period = now - p->last_task_numa_placement;
 	} else {
 		delta = p->se.avg.runnable_avg_sum;
-		*period = p->se.avg.runnable_avg_period;
+		*period = p->se.avg.avg_period;
 	}
 
 	p->last_sum_exec_runtime = runtime;
@@ -2297,7 +2297,8 @@ static u32 __compute_runnable_contrib(u64 n)
  */
 static __always_inline int __update_entity_runnable_avg(u64 now,
 							struct sched_avg *sa,
-							int runnable)
+							int runnable,
+							int running)
 {
 	u64 delta, periods;
 	u32 runnable_contrib;
@@ -2323,7 +2324,7 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
 	sa->last_runnable_update = now;
 
 	/* delta_w is the amount already accumulated against our next period */
-	delta_w = sa->runnable_avg_period % 1024;
+	delta_w = sa->avg_period % 1024;
 	if (delta + delta_w >= 1024) {
 		/* period roll-over */
 		decayed = 1;
@@ -2336,7 +2337,9 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
 		delta_w = 1024 - delta_w;
 		if (runnable)
 			sa->runnable_avg_sum += delta_w;
-		sa->runnable_avg_period += delta_w;
+		if (running)
+			sa->running_avg_sum += delta_w;
+		sa->avg_period += delta_w;
 
 		delta -= delta_w;
 
@@ -2346,20 +2349,26 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
 
 		sa->runnable_avg_sum = decay_load(sa->runnable_avg_sum,
 						  periods + 1);
-		sa->runnable_avg_period = decay_load(sa->runnable_avg_period,
+		sa->running_avg_sum = decay_load(sa->running_avg_sum,
+						  periods + 1);
+		sa->avg_period = decay_load(sa->avg_period,
 						     periods + 1);
 
 		/* Efficiently calculate \sum (1..n_period) 1024*y^i */
 		runnable_contrib = __compute_runnable_contrib(periods);
 		if (runnable)
 			sa->runnable_avg_sum += runnable_contrib;
-		sa->runnable_avg_period += runnable_contrib;
+		if (running)
+			sa->running_avg_sum += runnable_contrib;
+		sa->avg_period += runnable_contrib;
 	}
 
 	/* Remainder of delta accrued against u_0` */
 	if (runnable)
 		sa->runnable_avg_sum += delta;
-	sa->runnable_avg_period += delta;
+	if (running)
+		sa->running_avg_sum += delta;
+	sa->avg_period += delta;
 
 	return decayed;
 }
@@ -2411,7 +2420,7 @@ static inline void __update_tg_runnable_avg(struct sched_avg *sa,
 
 	/* The fraction of a cpu used by this cfs_rq */
 	contrib = div_u64((u64)sa->runnable_avg_sum << NICE_0_SHIFT,
-			  sa->runnable_avg_period + 1);
+			  sa->avg_period + 1);
 	contrib -= cfs_rq->tg_runnable_contrib;
 
 	if (abs(contrib) > cfs_rq->tg_runnable_contrib / 64) {
@@ -2464,7 +2473,8 @@ static inline void __update_group_entity_contrib(struct sched_entity *se)
 
 static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
 {
-	__update_entity_runnable_avg(rq_clock_task(rq), &rq->avg, runnable);
+	__update_entity_runnable_avg(rq_clock_task(rq), &rq->avg, runnable,
+			runnable);
 	__update_tg_runnable_avg(&rq->avg, &rq->cfs);
 }
 #else /* CONFIG_FAIR_GROUP_SCHED */
@@ -2482,7 +2492,7 @@ static inline void __update_task_entity_contrib(struct sched_entity *se)
 
 	/* avoid overflowing a 32-bit type w/ SCHED_LOAD_SCALE */
 	contrib = se->avg.runnable_avg_sum * scale_load_down(se->load.weight);
-	contrib /= (se->avg.runnable_avg_period + 1);
+	contrib /= (se->avg.avg_period + 1);
 	se->avg.load_avg_contrib = scale_load(contrib);
 }
 
@@ -2501,6 +2511,27 @@ static long __update_entity_load_avg_contrib(struct sched_entity *se)
 	return se->avg.load_avg_contrib - old_contrib;
 }
 
+
+static inline void __update_task_entity_utilization(struct sched_entity *se)
+{
+	u32 contrib;
+
+	/* avoid overflowing a 32-bit type w/ SCHED_LOAD_SCALE */
+	contrib = se->avg.running_avg_sum * scale_load_down(SCHED_LOAD_SCALE);
+	contrib /= (se->avg.avg_period + 1);
+	se->avg.utilization_avg_contrib = scale_load(contrib);
+}
+
+static long __update_entity_utilization_avg_contrib(struct sched_entity *se)
+{
+	long old_contrib = se->avg.utilization_avg_contrib;
+
+	if (entity_is_task(se))
+		__update_task_entity_utilization(se);
+
+	return se->avg.utilization_avg_contrib - old_contrib;
+}
+
 static inline void subtract_blocked_load_contrib(struct cfs_rq *cfs_rq,
 						 long load_contrib)
 {
@@ -2517,7 +2548,7 @@ static inline void update_entity_load_avg(struct sched_entity *se,
 					  int update_cfs_rq)
 {
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
-	long contrib_delta;
+	long contrib_delta, utilization_delta;
 	u64 now;
 
 	/*
@@ -2529,16 +2560,20 @@ static inline void update_entity_load_avg(struct sched_entity *se,
 	else
 		now = cfs_rq_clock_task(group_cfs_rq(se));
 
-	if (!__update_entity_runnable_avg(now, &se->avg, se->on_rq))
+	if (!__update_entity_runnable_avg(now, &se->avg, se->on_rq,
+					cfs_rq->curr == se))
 		return;
 
 	contrib_delta = __update_entity_load_avg_contrib(se);
+	utilization_delta = __update_entity_utilization_avg_contrib(se);
 
 	if (!update_cfs_rq)
 		return;
 
-	if (se->on_rq)
+	if (se->on_rq) {
 		cfs_rq->runnable_load_avg += contrib_delta;
+		cfs_rq->utilization_load_avg += utilization_delta;
+	}
 	else
 		subtract_blocked_load_contrib(cfs_rq, -contrib_delta);
 }
@@ -2615,6 +2650,7 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
 	}
 
 	cfs_rq->runnable_load_avg += se->avg.load_avg_contrib;
+	cfs_rq->utilization_load_avg += se->avg.utilization_avg_contrib;
 	/* we force update consideration on load-balancer moves */
 	update_cfs_rq_blocked_load(cfs_rq, !wakeup);
 }
@@ -2633,6 +2669,7 @@ static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq,
 	update_cfs_rq_blocked_load(cfs_rq, !sleep);
 
 	cfs_rq->runnable_load_avg -= se->avg.load_avg_contrib;
+	cfs_rq->utilization_load_avg -= se->avg.utilization_avg_contrib;
 	if (sleep) {
 		cfs_rq->blocked_load_avg += se->avg.load_avg_contrib;
 		se->avg.decay_count = atomic64_read(&cfs_rq->decay_counter);
@@ -2970,6 +3007,7 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 		 */
 		update_stats_wait_end(cfs_rq, se);
 		__dequeue_entity(cfs_rq, se);
+		update_entity_load_avg(se, 1);
 	}
 
 	update_stats_curr_start(cfs_rq, se);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index f332e45..3ccb136 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -339,8 +339,14 @@ struct cfs_rq {
 	 * Under CFS, load is tracked on a per-entity basis and aggregated up.
 	 * This allows for the description of both thread and group usage (in
 	 * the FAIR_GROUP_SCHED case).
+	 * runnable_load_avg is the sum of the load_avg_contrib of the
+	 * sched_entities on the rq.
+	 * blocked_load_avg is similar to runnable_load_avg except that its
+	 * the blocked sched_entities on the rq.
+	 * utilization_load_avg is the sum of the average running time of the
+	 * sched_entities on the rq.
 	 */
-	unsigned long runnable_load_avg, blocked_load_avg;
+	unsigned long runnable_load_avg, blocked_load_avg, utilization_load_avg;
 	atomic64_t decay_counter;
 	u64 last_decay;
 	atomic_long_t removed_load;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v6 4/6] sched: get CPU's usage statistic
  2014-09-23 16:07 ` Vincent Guittot
@ 2014-09-23 16:08   ` Vincent Guittot
  -1 siblings, 0 replies; 72+ messages in thread
From: Vincent Guittot @ 2014-09-23 16:08 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel, preeti, linux, linux-arm-kernel
  Cc: riel, Morten.Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano, dietmar.eggemann, pjt, bsegall, Vincent Guittot

Monitor the usage level of each group of each sched_domain level. The usage is
the amount of cpu_capacity that is currently used on a CPU or group of CPUs.
We use the utilization_load_avg to evaluate the usage level of each group.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2cf153d..4097e3f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4523,6 +4523,17 @@ static int select_idle_sibling(struct task_struct *p, int target)
 	return target;
 }
 
+static int get_cpu_usage(int cpu)
+{
+	unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg;
+	unsigned long capacity = capacity_orig_of(cpu);
+
+	if (usage >= SCHED_LOAD_SCALE)
+		return capacity + 1;
+
+	return (usage * capacity) >> SCHED_LOAD_SHIFT;
+}
+
 /*
  * select_task_rq_fair: Select target runqueue for the waking task in domains
  * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
@@ -5663,6 +5674,7 @@ struct sg_lb_stats {
 	unsigned long sum_weighted_load; /* Weighted load of group's tasks */
 	unsigned long load_per_task;
 	unsigned long group_capacity;
+	unsigned long group_usage; /* Total usage of the group */
 	unsigned int sum_nr_running; /* Nr tasks running in the group */
 	unsigned int group_capacity_factor;
 	unsigned int idle_cpus;
@@ -6037,6 +6049,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 			load = source_load(i, load_idx);
 
 		sgs->group_load += load;
+		sgs->group_usage += get_cpu_usage(i);
 		sgs->sum_nr_running += rq->cfs.h_nr_running;
 
 		if (rq->nr_running > 1)
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v6 4/6] sched: get CPU's usage statistic
@ 2014-09-23 16:08   ` Vincent Guittot
  0 siblings, 0 replies; 72+ messages in thread
From: Vincent Guittot @ 2014-09-23 16:08 UTC (permalink / raw)
  To: linux-arm-kernel

Monitor the usage level of each group of each sched_domain level. The usage is
the amount of cpu_capacity that is currently used on a CPU or group of CPUs.
We use the utilization_load_avg to evaluate the usage level of each group.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2cf153d..4097e3f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4523,6 +4523,17 @@ static int select_idle_sibling(struct task_struct *p, int target)
 	return target;
 }
 
+static int get_cpu_usage(int cpu)
+{
+	unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg;
+	unsigned long capacity = capacity_orig_of(cpu);
+
+	if (usage >= SCHED_LOAD_SCALE)
+		return capacity + 1;
+
+	return (usage * capacity) >> SCHED_LOAD_SHIFT;
+}
+
 /*
  * select_task_rq_fair: Select target runqueue for the waking task in domains
  * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
@@ -5663,6 +5674,7 @@ struct sg_lb_stats {
 	unsigned long sum_weighted_load; /* Weighted load of group's tasks */
 	unsigned long load_per_task;
 	unsigned long group_capacity;
+	unsigned long group_usage; /* Total usage of the group */
 	unsigned int sum_nr_running; /* Nr tasks running in the group */
 	unsigned int group_capacity_factor;
 	unsigned int idle_cpus;
@@ -6037,6 +6049,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 			load = source_load(i, load_idx);
 
 		sgs->group_load += load;
+		sgs->group_usage += get_cpu_usage(i);
 		sgs->sum_nr_running += rq->cfs.h_nr_running;
 
 		if (rq->nr_running > 1)
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v6 5/6] sched: replace capacity_factor by usage
  2014-09-23 16:07 ` Vincent Guittot
@ 2014-09-23 16:08   ` Vincent Guittot
  -1 siblings, 0 replies; 72+ messages in thread
From: Vincent Guittot @ 2014-09-23 16:08 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel, preeti, linux, linux-arm-kernel
  Cc: riel, Morten.Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano, dietmar.eggemann, pjt, bsegall, Vincent Guittot

The scheduler tries to compute how many tasks a group of CPUs can handle by
assuming that a task's load is SCHED_LOAD_SCALE and a CPU capacity is
SCHED_CAPACITY_SCALE but the capacity_factor is hardly working for SMT system,
it sometimes works for big cores but fails to do the right thing for little
cores.

Below are two examples to illustrate the problem that this patch solves:

1 - capacity_factor makes the assumption that max capacity of a CPU is
SCHED_CAPACITY_SCALE and the load of a thread is always is
SCHED_LOAD_SCALE. It compares the output of these figures with the sum
of nr_running to decide if a group is overloaded or not.

But if the default capacity of a CPU is less than SCHED_CAPACITY_SCALE
(640 as an example), a group of 3 CPUS will have a max capacity_factor
of 2 ( div_round_closest(3x640/1024) = 2) which means that it will be
seen as overloaded if we have only one task per CPU.

2 - Then, if the default capacity of a CPU is greater than
SCHED_CAPACITY_SCALE (1512 as an example), a group of 4 CPUs will have
a capacity_factor of 4 (at max and thanks to the fix[0] for SMT system
that prevent the apparition of ghost CPUs) but if one CPU is fully
used by a rt task (and its capacity is reduced to nearly nothing), the
capacity factor of the group will still be 4
(div_round_closest(3*1512/1024) = 5).

So, this patch tries to solve this issue by removing capacity_factor
and replacing it with the 2 following metrics :
-The available CPU's capacity for CFS tasks which is the already used by
load_balance.
-The usage of the CPU by the CFS tasks. For the latter, I have
re-introduced the utilization_avg_contrib which is in the range
[0..SCHED_CPU_LOAD] whatever the capacity of the CPU is.

This implementation of utilization_avg_contrib doesn't solve the scaling
in-variance problem, so i have to scale the utilization with original
capacity of the CPU in order to get the CPU usage and compare it with the
capacity. Once the scaling invariance will have been added in
utilization_avg_contrib, we will remove the scale of utilization_avg_contrib
by cpu_capacity_orig in get_cpu_usage. But the scaling invariance will come
in another patchset.

Finally, the sched_group->sched_group_capacity->capacity_orig has been removed
because it's more used during load balance.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/core.c  |  12 -----
 kernel/sched/fair.c  | 131 ++++++++++++++++++++-------------------------------
 kernel/sched/sched.h |   2 +-
 3 files changed, 52 insertions(+), 93 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e20f203..c7c8ac4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5342,17 +5342,6 @@ static int sched_domain_debug_one(struct sched_domain *sd, int cpu, int level,
 			break;
 		}
 
-		/*
-		 * Even though we initialize ->capacity to something semi-sane,
-		 * we leave capacity_orig unset. This allows us to detect if
-		 * domain iteration is still funny without causing /0 traps.
-		 */
-		if (!group->sgc->capacity_orig) {
-			printk(KERN_CONT "\n");
-			printk(KERN_ERR "ERROR: domain->cpu_capacity not set\n");
-			break;
-		}
-
 		if (!cpumask_weight(sched_group_cpus(group))) {
 			printk(KERN_CONT "\n");
 			printk(KERN_ERR "ERROR: empty group\n");
@@ -5837,7 +5826,6 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
 		 * die on a /0 trap.
 		 */
 		sg->sgc->capacity = SCHED_CAPACITY_SCALE * cpumask_weight(sg_span);
-		sg->sgc->capacity_orig = sg->sgc->capacity;
 
 		/*
 		 * Make sure the first group of this domain contains the
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4097e3f..2ba278c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5676,11 +5676,10 @@ struct sg_lb_stats {
 	unsigned long group_capacity;
 	unsigned long group_usage; /* Total usage of the group */
 	unsigned int sum_nr_running; /* Nr tasks running in the group */
-	unsigned int group_capacity_factor;
 	unsigned int idle_cpus;
 	unsigned int group_weight;
 	enum group_type group_type;
-	int group_has_free_capacity;
+	int group_out_of_capacity;
 #ifdef CONFIG_NUMA_BALANCING
 	unsigned int nr_numa_running;
 	unsigned int nr_preferred_running;
@@ -5821,7 +5820,6 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
 	capacity >>= SCHED_CAPACITY_SHIFT;
 
 	cpu_rq(cpu)->cpu_capacity_orig = capacity;
-	sdg->sgc->capacity_orig = capacity;
 
 	if (sched_feat(ARCH_CAPACITY))
 		capacity *= arch_scale_freq_capacity(sd, cpu);
@@ -5844,7 +5842,7 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
 {
 	struct sched_domain *child = sd->child;
 	struct sched_group *group, *sdg = sd->groups;
-	unsigned long capacity, capacity_orig;
+	unsigned long capacity;
 	unsigned long interval;
 
 	interval = msecs_to_jiffies(sd->balance_interval);
@@ -5856,7 +5854,7 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
 		return;
 	}
 
-	capacity_orig = capacity = 0;
+	capacity = 0;
 
 	if (child->flags & SD_OVERLAP) {
 		/*
@@ -5876,19 +5874,15 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
 			 * Use capacity_of(), which is set irrespective of domains
 			 * in update_cpu_capacity().
 			 *
-			 * This avoids capacity/capacity_orig from being 0 and
+			 * This avoids capacity from being 0 and
 			 * causing divide-by-zero issues on boot.
-			 *
-			 * Runtime updates will correct capacity_orig.
 			 */
 			if (unlikely(!rq->sd)) {
-				capacity_orig += capacity_orig_of(cpu);
 				capacity += capacity_of(cpu);
 				continue;
 			}
 
 			sgc = rq->sd->groups->sgc;
-			capacity_orig += sgc->capacity_orig;
 			capacity += sgc->capacity;
 		}
 	} else  {
@@ -5899,42 +5893,15 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
 
 		group = child->groups;
 		do {
-			capacity_orig += group->sgc->capacity_orig;
 			capacity += group->sgc->capacity;
 			group = group->next;
 		} while (group != child->groups);
 	}
 
-	sdg->sgc->capacity_orig = capacity_orig;
 	sdg->sgc->capacity = capacity;
 }
 
 /*
- * Try and fix up capacity for tiny siblings, this is needed when
- * things like SD_ASYM_PACKING need f_b_g to select another sibling
- * which on its own isn't powerful enough.
- *
- * See update_sd_pick_busiest() and check_asym_packing().
- */
-static inline int
-fix_small_capacity(struct sched_domain *sd, struct sched_group *group)
-{
-	/*
-	 * Only siblings can have significantly less than SCHED_CAPACITY_SCALE
-	 */
-	if (!(sd->flags & SD_SHARE_CPUCAPACITY))
-		return 0;
-
-	/*
-	 * If ~90% of the cpu_capacity is still there, we're good.
-	 */
-	if (group->sgc->capacity * 32 > group->sgc->capacity_orig * 29)
-		return 1;
-
-	return 0;
-}
-
-/*
  * Check whether the capacity of the rq has been noticeably reduced by side
  * activity. The imbalance_pct is used for the threshold.
  * Return true is the capacity is reduced
@@ -5980,38 +5947,37 @@ static inline int sg_imbalanced(struct sched_group *group)
 	return group->sgc->imbalance;
 }
 
-/*
- * Compute the group capacity factor.
- *
- * Avoid the issue where N*frac(smt_capacity) >= 1 creates 'phantom' cores by
- * first dividing out the smt factor and computing the actual number of cores
- * and limit unit capacity with that.
- */
-static inline int sg_capacity_factor(struct lb_env *env, struct sched_group *group)
+static inline int group_has_free_capacity(struct sg_lb_stats *sgs,
+			struct lb_env *env)
 {
-	unsigned int capacity_factor, smt, cpus;
-	unsigned int capacity, capacity_orig;
+	if ((sgs->group_capacity * 100) >
+			(sgs->group_usage * env->sd->imbalance_pct))
+		return true;
 
-	capacity = group->sgc->capacity;
-	capacity_orig = group->sgc->capacity_orig;
-	cpus = group->group_weight;
+	if (sgs->sum_nr_running < sgs->group_weight)
+		return true;
 
-	/* smt := ceil(cpus / capacity), assumes: 1 < smt_capacity < 2 */
-	smt = DIV_ROUND_UP(SCHED_CAPACITY_SCALE * cpus, capacity_orig);
-	capacity_factor = cpus / smt; /* cores */
+	return false;
+}
 
-	capacity_factor = min_t(unsigned,
-		capacity_factor, DIV_ROUND_CLOSEST(capacity, SCHED_CAPACITY_SCALE));
-	if (!capacity_factor)
-		capacity_factor = fix_small_capacity(env->sd, group);
+static inline int group_is_overloaded(struct sg_lb_stats *sgs,
+			struct lb_env *env)
+{
+	if (sgs->sum_nr_running <= sgs->group_weight)
+		return false;
 
-	return capacity_factor;
+	if ((sgs->group_capacity * 100) <
+			(sgs->group_usage * env->sd->imbalance_pct))
+		return true;
+
+	return false;
 }
 
 static enum group_type
-group_classify(struct sched_group *group, struct sg_lb_stats *sgs)
+group_classify(struct sched_group *group, struct sg_lb_stats *sgs,
+			struct lb_env *env)
 {
-	if (sgs->sum_nr_running > sgs->group_capacity_factor)
+	if (group_is_overloaded(sgs, env))
 		return group_overloaded;
 
 	if (sg_imbalanced(group))
@@ -6072,11 +6038,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 		sgs->load_per_task = sgs->sum_weighted_load / sgs->sum_nr_running;
 
 	sgs->group_weight = group->group_weight;
-	sgs->group_capacity_factor = sg_capacity_factor(env, group);
-	sgs->group_type = group_classify(group, sgs);
 
-	if (sgs->group_capacity_factor > sgs->sum_nr_running)
-		sgs->group_has_free_capacity = 1;
+	sgs->group_type = group_classify(group, sgs, env);
+
+	sgs->group_out_of_capacity = group_is_overloaded(sgs, env);
 }
 
 /**
@@ -6198,17 +6163,21 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 
 		/*
 		 * In case the child domain prefers tasks go to siblings
-		 * first, lower the sg capacity factor to one so that we'll try
+		 * first, lower the sg capacity to one so that we'll try
 		 * and move all the excess tasks away. We lower the capacity
 		 * of a group only if the local group has the capacity to fit
-		 * these excess tasks, i.e. nr_running < group_capacity_factor. The
+		 * these excess tasks, i.e. group_capacity > 0. The
 		 * extra check prevents the case where you always pull from the
 		 * heaviest group when it is already under-utilized (possible
 		 * with a large weight task outweighs the tasks on the system).
 		 */
 		if (prefer_sibling && sds->local &&
-		    sds->local_stat.group_has_free_capacity)
-			sgs->group_capacity_factor = min(sgs->group_capacity_factor, 1U);
+		    group_has_free_capacity(&sds->local_stat, env)) {
+			if (sgs->sum_nr_running > 1)
+				sgs->group_out_of_capacity = 1;
+			sgs->group_capacity = min(sgs->group_capacity,
+						SCHED_CAPACITY_SCALE);
+		}
 
 		if (update_sd_pick_busiest(env, sds, sg, sgs)) {
 			sds->busiest = sg;
@@ -6387,11 +6356,12 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 	 */
 	if (busiest->group_type == group_overloaded &&
 	    local->group_type   == group_overloaded) {
-		load_above_capacity =
-			(busiest->sum_nr_running - busiest->group_capacity_factor);
-
-		load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_CAPACITY_SCALE);
-		load_above_capacity /= busiest->group_capacity;
+		load_above_capacity = busiest->sum_nr_running *
+					SCHED_LOAD_SCALE;
+		if (load_above_capacity > busiest->group_capacity)
+			load_above_capacity -= busiest->group_capacity;
+		else
+			load_above_capacity = ~0UL;
 	}
 
 	/*
@@ -6454,6 +6424,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
 	local = &sds.local_stat;
 	busiest = &sds.busiest_stat;
 
+	/* ASYM feature bypasses nice load balance check */
 	if ((env->idle == CPU_IDLE || env->idle == CPU_NEWLY_IDLE) &&
 	    check_asym_packing(env, &sds))
 		return sds.busiest;
@@ -6474,8 +6445,9 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
 		goto force_balance;
 
 	/* SD_BALANCE_NEWIDLE trumps SMP nice when underutilized */
-	if (env->idle == CPU_NEWLY_IDLE && local->group_has_free_capacity &&
-	    !busiest->group_has_free_capacity)
+	if (env->idle == CPU_NEWLY_IDLE &&
+			group_has_free_capacity(local, env) &&
+			busiest->group_out_of_capacity)
 		goto force_balance;
 
 	/*
@@ -6533,7 +6505,7 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 	int i;
 
 	for_each_cpu_and(i, sched_group_cpus(group), env->cpus) {
-		unsigned long capacity, capacity_factor, wl;
+		unsigned long capacity, wl;
 		enum fbq_type rt;
 
 		rq = cpu_rq(i);
@@ -6562,9 +6534,6 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 			continue;
 
 		capacity = capacity_of(i);
-		capacity_factor = DIV_ROUND_CLOSEST(capacity, SCHED_CAPACITY_SCALE);
-		if (!capacity_factor)
-			capacity_factor = fix_small_capacity(env->sd, group);
 
 		wl = weighted_cpuload(i);
 
@@ -6572,7 +6541,9 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 		 * When comparing with imbalance, use weighted_cpuload()
 		 * which is not scaled with the cpu capacity.
 		 */
-		if (capacity_factor && rq->nr_running == 1 && wl > env->imbalance)
+
+		if (rq->nr_running == 1 && wl > env->imbalance &&
+		    !check_cpu_capacity(rq, env->sd))
 			continue;
 
 		/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3ccb136..8d224c5 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -750,7 +750,7 @@ struct sched_group_capacity {
 	 * CPU capacity of this group, SCHED_LOAD_SCALE being max capacity
 	 * for a single CPU.
 	 */
-	unsigned int capacity, capacity_orig;
+	unsigned int capacity;
 	unsigned long next_update;
 	int imbalance; /* XXX unrelated to capacity but shared group state */
 	/*
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v6 5/6] sched: replace capacity_factor by usage
@ 2014-09-23 16:08   ` Vincent Guittot
  0 siblings, 0 replies; 72+ messages in thread
From: Vincent Guittot @ 2014-09-23 16:08 UTC (permalink / raw)
  To: linux-arm-kernel

The scheduler tries to compute how many tasks a group of CPUs can handle by
assuming that a task's load is SCHED_LOAD_SCALE and a CPU capacity is
SCHED_CAPACITY_SCALE but the capacity_factor is hardly working for SMT system,
it sometimes works for big cores but fails to do the right thing for little
cores.

Below are two examples to illustrate the problem that this patch solves:

1 - capacity_factor makes the assumption that max capacity of a CPU is
SCHED_CAPACITY_SCALE and the load of a thread is always is
SCHED_LOAD_SCALE. It compares the output of these figures with the sum
of nr_running to decide if a group is overloaded or not.

But if the default capacity of a CPU is less than SCHED_CAPACITY_SCALE
(640 as an example), a group of 3 CPUS will have a max capacity_factor
of 2 ( div_round_closest(3x640/1024) = 2) which means that it will be
seen as overloaded if we have only one task per CPU.

2 - Then, if the default capacity of a CPU is greater than
SCHED_CAPACITY_SCALE (1512 as an example), a group of 4 CPUs will have
a capacity_factor of 4 (at max and thanks to the fix[0] for SMT system
that prevent the apparition of ghost CPUs) but if one CPU is fully
used by a rt task (and its capacity is reduced to nearly nothing), the
capacity factor of the group will still be 4
(div_round_closest(3*1512/1024) = 5).

So, this patch tries to solve this issue by removing capacity_factor
and replacing it with the 2 following metrics :
-The available CPU's capacity for CFS tasks which is the already used by
load_balance.
-The usage of the CPU by the CFS tasks. For the latter, I have
re-introduced the utilization_avg_contrib which is in the range
[0..SCHED_CPU_LOAD] whatever the capacity of the CPU is.

This implementation of utilization_avg_contrib doesn't solve the scaling
in-variance problem, so i have to scale the utilization with original
capacity of the CPU in order to get the CPU usage and compare it with the
capacity. Once the scaling invariance will have been added in
utilization_avg_contrib, we will remove the scale of utilization_avg_contrib
by cpu_capacity_orig in get_cpu_usage. But the scaling invariance will come
in another patchset.

Finally, the sched_group->sched_group_capacity->capacity_orig has been removed
because it's more used during load balance.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/core.c  |  12 -----
 kernel/sched/fair.c  | 131 ++++++++++++++++++++-------------------------------
 kernel/sched/sched.h |   2 +-
 3 files changed, 52 insertions(+), 93 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e20f203..c7c8ac4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5342,17 +5342,6 @@ static int sched_domain_debug_one(struct sched_domain *sd, int cpu, int level,
 			break;
 		}
 
-		/*
-		 * Even though we initialize ->capacity to something semi-sane,
-		 * we leave capacity_orig unset. This allows us to detect if
-		 * domain iteration is still funny without causing /0 traps.
-		 */
-		if (!group->sgc->capacity_orig) {
-			printk(KERN_CONT "\n");
-			printk(KERN_ERR "ERROR: domain->cpu_capacity not set\n");
-			break;
-		}
-
 		if (!cpumask_weight(sched_group_cpus(group))) {
 			printk(KERN_CONT "\n");
 			printk(KERN_ERR "ERROR: empty group\n");
@@ -5837,7 +5826,6 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
 		 * die on a /0 trap.
 		 */
 		sg->sgc->capacity = SCHED_CAPACITY_SCALE * cpumask_weight(sg_span);
-		sg->sgc->capacity_orig = sg->sgc->capacity;
 
 		/*
 		 * Make sure the first group of this domain contains the
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4097e3f..2ba278c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5676,11 +5676,10 @@ struct sg_lb_stats {
 	unsigned long group_capacity;
 	unsigned long group_usage; /* Total usage of the group */
 	unsigned int sum_nr_running; /* Nr tasks running in the group */
-	unsigned int group_capacity_factor;
 	unsigned int idle_cpus;
 	unsigned int group_weight;
 	enum group_type group_type;
-	int group_has_free_capacity;
+	int group_out_of_capacity;
 #ifdef CONFIG_NUMA_BALANCING
 	unsigned int nr_numa_running;
 	unsigned int nr_preferred_running;
@@ -5821,7 +5820,6 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
 	capacity >>= SCHED_CAPACITY_SHIFT;
 
 	cpu_rq(cpu)->cpu_capacity_orig = capacity;
-	sdg->sgc->capacity_orig = capacity;
 
 	if (sched_feat(ARCH_CAPACITY))
 		capacity *= arch_scale_freq_capacity(sd, cpu);
@@ -5844,7 +5842,7 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
 {
 	struct sched_domain *child = sd->child;
 	struct sched_group *group, *sdg = sd->groups;
-	unsigned long capacity, capacity_orig;
+	unsigned long capacity;
 	unsigned long interval;
 
 	interval = msecs_to_jiffies(sd->balance_interval);
@@ -5856,7 +5854,7 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
 		return;
 	}
 
-	capacity_orig = capacity = 0;
+	capacity = 0;
 
 	if (child->flags & SD_OVERLAP) {
 		/*
@@ -5876,19 +5874,15 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
 			 * Use capacity_of(), which is set irrespective of domains
 			 * in update_cpu_capacity().
 			 *
-			 * This avoids capacity/capacity_orig from being 0 and
+			 * This avoids capacity from being 0 and
 			 * causing divide-by-zero issues on boot.
-			 *
-			 * Runtime updates will correct capacity_orig.
 			 */
 			if (unlikely(!rq->sd)) {
-				capacity_orig += capacity_orig_of(cpu);
 				capacity += capacity_of(cpu);
 				continue;
 			}
 
 			sgc = rq->sd->groups->sgc;
-			capacity_orig += sgc->capacity_orig;
 			capacity += sgc->capacity;
 		}
 	} else  {
@@ -5899,42 +5893,15 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
 
 		group = child->groups;
 		do {
-			capacity_orig += group->sgc->capacity_orig;
 			capacity += group->sgc->capacity;
 			group = group->next;
 		} while (group != child->groups);
 	}
 
-	sdg->sgc->capacity_orig = capacity_orig;
 	sdg->sgc->capacity = capacity;
 }
 
 /*
- * Try and fix up capacity for tiny siblings, this is needed when
- * things like SD_ASYM_PACKING need f_b_g to select another sibling
- * which on its own isn't powerful enough.
- *
- * See update_sd_pick_busiest() and check_asym_packing().
- */
-static inline int
-fix_small_capacity(struct sched_domain *sd, struct sched_group *group)
-{
-	/*
-	 * Only siblings can have significantly less than SCHED_CAPACITY_SCALE
-	 */
-	if (!(sd->flags & SD_SHARE_CPUCAPACITY))
-		return 0;
-
-	/*
-	 * If ~90% of the cpu_capacity is still there, we're good.
-	 */
-	if (group->sgc->capacity * 32 > group->sgc->capacity_orig * 29)
-		return 1;
-
-	return 0;
-}
-
-/*
  * Check whether the capacity of the rq has been noticeably reduced by side
  * activity. The imbalance_pct is used for the threshold.
  * Return true is the capacity is reduced
@@ -5980,38 +5947,37 @@ static inline int sg_imbalanced(struct sched_group *group)
 	return group->sgc->imbalance;
 }
 
-/*
- * Compute the group capacity factor.
- *
- * Avoid the issue where N*frac(smt_capacity) >= 1 creates 'phantom' cores by
- * first dividing out the smt factor and computing the actual number of cores
- * and limit unit capacity with that.
- */
-static inline int sg_capacity_factor(struct lb_env *env, struct sched_group *group)
+static inline int group_has_free_capacity(struct sg_lb_stats *sgs,
+			struct lb_env *env)
 {
-	unsigned int capacity_factor, smt, cpus;
-	unsigned int capacity, capacity_orig;
+	if ((sgs->group_capacity * 100) >
+			(sgs->group_usage * env->sd->imbalance_pct))
+		return true;
 
-	capacity = group->sgc->capacity;
-	capacity_orig = group->sgc->capacity_orig;
-	cpus = group->group_weight;
+	if (sgs->sum_nr_running < sgs->group_weight)
+		return true;
 
-	/* smt := ceil(cpus / capacity), assumes: 1 < smt_capacity < 2 */
-	smt = DIV_ROUND_UP(SCHED_CAPACITY_SCALE * cpus, capacity_orig);
-	capacity_factor = cpus / smt; /* cores */
+	return false;
+}
 
-	capacity_factor = min_t(unsigned,
-		capacity_factor, DIV_ROUND_CLOSEST(capacity, SCHED_CAPACITY_SCALE));
-	if (!capacity_factor)
-		capacity_factor = fix_small_capacity(env->sd, group);
+static inline int group_is_overloaded(struct sg_lb_stats *sgs,
+			struct lb_env *env)
+{
+	if (sgs->sum_nr_running <= sgs->group_weight)
+		return false;
 
-	return capacity_factor;
+	if ((sgs->group_capacity * 100) <
+			(sgs->group_usage * env->sd->imbalance_pct))
+		return true;
+
+	return false;
 }
 
 static enum group_type
-group_classify(struct sched_group *group, struct sg_lb_stats *sgs)
+group_classify(struct sched_group *group, struct sg_lb_stats *sgs,
+			struct lb_env *env)
 {
-	if (sgs->sum_nr_running > sgs->group_capacity_factor)
+	if (group_is_overloaded(sgs, env))
 		return group_overloaded;
 
 	if (sg_imbalanced(group))
@@ -6072,11 +6038,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 		sgs->load_per_task = sgs->sum_weighted_load / sgs->sum_nr_running;
 
 	sgs->group_weight = group->group_weight;
-	sgs->group_capacity_factor = sg_capacity_factor(env, group);
-	sgs->group_type = group_classify(group, sgs);
 
-	if (sgs->group_capacity_factor > sgs->sum_nr_running)
-		sgs->group_has_free_capacity = 1;
+	sgs->group_type = group_classify(group, sgs, env);
+
+	sgs->group_out_of_capacity = group_is_overloaded(sgs, env);
 }
 
 /**
@@ -6198,17 +6163,21 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 
 		/*
 		 * In case the child domain prefers tasks go to siblings
-		 * first, lower the sg capacity factor to one so that we'll try
+		 * first, lower the sg capacity to one so that we'll try
 		 * and move all the excess tasks away. We lower the capacity
 		 * of a group only if the local group has the capacity to fit
-		 * these excess tasks, i.e. nr_running < group_capacity_factor. The
+		 * these excess tasks, i.e. group_capacity > 0. The
 		 * extra check prevents the case where you always pull from the
 		 * heaviest group when it is already under-utilized (possible
 		 * with a large weight task outweighs the tasks on the system).
 		 */
 		if (prefer_sibling && sds->local &&
-		    sds->local_stat.group_has_free_capacity)
-			sgs->group_capacity_factor = min(sgs->group_capacity_factor, 1U);
+		    group_has_free_capacity(&sds->local_stat, env)) {
+			if (sgs->sum_nr_running > 1)
+				sgs->group_out_of_capacity = 1;
+			sgs->group_capacity = min(sgs->group_capacity,
+						SCHED_CAPACITY_SCALE);
+		}
 
 		if (update_sd_pick_busiest(env, sds, sg, sgs)) {
 			sds->busiest = sg;
@@ -6387,11 +6356,12 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 	 */
 	if (busiest->group_type == group_overloaded &&
 	    local->group_type   == group_overloaded) {
-		load_above_capacity =
-			(busiest->sum_nr_running - busiest->group_capacity_factor);
-
-		load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_CAPACITY_SCALE);
-		load_above_capacity /= busiest->group_capacity;
+		load_above_capacity = busiest->sum_nr_running *
+					SCHED_LOAD_SCALE;
+		if (load_above_capacity > busiest->group_capacity)
+			load_above_capacity -= busiest->group_capacity;
+		else
+			load_above_capacity = ~0UL;
 	}
 
 	/*
@@ -6454,6 +6424,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
 	local = &sds.local_stat;
 	busiest = &sds.busiest_stat;
 
+	/* ASYM feature bypasses nice load balance check */
 	if ((env->idle == CPU_IDLE || env->idle == CPU_NEWLY_IDLE) &&
 	    check_asym_packing(env, &sds))
 		return sds.busiest;
@@ -6474,8 +6445,9 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
 		goto force_balance;
 
 	/* SD_BALANCE_NEWIDLE trumps SMP nice when underutilized */
-	if (env->idle == CPU_NEWLY_IDLE && local->group_has_free_capacity &&
-	    !busiest->group_has_free_capacity)
+	if (env->idle == CPU_NEWLY_IDLE &&
+			group_has_free_capacity(local, env) &&
+			busiest->group_out_of_capacity)
 		goto force_balance;
 
 	/*
@@ -6533,7 +6505,7 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 	int i;
 
 	for_each_cpu_and(i, sched_group_cpus(group), env->cpus) {
-		unsigned long capacity, capacity_factor, wl;
+		unsigned long capacity, wl;
 		enum fbq_type rt;
 
 		rq = cpu_rq(i);
@@ -6562,9 +6534,6 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 			continue;
 
 		capacity = capacity_of(i);
-		capacity_factor = DIV_ROUND_CLOSEST(capacity, SCHED_CAPACITY_SCALE);
-		if (!capacity_factor)
-			capacity_factor = fix_small_capacity(env->sd, group);
 
 		wl = weighted_cpuload(i);
 
@@ -6572,7 +6541,9 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 		 * When comparing with imbalance, use weighted_cpuload()
 		 * which is not scaled with the cpu capacity.
 		 */
-		if (capacity_factor && rq->nr_running == 1 && wl > env->imbalance)
+
+		if (rq->nr_running == 1 && wl > env->imbalance &&
+		    !check_cpu_capacity(rq, env->sd))
 			continue;
 
 		/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3ccb136..8d224c5 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -750,7 +750,7 @@ struct sched_group_capacity {
 	 * CPU capacity of this group, SCHED_LOAD_SCALE being max capacity
 	 * for a single CPU.
 	 */
-	unsigned int capacity, capacity_orig;
+	unsigned int capacity;
 	unsigned long next_update;
 	int imbalance; /* XXX unrelated to capacity but shared group state */
 	/*
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v6 6/6] sched: add SD_PREFER_SIBLING for SMT level
  2014-09-23 16:07 ` Vincent Guittot
@ 2014-09-23 16:08   ` Vincent Guittot
  -1 siblings, 0 replies; 72+ messages in thread
From: Vincent Guittot @ 2014-09-23 16:08 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel, preeti, linux, linux-arm-kernel
  Cc: riel, Morten.Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano, dietmar.eggemann, pjt, bsegall, Vincent Guittot

add the SD_PREFER_SIBLING flag for SMT level in order to ensure that
the scheduler will put at least 1 task per core.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/core.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c7c8ac4..f72663e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6134,6 +6134,7 @@ sd_init(struct sched_domain_topology_level *tl, int cpu)
 	 */
 
 	if (sd->flags & SD_SHARE_CPUCAPACITY) {
+		sd->flags |= SD_PREFER_SIBLING;
 		sd->imbalance_pct = 110;
 		sd->smt_gain = 1178; /* ~15% */
 
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v6 6/6] sched: add SD_PREFER_SIBLING for SMT level
@ 2014-09-23 16:08   ` Vincent Guittot
  0 siblings, 0 replies; 72+ messages in thread
From: Vincent Guittot @ 2014-09-23 16:08 UTC (permalink / raw)
  To: linux-arm-kernel

add the SD_PREFER_SIBLING flag for SMT level in order to ensure that
the scheduler will put at least 1 task per core.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/core.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c7c8ac4..f72663e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6134,6 +6134,7 @@ sd_init(struct sched_domain_topology_level *tl, int cpu)
 	 */
 
 	if (sd->flags & SD_SHARE_CPUCAPACITY) {
+		sd->flags |= SD_PREFER_SIBLING;
 		sd->imbalance_pct = 110;
 		sd->smt_gain = 1178; /* ~15% */
 
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* Re: [PATCH v6 6/6] sched: add SD_PREFER_SIBLING for SMT level
  2014-09-23 16:08   ` Vincent Guittot
@ 2014-09-24 12:27     ` Preeti U Murthy
  -1 siblings, 0 replies; 72+ messages in thread
From: Preeti U Murthy @ 2014-09-24 12:27 UTC (permalink / raw)
  To: Vincent Guittot, peterz, mingo, linux-kernel, linux, linux-arm-kernel
  Cc: riel, Morten.Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano, dietmar.eggemann, pjt, bsegall

On 09/23/2014 09:38 PM, Vincent Guittot wrote:
> add the SD_PREFER_SIBLING flag for SMT level in order to ensure that
> the scheduler will put at least 1 task per core.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>  kernel/sched/core.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index c7c8ac4..f72663e 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6134,6 +6134,7 @@ sd_init(struct sched_domain_topology_level *tl, int cpu)
>  	 */
> 
>  	if (sd->flags & SD_SHARE_CPUCAPACITY) {
> +		sd->flags |= SD_PREFER_SIBLING;
>  		sd->imbalance_pct = 110;
>  		sd->smt_gain = 1178; /* ~15% */
> 
Reviewed-by: Preeti U. Murthy <preeti@linux.vnet.ibm.com>


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH v6 6/6] sched: add SD_PREFER_SIBLING for SMT level
@ 2014-09-24 12:27     ` Preeti U Murthy
  0 siblings, 0 replies; 72+ messages in thread
From: Preeti U Murthy @ 2014-09-24 12:27 UTC (permalink / raw)
  To: linux-arm-kernel

On 09/23/2014 09:38 PM, Vincent Guittot wrote:
> add the SD_PREFER_SIBLING flag for SMT level in order to ensure that
> the scheduler will put at least 1 task per core.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>  kernel/sched/core.c | 1 +
>  1 file changed, 1 insertion(+)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index c7c8ac4..f72663e 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -6134,6 +6134,7 @@ sd_init(struct sched_domain_topology_level *tl, int cpu)
>  	 */
> 
>  	if (sd->flags & SD_SHARE_CPUCAPACITY) {
> +		sd->flags |= SD_PREFER_SIBLING;
>  		sd->imbalance_pct = 110;
>  		sd->smt_gain = 1178; /* ~15% */
> 
Reviewed-by: Preeti U. Murthy <preeti@linux.vnet.ibm.com>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v6 5/6] sched: replace capacity_factor by usage
  2014-09-23 16:08   ` Vincent Guittot
@ 2014-09-24 17:48     ` Dietmar Eggemann
  -1 siblings, 0 replies; 72+ messages in thread
From: Dietmar Eggemann @ 2014-09-24 17:48 UTC (permalink / raw)
  To: Vincent Guittot, peterz, mingo, linux-kernel, preeti, linux,
	linux-arm-kernel
  Cc: riel, Morten Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano, pjt, bsegall

On 23/09/14 17:08, Vincent Guittot wrote:

[...]

> 
> Finally, the sched_group->sched_group_capacity->capacity_orig has been removed
> because it's more used during load balance.

So you're not forced to call it rq->cpu_capacity_orig any more, you
could use rq->cpu_capacity_max instead.

[...]

This review (by PeterZ) during v5 of your patch-set recommended some
renaming (e.g. s/group_has_free_capacity/group_has_capacity and
s/group_out_of_capacity/group_no_capacity as well as reordering of the
parameters which I agree with:

https://lkml.org/lkml/2014/9/11/706

> 
> -/*
> - * Compute the group capacity factor.
> - *
> - * Avoid the issue where N*frac(smt_capacity) >= 1 creates 'phantom' cores by
> - * first dividing out the smt factor and computing the actual number of cores
> - * and limit unit capacity with that.
> - */
> -static inline int sg_capacity_factor(struct lb_env *env, struct sched_group *group)
> +static inline int group_has_free_capacity(struct sg_lb_stats *sgs,

s/static inline int/static inline bool

> +                       struct lb_env *env)
>  {
> -       unsigned int capacity_factor, smt, cpus;
> -       unsigned int capacity, capacity_orig;
> +       if ((sgs->group_capacity * 100) >
> +                       (sgs->group_usage * env->sd->imbalance_pct))
> +               return true;
> 
> -       capacity = group->sgc->capacity;
> -       capacity_orig = group->sgc->capacity_orig;
> -       cpus = group->group_weight;
> +       if (sgs->sum_nr_running < sgs->group_weight)
> +               return true;
> 
> -       /* smt := ceil(cpus / capacity), assumes: 1 < smt_capacity < 2 */
> -       smt = DIV_ROUND_UP(SCHED_CAPACITY_SCALE * cpus, capacity_orig);
> -       capacity_factor = cpus / smt; /* cores */
> +       return false;
> +}
> 
> -       capacity_factor = min_t(unsigned,
> -               capacity_factor, DIV_ROUND_CLOSEST(capacity, SCHED_CAPACITY_SCALE));
> -       if (!capacity_factor)
> -               capacity_factor = fix_small_capacity(env->sd, group);
> +static inline int group_is_overloaded(struct sg_lb_stats *sgs,

s/static inline int/static inline bool

> +                       struct lb_env *env)
> +{
> +       if (sgs->sum_nr_running <= sgs->group_weight)
> +               return false;
> 
> -       return capacity_factor;
> +       if ((sgs->group_capacity * 100) <
> +                       (sgs->group_usage * env->sd->imbalance_pct))
> +               return true;
> +
> +       return false;
>  }
> 
>  static enum group_type
> -group_classify(struct sched_group *group, struct sg_lb_stats *sgs)
> +group_classify(struct sched_group *group, struct sg_lb_stats *sgs,
> +                       struct lb_env *env)
>  {
> -       if (sgs->sum_nr_running > sgs->group_capacity_factor)
> +       if (group_is_overloaded(sgs, env))
>                 return group_overloaded;
> 
>         if (sg_imbalanced(group))
> @@ -6072,11 +6038,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>                 sgs->load_per_task = sgs->sum_weighted_load / sgs->sum_nr_running;
> 
>         sgs->group_weight = group->group_weight;
> -       sgs->group_capacity_factor = sg_capacity_factor(env, group);
> -       sgs->group_type = group_classify(group, sgs);
> 
> -       if (sgs->group_capacity_factor > sgs->sum_nr_running)
> -               sgs->group_has_free_capacity = 1;
> +       sgs->group_type = group_classify(group, sgs, env);
> +
> +       sgs->group_out_of_capacity = group_is_overloaded(sgs, env);

In case sgs->group_type is group_overloaded you could set
sgs->group_out_of_capacity to 1 without calling group_is_overloaded again.

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7cdf271e8e52..52d441c92a4f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6037,7 +6037,8 @@ static inline void update_sg_lb_stats(struct
lb_env *env,

        sgs->group_type = group_classify(group, sgs, env);

-       sgs->group_out_of_capacity = group_is_overloaded(sgs, env);
+       if (sgs->group_type == group_overloaded)
+               sgs->group_out_of_capacity = 1;
 }

>  }
> 
>  /**
> @@ -6198,17 +6163,21 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd

[...]


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v6 5/6] sched: replace capacity_factor by usage
@ 2014-09-24 17:48     ` Dietmar Eggemann
  0 siblings, 0 replies; 72+ messages in thread
From: Dietmar Eggemann @ 2014-09-24 17:48 UTC (permalink / raw)
  To: linux-arm-kernel

On 23/09/14 17:08, Vincent Guittot wrote:

[...]

> 
> Finally, the sched_group->sched_group_capacity->capacity_orig has been removed
> because it's more used during load balance.

So you're not forced to call it rq->cpu_capacity_orig any more, you
could use rq->cpu_capacity_max instead.

[...]

This review (by PeterZ) during v5 of your patch-set recommended some
renaming (e.g. s/group_has_free_capacity/group_has_capacity and
s/group_out_of_capacity/group_no_capacity as well as reordering of the
parameters which I agree with:

https://lkml.org/lkml/2014/9/11/706

> 
> -/*
> - * Compute the group capacity factor.
> - *
> - * Avoid the issue where N*frac(smt_capacity) >= 1 creates 'phantom' cores by
> - * first dividing out the smt factor and computing the actual number of cores
> - * and limit unit capacity with that.
> - */
> -static inline int sg_capacity_factor(struct lb_env *env, struct sched_group *group)
> +static inline int group_has_free_capacity(struct sg_lb_stats *sgs,

s/static inline int/static inline bool

> +                       struct lb_env *env)
>  {
> -       unsigned int capacity_factor, smt, cpus;
> -       unsigned int capacity, capacity_orig;
> +       if ((sgs->group_capacity * 100) >
> +                       (sgs->group_usage * env->sd->imbalance_pct))
> +               return true;
> 
> -       capacity = group->sgc->capacity;
> -       capacity_orig = group->sgc->capacity_orig;
> -       cpus = group->group_weight;
> +       if (sgs->sum_nr_running < sgs->group_weight)
> +               return true;
> 
> -       /* smt := ceil(cpus / capacity), assumes: 1 < smt_capacity < 2 */
> -       smt = DIV_ROUND_UP(SCHED_CAPACITY_SCALE * cpus, capacity_orig);
> -       capacity_factor = cpus / smt; /* cores */
> +       return false;
> +}
> 
> -       capacity_factor = min_t(unsigned,
> -               capacity_factor, DIV_ROUND_CLOSEST(capacity, SCHED_CAPACITY_SCALE));
> -       if (!capacity_factor)
> -               capacity_factor = fix_small_capacity(env->sd, group);
> +static inline int group_is_overloaded(struct sg_lb_stats *sgs,

s/static inline int/static inline bool

> +                       struct lb_env *env)
> +{
> +       if (sgs->sum_nr_running <= sgs->group_weight)
> +               return false;
> 
> -       return capacity_factor;
> +       if ((sgs->group_capacity * 100) <
> +                       (sgs->group_usage * env->sd->imbalance_pct))
> +               return true;
> +
> +       return false;
>  }
> 
>  static enum group_type
> -group_classify(struct sched_group *group, struct sg_lb_stats *sgs)
> +group_classify(struct sched_group *group, struct sg_lb_stats *sgs,
> +                       struct lb_env *env)
>  {
> -       if (sgs->sum_nr_running > sgs->group_capacity_factor)
> +       if (group_is_overloaded(sgs, env))
>                 return group_overloaded;
> 
>         if (sg_imbalanced(group))
> @@ -6072,11 +6038,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>                 sgs->load_per_task = sgs->sum_weighted_load / sgs->sum_nr_running;
> 
>         sgs->group_weight = group->group_weight;
> -       sgs->group_capacity_factor = sg_capacity_factor(env, group);
> -       sgs->group_type = group_classify(group, sgs);
> 
> -       if (sgs->group_capacity_factor > sgs->sum_nr_running)
> -               sgs->group_has_free_capacity = 1;
> +       sgs->group_type = group_classify(group, sgs, env);
> +
> +       sgs->group_out_of_capacity = group_is_overloaded(sgs, env);

In case sgs->group_type is group_overloaded you could set
sgs->group_out_of_capacity to 1 without calling group_is_overloaded again.

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7cdf271e8e52..52d441c92a4f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6037,7 +6037,8 @@ static inline void update_sg_lb_stats(struct
lb_env *env,

        sgs->group_type = group_classify(group, sgs, env);

-       sgs->group_out_of_capacity = group_is_overloaded(sgs, env);
+       if (sgs->group_type == group_overloaded)
+               sgs->group_out_of_capacity = 1;
 }

>  }
> 
>  /**
> @@ -6198,17 +6163,21 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd

[...]

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* Re: [PATCH v6 5/6] sched: replace capacity_factor by usage
  2014-09-24 17:48     ` Dietmar Eggemann
@ 2014-09-25  8:35       ` Vincent Guittot
  -1 siblings, 0 replies; 72+ messages in thread
From: Vincent Guittot @ 2014-09-25  8:35 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: peterz, mingo, linux-kernel, preeti, linux, linux-arm-kernel,
	riel, Morten Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano, pjt, bsegall

On 24 September 2014 19:48, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> On 23/09/14 17:08, Vincent Guittot wrote:
>

[snip]

>
> This review (by PeterZ) during v5 of your patch-set recommended some
> renaming (e.g. s/group_has_free_capacity/group_has_capacity and
> s/group_out_of_capacity/group_no_capacity as well as reordering of the
> parameters which I agree with:
>
> https://lkml.org/lkml/2014/9/11/706

Ah... you're right, these changes have passed through my seance of renaming

>
>>
>> -/*

[snip]

>> -       if (sgs->group_capacity_factor > sgs->sum_nr_running)
>> -               sgs->group_has_free_capacity = 1;
>> +       sgs->group_type = group_classify(group, sgs, env);
>> +
>> +       sgs->group_out_of_capacity = group_is_overloaded(sgs, env);
>
> In case sgs->group_type is group_overloaded you could set
> sgs->group_out_of_capacity to 1 without calling group_is_overloaded again.

I prefer to keep sgs->group_out_of_capacity = group_is_overloaded(sgs,
env) and use it in group_classify in case of future changes in the
classification order

Regards,
Vincent
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 7cdf271e8e52..52d441c92a4f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6037,7 +6037,8 @@ static inline void update_sg_lb_stats(struct
> lb_env *env,
>
>         sgs->group_type = group_classify(group, sgs, env);
>
> -       sgs->group_out_of_capacity = group_is_overloaded(sgs, env);
> +       if (sgs->group_type == group_overloaded)
> +               sgs->group_out_of_capacity = 1;
>  }
>
>>  }
>>
>>  /**
>> @@ -6198,17 +6163,21 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>
> [...]
>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH v6 5/6] sched: replace capacity_factor by usage
@ 2014-09-25  8:35       ` Vincent Guittot
  0 siblings, 0 replies; 72+ messages in thread
From: Vincent Guittot @ 2014-09-25  8:35 UTC (permalink / raw)
  To: linux-arm-kernel

On 24 September 2014 19:48, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> On 23/09/14 17:08, Vincent Guittot wrote:
>

[snip]

>
> This review (by PeterZ) during v5 of your patch-set recommended some
> renaming (e.g. s/group_has_free_capacity/group_has_capacity and
> s/group_out_of_capacity/group_no_capacity as well as reordering of the
> parameters which I agree with:
>
> https://lkml.org/lkml/2014/9/11/706

Ah... you're right, these changes have passed through my seance of renaming

>
>>
>> -/*

[snip]

>> -       if (sgs->group_capacity_factor > sgs->sum_nr_running)
>> -               sgs->group_has_free_capacity = 1;
>> +       sgs->group_type = group_classify(group, sgs, env);
>> +
>> +       sgs->group_out_of_capacity = group_is_overloaded(sgs, env);
>
> In case sgs->group_type is group_overloaded you could set
> sgs->group_out_of_capacity to 1 without calling group_is_overloaded again.

I prefer to keep sgs->group_out_of_capacity = group_is_overloaded(sgs,
env) and use it in group_classify in case of future changes in the
classification order

Regards,
Vincent
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 7cdf271e8e52..52d441c92a4f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6037,7 +6037,8 @@ static inline void update_sg_lb_stats(struct
> lb_env *env,
>
>         sgs->group_type = group_classify(group, sgs, env);
>
> -       sgs->group_out_of_capacity = group_is_overloaded(sgs, env);
> +       if (sgs->group_type == group_overloaded)
> +               sgs->group_out_of_capacity = 1;
>  }
>
>>  }
>>
>>  /**
>> @@ -6198,17 +6163,21 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>
> [...]
>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH v6 5/6] sched: replace capacity_factor by usage
  2014-09-23 16:08   ` Vincent Guittot
@ 2014-09-25  8:38     ` Vincent Guittot
  -1 siblings, 0 replies; 72+ messages in thread
From: Vincent Guittot @ 2014-09-25  8:38 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel, preeti, linux, linux-arm-kernel
  Cc: riel, Morten.Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano, dietmar.eggemann, pjt, bsegall, Vincent Guittot

The scheduler tries to compute how many tasks a group of CPUs can handle by
assuming that a task's load is SCHED_LOAD_SCALE and a CPU capacity is
SCHED_CAPACITY_SCALE but the capacity_factor is hardly working for SMT system,
it sometimes works for big cores but fails to do the right thing for little
cores.

Below are two examples to illustrate the problem that this patch solves:

1 - capacity_factor makes the assumption that max capacity of a CPU is
SCHED_CAPACITY_SCALE and the load of a thread is always is
SCHED_LOAD_SCALE. It compares the output of these figures with the sum
of nr_running to decide if a group is overloaded or not.

But if the default capacity of a CPU is less than SCHED_CAPACITY_SCALE
(640 as an example), a group of 3 CPUS will have a max capacity_factor
of 2 ( div_round_closest(3x640/1024) = 2) which means that it will be
seen as overloaded if we have only one task per CPU.

2 - Then, if the default capacity of a CPU is greater than
SCHED_CAPACITY_SCALE (1512 as an example), a group of 4 CPUs will have
a capacity_factor of 4 (at max and thanks to the fix[0] for SMT system
that prevent the apparition of ghost CPUs) but if one CPU is fully
used by a rt task (and its capacity is reduced to nearly nothing), the
capacity factor of the group will still be 4
(div_round_closest(3*1512/1024) = 5).

So, this patch tries to solve this issue by removing capacity_factor
and replacing it with the 2 following metrics :
-The available CPU's capacity for CFS tasks which is the already used by
load_balance.
-The usage of the CPU by the CFS tasks. For the latter, I have
re-introduced the utilization_avg_contrib which is in the range
[0..SCHED_CPU_LOAD] whatever the capacity of the CPU is.

This implementation of utilization_avg_contrib doesn't solve the scaling
in-variance problem, so i have to scale the utilization with original
capacity of the CPU in order to get the CPU usage and compare it with the
capacity. Once the scaling invariance will have been added in
utilization_avg_contrib, we will remove the scale of utilization_avg_contrib
by cpu_capacity_orig in get_cpu_usage. But the scaling invariance will come
in another patchset.

Finally, the sched_group->sched_group_capacity->capacity_orig has been removed
because it's more used during load balance.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---

Hi,

This update of the patch takes into account the missing renamings as pointed out by
Dietmar

Regards,
Vincent

 kernel/sched/core.c  |  12 -----
 kernel/sched/fair.c  | 131 ++++++++++++++++++++-------------------------------
 kernel/sched/sched.h |   2 +-
 3 files changed, 51 insertions(+), 94 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e20f203..c7c8ac4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5342,17 +5342,6 @@ static int sched_domain_debug_one(struct sched_domain *sd, int cpu, int level,
 			break;
 		}
 
-		/*
-		 * Even though we initialize ->capacity to something semi-sane,
-		 * we leave capacity_orig unset. This allows us to detect if
-		 * domain iteration is still funny without causing /0 traps.
-		 */
-		if (!group->sgc->capacity_orig) {
-			printk(KERN_CONT "\n");
-			printk(KERN_ERR "ERROR: domain->cpu_capacity not set\n");
-			break;
-		}
-
 		if (!cpumask_weight(sched_group_cpus(group))) {
 			printk(KERN_CONT "\n");
 			printk(KERN_ERR "ERROR: empty group\n");
@@ -5837,7 +5826,6 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
 		 * die on a /0 trap.
 		 */
 		sg->sgc->capacity = SCHED_CAPACITY_SCALE * cpumask_weight(sg_span);
-		sg->sgc->capacity_orig = sg->sgc->capacity;
 
 		/*
 		 * Make sure the first group of this domain contains the
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4097e3f..f305f17 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5676,11 +5676,10 @@ struct sg_lb_stats {
 	unsigned long group_capacity;
 	unsigned long group_usage; /* Total usage of the group */
 	unsigned int sum_nr_running; /* Nr tasks running in the group */
-	unsigned int group_capacity_factor;
 	unsigned int idle_cpus;
 	unsigned int group_weight;
 	enum group_type group_type;
-	int group_has_free_capacity;
+	int group_no_capacity;
 #ifdef CONFIG_NUMA_BALANCING
 	unsigned int nr_numa_running;
 	unsigned int nr_preferred_running;
@@ -5821,7 +5820,6 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
 	capacity >>= SCHED_CAPACITY_SHIFT;
 
 	cpu_rq(cpu)->cpu_capacity_orig = capacity;
-	sdg->sgc->capacity_orig = capacity;
 
 	if (sched_feat(ARCH_CAPACITY))
 		capacity *= arch_scale_freq_capacity(sd, cpu);
@@ -5844,7 +5842,7 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
 {
 	struct sched_domain *child = sd->child;
 	struct sched_group *group, *sdg = sd->groups;
-	unsigned long capacity, capacity_orig;
+	unsigned long capacity;
 	unsigned long interval;
 
 	interval = msecs_to_jiffies(sd->balance_interval);
@@ -5856,7 +5854,7 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
 		return;
 	}
 
-	capacity_orig = capacity = 0;
+	capacity = 0;
 
 	if (child->flags & SD_OVERLAP) {
 		/*
@@ -5876,19 +5874,15 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
 			 * Use capacity_of(), which is set irrespective of domains
 			 * in update_cpu_capacity().
 			 *
-			 * This avoids capacity/capacity_orig from being 0 and
+			 * This avoids capacity from being 0 and
 			 * causing divide-by-zero issues on boot.
-			 *
-			 * Runtime updates will correct capacity_orig.
 			 */
 			if (unlikely(!rq->sd)) {
-				capacity_orig += capacity_orig_of(cpu);
 				capacity += capacity_of(cpu);
 				continue;
 			}
 
 			sgc = rq->sd->groups->sgc;
-			capacity_orig += sgc->capacity_orig;
 			capacity += sgc->capacity;
 		}
 	} else  {
@@ -5899,42 +5893,15 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
 
 		group = child->groups;
 		do {
-			capacity_orig += group->sgc->capacity_orig;
 			capacity += group->sgc->capacity;
 			group = group->next;
 		} while (group != child->groups);
 	}
 
-	sdg->sgc->capacity_orig = capacity_orig;
 	sdg->sgc->capacity = capacity;
 }
 
 /*
- * Try and fix up capacity for tiny siblings, this is needed when
- * things like SD_ASYM_PACKING need f_b_g to select another sibling
- * which on its own isn't powerful enough.
- *
- * See update_sd_pick_busiest() and check_asym_packing().
- */
-static inline int
-fix_small_capacity(struct sched_domain *sd, struct sched_group *group)
-{
-	/*
-	 * Only siblings can have significantly less than SCHED_CAPACITY_SCALE
-	 */
-	if (!(sd->flags & SD_SHARE_CPUCAPACITY))
-		return 0;
-
-	/*
-	 * If ~90% of the cpu_capacity is still there, we're good.
-	 */
-	if (group->sgc->capacity * 32 > group->sgc->capacity_orig * 29)
-		return 1;
-
-	return 0;
-}
-
-/*
  * Check whether the capacity of the rq has been noticeably reduced by side
  * activity. The imbalance_pct is used for the threshold.
  * Return true is the capacity is reduced
@@ -5980,38 +5947,37 @@ static inline int sg_imbalanced(struct sched_group *group)
 	return group->sgc->imbalance;
 }
 
-/*
- * Compute the group capacity factor.
- *
- * Avoid the issue where N*frac(smt_capacity) >= 1 creates 'phantom' cores by
- * first dividing out the smt factor and computing the actual number of cores
- * and limit unit capacity with that.
- */
-static inline int sg_capacity_factor(struct lb_env *env, struct sched_group *group)
+static inline bool
+group_has_capacity(struct sg_lb_stats *sgs, struct lb_env *env)
 {
-	unsigned int capacity_factor, smt, cpus;
-	unsigned int capacity, capacity_orig;
+	if ((sgs->group_capacity * 100) >
+			(sgs->group_usage * env->sd->imbalance_pct))
+		return true;
 
-	capacity = group->sgc->capacity;
-	capacity_orig = group->sgc->capacity_orig;
-	cpus = group->group_weight;
+	if (sgs->sum_nr_running < sgs->group_weight)
+		return true;
 
-	/* smt := ceil(cpus / capacity), assumes: 1 < smt_capacity < 2 */
-	smt = DIV_ROUND_UP(SCHED_CAPACITY_SCALE * cpus, capacity_orig);
-	capacity_factor = cpus / smt; /* cores */
+	return false;
+}
 
-	capacity_factor = min_t(unsigned,
-		capacity_factor, DIV_ROUND_CLOSEST(capacity, SCHED_CAPACITY_SCALE));
-	if (!capacity_factor)
-		capacity_factor = fix_small_capacity(env->sd, group);
+static inline bool
+group_is_overloaded(struct sg_lb_stats *sgs, struct lb_env *env)
+{
+	if (sgs->sum_nr_running <= sgs->group_weight)
+		return false;
+
+	if ((sgs->group_capacity * 100) <
+			(sgs->group_usage * env->sd->imbalance_pct))
+		return true;
 
-	return capacity_factor;
+	return false;
 }
 
-static enum group_type
-group_classify(struct sched_group *group, struct sg_lb_stats *sgs)
+static enum group_type group_classify(struct sched_group *group,
+		struct sg_lb_stats *sgs,
+		struct lb_env *env)
 {
-	if (sgs->sum_nr_running > sgs->group_capacity_factor)
+	if (sgs->group_no_capacity)
 		return group_overloaded;
 
 	if (sg_imbalanced(group))
@@ -6072,11 +6038,9 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 		sgs->load_per_task = sgs->sum_weighted_load / sgs->sum_nr_running;
 
 	sgs->group_weight = group->group_weight;
-	sgs->group_capacity_factor = sg_capacity_factor(env, group);
-	sgs->group_type = group_classify(group, sgs);
 
-	if (sgs->group_capacity_factor > sgs->sum_nr_running)
-		sgs->group_has_free_capacity = 1;
+	sgs->group_no_capacity = group_is_overloaded(sgs, env);
+	sgs->group_type = group_classify(group, sgs, env);
 }
 
 /**
@@ -6198,17 +6162,21 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 
 		/*
 		 * In case the child domain prefers tasks go to siblings
-		 * first, lower the sg capacity factor to one so that we'll try
+		 * first, lower the sg capacity to one so that we'll try
 		 * and move all the excess tasks away. We lower the capacity
 		 * of a group only if the local group has the capacity to fit
-		 * these excess tasks, i.e. nr_running < group_capacity_factor. The
+		 * these excess tasks, i.e. group_capacity > 0. The
 		 * extra check prevents the case where you always pull from the
 		 * heaviest group when it is already under-utilized (possible
 		 * with a large weight task outweighs the tasks on the system).
 		 */
 		if (prefer_sibling && sds->local &&
-		    sds->local_stat.group_has_free_capacity)
-			sgs->group_capacity_factor = min(sgs->group_capacity_factor, 1U);
+		    group_has_capacity(&sds->local_stat, env)) {
+			if (sgs->sum_nr_running > 1)
+				sgs->group_no_capacity = 1;
+			sgs->group_capacity = min(sgs->group_capacity,
+						SCHED_CAPACITY_SCALE);
+		}
 
 		if (update_sd_pick_busiest(env, sds, sg, sgs)) {
 			sds->busiest = sg;
@@ -6387,11 +6355,12 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 	 */
 	if (busiest->group_type == group_overloaded &&
 	    local->group_type   == group_overloaded) {
-		load_above_capacity =
-			(busiest->sum_nr_running - busiest->group_capacity_factor);
-
-		load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_CAPACITY_SCALE);
-		load_above_capacity /= busiest->group_capacity;
+		load_above_capacity = busiest->sum_nr_running *
+					SCHED_LOAD_SCALE;
+		if (load_above_capacity > busiest->group_capacity)
+			load_above_capacity -= busiest->group_capacity;
+		else
+			load_above_capacity = ~0UL;
 	}
 
 	/*
@@ -6454,6 +6423,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
 	local = &sds.local_stat;
 	busiest = &sds.busiest_stat;
 
+	/* ASYM feature bypasses nice load balance check */
 	if ((env->idle == CPU_IDLE || env->idle == CPU_NEWLY_IDLE) &&
 	    check_asym_packing(env, &sds))
 		return sds.busiest;
@@ -6474,8 +6444,8 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
 		goto force_balance;
 
 	/* SD_BALANCE_NEWIDLE trumps SMP nice when underutilized */
-	if (env->idle == CPU_NEWLY_IDLE && local->group_has_free_capacity &&
-	    !busiest->group_has_free_capacity)
+	if (env->idle == CPU_NEWLY_IDLE && group_has_capacity(local, env) &&
+	    busiest->group_no_capacity)
 		goto force_balance;
 
 	/*
@@ -6533,7 +6503,7 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 	int i;
 
 	for_each_cpu_and(i, sched_group_cpus(group), env->cpus) {
-		unsigned long capacity, capacity_factor, wl;
+		unsigned long capacity, wl;
 		enum fbq_type rt;
 
 		rq = cpu_rq(i);
@@ -6562,9 +6532,6 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 			continue;
 
 		capacity = capacity_of(i);
-		capacity_factor = DIV_ROUND_CLOSEST(capacity, SCHED_CAPACITY_SCALE);
-		if (!capacity_factor)
-			capacity_factor = fix_small_capacity(env->sd, group);
 
 		wl = weighted_cpuload(i);
 
@@ -6572,7 +6539,9 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 		 * When comparing with imbalance, use weighted_cpuload()
 		 * which is not scaled with the cpu capacity.
 		 */
-		if (capacity_factor && rq->nr_running == 1 && wl > env->imbalance)
+
+		if (rq->nr_running == 1 && wl > env->imbalance &&
+		    !check_cpu_capacity(rq, env->sd))
 			continue;
 
 		/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3ccb136..8d224c5 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -750,7 +750,7 @@ struct sched_group_capacity {
 	 * CPU capacity of this group, SCHED_LOAD_SCALE being max capacity
 	 * for a single CPU.
 	 */
-	unsigned int capacity, capacity_orig;
+	unsigned int capacity;
 	unsigned long next_update;
 	int imbalance; /* XXX unrelated to capacity but shared group state */
 	/*
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 72+ messages in thread

* [PATCH v6 5/6] sched: replace capacity_factor by usage
@ 2014-09-25  8:38     ` Vincent Guittot
  0 siblings, 0 replies; 72+ messages in thread
From: Vincent Guittot @ 2014-09-25  8:38 UTC (permalink / raw)
  To: linux-arm-kernel

The scheduler tries to compute how many tasks a group of CPUs can handle by
assuming that a task's load is SCHED_LOAD_SCALE and a CPU capacity is
SCHED_CAPACITY_SCALE but the capacity_factor is hardly working for SMT system,
it sometimes works for big cores but fails to do the right thing for little
cores.

Below are two examples to illustrate the problem that this patch solves:

1 - capacity_factor makes the assumption that max capacity of a CPU is
SCHED_CAPACITY_SCALE and the load of a thread is always is
SCHED_LOAD_SCALE. It compares the output of these figures with the sum
of nr_running to decide if a group is overloaded or not.

But if the default capacity of a CPU is less than SCHED_CAPACITY_SCALE
(640 as an example), a group of 3 CPUS will have a max capacity_factor
of 2 ( div_round_closest(3x640/1024) = 2) which means that it will be
seen as overloaded if we have only one task per CPU.

2 - Then, if the default capacity of a CPU is greater than
SCHED_CAPACITY_SCALE (1512 as an example), a group of 4 CPUs will have
a capacity_factor of 4 (at max and thanks to the fix[0] for SMT system
that prevent the apparition of ghost CPUs) but if one CPU is fully
used by a rt task (and its capacity is reduced to nearly nothing), the
capacity factor of the group will still be 4
(div_round_closest(3*1512/1024) = 5).

So, this patch tries to solve this issue by removing capacity_factor
and replacing it with the 2 following metrics :
-The available CPU's capacity for CFS tasks which is the already used by
load_balance.
-The usage of the CPU by the CFS tasks. For the latter, I have
re-introduced the utilization_avg_contrib which is in the range
[0..SCHED_CPU_LOAD] whatever the capacity of the CPU is.

This implementation of utilization_avg_contrib doesn't solve the scaling
in-variance problem, so i have to scale the utilization with original
capacity of the CPU in order to get the CPU usage and compare it with the
capacity. Once the scaling invariance will have been added in
utilization_avg_contrib, we will remove the scale of utilization_avg_contrib
by cpu_capacity_orig in get_cpu_usage. But the scaling invariance will come
in another patchset.

Finally, the sched_group->sched_group_capacity->capacity_orig has been removed
because it's more used during load balance.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---

Hi,

This update of the patch takes into account the missing renamings as pointed out by
Dietmar

Regards,
Vincent

 kernel/sched/core.c  |  12 -----
 kernel/sched/fair.c  | 131 ++++++++++++++++++++-------------------------------
 kernel/sched/sched.h |   2 +-
 3 files changed, 51 insertions(+), 94 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e20f203..c7c8ac4 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5342,17 +5342,6 @@ static int sched_domain_debug_one(struct sched_domain *sd, int cpu, int level,
 			break;
 		}
 
-		/*
-		 * Even though we initialize ->capacity to something semi-sane,
-		 * we leave capacity_orig unset. This allows us to detect if
-		 * domain iteration is still funny without causing /0 traps.
-		 */
-		if (!group->sgc->capacity_orig) {
-			printk(KERN_CONT "\n");
-			printk(KERN_ERR "ERROR: domain->cpu_capacity not set\n");
-			break;
-		}
-
 		if (!cpumask_weight(sched_group_cpus(group))) {
 			printk(KERN_CONT "\n");
 			printk(KERN_ERR "ERROR: empty group\n");
@@ -5837,7 +5826,6 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
 		 * die on a /0 trap.
 		 */
 		sg->sgc->capacity = SCHED_CAPACITY_SCALE * cpumask_weight(sg_span);
-		sg->sgc->capacity_orig = sg->sgc->capacity;
 
 		/*
 		 * Make sure the first group of this domain contains the
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4097e3f..f305f17 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5676,11 +5676,10 @@ struct sg_lb_stats {
 	unsigned long group_capacity;
 	unsigned long group_usage; /* Total usage of the group */
 	unsigned int sum_nr_running; /* Nr tasks running in the group */
-	unsigned int group_capacity_factor;
 	unsigned int idle_cpus;
 	unsigned int group_weight;
 	enum group_type group_type;
-	int group_has_free_capacity;
+	int group_no_capacity;
 #ifdef CONFIG_NUMA_BALANCING
 	unsigned int nr_numa_running;
 	unsigned int nr_preferred_running;
@@ -5821,7 +5820,6 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
 	capacity >>= SCHED_CAPACITY_SHIFT;
 
 	cpu_rq(cpu)->cpu_capacity_orig = capacity;
-	sdg->sgc->capacity_orig = capacity;
 
 	if (sched_feat(ARCH_CAPACITY))
 		capacity *= arch_scale_freq_capacity(sd, cpu);
@@ -5844,7 +5842,7 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
 {
 	struct sched_domain *child = sd->child;
 	struct sched_group *group, *sdg = sd->groups;
-	unsigned long capacity, capacity_orig;
+	unsigned long capacity;
 	unsigned long interval;
 
 	interval = msecs_to_jiffies(sd->balance_interval);
@@ -5856,7 +5854,7 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
 		return;
 	}
 
-	capacity_orig = capacity = 0;
+	capacity = 0;
 
 	if (child->flags & SD_OVERLAP) {
 		/*
@@ -5876,19 +5874,15 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
 			 * Use capacity_of(), which is set irrespective of domains
 			 * in update_cpu_capacity().
 			 *
-			 * This avoids capacity/capacity_orig from being 0 and
+			 * This avoids capacity from being 0 and
 			 * causing divide-by-zero issues on boot.
-			 *
-			 * Runtime updates will correct capacity_orig.
 			 */
 			if (unlikely(!rq->sd)) {
-				capacity_orig += capacity_orig_of(cpu);
 				capacity += capacity_of(cpu);
 				continue;
 			}
 
 			sgc = rq->sd->groups->sgc;
-			capacity_orig += sgc->capacity_orig;
 			capacity += sgc->capacity;
 		}
 	} else  {
@@ -5899,42 +5893,15 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
 
 		group = child->groups;
 		do {
-			capacity_orig += group->sgc->capacity_orig;
 			capacity += group->sgc->capacity;
 			group = group->next;
 		} while (group != child->groups);
 	}
 
-	sdg->sgc->capacity_orig = capacity_orig;
 	sdg->sgc->capacity = capacity;
 }
 
 /*
- * Try and fix up capacity for tiny siblings, this is needed when
- * things like SD_ASYM_PACKING need f_b_g to select another sibling
- * which on its own isn't powerful enough.
- *
- * See update_sd_pick_busiest() and check_asym_packing().
- */
-static inline int
-fix_small_capacity(struct sched_domain *sd, struct sched_group *group)
-{
-	/*
-	 * Only siblings can have significantly less than SCHED_CAPACITY_SCALE
-	 */
-	if (!(sd->flags & SD_SHARE_CPUCAPACITY))
-		return 0;
-
-	/*
-	 * If ~90% of the cpu_capacity is still there, we're good.
-	 */
-	if (group->sgc->capacity * 32 > group->sgc->capacity_orig * 29)
-		return 1;
-
-	return 0;
-}
-
-/*
  * Check whether the capacity of the rq has been noticeably reduced by side
  * activity. The imbalance_pct is used for the threshold.
  * Return true is the capacity is reduced
@@ -5980,38 +5947,37 @@ static inline int sg_imbalanced(struct sched_group *group)
 	return group->sgc->imbalance;
 }
 
-/*
- * Compute the group capacity factor.
- *
- * Avoid the issue where N*frac(smt_capacity) >= 1 creates 'phantom' cores by
- * first dividing out the smt factor and computing the actual number of cores
- * and limit unit capacity with that.
- */
-static inline int sg_capacity_factor(struct lb_env *env, struct sched_group *group)
+static inline bool
+group_has_capacity(struct sg_lb_stats *sgs, struct lb_env *env)
 {
-	unsigned int capacity_factor, smt, cpus;
-	unsigned int capacity, capacity_orig;
+	if ((sgs->group_capacity * 100) >
+			(sgs->group_usage * env->sd->imbalance_pct))
+		return true;
 
-	capacity = group->sgc->capacity;
-	capacity_orig = group->sgc->capacity_orig;
-	cpus = group->group_weight;
+	if (sgs->sum_nr_running < sgs->group_weight)
+		return true;
 
-	/* smt := ceil(cpus / capacity), assumes: 1 < smt_capacity < 2 */
-	smt = DIV_ROUND_UP(SCHED_CAPACITY_SCALE * cpus, capacity_orig);
-	capacity_factor = cpus / smt; /* cores */
+	return false;
+}
 
-	capacity_factor = min_t(unsigned,
-		capacity_factor, DIV_ROUND_CLOSEST(capacity, SCHED_CAPACITY_SCALE));
-	if (!capacity_factor)
-		capacity_factor = fix_small_capacity(env->sd, group);
+static inline bool
+group_is_overloaded(struct sg_lb_stats *sgs, struct lb_env *env)
+{
+	if (sgs->sum_nr_running <= sgs->group_weight)
+		return false;
+
+	if ((sgs->group_capacity * 100) <
+			(sgs->group_usage * env->sd->imbalance_pct))
+		return true;
 
-	return capacity_factor;
+	return false;
 }
 
-static enum group_type
-group_classify(struct sched_group *group, struct sg_lb_stats *sgs)
+static enum group_type group_classify(struct sched_group *group,
+		struct sg_lb_stats *sgs,
+		struct lb_env *env)
 {
-	if (sgs->sum_nr_running > sgs->group_capacity_factor)
+	if (sgs->group_no_capacity)
 		return group_overloaded;
 
 	if (sg_imbalanced(group))
@@ -6072,11 +6038,9 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 		sgs->load_per_task = sgs->sum_weighted_load / sgs->sum_nr_running;
 
 	sgs->group_weight = group->group_weight;
-	sgs->group_capacity_factor = sg_capacity_factor(env, group);
-	sgs->group_type = group_classify(group, sgs);
 
-	if (sgs->group_capacity_factor > sgs->sum_nr_running)
-		sgs->group_has_free_capacity = 1;
+	sgs->group_no_capacity = group_is_overloaded(sgs, env);
+	sgs->group_type = group_classify(group, sgs, env);
 }
 
 /**
@@ -6198,17 +6162,21 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 
 		/*
 		 * In case the child domain prefers tasks go to siblings
-		 * first, lower the sg capacity factor to one so that we'll try
+		 * first, lower the sg capacity to one so that we'll try
 		 * and move all the excess tasks away. We lower the capacity
 		 * of a group only if the local group has the capacity to fit
-		 * these excess tasks, i.e. nr_running < group_capacity_factor. The
+		 * these excess tasks, i.e. group_capacity > 0. The
 		 * extra check prevents the case where you always pull from the
 		 * heaviest group when it is already under-utilized (possible
 		 * with a large weight task outweighs the tasks on the system).
 		 */
 		if (prefer_sibling && sds->local &&
-		    sds->local_stat.group_has_free_capacity)
-			sgs->group_capacity_factor = min(sgs->group_capacity_factor, 1U);
+		    group_has_capacity(&sds->local_stat, env)) {
+			if (sgs->sum_nr_running > 1)
+				sgs->group_no_capacity = 1;
+			sgs->group_capacity = min(sgs->group_capacity,
+						SCHED_CAPACITY_SCALE);
+		}
 
 		if (update_sd_pick_busiest(env, sds, sg, sgs)) {
 			sds->busiest = sg;
@@ -6387,11 +6355,12 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 	 */
 	if (busiest->group_type == group_overloaded &&
 	    local->group_type   == group_overloaded) {
-		load_above_capacity =
-			(busiest->sum_nr_running - busiest->group_capacity_factor);
-
-		load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_CAPACITY_SCALE);
-		load_above_capacity /= busiest->group_capacity;
+		load_above_capacity = busiest->sum_nr_running *
+					SCHED_LOAD_SCALE;
+		if (load_above_capacity > busiest->group_capacity)
+			load_above_capacity -= busiest->group_capacity;
+		else
+			load_above_capacity = ~0UL;
 	}
 
 	/*
@@ -6454,6 +6423,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
 	local = &sds.local_stat;
 	busiest = &sds.busiest_stat;
 
+	/* ASYM feature bypasses nice load balance check */
 	if ((env->idle == CPU_IDLE || env->idle == CPU_NEWLY_IDLE) &&
 	    check_asym_packing(env, &sds))
 		return sds.busiest;
@@ -6474,8 +6444,8 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
 		goto force_balance;
 
 	/* SD_BALANCE_NEWIDLE trumps SMP nice when underutilized */
-	if (env->idle == CPU_NEWLY_IDLE && local->group_has_free_capacity &&
-	    !busiest->group_has_free_capacity)
+	if (env->idle == CPU_NEWLY_IDLE && group_has_capacity(local, env) &&
+	    busiest->group_no_capacity)
 		goto force_balance;
 
 	/*
@@ -6533,7 +6503,7 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 	int i;
 
 	for_each_cpu_and(i, sched_group_cpus(group), env->cpus) {
-		unsigned long capacity, capacity_factor, wl;
+		unsigned long capacity, wl;
 		enum fbq_type rt;
 
 		rq = cpu_rq(i);
@@ -6562,9 +6532,6 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 			continue;
 
 		capacity = capacity_of(i);
-		capacity_factor = DIV_ROUND_CLOSEST(capacity, SCHED_CAPACITY_SCALE);
-		if (!capacity_factor)
-			capacity_factor = fix_small_capacity(env->sd, group);
 
 		wl = weighted_cpuload(i);
 
@@ -6572,7 +6539,9 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 		 * When comparing with imbalance, use weighted_cpuload()
 		 * which is not scaled with the cpu capacity.
 		 */
-		if (capacity_factor && rq->nr_running == 1 && wl > env->imbalance)
+
+		if (rq->nr_running == 1 && wl > env->imbalance &&
+		    !check_cpu_capacity(rq, env->sd))
 			continue;
 
 		/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3ccb136..8d224c5 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -750,7 +750,7 @@ struct sched_group_capacity {
 	 * CPU capacity of this group, SCHED_LOAD_SCALE being max capacity
 	 * for a single CPU.
 	 */
-	unsigned int capacity, capacity_orig;
+	unsigned int capacity;
 	unsigned long next_update;
 	int imbalance; /* XXX unrelated to capacity but shared group state */
 	/*
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 72+ messages in thread

* Re: [PATCH v6 6/6] sched: add SD_PREFER_SIBLING for SMT level
  2014-09-24 12:27     ` Preeti U Murthy
@ 2014-09-25 12:10       ` Vincent Guittot
  -1 siblings, 0 replies; 72+ messages in thread
From: Vincent Guittot @ 2014-09-25 12:10 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel,
	Russell King - ARM Linux, LAK, Rik van Riel, Morten Rasmussen,
	Mike Galbraith, Nicolas Pitre, linaro-kernel, Daniel Lezcano,
	Dietmar Eggemann, Paul Turner, Benjamin Segall

On 24 September 2014 14:27, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
> On 09/23/2014 09:38 PM, Vincent Guittot wrote:
>> add the SD_PREFER_SIBLING flag for SMT level in order to ensure that
>> the scheduler will put at least 1 task per core.
>>
>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>> ---
>>  kernel/sched/core.c | 1 +
>>  1 file changed, 1 insertion(+)
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index c7c8ac4..f72663e 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -6134,6 +6134,7 @@ sd_init(struct sched_domain_topology_level *tl, int cpu)
>>        */
>>
>>       if (sd->flags & SD_SHARE_CPUCAPACITY) {
>> +             sd->flags |= SD_PREFER_SIBLING;
>>               sd->imbalance_pct = 110;
>>               sd->smt_gain = 1178; /* ~15% */
>>
> Reviewed-by: Preeti U. Murthy <preeti@linux.vnet.ibm.com>

Thanks

>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH v6 6/6] sched: add SD_PREFER_SIBLING for SMT level
@ 2014-09-25 12:10       ` Vincent Guittot
  0 siblings, 0 replies; 72+ messages in thread
From: Vincent Guittot @ 2014-09-25 12:10 UTC (permalink / raw)
  To: linux-arm-kernel

On 24 September 2014 14:27, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
> On 09/23/2014 09:38 PM, Vincent Guittot wrote:
>> add the SD_PREFER_SIBLING flag for SMT level in order to ensure that
>> the scheduler will put at least 1 task per core.
>>
>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>> ---
>>  kernel/sched/core.c | 1 +
>>  1 file changed, 1 insertion(+)
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index c7c8ac4..f72663e 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -6134,6 +6134,7 @@ sd_init(struct sched_domain_topology_level *tl, int cpu)
>>        */
>>
>>       if (sd->flags & SD_SHARE_CPUCAPACITY) {
>> +             sd->flags |= SD_PREFER_SIBLING;
>>               sd->imbalance_pct = 110;
>>               sd->smt_gain = 1178; /* ~15% */
>>
> Reviewed-by: Preeti U. Murthy <preeti@linux.vnet.ibm.com>

Thanks

>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v6 4/6] sched: get CPU's usage statistic
  2014-09-23 16:08   ` Vincent Guittot
@ 2014-09-25 19:05     ` Dietmar Eggemann
  -1 siblings, 0 replies; 72+ messages in thread
From: Dietmar Eggemann @ 2014-09-25 19:05 UTC (permalink / raw)
  To: Vincent Guittot, peterz, mingo, linux-kernel, preeti, linux,
	linux-arm-kernel
  Cc: riel, Morten Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano, pjt, bsegall

On 23/09/14 17:08, Vincent Guittot wrote:
> Monitor the usage level of each group of each sched_domain level. The usage is
> the amount of cpu_capacity that is currently used on a CPU or group of CPUs.
> We use the utilization_load_avg to evaluate the usage level of each group.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>  kernel/sched/fair.c | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 2cf153d..4097e3f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4523,6 +4523,17 @@ static int select_idle_sibling(struct task_struct *p, int target)
>  	return target;
>  }
>  
> +static int get_cpu_usage(int cpu)
> +{
> +	unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg;
> +	unsigned long capacity = capacity_orig_of(cpu);
> +
> +	if (usage >= SCHED_LOAD_SCALE)
> +		return capacity + 1;

Why you are returning rq->cpu_capacity_orig + 1 (1025) in case
utilization_load_avg is greater or equal than 1024 and not usage or
(usage * capacity) >> SCHED_LOAD_SHIFT too?

In case the weight of a sched group is greater than 1, you might loose
the information that the whole sched group is over-utilized too.

You add up the individual cpu usage values for a group by
sgs->group_usage += get_cpu_usage(i) in update_sg_lb_stats and later use
sgs->group_usage in group_is_overloaded to compare it against
sgs->group_capacity (taking imbalance_pct into consideration).

> +
> +	return (usage * capacity) >> SCHED_LOAD_SHIFT;

Nit-pick: Since you're multiplying by a capacity value
(rq->cpu_capacity_orig) you should shift by SCHED_CAPACITY_SHIFT.

Just to make sure: You do this scaling of usage by cpu_capacity_orig
here only to cater for the fact that cpu_capacity_orig might be uarch
scaled (by arch_scale_cpu_capacity, !SMT) in update_cpu_capacity while
utilization_load_avg is currently not.
We don't even uArch scale on ARM TC2 big.LITTLE platform in mainline
today due to the missing clock-frequency property in the device tree.

I think it's hard for people to grasp that your patch-set takes uArch
scaling of capacity into consideration but not frequency scaling of
capacity (via arch_scale_freq_capacity, not used at the moment).

> +}
> +
>  /*
>   * select_task_rq_fair: Select target runqueue for the waking task in domains
>   * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
> @@ -5663,6 +5674,7 @@ struct sg_lb_stats {
>  	unsigned long sum_weighted_load; /* Weighted load of group's tasks */
>  	unsigned long load_per_task;
>  	unsigned long group_capacity;
> +	unsigned long group_usage; /* Total usage of the group */
>  	unsigned int sum_nr_running; /* Nr tasks running in the group */
>  	unsigned int group_capacity_factor;
>  	unsigned int idle_cpus;
> @@ -6037,6 +6049,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>  			load = source_load(i, load_idx);
>  
>  		sgs->group_load += load;
> +		sgs->group_usage += get_cpu_usage(i);
>  		sgs->sum_nr_running += rq->cfs.h_nr_running;
>  
>  		if (rq->nr_running > 1)
> 



^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH v6 4/6] sched: get CPU's usage statistic
@ 2014-09-25 19:05     ` Dietmar Eggemann
  0 siblings, 0 replies; 72+ messages in thread
From: Dietmar Eggemann @ 2014-09-25 19:05 UTC (permalink / raw)
  To: linux-arm-kernel

On 23/09/14 17:08, Vincent Guittot wrote:
> Monitor the usage level of each group of each sched_domain level. The usage is
> the amount of cpu_capacity that is currently used on a CPU or group of CPUs.
> We use the utilization_load_avg to evaluate the usage level of each group.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>  kernel/sched/fair.c | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 2cf153d..4097e3f 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4523,6 +4523,17 @@ static int select_idle_sibling(struct task_struct *p, int target)
>  	return target;
>  }
>  
> +static int get_cpu_usage(int cpu)
> +{
> +	unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg;
> +	unsigned long capacity = capacity_orig_of(cpu);
> +
> +	if (usage >= SCHED_LOAD_SCALE)
> +		return capacity + 1;

Why you are returning rq->cpu_capacity_orig + 1 (1025) in case
utilization_load_avg is greater or equal than 1024 and not usage or
(usage * capacity) >> SCHED_LOAD_SHIFT too?

In case the weight of a sched group is greater than 1, you might loose
the information that the whole sched group is over-utilized too.

You add up the individual cpu usage values for a group by
sgs->group_usage += get_cpu_usage(i) in update_sg_lb_stats and later use
sgs->group_usage in group_is_overloaded to compare it against
sgs->group_capacity (taking imbalance_pct into consideration).

> +
> +	return (usage * capacity) >> SCHED_LOAD_SHIFT;

Nit-pick: Since you're multiplying by a capacity value
(rq->cpu_capacity_orig) you should shift by SCHED_CAPACITY_SHIFT.

Just to make sure: You do this scaling of usage by cpu_capacity_orig
here only to cater for the fact that cpu_capacity_orig might be uarch
scaled (by arch_scale_cpu_capacity, !SMT) in update_cpu_capacity while
utilization_load_avg is currently not.
We don't even uArch scale on ARM TC2 big.LITTLE platform in mainline
today due to the missing clock-frequency property in the device tree.

I think it's hard for people to grasp that your patch-set takes uArch
scaling of capacity into consideration but not frequency scaling of
capacity (via arch_scale_freq_capacity, not used at the moment).

> +}
> +
>  /*
>   * select_task_rq_fair: Select target runqueue for the waking task in domains
>   * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
> @@ -5663,6 +5674,7 @@ struct sg_lb_stats {
>  	unsigned long sum_weighted_load; /* Weighted load of group's tasks */
>  	unsigned long load_per_task;
>  	unsigned long group_capacity;
> +	unsigned long group_usage; /* Total usage of the group */
>  	unsigned int sum_nr_running; /* Nr tasks running in the group */
>  	unsigned int group_capacity_factor;
>  	unsigned int idle_cpus;
> @@ -6037,6 +6049,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>  			load = source_load(i, load_idx);
>  
>  		sgs->group_load += load;
> +		sgs->group_usage += get_cpu_usage(i);
>  		sgs->sum_nr_running += rq->cfs.h_nr_running;
>  
>  		if (rq->nr_running > 1)
> 

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v6 5/6] sched: replace capacity_factor by usage
  2014-09-25  8:35       ` Vincent Guittot
@ 2014-09-25 19:19         ` Dietmar Eggemann
  -1 siblings, 0 replies; 72+ messages in thread
From: Dietmar Eggemann @ 2014-09-25 19:19 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: peterz, mingo, linux-kernel, preeti, linux, linux-arm-kernel,
	riel, Morten Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano, pjt, bsegall

On 25/09/14 09:35, Vincent Guittot wrote:
> On 24 September 2014 19:48, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>> On 23/09/14 17:08, Vincent Guittot wrote:
>>
> 
> [snip]
> 
>>
>> This review (by PeterZ) during v5 of your patch-set recommended some
>> renaming (e.g. s/group_has_free_capacity/group_has_capacity and
>> s/group_out_of_capacity/group_no_capacity as well as reordering of the
>> parameters which I agree with:
>>
>> https://lkml.org/lkml/2014/9/11/706
> 
> Ah... you're right, these changes have passed through my seance of renaming

What about the ordering of the function parameters in
group_has_capacity, group_is_overloaded and group_type group_classify?

All the existing *load balance* related functions in fair.c seem to
follow this (struct lb_env *env, struct sd_lb_stats *sds, struct
sched_group *group, struct sg_lb_stats *sgs) order.

> 
>>
>>>
>>> -/*
> 
> [snip]
> 
>>> -       if (sgs->group_capacity_factor > sgs->sum_nr_running)
>>> -               sgs->group_has_free_capacity = 1;
>>> +       sgs->group_type = group_classify(group, sgs, env);
>>> +
>>> +       sgs->group_out_of_capacity = group_is_overloaded(sgs, env);
>>
>> In case sgs->group_type is group_overloaded you could set
>> sgs->group_out_of_capacity to 1 without calling group_is_overloaded again.
> 
> I prefer to keep sgs->group_out_of_capacity = group_is_overloaded(sgs,
> env) and use it in group_classify in case of future changes in the
> classification order

Ok, but than group_is_overloaded is called twice at the end of
update_sg_lb_stats with exactly the same result. Looks weird in my traces.

> 
[...]



^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH v6 5/6] sched: replace capacity_factor by usage
@ 2014-09-25 19:19         ` Dietmar Eggemann
  0 siblings, 0 replies; 72+ messages in thread
From: Dietmar Eggemann @ 2014-09-25 19:19 UTC (permalink / raw)
  To: linux-arm-kernel

On 25/09/14 09:35, Vincent Guittot wrote:
> On 24 September 2014 19:48, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>> On 23/09/14 17:08, Vincent Guittot wrote:
>>
> 
> [snip]
> 
>>
>> This review (by PeterZ) during v5 of your patch-set recommended some
>> renaming (e.g. s/group_has_free_capacity/group_has_capacity and
>> s/group_out_of_capacity/group_no_capacity as well as reordering of the
>> parameters which I agree with:
>>
>> https://lkml.org/lkml/2014/9/11/706
> 
> Ah... you're right, these changes have passed through my seance of renaming

What about the ordering of the function parameters in
group_has_capacity, group_is_overloaded and group_type group_classify?

All the existing *load balance* related functions in fair.c seem to
follow this (struct lb_env *env, struct sd_lb_stats *sds, struct
sched_group *group, struct sg_lb_stats *sgs) order.

> 
>>
>>>
>>> -/*
> 
> [snip]
> 
>>> -       if (sgs->group_capacity_factor > sgs->sum_nr_running)
>>> -               sgs->group_has_free_capacity = 1;
>>> +       sgs->group_type = group_classify(group, sgs, env);
>>> +
>>> +       sgs->group_out_of_capacity = group_is_overloaded(sgs, env);
>>
>> In case sgs->group_type is group_overloaded you could set
>> sgs->group_out_of_capacity to 1 without calling group_is_overloaded again.
> 
> I prefer to keep sgs->group_out_of_capacity = group_is_overloaded(sgs,
> env) and use it in group_classify in case of future changes in the
> classification order

Ok, but than group_is_overloaded is called twice at the end of
update_sg_lb_stats with exactly the same result. Looks weird in my traces.

> 
[...]

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v6 4/6] sched: get CPU's usage statistic
  2014-09-25 19:05     ` Dietmar Eggemann
@ 2014-09-26 12:17       ` Vincent Guittot
  -1 siblings, 0 replies; 72+ messages in thread
From: Vincent Guittot @ 2014-09-26 12:17 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: peterz, mingo, linux-kernel, preeti, linux, linux-arm-kernel,
	riel, Morten Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano, pjt, bsegall

On 25 September 2014 21:05, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> On 23/09/14 17:08, Vincent Guittot wrote:
>> Monitor the usage level of each group of each sched_domain level. The usage is
>> the amount of cpu_capacity that is currently used on a CPU or group of CPUs.
>> We use the utilization_load_avg to evaluate the usage level of each group.
>>
>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>> ---
>>  kernel/sched/fair.c | 13 +++++++++++++
>>  1 file changed, 13 insertions(+)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 2cf153d..4097e3f 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -4523,6 +4523,17 @@ static int select_idle_sibling(struct task_struct *p, int target)
>>       return target;
>>  }
>>
>> +static int get_cpu_usage(int cpu)
>> +{
>> +     unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg;
>> +     unsigned long capacity = capacity_orig_of(cpu);
>> +
>> +     if (usage >= SCHED_LOAD_SCALE)
>> +             return capacity + 1;
>
> Why you are returning rq->cpu_capacity_orig + 1 (1025) in case
> utilization_load_avg is greater or equal than 1024 and not usage or
> (usage * capacity) >> SCHED_LOAD_SHIFT too?

The usage can't be higher than the full capacity of the CPU because
it's about the running time on this CPU. Nevertheless, usage can be
higher than SCHED_LOAD_SCALE because of unfortunate rounding in
avg_period and running_load_avg or just after migrating tasks until
the average stabilizes with the new running time.

>
> In case the weight of a sched group is greater than 1, you might loose
> the information that the whole sched group is over-utilized too.

that's exactly for sched_group with more than 1 CPU that we need to
cap the usage of a CPU to 100%. Otherwise, the group could be seen as
overloaded (CPU0 usage at 121% + CPU1 usage at 80%) whereas CPU1 has
20% of available capacity

>
> You add up the individual cpu usage values for a group by
> sgs->group_usage += get_cpu_usage(i) in update_sg_lb_stats and later use
> sgs->group_usage in group_is_overloaded to compare it against
> sgs->group_capacity (taking imbalance_pct into consideration).
>
>> +
>> +     return (usage * capacity) >> SCHED_LOAD_SHIFT;
>
> Nit-pick: Since you're multiplying by a capacity value
> (rq->cpu_capacity_orig) you should shift by SCHED_CAPACITY_SHIFT.

we want to compare the output of the function with some capacity
figures so i think that >> SCHED_LOAD_SHIFT is the right operation.

>
> Just to make sure: You do this scaling of usage by cpu_capacity_orig
> here only to cater for the fact that cpu_capacity_orig might be uarch
> scaled (by arch_scale_cpu_capacity, !SMT) in update_cpu_capacity while

I do this for any system with CPUs that have an original capacity that
is different from SCHED_CAPACITY_SCALE so it's for both uArch and SMT.

> utilization_load_avg is currently not.
> We don't even uArch scale on ARM TC2 big.LITTLE platform in mainline
> today due to the missing clock-frequency property in the device tree.

sorry i don't catch your point

>
> I think it's hard for people to grasp that your patch-set takes uArch
> scaling of capacity into consideration but not frequency scaling of
> capacity (via arch_scale_freq_capacity, not used at the moment).
>
>> +}
>> +
>>  /*
>>   * select_task_rq_fair: Select target runqueue for the waking task in domains
>>   * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
>> @@ -5663,6 +5674,7 @@ struct sg_lb_stats {
>>       unsigned long sum_weighted_load; /* Weighted load of group's tasks */
>>       unsigned long load_per_task;
>>       unsigned long group_capacity;
>> +     unsigned long group_usage; /* Total usage of the group */
>>       unsigned int sum_nr_running; /* Nr tasks running in the group */
>>       unsigned int group_capacity_factor;
>>       unsigned int idle_cpus;
>> @@ -6037,6 +6049,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>>                       load = source_load(i, load_idx);
>>
>>               sgs->group_load += load;
>> +             sgs->group_usage += get_cpu_usage(i);
>>               sgs->sum_nr_running += rq->cfs.h_nr_running;
>>
>>               if (rq->nr_running > 1)
>>
>
>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH v6 4/6] sched: get CPU's usage statistic
@ 2014-09-26 12:17       ` Vincent Guittot
  0 siblings, 0 replies; 72+ messages in thread
From: Vincent Guittot @ 2014-09-26 12:17 UTC (permalink / raw)
  To: linux-arm-kernel

On 25 September 2014 21:05, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> On 23/09/14 17:08, Vincent Guittot wrote:
>> Monitor the usage level of each group of each sched_domain level. The usage is
>> the amount of cpu_capacity that is currently used on a CPU or group of CPUs.
>> We use the utilization_load_avg to evaluate the usage level of each group.
>>
>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>> ---
>>  kernel/sched/fair.c | 13 +++++++++++++
>>  1 file changed, 13 insertions(+)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 2cf153d..4097e3f 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -4523,6 +4523,17 @@ static int select_idle_sibling(struct task_struct *p, int target)
>>       return target;
>>  }
>>
>> +static int get_cpu_usage(int cpu)
>> +{
>> +     unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg;
>> +     unsigned long capacity = capacity_orig_of(cpu);
>> +
>> +     if (usage >= SCHED_LOAD_SCALE)
>> +             return capacity + 1;
>
> Why you are returning rq->cpu_capacity_orig + 1 (1025) in case
> utilization_load_avg is greater or equal than 1024 and not usage or
> (usage * capacity) >> SCHED_LOAD_SHIFT too?

The usage can't be higher than the full capacity of the CPU because
it's about the running time on this CPU. Nevertheless, usage can be
higher than SCHED_LOAD_SCALE because of unfortunate rounding in
avg_period and running_load_avg or just after migrating tasks until
the average stabilizes with the new running time.

>
> In case the weight of a sched group is greater than 1, you might loose
> the information that the whole sched group is over-utilized too.

that's exactly for sched_group with more than 1 CPU that we need to
cap the usage of a CPU to 100%. Otherwise, the group could be seen as
overloaded (CPU0 usage at 121% + CPU1 usage at 80%) whereas CPU1 has
20% of available capacity

>
> You add up the individual cpu usage values for a group by
> sgs->group_usage += get_cpu_usage(i) in update_sg_lb_stats and later use
> sgs->group_usage in group_is_overloaded to compare it against
> sgs->group_capacity (taking imbalance_pct into consideration).
>
>> +
>> +     return (usage * capacity) >> SCHED_LOAD_SHIFT;
>
> Nit-pick: Since you're multiplying by a capacity value
> (rq->cpu_capacity_orig) you should shift by SCHED_CAPACITY_SHIFT.

we want to compare the output of the function with some capacity
figures so i think that >> SCHED_LOAD_SHIFT is the right operation.

>
> Just to make sure: You do this scaling of usage by cpu_capacity_orig
> here only to cater for the fact that cpu_capacity_orig might be uarch
> scaled (by arch_scale_cpu_capacity, !SMT) in update_cpu_capacity while

I do this for any system with CPUs that have an original capacity that
is different from SCHED_CAPACITY_SCALE so it's for both uArch and SMT.

> utilization_load_avg is currently not.
> We don't even uArch scale on ARM TC2 big.LITTLE platform in mainline
> today due to the missing clock-frequency property in the device tree.

sorry i don't catch your point

>
> I think it's hard for people to grasp that your patch-set takes uArch
> scaling of capacity into consideration but not frequency scaling of
> capacity (via arch_scale_freq_capacity, not used at the moment).
>
>> +}
>> +
>>  /*
>>   * select_task_rq_fair: Select target runqueue for the waking task in domains
>>   * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
>> @@ -5663,6 +5674,7 @@ struct sg_lb_stats {
>>       unsigned long sum_weighted_load; /* Weighted load of group's tasks */
>>       unsigned long load_per_task;
>>       unsigned long group_capacity;
>> +     unsigned long group_usage; /* Total usage of the group */
>>       unsigned int sum_nr_running; /* Nr tasks running in the group */
>>       unsigned int group_capacity_factor;
>>       unsigned int idle_cpus;
>> @@ -6037,6 +6049,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>>                       load = source_load(i, load_idx);
>>
>>               sgs->group_load += load;
>> +             sgs->group_usage += get_cpu_usage(i);
>>               sgs->sum_nr_running += rq->cfs.h_nr_running;
>>
>>               if (rq->nr_running > 1)
>>
>
>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v6 5/6] sched: replace capacity_factor by usage
  2014-09-25 19:19         ` Dietmar Eggemann
@ 2014-09-26 12:39           ` Vincent Guittot
  -1 siblings, 0 replies; 72+ messages in thread
From: Vincent Guittot @ 2014-09-26 12:39 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: peterz, mingo, linux-kernel, preeti, linux, linux-arm-kernel,
	riel, Morten Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano, pjt, bsegall

On 25 September 2014 21:19, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> On 25/09/14 09:35, Vincent Guittot wrote:

[snip]

>>>
>>> In case sgs->group_type is group_overloaded you could set
>>> sgs->group_out_of_capacity to 1 without calling group_is_overloaded again.
>>
>> I prefer to keep sgs->group_out_of_capacity = group_is_overloaded(sgs,
>> env) and use it in group_classify in case of future changes in the
>> classification order
>
> Ok, but than group_is_overloaded is called twice at the end of
> update_sg_lb_stats with exactly the same result. Looks weird in my traces.

sorry i don't catch your point.
group_is_overloaded() is called once for group_out_of_capacity which
is then used in group_classify so it should be called only once

>
>>
> [...]
>
>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH v6 5/6] sched: replace capacity_factor by usage
@ 2014-09-26 12:39           ` Vincent Guittot
  0 siblings, 0 replies; 72+ messages in thread
From: Vincent Guittot @ 2014-09-26 12:39 UTC (permalink / raw)
  To: linux-arm-kernel

On 25 September 2014 21:19, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> On 25/09/14 09:35, Vincent Guittot wrote:

[snip]

>>>
>>> In case sgs->group_type is group_overloaded you could set
>>> sgs->group_out_of_capacity to 1 without calling group_is_overloaded again.
>>
>> I prefer to keep sgs->group_out_of_capacity = group_is_overloaded(sgs,
>> env) and use it in group_classify in case of future changes in the
>> classification order
>
> Ok, but than group_is_overloaded is called twice at the end of
> update_sg_lb_stats with exactly the same result. Looks weird in my traces.

sorry i don't catch your point.
group_is_overloaded() is called once for group_out_of_capacity which
is then used in group_classify so it should be called only once

>
>>
> [...]
>
>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v6 5/6] sched: replace capacity_factor by usage
  2014-09-26 12:39           ` Vincent Guittot
@ 2014-09-26 14:00             ` Dietmar Eggemann
  -1 siblings, 0 replies; 72+ messages in thread
From: Dietmar Eggemann @ 2014-09-26 14:00 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: peterz, mingo, linux-kernel, preeti, linux, linux-arm-kernel,
	riel, Morten Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano, pjt, bsegall

On 26/09/14 13:39, Vincent Guittot wrote:
> On 25 September 2014 21:19, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>> On 25/09/14 09:35, Vincent Guittot wrote:
> 
> [snip]
> 
>>>>
>>>> In case sgs->group_type is group_overloaded you could set
>>>> sgs->group_out_of_capacity to 1 without calling group_is_overloaded again.
>>>
>>> I prefer to keep sgs->group_out_of_capacity = group_is_overloaded(sgs,
>>> env) and use it in group_classify in case of future changes in the
>>> classification order
>>
>> Ok, but than group_is_overloaded is called twice at the end of
>> update_sg_lb_stats with exactly the same result. Looks weird in my traces.
> 
> sorry i don't catch your point.
> group_is_overloaded() is called once for group_out_of_capacity which
> is then used in group_classify so it should be called only once

You're right, I made a mistake in my local setup while replacing this
patch with the new version. Problem doesn't exist any more.

> 
>>
>>>
>> [...]
>>
>>
> 



^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH v6 5/6] sched: replace capacity_factor by usage
@ 2014-09-26 14:00             ` Dietmar Eggemann
  0 siblings, 0 replies; 72+ messages in thread
From: Dietmar Eggemann @ 2014-09-26 14:00 UTC (permalink / raw)
  To: linux-arm-kernel

On 26/09/14 13:39, Vincent Guittot wrote:
> On 25 September 2014 21:19, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>> On 25/09/14 09:35, Vincent Guittot wrote:
> 
> [snip]
> 
>>>>
>>>> In case sgs->group_type is group_overloaded you could set
>>>> sgs->group_out_of_capacity to 1 without calling group_is_overloaded again.
>>>
>>> I prefer to keep sgs->group_out_of_capacity = group_is_overloaded(sgs,
>>> env) and use it in group_classify in case of future changes in the
>>> classification order
>>
>> Ok, but than group_is_overloaded is called twice at the end of
>> update_sg_lb_stats with exactly the same result. Looks weird in my traces.
> 
> sorry i don't catch your point.
> group_is_overloaded() is called once for group_out_of_capacity which
> is then used in group_classify so it should be called only once

You're right, I made a mistake in my local setup while replacing this
patch with the new version. Problem doesn't exist any more.

> 
>>
>>>
>> [...]
>>
>>
> 

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v6 4/6] sched: get CPU's usage statistic
  2014-09-26 12:17       ` Vincent Guittot
@ 2014-09-26 15:58         ` Morten Rasmussen
  -1 siblings, 0 replies; 72+ messages in thread
From: Morten Rasmussen @ 2014-09-26 15:58 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Dietmar Eggemann, peterz, mingo, linux-kernel, preeti, linux,
	linux-arm-kernel, riel, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano, pjt, bsegall

On Fri, Sep 26, 2014 at 01:17:43PM +0100, Vincent Guittot wrote:
> On 25 September 2014 21:05, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> > On 23/09/14 17:08, Vincent Guittot wrote:
> >> Monitor the usage level of each group of each sched_domain level. The usage is
> >> the amount of cpu_capacity that is currently used on a CPU or group of CPUs.
> >> We use the utilization_load_avg to evaluate the usage level of each group.
> >>
> >> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> >> ---
> >>  kernel/sched/fair.c | 13 +++++++++++++
> >>  1 file changed, 13 insertions(+)
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index 2cf153d..4097e3f 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -4523,6 +4523,17 @@ static int select_idle_sibling(struct task_struct *p, int target)
> >>       return target;
> >>  }
> >>
> >> +static int get_cpu_usage(int cpu)
> >> +{
> >> +     unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg;
> >> +     unsigned long capacity = capacity_orig_of(cpu);
> >> +
> >> +     if (usage >= SCHED_LOAD_SCALE)
> >> +             return capacity + 1;
> >
> > Why you are returning rq->cpu_capacity_orig + 1 (1025) in case
> > utilization_load_avg is greater or equal than 1024 and not usage or
> > (usage * capacity) >> SCHED_LOAD_SHIFT too?
> 
> The usage can't be higher than the full capacity of the CPU because
> it's about the running time on this CPU. Nevertheless, usage can be
> higher than SCHED_LOAD_SCALE because of unfortunate rounding in
> avg_period and running_load_avg or just after migrating tasks until
> the average stabilizes with the new running time.

I fully agree that the cpu usage should be capped to capacity, but why
do you return capacity + 1? I would just return capacity, no?

Now that you have gotten rid of 'usage' everywhere else, shouldn't this
function be renamed to get_cpu_utilization()?

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH v6 4/6] sched: get CPU's usage statistic
@ 2014-09-26 15:58         ` Morten Rasmussen
  0 siblings, 0 replies; 72+ messages in thread
From: Morten Rasmussen @ 2014-09-26 15:58 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Sep 26, 2014 at 01:17:43PM +0100, Vincent Guittot wrote:
> On 25 September 2014 21:05, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> > On 23/09/14 17:08, Vincent Guittot wrote:
> >> Monitor the usage level of each group of each sched_domain level. The usage is
> >> the amount of cpu_capacity that is currently used on a CPU or group of CPUs.
> >> We use the utilization_load_avg to evaluate the usage level of each group.
> >>
> >> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> >> ---
> >>  kernel/sched/fair.c | 13 +++++++++++++
> >>  1 file changed, 13 insertions(+)
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index 2cf153d..4097e3f 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -4523,6 +4523,17 @@ static int select_idle_sibling(struct task_struct *p, int target)
> >>       return target;
> >>  }
> >>
> >> +static int get_cpu_usage(int cpu)
> >> +{
> >> +     unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg;
> >> +     unsigned long capacity = capacity_orig_of(cpu);
> >> +
> >> +     if (usage >= SCHED_LOAD_SCALE)
> >> +             return capacity + 1;
> >
> > Why you are returning rq->cpu_capacity_orig + 1 (1025) in case
> > utilization_load_avg is greater or equal than 1024 and not usage or
> > (usage * capacity) >> SCHED_LOAD_SHIFT too?
> 
> The usage can't be higher than the full capacity of the CPU because
> it's about the running time on this CPU. Nevertheless, usage can be
> higher than SCHED_LOAD_SCALE because of unfortunate rounding in
> avg_period and running_load_avg or just after migrating tasks until
> the average stabilizes with the new running time.

I fully agree that the cpu usage should be capped to capacity, but why
do you return capacity + 1? I would just return capacity, no?

Now that you have gotten rid of 'usage' everywhere else, shouldn't this
function be renamed to get_cpu_utilization()?

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v6 4/6] sched: get CPU's usage statistic
  2014-09-26 12:17       ` Vincent Guittot
@ 2014-09-26 19:57         ` Dietmar Eggemann
  -1 siblings, 0 replies; 72+ messages in thread
From: Dietmar Eggemann @ 2014-09-26 19:57 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: peterz, mingo, linux-kernel, preeti, linux, linux-arm-kernel,
	riel, Morten Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano, pjt, bsegall

On 26/09/14 13:17, Vincent Guittot wrote:
> On 25 September 2014 21:05, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>> On 23/09/14 17:08, Vincent Guittot wrote:

[...]

>>>
>>> +static int get_cpu_usage(int cpu)
>>> +{
>>> +     unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg;
>>> +     unsigned long capacity = capacity_orig_of(cpu);
>>> +
>>> +     if (usage >= SCHED_LOAD_SCALE)
>>> +             return capacity + 1;
>>
>> Why you are returning rq->cpu_capacity_orig + 1 (1025) in case
>> utilization_load_avg is greater or equal than 1024 and not usage or
>> (usage * capacity) >> SCHED_LOAD_SHIFT too?
> 
> The usage can't be higher than the full capacity of the CPU because
> it's about the running time on this CPU. Nevertheless, usage can be
> higher than SCHED_LOAD_SCALE because of unfortunate rounding in
> avg_period and running_load_avg or just after migrating tasks until
> the average stabilizes with the new running time.

Ok, I got it now, thanks!


When running 'hackbench -p -T -s 10 -l 1' on TC2, the usage for a cpu
goes occasionally also much higher than SCHED_LOAD_SCALE. After all,
p->se.avg.running_avg_sum is initialized to slice in
init_task_runnable_average.

> 
>>
>> In case the weight of a sched group is greater than 1, you might loose
>> the information that the whole sched group is over-utilized too.
> 
> that's exactly for sched_group with more than 1 CPU that we need to
> cap the usage of a CPU to 100%. Otherwise, the group could be seen as
> overloaded (CPU0 usage at 121% + CPU1 usage at 80%) whereas CPU1 has
> 20% of available capacity

Makes sense, we don't want to do anything in this case on a sched level
(e.g. DIE), the appropriate level below (e.g. MC) should balance this
out first. Got it!

> 
>>
>> You add up the individual cpu usage values for a group by
>> sgs->group_usage += get_cpu_usage(i) in update_sg_lb_stats and later use
>> sgs->group_usage in group_is_overloaded to compare it against
>> sgs->group_capacity (taking imbalance_pct into consideration).
>>
>>> +
>>> +     return (usage * capacity) >> SCHED_LOAD_SHIFT;
>>
>> Nit-pick: Since you're multiplying by a capacity value
>> (rq->cpu_capacity_orig) you should shift by SCHED_CAPACITY_SHIFT.
> 
> we want to compare the output of the function with some capacity
> figures so i think that >> SCHED_LOAD_SHIFT is the right operation.
> 
>>
>> Just to make sure: You do this scaling of usage by cpu_capacity_orig
>> here only to cater for the fact that cpu_capacity_orig might be uarch
>> scaled (by arch_scale_cpu_capacity, !SMT) in update_cpu_capacity while
> 
> I do this for any system with CPUs that have an original capacity that
> is different from SCHED_CAPACITY_SCALE so it's for both uArch and SMT.

Understood so your current patch-set is doing uArch scaling for capacity
and since you're not doing uArch scaling for utilization, you do this '*
capacity) >> SCHED_LOAD_SHIFT' thing. Correct?

> 
>> utilization_load_avg is currently not.
>> We don't even uArch scale on ARM TC2 big.LITTLE platform in mainline
>> today due to the missing clock-frequency property in the device tree.
> 
> sorry i don't catch your point

With mainline dts file for ARM TC2, the rq->cpu_capacity-orig is 1024
for all 5 cpus (A15's and A7's). The arm topology shim layer barfs a

  /cpus/cpu@x missing clock-frequency property

per cpu in this case and doesn't scale the capacity. Only when I add

 clock-frequency = <xxxxxxxxx>;

per cpuX node into the dts file, I get a system with asymmetric
rq->cpu_capacity_orig values (606 for an A7 and 1441 for an A15).

> 
>>
>> I think it's hard for people to grasp that your patch-set takes uArch
>> scaling of capacity into consideration but not frequency scaling of
>> capacity (via arch_scale_freq_capacity, not used at the moment).

[...]


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH v6 4/6] sched: get CPU's usage statistic
@ 2014-09-26 19:57         ` Dietmar Eggemann
  0 siblings, 0 replies; 72+ messages in thread
From: Dietmar Eggemann @ 2014-09-26 19:57 UTC (permalink / raw)
  To: linux-arm-kernel

On 26/09/14 13:17, Vincent Guittot wrote:
> On 25 September 2014 21:05, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>> On 23/09/14 17:08, Vincent Guittot wrote:

[...]

>>>
>>> +static int get_cpu_usage(int cpu)
>>> +{
>>> +     unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg;
>>> +     unsigned long capacity = capacity_orig_of(cpu);
>>> +
>>> +     if (usage >= SCHED_LOAD_SCALE)
>>> +             return capacity + 1;
>>
>> Why you are returning rq->cpu_capacity_orig + 1 (1025) in case
>> utilization_load_avg is greater or equal than 1024 and not usage or
>> (usage * capacity) >> SCHED_LOAD_SHIFT too?
> 
> The usage can't be higher than the full capacity of the CPU because
> it's about the running time on this CPU. Nevertheless, usage can be
> higher than SCHED_LOAD_SCALE because of unfortunate rounding in
> avg_period and running_load_avg or just after migrating tasks until
> the average stabilizes with the new running time.

Ok, I got it now, thanks!


When running 'hackbench -p -T -s 10 -l 1' on TC2, the usage for a cpu
goes occasionally also much higher than SCHED_LOAD_SCALE. After all,
p->se.avg.running_avg_sum is initialized to slice in
init_task_runnable_average.

> 
>>
>> In case the weight of a sched group is greater than 1, you might loose
>> the information that the whole sched group is over-utilized too.
> 
> that's exactly for sched_group with more than 1 CPU that we need to
> cap the usage of a CPU to 100%. Otherwise, the group could be seen as
> overloaded (CPU0 usage at 121% + CPU1 usage at 80%) whereas CPU1 has
> 20% of available capacity

Makes sense, we don't want to do anything in this case on a sched level
(e.g. DIE), the appropriate level below (e.g. MC) should balance this
out first. Got it!

> 
>>
>> You add up the individual cpu usage values for a group by
>> sgs->group_usage += get_cpu_usage(i) in update_sg_lb_stats and later use
>> sgs->group_usage in group_is_overloaded to compare it against
>> sgs->group_capacity (taking imbalance_pct into consideration).
>>
>>> +
>>> +     return (usage * capacity) >> SCHED_LOAD_SHIFT;
>>
>> Nit-pick: Since you're multiplying by a capacity value
>> (rq->cpu_capacity_orig) you should shift by SCHED_CAPACITY_SHIFT.
> 
> we want to compare the output of the function with some capacity
> figures so i think that >> SCHED_LOAD_SHIFT is the right operation.
> 
>>
>> Just to make sure: You do this scaling of usage by cpu_capacity_orig
>> here only to cater for the fact that cpu_capacity_orig might be uarch
>> scaled (by arch_scale_cpu_capacity, !SMT) in update_cpu_capacity while
> 
> I do this for any system with CPUs that have an original capacity that
> is different from SCHED_CAPACITY_SCALE so it's for both uArch and SMT.

Understood so your current patch-set is doing uArch scaling for capacity
and since you're not doing uArch scaling for utilization, you do this '*
capacity) >> SCHED_LOAD_SHIFT' thing. Correct?

> 
>> utilization_load_avg is currently not.
>> We don't even uArch scale on ARM TC2 big.LITTLE platform in mainline
>> today due to the missing clock-frequency property in the device tree.
> 
> sorry i don't catch your point

With mainline dts file for ARM TC2, the rq->cpu_capacity-orig is 1024
for all 5 cpus (A15's and A7's). The arm topology shim layer barfs a

  /cpus/cpu at x missing clock-frequency property

per cpu in this case and doesn't scale the capacity. Only when I add

 clock-frequency = <xxxxxxxxx>;

per cpuX node into the dts file, I get a system with asymmetric
rq->cpu_capacity_orig values (606 for an A7 and 1441 for an A15).

> 
>>
>> I think it's hard for people to grasp that your patch-set takes uArch
>> scaling of capacity into consideration but not frequency scaling of
>> capacity (via arch_scale_freq_capacity, not used at the moment).

[...]

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v6 5/6] sched: replace capacity_factor by usage
  2014-09-23 16:08   ` Vincent Guittot
@ 2014-09-29 13:39     ` Dietmar Eggemann
  -1 siblings, 0 replies; 72+ messages in thread
From: Dietmar Eggemann @ 2014-09-29 13:39 UTC (permalink / raw)
  To: Vincent Guittot, peterz, mingo, linux-kernel, preeti, linux,
	linux-arm-kernel
  Cc: riel, Morten Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano, pjt, bsegall

On 23/09/14 17:08, Vincent Guittot wrote:
> The scheduler tries to compute how many tasks a group of CPUs can handle by
> assuming that a task's load is SCHED_LOAD_SCALE and a CPU capacity is
> SCHED_CAPACITY_SCALE but the capacity_factor is hardly working for SMT system,
> it sometimes works for big cores but fails to do the right thing for little
> cores.
> 
> Below are two examples to illustrate the problem that this patch solves:
> 
> 1 - capacity_factor makes the assumption that max capacity of a CPU is
> SCHED_CAPACITY_SCALE and the load of a thread is always is
> SCHED_LOAD_SCALE. It compares the output of these figures with the sum
> of nr_running to decide if a group is overloaded or not.
> 
> But if the default capacity of a CPU is less than SCHED_CAPACITY_SCALE
> (640 as an example), a group of 3 CPUS will have a max capacity_factor
> of 2 ( div_round_closest(3x640/1024) = 2) which means that it will be
> seen as overloaded if we have only one task per CPU.
> 
> 2 - Then, if the default capacity of a CPU is greater than
> SCHED_CAPACITY_SCALE (1512 as an example), a group of 4 CPUs will have
> a capacity_factor of 4 (at max and thanks to the fix[0] for SMT system
> that prevent the apparition of ghost CPUs) but if one CPU is fully
> used by a rt task (and its capacity is reduced to nearly nothing), the
> capacity factor of the group will still be 4
> (div_round_closest(3*1512/1024) = 5).
> 
> So, this patch tries to solve this issue by removing capacity_factor
> and replacing it with the 2 following metrics :
> -The available CPU's capacity for CFS tasks which is the already used by
> load_balance.
> -The usage of the CPU by the CFS tasks. For the latter, I have
> re-introduced the utilization_avg_contrib which is in the range
> [0..SCHED_CPU_LOAD] whatever the capacity of the CPU is.

IMHO, this last sentence is misleading. The usage of a cpu can be
temporally unbounded (in case a lot of tasks have just been spawned on
this cpu, testcase: hackbench) but it converges very quickly towards a
value between [0..1024]. Your implementation is already handling this
case by capping usage to cpu_rq(cpu)->capacity_orig + 1 .
BTW, couldn't find the definition of SCHED_CPU_LOAD.

[...]


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH v6 5/6] sched: replace capacity_factor by usage
@ 2014-09-29 13:39     ` Dietmar Eggemann
  0 siblings, 0 replies; 72+ messages in thread
From: Dietmar Eggemann @ 2014-09-29 13:39 UTC (permalink / raw)
  To: linux-arm-kernel

On 23/09/14 17:08, Vincent Guittot wrote:
> The scheduler tries to compute how many tasks a group of CPUs can handle by
> assuming that a task's load is SCHED_LOAD_SCALE and a CPU capacity is
> SCHED_CAPACITY_SCALE but the capacity_factor is hardly working for SMT system,
> it sometimes works for big cores but fails to do the right thing for little
> cores.
> 
> Below are two examples to illustrate the problem that this patch solves:
> 
> 1 - capacity_factor makes the assumption that max capacity of a CPU is
> SCHED_CAPACITY_SCALE and the load of a thread is always is
> SCHED_LOAD_SCALE. It compares the output of these figures with the sum
> of nr_running to decide if a group is overloaded or not.
> 
> But if the default capacity of a CPU is less than SCHED_CAPACITY_SCALE
> (640 as an example), a group of 3 CPUS will have a max capacity_factor
> of 2 ( div_round_closest(3x640/1024) = 2) which means that it will be
> seen as overloaded if we have only one task per CPU.
> 
> 2 - Then, if the default capacity of a CPU is greater than
> SCHED_CAPACITY_SCALE (1512 as an example), a group of 4 CPUs will have
> a capacity_factor of 4 (at max and thanks to the fix[0] for SMT system
> that prevent the apparition of ghost CPUs) but if one CPU is fully
> used by a rt task (and its capacity is reduced to nearly nothing), the
> capacity factor of the group will still be 4
> (div_round_closest(3*1512/1024) = 5).
> 
> So, this patch tries to solve this issue by removing capacity_factor
> and replacing it with the 2 following metrics :
> -The available CPU's capacity for CFS tasks which is the already used by
> load_balance.
> -The usage of the CPU by the CFS tasks. For the latter, I have
> re-introduced the utilization_avg_contrib which is in the range
> [0..SCHED_CPU_LOAD] whatever the capacity of the CPU is.

IMHO, this last sentence is misleading. The usage of a cpu can be
temporally unbounded (in case a lot of tasks have just been spawned on
this cpu, testcase: hackbench) but it converges very quickly towards a
value between [0..1024]. Your implementation is already handling this
case by capping usage to cpu_rq(cpu)->capacity_orig + 1 .
BTW, couldn't find the definition of SCHED_CPU_LOAD.

[...]

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v6 5/6] sched: replace capacity_factor by usage
  2014-09-23 16:08   ` Vincent Guittot
@ 2014-10-02 16:57     ` Morten Rasmussen
  -1 siblings, 0 replies; 72+ messages in thread
From: Morten Rasmussen @ 2014-10-02 16:57 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: peterz, mingo, linux-kernel, preeti, linux, linux-arm-kernel,
	riel, efault, nicolas.pitre, linaro-kernel, daniel.lezcano,
	Dietmar Eggemann, pjt, bsegall

On Tue, Sep 23, 2014 at 05:08:04PM +0100, Vincent Guittot wrote:
> Below are two examples to illustrate the problem that this patch solves:
> 
> 1 - capacity_factor makes the assumption that max capacity of a CPU is
> SCHED_CAPACITY_SCALE and the load of a thread is always is
> SCHED_LOAD_SCALE. It compares the output of these figures with the sum
> of nr_running to decide if a group is overloaded or not.
> 
> But if the default capacity of a CPU is less than SCHED_CAPACITY_SCALE
> (640 as an example), a group of 3 CPUS will have a max capacity_factor
> of 2 ( div_round_closest(3x640/1024) = 2) which means that it will be
> seen as overloaded if we have only one task per CPU.

I did some testing on TC2 which has the setup you describe above on the
A7 cluster when the clock-frequency property is set in DT. The two A15s
have max capacities above 1024. When using sysbench with five threads I
still get three tasks on the two A15s and two tasks on the three A7
leaving one cpu idle (on average).

Using cpu utilization (usage) does correctly identify the A15 cluster as
overloaded and it even gets as far as selecting the A15 running two
tasks as the source cpu in load_balance(). However, load_balance() bails
out without pulling the task due to calculate_imbalance() returning a
zero imbalance. calculate_imbalance() bails out just before the hunk you
changed due to comparison of the sched_group avg_loads. sgs->avg_load is
basically the sum of load in the group divided by its capacity. Since
load isn't scaled the avg_load of the overloaded A15 group is slightly
_lower_ than the partially idle A7 group. Hence calculate_imbalance()
bails out, which isn't what we want.

I think we need to have a closer look at the imbalance calculation code
and any other users of sgs->avg_load to get rid of all code making
assumptions about cpu capacity.

Morten

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH v6 5/6] sched: replace capacity_factor by usage
@ 2014-10-02 16:57     ` Morten Rasmussen
  0 siblings, 0 replies; 72+ messages in thread
From: Morten Rasmussen @ 2014-10-02 16:57 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Sep 23, 2014 at 05:08:04PM +0100, Vincent Guittot wrote:
> Below are two examples to illustrate the problem that this patch solves:
> 
> 1 - capacity_factor makes the assumption that max capacity of a CPU is
> SCHED_CAPACITY_SCALE and the load of a thread is always is
> SCHED_LOAD_SCALE. It compares the output of these figures with the sum
> of nr_running to decide if a group is overloaded or not.
> 
> But if the default capacity of a CPU is less than SCHED_CAPACITY_SCALE
> (640 as an example), a group of 3 CPUS will have a max capacity_factor
> of 2 ( div_round_closest(3x640/1024) = 2) which means that it will be
> seen as overloaded if we have only one task per CPU.

I did some testing on TC2 which has the setup you describe above on the
A7 cluster when the clock-frequency property is set in DT. The two A15s
have max capacities above 1024. When using sysbench with five threads I
still get three tasks on the two A15s and two tasks on the three A7
leaving one cpu idle (on average).

Using cpu utilization (usage) does correctly identify the A15 cluster as
overloaded and it even gets as far as selecting the A15 running two
tasks as the source cpu in load_balance(). However, load_balance() bails
out without pulling the task due to calculate_imbalance() returning a
zero imbalance. calculate_imbalance() bails out just before the hunk you
changed due to comparison of the sched_group avg_loads. sgs->avg_load is
basically the sum of load in the group divided by its capacity. Since
load isn't scaled the avg_load of the overloaded A15 group is slightly
_lower_ than the partially idle A7 group. Hence calculate_imbalance()
bails out, which isn't what we want.

I think we need to have a closer look at the imbalance calculation code
and any other users of sgs->avg_load to get rid of all code making
assumptions about cpu capacity.

Morten

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v6 5/6] sched: replace capacity_factor by usage
  2014-10-02 16:57     ` Morten Rasmussen
@ 2014-10-03  7:24       ` Vincent Guittot
  -1 siblings, 0 replies; 72+ messages in thread
From: Vincent Guittot @ 2014-10-03  7:24 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: peterz, mingo, linux-kernel, preeti, linux, linux-arm-kernel,
	riel, efault, nicolas.pitre, linaro-kernel, daniel.lezcano,
	Dietmar Eggemann, pjt, bsegall

On 2 October 2014 18:57, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> On Tue, Sep 23, 2014 at 05:08:04PM +0100, Vincent Guittot wrote:
>> Below are two examples to illustrate the problem that this patch solves:
>>
>> 1 - capacity_factor makes the assumption that max capacity of a CPU is
>> SCHED_CAPACITY_SCALE and the load of a thread is always is
>> SCHED_LOAD_SCALE. It compares the output of these figures with the sum
>> of nr_running to decide if a group is overloaded or not.
>>
>> But if the default capacity of a CPU is less than SCHED_CAPACITY_SCALE
>> (640 as an example), a group of 3 CPUS will have a max capacity_factor
>> of 2 ( div_round_closest(3x640/1024) = 2) which means that it will be
>> seen as overloaded if we have only one task per CPU.
>
> I did some testing on TC2 which has the setup you describe above on the
> A7 cluster when the clock-frequency property is set in DT. The two A15s
> have max capacities above 1024. When using sysbench with five threads I
> still get three tasks on the two A15s and two tasks on the three A7
> leaving one cpu idle (on average).
>
> Using cpu utilization (usage) does correctly identify the A15 cluster as
> overloaded and it even gets as far as selecting the A15 running two
> tasks as the source cpu in load_balance(). However, load_balance() bails
> out without pulling the task due to calculate_imbalance() returning a
> zero imbalance. calculate_imbalance() bails out just before the hunk you
> changed due to comparison of the sched_group avg_loads. sgs->avg_load is
> basically the sum of load in the group divided by its capacity. Since
> load isn't scaled the avg_load of the overloaded A15 group is slightly
> _lower_ than the partially idle A7 group. Hence calculate_imbalance()
> bails out, which isn't what we want.
>
> I think we need to have a closer look at the imbalance calculation code
> and any other users of sgs->avg_load to get rid of all code making
> assumptions about cpu capacity.

We already had this discussion during the review of a previous version
https://lkml.org/lkml/2014/6/3/422
and my answer has not changed; This patch is necessary to solve the 1
task per CPU issue of the HMP system but is not enough. I have a patch
for solving the imbalance calculation issue and i have planned to send
it once this patch will be in a good shape for being accepted by
Peter.

I don't want to mix this patch and the next one because there are
addressing different problem: this one is how evaluate the capacity of
a system and detect when a group is overloaded and the next one will
handle the case when the balance of the system can't rely on the
average load figures of the group because we have a significant
capacity difference between groups. Not that i haven't specifically
mentioned HMP for the last patch because SMP system can also take
advantage of it

Regards,
Vincent

>
> Morten

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH v6 5/6] sched: replace capacity_factor by usage
@ 2014-10-03  7:24       ` Vincent Guittot
  0 siblings, 0 replies; 72+ messages in thread
From: Vincent Guittot @ 2014-10-03  7:24 UTC (permalink / raw)
  To: linux-arm-kernel

On 2 October 2014 18:57, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> On Tue, Sep 23, 2014 at 05:08:04PM +0100, Vincent Guittot wrote:
>> Below are two examples to illustrate the problem that this patch solves:
>>
>> 1 - capacity_factor makes the assumption that max capacity of a CPU is
>> SCHED_CAPACITY_SCALE and the load of a thread is always is
>> SCHED_LOAD_SCALE. It compares the output of these figures with the sum
>> of nr_running to decide if a group is overloaded or not.
>>
>> But if the default capacity of a CPU is less than SCHED_CAPACITY_SCALE
>> (640 as an example), a group of 3 CPUS will have a max capacity_factor
>> of 2 ( div_round_closest(3x640/1024) = 2) which means that it will be
>> seen as overloaded if we have only one task per CPU.
>
> I did some testing on TC2 which has the setup you describe above on the
> A7 cluster when the clock-frequency property is set in DT. The two A15s
> have max capacities above 1024. When using sysbench with five threads I
> still get three tasks on the two A15s and two tasks on the three A7
> leaving one cpu idle (on average).
>
> Using cpu utilization (usage) does correctly identify the A15 cluster as
> overloaded and it even gets as far as selecting the A15 running two
> tasks as the source cpu in load_balance(). However, load_balance() bails
> out without pulling the task due to calculate_imbalance() returning a
> zero imbalance. calculate_imbalance() bails out just before the hunk you
> changed due to comparison of the sched_group avg_loads. sgs->avg_load is
> basically the sum of load in the group divided by its capacity. Since
> load isn't scaled the avg_load of the overloaded A15 group is slightly
> _lower_ than the partially idle A7 group. Hence calculate_imbalance()
> bails out, which isn't what we want.
>
> I think we need to have a closer look at the imbalance calculation code
> and any other users of sgs->avg_load to get rid of all code making
> assumptions about cpu capacity.

We already had this discussion during the review of a previous version
https://lkml.org/lkml/2014/6/3/422
and my answer has not changed; This patch is necessary to solve the 1
task per CPU issue of the HMP system but is not enough. I have a patch
for solving the imbalance calculation issue and i have planned to send
it once this patch will be in a good shape for being accepted by
Peter.

I don't want to mix this patch and the next one because there are
addressing different problem: this one is how evaluate the capacity of
a system and detect when a group is overloaded and the next one will
handle the case when the balance of the system can't rely on the
average load figures of the group because we have a significant
capacity difference between groups. Not that i haven't specifically
mentioned HMP for the last patch because SMP system can also take
advantage of it

Regards,
Vincent

>
> Morten

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v6 5/6] sched: replace capacity_factor by usage
  2014-10-03  7:24       ` Vincent Guittot
@ 2014-10-03  9:35         ` Morten Rasmussen
  -1 siblings, 0 replies; 72+ messages in thread
From: Morten Rasmussen @ 2014-10-03  9:35 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: peterz, mingo, linux-kernel, preeti, linux, linux-arm-kernel,
	riel, efault, nicolas.pitre, linaro-kernel, daniel.lezcano,
	Dietmar Eggemann, pjt, bsegall

On Fri, Oct 03, 2014 at 08:24:23AM +0100, Vincent Guittot wrote:
> On 2 October 2014 18:57, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> > On Tue, Sep 23, 2014 at 05:08:04PM +0100, Vincent Guittot wrote:
> >> Below are two examples to illustrate the problem that this patch solves:
> >>
> >> 1 - capacity_factor makes the assumption that max capacity of a CPU is
> >> SCHED_CAPACITY_SCALE and the load of a thread is always is
> >> SCHED_LOAD_SCALE. It compares the output of these figures with the sum
> >> of nr_running to decide if a group is overloaded or not.
> >>
> >> But if the default capacity of a CPU is less than SCHED_CAPACITY_SCALE
> >> (640 as an example), a group of 3 CPUS will have a max capacity_factor
> >> of 2 ( div_round_closest(3x640/1024) = 2) which means that it will be
> >> seen as overloaded if we have only one task per CPU.
> >
> > I did some testing on TC2 which has the setup you describe above on the
> > A7 cluster when the clock-frequency property is set in DT. The two A15s
> > have max capacities above 1024. When using sysbench with five threads I
> > still get three tasks on the two A15s and two tasks on the three A7
> > leaving one cpu idle (on average).
> >
> > Using cpu utilization (usage) does correctly identify the A15 cluster as
> > overloaded and it even gets as far as selecting the A15 running two
> > tasks as the source cpu in load_balance(). However, load_balance() bails
> > out without pulling the task due to calculate_imbalance() returning a
> > zero imbalance. calculate_imbalance() bails out just before the hunk you
> > changed due to comparison of the sched_group avg_loads. sgs->avg_load is
> > basically the sum of load in the group divided by its capacity. Since
> > load isn't scaled the avg_load of the overloaded A15 group is slightly
> > _lower_ than the partially idle A7 group. Hence calculate_imbalance()
> > bails out, which isn't what we want.
> >
> > I think we need to have a closer look at the imbalance calculation code
> > and any other users of sgs->avg_load to get rid of all code making
> > assumptions about cpu capacity.
> 
> We already had this discussion during the review of a previous version
> https://lkml.org/lkml/2014/6/3/422
> and my answer has not changed; This patch is necessary to solve the 1
> task per CPU issue of the HMP system but is not enough. I have a patch
> for solving the imbalance calculation issue and i have planned to send
> it once this patch will be in a good shape for being accepted by
> Peter.
> 
> I don't want to mix this patch and the next one because there are
> addressing different problem: this one is how evaluate the capacity of
> a system and detect when a group is overloaded and the next one will
> handle the case when the balance of the system can't rely on the
> average load figures of the group because we have a significant
> capacity difference between groups. Not that i haven't specifically
> mentioned HMP for the last patch because SMP system can also take
> advantage of it

You do mention 'big' and 'little' cores in your commit message and quote
example numbers with are identical to the cpu capacities for TC2. That
lead me to believe that this patch would address the issues we see for
HMP systems. But that is clearly wrong. I would suggest that you drop
mentioning big and little cores and stick to only describing cpu
capacity reductions due to rt tasks and irq to avoid any confusion about
the purpose of the patch. Maybe explicitly say that non-default cpu
capacities (capacity_orig) isn't addressed yet.

I think the two problems you describe are very much related. This patch
set is half the solution of the HMP balancing problem. We just need the
last bit for avg_load and then we can add scale-invariance on top.

IMHO, it would be good to have all the bits and pieces for cpu capacity
scaling and scaling of per-entity load-tracking on the table so we fit
things together. We only have patches for parts of the solution posted
so far.

Morten

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH v6 5/6] sched: replace capacity_factor by usage
@ 2014-10-03  9:35         ` Morten Rasmussen
  0 siblings, 0 replies; 72+ messages in thread
From: Morten Rasmussen @ 2014-10-03  9:35 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Oct 03, 2014 at 08:24:23AM +0100, Vincent Guittot wrote:
> On 2 October 2014 18:57, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> > On Tue, Sep 23, 2014 at 05:08:04PM +0100, Vincent Guittot wrote:
> >> Below are two examples to illustrate the problem that this patch solves:
> >>
> >> 1 - capacity_factor makes the assumption that max capacity of a CPU is
> >> SCHED_CAPACITY_SCALE and the load of a thread is always is
> >> SCHED_LOAD_SCALE. It compares the output of these figures with the sum
> >> of nr_running to decide if a group is overloaded or not.
> >>
> >> But if the default capacity of a CPU is less than SCHED_CAPACITY_SCALE
> >> (640 as an example), a group of 3 CPUS will have a max capacity_factor
> >> of 2 ( div_round_closest(3x640/1024) = 2) which means that it will be
> >> seen as overloaded if we have only one task per CPU.
> >
> > I did some testing on TC2 which has the setup you describe above on the
> > A7 cluster when the clock-frequency property is set in DT. The two A15s
> > have max capacities above 1024. When using sysbench with five threads I
> > still get three tasks on the two A15s and two tasks on the three A7
> > leaving one cpu idle (on average).
> >
> > Using cpu utilization (usage) does correctly identify the A15 cluster as
> > overloaded and it even gets as far as selecting the A15 running two
> > tasks as the source cpu in load_balance(). However, load_balance() bails
> > out without pulling the task due to calculate_imbalance() returning a
> > zero imbalance. calculate_imbalance() bails out just before the hunk you
> > changed due to comparison of the sched_group avg_loads. sgs->avg_load is
> > basically the sum of load in the group divided by its capacity. Since
> > load isn't scaled the avg_load of the overloaded A15 group is slightly
> > _lower_ than the partially idle A7 group. Hence calculate_imbalance()
> > bails out, which isn't what we want.
> >
> > I think we need to have a closer look at the imbalance calculation code
> > and any other users of sgs->avg_load to get rid of all code making
> > assumptions about cpu capacity.
> 
> We already had this discussion during the review of a previous version
> https://lkml.org/lkml/2014/6/3/422
> and my answer has not changed; This patch is necessary to solve the 1
> task per CPU issue of the HMP system but is not enough. I have a patch
> for solving the imbalance calculation issue and i have planned to send
> it once this patch will be in a good shape for being accepted by
> Peter.
> 
> I don't want to mix this patch and the next one because there are
> addressing different problem: this one is how evaluate the capacity of
> a system and detect when a group is overloaded and the next one will
> handle the case when the balance of the system can't rely on the
> average load figures of the group because we have a significant
> capacity difference between groups. Not that i haven't specifically
> mentioned HMP for the last patch because SMP system can also take
> advantage of it

You do mention 'big' and 'little' cores in your commit message and quote
example numbers with are identical to the cpu capacities for TC2. That
lead me to believe that this patch would address the issues we see for
HMP systems. But that is clearly wrong. I would suggest that you drop
mentioning big and little cores and stick to only describing cpu
capacity reductions due to rt tasks and irq to avoid any confusion about
the purpose of the patch. Maybe explicitly say that non-default cpu
capacities (capacity_orig) isn't addressed yet.

I think the two problems you describe are very much related. This patch
set is half the solution of the HMP balancing problem. We just need the
last bit for avg_load and then we can add scale-invariance on top.

IMHO, it would be good to have all the bits and pieces for cpu capacity
scaling and scaling of per-entity load-tracking on the table so we fit
things together. We only have patches for parts of the solution posted
so far.

Morten

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v6 5/6] sched: replace capacity_factor by usage
  2014-10-03  9:35         ` Morten Rasmussen
@ 2014-10-03 12:50           ` Vincent Guittot
  -1 siblings, 0 replies; 72+ messages in thread
From: Vincent Guittot @ 2014-10-03 12:50 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: peterz, mingo, linux-kernel, preeti, linux, linux-arm-kernel,
	riel, efault, nicolas.pitre, linaro-kernel, daniel.lezcano,
	Dietmar Eggemann, pjt, bsegall

On 3 October 2014 11:35, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> On Fri, Oct 03, 2014 at 08:24:23AM +0100, Vincent Guittot wrote:
>> On 2 October 2014 18:57, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
>> > On Tue, Sep 23, 2014 at 05:08:04PM +0100, Vincent Guittot wrote:
>> >> Below are two examples to illustrate the problem that this patch solves:
>> >>
>> >> 1 - capacity_factor makes the assumption that max capacity of a CPU is
>> >> SCHED_CAPACITY_SCALE and the load of a thread is always is
>> >> SCHED_LOAD_SCALE. It compares the output of these figures with the sum
>> >> of nr_running to decide if a group is overloaded or not.
>> >>
>> >> But if the default capacity of a CPU is less than SCHED_CAPACITY_SCALE
>> >> (640 as an example), a group of 3 CPUS will have a max capacity_factor
>> >> of 2 ( div_round_closest(3x640/1024) = 2) which means that it will be
>> >> seen as overloaded if we have only one task per CPU.
>> >
>> > I did some testing on TC2 which has the setup you describe above on the
>> > A7 cluster when the clock-frequency property is set in DT. The two A15s
>> > have max capacities above 1024. When using sysbench with five threads I
>> > still get three tasks on the two A15s and two tasks on the three A7
>> > leaving one cpu idle (on average).
>> >
>> > Using cpu utilization (usage) does correctly identify the A15 cluster as
>> > overloaded and it even gets as far as selecting the A15 running two
>> > tasks as the source cpu in load_balance(). However, load_balance() bails
>> > out without pulling the task due to calculate_imbalance() returning a
>> > zero imbalance. calculate_imbalance() bails out just before the hunk you
>> > changed due to comparison of the sched_group avg_loads. sgs->avg_load is
>> > basically the sum of load in the group divided by its capacity. Since
>> > load isn't scaled the avg_load of the overloaded A15 group is slightly
>> > _lower_ than the partially idle A7 group. Hence calculate_imbalance()
>> > bails out, which isn't what we want.
>> >
>> > I think we need to have a closer look at the imbalance calculation code
>> > and any other users of sgs->avg_load to get rid of all code making
>> > assumptions about cpu capacity.
>>
>> We already had this discussion during the review of a previous version
>> https://lkml.org/lkml/2014/6/3/422
>> and my answer has not changed; This patch is necessary to solve the 1
>> task per CPU issue of the HMP system but is not enough. I have a patch
>> for solving the imbalance calculation issue and i have planned to send
>> it once this patch will be in a good shape for being accepted by
>> Peter.
>>
>> I don't want to mix this patch and the next one because there are
>> addressing different problem: this one is how evaluate the capacity of
>> a system and detect when a group is overloaded and the next one will
>> handle the case when the balance of the system can't rely on the
>> average load figures of the group because we have a significant
>> capacity difference between groups. Not that i haven't specifically
>> mentioned HMP for the last patch because SMP system can also take
>> advantage of it
>
> You do mention 'big' and 'little' cores in your commit message and quote
> example numbers with are identical to the cpu capacities for TC2. That

By last patch, i mean the patch about imbalance that i haven't sent
yet, but it's not related with this patch

> lead me to believe that this patch would address the issues we see for
> HMP systems. But that is clearly wrong. I would suggest that you drop

This patch addresses one issue: correctly detect how much capacity we
have and correctly detect when the group is overloaded; HMP system
clearly falls in this category but not only.
This is the only purpose of this patch and not the "1 task per CPU
issue" that you mentioned.

The second patch is about correctly balance the tasks on system where
the capacity of a group is significantly less than another group. It
has nothing to do in capacity computation and overload detection but
it will use these corrected/new metrics to make a better choice.

Then, there is the "1 task per CPU issue on HMP" that you mentioned
and this latter will be solved thanks to these 2 patchsets but this is
not the only/main target of these patchsets so i don't want to reduce
them into: "solve an HMP issue"

> mentioning big and little cores and stick to only describing cpu
> capacity reductions due to rt tasks and irq to avoid any confusion about
> the purpose of the patch. Maybe explicitly say that non-default cpu
> capacities (capacity_orig) isn't addressed yet.

But they are addressed by this patchset. you just mixed various goal
together (see above)

>
> I think the two problems you describe are very much related. This patch
> set is half the solution of the HMP balancing problem. We just need the
> last bit for avg_load and then we can add scale-invariance on top.

i don't see the link between scale invariance and a bad load-balancing
choice. The fact that the current load balancer puts more tasks than
CPUs in a group on HMP system should not influence or break the scale
invariance load tracking. Isn't it ?

Now, i could send the other patchset but i'm afraid that this will
generate more confusion than help in the process review.

Regards,
Vincent

>
> IMHO, it would be good to have all the bits and pieces for cpu capacity
> scaling and scaling of per-entity load-tracking on the table so we fit
> things together. We only have patches for parts of the solution posted
> so far.
>
> Morten

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH v6 5/6] sched: replace capacity_factor by usage
@ 2014-10-03 12:50           ` Vincent Guittot
  0 siblings, 0 replies; 72+ messages in thread
From: Vincent Guittot @ 2014-10-03 12:50 UTC (permalink / raw)
  To: linux-arm-kernel

On 3 October 2014 11:35, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> On Fri, Oct 03, 2014 at 08:24:23AM +0100, Vincent Guittot wrote:
>> On 2 October 2014 18:57, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
>> > On Tue, Sep 23, 2014 at 05:08:04PM +0100, Vincent Guittot wrote:
>> >> Below are two examples to illustrate the problem that this patch solves:
>> >>
>> >> 1 - capacity_factor makes the assumption that max capacity of a CPU is
>> >> SCHED_CAPACITY_SCALE and the load of a thread is always is
>> >> SCHED_LOAD_SCALE. It compares the output of these figures with the sum
>> >> of nr_running to decide if a group is overloaded or not.
>> >>
>> >> But if the default capacity of a CPU is less than SCHED_CAPACITY_SCALE
>> >> (640 as an example), a group of 3 CPUS will have a max capacity_factor
>> >> of 2 ( div_round_closest(3x640/1024) = 2) which means that it will be
>> >> seen as overloaded if we have only one task per CPU.
>> >
>> > I did some testing on TC2 which has the setup you describe above on the
>> > A7 cluster when the clock-frequency property is set in DT. The two A15s
>> > have max capacities above 1024. When using sysbench with five threads I
>> > still get three tasks on the two A15s and two tasks on the three A7
>> > leaving one cpu idle (on average).
>> >
>> > Using cpu utilization (usage) does correctly identify the A15 cluster as
>> > overloaded and it even gets as far as selecting the A15 running two
>> > tasks as the source cpu in load_balance(). However, load_balance() bails
>> > out without pulling the task due to calculate_imbalance() returning a
>> > zero imbalance. calculate_imbalance() bails out just before the hunk you
>> > changed due to comparison of the sched_group avg_loads. sgs->avg_load is
>> > basically the sum of load in the group divided by its capacity. Since
>> > load isn't scaled the avg_load of the overloaded A15 group is slightly
>> > _lower_ than the partially idle A7 group. Hence calculate_imbalance()
>> > bails out, which isn't what we want.
>> >
>> > I think we need to have a closer look at the imbalance calculation code
>> > and any other users of sgs->avg_load to get rid of all code making
>> > assumptions about cpu capacity.
>>
>> We already had this discussion during the review of a previous version
>> https://lkml.org/lkml/2014/6/3/422
>> and my answer has not changed; This patch is necessary to solve the 1
>> task per CPU issue of the HMP system but is not enough. I have a patch
>> for solving the imbalance calculation issue and i have planned to send
>> it once this patch will be in a good shape for being accepted by
>> Peter.
>>
>> I don't want to mix this patch and the next one because there are
>> addressing different problem: this one is how evaluate the capacity of
>> a system and detect when a group is overloaded and the next one will
>> handle the case when the balance of the system can't rely on the
>> average load figures of the group because we have a significant
>> capacity difference between groups. Not that i haven't specifically
>> mentioned HMP for the last patch because SMP system can also take
>> advantage of it
>
> You do mention 'big' and 'little' cores in your commit message and quote
> example numbers with are identical to the cpu capacities for TC2. That

By last patch, i mean the patch about imbalance that i haven't sent
yet, but it's not related with this patch

> lead me to believe that this patch would address the issues we see for
> HMP systems. But that is clearly wrong. I would suggest that you drop

This patch addresses one issue: correctly detect how much capacity we
have and correctly detect when the group is overloaded; HMP system
clearly falls in this category but not only.
This is the only purpose of this patch and not the "1 task per CPU
issue" that you mentioned.

The second patch is about correctly balance the tasks on system where
the capacity of a group is significantly less than another group. It
has nothing to do in capacity computation and overload detection but
it will use these corrected/new metrics to make a better choice.

Then, there is the "1 task per CPU issue on HMP" that you mentioned
and this latter will be solved thanks to these 2 patchsets but this is
not the only/main target of these patchsets so i don't want to reduce
them into: "solve an HMP issue"

> mentioning big and little cores and stick to only describing cpu
> capacity reductions due to rt tasks and irq to avoid any confusion about
> the purpose of the patch. Maybe explicitly say that non-default cpu
> capacities (capacity_orig) isn't addressed yet.

But they are addressed by this patchset. you just mixed various goal
together (see above)

>
> I think the two problems you describe are very much related. This patch
> set is half the solution of the HMP balancing problem. We just need the
> last bit for avg_load and then we can add scale-invariance on top.

i don't see the link between scale invariance and a bad load-balancing
choice. The fact that the current load balancer puts more tasks than
CPUs in a group on HMP system should not influence or break the scale
invariance load tracking. Isn't it ?

Now, i could send the other patchset but i'm afraid that this will
generate more confusion than help in the process review.

Regards,
Vincent

>
> IMHO, it would be good to have all the bits and pieces for cpu capacity
> scaling and scaling of per-entity load-tracking on the table so we fit
> things together. We only have patches for parts of the solution posted
> so far.
>
> Morten

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v6 3/6] sched: add utilization_avg_contrib
  2014-09-23 16:08   ` Vincent Guittot
@ 2014-10-03 14:15     ` Peter Zijlstra
  -1 siblings, 0 replies; 72+ messages in thread
From: Peter Zijlstra @ 2014-10-03 14:15 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, linux-kernel, preeti, linux, linux-arm-kernel, riel,
	Morten.Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano, dietmar.eggemann, pjt, bsegall

On Tue, Sep 23, 2014 at 06:08:02PM +0200, Vincent Guittot wrote:
>  struct sched_avg {
> +	u64 last_runnable_update;
> +	s64 decay_count;
> +	/*
> +	 * utilization_avg_contrib describes the amount of time that a
> +	 * sched_entity is running on a CPU. It is based on running_avg_sum
> +	 * and is scaled in the range [0..SCHED_LOAD_SCALE].
> +	 * load_avg_contrib described the the amount of time that a
> +	 * sched_entity is runnable on a rq. It is based on both
> +	 * runnable_avg_sum and the weight of the task.
> +	 */
> +	unsigned long load_avg_contrib, utilization_avg_contrib;
>  	/*
>  	 * These sums represent an infinite geometric series and so are bound
>  	 * above by 1024/(1-y).  Thus we only need a u32 to store them for all
>  	 * choices of y < 1-2^(-32)*1024.
> +	 * runnable_avg_sum represents the amount of time a sched_entity is on
> +	 * the runqueue whereas running_avg_sum reflects the time the
> +	 * sched_entity is effectively running on the runqueue.

I would say: 'running on the cpu'. I would further clarify that runnable
also includes running, the above could be read such that runnable is
only the time spend waiting on the queue, excluding the time spend on
the cpu.

>  	 */
> +	u32 runnable_avg_sum, avg_period, running_avg_sum;
>  };

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH v6 3/6] sched: add utilization_avg_contrib
@ 2014-10-03 14:15     ` Peter Zijlstra
  0 siblings, 0 replies; 72+ messages in thread
From: Peter Zijlstra @ 2014-10-03 14:15 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Sep 23, 2014 at 06:08:02PM +0200, Vincent Guittot wrote:
>  struct sched_avg {
> +	u64 last_runnable_update;
> +	s64 decay_count;
> +	/*
> +	 * utilization_avg_contrib describes the amount of time that a
> +	 * sched_entity is running on a CPU. It is based on running_avg_sum
> +	 * and is scaled in the range [0..SCHED_LOAD_SCALE].
> +	 * load_avg_contrib described the the amount of time that a
> +	 * sched_entity is runnable on a rq. It is based on both
> +	 * runnable_avg_sum and the weight of the task.
> +	 */
> +	unsigned long load_avg_contrib, utilization_avg_contrib;
>  	/*
>  	 * These sums represent an infinite geometric series and so are bound
>  	 * above by 1024/(1-y).  Thus we only need a u32 to store them for all
>  	 * choices of y < 1-2^(-32)*1024.
> +	 * runnable_avg_sum represents the amount of time a sched_entity is on
> +	 * the runqueue whereas running_avg_sum reflects the time the
> +	 * sched_entity is effectively running on the runqueue.

I would say: 'running on the cpu'. I would further clarify that runnable
also includes running, the above could be read such that runnable is
only the time spend waiting on the queue, excluding the time spend on
the cpu.

>  	 */
> +	u32 runnable_avg_sum, avg_period, running_avg_sum;
>  };

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v6 3/6] sched: add utilization_avg_contrib
  2014-09-23 16:08   ` Vincent Guittot
@ 2014-10-03 14:36     ` Peter Zijlstra
  -1 siblings, 0 replies; 72+ messages in thread
From: Peter Zijlstra @ 2014-10-03 14:36 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, linux-kernel, preeti, linux, linux-arm-kernel, riel,
	Morten.Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano, dietmar.eggemann, pjt, bsegall

On Tue, Sep 23, 2014 at 06:08:02PM +0200, Vincent Guittot wrote:
> +++ b/kernel/sched/sched.h
> @@ -339,8 +339,14 @@ struct cfs_rq {
>  	 * Under CFS, load is tracked on a per-entity basis and aggregated up.
>  	 * This allows for the description of both thread and group usage (in
>  	 * the FAIR_GROUP_SCHED case).
> +	 * runnable_load_avg is the sum of the load_avg_contrib of the
> +	 * sched_entities on the rq.

> +	 * blocked_load_avg is similar to runnable_load_avg except that its
> +	 * the blocked sched_entities on the rq.

Strictly speaking blocked entities aren't on a rq as such, but yeah, no
idea how to better put it. Just being a pedantic, which isn't helpful I
guess :-)

> +	 * utilization_load_avg is the sum of the average running time of the
> +	 * sched_entities on the rq.
>  	 */

So I think there was some talk about a blocked_utilization thingy, which
would track the avg running time of the tasks currently asleep, right?

> +	unsigned long runnable_load_avg, blocked_load_avg, utilization_load_avg;
>  	atomic64_t decay_counter;
>  	u64 last_decay;
>  	atomic_long_t removed_load;

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH v6 3/6] sched: add utilization_avg_contrib
@ 2014-10-03 14:36     ` Peter Zijlstra
  0 siblings, 0 replies; 72+ messages in thread
From: Peter Zijlstra @ 2014-10-03 14:36 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Sep 23, 2014 at 06:08:02PM +0200, Vincent Guittot wrote:
> +++ b/kernel/sched/sched.h
> @@ -339,8 +339,14 @@ struct cfs_rq {
>  	 * Under CFS, load is tracked on a per-entity basis and aggregated up.
>  	 * This allows for the description of both thread and group usage (in
>  	 * the FAIR_GROUP_SCHED case).
> +	 * runnable_load_avg is the sum of the load_avg_contrib of the
> +	 * sched_entities on the rq.

> +	 * blocked_load_avg is similar to runnable_load_avg except that its
> +	 * the blocked sched_entities on the rq.

Strictly speaking blocked entities aren't on a rq as such, but yeah, no
idea how to better put it. Just being a pedantic, which isn't helpful I
guess :-)

> +	 * utilization_load_avg is the sum of the average running time of the
> +	 * sched_entities on the rq.
>  	 */

So I think there was some talk about a blocked_utilization thingy, which
would track the avg running time of the tasks currently asleep, right?

> +	unsigned long runnable_load_avg, blocked_load_avg, utilization_load_avg;
>  	atomic64_t decay_counter;
>  	u64 last_decay;
>  	atomic_long_t removed_load;

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v6 3/6] sched: add utilization_avg_contrib
  2014-10-03 14:15     ` Peter Zijlstra
@ 2014-10-03 14:44       ` Vincent Guittot
  -1 siblings, 0 replies; 72+ messages in thread
From: Vincent Guittot @ 2014-10-03 14:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Preeti U Murthy,
	Russell King - ARM Linux, LAK, Rik van Riel, Morten Rasmussen,
	Mike Galbraith, Nicolas Pitre, linaro-kernel, Daniel Lezcano,
	Dietmar Eggemann, Paul Turner, Benjamin Segall

On 3 October 2014 16:15, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Sep 23, 2014 at 06:08:02PM +0200, Vincent Guittot wrote:
>>  struct sched_avg {
>> +     u64 last_runnable_update;
>> +     s64 decay_count;
>> +     /*
>> +      * utilization_avg_contrib describes the amount of time that a
>> +      * sched_entity is running on a CPU. It is based on running_avg_sum
>> +      * and is scaled in the range [0..SCHED_LOAD_SCALE].
>> +      * load_avg_contrib described the the amount of time that a
>> +      * sched_entity is runnable on a rq. It is based on both
>> +      * runnable_avg_sum and the weight of the task.
>> +      */
>> +     unsigned long load_avg_contrib, utilization_avg_contrib;
>>       /*
>>        * These sums represent an infinite geometric series and so are bound
>>        * above by 1024/(1-y).  Thus we only need a u32 to store them for all
>>        * choices of y < 1-2^(-32)*1024.
>> +      * runnable_avg_sum represents the amount of time a sched_entity is on
>> +      * the runqueue whereas running_avg_sum reflects the time the
>> +      * sched_entity is effectively running on the runqueue.
>
> I would say: 'running on the cpu'. I would further clarify that runnable
> also includes running, the above could be read such that runnable is
> only the time spend waiting on the queue, excluding the time spend on
> the cpu.

ok

>
>>        */
>> +     u32 runnable_avg_sum, avg_period, running_avg_sum;
>>  };

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH v6 3/6] sched: add utilization_avg_contrib
@ 2014-10-03 14:44       ` Vincent Guittot
  0 siblings, 0 replies; 72+ messages in thread
From: Vincent Guittot @ 2014-10-03 14:44 UTC (permalink / raw)
  To: linux-arm-kernel

On 3 October 2014 16:15, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Sep 23, 2014 at 06:08:02PM +0200, Vincent Guittot wrote:
>>  struct sched_avg {
>> +     u64 last_runnable_update;
>> +     s64 decay_count;
>> +     /*
>> +      * utilization_avg_contrib describes the amount of time that a
>> +      * sched_entity is running on a CPU. It is based on running_avg_sum
>> +      * and is scaled in the range [0..SCHED_LOAD_SCALE].
>> +      * load_avg_contrib described the the amount of time that a
>> +      * sched_entity is runnable on a rq. It is based on both
>> +      * runnable_avg_sum and the weight of the task.
>> +      */
>> +     unsigned long load_avg_contrib, utilization_avg_contrib;
>>       /*
>>        * These sums represent an infinite geometric series and so are bound
>>        * above by 1024/(1-y).  Thus we only need a u32 to store them for all
>>        * choices of y < 1-2^(-32)*1024.
>> +      * runnable_avg_sum represents the amount of time a sched_entity is on
>> +      * the runqueue whereas running_avg_sum reflects the time the
>> +      * sched_entity is effectively running on the runqueue.
>
> I would say: 'running on the cpu'. I would further clarify that runnable
> also includes running, the above could be read such that runnable is
> only the time spend waiting on the queue, excluding the time spend on
> the cpu.

ok

>
>>        */
>> +     u32 runnable_avg_sum, avg_period, running_avg_sum;
>>  };

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v6 3/6] sched: add utilization_avg_contrib
  2014-10-03 14:36     ` Peter Zijlstra
@ 2014-10-03 14:51       ` Vincent Guittot
  -1 siblings, 0 replies; 72+ messages in thread
From: Vincent Guittot @ 2014-10-03 14:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Preeti U Murthy,
	Russell King - ARM Linux, LAK, Rik van Riel, Morten Rasmussen,
	Mike Galbraith, Nicolas Pitre, linaro-kernel, Daniel Lezcano,
	Dietmar Eggemann, Paul Turner, Benjamin Segall

On 3 October 2014 16:36, Peter Zijlstra <peterz@infradead.org> wrote:
>
>> +      * utilization_load_avg is the sum of the average running time of the
>> +      * sched_entities on the rq.
>>        */
>
> So I think there was some talk about a blocked_utilization thingy, which
> would track the avg running time of the tasks currently asleep, right?
>

yes. Do you mean that we should anticipate and rename
utilization_load_avg into utilization_runnable_avg to make space for a
utilization_blocked_avg that could be added in future ?

>> +     unsigned long runnable_load_avg, blocked_load_avg, utilization_load_avg;
>>       atomic64_t decay_counter;
>>       u64 last_decay;
>>       atomic_long_t removed_load;

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH v6 3/6] sched: add utilization_avg_contrib
@ 2014-10-03 14:51       ` Vincent Guittot
  0 siblings, 0 replies; 72+ messages in thread
From: Vincent Guittot @ 2014-10-03 14:51 UTC (permalink / raw)
  To: linux-arm-kernel

On 3 October 2014 16:36, Peter Zijlstra <peterz@infradead.org> wrote:
>
>> +      * utilization_load_avg is the sum of the average running time of the
>> +      * sched_entities on the rq.
>>        */
>
> So I think there was some talk about a blocked_utilization thingy, which
> would track the avg running time of the tasks currently asleep, right?
>

yes. Do you mean that we should anticipate and rename
utilization_load_avg into utilization_runnable_avg to make space for a
utilization_blocked_avg that could be added in future ?

>> +     unsigned long runnable_load_avg, blocked_load_avg, utilization_load_avg;
>>       atomic64_t decay_counter;
>>       u64 last_decay;
>>       atomic_long_t removed_load;

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v6 3/6] sched: add utilization_avg_contrib
  2014-10-03 14:51       ` Vincent Guittot
@ 2014-10-03 15:14         ` Peter Zijlstra
  -1 siblings, 0 replies; 72+ messages in thread
From: Peter Zijlstra @ 2014-10-03 15:14 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, linux-kernel, Preeti U Murthy,
	Russell King - ARM Linux, LAK, Rik van Riel, Morten Rasmussen,
	Mike Galbraith, Nicolas Pitre, linaro-kernel, Daniel Lezcano,
	Dietmar Eggemann, Paul Turner, Benjamin Segall

On Fri, Oct 03, 2014 at 04:51:01PM +0200, Vincent Guittot wrote:
> On 3 October 2014 16:36, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> >> +      * utilization_load_avg is the sum of the average running time of the
> >> +      * sched_entities on the rq.
> >>        */
> >
> > So I think there was some talk about a blocked_utilization thingy, which
> > would track the avg running time of the tasks currently asleep, right?
> >
> 
> yes. Do you mean that we should anticipate and rename
> utilization_load_avg into utilization_runnable_avg to make space for a
> utilization_blocked_avg that could be added in future ?

nah, just trying to put things straight in my brain, including what is
'missing'.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH v6 3/6] sched: add utilization_avg_contrib
@ 2014-10-03 15:14         ` Peter Zijlstra
  0 siblings, 0 replies; 72+ messages in thread
From: Peter Zijlstra @ 2014-10-03 15:14 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Oct 03, 2014 at 04:51:01PM +0200, Vincent Guittot wrote:
> On 3 October 2014 16:36, Peter Zijlstra <peterz@infradead.org> wrote:
> >
> >> +      * utilization_load_avg is the sum of the average running time of the
> >> +      * sched_entities on the rq.
> >>        */
> >
> > So I think there was some talk about a blocked_utilization thingy, which
> > would track the avg running time of the tasks currently asleep, right?
> >
> 
> yes. Do you mean that we should anticipate and rename
> utilization_load_avg into utilization_runnable_avg to make space for a
> utilization_blocked_avg that could be added in future ?

nah, just trying to put things straight in my brain, including what is
'missing'.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v6 5/6] sched: replace capacity_factor by usage
  2014-09-23 16:08   ` Vincent Guittot
@ 2014-10-03 15:38     ` Peter Zijlstra
  -1 siblings, 0 replies; 72+ messages in thread
From: Peter Zijlstra @ 2014-10-03 15:38 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, linux-kernel, preeti, linux, linux-arm-kernel, riel,
	Morten.Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano, dietmar.eggemann, pjt, bsegall

On Tue, Sep 23, 2014 at 06:08:04PM +0200, Vincent Guittot wrote:

> This implementation of utilization_avg_contrib doesn't solve the scaling
> in-variance problem, so i have to scale the utilization with original
> capacity of the CPU in order to get the CPU usage and compare it with the
> capacity. Once the scaling invariance will have been added in
> utilization_avg_contrib, we will remove the scale of utilization_avg_contrib
> by cpu_capacity_orig in get_cpu_usage. But the scaling invariance will come
> in another patchset.

I would have expected this in the previous patch that introduced that
lot. Including a few words on how/why the cpu_capacity is a 'good'
approximation etc..

> Finally, the sched_group->sched_group_capacity->capacity_orig has been removed
> because it's more used during load balance.

That sentence is a contradiction, I expect there's a negative gone
missing someplace.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH v6 5/6] sched: replace capacity_factor by usage
@ 2014-10-03 15:38     ` Peter Zijlstra
  0 siblings, 0 replies; 72+ messages in thread
From: Peter Zijlstra @ 2014-10-03 15:38 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Sep 23, 2014 at 06:08:04PM +0200, Vincent Guittot wrote:

> This implementation of utilization_avg_contrib doesn't solve the scaling
> in-variance problem, so i have to scale the utilization with original
> capacity of the CPU in order to get the CPU usage and compare it with the
> capacity. Once the scaling invariance will have been added in
> utilization_avg_contrib, we will remove the scale of utilization_avg_contrib
> by cpu_capacity_orig in get_cpu_usage. But the scaling invariance will come
> in another patchset.

I would have expected this in the previous patch that introduced that
lot. Including a few words on how/why the cpu_capacity is a 'good'
approximation etc..

> Finally, the sched_group->sched_group_capacity->capacity_orig has been removed
> because it's more used during load balance.

That sentence is a contradiction, I expect there's a negative gone
missing someplace.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v6 3/6] sched: add utilization_avg_contrib
  2014-10-03 15:14         ` Peter Zijlstra
@ 2014-10-03 16:05           ` Morten Rasmussen
  -1 siblings, 0 replies; 72+ messages in thread
From: Morten Rasmussen @ 2014-10-03 16:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vincent Guittot, Ingo Molnar, linux-kernel, Preeti U Murthy,
	Russell King - ARM Linux, LAK, Rik van Riel, Mike Galbraith,
	Nicolas Pitre, linaro-kernel, Daniel Lezcano, Dietmar Eggemann,
	Paul Turner, Benjamin Segall

On Fri, Oct 03, 2014 at 04:14:51PM +0100, Peter Zijlstra wrote:
> On Fri, Oct 03, 2014 at 04:51:01PM +0200, Vincent Guittot wrote:
> > On 3 October 2014 16:36, Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > >> +      * utilization_load_avg is the sum of the average running time of the
> > >> +      * sched_entities on the rq.
> > >>        */
> > >
> > > So I think there was some talk about a blocked_utilization thingy, which
> > > would track the avg running time of the tasks currently asleep, right?
> > >
> > 
> > yes. Do you mean that we should anticipate and rename
> > utilization_load_avg into utilization_runnable_avg to make space for a
> > utilization_blocked_avg that could be added in future ?
> 
> nah, just trying to put things straight in my brain, including what is
> 'missing'.

As Ben pointed out in the scale-invariance thread, we need blocked
utilization. I fully agree with that. It doesn't make any sense not to
include it. In fact I do have the patch already.

If you want to rename utlization_load_avg you should name it
utilization_running_avg, not utilization_runnable_avg :) Or even better,
something shorter.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH v6 3/6] sched: add utilization_avg_contrib
@ 2014-10-03 16:05           ` Morten Rasmussen
  0 siblings, 0 replies; 72+ messages in thread
From: Morten Rasmussen @ 2014-10-03 16:05 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, Oct 03, 2014 at 04:14:51PM +0100, Peter Zijlstra wrote:
> On Fri, Oct 03, 2014 at 04:51:01PM +0200, Vincent Guittot wrote:
> > On 3 October 2014 16:36, Peter Zijlstra <peterz@infradead.org> wrote:
> > >
> > >> +      * utilization_load_avg is the sum of the average running time of the
> > >> +      * sched_entities on the rq.
> > >>        */
> > >
> > > So I think there was some talk about a blocked_utilization thingy, which
> > > would track the avg running time of the tasks currently asleep, right?
> > >
> > 
> > yes. Do you mean that we should anticipate and rename
> > utilization_load_avg into utilization_runnable_avg to make space for a
> > utilization_blocked_avg that could be added in future ?
> 
> nah, just trying to put things straight in my brain, including what is
> 'missing'.

As Ben pointed out in the scale-invariance thread, we need blocked
utilization. I fully agree with that. It doesn't make any sense not to
include it. In fact I do have the patch already.

If you want to rename utlization_load_avg you should name it
utilization_running_avg, not utilization_runnable_avg :) Or even better,
something shorter.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v6 5/6] sched: replace capacity_factor by usage
  2014-10-03 15:38     ` Peter Zijlstra
@ 2014-10-06  8:55       ` Vincent Guittot
  -1 siblings, 0 replies; 72+ messages in thread
From: Vincent Guittot @ 2014-10-06  8:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Preeti U Murthy,
	Russell King - ARM Linux, LAK, Rik van Riel, Morten Rasmussen,
	Mike Galbraith, Nicolas Pitre, linaro-kernel, Daniel Lezcano,
	Dietmar Eggemann, Paul Turner, Benjamin Segall

On 3 October 2014 17:38, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Sep 23, 2014 at 06:08:04PM +0200, Vincent Guittot wrote:
>
>> This implementation of utilization_avg_contrib doesn't solve the scaling
>> in-variance problem, so i have to scale the utilization with original
>> capacity of the CPU in order to get the CPU usage and compare it with the
>> capacity. Once the scaling invariance will have been added in
>> utilization_avg_contrib, we will remove the scale of utilization_avg_contrib
>> by cpu_capacity_orig in get_cpu_usage. But the scaling invariance will come
>> in another patchset.
>
> I would have expected this in the previous patch that introduced that
> lot. Including a few words on how/why the cpu_capacity is a 'good'
> approximation etc..

That's fair, i can move the explanation

>
>> Finally, the sched_group->sched_group_capacity->capacity_orig has been removed
>> because it's more used during load balance.
>
> That sentence is a contradiction, I expect there's a negative gone
> missing someplace.

yes the "no" is of "because it is no more used during load balance" is missing

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH v6 5/6] sched: replace capacity_factor by usage
@ 2014-10-06  8:55       ` Vincent Guittot
  0 siblings, 0 replies; 72+ messages in thread
From: Vincent Guittot @ 2014-10-06  8:55 UTC (permalink / raw)
  To: linux-arm-kernel

On 3 October 2014 17:38, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Sep 23, 2014 at 06:08:04PM +0200, Vincent Guittot wrote:
>
>> This implementation of utilization_avg_contrib doesn't solve the scaling
>> in-variance problem, so i have to scale the utilization with original
>> capacity of the CPU in order to get the CPU usage and compare it with the
>> capacity. Once the scaling invariance will have been added in
>> utilization_avg_contrib, we will remove the scale of utilization_avg_contrib
>> by cpu_capacity_orig in get_cpu_usage. But the scaling invariance will come
>> in another patchset.
>
> I would have expected this in the previous patch that introduced that
> lot. Including a few words on how/why the cpu_capacity is a 'good'
> approximation etc..

That's fair, i can move the explanation

>
>> Finally, the sched_group->sched_group_capacity->capacity_orig has been removed
>> because it's more used during load balance.
>
> That sentence is a contradiction, I expect there's a negative gone
> missing someplace.

yes the "no" is of "because it is no more used during load balance" is missing

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v6 4/6] sched: get CPU's usage statistic
  2014-09-26 12:17       ` Vincent Guittot
@ 2014-11-21  5:36         ` Wanpeng Li
  -1 siblings, 0 replies; 72+ messages in thread
From: Wanpeng Li @ 2014-11-21  5:36 UTC (permalink / raw)
  To: Vincent Guittot, Dietmar Eggemann
  Cc: peterz, mingo, linux-kernel, preeti, linux, linux-arm-kernel,
	riel, Morten Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano, pjt, bsegall

Hi Vincent,
On 9/26/14, 8:17 PM, Vincent Guittot wrote:
> On 25 September 2014 21:05, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>> On 23/09/14 17:08, Vincent Guittot wrote:
>>> Monitor the usage level of each group of each sched_domain level. The usage is
>>> the amount of cpu_capacity that is currently used on a CPU or group of CPUs.
>>> We use the utilization_load_avg to evaluate the usage level of each group.
>>>
>>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>>> ---
>>>   kernel/sched/fair.c | 13 +++++++++++++
>>>   1 file changed, 13 insertions(+)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 2cf153d..4097e3f 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -4523,6 +4523,17 @@ static int select_idle_sibling(struct task_struct *p, int target)
>>>        return target;
>>>   }
>>>
>>> +static int get_cpu_usage(int cpu)
>>> +{
>>> +     unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg;
>>> +     unsigned long capacity = capacity_orig_of(cpu);
>>> +
>>> +     if (usage >= SCHED_LOAD_SCALE)
>>> +             return capacity + 1;
>> Why you are returning rq->cpu_capacity_orig + 1 (1025) in case
>> utilization_load_avg is greater or equal than 1024 and not usage or
>> (usage * capacity) >> SCHED_LOAD_SHIFT too?
> The usage can't be higher than the full capacity of the CPU because
> it's about the running time on this CPU. Nevertheless, usage can be
> higher than SCHED_LOAD_SCALE because of unfortunate rounding in
> avg_period and running_load_avg or just after migrating tasks until
> the average stabilizes with the new running time.
>
>> In case the weight of a sched group is greater than 1, you might loose
>> the information that the whole sched group is over-utilized too.
> that's exactly for sched_group with more than 1 CPU that we need to
> cap the usage of a CPU to 100%. Otherwise, the group could be seen as
> overloaded (CPU0 usage at 121% + CPU1 usage at 80%) whereas CPU1 has
> 20% of available capacity
>
>> You add up the individual cpu usage values for a group by
>> sgs->group_usage += get_cpu_usage(i) in update_sg_lb_stats and later use
>> sgs->group_usage in group_is_overloaded to compare it against
>> sgs->group_capacity (taking imbalance_pct into consideration).
>>
>>> +
>>> +     return (usage * capacity) >> SCHED_LOAD_SHIFT;
>> Nit-pick: Since you're multiplying by a capacity value
>> (rq->cpu_capacity_orig) you should shift by SCHED_CAPACITY_SHIFT.
> we want to compare the output of the function with some capacity
> figures so i think that >> SCHED_LOAD_SHIFT is the right operation.

Could you explain more why '>> SCHED_LOAD_SHIFT' instead of '>> 
SCHED_CAPACITY_SHIFT'?

Regards,
Wanpeng Li

>
>> Just to make sure: You do this scaling of usage by cpu_capacity_orig
>> here only to cater for the fact that cpu_capacity_orig might be uarch
>> scaled (by arch_scale_cpu_capacity, !SMT) in update_cpu_capacity while
> I do this for any system with CPUs that have an original capacity that
> is different from SCHED_CAPACITY_SCALE so it's for both uArch and SMT.
>
>> utilization_load_avg is currently not.
>> We don't even uArch scale on ARM TC2 big.LITTLE platform in mainline
>> today due to the missing clock-frequency property in the device tree.
> sorry i don't catch your point
>
>> I think it's hard for people to grasp that your patch-set takes uArch
>> scaling of capacity into consideration but not frequency scaling of
>> capacity (via arch_scale_freq_capacity, not used at the moment).
>>
>>> +}
>>> +
>>>   /*
>>>    * select_task_rq_fair: Select target runqueue for the waking task in domains
>>>    * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
>>> @@ -5663,6 +5674,7 @@ struct sg_lb_stats {
>>>        unsigned long sum_weighted_load; /* Weighted load of group's tasks */
>>>        unsigned long load_per_task;
>>>        unsigned long group_capacity;
>>> +     unsigned long group_usage; /* Total usage of the group */
>>>        unsigned int sum_nr_running; /* Nr tasks running in the group */
>>>        unsigned int group_capacity_factor;
>>>        unsigned int idle_cpus;
>>> @@ -6037,6 +6049,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>>>                        load = source_load(i, load_idx);
>>>
>>>                sgs->group_load += load;
>>> +             sgs->group_usage += get_cpu_usage(i);
>>>                sgs->sum_nr_running += rq->cfs.h_nr_running;
>>>
>>>                if (rq->nr_running > 1)
>>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH v6 4/6] sched: get CPU's usage statistic
@ 2014-11-21  5:36         ` Wanpeng Li
  0 siblings, 0 replies; 72+ messages in thread
From: Wanpeng Li @ 2014-11-21  5:36 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Vincent,
On 9/26/14, 8:17 PM, Vincent Guittot wrote:
> On 25 September 2014 21:05, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>> On 23/09/14 17:08, Vincent Guittot wrote:
>>> Monitor the usage level of each group of each sched_domain level. The usage is
>>> the amount of cpu_capacity that is currently used on a CPU or group of CPUs.
>>> We use the utilization_load_avg to evaluate the usage level of each group.
>>>
>>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>>> ---
>>>   kernel/sched/fair.c | 13 +++++++++++++
>>>   1 file changed, 13 insertions(+)
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 2cf153d..4097e3f 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -4523,6 +4523,17 @@ static int select_idle_sibling(struct task_struct *p, int target)
>>>        return target;
>>>   }
>>>
>>> +static int get_cpu_usage(int cpu)
>>> +{
>>> +     unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg;
>>> +     unsigned long capacity = capacity_orig_of(cpu);
>>> +
>>> +     if (usage >= SCHED_LOAD_SCALE)
>>> +             return capacity + 1;
>> Why you are returning rq->cpu_capacity_orig + 1 (1025) in case
>> utilization_load_avg is greater or equal than 1024 and not usage or
>> (usage * capacity) >> SCHED_LOAD_SHIFT too?
> The usage can't be higher than the full capacity of the CPU because
> it's about the running time on this CPU. Nevertheless, usage can be
> higher than SCHED_LOAD_SCALE because of unfortunate rounding in
> avg_period and running_load_avg or just after migrating tasks until
> the average stabilizes with the new running time.
>
>> In case the weight of a sched group is greater than 1, you might loose
>> the information that the whole sched group is over-utilized too.
> that's exactly for sched_group with more than 1 CPU that we need to
> cap the usage of a CPU to 100%. Otherwise, the group could be seen as
> overloaded (CPU0 usage at 121% + CPU1 usage at 80%) whereas CPU1 has
> 20% of available capacity
>
>> You add up the individual cpu usage values for a group by
>> sgs->group_usage += get_cpu_usage(i) in update_sg_lb_stats and later use
>> sgs->group_usage in group_is_overloaded to compare it against
>> sgs->group_capacity (taking imbalance_pct into consideration).
>>
>>> +
>>> +     return (usage * capacity) >> SCHED_LOAD_SHIFT;
>> Nit-pick: Since you're multiplying by a capacity value
>> (rq->cpu_capacity_orig) you should shift by SCHED_CAPACITY_SHIFT.
> we want to compare the output of the function with some capacity
> figures so i think that >> SCHED_LOAD_SHIFT is the right operation.

Could you explain more why '>> SCHED_LOAD_SHIFT' instead of '>> 
SCHED_CAPACITY_SHIFT'?

Regards,
Wanpeng Li

>
>> Just to make sure: You do this scaling of usage by cpu_capacity_orig
>> here only to cater for the fact that cpu_capacity_orig might be uarch
>> scaled (by arch_scale_cpu_capacity, !SMT) in update_cpu_capacity while
> I do this for any system with CPUs that have an original capacity that
> is different from SCHED_CAPACITY_SCALE so it's for both uArch and SMT.
>
>> utilization_load_avg is currently not.
>> We don't even uArch scale on ARM TC2 big.LITTLE platform in mainline
>> today due to the missing clock-frequency property in the device tree.
> sorry i don't catch your point
>
>> I think it's hard for people to grasp that your patch-set takes uArch
>> scaling of capacity into consideration but not frequency scaling of
>> capacity (via arch_scale_freq_capacity, not used at the moment).
>>
>>> +}
>>> +
>>>   /*
>>>    * select_task_rq_fair: Select target runqueue for the waking task in domains
>>>    * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
>>> @@ -5663,6 +5674,7 @@ struct sg_lb_stats {
>>>        unsigned long sum_weighted_load; /* Weighted load of group's tasks */
>>>        unsigned long load_per_task;
>>>        unsigned long group_capacity;
>>> +     unsigned long group_usage; /* Total usage of the group */
>>>        unsigned int sum_nr_running; /* Nr tasks running in the group */
>>>        unsigned int group_capacity_factor;
>>>        unsigned int idle_cpus;
>>> @@ -6037,6 +6049,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>>>                        load = source_load(i, load_idx);
>>>
>>>                sgs->group_load += load;
>>> +             sgs->group_usage += get_cpu_usage(i);
>>>                sgs->sum_nr_running += rq->cfs.h_nr_running;
>>>
>>>                if (rq->nr_running > 1)
>>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v6 4/6] sched: get CPU's usage statistic
  2014-11-21  5:36         ` Wanpeng Li
@ 2014-11-21 12:17           ` Vincent Guittot
  -1 siblings, 0 replies; 72+ messages in thread
From: Vincent Guittot @ 2014-11-21 12:17 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Dietmar Eggemann, peterz, mingo, linux-kernel, preeti, linux,
	linux-arm-kernel, riel, Morten Rasmussen, efault, nicolas.pitre,
	linaro-kernel, daniel.lezcano, pjt, bsegall

On 21 November 2014 06:36, Wanpeng Li <kernellwp@gmail.com> wrote:
> Hi Vincent,
>
> On 9/26/14, 8:17 PM, Vincent Guittot wrote:

 [snip]

>>
>>> You add up the individual cpu usage values for a group by
>>> sgs->group_usage += get_cpu_usage(i) in update_sg_lb_stats and later use
>>> sgs->group_usage in group_is_overloaded to compare it against
>>> sgs->group_capacity (taking imbalance_pct into consideration).
>>>
>>>> +
>>>> +     return (usage * capacity) >> SCHED_LOAD_SHIFT;
>>>
>>> Nit-pick: Since you're multiplying by a capacity value
>>> (rq->cpu_capacity_orig) you should shift by SCHED_CAPACITY_SHIFT.
>>
>> we want to compare the output of the function with some capacity
>> figures so i think that >> SCHED_LOAD_SHIFT is the right operation.
>
>
> Could you explain more why '>> SCHED_LOAD_SHIFT' instead of '>>
> SCHED_CAPACITY_SHIFT'?

The return of get_cpu_usage is going to be compared with capacity so
we need to return CAPACITY unit

usage unit is LOAD, capacity unit is CAPACITY so we have to divide by
LOAD unit to return CAPACITY unit

LOAD unit * CAPACITY unit / LOAD unit -> CAPACITY unit

Regards,
Vincent

>
> Regards,
> Wanpeng Li
>
>>
>>> Just to make sure: You do this scaling of usage by cpu_capacity_orig
>>> here only to cater for the fact that cpu_capacity_orig might be uarch
>>> scaled (by arch_scale_cpu_capacity, !SMT) in update_cpu_capacity while
>>
>> I do this for any system with CPUs that have an original capacity that
>> is different from SCHED_CAPACITY_SCALE so it's for both uArch and SMT.
>>
>>> utilization_load_avg is currently not.
>>> We don't even uArch scale on ARM TC2 big.LITTLE platform in mainline
>>> today due to the missing clock-frequency property in the device tree.
>>
>> sorry i don't catch your point
>>
>>> I think it's hard for people to grasp that your patch-set takes uArch
>>> scaling of capacity into consideration but not frequency scaling of
>>> capacity (via arch_scale_freq_capacity, not used at the moment).
>>>
>>>> +}
>>>> +
>>>>   /*
>>>>    * select_task_rq_fair: Select target runqueue for the waking task in
>>>> domains
>>>>    * that have the 'sd_flag' flag set. In practice, this is
>>>> SD_BALANCE_WAKE,
>>>> @@ -5663,6 +5674,7 @@ struct sg_lb_stats {
>>>>        unsigned long sum_weighted_load; /* Weighted load of group's
>>>> tasks */
>>>>        unsigned long load_per_task;
>>>>        unsigned long group_capacity;
>>>> +     unsigned long group_usage; /* Total usage of the group */
>>>>        unsigned int sum_nr_running; /* Nr tasks running in the group */
>>>>        unsigned int group_capacity_factor;
>>>>        unsigned int idle_cpus;
>>>> @@ -6037,6 +6049,7 @@ static inline void update_sg_lb_stats(struct
>>>> lb_env *env,
>>>>                        load = source_load(i, load_idx);
>>>>
>>>>                sgs->group_load += load;
>>>> +             sgs->group_usage += get_cpu_usage(i);
>>>>                sgs->sum_nr_running += rq->cfs.h_nr_running;
>>>>
>>>>                if (rq->nr_running > 1)
>>>>
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
>
>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH v6 4/6] sched: get CPU's usage statistic
@ 2014-11-21 12:17           ` Vincent Guittot
  0 siblings, 0 replies; 72+ messages in thread
From: Vincent Guittot @ 2014-11-21 12:17 UTC (permalink / raw)
  To: linux-arm-kernel

On 21 November 2014 06:36, Wanpeng Li <kernellwp@gmail.com> wrote:
> Hi Vincent,
>
> On 9/26/14, 8:17 PM, Vincent Guittot wrote:

 [snip]

>>
>>> You add up the individual cpu usage values for a group by
>>> sgs->group_usage += get_cpu_usage(i) in update_sg_lb_stats and later use
>>> sgs->group_usage in group_is_overloaded to compare it against
>>> sgs->group_capacity (taking imbalance_pct into consideration).
>>>
>>>> +
>>>> +     return (usage * capacity) >> SCHED_LOAD_SHIFT;
>>>
>>> Nit-pick: Since you're multiplying by a capacity value
>>> (rq->cpu_capacity_orig) you should shift by SCHED_CAPACITY_SHIFT.
>>
>> we want to compare the output of the function with some capacity
>> figures so i think that >> SCHED_LOAD_SHIFT is the right operation.
>
>
> Could you explain more why '>> SCHED_LOAD_SHIFT' instead of '>>
> SCHED_CAPACITY_SHIFT'?

The return of get_cpu_usage is going to be compared with capacity so
we need to return CAPACITY unit

usage unit is LOAD, capacity unit is CAPACITY so we have to divide by
LOAD unit to return CAPACITY unit

LOAD unit * CAPACITY unit / LOAD unit -> CAPACITY unit

Regards,
Vincent

>
> Regards,
> Wanpeng Li
>
>>
>>> Just to make sure: You do this scaling of usage by cpu_capacity_orig
>>> here only to cater for the fact that cpu_capacity_orig might be uarch
>>> scaled (by arch_scale_cpu_capacity, !SMT) in update_cpu_capacity while
>>
>> I do this for any system with CPUs that have an original capacity that
>> is different from SCHED_CAPACITY_SCALE so it's for both uArch and SMT.
>>
>>> utilization_load_avg is currently not.
>>> We don't even uArch scale on ARM TC2 big.LITTLE platform in mainline
>>> today due to the missing clock-frequency property in the device tree.
>>
>> sorry i don't catch your point
>>
>>> I think it's hard for people to grasp that your patch-set takes uArch
>>> scaling of capacity into consideration but not frequency scaling of
>>> capacity (via arch_scale_freq_capacity, not used at the moment).
>>>
>>>> +}
>>>> +
>>>>   /*
>>>>    * select_task_rq_fair: Select target runqueue for the waking task in
>>>> domains
>>>>    * that have the 'sd_flag' flag set. In practice, this is
>>>> SD_BALANCE_WAKE,
>>>> @@ -5663,6 +5674,7 @@ struct sg_lb_stats {
>>>>        unsigned long sum_weighted_load; /* Weighted load of group's
>>>> tasks */
>>>>        unsigned long load_per_task;
>>>>        unsigned long group_capacity;
>>>> +     unsigned long group_usage; /* Total usage of the group */
>>>>        unsigned int sum_nr_running; /* Nr tasks running in the group */
>>>>        unsigned int group_capacity_factor;
>>>>        unsigned int idle_cpus;
>>>> @@ -6037,6 +6049,7 @@ static inline void update_sg_lb_stats(struct
>>>> lb_env *env,
>>>>                        load = source_load(i, load_idx);
>>>>
>>>>                sgs->group_load += load;
>>>> +             sgs->group_usage += get_cpu_usage(i);
>>>>                sgs->sum_nr_running += rq->cfs.h_nr_running;
>>>>
>>>>                if (rq->nr_running > 1)
>>>>
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo at vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
>
>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v6 5/6] sched: replace capacity_factor by usage
  2014-10-03 12:50           ` Vincent Guittot
@ 2014-11-23  0:22             ` Wanpeng Li
  -1 siblings, 0 replies; 72+ messages in thread
From: Wanpeng Li @ 2014-11-23  0:22 UTC (permalink / raw)
  To: Vincent Guittot, Morten Rasmussen
  Cc: peterz, mingo, linux-kernel, preeti, linux, linux-arm-kernel,
	riel, efault, nicolas.pitre, linaro-kernel, daniel.lezcano,
	Dietmar Eggemann, pjt, bsegall

Hi Vincent,
On 10/3/14, 8:50 PM, Vincent Guittot wrote:
> On 3 October 2014 11:35, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
>> On Fri, Oct 03, 2014 at 08:24:23AM +0100, Vincent Guittot wrote:
>>> On 2 October 2014 18:57, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
>>>> On Tue, Sep 23, 2014 at 05:08:04PM +0100, Vincent Guittot wrote:
>>>>> Below are two examples to illustrate the problem that this patch solves:
>>>>>
>>>>> 1 - capacity_factor makes the assumption that max capacity of a CPU is
>>>>> SCHED_CAPACITY_SCALE and the load of a thread is always is
>>>>> SCHED_LOAD_SCALE. It compares the output of these figures with the sum
>>>>> of nr_running to decide if a group is overloaded or not.
>>>>>
>>>>> But if the default capacity of a CPU is less than SCHED_CAPACITY_SCALE
>>>>> (640 as an example), a group of 3 CPUS will have a max capacity_factor
>>>>> of 2 ( div_round_closest(3x640/1024) = 2) which means that it will be
>>>>> seen as overloaded if we have only one task per CPU.
>>>> I did some testing on TC2 which has the setup you describe above on the
>>>> A7 cluster when the clock-frequency property is set in DT. The two A15s
>>>> have max capacities above 1024. When using sysbench with five threads I
>>>> still get three tasks on the two A15s and two tasks on the three A7
>>>> leaving one cpu idle (on average).
>>>>
>>>> Using cpu utilization (usage) does correctly identify the A15 cluster as
>>>> overloaded and it even gets as far as selecting the A15 running two
>>>> tasks as the source cpu in load_balance(). However, load_balance() bails
>>>> out without pulling the task due to calculate_imbalance() returning a
>>>> zero imbalance. calculate_imbalance() bails out just before the hunk you
>>>> changed due to comparison of the sched_group avg_loads. sgs->avg_load is
>>>> basically the sum of load in the group divided by its capacity. Since
>>>> load isn't scaled the avg_load of the overloaded A15 group is slightly
>>>> _lower_ than the partially idle A7 group. Hence calculate_imbalance()
>>>> bails out, which isn't what we want.
>>>>
>>>> I think we need to have a closer look at the imbalance calculation code
>>>> and any other users of sgs->avg_load to get rid of all code making
>>>> assumptions about cpu capacity.
>>> We already had this discussion during the review of a previous version
>>> https://lkml.org/lkml/2014/6/3/422
>>> and my answer has not changed; This patch is necessary to solve the 1
>>> task per CPU issue of the HMP system but is not enough. I have a patch
>>> for solving the imbalance calculation issue and i have planned to send
>>> it once this patch will be in a good shape for being accepted by
>>> Peter.
>>>
>>> I don't want to mix this patch and the next one because there are
>>> addressing different problem: this one is how evaluate the capacity of
>>> a system and detect when a group is overloaded and the next one will
>>> handle the case when the balance of the system can't rely on the
>>> average load figures of the group because we have a significant
>>> capacity difference between groups. Not that i haven't specifically
>>> mentioned HMP for the last patch because SMP system can also take
>>> advantage of it
>> You do mention 'big' and 'little' cores in your commit message and quote
>> example numbers with are identical to the cpu capacities for TC2. That
> By last patch, i mean the patch about imbalance that i haven't sent
> yet, but it's not related with this patch
>
>> lead me to believe that this patch would address the issues we see for
>> HMP systems. But that is clearly wrong. I would suggest that you drop
> This patch addresses one issue: correctly detect how much capacity we
> have and correctly detect when the group is overloaded; HMP system

What's the meaning of "HMP system"?

Regards,
Wanpeng Li

> clearly falls in this category but not only.
> This is the only purpose of this patch and not the "1 task per CPU
> issue" that you mentioned.
>
> The second patch is about correctly balance the tasks on system where
> the capacity of a group is significantly less than another group. It
> has nothing to do in capacity computation and overload detection but
> it will use these corrected/new metrics to make a better choice.
>
> Then, there is the "1 task per CPU issue on HMP" that you mentioned
> and this latter will be solved thanks to these 2 patchsets but this is
> not the only/main target of these patchsets so i don't want to reduce
> them into: "solve an HMP issue"
>
>> mentioning big and little cores and stick to only describing cpu
>> capacity reductions due to rt tasks and irq to avoid any confusion about
>> the purpose of the patch. Maybe explicitly say that non-default cpu
>> capacities (capacity_orig) isn't addressed yet.
> But they are addressed by this patchset. you just mixed various goal
> together (see above)
>
>> I think the two problems you describe are very much related. This patch
>> set is half the solution of the HMP balancing problem. We just need the
>> last bit for avg_load and then we can add scale-invariance on top.
> i don't see the link between scale invariance and a bad load-balancing
> choice. The fact that the current load balancer puts more tasks than
> CPUs in a group on HMP system should not influence or break the scale
> invariance load tracking. Isn't it ?
>
> Now, i could send the other patchset but i'm afraid that this will
> generate more confusion than help in the process review.
>
> Regards,
> Vincent
>
>> IMHO, it would be good to have all the bits and pieces for cpu capacity
>> scaling and scaling of per-entity load-tracking on the table so we fit
>> things together. We only have patches for parts of the solution posted
>> so far.
>>
>> Morten
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH v6 5/6] sched: replace capacity_factor by usage
@ 2014-11-23  0:22             ` Wanpeng Li
  0 siblings, 0 replies; 72+ messages in thread
From: Wanpeng Li @ 2014-11-23  0:22 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Vincent,
On 10/3/14, 8:50 PM, Vincent Guittot wrote:
> On 3 October 2014 11:35, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
>> On Fri, Oct 03, 2014 at 08:24:23AM +0100, Vincent Guittot wrote:
>>> On 2 October 2014 18:57, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
>>>> On Tue, Sep 23, 2014 at 05:08:04PM +0100, Vincent Guittot wrote:
>>>>> Below are two examples to illustrate the problem that this patch solves:
>>>>>
>>>>> 1 - capacity_factor makes the assumption that max capacity of a CPU is
>>>>> SCHED_CAPACITY_SCALE and the load of a thread is always is
>>>>> SCHED_LOAD_SCALE. It compares the output of these figures with the sum
>>>>> of nr_running to decide if a group is overloaded or not.
>>>>>
>>>>> But if the default capacity of a CPU is less than SCHED_CAPACITY_SCALE
>>>>> (640 as an example), a group of 3 CPUS will have a max capacity_factor
>>>>> of 2 ( div_round_closest(3x640/1024) = 2) which means that it will be
>>>>> seen as overloaded if we have only one task per CPU.
>>>> I did some testing on TC2 which has the setup you describe above on the
>>>> A7 cluster when the clock-frequency property is set in DT. The two A15s
>>>> have max capacities above 1024. When using sysbench with five threads I
>>>> still get three tasks on the two A15s and two tasks on the three A7
>>>> leaving one cpu idle (on average).
>>>>
>>>> Using cpu utilization (usage) does correctly identify the A15 cluster as
>>>> overloaded and it even gets as far as selecting the A15 running two
>>>> tasks as the source cpu in load_balance(). However, load_balance() bails
>>>> out without pulling the task due to calculate_imbalance() returning a
>>>> zero imbalance. calculate_imbalance() bails out just before the hunk you
>>>> changed due to comparison of the sched_group avg_loads. sgs->avg_load is
>>>> basically the sum of load in the group divided by its capacity. Since
>>>> load isn't scaled the avg_load of the overloaded A15 group is slightly
>>>> _lower_ than the partially idle A7 group. Hence calculate_imbalance()
>>>> bails out, which isn't what we want.
>>>>
>>>> I think we need to have a closer look at the imbalance calculation code
>>>> and any other users of sgs->avg_load to get rid of all code making
>>>> assumptions about cpu capacity.
>>> We already had this discussion during the review of a previous version
>>> https://lkml.org/lkml/2014/6/3/422
>>> and my answer has not changed; This patch is necessary to solve the 1
>>> task per CPU issue of the HMP system but is not enough. I have a patch
>>> for solving the imbalance calculation issue and i have planned to send
>>> it once this patch will be in a good shape for being accepted by
>>> Peter.
>>>
>>> I don't want to mix this patch and the next one because there are
>>> addressing different problem: this one is how evaluate the capacity of
>>> a system and detect when a group is overloaded and the next one will
>>> handle the case when the balance of the system can't rely on the
>>> average load figures of the group because we have a significant
>>> capacity difference between groups. Not that i haven't specifically
>>> mentioned HMP for the last patch because SMP system can also take
>>> advantage of it
>> You do mention 'big' and 'little' cores in your commit message and quote
>> example numbers with are identical to the cpu capacities for TC2. That
> By last patch, i mean the patch about imbalance that i haven't sent
> yet, but it's not related with this patch
>
>> lead me to believe that this patch would address the issues we see for
>> HMP systems. But that is clearly wrong. I would suggest that you drop
> This patch addresses one issue: correctly detect how much capacity we
> have and correctly detect when the group is overloaded; HMP system

What's the meaning of "HMP system"?

Regards,
Wanpeng Li

> clearly falls in this category but not only.
> This is the only purpose of this patch and not the "1 task per CPU
> issue" that you mentioned.
>
> The second patch is about correctly balance the tasks on system where
> the capacity of a group is significantly less than another group. It
> has nothing to do in capacity computation and overload detection but
> it will use these corrected/new metrics to make a better choice.
>
> Then, there is the "1 task per CPU issue on HMP" that you mentioned
> and this latter will be solved thanks to these 2 patchsets but this is
> not the only/main target of these patchsets so i don't want to reduce
> them into: "solve an HMP issue"
>
>> mentioning big and little cores and stick to only describing cpu
>> capacity reductions due to rt tasks and irq to avoid any confusion about
>> the purpose of the patch. Maybe explicitly say that non-default cpu
>> capacities (capacity_orig) isn't addressed yet.
> But they are addressed by this patchset. you just mixed various goal
> together (see above)
>
>> I think the two problems you describe are very much related. This patch
>> set is half the solution of the HMP balancing problem. We just need the
>> last bit for avg_load and then we can add scale-invariance on top.
> i don't see the link between scale invariance and a bad load-balancing
> choice. The fact that the current load balancer puts more tasks than
> CPUs in a group on HMP system should not influence or break the scale
> invariance load tracking. Isn't it ?
>
> Now, i could send the other patchset but i'm afraid that this will
> generate more confusion than help in the process review.
>
> Regards,
> Vincent
>
>> IMHO, it would be good to have all the bits and pieces for cpu capacity
>> scaling and scaling of per-entity load-tracking on the table so we fit
>> things together. We only have patches for parts of the solution posted
>> so far.
>>
>> Morten
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v6 5/6] sched: replace capacity_factor by usage
  2014-11-23  0:22             ` Wanpeng Li
@ 2014-11-24  8:26               ` Vincent Guittot
  -1 siblings, 0 replies; 72+ messages in thread
From: Vincent Guittot @ 2014-11-24  8:26 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Morten Rasmussen, peterz, mingo, linux-kernel, preeti, linux,
	linux-arm-kernel, riel, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano, Dietmar Eggemann, pjt, bsegall

On 23 November 2014 at 01:22, Wanpeng Li <kernellwp@gmail.com> wrote:
> Hi Vincent,
>
> On 10/3/14, 8:50 PM, Vincent Guittot wrote:
>>
>> On 3 October 2014 11:35, Morten Rasmussen <morten.rasmussen@arm.com>
>> wrote:
>>>
>>> On Fri, Oct 03, 2014 at 08:24:23AM +0100, Vincent Guittot wrote:
>>>>
>>>> On 2 October 2014 18:57, Morten Rasmussen <morten.rasmussen@arm.com>
>>>> wrote:
>>>>>
>>>>> On Tue, Sep 23, 2014 at 05:08:04PM +0100, Vincent Guittot wrote:
>>>>>>
>>>>>> Below are two examples to illustrate the problem that this patch
>>>>>> solves:
>>>>>>
>>>>>> 1 - capacity_factor makes the assumption that max capacity of a CPU is
>>>>>> SCHED_CAPACITY_SCALE and the load of a thread is always is
>>>>>> SCHED_LOAD_SCALE. It compares the output of these figures with the sum
>>>>>> of nr_running to decide if a group is overloaded or not.
>>>>>>
>>>>>> But if the default capacity of a CPU is less than SCHED_CAPACITY_SCALE
>>>>>> (640 as an example), a group of 3 CPUS will have a max capacity_factor
>>>>>> of 2 ( div_round_closest(3x640/1024) = 2) which means that it will be
>>>>>> seen as overloaded if we have only one task per CPU.
>>>>>
>>>>> I did some testing on TC2 which has the setup you describe above on the
>>>>> A7 cluster when the clock-frequency property is set in DT. The two A15s
>>>>> have max capacities above 1024. When using sysbench with five threads I
>>>>> still get three tasks on the two A15s and two tasks on the three A7
>>>>> leaving one cpu idle (on average).
>>>>>
>>>>> Using cpu utilization (usage) does correctly identify the A15 cluster
>>>>> as
>>>>> overloaded and it even gets as far as selecting the A15 running two
>>>>> tasks as the source cpu in load_balance(). However, load_balance()
>>>>> bails
>>>>> out without pulling the task due to calculate_imbalance() returning a
>>>>> zero imbalance. calculate_imbalance() bails out just before the hunk
>>>>> you
>>>>> changed due to comparison of the sched_group avg_loads. sgs->avg_load
>>>>> is
>>>>> basically the sum of load in the group divided by its capacity. Since
>>>>> load isn't scaled the avg_load of the overloaded A15 group is slightly
>>>>> _lower_ than the partially idle A7 group. Hence calculate_imbalance()
>>>>> bails out, which isn't what we want.
>>>>>
>>>>> I think we need to have a closer look at the imbalance calculation code
>>>>> and any other users of sgs->avg_load to get rid of all code making
>>>>> assumptions about cpu capacity.
>>>>
>>>> We already had this discussion during the review of a previous version
>>>> https://lkml.org/lkml/2014/6/3/422
>>>> and my answer has not changed; This patch is necessary to solve the 1
>>>> task per CPU issue of the HMP system but is not enough. I have a patch
>>>> for solving the imbalance calculation issue and i have planned to send
>>>> it once this patch will be in a good shape for being accepted by
>>>> Peter.
>>>>
>>>> I don't want to mix this patch and the next one because there are
>>>> addressing different problem: this one is how evaluate the capacity of
>>>> a system and detect when a group is overloaded and the next one will
>>>> handle the case when the balance of the system can't rely on the
>>>> average load figures of the group because we have a significant
>>>> capacity difference between groups. Not that i haven't specifically
>>>> mentioned HMP for the last patch because SMP system can also take
>>>> advantage of it
>>>
>>> You do mention 'big' and 'little' cores in your commit message and quote
>>> example numbers with are identical to the cpu capacities for TC2. That
>>
>> By last patch, i mean the patch about imbalance that i haven't sent
>> yet, but it's not related with this patch
>>
>>> lead me to believe that this patch would address the issues we see for
>>> HMP systems. But that is clearly wrong. I would suggest that you drop
>>
>> This patch addresses one issue: correctly detect how much capacity we
>> have and correctly detect when the group is overloaded; HMP system
>
>
> What's the meaning of "HMP system"?

Heterogeneous Multi Processor  system are system that gathers
processors with different micro architecture and/or different max
frequency. The main result is that processors don't have the same
maximum capacity and the same DMIPS/W efficiency

Regards,
Vincent
>
> Regards,
> Wanpeng Li
>
>> clearly falls in this category but not only.
>> This is the only purpose of this patch and not the "1 task per CPU
>> issue" that you mentioned.
>>
>> The second patch is about correctly balance the tasks on system where
>> the capacity of a group is significantly less than another group. It
>> has nothing to do in capacity computation and overload detection but
>> it will use these corrected/new metrics to make a better choice.
>>
>> Then, there is the "1 task per CPU issue on HMP" that you mentioned
>> and this latter will be solved thanks to these 2 patchsets but this is
>> not the only/main target of these patchsets so i don't want to reduce
>> them into: "solve an HMP issue"
>>
>>> mentioning big and little cores and stick to only describing cpu
>>> capacity reductions due to rt tasks and irq to avoid any confusion about
>>> the purpose of the patch. Maybe explicitly say that non-default cpu
>>> capacities (capacity_orig) isn't addressed yet.
>>
>> But they are addressed by this patchset. you just mixed various goal
>> together (see above)
>>
>>> I think the two problems you describe are very much related. This patch
>>> set is half the solution of the HMP balancing problem. We just need the
>>> last bit for avg_load and then we can add scale-invariance on top.
>>
>> i don't see the link between scale invariance and a bad load-balancing
>> choice. The fact that the current load balancer puts more tasks than
>> CPUs in a group on HMP system should not influence or break the scale
>> invariance load tracking. Isn't it ?
>>
>> Now, i could send the other patchset but i'm afraid that this will
>> generate more confusion than help in the process review.
>>
>> Regards,
>> Vincent
>>
>>> IMHO, it would be good to have all the bits and pieces for cpu capacity
>>> scaling and scaling of per-entity load-tracking on the table so we fit
>>> things together. We only have patches for parts of the solution posted
>>> so far.
>>>
>>> Morten
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
>
>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH v6 5/6] sched: replace capacity_factor by usage
@ 2014-11-24  8:26               ` Vincent Guittot
  0 siblings, 0 replies; 72+ messages in thread
From: Vincent Guittot @ 2014-11-24  8:26 UTC (permalink / raw)
  To: linux-arm-kernel

On 23 November 2014 at 01:22, Wanpeng Li <kernellwp@gmail.com> wrote:
> Hi Vincent,
>
> On 10/3/14, 8:50 PM, Vincent Guittot wrote:
>>
>> On 3 October 2014 11:35, Morten Rasmussen <morten.rasmussen@arm.com>
>> wrote:
>>>
>>> On Fri, Oct 03, 2014 at 08:24:23AM +0100, Vincent Guittot wrote:
>>>>
>>>> On 2 October 2014 18:57, Morten Rasmussen <morten.rasmussen@arm.com>
>>>> wrote:
>>>>>
>>>>> On Tue, Sep 23, 2014 at 05:08:04PM +0100, Vincent Guittot wrote:
>>>>>>
>>>>>> Below are two examples to illustrate the problem that this patch
>>>>>> solves:
>>>>>>
>>>>>> 1 - capacity_factor makes the assumption that max capacity of a CPU is
>>>>>> SCHED_CAPACITY_SCALE and the load of a thread is always is
>>>>>> SCHED_LOAD_SCALE. It compares the output of these figures with the sum
>>>>>> of nr_running to decide if a group is overloaded or not.
>>>>>>
>>>>>> But if the default capacity of a CPU is less than SCHED_CAPACITY_SCALE
>>>>>> (640 as an example), a group of 3 CPUS will have a max capacity_factor
>>>>>> of 2 ( div_round_closest(3x640/1024) = 2) which means that it will be
>>>>>> seen as overloaded if we have only one task per CPU.
>>>>>
>>>>> I did some testing on TC2 which has the setup you describe above on the
>>>>> A7 cluster when the clock-frequency property is set in DT. The two A15s
>>>>> have max capacities above 1024. When using sysbench with five threads I
>>>>> still get three tasks on the two A15s and two tasks on the three A7
>>>>> leaving one cpu idle (on average).
>>>>>
>>>>> Using cpu utilization (usage) does correctly identify the A15 cluster
>>>>> as
>>>>> overloaded and it even gets as far as selecting the A15 running two
>>>>> tasks as the source cpu in load_balance(). However, load_balance()
>>>>> bails
>>>>> out without pulling the task due to calculate_imbalance() returning a
>>>>> zero imbalance. calculate_imbalance() bails out just before the hunk
>>>>> you
>>>>> changed due to comparison of the sched_group avg_loads. sgs->avg_load
>>>>> is
>>>>> basically the sum of load in the group divided by its capacity. Since
>>>>> load isn't scaled the avg_load of the overloaded A15 group is slightly
>>>>> _lower_ than the partially idle A7 group. Hence calculate_imbalance()
>>>>> bails out, which isn't what we want.
>>>>>
>>>>> I think we need to have a closer look at the imbalance calculation code
>>>>> and any other users of sgs->avg_load to get rid of all code making
>>>>> assumptions about cpu capacity.
>>>>
>>>> We already had this discussion during the review of a previous version
>>>> https://lkml.org/lkml/2014/6/3/422
>>>> and my answer has not changed; This patch is necessary to solve the 1
>>>> task per CPU issue of the HMP system but is not enough. I have a patch
>>>> for solving the imbalance calculation issue and i have planned to send
>>>> it once this patch will be in a good shape for being accepted by
>>>> Peter.
>>>>
>>>> I don't want to mix this patch and the next one because there are
>>>> addressing different problem: this one is how evaluate the capacity of
>>>> a system and detect when a group is overloaded and the next one will
>>>> handle the case when the balance of the system can't rely on the
>>>> average load figures of the group because we have a significant
>>>> capacity difference between groups. Not that i haven't specifically
>>>> mentioned HMP for the last patch because SMP system can also take
>>>> advantage of it
>>>
>>> You do mention 'big' and 'little' cores in your commit message and quote
>>> example numbers with are identical to the cpu capacities for TC2. That
>>
>> By last patch, i mean the patch about imbalance that i haven't sent
>> yet, but it's not related with this patch
>>
>>> lead me to believe that this patch would address the issues we see for
>>> HMP systems. But that is clearly wrong. I would suggest that you drop
>>
>> This patch addresses one issue: correctly detect how much capacity we
>> have and correctly detect when the group is overloaded; HMP system
>
>
> What's the meaning of "HMP system"?

Heterogeneous Multi Processor  system are system that gathers
processors with different micro architecture and/or different max
frequency. The main result is that processors don't have the same
maximum capacity and the same DMIPS/W efficiency

Regards,
Vincent
>
> Regards,
> Wanpeng Li
>
>> clearly falls in this category but not only.
>> This is the only purpose of this patch and not the "1 task per CPU
>> issue" that you mentioned.
>>
>> The second patch is about correctly balance the tasks on system where
>> the capacity of a group is significantly less than another group. It
>> has nothing to do in capacity computation and overload detection but
>> it will use these corrected/new metrics to make a better choice.
>>
>> Then, there is the "1 task per CPU issue on HMP" that you mentioned
>> and this latter will be solved thanks to these 2 patchsets but this is
>> not the only/main target of these patchsets so i don't want to reduce
>> them into: "solve an HMP issue"
>>
>>> mentioning big and little cores and stick to only describing cpu
>>> capacity reductions due to rt tasks and irq to avoid any confusion about
>>> the purpose of the patch. Maybe explicitly say that non-default cpu
>>> capacities (capacity_orig) isn't addressed yet.
>>
>> But they are addressed by this patchset. you just mixed various goal
>> together (see above)
>>
>>> I think the two problems you describe are very much related. This patch
>>> set is half the solution of the HMP balancing problem. We just need the
>>> last bit for avg_load and then we can add scale-invariance on top.
>>
>> i don't see the link between scale invariance and a bad load-balancing
>> choice. The fact that the current load balancer puts more tasks than
>> CPUs in a group on HMP system should not influence or break the scale
>> invariance load tracking. Isn't it ?
>>
>> Now, i could send the other patchset but i'm afraid that this will
>> generate more confusion than help in the process review.
>>
>> Regards,
>> Vincent
>>
>>> IMHO, it would be good to have all the bits and pieces for cpu capacity
>>> scaling and scaling of per-entity load-tracking on the table so we fit
>>> things together. We only have patches for parts of the solution posted
>>> so far.
>>>
>>> Morten
>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo at vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
>
>

^ permalink raw reply	[flat|nested] 72+ messages in thread

end of thread, other threads:[~2014-11-24  8:27 UTC | newest]

Thread overview: 72+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-23 16:07 [PATCH v6 0/6] sched: consolidation of cpu_capacity Vincent Guittot
2014-09-23 16:07 ` Vincent Guittot
2014-09-23 16:08 ` [PATCH v6 1/6] sched: add per rq cpu_capacity_orig Vincent Guittot
2014-09-23 16:08   ` Vincent Guittot
2014-09-23 16:08 ` [PATCH v6 2/6] sched: move cfs task on a CPU with higher capacity Vincent Guittot
2014-09-23 16:08   ` Vincent Guittot
2014-09-23 16:08 ` [PATCH v6 3/6] sched: add utilization_avg_contrib Vincent Guittot
2014-09-23 16:08   ` Vincent Guittot
2014-10-03 14:15   ` Peter Zijlstra
2014-10-03 14:15     ` Peter Zijlstra
2014-10-03 14:44     ` Vincent Guittot
2014-10-03 14:44       ` Vincent Guittot
2014-10-03 14:36   ` Peter Zijlstra
2014-10-03 14:36     ` Peter Zijlstra
2014-10-03 14:51     ` Vincent Guittot
2014-10-03 14:51       ` Vincent Guittot
2014-10-03 15:14       ` Peter Zijlstra
2014-10-03 15:14         ` Peter Zijlstra
2014-10-03 16:05         ` Morten Rasmussen
2014-10-03 16:05           ` Morten Rasmussen
2014-09-23 16:08 ` [PATCH v6 4/6] sched: get CPU's usage statistic Vincent Guittot
2014-09-23 16:08   ` Vincent Guittot
2014-09-25 19:05   ` Dietmar Eggemann
2014-09-25 19:05     ` Dietmar Eggemann
2014-09-26 12:17     ` Vincent Guittot
2014-09-26 12:17       ` Vincent Guittot
2014-09-26 15:58       ` Morten Rasmussen
2014-09-26 15:58         ` Morten Rasmussen
2014-09-26 19:57       ` Dietmar Eggemann
2014-09-26 19:57         ` Dietmar Eggemann
2014-11-21  5:36       ` Wanpeng Li
2014-11-21  5:36         ` Wanpeng Li
2014-11-21 12:17         ` Vincent Guittot
2014-11-21 12:17           ` Vincent Guittot
2014-09-23 16:08 ` [PATCH v6 5/6] sched: replace capacity_factor by usage Vincent Guittot
2014-09-23 16:08   ` Vincent Guittot
2014-09-24 17:48   ` Dietmar Eggemann
2014-09-24 17:48     ` Dietmar Eggemann
2014-09-25  8:35     ` Vincent Guittot
2014-09-25  8:35       ` Vincent Guittot
2014-09-25 19:19       ` Dietmar Eggemann
2014-09-25 19:19         ` Dietmar Eggemann
2014-09-26 12:39         ` Vincent Guittot
2014-09-26 12:39           ` Vincent Guittot
2014-09-26 14:00           ` Dietmar Eggemann
2014-09-26 14:00             ` Dietmar Eggemann
2014-09-25  8:38   ` Vincent Guittot
2014-09-25  8:38     ` Vincent Guittot
2014-09-29 13:39   ` Dietmar Eggemann
2014-09-29 13:39     ` Dietmar Eggemann
2014-10-02 16:57   ` Morten Rasmussen
2014-10-02 16:57     ` Morten Rasmussen
2014-10-03  7:24     ` Vincent Guittot
2014-10-03  7:24       ` Vincent Guittot
2014-10-03  9:35       ` Morten Rasmussen
2014-10-03  9:35         ` Morten Rasmussen
2014-10-03 12:50         ` Vincent Guittot
2014-10-03 12:50           ` Vincent Guittot
2014-11-23  0:22           ` Wanpeng Li
2014-11-23  0:22             ` Wanpeng Li
2014-11-24  8:26             ` Vincent Guittot
2014-11-24  8:26               ` Vincent Guittot
2014-10-03 15:38   ` Peter Zijlstra
2014-10-03 15:38     ` Peter Zijlstra
2014-10-06  8:55     ` Vincent Guittot
2014-10-06  8:55       ` Vincent Guittot
2014-09-23 16:08 ` [PATCH v6 6/6] sched: add SD_PREFER_SIBLING for SMT level Vincent Guittot
2014-09-23 16:08   ` Vincent Guittot
2014-09-24 12:27   ` Preeti U Murthy
2014-09-24 12:27     ` Preeti U Murthy
2014-09-25 12:10     ` Vincent Guittot
2014-09-25 12:10       ` Vincent Guittot

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.