All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v2 00/11]  sched: consolidation of cpu_power
@ 2014-05-23 15:52 ` Vincent Guittot
  0 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-05-23 15:52 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel, linux, linux-arm-kernel
  Cc: preeti, Morten.Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano, Vincent Guittot

Part of this patchset was previously part of the larger tasks packing patchset
[1]. I have splitted the latter in 3 different patchsets (at least) to make the
thing easier.
-configuration of sched_domain topology [2]
-update and consolidation of cpu_power (this patchset)
-tasks packing algorithm

SMT system is no more the only system that can have a CPUs with an original
capacity that is different from the default value. We need to extend the use of
cpu_power_orig to all kind of platform so the scheduler will have both the
maximum capacity (cpu_power_orig/power_orig) and the current capacity
(cpu_power/power) of CPUs and sched_groups. A new function arch_scale_cpu_power
has been created and replace arch_scale_smt_power, which is SMT specifc in the
computation of the capapcity of a CPU.

During load balance, the scheduler evaluates the number of tasks that a group
of CPUs can handle. The current method assumes that tasks have a fix load of 
SCHED_LOAD_SCALE and CPUs have a default capacity of SCHED_POWER_SCALE.
This assumption generates wrong decision by creating ghost cores and by
removing real ones when the original capacity of CPUs is different from the
default SCHED_POWER_SCALE.

Now that we have the original capacity of a CPUS and its activity/utilization,
we can evaluate more accuratly the capacity of a group of CPUs.

This patchset mainly replaces the old capacity method by a new one and has kept
the policy almost unchanged whereas we can certainly take advantage of this new
statistic in several other places of the load balance.

TODO:
 - align variable's and field's name with the renaming [3]

Tests results:
I have put below results of 2 tests:
- hackbench -l 500 -s 4096
- scp of 100MB file on the platform

on a dual cortex-A7 
                  hackbench        scp    
tip/master        25.75s(+/-0.25)  5.16MB/s(+/-1.49)
+ patches 1,2     25.89s(+/-0.31)  5.18MB/s(+/-1.45)
+ patches 3-10    25.68s(+/-0.22)  7.00MB/s(+/-1.88)
+ irq accounting  25.80s(+/-0.25)  8.06MB/s(+/-0.05)

on a quad cortex-A15 
                  hackbench        scp    
tip/master        15.69s(+/-0.16)  9.70MB/s(+/-0.04)
+ patches 1,2     15.53s(+/-0.13)  9.72MB/s(+/-0.05)
+ patches 3-10    15.56s(+/-0.22)  9.88MB/s(+/-0.05)
+ irq accounting  15.99s(+/-0.08) 10.37MB/s(+/-0.03)

The improvement of scp bandwidth happens when tasks and irq are using
different CPU which is a bit random without irq accounting config

Change since V1:
 - add 3 fixes
 - correct some commit messages
 - replace capacity computation by activity
 - take into account current cpu capacity

[1] https://lkml.org/lkml/2013/10/18/121
[2] https://lkml.org/lkml/2014/3/19/377
[3] https://lkml.org/lkml/2014/5/14/622

Vincent Guittot (11):
  sched: fix imbalance flag reset
  sched: remove a wake_affine condition
  sched: fix avg_load computation
  sched: Allow all archs to set the power_orig
  ARM: topology: use new cpu_power interface
  sched: add per rq cpu_power_orig
  Revert "sched: Put rq's sched_avg under CONFIG_FAIR_GROUP_SCHED"
  sched: get CPU's activity statistic
  sched: test the cpu's capacity in wake affine
  sched: move cfs task on a CPU with higher capacity
  sched: replace capacity by activity

 arch/arm/kernel/topology.c |   4 +-
 kernel/sched/core.c        |   2 +-
 kernel/sched/fair.c        | 229 ++++++++++++++++++++++-----------------------
 kernel/sched/sched.h       |   5 +-
 4 files changed, 118 insertions(+), 122 deletions(-)

-- 
1.9.1


^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 00/11]  sched: consolidation of cpu_power
@ 2014-05-23 15:52 ` Vincent Guittot
  0 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-05-23 15:52 UTC (permalink / raw)
  To: linux-arm-kernel

Part of this patchset was previously part of the larger tasks packing patchset
[1]. I have splitted the latter in 3 different patchsets (at least) to make the
thing easier.
-configuration of sched_domain topology [2]
-update and consolidation of cpu_power (this patchset)
-tasks packing algorithm

SMT system is no more the only system that can have a CPUs with an original
capacity that is different from the default value. We need to extend the use of
cpu_power_orig to all kind of platform so the scheduler will have both the
maximum capacity (cpu_power_orig/power_orig) and the current capacity
(cpu_power/power) of CPUs and sched_groups. A new function arch_scale_cpu_power
has been created and replace arch_scale_smt_power, which is SMT specifc in the
computation of the capapcity of a CPU.

During load balance, the scheduler evaluates the number of tasks that a group
of CPUs can handle. The current method assumes that tasks have a fix load of 
SCHED_LOAD_SCALE and CPUs have a default capacity of SCHED_POWER_SCALE.
This assumption generates wrong decision by creating ghost cores and by
removing real ones when the original capacity of CPUs is different from the
default SCHED_POWER_SCALE.

Now that we have the original capacity of a CPUS and its activity/utilization,
we can evaluate more accuratly the capacity of a group of CPUs.

This patchset mainly replaces the old capacity method by a new one and has kept
the policy almost unchanged whereas we can certainly take advantage of this new
statistic in several other places of the load balance.

TODO:
 - align variable's and field's name with the renaming [3]

Tests results:
I have put below results of 2 tests:
- hackbench -l 500 -s 4096
- scp of 100MB file on the platform

on a dual cortex-A7 
                  hackbench        scp    
tip/master        25.75s(+/-0.25)  5.16MB/s(+/-1.49)
+ patches 1,2     25.89s(+/-0.31)  5.18MB/s(+/-1.45)
+ patches 3-10    25.68s(+/-0.22)  7.00MB/s(+/-1.88)
+ irq accounting  25.80s(+/-0.25)  8.06MB/s(+/-0.05)

on a quad cortex-A15 
                  hackbench        scp    
tip/master        15.69s(+/-0.16)  9.70MB/s(+/-0.04)
+ patches 1,2     15.53s(+/-0.13)  9.72MB/s(+/-0.05)
+ patches 3-10    15.56s(+/-0.22)  9.88MB/s(+/-0.05)
+ irq accounting  15.99s(+/-0.08) 10.37MB/s(+/-0.03)

The improvement of scp bandwidth happens when tasks and irq are using
different CPU which is a bit random without irq accounting config

Change since V1:
 - add 3 fixes
 - correct some commit messages
 - replace capacity computation by activity
 - take into account current cpu capacity

[1] https://lkml.org/lkml/2013/10/18/121
[2] https://lkml.org/lkml/2014/3/19/377
[3] https://lkml.org/lkml/2014/5/14/622

Vincent Guittot (11):
  sched: fix imbalance flag reset
  sched: remove a wake_affine condition
  sched: fix avg_load computation
  sched: Allow all archs to set the power_orig
  ARM: topology: use new cpu_power interface
  sched: add per rq cpu_power_orig
  Revert "sched: Put rq's sched_avg under CONFIG_FAIR_GROUP_SCHED"
  sched: get CPU's activity statistic
  sched: test the cpu's capacity in wake affine
  sched: move cfs task on a CPU with higher capacity
  sched: replace capacity by activity

 arch/arm/kernel/topology.c |   4 +-
 kernel/sched/core.c        |   2 +-
 kernel/sched/fair.c        | 229 ++++++++++++++++++++++-----------------------
 kernel/sched/sched.h       |   5 +-
 4 files changed, 118 insertions(+), 122 deletions(-)

-- 
1.9.1

^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 01/11] sched: fix imbalance flag reset
  2014-05-23 15:52 ` Vincent Guittot
@ 2014-05-23 15:52   ` Vincent Guittot
  -1 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-05-23 15:52 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel, linux, linux-arm-kernel
  Cc: preeti, Morten.Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano, Vincent Guittot

The imbalance flag can stay set whereas there is no imbalance.

Let assume that we have 3 tasks that run on a dual cores /dual cluster system.
We will have some idle load balance which are triggered during tick.
Unfortunately, the tick is also used to queue background work so we can reach
the situation where short work has been queued on a CPU which already runs a
task. The load balance will detect this imbalance (2 tasks on 1 CPU and an idle
CPU) and will try to pull the waiting task on the idle CPU. The waiting task is
a worker thread that is pinned on a CPU so an imbalance due to pinned task is
detected and the imbalance flag is set.
Then, we will not be able to clear the flag because we have at most 1 task on
each CPU but the imbalance flag will trig to useless active load balance
between the idle CPU and the busy CPU.

We need to reset of the imbalance flag as soon as we have reached a balanced
state.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c9617b7..9587ed1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6610,10 +6610,8 @@ more_balance:
 		if (sd_parent) {
 			int *group_imbalance = &sd_parent->groups->sgp->imbalance;
 
-			if ((env.flags & LBF_SOME_PINNED) && env.imbalance > 0) {
+			if ((env.flags & LBF_SOME_PINNED) && env.imbalance > 0)
 				*group_imbalance = 1;
-			} else if (*group_imbalance)
-				*group_imbalance = 0;
 		}
 
 		/* All tasks on this runqueue were pinned by CPU affinity */
@@ -6698,6 +6696,16 @@ more_balance:
 	goto out;
 
 out_balanced:
+	/*
+	 * We reach balance although we may have faced some affinity
+	 * constraints. Clear the imbalance flag if it was set.
+	 */
+	if (sd_parent) {
+		int *group_imbalance = &sd_parent->groups->sgp->imbalance;
+		if (*group_imbalance)
+			*group_imbalance = 0;
+	}
+
 	schedstat_inc(sd, lb_balanced[idle]);
 
 	sd->nr_balance_failed = 0;
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 216+ messages in thread

* [PATCH v2 01/11] sched: fix imbalance flag reset
@ 2014-05-23 15:52   ` Vincent Guittot
  0 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-05-23 15:52 UTC (permalink / raw)
  To: linux-arm-kernel

The imbalance flag can stay set whereas there is no imbalance.

Let assume that we have 3 tasks that run on a dual cores /dual cluster system.
We will have some idle load balance which are triggered during tick.
Unfortunately, the tick is also used to queue background work so we can reach
the situation where short work has been queued on a CPU which already runs a
task. The load balance will detect this imbalance (2 tasks on 1 CPU and an idle
CPU) and will try to pull the waiting task on the idle CPU. The waiting task is
a worker thread that is pinned on a CPU so an imbalance due to pinned task is
detected and the imbalance flag is set.
Then, we will not be able to clear the flag because we have at most 1 task on
each CPU but the imbalance flag will trig to useless active load balance
between the idle CPU and the busy CPU.

We need to reset of the imbalance flag as soon as we have reached a balanced
state.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 14 +++++++++++---
 1 file changed, 11 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c9617b7..9587ed1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6610,10 +6610,8 @@ more_balance:
 		if (sd_parent) {
 			int *group_imbalance = &sd_parent->groups->sgp->imbalance;
 
-			if ((env.flags & LBF_SOME_PINNED) && env.imbalance > 0) {
+			if ((env.flags & LBF_SOME_PINNED) && env.imbalance > 0)
 				*group_imbalance = 1;
-			} else if (*group_imbalance)
-				*group_imbalance = 0;
 		}
 
 		/* All tasks on this runqueue were pinned by CPU affinity */
@@ -6698,6 +6696,16 @@ more_balance:
 	goto out;
 
 out_balanced:
+	/*
+	 * We reach balance although we may have faced some affinity
+	 * constraints. Clear the imbalance flag if it was set.
+	 */
+	if (sd_parent) {
+		int *group_imbalance = &sd_parent->groups->sgp->imbalance;
+		if (*group_imbalance)
+			*group_imbalance = 0;
+	}
+
 	schedstat_inc(sd, lb_balanced[idle]);
 
 	sd->nr_balance_failed = 0;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 216+ messages in thread

* [PATCH v2 02/11] sched: remove a wake_affine condition
  2014-05-23 15:52 ` Vincent Guittot
@ 2014-05-23 15:52   ` Vincent Guittot
  -1 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-05-23 15:52 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel, linux, linux-arm-kernel
  Cc: preeti, Morten.Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano, Vincent Guittot

I have tried to understand the meaning of the condition :
 (this_load <= load &&
  this_load + target_load(prev_cpu, idx) <= tl_per_task)
but i failed to find a use case that can take advantage of it and i haven't
found description of it in the previous commits' log.
Futhermore, the comment of the condition refers to task_hot function that was
used before being replaced by the current condition:
/*
 * This domain has SD_WAKE_AFFINE and
 * p is cache cold in this domain, and
 * there is no bad imbalance.
 */

If we look more deeply the below condition
 this_load + target_load(prev_cpu, idx) <= tl_per_task

When sync is clear, we have :
 tl_per_task = runnable_load_avg / nr_running
 this_load = max(runnable_load_avg, cpuload[idx])
 target_load =  max(runnable_load_avg', cpuload'[idx])

It implies that runnable_load_avg' == 0 and nr_running <= 1 in order to match the
condition. This implies that runnable_load_avg == 0 too because of the
condition: this_load <= load
but if this _load is null, balanced is already set and the test is redundant.

If sync is set, it's not as straight forward as above (especially if cgroup
are involved) but the policy should be similar as we have removed a task that's
going to sleep in order to get a more accurate load and this_load values.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 25 +++++++------------------
 1 file changed, 7 insertions(+), 18 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9587ed1..30240ab 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4238,7 +4238,6 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
 {
 	s64 this_load, load;
 	int idx, this_cpu, prev_cpu;
-	unsigned long tl_per_task;
 	struct task_group *tg;
 	unsigned long weight;
 	int balanced;
@@ -4296,32 +4295,22 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
 		balanced = this_eff_load <= prev_eff_load;
 	} else
 		balanced = true;
+	schedstat_inc(p, se.statistics.nr_wakeups_affine_attempts);
 
+	if (!balanced)
+		return 0;
 	/*
 	 * If the currently running task will sleep within
 	 * a reasonable amount of time then attract this newly
 	 * woken task:
 	 */
-	if (sync && balanced)
+	if (sync)
 		return 1;
 
-	schedstat_inc(p, se.statistics.nr_wakeups_affine_attempts);
-	tl_per_task = cpu_avg_load_per_task(this_cpu);
-
-	if (balanced ||
-	    (this_load <= load &&
-	     this_load + target_load(prev_cpu, idx) <= tl_per_task)) {
-		/*
-		 * This domain has SD_WAKE_AFFINE and
-		 * p is cache cold in this domain, and
-		 * there is no bad imbalance.
-		 */
-		schedstat_inc(sd, ttwu_move_affine);
-		schedstat_inc(p, se.statistics.nr_wakeups_affine);
+	schedstat_inc(sd, ttwu_move_affine);
+	schedstat_inc(p, se.statistics.nr_wakeups_affine);
 
-		return 1;
-	}
-	return 0;
+	return 1;
 }
 
 /*
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 216+ messages in thread

* [PATCH v2 02/11] sched: remove a wake_affine condition
@ 2014-05-23 15:52   ` Vincent Guittot
  0 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-05-23 15:52 UTC (permalink / raw)
  To: linux-arm-kernel

I have tried to understand the meaning of the condition :
 (this_load <= load &&
  this_load + target_load(prev_cpu, idx) <= tl_per_task)
but i failed to find a use case that can take advantage of it and i haven't
found description of it in the previous commits' log.
Futhermore, the comment of the condition refers to task_hot function that was
used before being replaced by the current condition:
/*
 * This domain has SD_WAKE_AFFINE and
 * p is cache cold in this domain, and
 * there is no bad imbalance.
 */

If we look more deeply the below condition
 this_load + target_load(prev_cpu, idx) <= tl_per_task

When sync is clear, we have :
 tl_per_task = runnable_load_avg / nr_running
 this_load = max(runnable_load_avg, cpuload[idx])
 target_load =  max(runnable_load_avg', cpuload'[idx])

It implies that runnable_load_avg' == 0 and nr_running <= 1 in order to match the
condition. This implies that runnable_load_avg == 0 too because of the
condition: this_load <= load
but if this _load is null, balanced is already set and the test is redundant.

If sync is set, it's not as straight forward as above (especially if cgroup
are involved) but the policy should be similar as we have removed a task that's
going to sleep in order to get a more accurate load and this_load values.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 25 +++++++------------------
 1 file changed, 7 insertions(+), 18 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9587ed1..30240ab 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4238,7 +4238,6 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
 {
 	s64 this_load, load;
 	int idx, this_cpu, prev_cpu;
-	unsigned long tl_per_task;
 	struct task_group *tg;
 	unsigned long weight;
 	int balanced;
@@ -4296,32 +4295,22 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
 		balanced = this_eff_load <= prev_eff_load;
 	} else
 		balanced = true;
+	schedstat_inc(p, se.statistics.nr_wakeups_affine_attempts);
 
+	if (!balanced)
+		return 0;
 	/*
 	 * If the currently running task will sleep within
 	 * a reasonable amount of time then attract this newly
 	 * woken task:
 	 */
-	if (sync && balanced)
+	if (sync)
 		return 1;
 
-	schedstat_inc(p, se.statistics.nr_wakeups_affine_attempts);
-	tl_per_task = cpu_avg_load_per_task(this_cpu);
-
-	if (balanced ||
-	    (this_load <= load &&
-	     this_load + target_load(prev_cpu, idx) <= tl_per_task)) {
-		/*
-		 * This domain has SD_WAKE_AFFINE and
-		 * p is cache cold in this domain, and
-		 * there is no bad imbalance.
-		 */
-		schedstat_inc(sd, ttwu_move_affine);
-		schedstat_inc(p, se.statistics.nr_wakeups_affine);
+	schedstat_inc(sd, ttwu_move_affine);
+	schedstat_inc(p, se.statistics.nr_wakeups_affine);
 
-		return 1;
-	}
-	return 0;
+	return 1;
 }
 
 /*
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 216+ messages in thread

* [PATCH v2 03/11] sched: fix avg_load computation
  2014-05-23 15:52 ` Vincent Guittot
@ 2014-05-23 15:52   ` Vincent Guittot
  -1 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-05-23 15:52 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel, linux, linux-arm-kernel
  Cc: preeti, Morten.Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano, Vincent Guittot

The computation of avg_load and avg_load_per_task should only takes into
account the number of cfs tasks. The non cfs task are already taken into
account by decreasing the cpu's capacity (cpu_power) and they will be tracked
in the CPU's utilization (group_activity) of the next patches

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 30240ab..6a84114 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4048,7 +4048,7 @@ static unsigned long power_of(int cpu)
 static unsigned long cpu_avg_load_per_task(int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
-	unsigned long nr_running = ACCESS_ONCE(rq->nr_running);
+	unsigned long nr_running = ACCESS_ONCE(rq->cfs.h_nr_running);
 	unsigned long load_avg = rq->cfs.runnable_load_avg;
 
 	if (nr_running)
@@ -5868,7 +5868,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 			load = source_load(i, load_idx);
 
 		sgs->group_load += load;
-		sgs->sum_nr_running += rq->nr_running;
+		sgs->sum_nr_running += rq->cfs.h_nr_running;
 #ifdef CONFIG_NUMA_BALANCING
 		sgs->nr_numa_running += rq->nr_numa_running;
 		sgs->nr_preferred_running += rq->nr_preferred_running;
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 216+ messages in thread

* [PATCH v2 03/11] sched: fix avg_load computation
@ 2014-05-23 15:52   ` Vincent Guittot
  0 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-05-23 15:52 UTC (permalink / raw)
  To: linux-arm-kernel

The computation of avg_load and avg_load_per_task should only takes into
account the number of cfs tasks. The non cfs task are already taken into
account by decreasing the cpu's capacity (cpu_power) and they will be tracked
in the CPU's utilization (group_activity) of the next patches

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 30240ab..6a84114 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4048,7 +4048,7 @@ static unsigned long power_of(int cpu)
 static unsigned long cpu_avg_load_per_task(int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
-	unsigned long nr_running = ACCESS_ONCE(rq->nr_running);
+	unsigned long nr_running = ACCESS_ONCE(rq->cfs.h_nr_running);
 	unsigned long load_avg = rq->cfs.runnable_load_avg;
 
 	if (nr_running)
@@ -5868,7 +5868,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 			load = source_load(i, load_idx);
 
 		sgs->group_load += load;
-		sgs->sum_nr_running += rq->nr_running;
+		sgs->sum_nr_running += rq->cfs.h_nr_running;
 #ifdef CONFIG_NUMA_BALANCING
 		sgs->nr_numa_running += rq->nr_numa_running;
 		sgs->nr_preferred_running += rq->nr_preferred_running;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 216+ messages in thread

* [PATCH v2 04/11] sched: Allow all archs to set the power_orig
  2014-05-23 15:52 ` Vincent Guittot
@ 2014-05-23 15:52   ` Vincent Guittot
  -1 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-05-23 15:52 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel, linux, linux-arm-kernel
  Cc: preeti, Morten.Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano, Vincent Guittot

power_orig is only changed for system with a SMT sched_domain level in order to
reflect the lower capacity of CPUs. Heterogenous system also have to reflect an
original capacity that is different from the default value.

Create a more generic function arch_scale_cpu_power that can be also used by
non SMT platform to set power_orig.

The weak behavior of arch_scale_cpu_power is the previous SMT one in order to
keep backward compatibility in the use of power_orig.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 24 ++++++++++++++++--------
 1 file changed, 16 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6a84114..6dc05f7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5618,6 +5618,20 @@ unsigned long __weak arch_scale_smt_power(struct sched_domain *sd, int cpu)
 	return default_scale_smt_power(sd, cpu);
 }
 
+unsigned long __weak arch_scale_cpu_power(struct sched_domain *sd, int cpu)
+{
+	unsigned long weight = sd->span_weight;
+
+	if ((sd->flags & SD_SHARE_CPUPOWER) && weight > 1) {
+		if (sched_feat(ARCH_POWER))
+			return arch_scale_smt_power(sd, cpu);
+		else
+			return default_scale_smt_power(sd, cpu);
+	}
+
+	return SCHED_POWER_SCALE;
+}
+
 static unsigned long scale_rt_power(int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
@@ -5654,18 +5668,12 @@ static unsigned long scale_rt_power(int cpu)
 
 static void update_cpu_power(struct sched_domain *sd, int cpu)
 {
-	unsigned long weight = sd->span_weight;
 	unsigned long power = SCHED_POWER_SCALE;
 	struct sched_group *sdg = sd->groups;
 
-	if ((sd->flags & SD_SHARE_CPUPOWER) && weight > 1) {
-		if (sched_feat(ARCH_POWER))
-			power *= arch_scale_smt_power(sd, cpu);
-		else
-			power *= default_scale_smt_power(sd, cpu);
+	power *= arch_scale_cpu_power(sd, cpu);
 
-		power >>= SCHED_POWER_SHIFT;
-	}
+	power >>= SCHED_POWER_SHIFT;
 
 	sdg->sgp->power_orig = power;
 
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 216+ messages in thread

* [PATCH v2 04/11] sched: Allow all archs to set the power_orig
@ 2014-05-23 15:52   ` Vincent Guittot
  0 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-05-23 15:52 UTC (permalink / raw)
  To: linux-arm-kernel

power_orig is only changed for system with a SMT sched_domain level in order to
reflect the lower capacity of CPUs. Heterogenous system also have to reflect an
original capacity that is different from the default value.

Create a more generic function arch_scale_cpu_power that can be also used by
non SMT platform to set power_orig.

The weak behavior of arch_scale_cpu_power is the previous SMT one in order to
keep backward compatibility in the use of power_orig.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 24 ++++++++++++++++--------
 1 file changed, 16 insertions(+), 8 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6a84114..6dc05f7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5618,6 +5618,20 @@ unsigned long __weak arch_scale_smt_power(struct sched_domain *sd, int cpu)
 	return default_scale_smt_power(sd, cpu);
 }
 
+unsigned long __weak arch_scale_cpu_power(struct sched_domain *sd, int cpu)
+{
+	unsigned long weight = sd->span_weight;
+
+	if ((sd->flags & SD_SHARE_CPUPOWER) && weight > 1) {
+		if (sched_feat(ARCH_POWER))
+			return arch_scale_smt_power(sd, cpu);
+		else
+			return default_scale_smt_power(sd, cpu);
+	}
+
+	return SCHED_POWER_SCALE;
+}
+
 static unsigned long scale_rt_power(int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
@@ -5654,18 +5668,12 @@ static unsigned long scale_rt_power(int cpu)
 
 static void update_cpu_power(struct sched_domain *sd, int cpu)
 {
-	unsigned long weight = sd->span_weight;
 	unsigned long power = SCHED_POWER_SCALE;
 	struct sched_group *sdg = sd->groups;
 
-	if ((sd->flags & SD_SHARE_CPUPOWER) && weight > 1) {
-		if (sched_feat(ARCH_POWER))
-			power *= arch_scale_smt_power(sd, cpu);
-		else
-			power *= default_scale_smt_power(sd, cpu);
+	power *= arch_scale_cpu_power(sd, cpu);
 
-		power >>= SCHED_POWER_SHIFT;
-	}
+	power >>= SCHED_POWER_SHIFT;
 
 	sdg->sgp->power_orig = power;
 
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 216+ messages in thread

* [PATCH v2 05/11] ARM: topology: use new cpu_power interface
  2014-05-23 15:52 ` Vincent Guittot
@ 2014-05-23 15:52   ` Vincent Guittot
  -1 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-05-23 15:52 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel, linux, linux-arm-kernel
  Cc: preeti, Morten.Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano, Vincent Guittot

Use the new arch_scale_cpu_power in order to reflect the original capacity of
a CPU instead of arch_scale_freq_power which is more linked to a scaling of
the capacity linked to the frequency.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 arch/arm/kernel/topology.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
index 71e1fec..6cc25a8 100644
--- a/arch/arm/kernel/topology.c
+++ b/arch/arm/kernel/topology.c
@@ -42,7 +42,7 @@
  */
 static DEFINE_PER_CPU(unsigned long, cpu_scale);
 
-unsigned long arch_scale_freq_power(struct sched_domain *sd, int cpu)
+unsigned long arch_scale_cpu_power(struct sched_domain *sd, int cpu)
 {
 	return per_cpu(cpu_scale, cpu);
 }
@@ -166,7 +166,7 @@ static void update_cpu_power(unsigned int cpu)
 	set_power_scale(cpu, cpu_capacity(cpu) / middle_capacity);
 
 	printk(KERN_INFO "CPU%u: update cpu_power %lu\n",
-		cpu, arch_scale_freq_power(NULL, cpu));
+		cpu, arch_scale_cpu_power(NULL, cpu));
 }
 
 #else
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 216+ messages in thread

* [PATCH v2 05/11] ARM: topology: use new cpu_power interface
@ 2014-05-23 15:52   ` Vincent Guittot
  0 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-05-23 15:52 UTC (permalink / raw)
  To: linux-arm-kernel

Use the new arch_scale_cpu_power in order to reflect the original capacity of
a CPU instead of arch_scale_freq_power which is more linked to a scaling of
the capacity linked to the frequency.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 arch/arm/kernel/topology.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
index 71e1fec..6cc25a8 100644
--- a/arch/arm/kernel/topology.c
+++ b/arch/arm/kernel/topology.c
@@ -42,7 +42,7 @@
  */
 static DEFINE_PER_CPU(unsigned long, cpu_scale);
 
-unsigned long arch_scale_freq_power(struct sched_domain *sd, int cpu)
+unsigned long arch_scale_cpu_power(struct sched_domain *sd, int cpu)
 {
 	return per_cpu(cpu_scale, cpu);
 }
@@ -166,7 +166,7 @@ static void update_cpu_power(unsigned int cpu)
 	set_power_scale(cpu, cpu_capacity(cpu) / middle_capacity);
 
 	printk(KERN_INFO "CPU%u: update cpu_power %lu\n",
-		cpu, arch_scale_freq_power(NULL, cpu));
+		cpu, arch_scale_cpu_power(NULL, cpu));
 }
 
 #else
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 216+ messages in thread

* [PATCH v2 06/11] sched: add per rq cpu_power_orig
  2014-05-23 15:52 ` Vincent Guittot
@ 2014-05-23 15:53   ` Vincent Guittot
  -1 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-05-23 15:53 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel, linux, linux-arm-kernel
  Cc: preeti, Morten.Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano, Vincent Guittot

This new field cpu_power_orig reflects the available capacity of a CPUs unlike
the cpu_power which reflects the current capacity that can be altered by
frequency and rt tasks.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/core.c  | 2 +-
 kernel/sched/fair.c  | 1 +
 kernel/sched/sched.h | 1 +
 3 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 240aa83..686c56f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6946,7 +6946,7 @@ void __init sched_init(void)
 #ifdef CONFIG_SMP
 		rq->sd = NULL;
 		rq->rd = NULL;
-		rq->cpu_power = SCHED_POWER_SCALE;
+		rq->cpu_power = rq->cpu_power_orig = SCHED_POWER_SCALE;
 		rq->post_schedule = 0;
 		rq->active_balance = 0;
 		rq->next_balance = jiffies;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6dc05f7..b3dc2b3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5675,6 +5675,7 @@ static void update_cpu_power(struct sched_domain *sd, int cpu)
 
 	power >>= SCHED_POWER_SHIFT;
 
+	cpu_rq(cpu)->cpu_power_orig = power;
 	sdg->sgp->power_orig = power;
 
 	if (sched_feat(ARCH_POWER))
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 600e229..4c74014 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -568,6 +568,7 @@ struct rq {
 	struct sched_domain *sd;
 
 	unsigned long cpu_power;
+	unsigned long cpu_power_orig;
 
 	unsigned char idle_balance;
 	/* For active balancing */
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 216+ messages in thread

* [PATCH v2 06/11] sched: add per rq cpu_power_orig
@ 2014-05-23 15:53   ` Vincent Guittot
  0 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-05-23 15:53 UTC (permalink / raw)
  To: linux-arm-kernel

This new field cpu_power_orig reflects the available capacity of a CPUs unlike
the cpu_power which reflects the current capacity that can be altered by
frequency and rt tasks.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/core.c  | 2 +-
 kernel/sched/fair.c  | 1 +
 kernel/sched/sched.h | 1 +
 3 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 240aa83..686c56f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6946,7 +6946,7 @@ void __init sched_init(void)
 #ifdef CONFIG_SMP
 		rq->sd = NULL;
 		rq->rd = NULL;
-		rq->cpu_power = SCHED_POWER_SCALE;
+		rq->cpu_power = rq->cpu_power_orig = SCHED_POWER_SCALE;
 		rq->post_schedule = 0;
 		rq->active_balance = 0;
 		rq->next_balance = jiffies;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6dc05f7..b3dc2b3 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5675,6 +5675,7 @@ static void update_cpu_power(struct sched_domain *sd, int cpu)
 
 	power >>= SCHED_POWER_SHIFT;
 
+	cpu_rq(cpu)->cpu_power_orig = power;
 	sdg->sgp->power_orig = power;
 
 	if (sched_feat(ARCH_POWER))
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 600e229..4c74014 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -568,6 +568,7 @@ struct rq {
 	struct sched_domain *sd;
 
 	unsigned long cpu_power;
+	unsigned long cpu_power_orig;
 
 	unsigned char idle_balance;
 	/* For active balancing */
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 216+ messages in thread

* [PATCH v2 07/11] Revert "sched: Put rq's sched_avg under CONFIG_FAIR_GROUP_SCHED"
  2014-05-23 15:52 ` Vincent Guittot
@ 2014-05-23 15:53   ` Vincent Guittot
  -1 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-05-23 15:53 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel, linux, linux-arm-kernel
  Cc: preeti, Morten.Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano, Vincent Guittot

This reverts commit f5f9739d7a0ccbdcf913a0b3604b134129d14f7e.

We are going to use runnable_avg_sum and runnable_avg_period in order to get
the utilization of the CPU. This statistic includes all tasks that run the CPU
and not only CFS tasks.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c  | 13 ++++++-------
 kernel/sched/sched.h |  4 ++--
 2 files changed, 8 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b3dc2b3..b7c51be 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2427,19 +2427,12 @@ static inline void __update_group_entity_contrib(struct sched_entity *se)
 		se->avg.load_avg_contrib >>= NICE_0_SHIFT;
 	}
 }
-
-static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
-{
-	__update_entity_runnable_avg(rq_clock_task(rq), &rq->avg, runnable);
-	__update_tg_runnable_avg(&rq->avg, &rq->cfs);
-}
 #else /* CONFIG_FAIR_GROUP_SCHED */
 static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
 						 int force_update) {}
 static inline void __update_tg_runnable_avg(struct sched_avg *sa,
 						  struct cfs_rq *cfs_rq) {}
 static inline void __update_group_entity_contrib(struct sched_entity *se) {}
-static inline void update_rq_runnable_avg(struct rq *rq, int runnable) {}
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
 static inline void __update_task_entity_contrib(struct sched_entity *se)
@@ -2538,6 +2531,12 @@ static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, int force_update)
 	__update_cfs_rq_tg_load_contrib(cfs_rq, force_update);
 }
 
+static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
+{
+	__update_entity_runnable_avg(rq_clock_task(rq), &rq->avg, runnable);
+	__update_tg_runnable_avg(&rq->avg, &rq->cfs);
+}
+
 /* Add the load generated by se into cfs_rq's child load-average */
 static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
 						  struct sched_entity *se,
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 4c74014..929ee6c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -542,8 +542,6 @@ struct rq {
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	/* list of leaf cfs_rq on this cpu: */
 	struct list_head leaf_cfs_rq_list;
-
-	struct sched_avg avg;
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
 	/*
@@ -634,6 +632,8 @@ struct rq {
 #ifdef CONFIG_SMP
 	struct llist_head wake_list;
 #endif
+
+	struct sched_avg avg;
 };
 
 static inline int cpu_of(struct rq *rq)
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 216+ messages in thread

* [PATCH v2 07/11] Revert "sched: Put rq's sched_avg under CONFIG_FAIR_GROUP_SCHED"
@ 2014-05-23 15:53   ` Vincent Guittot
  0 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-05-23 15:53 UTC (permalink / raw)
  To: linux-arm-kernel

This reverts commit f5f9739d7a0ccbdcf913a0b3604b134129d14f7e.

We are going to use runnable_avg_sum and runnable_avg_period in order to get
the utilization of the CPU. This statistic includes all tasks that run the CPU
and not only CFS tasks.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c  | 13 ++++++-------
 kernel/sched/sched.h |  4 ++--
 2 files changed, 8 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b3dc2b3..b7c51be 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2427,19 +2427,12 @@ static inline void __update_group_entity_contrib(struct sched_entity *se)
 		se->avg.load_avg_contrib >>= NICE_0_SHIFT;
 	}
 }
-
-static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
-{
-	__update_entity_runnable_avg(rq_clock_task(rq), &rq->avg, runnable);
-	__update_tg_runnable_avg(&rq->avg, &rq->cfs);
-}
 #else /* CONFIG_FAIR_GROUP_SCHED */
 static inline void __update_cfs_rq_tg_load_contrib(struct cfs_rq *cfs_rq,
 						 int force_update) {}
 static inline void __update_tg_runnable_avg(struct sched_avg *sa,
 						  struct cfs_rq *cfs_rq) {}
 static inline void __update_group_entity_contrib(struct sched_entity *se) {}
-static inline void update_rq_runnable_avg(struct rq *rq, int runnable) {}
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
 static inline void __update_task_entity_contrib(struct sched_entity *se)
@@ -2538,6 +2531,12 @@ static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, int force_update)
 	__update_cfs_rq_tg_load_contrib(cfs_rq, force_update);
 }
 
+static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
+{
+	__update_entity_runnable_avg(rq_clock_task(rq), &rq->avg, runnable);
+	__update_tg_runnable_avg(&rq->avg, &rq->cfs);
+}
+
 /* Add the load generated by se into cfs_rq's child load-average */
 static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
 						  struct sched_entity *se,
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 4c74014..929ee6c 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -542,8 +542,6 @@ struct rq {
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	/* list of leaf cfs_rq on this cpu: */
 	struct list_head leaf_cfs_rq_list;
-
-	struct sched_avg avg;
 #endif /* CONFIG_FAIR_GROUP_SCHED */
 
 	/*
@@ -634,6 +632,8 @@ struct rq {
 #ifdef CONFIG_SMP
 	struct llist_head wake_list;
 #endif
+
+	struct sched_avg avg;
 };
 
 static inline int cpu_of(struct rq *rq)
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 216+ messages in thread

* [PATCH v2 08/11] sched: get CPU's activity statistic
  2014-05-23 15:52 ` Vincent Guittot
@ 2014-05-23 15:53   ` Vincent Guittot
  -1 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-05-23 15:53 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel, linux, linux-arm-kernel
  Cc: preeti, Morten.Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano, Vincent Guittot

Monitor the activity level of each group of each sched_domain level. The
activity is the amount of cpu_power that is currently used on a CPU or group
of CPUs. We use the runnable_avg_sum and _period to evaluate this activity
level. In the special use case where the CPU is fully loaded by more than 1
task, the activity level is set above the cpu_power in order to reflect the
overload of the CPU

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b7c51be..c01d8b6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4044,6 +4044,11 @@ static unsigned long power_of(int cpu)
 	return cpu_rq(cpu)->cpu_power;
 }
 
+static unsigned long power_orig_of(int cpu)
+{
+	return cpu_rq(cpu)->cpu_power_orig;
+}
+
 static unsigned long cpu_avg_load_per_task(int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
@@ -4438,6 +4443,18 @@ done:
 	return target;
 }
 
+static int get_cpu_activity(int cpu)
+{
+	struct rq *rq = cpu_rq(cpu);
+	u32 sum = rq->avg.runnable_avg_sum;
+	u32 period = rq->avg.runnable_avg_period;
+
+	if (sum >= period)
+		return power_orig_of(cpu) + rq->nr_running - 1;
+
+	return (sum * power_orig_of(cpu)) / period;
+}
+
 /*
  * select_task_rq_fair: Select target runqueue for the waking task in domains
  * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
@@ -5518,6 +5535,7 @@ struct sg_lb_stats {
 	unsigned long sum_weighted_load; /* Weighted load of group's tasks */
 	unsigned long load_per_task;
 	unsigned long group_power;
+	unsigned long group_activity; /* Total activity of the group */
 	unsigned int sum_nr_running; /* Nr tasks running in the group */
 	unsigned int group_capacity;
 	unsigned int idle_cpus;
@@ -5538,6 +5556,7 @@ struct sd_lb_stats {
 	struct sched_group *busiest;	/* Busiest group in this sd */
 	struct sched_group *local;	/* Local group in this sd */
 	unsigned long total_load;	/* Total load of all groups in sd */
+	unsigned long total_activity;  /* Total activity of all groups in sd */
 	unsigned long total_pwr;	/* Total power of all groups in sd */
 	unsigned long avg_load;	/* Average load across all groups in sd */
 
@@ -5557,6 +5576,7 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds)
 		.busiest = NULL,
 		.local = NULL,
 		.total_load = 0UL,
+		.total_activity = 0UL,
 		.total_pwr = 0UL,
 		.busiest_stat = {
 			.avg_load = 0UL,
@@ -5876,6 +5896,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 			load = source_load(i, load_idx);
 
 		sgs->group_load += load;
+		sgs->group_activity += get_cpu_activity(i);
 		sgs->sum_nr_running += rq->cfs.h_nr_running;
 #ifdef CONFIG_NUMA_BALANCING
 		sgs->nr_numa_running += rq->nr_numa_running;
@@ -6034,6 +6055,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 next_group:
 		/* Now, start updating sd_lb_stats */
 		sds->total_load += sgs->group_load;
+		sds->total_activity += sgs->group_activity;
 		sds->total_pwr += sgs->group_power;
 
 		sg = sg->next;
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 216+ messages in thread

* [PATCH v2 08/11] sched: get CPU's activity statistic
@ 2014-05-23 15:53   ` Vincent Guittot
  0 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-05-23 15:53 UTC (permalink / raw)
  To: linux-arm-kernel

Monitor the activity level of each group of each sched_domain level. The
activity is the amount of cpu_power that is currently used on a CPU or group
of CPUs. We use the runnable_avg_sum and _period to evaluate this activity
level. In the special use case where the CPU is fully loaded by more than 1
task, the activity level is set above the cpu_power in order to reflect the
overload of the CPU

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 22 ++++++++++++++++++++++
 1 file changed, 22 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b7c51be..c01d8b6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4044,6 +4044,11 @@ static unsigned long power_of(int cpu)
 	return cpu_rq(cpu)->cpu_power;
 }
 
+static unsigned long power_orig_of(int cpu)
+{
+	return cpu_rq(cpu)->cpu_power_orig;
+}
+
 static unsigned long cpu_avg_load_per_task(int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
@@ -4438,6 +4443,18 @@ done:
 	return target;
 }
 
+static int get_cpu_activity(int cpu)
+{
+	struct rq *rq = cpu_rq(cpu);
+	u32 sum = rq->avg.runnable_avg_sum;
+	u32 period = rq->avg.runnable_avg_period;
+
+	if (sum >= period)
+		return power_orig_of(cpu) + rq->nr_running - 1;
+
+	return (sum * power_orig_of(cpu)) / period;
+}
+
 /*
  * select_task_rq_fair: Select target runqueue for the waking task in domains
  * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
@@ -5518,6 +5535,7 @@ struct sg_lb_stats {
 	unsigned long sum_weighted_load; /* Weighted load of group's tasks */
 	unsigned long load_per_task;
 	unsigned long group_power;
+	unsigned long group_activity; /* Total activity of the group */
 	unsigned int sum_nr_running; /* Nr tasks running in the group */
 	unsigned int group_capacity;
 	unsigned int idle_cpus;
@@ -5538,6 +5556,7 @@ struct sd_lb_stats {
 	struct sched_group *busiest;	/* Busiest group in this sd */
 	struct sched_group *local;	/* Local group in this sd */
 	unsigned long total_load;	/* Total load of all groups in sd */
+	unsigned long total_activity;  /* Total activity of all groups in sd */
 	unsigned long total_pwr;	/* Total power of all groups in sd */
 	unsigned long avg_load;	/* Average load across all groups in sd */
 
@@ -5557,6 +5576,7 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds)
 		.busiest = NULL,
 		.local = NULL,
 		.total_load = 0UL,
+		.total_activity = 0UL,
 		.total_pwr = 0UL,
 		.busiest_stat = {
 			.avg_load = 0UL,
@@ -5876,6 +5896,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 			load = source_load(i, load_idx);
 
 		sgs->group_load += load;
+		sgs->group_activity += get_cpu_activity(i);
 		sgs->sum_nr_running += rq->cfs.h_nr_running;
 #ifdef CONFIG_NUMA_BALANCING
 		sgs->nr_numa_running += rq->nr_numa_running;
@@ -6034,6 +6055,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 next_group:
 		/* Now, start updating sd_lb_stats */
 		sds->total_load += sgs->group_load;
+		sds->total_activity += sgs->group_activity;
 		sds->total_pwr += sgs->group_power;
 
 		sg = sg->next;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 216+ messages in thread

* [PATCH v2 09/11] sched: test the cpu's capacity in wake affine
  2014-05-23 15:52 ` Vincent Guittot
@ 2014-05-23 15:53   ` Vincent Guittot
  -1 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-05-23 15:53 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel, linux, linux-arm-kernel
  Cc: preeti, Morten.Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano, Vincent Guittot

Currently the task always wakes affine on this_cpu if the latter is idle.
Before waking up the task on this_cpu, we check that this_cpu capacity is not
significantly reduced because of RT tasks or irq activity.

Use case where the number of irq and the time spent under irq is important
will take benefit of this because the task that is woken up by irq or softirq
will not use the same CPU than irq (and softirq) but a idle one which share
its LLC.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c01d8b6..e8a30f9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4241,6 +4241,7 @@ static int wake_wide(struct task_struct *p)
 static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
 {
 	s64 this_load, load;
+	s64 this_eff_load, prev_eff_load;
 	int idx, this_cpu, prev_cpu;
 	struct task_group *tg;
 	unsigned long weight;
@@ -4284,21 +4285,21 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
 	 * Otherwise check if either cpus are near enough in load to allow this
 	 * task to be woken on this_cpu.
 	 */
-	if (this_load > 0) {
-		s64 this_eff_load, prev_eff_load;
+	this_eff_load = 100;
+	this_eff_load *= power_of(prev_cpu);
+
+	prev_eff_load = 100 + (sd->imbalance_pct - 100) / 2;
+	prev_eff_load *= power_of(this_cpu);
 
-		this_eff_load = 100;
-		this_eff_load *= power_of(prev_cpu);
+	if (this_load > 0) {
 		this_eff_load *= this_load +
 			effective_load(tg, this_cpu, weight, weight);
 
-		prev_eff_load = 100 + (sd->imbalance_pct - 100) / 2;
-		prev_eff_load *= power_of(this_cpu);
 		prev_eff_load *= load + effective_load(tg, prev_cpu, 0, weight);
+	}
+
+	balanced = this_eff_load <= prev_eff_load;
 
-		balanced = this_eff_load <= prev_eff_load;
-	} else
-		balanced = true;
 	schedstat_inc(p, se.statistics.nr_wakeups_affine_attempts);
 
 	if (!balanced)
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 216+ messages in thread

* [PATCH v2 09/11] sched: test the cpu's capacity in wake affine
@ 2014-05-23 15:53   ` Vincent Guittot
  0 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-05-23 15:53 UTC (permalink / raw)
  To: linux-arm-kernel

Currently the task always wakes affine on this_cpu if the latter is idle.
Before waking up the task on this_cpu, we check that this_cpu capacity is not
significantly reduced because of RT tasks or irq activity.

Use case where the number of irq and the time spent under irq is important
will take benefit of this because the task that is woken up by irq or softirq
will not use the same CPU than irq (and softirq) but a idle one which share
its LLC.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c01d8b6..e8a30f9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4241,6 +4241,7 @@ static int wake_wide(struct task_struct *p)
 static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
 {
 	s64 this_load, load;
+	s64 this_eff_load, prev_eff_load;
 	int idx, this_cpu, prev_cpu;
 	struct task_group *tg;
 	unsigned long weight;
@@ -4284,21 +4285,21 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
 	 * Otherwise check if either cpus are near enough in load to allow this
 	 * task to be woken on this_cpu.
 	 */
-	if (this_load > 0) {
-		s64 this_eff_load, prev_eff_load;
+	this_eff_load = 100;
+	this_eff_load *= power_of(prev_cpu);
+
+	prev_eff_load = 100 + (sd->imbalance_pct - 100) / 2;
+	prev_eff_load *= power_of(this_cpu);
 
-		this_eff_load = 100;
-		this_eff_load *= power_of(prev_cpu);
+	if (this_load > 0) {
 		this_eff_load *= this_load +
 			effective_load(tg, this_cpu, weight, weight);
 
-		prev_eff_load = 100 + (sd->imbalance_pct - 100) / 2;
-		prev_eff_load *= power_of(this_cpu);
 		prev_eff_load *= load + effective_load(tg, prev_cpu, 0, weight);
+	}
+
+	balanced = this_eff_load <= prev_eff_load;
 
-		balanced = this_eff_load <= prev_eff_load;
-	} else
-		balanced = true;
 	schedstat_inc(p, se.statistics.nr_wakeups_affine_attempts);
 
 	if (!balanced)
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 216+ messages in thread

* [PATCH v2 10/11] sched: move cfs task on a CPU with higher capacity
  2014-05-23 15:52 ` Vincent Guittot
@ 2014-05-23 15:53   ` Vincent Guittot
  -1 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-05-23 15:53 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel, linux, linux-arm-kernel
  Cc: preeti, Morten.Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano, Vincent Guittot

If the CPU is used for handling lot of IRQs, trig a load balance to check if
it's worth moving its tasks on another CPU that has more capacity

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e8a30f9..2501e49 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5948,6 +5948,13 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 	if (sgs->sum_nr_running > sgs->group_capacity)
 		return true;
 
+	/*
+	 * The group capacity is reduced probably because of activity from other
+	 * sched class or interrupts which use part of the available capacity
+	 */
+	if ((sg->sgp->power_orig * 100) > (sgs->group_power * env->sd->imbalance_pct))
+		return true;
+
 	if (sgs->group_imb)
 		return true;
 
@@ -7282,6 +7289,12 @@ static inline int nohz_kick_needed(struct rq *rq)
 
 		if (nr_busy > 1)
 			goto need_kick_unlock;
+
+		if ((rq->cfs.h_nr_running >= 1)
+		 && ((rq->cpu_power * sd->imbalance_pct) <
+					(rq->cpu_power_orig * 100)))
+			goto need_kick_unlock;
+
 	}
 
 	sd = rcu_dereference(per_cpu(sd_asym, cpu));
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 216+ messages in thread

* [PATCH v2 10/11] sched: move cfs task on a CPU with higher capacity
@ 2014-05-23 15:53   ` Vincent Guittot
  0 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-05-23 15:53 UTC (permalink / raw)
  To: linux-arm-kernel

If the CPU is used for handling lot of IRQs, trig a load balance to check if
it's worth moving its tasks on another CPU that has more capacity

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 13 +++++++++++++
 1 file changed, 13 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e8a30f9..2501e49 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5948,6 +5948,13 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 	if (sgs->sum_nr_running > sgs->group_capacity)
 		return true;
 
+	/*
+	 * The group capacity is reduced probably because of activity from other
+	 * sched class or interrupts which use part of the available capacity
+	 */
+	if ((sg->sgp->power_orig * 100) > (sgs->group_power * env->sd->imbalance_pct))
+		return true;
+
 	if (sgs->group_imb)
 		return true;
 
@@ -7282,6 +7289,12 @@ static inline int nohz_kick_needed(struct rq *rq)
 
 		if (nr_busy > 1)
 			goto need_kick_unlock;
+
+		if ((rq->cfs.h_nr_running >= 1)
+		 && ((rq->cpu_power * sd->imbalance_pct) <
+					(rq->cpu_power_orig * 100)))
+			goto need_kick_unlock;
+
 	}
 
 	sd = rcu_dereference(per_cpu(sd_asym, cpu));
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 216+ messages in thread

* [PATCH v2 11/11] sched: replace capacity by activity
  2014-05-23 15:52 ` Vincent Guittot
@ 2014-05-23 15:53   ` Vincent Guittot
  -1 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-05-23 15:53 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel, linux, linux-arm-kernel
  Cc: preeti, Morten.Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano, Vincent Guittot

The scheduler tries to compute how many tasks a group of CPUs can handle by
assuming that a task's load is SCHED_LOAD_SCALE and a CPU capacity is
SCHED_POWER_SCALE.
We can now have a better idea of the utilization of a group fo CPUs thanks to
group_actitvity and deduct how many capacity is still available.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 94 ++++++++++++++---------------------------------------
 1 file changed, 24 insertions(+), 70 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2501e49..05b9502 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5538,11 +5538,10 @@ struct sg_lb_stats {
 	unsigned long group_power;
 	unsigned long group_activity; /* Total activity of the group */
 	unsigned int sum_nr_running; /* Nr tasks running in the group */
-	unsigned int group_capacity;
+	long group_capacity;
 	unsigned int idle_cpus;
 	unsigned int group_weight;
 	int group_imb; /* Is there an imbalance in the group ? */
-	int group_has_capacity; /* Is there extra capacity in the group? */
 #ifdef CONFIG_NUMA_BALANCING
 	unsigned int nr_numa_running;
 	unsigned int nr_preferred_running;
@@ -5785,31 +5784,6 @@ void update_group_power(struct sched_domain *sd, int cpu)
 }
 
 /*
- * Try and fix up capacity for tiny siblings, this is needed when
- * things like SD_ASYM_PACKING need f_b_g to select another sibling
- * which on its own isn't powerful enough.
- *
- * See update_sd_pick_busiest() and check_asym_packing().
- */
-static inline int
-fix_small_capacity(struct sched_domain *sd, struct sched_group *group)
-{
-	/*
-	 * Only siblings can have significantly less than SCHED_POWER_SCALE
-	 */
-	if (!(sd->flags & SD_SHARE_CPUPOWER))
-		return 0;
-
-	/*
-	 * If ~90% of the cpu_power is still there, we're good.
-	 */
-	if (group->sgp->power * 32 > group->sgp->power_orig * 29)
-		return 1;
-
-	return 0;
-}
-
-/*
  * Group imbalance indicates (and tries to solve) the problem where balancing
  * groups is inadequate due to tsk_cpus_allowed() constraints.
  *
@@ -5843,33 +5817,6 @@ static inline int sg_imbalanced(struct sched_group *group)
 	return group->sgp->imbalance;
 }
 
-/*
- * Compute the group capacity.
- *
- * Avoid the issue where N*frac(smt_power) >= 1 creates 'phantom' cores by
- * first dividing out the smt factor and computing the actual number of cores
- * and limit power unit capacity with that.
- */
-static inline int sg_capacity(struct lb_env *env, struct sched_group *group)
-{
-	unsigned int capacity, smt, cpus;
-	unsigned int power, power_orig;
-
-	power = group->sgp->power;
-	power_orig = group->sgp->power_orig;
-	cpus = group->group_weight;
-
-	/* smt := ceil(cpus / power), assumes: 1 < smt_power < 2 */
-	smt = DIV_ROUND_UP(SCHED_POWER_SCALE * cpus, power_orig);
-	capacity = cpus / smt; /* cores */
-
-	capacity = min_t(unsigned, capacity, DIV_ROUND_CLOSEST(power, SCHED_POWER_SCALE));
-	if (!capacity)
-		capacity = fix_small_capacity(env->sd, group);
-
-	return capacity;
-}
-
 /**
  * update_sg_lb_stats - Update sched_group's statistics for load balancing.
  * @env: The load balancing environment.
@@ -5918,10 +5865,9 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 	sgs->group_weight = group->group_weight;
 
 	sgs->group_imb = sg_imbalanced(group);
-	sgs->group_capacity = sg_capacity(env, group);
+	sgs->group_capacity = group->sgp->power_orig - sgs->group_activity;
+
 
-	if (sgs->group_capacity > sgs->sum_nr_running)
-		sgs->group_has_capacity = 1;
 }
 
 /**
@@ -5945,7 +5891,15 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 	if (sgs->avg_load <= sds->busiest_stat.avg_load)
 		return false;
 
-	if (sgs->sum_nr_running > sgs->group_capacity)
+	/* The group has an obvious long run overload */
+	if (sgs->group_capacity < 0)
+		return true;
+
+	/*
+	 * The group has a short run overload because more tasks than available
+	 * CPUs are running
+	 */
+	if (sgs->sum_nr_running > sgs->group_weight)
 		return true;
 
 	/*
@@ -6052,8 +6006,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 		 * with a large weight task outweighs the tasks on the system).
 		 */
 		if (prefer_sibling && sds->local &&
-		    sds->local_stat.group_has_capacity)
-			sgs->group_capacity = min(sgs->group_capacity, 1U);
+		    sds->local_stat.group_capacity > 0)
+			sgs->group_capacity = min(sgs->group_capacity, 1L);
 
 		if (update_sd_pick_busiest(env, sds, sg, sgs)) {
 			sds->busiest = sg;
@@ -6228,7 +6182,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 		 * have to drop below capacity to reach cpu-load equilibrium.
 		 */
 		load_above_capacity =
-			(busiest->sum_nr_running - busiest->group_capacity);
+			(busiest->sum_nr_running - busiest->group_weight);
 
 		load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_POWER_SCALE);
 		load_above_capacity /= busiest->group_power;
@@ -6294,6 +6248,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
 	local = &sds.local_stat;
 	busiest = &sds.busiest_stat;
 
+	/* ASYM feature bypasses nice load balance check */
 	if ((env->idle == CPU_IDLE || env->idle == CPU_NEWLY_IDLE) &&
 	    check_asym_packing(env, &sds))
 		return sds.busiest;
@@ -6313,8 +6268,8 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
 		goto force_balance;
 
 	/* SD_BALANCE_NEWIDLE trumps SMP nice when underutilized */
-	if (env->idle == CPU_NEWLY_IDLE && local->group_has_capacity &&
-	    !busiest->group_has_capacity)
+	if (env->idle == CPU_NEWLY_IDLE && (local->group_capacity > 0)
+			&& (busiest->group_capacity < 0))
 		goto force_balance;
 
 	/*
@@ -6372,7 +6327,7 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 	int i;
 
 	for_each_cpu_and(i, sched_group_cpus(group), env->cpus) {
-		unsigned long power, capacity, wl;
+		unsigned long power, wl;
 		enum fbq_type rt;
 
 		rq = cpu_rq(i);
@@ -6400,18 +6355,17 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 		if (rt > env->fbq_type)
 			continue;
 
-		power = power_of(i);
-		capacity = DIV_ROUND_CLOSEST(power, SCHED_POWER_SCALE);
-		if (!capacity)
-			capacity = fix_small_capacity(env->sd, group);
-
 		wl = weighted_cpuload(i);
 
 		/*
 		 * When comparing with imbalance, use weighted_cpuload()
 		 * which is not scaled with the cpu power.
 		 */
-		if (capacity && rq->nr_running == 1 && wl > env->imbalance)
+
+		power = power_of(i);
+
+		if (rq->nr_running == 1 && wl > env->imbalance &&
+		    ((power * env->sd->imbalance_pct) >= (rq->cpu_power_orig * 100)))
 			continue;
 
 		/*
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 216+ messages in thread

* [PATCH v2 11/11] sched: replace capacity by activity
@ 2014-05-23 15:53   ` Vincent Guittot
  0 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-05-23 15:53 UTC (permalink / raw)
  To: linux-arm-kernel

The scheduler tries to compute how many tasks a group of CPUs can handle by
assuming that a task's load is SCHED_LOAD_SCALE and a CPU capacity is
SCHED_POWER_SCALE.
We can now have a better idea of the utilization of a group fo CPUs thanks to
group_actitvity and deduct how many capacity is still available.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 94 ++++++++++++++---------------------------------------
 1 file changed, 24 insertions(+), 70 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2501e49..05b9502 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5538,11 +5538,10 @@ struct sg_lb_stats {
 	unsigned long group_power;
 	unsigned long group_activity; /* Total activity of the group */
 	unsigned int sum_nr_running; /* Nr tasks running in the group */
-	unsigned int group_capacity;
+	long group_capacity;
 	unsigned int idle_cpus;
 	unsigned int group_weight;
 	int group_imb; /* Is there an imbalance in the group ? */
-	int group_has_capacity; /* Is there extra capacity in the group? */
 #ifdef CONFIG_NUMA_BALANCING
 	unsigned int nr_numa_running;
 	unsigned int nr_preferred_running;
@@ -5785,31 +5784,6 @@ void update_group_power(struct sched_domain *sd, int cpu)
 }
 
 /*
- * Try and fix up capacity for tiny siblings, this is needed when
- * things like SD_ASYM_PACKING need f_b_g to select another sibling
- * which on its own isn't powerful enough.
- *
- * See update_sd_pick_busiest() and check_asym_packing().
- */
-static inline int
-fix_small_capacity(struct sched_domain *sd, struct sched_group *group)
-{
-	/*
-	 * Only siblings can have significantly less than SCHED_POWER_SCALE
-	 */
-	if (!(sd->flags & SD_SHARE_CPUPOWER))
-		return 0;
-
-	/*
-	 * If ~90% of the cpu_power is still there, we're good.
-	 */
-	if (group->sgp->power * 32 > group->sgp->power_orig * 29)
-		return 1;
-
-	return 0;
-}
-
-/*
  * Group imbalance indicates (and tries to solve) the problem where balancing
  * groups is inadequate due to tsk_cpus_allowed() constraints.
  *
@@ -5843,33 +5817,6 @@ static inline int sg_imbalanced(struct sched_group *group)
 	return group->sgp->imbalance;
 }
 
-/*
- * Compute the group capacity.
- *
- * Avoid the issue where N*frac(smt_power) >= 1 creates 'phantom' cores by
- * first dividing out the smt factor and computing the actual number of cores
- * and limit power unit capacity with that.
- */
-static inline int sg_capacity(struct lb_env *env, struct sched_group *group)
-{
-	unsigned int capacity, smt, cpus;
-	unsigned int power, power_orig;
-
-	power = group->sgp->power;
-	power_orig = group->sgp->power_orig;
-	cpus = group->group_weight;
-
-	/* smt := ceil(cpus / power), assumes: 1 < smt_power < 2 */
-	smt = DIV_ROUND_UP(SCHED_POWER_SCALE * cpus, power_orig);
-	capacity = cpus / smt; /* cores */
-
-	capacity = min_t(unsigned, capacity, DIV_ROUND_CLOSEST(power, SCHED_POWER_SCALE));
-	if (!capacity)
-		capacity = fix_small_capacity(env->sd, group);
-
-	return capacity;
-}
-
 /**
  * update_sg_lb_stats - Update sched_group's statistics for load balancing.
  * @env: The load balancing environment.
@@ -5918,10 +5865,9 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 	sgs->group_weight = group->group_weight;
 
 	sgs->group_imb = sg_imbalanced(group);
-	sgs->group_capacity = sg_capacity(env, group);
+	sgs->group_capacity = group->sgp->power_orig - sgs->group_activity;
+
 
-	if (sgs->group_capacity > sgs->sum_nr_running)
-		sgs->group_has_capacity = 1;
 }
 
 /**
@@ -5945,7 +5891,15 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 	if (sgs->avg_load <= sds->busiest_stat.avg_load)
 		return false;
 
-	if (sgs->sum_nr_running > sgs->group_capacity)
+	/* The group has an obvious long run overload */
+	if (sgs->group_capacity < 0)
+		return true;
+
+	/*
+	 * The group has a short run overload because more tasks than available
+	 * CPUs are running
+	 */
+	if (sgs->sum_nr_running > sgs->group_weight)
 		return true;
 
 	/*
@@ -6052,8 +6006,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 		 * with a large weight task outweighs the tasks on the system).
 		 */
 		if (prefer_sibling && sds->local &&
-		    sds->local_stat.group_has_capacity)
-			sgs->group_capacity = min(sgs->group_capacity, 1U);
+		    sds->local_stat.group_capacity > 0)
+			sgs->group_capacity = min(sgs->group_capacity, 1L);
 
 		if (update_sd_pick_busiest(env, sds, sg, sgs)) {
 			sds->busiest = sg;
@@ -6228,7 +6182,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 		 * have to drop below capacity to reach cpu-load equilibrium.
 		 */
 		load_above_capacity =
-			(busiest->sum_nr_running - busiest->group_capacity);
+			(busiest->sum_nr_running - busiest->group_weight);
 
 		load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_POWER_SCALE);
 		load_above_capacity /= busiest->group_power;
@@ -6294,6 +6248,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
 	local = &sds.local_stat;
 	busiest = &sds.busiest_stat;
 
+	/* ASYM feature bypasses nice load balance check */
 	if ((env->idle == CPU_IDLE || env->idle == CPU_NEWLY_IDLE) &&
 	    check_asym_packing(env, &sds))
 		return sds.busiest;
@@ -6313,8 +6268,8 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
 		goto force_balance;
 
 	/* SD_BALANCE_NEWIDLE trumps SMP nice when underutilized */
-	if (env->idle == CPU_NEWLY_IDLE && local->group_has_capacity &&
-	    !busiest->group_has_capacity)
+	if (env->idle == CPU_NEWLY_IDLE && (local->group_capacity > 0)
+			&& (busiest->group_capacity < 0))
 		goto force_balance;
 
 	/*
@@ -6372,7 +6327,7 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 	int i;
 
 	for_each_cpu_and(i, sched_group_cpus(group), env->cpus) {
-		unsigned long power, capacity, wl;
+		unsigned long power, wl;
 		enum fbq_type rt;
 
 		rq = cpu_rq(i);
@@ -6400,18 +6355,17 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 		if (rt > env->fbq_type)
 			continue;
 
-		power = power_of(i);
-		capacity = DIV_ROUND_CLOSEST(power, SCHED_POWER_SCALE);
-		if (!capacity)
-			capacity = fix_small_capacity(env->sd, group);
-
 		wl = weighted_cpuload(i);
 
 		/*
 		 * When comparing with imbalance, use weighted_cpuload()
 		 * which is not scaled with the cpu power.
 		 */
-		if (capacity && rq->nr_running == 1 && wl > env->imbalance)
+
+		power = power_of(i);
+
+		if (rq->nr_running == 1 && wl > env->imbalance &&
+		    ((power * env->sd->imbalance_pct) >= (rq->cpu_power_orig * 100)))
 			continue;
 
 		/*
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 01/11] sched: fix imbalance flag reset
  2014-05-23 15:52   ` Vincent Guittot
@ 2014-05-25 10:33     ` Preeti U Murthy
  -1 siblings, 0 replies; 216+ messages in thread
From: Preeti U Murthy @ 2014-05-25 10:33 UTC (permalink / raw)
  To: Vincent Guittot, peterz
  Cc: mingo, linux-kernel, linux, linux-arm-kernel, Morten.Rasmussen,
	efault, nicolas.pitre, linaro-kernel, daniel.lezcano

Hi Vincent,

On 05/23/2014 09:22 PM, Vincent Guittot wrote:
> The imbalance flag can stay set whereas there is no imbalance.
> 
> Let assume that we have 3 tasks that run on a dual cores /dual cluster system.
> We will have some idle load balance which are triggered during tick.
> Unfortunately, the tick is also used to queue background work so we can reach
> the situation where short work has been queued on a CPU which already runs a
> task. The load balance will detect this imbalance (2 tasks on 1 CPU and an idle
> CPU) and will try to pull the waiting task on the idle CPU. The waiting task is
> a worker thread that is pinned on a CPU so an imbalance due to pinned task is
> detected and the imbalance flag is set.
> Then, we will not be able to clear the flag because we have at most 1 task on
> each CPU but the imbalance flag will trig to useless active load balance
> between the idle CPU and the busy CPU.

Why do we do active balancing today when there is at-most 1 task on the
busiest cpu? Shouldn't we be skipping load balancing altogether? If we
do active balancing when the number of tasks = 1, it will lead to a ping
pong right?

Regards
Preeti U Murthy


^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 01/11] sched: fix imbalance flag reset
@ 2014-05-25 10:33     ` Preeti U Murthy
  0 siblings, 0 replies; 216+ messages in thread
From: Preeti U Murthy @ 2014-05-25 10:33 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Vincent,

On 05/23/2014 09:22 PM, Vincent Guittot wrote:
> The imbalance flag can stay set whereas there is no imbalance.
> 
> Let assume that we have 3 tasks that run on a dual cores /dual cluster system.
> We will have some idle load balance which are triggered during tick.
> Unfortunately, the tick is also used to queue background work so we can reach
> the situation where short work has been queued on a CPU which already runs a
> task. The load balance will detect this imbalance (2 tasks on 1 CPU and an idle
> CPU) and will try to pull the waiting task on the idle CPU. The waiting task is
> a worker thread that is pinned on a CPU so an imbalance due to pinned task is
> detected and the imbalance flag is set.
> Then, we will not be able to clear the flag because we have at most 1 task on
> each CPU but the imbalance flag will trig to useless active load balance
> between the idle CPU and the busy CPU.

Why do we do active balancing today when there is at-most 1 task on the
busiest cpu? Shouldn't we be skipping load balancing altogether? If we
do active balancing when the number of tasks = 1, it will lead to a ping
pong right?

Regards
Preeti U Murthy

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 05/11] ARM: topology: use new cpu_power interface
  2014-05-23 15:52   ` Vincent Guittot
@ 2014-05-25 13:22     ` Preeti U Murthy
  -1 siblings, 0 replies; 216+ messages in thread
From: Preeti U Murthy @ 2014-05-25 13:22 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: peterz, mingo, linux-kernel, linux, linux-arm-kernel,
	Morten.Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano

Hi Vincent,

Why do we have two interfaces arch_scale_freq_power() and
arch_scale_cpu_power()? Does it make sense to consolidate them now ?

Regards
Preeti U Murthy


On 05/23/2014 09:22 PM, Vincent Guittot wrote:
> Use the new arch_scale_cpu_power in order to reflect the original capacity of
> a CPU instead of arch_scale_freq_power which is more linked to a scaling of
> the capacity linked to the frequency.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>  arch/arm/kernel/topology.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
> index 71e1fec..6cc25a8 100644
> --- a/arch/arm/kernel/topology.c
> +++ b/arch/arm/kernel/topology.c
> @@ -42,7 +42,7 @@
>   */
>  static DEFINE_PER_CPU(unsigned long, cpu_scale);
> 
> -unsigned long arch_scale_freq_power(struct sched_domain *sd, int cpu)
> +unsigned long arch_scale_cpu_power(struct sched_domain *sd, int cpu)
>  {
>  	return per_cpu(cpu_scale, cpu);
>  }
> @@ -166,7 +166,7 @@ static void update_cpu_power(unsigned int cpu)
>  	set_power_scale(cpu, cpu_capacity(cpu) / middle_capacity);
> 
>  	printk(KERN_INFO "CPU%u: update cpu_power %lu\n",
> -		cpu, arch_scale_freq_power(NULL, cpu));
> +		cpu, arch_scale_cpu_power(NULL, cpu));
>  }
> 
>  #else
> 


^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 05/11] ARM: topology: use new cpu_power interface
@ 2014-05-25 13:22     ` Preeti U Murthy
  0 siblings, 0 replies; 216+ messages in thread
From: Preeti U Murthy @ 2014-05-25 13:22 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Vincent,

Why do we have two interfaces arch_scale_freq_power() and
arch_scale_cpu_power()? Does it make sense to consolidate them now ?

Regards
Preeti U Murthy


On 05/23/2014 09:22 PM, Vincent Guittot wrote:
> Use the new arch_scale_cpu_power in order to reflect the original capacity of
> a CPU instead of arch_scale_freq_power which is more linked to a scaling of
> the capacity linked to the frequency.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>  arch/arm/kernel/topology.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
> index 71e1fec..6cc25a8 100644
> --- a/arch/arm/kernel/topology.c
> +++ b/arch/arm/kernel/topology.c
> @@ -42,7 +42,7 @@
>   */
>  static DEFINE_PER_CPU(unsigned long, cpu_scale);
> 
> -unsigned long arch_scale_freq_power(struct sched_domain *sd, int cpu)
> +unsigned long arch_scale_cpu_power(struct sched_domain *sd, int cpu)
>  {
>  	return per_cpu(cpu_scale, cpu);
>  }
> @@ -166,7 +166,7 @@ static void update_cpu_power(unsigned int cpu)
>  	set_power_scale(cpu, cpu_capacity(cpu) / middle_capacity);
> 
>  	printk(KERN_INFO "CPU%u: update cpu_power %lu\n",
> -		cpu, arch_scale_freq_power(NULL, cpu));
> +		cpu, arch_scale_cpu_power(NULL, cpu));
>  }
> 
>  #else
> 

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 01/11] sched: fix imbalance flag reset
  2014-05-25 10:33     ` Preeti U Murthy
@ 2014-05-26  7:49       ` Vincent Guittot
  -1 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-05-26  7:49 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel,
	Russell King - ARM Linux, LAK, Morten Rasmussen, Mike Galbraith,
	Nicolas Pitre, linaro-kernel, Daniel Lezcano

On 25 May 2014 12:33, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
> Hi Vincent,
>
> On 05/23/2014 09:22 PM, Vincent Guittot wrote:
>> The imbalance flag can stay set whereas there is no imbalance.
>>
>> Let assume that we have 3 tasks that run on a dual cores /dual cluster system.
>> We will have some idle load balance which are triggered during tick.
>> Unfortunately, the tick is also used to queue background work so we can reach
>> the situation where short work has been queued on a CPU which already runs a
>> task. The load balance will detect this imbalance (2 tasks on 1 CPU and an idle
>> CPU) and will try to pull the waiting task on the idle CPU. The waiting task is
>> a worker thread that is pinned on a CPU so an imbalance due to pinned task is
>> detected and the imbalance flag is set.
>> Then, we will not be able to clear the flag because we have at most 1 task on
>> each CPU but the imbalance flag will trig to useless active load balance
>> between the idle CPU and the busy CPU.
>
> Why do we do active balancing today when there is at-most 1 task on the
> busiest cpu? Shouldn't we be skipping load balancing altogether? If we
> do active balancing when the number of tasks = 1, it will lead to a ping
> pong right?

That's the purpose of the patch to prevent this useless active load
balance. When the imbalance flag is set, an active load balance is
triggered whatever the load balance is because of pinned tasks that
prevents a balance state.

Vincent

>
> Regards
> Preeti U Murthy
>

^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 01/11] sched: fix imbalance flag reset
@ 2014-05-26  7:49       ` Vincent Guittot
  0 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-05-26  7:49 UTC (permalink / raw)
  To: linux-arm-kernel

On 25 May 2014 12:33, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
> Hi Vincent,
>
> On 05/23/2014 09:22 PM, Vincent Guittot wrote:
>> The imbalance flag can stay set whereas there is no imbalance.
>>
>> Let assume that we have 3 tasks that run on a dual cores /dual cluster system.
>> We will have some idle load balance which are triggered during tick.
>> Unfortunately, the tick is also used to queue background work so we can reach
>> the situation where short work has been queued on a CPU which already runs a
>> task. The load balance will detect this imbalance (2 tasks on 1 CPU and an idle
>> CPU) and will try to pull the waiting task on the idle CPU. The waiting task is
>> a worker thread that is pinned on a CPU so an imbalance due to pinned task is
>> detected and the imbalance flag is set.
>> Then, we will not be able to clear the flag because we have at most 1 task on
>> each CPU but the imbalance flag will trig to useless active load balance
>> between the idle CPU and the busy CPU.
>
> Why do we do active balancing today when there is at-most 1 task on the
> busiest cpu? Shouldn't we be skipping load balancing altogether? If we
> do active balancing when the number of tasks = 1, it will lead to a ping
> pong right?

That's the purpose of the patch to prevent this useless active load
balance. When the imbalance flag is set, an active load balance is
triggered whatever the load balance is because of pinned tasks that
prevents a balance state.

Vincent

>
> Regards
> Preeti U Murthy
>

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 05/11] ARM: topology: use new cpu_power interface
  2014-05-25 13:22     ` Preeti U Murthy
@ 2014-05-26  8:25       ` Vincent Guittot
  -1 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-05-26  8:25 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel,
	Russell King - ARM Linux, LAK, Morten Rasmussen, Mike Galbraith,
	Nicolas Pitre, linaro-kernel, Daniel Lezcano

On 25 May 2014 15:22, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
> Hi Vincent,
>
> Why do we have two interfaces arch_scale_freq_power() and
> arch_scale_cpu_power()? Does it make sense to consolidate them now ?
Hi Preeti,

They don't have the same purpose. arch_scale_cpu_power set the max
capacity of your CPU whereas arch_scale_freq_power can be used to give
the current capacity. ARM platform were using arch_scale_freq_power
because it was the only one available for non SMT system but this
induces some misunderstanding and some limitation in the
characterization of a CPUs. This consolidation is a necessary step so
we can now have the max capacity of a CPU and let
arch_scale_freq_power for other purpose (or even remove it if
useless).

Regards,
Vincent

>
> Regards
> Preeti U Murthy
>
>
> On 05/23/2014 09:22 PM, Vincent Guittot wrote:
>> Use the new arch_scale_cpu_power in order to reflect the original capacity of
>> a CPU instead of arch_scale_freq_power which is more linked to a scaling of
>> the capacity linked to the frequency.
>>
>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>> ---
>>  arch/arm/kernel/topology.c | 4 ++--
>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
>> index 71e1fec..6cc25a8 100644
>> --- a/arch/arm/kernel/topology.c
>> +++ b/arch/arm/kernel/topology.c
>> @@ -42,7 +42,7 @@
>>   */
>>  static DEFINE_PER_CPU(unsigned long, cpu_scale);
>>
>> -unsigned long arch_scale_freq_power(struct sched_domain *sd, int cpu)
>> +unsigned long arch_scale_cpu_power(struct sched_domain *sd, int cpu)
>>  {
>>       return per_cpu(cpu_scale, cpu);
>>  }
>> @@ -166,7 +166,7 @@ static void update_cpu_power(unsigned int cpu)
>>       set_power_scale(cpu, cpu_capacity(cpu) / middle_capacity);
>>
>>       printk(KERN_INFO "CPU%u: update cpu_power %lu\n",
>> -             cpu, arch_scale_freq_power(NULL, cpu));
>> +             cpu, arch_scale_cpu_power(NULL, cpu));
>>  }
>>
>>  #else
>>
>

^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 05/11] ARM: topology: use new cpu_power interface
@ 2014-05-26  8:25       ` Vincent Guittot
  0 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-05-26  8:25 UTC (permalink / raw)
  To: linux-arm-kernel

On 25 May 2014 15:22, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
> Hi Vincent,
>
> Why do we have two interfaces arch_scale_freq_power() and
> arch_scale_cpu_power()? Does it make sense to consolidate them now ?
Hi Preeti,

They don't have the same purpose. arch_scale_cpu_power set the max
capacity of your CPU whereas arch_scale_freq_power can be used to give
the current capacity. ARM platform were using arch_scale_freq_power
because it was the only one available for non SMT system but this
induces some misunderstanding and some limitation in the
characterization of a CPUs. This consolidation is a necessary step so
we can now have the max capacity of a CPU and let
arch_scale_freq_power for other purpose (or even remove it if
useless).

Regards,
Vincent

>
> Regards
> Preeti U Murthy
>
>
> On 05/23/2014 09:22 PM, Vincent Guittot wrote:
>> Use the new arch_scale_cpu_power in order to reflect the original capacity of
>> a CPU instead of arch_scale_freq_power which is more linked to a scaling of
>> the capacity linked to the frequency.
>>
>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>> ---
>>  arch/arm/kernel/topology.c | 4 ++--
>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
>> index 71e1fec..6cc25a8 100644
>> --- a/arch/arm/kernel/topology.c
>> +++ b/arch/arm/kernel/topology.c
>> @@ -42,7 +42,7 @@
>>   */
>>  static DEFINE_PER_CPU(unsigned long, cpu_scale);
>>
>> -unsigned long arch_scale_freq_power(struct sched_domain *sd, int cpu)
>> +unsigned long arch_scale_cpu_power(struct sched_domain *sd, int cpu)
>>  {
>>       return per_cpu(cpu_scale, cpu);
>>  }
>> @@ -166,7 +166,7 @@ static void update_cpu_power(unsigned int cpu)
>>       set_power_scale(cpu, cpu_capacity(cpu) / middle_capacity);
>>
>>       printk(KERN_INFO "CPU%u: update cpu_power %lu\n",
>> -             cpu, arch_scale_freq_power(NULL, cpu));
>> +             cpu, arch_scale_cpu_power(NULL, cpu));
>>  }
>>
>>  #else
>>
>

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 01/11] sched: fix imbalance flag reset
  2014-05-26  7:49       ` Vincent Guittot
@ 2014-05-26  9:16         ` Preeti U Murthy
  -1 siblings, 0 replies; 216+ messages in thread
From: Preeti U Murthy @ 2014-05-26  9:16 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel,
	Russell King - ARM Linux, LAK, Morten Rasmussen, Mike Galbraith,
	Nicolas Pitre, linaro-kernel, Daniel Lezcano

On 05/26/2014 01:19 PM, Vincent Guittot wrote:
> On 25 May 2014 12:33, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
>> Hi Vincent,
>>
>> On 05/23/2014 09:22 PM, Vincent Guittot wrote:
>>> The imbalance flag can stay set whereas there is no imbalance.
>>>
>>> Let assume that we have 3 tasks that run on a dual cores /dual cluster system.
>>> We will have some idle load balance which are triggered during tick.
>>> Unfortunately, the tick is also used to queue background work so we can reach
>>> the situation where short work has been queued on a CPU which already runs a
>>> task. The load balance will detect this imbalance (2 tasks on 1 CPU and an idle
>>> CPU) and will try to pull the waiting task on the idle CPU. The waiting task is
>>> a worker thread that is pinned on a CPU so an imbalance due to pinned task is
>>> detected and the imbalance flag is set.
>>> Then, we will not be able to clear the flag because we have at most 1 task on
>>> each CPU but the imbalance flag will trig to useless active load balance
>>> between the idle CPU and the busy CPU.
>>
>> Why do we do active balancing today when there is at-most 1 task on the
>> busiest cpu? Shouldn't we be skipping load balancing altogether? If we
>> do active balancing when the number of tasks = 1, it will lead to a ping
>> pong right?
> 
> That's the purpose of the patch to prevent this useless active load
> balance. When the imbalance flag is set, an active load balance is
> triggered whatever the load balance is because of pinned tasks that
> prevents a balance state.

No I mean this:

sched:Do not continue load balancing when the busiest cpu has one
running task

From: Preeti U Murthy <preeti@linux.vnet.ibm.com>


---
 kernel/sched/fair.c |    2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c9617b7..b175333 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6626,6 +6626,8 @@ more_balance:
 			}
 			goto out_balanced;
 		}
+	} else {
+		goto out;
 	}

 	if (!ld_moved) {


}

Regards
Preeti U Murthy
> 
> Vincent
> 
>>
>> Regards
>> Preeti U Murthy
>>
> 


^ permalink raw reply related	[flat|nested] 216+ messages in thread

* [PATCH v2 01/11] sched: fix imbalance flag reset
@ 2014-05-26  9:16         ` Preeti U Murthy
  0 siblings, 0 replies; 216+ messages in thread
From: Preeti U Murthy @ 2014-05-26  9:16 UTC (permalink / raw)
  To: linux-arm-kernel

On 05/26/2014 01:19 PM, Vincent Guittot wrote:
> On 25 May 2014 12:33, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
>> Hi Vincent,
>>
>> On 05/23/2014 09:22 PM, Vincent Guittot wrote:
>>> The imbalance flag can stay set whereas there is no imbalance.
>>>
>>> Let assume that we have 3 tasks that run on a dual cores /dual cluster system.
>>> We will have some idle load balance which are triggered during tick.
>>> Unfortunately, the tick is also used to queue background work so we can reach
>>> the situation where short work has been queued on a CPU which already runs a
>>> task. The load balance will detect this imbalance (2 tasks on 1 CPU and an idle
>>> CPU) and will try to pull the waiting task on the idle CPU. The waiting task is
>>> a worker thread that is pinned on a CPU so an imbalance due to pinned task is
>>> detected and the imbalance flag is set.
>>> Then, we will not be able to clear the flag because we have at most 1 task on
>>> each CPU but the imbalance flag will trig to useless active load balance
>>> between the idle CPU and the busy CPU.
>>
>> Why do we do active balancing today when there is at-most 1 task on the
>> busiest cpu? Shouldn't we be skipping load balancing altogether? If we
>> do active balancing when the number of tasks = 1, it will lead to a ping
>> pong right?
> 
> That's the purpose of the patch to prevent this useless active load
> balance. When the imbalance flag is set, an active load balance is
> triggered whatever the load balance is because of pinned tasks that
> prevents a balance state.

No I mean this:

sched:Do not continue load balancing when the busiest cpu has one
running task

From: Preeti U Murthy <preeti@linux.vnet.ibm.com>


---
 kernel/sched/fair.c |    2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c9617b7..b175333 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6626,6 +6626,8 @@ more_balance:
 			}
 			goto out_balanced;
 		}
+	} else {
+		goto out;
 	}

 	if (!ld_moved) {


}

Regards
Preeti U Murthy
> 
> Vincent
> 
>>
>> Regards
>> Preeti U Murthy
>>
> 

^ permalink raw reply related	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 05/11] ARM: topology: use new cpu_power interface
  2014-05-26  8:25       ` Vincent Guittot
@ 2014-05-26  9:19         ` Preeti U Murthy
  -1 siblings, 0 replies; 216+ messages in thread
From: Preeti U Murthy @ 2014-05-26  9:19 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel,
	Russell King - ARM Linux, LAK, Morten Rasmussen, Mike Galbraith,
	Nicolas Pitre, linaro-kernel, Daniel Lezcano

On 05/26/2014 01:55 PM, Vincent Guittot wrote:
> On 25 May 2014 15:22, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
>> Hi Vincent,
>>
>> Why do we have two interfaces arch_scale_freq_power() and
>> arch_scale_cpu_power()? Does it make sense to consolidate them now ?
> Hi Preeti,
> 
> They don't have the same purpose. arch_scale_cpu_power set the max
> capacity of your CPU whereas arch_scale_freq_power can be used to give
> the current capacity. ARM platform were using arch_scale_freq_power
> because it was the only one available for non SMT system but this
> induces some misunderstanding and some limitation in the
> characterization of a CPUs. This consolidation is a necessary step so
> we can now have the max capacity of a CPU and let
> arch_scale_freq_power for other purpose (or even remove it if
> useless).

Ah ok! Thanks :) This was insightful :)

Regards
Preeti U Murthy
> 
> Regards,
> Vincent
> 
>>
>> Regards
>> Preeti U Murthy
>>
>>
>> On 05/23/2014 09:22 PM, Vincent Guittot wrote:
>>> Use the new arch_scale_cpu_power in order to reflect the original capacity of
>>> a CPU instead of arch_scale_freq_power which is more linked to a scaling of
>>> the capacity linked to the frequency.
>>>
>>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>>> ---
>>>  arch/arm/kernel/topology.c | 4 ++--
>>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
>>> index 71e1fec..6cc25a8 100644
>>> --- a/arch/arm/kernel/topology.c
>>> +++ b/arch/arm/kernel/topology.c
>>> @@ -42,7 +42,7 @@
>>>   */
>>>  static DEFINE_PER_CPU(unsigned long, cpu_scale);
>>>
>>> -unsigned long arch_scale_freq_power(struct sched_domain *sd, int cpu)
>>> +unsigned long arch_scale_cpu_power(struct sched_domain *sd, int cpu)
>>>  {
>>>       return per_cpu(cpu_scale, cpu);
>>>  }
>>> @@ -166,7 +166,7 @@ static void update_cpu_power(unsigned int cpu)
>>>       set_power_scale(cpu, cpu_capacity(cpu) / middle_capacity);
>>>
>>>       printk(KERN_INFO "CPU%u: update cpu_power %lu\n",
>>> -             cpu, arch_scale_freq_power(NULL, cpu));
>>> +             cpu, arch_scale_cpu_power(NULL, cpu));
>>>  }
>>>
>>>  #else
>>>
>>
> 


^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 05/11] ARM: topology: use new cpu_power interface
@ 2014-05-26  9:19         ` Preeti U Murthy
  0 siblings, 0 replies; 216+ messages in thread
From: Preeti U Murthy @ 2014-05-26  9:19 UTC (permalink / raw)
  To: linux-arm-kernel

On 05/26/2014 01:55 PM, Vincent Guittot wrote:
> On 25 May 2014 15:22, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
>> Hi Vincent,
>>
>> Why do we have two interfaces arch_scale_freq_power() and
>> arch_scale_cpu_power()? Does it make sense to consolidate them now ?
> Hi Preeti,
> 
> They don't have the same purpose. arch_scale_cpu_power set the max
> capacity of your CPU whereas arch_scale_freq_power can be used to give
> the current capacity. ARM platform were using arch_scale_freq_power
> because it was the only one available for non SMT system but this
> induces some misunderstanding and some limitation in the
> characterization of a CPUs. This consolidation is a necessary step so
> we can now have the max capacity of a CPU and let
> arch_scale_freq_power for other purpose (or even remove it if
> useless).

Ah ok! Thanks :) This was insightful :)

Regards
Preeti U Murthy
> 
> Regards,
> Vincent
> 
>>
>> Regards
>> Preeti U Murthy
>>
>>
>> On 05/23/2014 09:22 PM, Vincent Guittot wrote:
>>> Use the new arch_scale_cpu_power in order to reflect the original capacity of
>>> a CPU instead of arch_scale_freq_power which is more linked to a scaling of
>>> the capacity linked to the frequency.
>>>
>>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>>> ---
>>>  arch/arm/kernel/topology.c | 4 ++--
>>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>>
>>> diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
>>> index 71e1fec..6cc25a8 100644
>>> --- a/arch/arm/kernel/topology.c
>>> +++ b/arch/arm/kernel/topology.c
>>> @@ -42,7 +42,7 @@
>>>   */
>>>  static DEFINE_PER_CPU(unsigned long, cpu_scale);
>>>
>>> -unsigned long arch_scale_freq_power(struct sched_domain *sd, int cpu)
>>> +unsigned long arch_scale_cpu_power(struct sched_domain *sd, int cpu)
>>>  {
>>>       return per_cpu(cpu_scale, cpu);
>>>  }
>>> @@ -166,7 +166,7 @@ static void update_cpu_power(unsigned int cpu)
>>>       set_power_scale(cpu, cpu_capacity(cpu) / middle_capacity);
>>>
>>>       printk(KERN_INFO "CPU%u: update cpu_power %lu\n",
>>> -             cpu, arch_scale_freq_power(NULL, cpu));
>>> +             cpu, arch_scale_cpu_power(NULL, cpu));
>>>  }
>>>
>>>  #else
>>>
>>
> 

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 00/11]  sched: consolidation of cpu_power
  2014-05-23 15:52 ` Vincent Guittot
@ 2014-05-26  9:44   ` Preeti U Murthy
  -1 siblings, 0 replies; 216+ messages in thread
From: Preeti U Murthy @ 2014-05-26  9:44 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: peterz, mingo, linux-kernel, linux, linux-arm-kernel,
	Morten.Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano

Hi Vincent,

I conducted test runs of ebizzy on a Power8 box which had 48 cpus.
6 cores with SMT-8 to be precise. Its a single socket box. The results
are as below.

On 05/23/2014 09:22 PM, Vincent Guittot wrote:
> Part of this patchset was previously part of the larger tasks packing patchset
> [1]. I have splitted the latter in 3 different patchsets (at least) to make the
> thing easier.
> -configuration of sched_domain topology [2]
> -update and consolidation of cpu_power (this patchset)
> -tasks packing algorithm
> 
> SMT system is no more the only system that can have a CPUs with an original
> capacity that is different from the default value. We need to extend the use of
> cpu_power_orig to all kind of platform so the scheduler will have both the
> maximum capacity (cpu_power_orig/power_orig) and the current capacity
> (cpu_power/power) of CPUs and sched_groups. A new function arch_scale_cpu_power
> has been created and replace arch_scale_smt_power, which is SMT specifc in the
> computation of the capapcity of a CPU.
> 
> During load balance, the scheduler evaluates the number of tasks that a group
> of CPUs can handle. The current method assumes that tasks have a fix load of 
> SCHED_LOAD_SCALE and CPUs have a default capacity of SCHED_POWER_SCALE.
> This assumption generates wrong decision by creating ghost cores and by
> removing real ones when the original capacity of CPUs is different from the
> default SCHED_POWER_SCALE.
> 
> Now that we have the original capacity of a CPUS and its activity/utilization,
> we can evaluate more accuratly the capacity of a group of CPUs.
> 
> This patchset mainly replaces the old capacity method by a new one and has kept
> the policy almost unchanged whereas we can certainly take advantage of this new
> statistic in several other places of the load balance.
> 
> TODO:
>  - align variable's and field's name with the renaming [3]
> 
> Tests results:
> I have put below results of 2 tests:
> - hackbench -l 500 -s 4096
> - scp of 100MB file on the platform
> 
> on a dual cortex-A7 
>                   hackbench        scp    
> tip/master        25.75s(+/-0.25)  5.16MB/s(+/-1.49)
> + patches 1,2     25.89s(+/-0.31)  5.18MB/s(+/-1.45)
> + patches 3-10    25.68s(+/-0.22)  7.00MB/s(+/-1.88)
> + irq accounting  25.80s(+/-0.25)  8.06MB/s(+/-0.05)
> 
> on a quad cortex-A15 
>                   hackbench        scp    
> tip/master        15.69s(+/-0.16)  9.70MB/s(+/-0.04)
> + patches 1,2     15.53s(+/-0.13)  9.72MB/s(+/-0.05)
> + patches 3-10    15.56s(+/-0.22)  9.88MB/s(+/-0.05)
> + irq accounting  15.99s(+/-0.08) 10.37MB/s(+/-0.03)
> 
> The improvement of scp bandwidth happens when tasks and irq are using
> different CPU which is a bit random without irq accounting config

N -> Number of threads of ebizzy

Each 'N' run was for 30 seconds with multiple iterations and averaging them.

N          %change in number of records
           read after patching
------------------------------------------
1          + 0.0038
4          -17.6429
8          -26.3989
12         -29.5070
16         -38.4842
20         -44.5747
24         -51.9792
28         -34.1863
32         -38.4029
38         -22.2490
42          -7.4843
47         -0.69676

Let me profile it and check where the cause of this degradation is.


Regards
Preeti U Murthy


^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 00/11]  sched: consolidation of cpu_power
@ 2014-05-26  9:44   ` Preeti U Murthy
  0 siblings, 0 replies; 216+ messages in thread
From: Preeti U Murthy @ 2014-05-26  9:44 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Vincent,

I conducted test runs of ebizzy on a Power8 box which had 48 cpus.
6 cores with SMT-8 to be precise. Its a single socket box. The results
are as below.

On 05/23/2014 09:22 PM, Vincent Guittot wrote:
> Part of this patchset was previously part of the larger tasks packing patchset
> [1]. I have splitted the latter in 3 different patchsets (at least) to make the
> thing easier.
> -configuration of sched_domain topology [2]
> -update and consolidation of cpu_power (this patchset)
> -tasks packing algorithm
> 
> SMT system is no more the only system that can have a CPUs with an original
> capacity that is different from the default value. We need to extend the use of
> cpu_power_orig to all kind of platform so the scheduler will have both the
> maximum capacity (cpu_power_orig/power_orig) and the current capacity
> (cpu_power/power) of CPUs and sched_groups. A new function arch_scale_cpu_power
> has been created and replace arch_scale_smt_power, which is SMT specifc in the
> computation of the capapcity of a CPU.
> 
> During load balance, the scheduler evaluates the number of tasks that a group
> of CPUs can handle. The current method assumes that tasks have a fix load of 
> SCHED_LOAD_SCALE and CPUs have a default capacity of SCHED_POWER_SCALE.
> This assumption generates wrong decision by creating ghost cores and by
> removing real ones when the original capacity of CPUs is different from the
> default SCHED_POWER_SCALE.
> 
> Now that we have the original capacity of a CPUS and its activity/utilization,
> we can evaluate more accuratly the capacity of a group of CPUs.
> 
> This patchset mainly replaces the old capacity method by a new one and has kept
> the policy almost unchanged whereas we can certainly take advantage of this new
> statistic in several other places of the load balance.
> 
> TODO:
>  - align variable's and field's name with the renaming [3]
> 
> Tests results:
> I have put below results of 2 tests:
> - hackbench -l 500 -s 4096
> - scp of 100MB file on the platform
> 
> on a dual cortex-A7 
>                   hackbench        scp    
> tip/master        25.75s(+/-0.25)  5.16MB/s(+/-1.49)
> + patches 1,2     25.89s(+/-0.31)  5.18MB/s(+/-1.45)
> + patches 3-10    25.68s(+/-0.22)  7.00MB/s(+/-1.88)
> + irq accounting  25.80s(+/-0.25)  8.06MB/s(+/-0.05)
> 
> on a quad cortex-A15 
>                   hackbench        scp    
> tip/master        15.69s(+/-0.16)  9.70MB/s(+/-0.04)
> + patches 1,2     15.53s(+/-0.13)  9.72MB/s(+/-0.05)
> + patches 3-10    15.56s(+/-0.22)  9.88MB/s(+/-0.05)
> + irq accounting  15.99s(+/-0.08) 10.37MB/s(+/-0.03)
> 
> The improvement of scp bandwidth happens when tasks and irq are using
> different CPU which is a bit random without irq accounting config

N -> Number of threads of ebizzy

Each 'N' run was for 30 seconds with multiple iterations and averaging them.

N          %change in number of records
           read after patching
------------------------------------------
1          + 0.0038
4          -17.6429
8          -26.3989
12         -29.5070
16         -38.4842
20         -44.5747
24         -51.9792
28         -34.1863
32         -38.4029
38         -22.2490
42          -7.4843
47         -0.69676

Let me profile it and check where the cause of this degradation is.


Regards
Preeti U Murthy

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 00/11] sched: consolidation of cpu_power
  2014-05-26  9:44   ` Preeti U Murthy
@ 2014-05-26 10:04     ` Vincent Guittot
  -1 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-05-26 10:04 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel,
	Russell King - ARM Linux, LAK, Morten Rasmussen, Mike Galbraith,
	Nicolas Pitre, linaro-kernel, Daniel Lezcano

On 26 May 2014 11:44, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
> Hi Vincent,
>
> I conducted test runs of ebizzy on a Power8 box which had 48 cpus.
> 6 cores with SMT-8 to be precise. Its a single socket box. The results
> are as below.
>
> On 05/23/2014 09:22 PM, Vincent Guittot wrote:
>> Part of this patchset was previously part of the larger tasks packing patchset
>> [1]. I have splitted the latter in 3 different patchsets (at least) to make the
>> thing easier.
>> -configuration of sched_domain topology [2]
>> -update and consolidation of cpu_power (this patchset)
>> -tasks packing algorithm
>>
>> SMT system is no more the only system that can have a CPUs with an original
>> capacity that is different from the default value. We need to extend the use of
>> cpu_power_orig to all kind of platform so the scheduler will have both the
>> maximum capacity (cpu_power_orig/power_orig) and the current capacity
>> (cpu_power/power) of CPUs and sched_groups. A new function arch_scale_cpu_power
>> has been created and replace arch_scale_smt_power, which is SMT specifc in the
>> computation of the capapcity of a CPU.
>>
>> During load balance, the scheduler evaluates the number of tasks that a group
>> of CPUs can handle. The current method assumes that tasks have a fix load of
>> SCHED_LOAD_SCALE and CPUs have a default capacity of SCHED_POWER_SCALE.
>> This assumption generates wrong decision by creating ghost cores and by
>> removing real ones when the original capacity of CPUs is different from the
>> default SCHED_POWER_SCALE.
>>
>> Now that we have the original capacity of a CPUS and its activity/utilization,
>> we can evaluate more accuratly the capacity of a group of CPUs.
>>
>> This patchset mainly replaces the old capacity method by a new one and has kept
>> the policy almost unchanged whereas we can certainly take advantage of this new
>> statistic in several other places of the load balance.
>>
>> TODO:
>>  - align variable's and field's name with the renaming [3]
>>
>> Tests results:
>> I have put below results of 2 tests:
>> - hackbench -l 500 -s 4096
>> - scp of 100MB file on the platform
>>
>> on a dual cortex-A7
>>                   hackbench        scp
>> tip/master        25.75s(+/-0.25)  5.16MB/s(+/-1.49)
>> + patches 1,2     25.89s(+/-0.31)  5.18MB/s(+/-1.45)
>> + patches 3-10    25.68s(+/-0.22)  7.00MB/s(+/-1.88)
>> + irq accounting  25.80s(+/-0.25)  8.06MB/s(+/-0.05)
>>
>> on a quad cortex-A15
>>                   hackbench        scp
>> tip/master        15.69s(+/-0.16)  9.70MB/s(+/-0.04)
>> + patches 1,2     15.53s(+/-0.13)  9.72MB/s(+/-0.05)
>> + patches 3-10    15.56s(+/-0.22)  9.88MB/s(+/-0.05)
>> + irq accounting  15.99s(+/-0.08) 10.37MB/s(+/-0.03)
>>
>> The improvement of scp bandwidth happens when tasks and irq are using
>> different CPU which is a bit random without irq accounting config
>
> N -> Number of threads of ebizzy
>
> Each 'N' run was for 30 seconds with multiple iterations and averaging them.
>
> N          %change in number of records
>            read after patching
> ------------------------------------------
> 1          + 0.0038
> 4          -17.6429
> 8          -26.3989
> 12         -29.5070
> 16         -38.4842
> 20         -44.5747
> 24         -51.9792
> 28         -34.1863
> 32         -38.4029
> 38         -22.2490
> 42          -7.4843
> 47         -0.69676
>
> Let me profile it and check where the cause of this degradation is.

Hi Preeti,

Thanks for the test and the help to find the root cause of the
degration. I'm going to run the test on my platforms too and see if i
have similar results with my platforms

Regards
Vincent
>
>
> Regards
> Preeti U Murthy
>

^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 00/11] sched: consolidation of cpu_power
@ 2014-05-26 10:04     ` Vincent Guittot
  0 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-05-26 10:04 UTC (permalink / raw)
  To: linux-arm-kernel

On 26 May 2014 11:44, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
> Hi Vincent,
>
> I conducted test runs of ebizzy on a Power8 box which had 48 cpus.
> 6 cores with SMT-8 to be precise. Its a single socket box. The results
> are as below.
>
> On 05/23/2014 09:22 PM, Vincent Guittot wrote:
>> Part of this patchset was previously part of the larger tasks packing patchset
>> [1]. I have splitted the latter in 3 different patchsets (at least) to make the
>> thing easier.
>> -configuration of sched_domain topology [2]
>> -update and consolidation of cpu_power (this patchset)
>> -tasks packing algorithm
>>
>> SMT system is no more the only system that can have a CPUs with an original
>> capacity that is different from the default value. We need to extend the use of
>> cpu_power_orig to all kind of platform so the scheduler will have both the
>> maximum capacity (cpu_power_orig/power_orig) and the current capacity
>> (cpu_power/power) of CPUs and sched_groups. A new function arch_scale_cpu_power
>> has been created and replace arch_scale_smt_power, which is SMT specifc in the
>> computation of the capapcity of a CPU.
>>
>> During load balance, the scheduler evaluates the number of tasks that a group
>> of CPUs can handle. The current method assumes that tasks have a fix load of
>> SCHED_LOAD_SCALE and CPUs have a default capacity of SCHED_POWER_SCALE.
>> This assumption generates wrong decision by creating ghost cores and by
>> removing real ones when the original capacity of CPUs is different from the
>> default SCHED_POWER_SCALE.
>>
>> Now that we have the original capacity of a CPUS and its activity/utilization,
>> we can evaluate more accuratly the capacity of a group of CPUs.
>>
>> This patchset mainly replaces the old capacity method by a new one and has kept
>> the policy almost unchanged whereas we can certainly take advantage of this new
>> statistic in several other places of the load balance.
>>
>> TODO:
>>  - align variable's and field's name with the renaming [3]
>>
>> Tests results:
>> I have put below results of 2 tests:
>> - hackbench -l 500 -s 4096
>> - scp of 100MB file on the platform
>>
>> on a dual cortex-A7
>>                   hackbench        scp
>> tip/master        25.75s(+/-0.25)  5.16MB/s(+/-1.49)
>> + patches 1,2     25.89s(+/-0.31)  5.18MB/s(+/-1.45)
>> + patches 3-10    25.68s(+/-0.22)  7.00MB/s(+/-1.88)
>> + irq accounting  25.80s(+/-0.25)  8.06MB/s(+/-0.05)
>>
>> on a quad cortex-A15
>>                   hackbench        scp
>> tip/master        15.69s(+/-0.16)  9.70MB/s(+/-0.04)
>> + patches 1,2     15.53s(+/-0.13)  9.72MB/s(+/-0.05)
>> + patches 3-10    15.56s(+/-0.22)  9.88MB/s(+/-0.05)
>> + irq accounting  15.99s(+/-0.08) 10.37MB/s(+/-0.03)
>>
>> The improvement of scp bandwidth happens when tasks and irq are using
>> different CPU which is a bit random without irq accounting config
>
> N -> Number of threads of ebizzy
>
> Each 'N' run was for 30 seconds with multiple iterations and averaging them.
>
> N          %change in number of records
>            read after patching
> ------------------------------------------
> 1          + 0.0038
> 4          -17.6429
> 8          -26.3989
> 12         -29.5070
> 16         -38.4842
> 20         -44.5747
> 24         -51.9792
> 28         -34.1863
> 32         -38.4029
> 38         -22.2490
> 42          -7.4843
> 47         -0.69676
>
> Let me profile it and check where the cause of this degradation is.

Hi Preeti,

Thanks for the test and the help to find the root cause of the
degration. I'm going to run the test on my platforms too and see if i
have similar results with my platforms

Regards
Vincent
>
>
> Regards
> Preeti U Murthy
>

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 01/11] sched: fix imbalance flag reset
  2014-05-26  9:16         ` Preeti U Murthy
@ 2014-05-26 10:14           ` Vincent Guittot
  -1 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-05-26 10:14 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel,
	Russell King - ARM Linux, LAK, Morten Rasmussen, Mike Galbraith,
	Nicolas Pitre, linaro-kernel, Daniel Lezcano

On 26 May 2014 11:16, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
> On 05/26/2014 01:19 PM, Vincent Guittot wrote:
>> On 25 May 2014 12:33, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
>>> Hi Vincent,
>>>
>>> On 05/23/2014 09:22 PM, Vincent Guittot wrote:
>>>> The imbalance flag can stay set whereas there is no imbalance.
>>>>
>>>> Let assume that we have 3 tasks that run on a dual cores /dual cluster system.
>>>> We will have some idle load balance which are triggered during tick.
>>>> Unfortunately, the tick is also used to queue background work so we can reach
>>>> the situation where short work has been queued on a CPU which already runs a
>>>> task. The load balance will detect this imbalance (2 tasks on 1 CPU and an idle
>>>> CPU) and will try to pull the waiting task on the idle CPU. The waiting task is
>>>> a worker thread that is pinned on a CPU so an imbalance due to pinned task is
>>>> detected and the imbalance flag is set.
>>>> Then, we will not be able to clear the flag because we have at most 1 task on
>>>> each CPU but the imbalance flag will trig to useless active load balance
>>>> between the idle CPU and the busy CPU.
>>>
>>> Why do we do active balancing today when there is at-most 1 task on the
>>> busiest cpu? Shouldn't we be skipping load balancing altogether? If we
>>> do active balancing when the number of tasks = 1, it will lead to a ping
>>> pong right?
>>
>> That's the purpose of the patch to prevent this useless active load
>> balance. When the imbalance flag is set, an active load balance is
>> triggered whatever the load balance is because of pinned tasks that
>> prevents a balance state.
>
> No I mean this:
>
> sched:Do not continue load balancing when the busiest cpu has one
> running task

But you can have situation where you have to migrate the task even if
the busiest CPU has only 1 task. The use of imbalance flag is one
example. You can also be in a situation where the busiest group has
too much load compared to the local group and the busiest CPU even
with 1 task (at this instant) has been selected to move task away from
the busiest group

Vincent

>
> From: Preeti U Murthy <preeti@linux.vnet.ibm.com>
>
>
> ---
>  kernel/sched/fair.c |    2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c9617b7..b175333 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6626,6 +6626,8 @@ more_balance:
>                         }
>                         goto out_balanced;
>                 }
> +       } else {
> +               goto out;
>         }
>
>         if (!ld_moved) {
>
>
> }
>
> Regards
> Preeti U Murthy
>>
>> Vincent
>>
>>>
>>> Regards
>>> Preeti U Murthy
>>>
>>
>

^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 01/11] sched: fix imbalance flag reset
@ 2014-05-26 10:14           ` Vincent Guittot
  0 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-05-26 10:14 UTC (permalink / raw)
  To: linux-arm-kernel

On 26 May 2014 11:16, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
> On 05/26/2014 01:19 PM, Vincent Guittot wrote:
>> On 25 May 2014 12:33, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
>>> Hi Vincent,
>>>
>>> On 05/23/2014 09:22 PM, Vincent Guittot wrote:
>>>> The imbalance flag can stay set whereas there is no imbalance.
>>>>
>>>> Let assume that we have 3 tasks that run on a dual cores /dual cluster system.
>>>> We will have some idle load balance which are triggered during tick.
>>>> Unfortunately, the tick is also used to queue background work so we can reach
>>>> the situation where short work has been queued on a CPU which already runs a
>>>> task. The load balance will detect this imbalance (2 tasks on 1 CPU and an idle
>>>> CPU) and will try to pull the waiting task on the idle CPU. The waiting task is
>>>> a worker thread that is pinned on a CPU so an imbalance due to pinned task is
>>>> detected and the imbalance flag is set.
>>>> Then, we will not be able to clear the flag because we have at most 1 task on
>>>> each CPU but the imbalance flag will trig to useless active load balance
>>>> between the idle CPU and the busy CPU.
>>>
>>> Why do we do active balancing today when there is at-most 1 task on the
>>> busiest cpu? Shouldn't we be skipping load balancing altogether? If we
>>> do active balancing when the number of tasks = 1, it will lead to a ping
>>> pong right?
>>
>> That's the purpose of the patch to prevent this useless active load
>> balance. When the imbalance flag is set, an active load balance is
>> triggered whatever the load balance is because of pinned tasks that
>> prevents a balance state.
>
> No I mean this:
>
> sched:Do not continue load balancing when the busiest cpu has one
> running task

But you can have situation where you have to migrate the task even if
the busiest CPU has only 1 task. The use of imbalance flag is one
example. You can also be in a situation where the busiest group has
too much load compared to the local group and the busiest CPU even
with 1 task (at this instant) has been selected to move task away from
the busiest group

Vincent

>
> From: Preeti U Murthy <preeti@linux.vnet.ibm.com>
>
>
> ---
>  kernel/sched/fair.c |    2 ++
>  1 file changed, 2 insertions(+)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c9617b7..b175333 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6626,6 +6626,8 @@ more_balance:
>                         }
>                         goto out_balanced;
>                 }
> +       } else {
> +               goto out;
>         }
>
>         if (!ld_moved) {
>
>
> }
>
> Regards
> Preeti U Murthy
>>
>> Vincent
>>
>>>
>>> Regards
>>> Preeti U Murthy
>>>
>>
>

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 00/11] sched: consolidation of cpu_power
  2014-05-26 10:04     ` Vincent Guittot
@ 2014-05-26 15:54       ` Vincent Guittot
  -1 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-05-26 15:54 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel,
	Russell King - ARM Linux, LAK, Morten Rasmussen, Mike Galbraith,
	Nicolas Pitre, linaro-kernel, Daniel Lezcano

Hi Preeti,

I have done ebizzy tests on my platforms but doesn't have similar
results than you (my results below). It seems to be linked to SMT. I'm
going to look at that part more deeply and try to find a more suitable
HW for tests.

ebizzy -t N -S 20
Quad cores
 N  tip                 +patchset
 1  100.00% (+/- 0.30%)  97.00% (+/- 0.42%)
 2  100.00% (+/- 0.80%) 100.48% (+/- 0.88%)
 4  100.00% (+/- 1.18%)  99.32% (+/- 1.05%)
 6  100.00% (+/- 8.54%)  98.84% (+/- 1.39%)
 8  100.00% (+/- 0.45%)  98.89% (+/- 0.91%)
10  100.00% (+/- 0.32%)  99.25% (+/- 0.31%)
12  100.00% (+/- 0.15%)  99.20% (+/- 0.86%)
14  100.00% (+/- 0.58%)  99.44% (+/- 0.55%)

Dual cores
 N  tip                 +patchset
 1  100.00% (+/- 1.70%)  99.35% (+/- 2.82%)
 2  100.00% (+/- 2.75%) 100.48% (+/- 1.51%)
 4  100.00% (+/- 2.37%) 102.63% (+/- 2.35%)
 6  100.00% (+/- 3.11%)  97.65% (+/- 1.02%)
 8  100.00% (+/- 0.26%) 103.68% (+/- 5.90%)
10  100.00% (+/- 0.30%) 106.71% (+/- 10.85%)
12  100.00% (+/- 1.18%)  98.95% (+/- 0.75%)
14  100.00% (+/- 1.82%) 102.89% (+/- 2.32%)

Regards,
Vincent

On 26 May 2014 12:04, Vincent Guittot <vincent.guittot@linaro.org> wrote:
> On 26 May 2014 11:44, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
>> Hi Vincent,
>>
>> I conducted test runs of ebizzy on a Power8 box which had 48 cpus.
>> 6 cores with SMT-8 to be precise. Its a single socket box. The results
>> are as below.
>>
>> On 05/23/2014 09:22 PM, Vincent Guittot wrote:
>>> Part of this patchset was previously part of the larger tasks packing patchset
>>> [1]. I have splitted the latter in 3 different patchsets (at least) to make the
>>> thing easier.
>>> -configuration of sched_domain topology [2]
>>> -update and consolidation of cpu_power (this patchset)
>>> -tasks packing algorithm
>>>
>>> SMT system is no more the only system that can have a CPUs with an original
>>> capacity that is different from the default value. We need to extend the use of
>>> cpu_power_orig to all kind of platform so the scheduler will have both the
>>> maximum capacity (cpu_power_orig/power_orig) and the current capacity
>>> (cpu_power/power) of CPUs and sched_groups. A new function arch_scale_cpu_power
>>> has been created and replace arch_scale_smt_power, which is SMT specifc in the
>>> computation of the capapcity of a CPU.
>>>
>>> During load balance, the scheduler evaluates the number of tasks that a group
>>> of CPUs can handle. The current method assumes that tasks have a fix load of
>>> SCHED_LOAD_SCALE and CPUs have a default capacity of SCHED_POWER_SCALE.
>>> This assumption generates wrong decision by creating ghost cores and by
>>> removing real ones when the original capacity of CPUs is different from the
>>> default SCHED_POWER_SCALE.
>>>
>>> Now that we have the original capacity of a CPUS and its activity/utilization,
>>> we can evaluate more accuratly the capacity of a group of CPUs.
>>>
>>> This patchset mainly replaces the old capacity method by a new one and has kept
>>> the policy almost unchanged whereas we can certainly take advantage of this new
>>> statistic in several other places of the load balance.
>>>
>>> TODO:
>>>  - align variable's and field's name with the renaming [3]
>>>
>>> Tests results:
>>> I have put below results of 2 tests:
>>> - hackbench -l 500 -s 4096
>>> - scp of 100MB file on the platform
>>>
>>> on a dual cortex-A7
>>>                   hackbench        scp
>>> tip/master        25.75s(+/-0.25)  5.16MB/s(+/-1.49)
>>> + patches 1,2     25.89s(+/-0.31)  5.18MB/s(+/-1.45)
>>> + patches 3-10    25.68s(+/-0.22)  7.00MB/s(+/-1.88)
>>> + irq accounting  25.80s(+/-0.25)  8.06MB/s(+/-0.05)
>>>
>>> on a quad cortex-A15
>>>                   hackbench        scp
>>> tip/master        15.69s(+/-0.16)  9.70MB/s(+/-0.04)
>>> + patches 1,2     15.53s(+/-0.13)  9.72MB/s(+/-0.05)
>>> + patches 3-10    15.56s(+/-0.22)  9.88MB/s(+/-0.05)
>>> + irq accounting  15.99s(+/-0.08) 10.37MB/s(+/-0.03)
>>>
>>> The improvement of scp bandwidth happens when tasks and irq are using
>>> different CPU which is a bit random without irq accounting config
>>
>> N -> Number of threads of ebizzy
>>
>> Each 'N' run was for 30 seconds with multiple iterations and averaging them.
>>
>> N          %change in number of records
>>            read after patching
>> ------------------------------------------
>> 1          + 0.0038
>> 4          -17.6429
>> 8          -26.3989
>> 12         -29.5070
>> 16         -38.4842
>> 20         -44.5747
>> 24         -51.9792
>> 28         -34.1863
>> 32         -38.4029
>> 38         -22.2490
>> 42          -7.4843
>> 47         -0.69676
>>
>> Let me profile it and check where the cause of this degradation is.
>
> Hi Preeti,
>
> Thanks for the test and the help to find the root cause of the
> degration. I'm going to run the test on my platforms too and see if i
> have similar results with my platforms
>
> Regards
> Vincent
>>
>>
>> Regards
>> Preeti U Murthy
>>

^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 00/11] sched: consolidation of cpu_power
@ 2014-05-26 15:54       ` Vincent Guittot
  0 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-05-26 15:54 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Preeti,

I have done ebizzy tests on my platforms but doesn't have similar
results than you (my results below). It seems to be linked to SMT. I'm
going to look at that part more deeply and try to find a more suitable
HW for tests.

ebizzy -t N -S 20
Quad cores
 N  tip                 +patchset
 1  100.00% (+/- 0.30%)  97.00% (+/- 0.42%)
 2  100.00% (+/- 0.80%) 100.48% (+/- 0.88%)
 4  100.00% (+/- 1.18%)  99.32% (+/- 1.05%)
 6  100.00% (+/- 8.54%)  98.84% (+/- 1.39%)
 8  100.00% (+/- 0.45%)  98.89% (+/- 0.91%)
10  100.00% (+/- 0.32%)  99.25% (+/- 0.31%)
12  100.00% (+/- 0.15%)  99.20% (+/- 0.86%)
14  100.00% (+/- 0.58%)  99.44% (+/- 0.55%)

Dual cores
 N  tip                 +patchset
 1  100.00% (+/- 1.70%)  99.35% (+/- 2.82%)
 2  100.00% (+/- 2.75%) 100.48% (+/- 1.51%)
 4  100.00% (+/- 2.37%) 102.63% (+/- 2.35%)
 6  100.00% (+/- 3.11%)  97.65% (+/- 1.02%)
 8  100.00% (+/- 0.26%) 103.68% (+/- 5.90%)
10  100.00% (+/- 0.30%) 106.71% (+/- 10.85%)
12  100.00% (+/- 1.18%)  98.95% (+/- 0.75%)
14  100.00% (+/- 1.82%) 102.89% (+/- 2.32%)

Regards,
Vincent

On 26 May 2014 12:04, Vincent Guittot <vincent.guittot@linaro.org> wrote:
> On 26 May 2014 11:44, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
>> Hi Vincent,
>>
>> I conducted test runs of ebizzy on a Power8 box which had 48 cpus.
>> 6 cores with SMT-8 to be precise. Its a single socket box. The results
>> are as below.
>>
>> On 05/23/2014 09:22 PM, Vincent Guittot wrote:
>>> Part of this patchset was previously part of the larger tasks packing patchset
>>> [1]. I have splitted the latter in 3 different patchsets (at least) to make the
>>> thing easier.
>>> -configuration of sched_domain topology [2]
>>> -update and consolidation of cpu_power (this patchset)
>>> -tasks packing algorithm
>>>
>>> SMT system is no more the only system that can have a CPUs with an original
>>> capacity that is different from the default value. We need to extend the use of
>>> cpu_power_orig to all kind of platform so the scheduler will have both the
>>> maximum capacity (cpu_power_orig/power_orig) and the current capacity
>>> (cpu_power/power) of CPUs and sched_groups. A new function arch_scale_cpu_power
>>> has been created and replace arch_scale_smt_power, which is SMT specifc in the
>>> computation of the capapcity of a CPU.
>>>
>>> During load balance, the scheduler evaluates the number of tasks that a group
>>> of CPUs can handle. The current method assumes that tasks have a fix load of
>>> SCHED_LOAD_SCALE and CPUs have a default capacity of SCHED_POWER_SCALE.
>>> This assumption generates wrong decision by creating ghost cores and by
>>> removing real ones when the original capacity of CPUs is different from the
>>> default SCHED_POWER_SCALE.
>>>
>>> Now that we have the original capacity of a CPUS and its activity/utilization,
>>> we can evaluate more accuratly the capacity of a group of CPUs.
>>>
>>> This patchset mainly replaces the old capacity method by a new one and has kept
>>> the policy almost unchanged whereas we can certainly take advantage of this new
>>> statistic in several other places of the load balance.
>>>
>>> TODO:
>>>  - align variable's and field's name with the renaming [3]
>>>
>>> Tests results:
>>> I have put below results of 2 tests:
>>> - hackbench -l 500 -s 4096
>>> - scp of 100MB file on the platform
>>>
>>> on a dual cortex-A7
>>>                   hackbench        scp
>>> tip/master        25.75s(+/-0.25)  5.16MB/s(+/-1.49)
>>> + patches 1,2     25.89s(+/-0.31)  5.18MB/s(+/-1.45)
>>> + patches 3-10    25.68s(+/-0.22)  7.00MB/s(+/-1.88)
>>> + irq accounting  25.80s(+/-0.25)  8.06MB/s(+/-0.05)
>>>
>>> on a quad cortex-A15
>>>                   hackbench        scp
>>> tip/master        15.69s(+/-0.16)  9.70MB/s(+/-0.04)
>>> + patches 1,2     15.53s(+/-0.13)  9.72MB/s(+/-0.05)
>>> + patches 3-10    15.56s(+/-0.22)  9.88MB/s(+/-0.05)
>>> + irq accounting  15.99s(+/-0.08) 10.37MB/s(+/-0.03)
>>>
>>> The improvement of scp bandwidth happens when tasks and irq are using
>>> different CPU which is a bit random without irq accounting config
>>
>> N -> Number of threads of ebizzy
>>
>> Each 'N' run was for 30 seconds with multiple iterations and averaging them.
>>
>> N          %change in number of records
>>            read after patching
>> ------------------------------------------
>> 1          + 0.0038
>> 4          -17.6429
>> 8          -26.3989
>> 12         -29.5070
>> 16         -38.4842
>> 20         -44.5747
>> 24         -51.9792
>> 28         -34.1863
>> 32         -38.4029
>> 38         -22.2490
>> 42          -7.4843
>> 47         -0.69676
>>
>> Let me profile it and check where the cause of this degradation is.
>
> Hi Preeti,
>
> Thanks for the test and the help to find the root cause of the
> degration. I'm going to run the test on my platforms too and see if i
> have similar results with my platforms
>
> Regards
> Vincent
>>
>>
>> Regards
>> Preeti U Murthy
>>

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 00/11] sched: consolidation of cpu_power
  2014-05-26 15:54       ` Vincent Guittot
@ 2014-05-27  5:47         ` Preeti U Murthy
  -1 siblings, 0 replies; 216+ messages in thread
From: Preeti U Murthy @ 2014-05-27  5:47 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel,
	Russell King - ARM Linux, LAK, Morten Rasmussen, Mike Galbraith,
	Nicolas Pitre, linaro-kernel, Daniel Lezcano

On 05/26/2014 09:24 PM, Vincent Guittot wrote:
> Hi Preeti,
> 
> I have done ebizzy tests on my platforms but doesn't have similar
> results than you (my results below). It seems to be linked to SMT. I'm
> going to look at that part more deeply and try to find a more suitable
> HW for tests.

You are right Vincent. I tested this in smt-off mode and the regression
was not seen. But the regression was of the order 27% with higher number
of threads in smt-on mode. What is interesting is that the regression
increases in the range N=1 to N=24 and then it dips to 0 at N=48 on a 6
core, SMT 8 machine. Let me dig this further.

Let me dig further.

Regards
Preeti U Murthy
> 
> ebizzy -t N -S 20
> Quad cores
>  N  tip                 +patchset
>  1  100.00% (+/- 0.30%)  97.00% (+/- 0.42%)
>  2  100.00% (+/- 0.80%) 100.48% (+/- 0.88%)
>  4  100.00% (+/- 1.18%)  99.32% (+/- 1.05%)
>  6  100.00% (+/- 8.54%)  98.84% (+/- 1.39%)
>  8  100.00% (+/- 0.45%)  98.89% (+/- 0.91%)
> 10  100.00% (+/- 0.32%)  99.25% (+/- 0.31%)
> 12  100.00% (+/- 0.15%)  99.20% (+/- 0.86%)
> 14  100.00% (+/- 0.58%)  99.44% (+/- 0.55%)
> 
> Dual cores
>  N  tip                 +patchset
>  1  100.00% (+/- 1.70%)  99.35% (+/- 2.82%)
>  2  100.00% (+/- 2.75%) 100.48% (+/- 1.51%)
>  4  100.00% (+/- 2.37%) 102.63% (+/- 2.35%)
>  6  100.00% (+/- 3.11%)  97.65% (+/- 1.02%)
>  8  100.00% (+/- 0.26%) 103.68% (+/- 5.90%)
> 10  100.00% (+/- 0.30%) 106.71% (+/- 10.85%)
> 12  100.00% (+/- 1.18%)  98.95% (+/- 0.75%)
> 14  100.00% (+/- 1.82%) 102.89% (+/- 2.32%)
> 
> Regards,
> Vincent
> 
> On 26 May 2014 12:04, Vincent Guittot <vincent.guittot@linaro.org> wrote:
>> On 26 May 2014 11:44, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
>>> Hi Vincent,
>>>
>>> I conducted test runs of ebizzy on a Power8 box which had 48 cpus.
>>> 6 cores with SMT-8 to be precise. Its a single socket box. The results
>>> are as below.
>>>
>>> On 05/23/2014 09:22 PM, Vincent Guittot wrote:
>>>> Part of this patchset was previously part of the larger tasks packing patchset
>>>> [1]. I have splitted the latter in 3 different patchsets (at least) to make the
>>>> thing easier.
>>>> -configuration of sched_domain topology [2]
>>>> -update and consolidation of cpu_power (this patchset)
>>>> -tasks packing algorithm
>>>>
>>>> SMT system is no more the only system that can have a CPUs with an original
>>>> capacity that is different from the default value. We need to extend the use of
>>>> cpu_power_orig to all kind of platform so the scheduler will have both the
>>>> maximum capacity (cpu_power_orig/power_orig) and the current capacity
>>>> (cpu_power/power) of CPUs and sched_groups. A new function arch_scale_cpu_power
>>>> has been created and replace arch_scale_smt_power, which is SMT specifc in the
>>>> computation of the capapcity of a CPU.
>>>>
>>>> During load balance, the scheduler evaluates the number of tasks that a group
>>>> of CPUs can handle. The current method assumes that tasks have a fix load of
>>>> SCHED_LOAD_SCALE and CPUs have a default capacity of SCHED_POWER_SCALE.
>>>> This assumption generates wrong decision by creating ghost cores and by
>>>> removing real ones when the original capacity of CPUs is different from the
>>>> default SCHED_POWER_SCALE.
>>>>
>>>> Now that we have the original capacity of a CPUS and its activity/utilization,
>>>> we can evaluate more accuratly the capacity of a group of CPUs.
>>>>
>>>> This patchset mainly replaces the old capacity method by a new one and has kept
>>>> the policy almost unchanged whereas we can certainly take advantage of this new
>>>> statistic in several other places of the load balance.
>>>>
>>>> TODO:
>>>>  - align variable's and field's name with the renaming [3]
>>>>
>>>> Tests results:
>>>> I have put below results of 2 tests:
>>>> - hackbench -l 500 -s 4096
>>>> - scp of 100MB file on the platform
>>>>
>>>> on a dual cortex-A7
>>>>                   hackbench        scp
>>>> tip/master        25.75s(+/-0.25)  5.16MB/s(+/-1.49)
>>>> + patches 1,2     25.89s(+/-0.31)  5.18MB/s(+/-1.45)
>>>> + patches 3-10    25.68s(+/-0.22)  7.00MB/s(+/-1.88)
>>>> + irq accounting  25.80s(+/-0.25)  8.06MB/s(+/-0.05)
>>>>
>>>> on a quad cortex-A15
>>>>                   hackbench        scp
>>>> tip/master        15.69s(+/-0.16)  9.70MB/s(+/-0.04)
>>>> + patches 1,2     15.53s(+/-0.13)  9.72MB/s(+/-0.05)
>>>> + patches 3-10    15.56s(+/-0.22)  9.88MB/s(+/-0.05)
>>>> + irq accounting  15.99s(+/-0.08) 10.37MB/s(+/-0.03)
>>>>
>>>> The improvement of scp bandwidth happens when tasks and irq are using
>>>> different CPU which is a bit random without irq accounting config
>>>
>>> N -> Number of threads of ebizzy
>>>
>>> Each 'N' run was for 30 seconds with multiple iterations and averaging them.
>>>
>>> N          %change in number of records
>>>            read after patching
>>> ------------------------------------------
>>> 1          + 0.0038
>>> 4          -17.6429
>>> 8          -26.3989
>>> 12         -29.5070
>>> 16         -38.4842
>>> 20         -44.5747
>>> 24         -51.9792
>>> 28         -34.1863
>>> 32         -38.4029
>>> 38         -22.2490
>>> 42          -7.4843
>>> 47         -0.69676
>>>
>>> Let me profile it and check where the cause of this degradation is.
>>
>> Hi Preeti,
>>
>> Thanks for the test and the help to find the root cause of the
>> degration. I'm going to run the test on my platforms too and see if i
>> have similar results with my platforms
>>
>> Regards
>> Vincent
>>>
>>>
>>> Regards
>>> Preeti U Murthy
>>>
> 


^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 00/11] sched: consolidation of cpu_power
@ 2014-05-27  5:47         ` Preeti U Murthy
  0 siblings, 0 replies; 216+ messages in thread
From: Preeti U Murthy @ 2014-05-27  5:47 UTC (permalink / raw)
  To: linux-arm-kernel

On 05/26/2014 09:24 PM, Vincent Guittot wrote:
> Hi Preeti,
> 
> I have done ebizzy tests on my platforms but doesn't have similar
> results than you (my results below). It seems to be linked to SMT. I'm
> going to look at that part more deeply and try to find a more suitable
> HW for tests.

You are right Vincent. I tested this in smt-off mode and the regression
was not seen. But the regression was of the order 27% with higher number
of threads in smt-on mode. What is interesting is that the regression
increases in the range N=1 to N=24 and then it dips to 0 at N=48 on a 6
core, SMT 8 machine. Let me dig this further.

Let me dig further.

Regards
Preeti U Murthy
> 
> ebizzy -t N -S 20
> Quad cores
>  N  tip                 +patchset
>  1  100.00% (+/- 0.30%)  97.00% (+/- 0.42%)
>  2  100.00% (+/- 0.80%) 100.48% (+/- 0.88%)
>  4  100.00% (+/- 1.18%)  99.32% (+/- 1.05%)
>  6  100.00% (+/- 8.54%)  98.84% (+/- 1.39%)
>  8  100.00% (+/- 0.45%)  98.89% (+/- 0.91%)
> 10  100.00% (+/- 0.32%)  99.25% (+/- 0.31%)
> 12  100.00% (+/- 0.15%)  99.20% (+/- 0.86%)
> 14  100.00% (+/- 0.58%)  99.44% (+/- 0.55%)
> 
> Dual cores
>  N  tip                 +patchset
>  1  100.00% (+/- 1.70%)  99.35% (+/- 2.82%)
>  2  100.00% (+/- 2.75%) 100.48% (+/- 1.51%)
>  4  100.00% (+/- 2.37%) 102.63% (+/- 2.35%)
>  6  100.00% (+/- 3.11%)  97.65% (+/- 1.02%)
>  8  100.00% (+/- 0.26%) 103.68% (+/- 5.90%)
> 10  100.00% (+/- 0.30%) 106.71% (+/- 10.85%)
> 12  100.00% (+/- 1.18%)  98.95% (+/- 0.75%)
> 14  100.00% (+/- 1.82%) 102.89% (+/- 2.32%)
> 
> Regards,
> Vincent
> 
> On 26 May 2014 12:04, Vincent Guittot <vincent.guittot@linaro.org> wrote:
>> On 26 May 2014 11:44, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
>>> Hi Vincent,
>>>
>>> I conducted test runs of ebizzy on a Power8 box which had 48 cpus.
>>> 6 cores with SMT-8 to be precise. Its a single socket box. The results
>>> are as below.
>>>
>>> On 05/23/2014 09:22 PM, Vincent Guittot wrote:
>>>> Part of this patchset was previously part of the larger tasks packing patchset
>>>> [1]. I have splitted the latter in 3 different patchsets (at least) to make the
>>>> thing easier.
>>>> -configuration of sched_domain topology [2]
>>>> -update and consolidation of cpu_power (this patchset)
>>>> -tasks packing algorithm
>>>>
>>>> SMT system is no more the only system that can have a CPUs with an original
>>>> capacity that is different from the default value. We need to extend the use of
>>>> cpu_power_orig to all kind of platform so the scheduler will have both the
>>>> maximum capacity (cpu_power_orig/power_orig) and the current capacity
>>>> (cpu_power/power) of CPUs and sched_groups. A new function arch_scale_cpu_power
>>>> has been created and replace arch_scale_smt_power, which is SMT specifc in the
>>>> computation of the capapcity of a CPU.
>>>>
>>>> During load balance, the scheduler evaluates the number of tasks that a group
>>>> of CPUs can handle. The current method assumes that tasks have a fix load of
>>>> SCHED_LOAD_SCALE and CPUs have a default capacity of SCHED_POWER_SCALE.
>>>> This assumption generates wrong decision by creating ghost cores and by
>>>> removing real ones when the original capacity of CPUs is different from the
>>>> default SCHED_POWER_SCALE.
>>>>
>>>> Now that we have the original capacity of a CPUS and its activity/utilization,
>>>> we can evaluate more accuratly the capacity of a group of CPUs.
>>>>
>>>> This patchset mainly replaces the old capacity method by a new one and has kept
>>>> the policy almost unchanged whereas we can certainly take advantage of this new
>>>> statistic in several other places of the load balance.
>>>>
>>>> TODO:
>>>>  - align variable's and field's name with the renaming [3]
>>>>
>>>> Tests results:
>>>> I have put below results of 2 tests:
>>>> - hackbench -l 500 -s 4096
>>>> - scp of 100MB file on the platform
>>>>
>>>> on a dual cortex-A7
>>>>                   hackbench        scp
>>>> tip/master        25.75s(+/-0.25)  5.16MB/s(+/-1.49)
>>>> + patches 1,2     25.89s(+/-0.31)  5.18MB/s(+/-1.45)
>>>> + patches 3-10    25.68s(+/-0.22)  7.00MB/s(+/-1.88)
>>>> + irq accounting  25.80s(+/-0.25)  8.06MB/s(+/-0.05)
>>>>
>>>> on a quad cortex-A15
>>>>                   hackbench        scp
>>>> tip/master        15.69s(+/-0.16)  9.70MB/s(+/-0.04)
>>>> + patches 1,2     15.53s(+/-0.13)  9.72MB/s(+/-0.05)
>>>> + patches 3-10    15.56s(+/-0.22)  9.88MB/s(+/-0.05)
>>>> + irq accounting  15.99s(+/-0.08) 10.37MB/s(+/-0.03)
>>>>
>>>> The improvement of scp bandwidth happens when tasks and irq are using
>>>> different CPU which is a bit random without irq accounting config
>>>
>>> N -> Number of threads of ebizzy
>>>
>>> Each 'N' run was for 30 seconds with multiple iterations and averaging them.
>>>
>>> N          %change in number of records
>>>            read after patching
>>> ------------------------------------------
>>> 1          + 0.0038
>>> 4          -17.6429
>>> 8          -26.3989
>>> 12         -29.5070
>>> 16         -38.4842
>>> 20         -44.5747
>>> 24         -51.9792
>>> 28         -34.1863
>>> 32         -38.4029
>>> 38         -22.2490
>>> 42          -7.4843
>>> 47         -0.69676
>>>
>>> Let me profile it and check where the cause of this degradation is.
>>
>> Hi Preeti,
>>
>> Thanks for the test and the help to find the root cause of the
>> degration. I'm going to run the test on my platforms too and see if i
>> have similar results with my platforms
>>
>> Regards
>> Vincent
>>>
>>>
>>> Regards
>>> Preeti U Murthy
>>>
> 

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 02/11] sched: remove a wake_affine condition
  2014-05-23 15:52   ` Vincent Guittot
@ 2014-05-27 12:48     ` Peter Zijlstra
  -1 siblings, 0 replies; 216+ messages in thread
From: Peter Zijlstra @ 2014-05-27 12:48 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, linux-kernel, linux, linux-arm-kernel, preeti,
	Morten.Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano

[-- Attachment #1: Type: text/plain, Size: 2466 bytes --]

On Fri, May 23, 2014 at 05:52:56PM +0200, Vincent Guittot wrote:
> I have tried to understand the meaning of the condition :
>  (this_load <= load &&
>   this_load + target_load(prev_cpu, idx) <= tl_per_task)
> but i failed to find a use case that can take advantage of it and i haven't
> found description of it in the previous commits' log.

commit 2dd73a4f09beacadde827a032cf15fd8b1fa3d48

    int try_to_wake_up():
    
    in this function the value SCHED_LOAD_BALANCE is used to represent the load
    contribution of a single task in various calculations in the code that
    decides which CPU to put the waking task on.  While this would be a valid
    on a system where the nice values for the runnable tasks were distributed
    evenly around zero it will lead to anomalous load balancing if the
    distribution is skewed in either direction.  To overcome this problem
    SCHED_LOAD_SCALE has been replaced by the load_weight for the relevant task
    or by the average load_weight per task for the queue in question (as
    appropriate).

                        if ((tl <= load &&
-                               tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) ||
-                               100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) {
+                               tl + target_load(cpu, idx) <= tl_per_task) ||
+                               100*(tl + p->load_weight) <= imbalance*load) {


commit a3f21bce1fefdf92a4d1705e888d390b10f3ac6f


+                       if ((tl <= load &&
+                               tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) ||
+                               100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) {


So back when the code got introduced, it read:

	target_load(prev_cpu, idx) - sync*SCHED_LOAD_SCALE < source_load(this_cpu, idx) &&
	target_load(prev_cpu, idx) - sync*SCHED_LOAD_SCALE + target_load(this_cpu, idx) < SCHED_LOAD_SCALE

So while the first line makes some sense, the second line is still
somewhat challenging.

I read the second line something like: if there's less than one full
task running on the combined cpus.

Now for idx==0 this is hard, because even when sync=1 you can only make
it true if both cpus are completely idle, in which case you really want
to move to the waking cpu I suppose.

One task running will have it == SCHED_LOAD_SCALE.

But for idx>0 this can trigger in all kinds of situations of light load.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 02/11] sched: remove a wake_affine condition
@ 2014-05-27 12:48     ` Peter Zijlstra
  0 siblings, 0 replies; 216+ messages in thread
From: Peter Zijlstra @ 2014-05-27 12:48 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, May 23, 2014 at 05:52:56PM +0200, Vincent Guittot wrote:
> I have tried to understand the meaning of the condition :
>  (this_load <= load &&
>   this_load + target_load(prev_cpu, idx) <= tl_per_task)
> but i failed to find a use case that can take advantage of it and i haven't
> found description of it in the previous commits' log.

commit 2dd73a4f09beacadde827a032cf15fd8b1fa3d48

    int try_to_wake_up():
    
    in this function the value SCHED_LOAD_BALANCE is used to represent the load
    contribution of a single task in various calculations in the code that
    decides which CPU to put the waking task on.  While this would be a valid
    on a system where the nice values for the runnable tasks were distributed
    evenly around zero it will lead to anomalous load balancing if the
    distribution is skewed in either direction.  To overcome this problem
    SCHED_LOAD_SCALE has been replaced by the load_weight for the relevant task
    or by the average load_weight per task for the queue in question (as
    appropriate).

                        if ((tl <= load &&
-                               tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) ||
-                               100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) {
+                               tl + target_load(cpu, idx) <= tl_per_task) ||
+                               100*(tl + p->load_weight) <= imbalance*load) {


commit a3f21bce1fefdf92a4d1705e888d390b10f3ac6f


+                       if ((tl <= load &&
+                               tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) ||
+                               100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) {


So back when the code got introduced, it read:

	target_load(prev_cpu, idx) - sync*SCHED_LOAD_SCALE < source_load(this_cpu, idx) &&
	target_load(prev_cpu, idx) - sync*SCHED_LOAD_SCALE + target_load(this_cpu, idx) < SCHED_LOAD_SCALE

So while the first line makes some sense, the second line is still
somewhat challenging.

I read the second line something like: if there's less than one full
task running on the combined cpus.

Now for idx==0 this is hard, because even when sync=1 you can only make
it true if both cpus are completely idle, in which case you really want
to move to the waking cpu I suppose.

One task running will have it == SCHED_LOAD_SCALE.

But for idx>0 this can trigger in all kinds of situations of light load.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 836 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20140527/a5ffa29b/attachment.sig>

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 02/11] sched: remove a wake_affine condition
  2014-05-23 15:52   ` Vincent Guittot
@ 2014-05-27 13:43     ` Peter Zijlstra
  -1 siblings, 0 replies; 216+ messages in thread
From: Peter Zijlstra @ 2014-05-27 13:43 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, linux-kernel, linux, linux-arm-kernel, preeti,
	Morten.Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano

On Fri, May 23, 2014 at 05:52:56PM +0200, Vincent Guittot wrote:
> 
> If sync is set, it's not as straight forward as above (especially if cgroup
> are involved) 

avg load with cgroups is 'interesting' alright.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 02/11] sched: remove a wake_affine condition
@ 2014-05-27 13:43     ` Peter Zijlstra
  0 siblings, 0 replies; 216+ messages in thread
From: Peter Zijlstra @ 2014-05-27 13:43 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, May 23, 2014 at 05:52:56PM +0200, Vincent Guittot wrote:
> 
> If sync is set, it's not as straight forward as above (especially if cgroup
> are involved) 

avg load with cgroups is 'interesting' alright.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 02/11] sched: remove a wake_affine condition
  2014-05-23 15:52   ` Vincent Guittot
@ 2014-05-27 13:45     ` Peter Zijlstra
  -1 siblings, 0 replies; 216+ messages in thread
From: Peter Zijlstra @ 2014-05-27 13:45 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, linux-kernel, linux, linux-arm-kernel, preeti,
	Morten.Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano

On Fri, May 23, 2014 at 05:52:56PM +0200, Vincent Guittot wrote:
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 9587ed1..30240ab 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4238,7 +4238,6 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
>  {
>  	s64 this_load, load;
>  	int idx, this_cpu, prev_cpu;
> -	unsigned long tl_per_task;
>  	struct task_group *tg;
>  	unsigned long weight;
>  	int balanced;
> @@ -4296,32 +4295,22 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
>  		balanced = this_eff_load <= prev_eff_load;
>  	} else
>  		balanced = true;
> +	schedstat_inc(p, se.statistics.nr_wakeups_affine_attempts);
>  
> +	if (!balanced)
> +		return 0;
>  	/*
>  	 * If the currently running task will sleep within
>  	 * a reasonable amount of time then attract this newly
>  	 * woken task:
>  	 */
> +	if (sync)
>  		return 1;
>  
> +	schedstat_inc(sd, ttwu_move_affine);
> +	schedstat_inc(p, se.statistics.nr_wakeups_affine);
>  
> +	return 1;
>  }

So I'm not usually one for schedstat nitpicking, but should we fix it in
the sync case? That is, for sync we return 1 but do no inc
nr_wakeups_affine, even though its going to be an affine wakeup.



^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 02/11] sched: remove a wake_affine condition
@ 2014-05-27 13:45     ` Peter Zijlstra
  0 siblings, 0 replies; 216+ messages in thread
From: Peter Zijlstra @ 2014-05-27 13:45 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, May 23, 2014 at 05:52:56PM +0200, Vincent Guittot wrote:
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 9587ed1..30240ab 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4238,7 +4238,6 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
>  {
>  	s64 this_load, load;
>  	int idx, this_cpu, prev_cpu;
> -	unsigned long tl_per_task;
>  	struct task_group *tg;
>  	unsigned long weight;
>  	int balanced;
> @@ -4296,32 +4295,22 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
>  		balanced = this_eff_load <= prev_eff_load;
>  	} else
>  		balanced = true;
> +	schedstat_inc(p, se.statistics.nr_wakeups_affine_attempts);
>  
> +	if (!balanced)
> +		return 0;
>  	/*
>  	 * If the currently running task will sleep within
>  	 * a reasonable amount of time then attract this newly
>  	 * woken task:
>  	 */
> +	if (sync)
>  		return 1;
>  
> +	schedstat_inc(sd, ttwu_move_affine);
> +	schedstat_inc(p, se.statistics.nr_wakeups_affine);
>  
> +	return 1;
>  }

So I'm not usually one for schedstat nitpicking, but should we fix it in
the sync case? That is, for sync we return 1 but do no inc
nr_wakeups_affine, even though its going to be an affine wakeup.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 03/11] sched: fix avg_load computation
  2014-05-23 15:52   ` Vincent Guittot
@ 2014-05-27 13:48     ` Peter Zijlstra
  -1 siblings, 0 replies; 216+ messages in thread
From: Peter Zijlstra @ 2014-05-27 13:48 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, linux-kernel, linux, linux-arm-kernel, preeti,
	Morten.Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano

On Fri, May 23, 2014 at 05:52:57PM +0200, Vincent Guittot wrote:
> The computation of avg_load and avg_load_per_task should only takes into
> account the number of cfs tasks. The non cfs task are already taken into
> account by decreasing the cpu's capacity (cpu_power) and they will be tracked
> in the CPU's utilization (group_activity) of the next patches

Ah yes, very good.  I don't think we had h_nr_running back when I
frobbed this last.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 03/11] sched: fix avg_load computation
@ 2014-05-27 13:48     ` Peter Zijlstra
  0 siblings, 0 replies; 216+ messages in thread
From: Peter Zijlstra @ 2014-05-27 13:48 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, May 23, 2014 at 05:52:57PM +0200, Vincent Guittot wrote:
> The computation of avg_load and avg_load_per_task should only takes into
> account the number of cfs tasks. The non cfs task are already taken into
> account by decreasing the cpu's capacity (cpu_power) and they will be tracked
> in the CPU's utilization (group_activity) of the next patches

Ah yes, very good.  I don't think we had h_nr_running back when I
frobbed this last.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 02/11] sched: remove a wake_affine condition
  2014-05-27 12:48     ` Peter Zijlstra
@ 2014-05-27 15:19       ` Vincent Guittot
  -1 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-05-27 15:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Russell King - ARM Linux, LAK,
	Preeti U Murthy, Morten Rasmussen, Mike Galbraith, Nicolas Pitre,
	linaro-kernel, Daniel Lezcano

On 27 May 2014 14:48, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, May 23, 2014 at 05:52:56PM +0200, Vincent Guittot wrote:
>> I have tried to understand the meaning of the condition :
>>  (this_load <= load &&
>>   this_load + target_load(prev_cpu, idx) <= tl_per_task)
>> but i failed to find a use case that can take advantage of it and i haven't
>> found description of it in the previous commits' log.
>
> commit 2dd73a4f09beacadde827a032cf15fd8b1fa3d48
>
>     int try_to_wake_up():
>
>     in this function the value SCHED_LOAD_BALANCE is used to represent the load
>     contribution of a single task in various calculations in the code that
>     decides which CPU to put the waking task on.  While this would be a valid
>     on a system where the nice values for the runnable tasks were distributed
>     evenly around zero it will lead to anomalous load balancing if the
>     distribution is skewed in either direction.  To overcome this problem
>     SCHED_LOAD_SCALE has been replaced by the load_weight for the relevant task
>     or by the average load_weight per task for the queue in question (as
>     appropriate).
>
>                         if ((tl <= load &&
> -                               tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) ||
> -                               100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) {
> +                               tl + target_load(cpu, idx) <= tl_per_task) ||
> +                               100*(tl + p->load_weight) <= imbalance*load) {

The oldest patch i had found was: https://lkml.org/lkml/2005/2/24/34
where task_hot had been replaced by
+ if ((tl <= load &&
+ tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) ||
+ 100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) {

but as explained, i haven't found a clear explanation of this condition

>
>
> commit a3f21bce1fefdf92a4d1705e888d390b10f3ac6f
>
>
> +                       if ((tl <= load &&
> +                               tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) ||
> +                               100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) {
>
>
> So back when the code got introduced, it read:
>
>         target_load(prev_cpu, idx) - sync*SCHED_LOAD_SCALE < source_load(this_cpu, idx) &&
>         target_load(prev_cpu, idx) - sync*SCHED_LOAD_SCALE + target_load(this_cpu, idx) < SCHED_LOAD_SCALE
>
> So while the first line makes some sense, the second line is still
> somewhat challenging.
>
> I read the second line something like: if there's less than one full
> task running on the combined cpus.

ok. your explanation makes sense

>
> Now for idx==0 this is hard, because even when sync=1 you can only make
> it true if both cpus are completely idle, in which case you really want
> to move to the waking cpu I suppose.

This use case is already taken into account by

if (this_load > 0)
..
else
 balance = true

>
> One task running will have it == SCHED_LOAD_SCALE.
>
> But for idx>0 this can trigger in all kinds of situations of light load.

target_load is the max between load for idx == 0 and load for the
selected idx so we have even less chance to match the condition : both
cpu are completely idle

^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 02/11] sched: remove a wake_affine condition
@ 2014-05-27 15:19       ` Vincent Guittot
  0 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-05-27 15:19 UTC (permalink / raw)
  To: linux-arm-kernel

On 27 May 2014 14:48, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, May 23, 2014 at 05:52:56PM +0200, Vincent Guittot wrote:
>> I have tried to understand the meaning of the condition :
>>  (this_load <= load &&
>>   this_load + target_load(prev_cpu, idx) <= tl_per_task)
>> but i failed to find a use case that can take advantage of it and i haven't
>> found description of it in the previous commits' log.
>
> commit 2dd73a4f09beacadde827a032cf15fd8b1fa3d48
>
>     int try_to_wake_up():
>
>     in this function the value SCHED_LOAD_BALANCE is used to represent the load
>     contribution of a single task in various calculations in the code that
>     decides which CPU to put the waking task on.  While this would be a valid
>     on a system where the nice values for the runnable tasks were distributed
>     evenly around zero it will lead to anomalous load balancing if the
>     distribution is skewed in either direction.  To overcome this problem
>     SCHED_LOAD_SCALE has been replaced by the load_weight for the relevant task
>     or by the average load_weight per task for the queue in question (as
>     appropriate).
>
>                         if ((tl <= load &&
> -                               tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) ||
> -                               100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) {
> +                               tl + target_load(cpu, idx) <= tl_per_task) ||
> +                               100*(tl + p->load_weight) <= imbalance*load) {

The oldest patch i had found was: https://lkml.org/lkml/2005/2/24/34
where task_hot had been replaced by
+ if ((tl <= load &&
+ tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) ||
+ 100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) {

but as explained, i haven't found a clear explanation of this condition

>
>
> commit a3f21bce1fefdf92a4d1705e888d390b10f3ac6f
>
>
> +                       if ((tl <= load &&
> +                               tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) ||
> +                               100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) {
>
>
> So back when the code got introduced, it read:
>
>         target_load(prev_cpu, idx) - sync*SCHED_LOAD_SCALE < source_load(this_cpu, idx) &&
>         target_load(prev_cpu, idx) - sync*SCHED_LOAD_SCALE + target_load(this_cpu, idx) < SCHED_LOAD_SCALE
>
> So while the first line makes some sense, the second line is still
> somewhat challenging.
>
> I read the second line something like: if there's less than one full
> task running on the combined cpus.

ok. your explanation makes sense

>
> Now for idx==0 this is hard, because even when sync=1 you can only make
> it true if both cpus are completely idle, in which case you really want
> to move to the waking cpu I suppose.

This use case is already taken into account by

if (this_load > 0)
..
else
 balance = true

>
> One task running will have it == SCHED_LOAD_SCALE.
>
> But for idx>0 this can trigger in all kinds of situations of light load.

target_load is the max between load for idx == 0 and load for the
selected idx so we have even less chance to match the condition : both
cpu are completely idle

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 02/11] sched: remove a wake_affine condition
  2014-05-27 13:45     ` Peter Zijlstra
@ 2014-05-27 15:20       ` Vincent Guittot
  -1 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-05-27 15:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Russell King - ARM Linux, LAK,
	Preeti U Murthy, Morten Rasmussen, Mike Galbraith, Nicolas Pitre,
	linaro-kernel, Daniel Lezcano

On 27 May 2014 15:45, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, May 23, 2014 at 05:52:56PM +0200, Vincent Guittot wrote:
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 9587ed1..30240ab 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -4238,7 +4238,6 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
>>  {
>>       s64 this_load, load;
>>       int idx, this_cpu, prev_cpu;
>> -     unsigned long tl_per_task;
>>       struct task_group *tg;
>>       unsigned long weight;
>>       int balanced;
>> @@ -4296,32 +4295,22 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
>>               balanced = this_eff_load <= prev_eff_load;
>>       } else
>>               balanced = true;
>> +     schedstat_inc(p, se.statistics.nr_wakeups_affine_attempts);
>>
>> +     if (!balanced)
>> +             return 0;
>>       /*
>>        * If the currently running task will sleep within
>>        * a reasonable amount of time then attract this newly
>>        * woken task:
>>        */
>> +     if (sync)
>>               return 1;
>>
>> +     schedstat_inc(sd, ttwu_move_affine);
>> +     schedstat_inc(p, se.statistics.nr_wakeups_affine);
>>
>> +     return 1;
>>  }
>
> So I'm not usually one for schedstat nitpicking, but should we fix it in
> the sync case? That is, for sync we return 1 but do no inc
> nr_wakeups_affine, even though its going to be an affine wakeup.

ok, i'm going to move schedstat update at the right place

>
>

^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 02/11] sched: remove a wake_affine condition
@ 2014-05-27 15:20       ` Vincent Guittot
  0 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-05-27 15:20 UTC (permalink / raw)
  To: linux-arm-kernel

On 27 May 2014 15:45, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, May 23, 2014 at 05:52:56PM +0200, Vincent Guittot wrote:
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 9587ed1..30240ab 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -4238,7 +4238,6 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
>>  {
>>       s64 this_load, load;
>>       int idx, this_cpu, prev_cpu;
>> -     unsigned long tl_per_task;
>>       struct task_group *tg;
>>       unsigned long weight;
>>       int balanced;
>> @@ -4296,32 +4295,22 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
>>               balanced = this_eff_load <= prev_eff_load;
>>       } else
>>               balanced = true;
>> +     schedstat_inc(p, se.statistics.nr_wakeups_affine_attempts);
>>
>> +     if (!balanced)
>> +             return 0;
>>       /*
>>        * If the currently running task will sleep within
>>        * a reasonable amount of time then attract this newly
>>        * woken task:
>>        */
>> +     if (sync)
>>               return 1;
>>
>> +     schedstat_inc(sd, ttwu_move_affine);
>> +     schedstat_inc(p, se.statistics.nr_wakeups_affine);
>>
>> +     return 1;
>>  }
>
> So I'm not usually one for schedstat nitpicking, but should we fix it in
> the sync case? That is, for sync we return 1 but do no inc
> nr_wakeups_affine, even though its going to be an affine wakeup.

ok, i'm going to move schedstat update at the right place

>
>

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 02/11] sched: remove a wake_affine condition
  2014-05-27 15:19       ` Vincent Guittot
@ 2014-05-27 15:39         ` Peter Zijlstra
  -1 siblings, 0 replies; 216+ messages in thread
From: Peter Zijlstra @ 2014-05-27 15:39 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, linux-kernel, Russell King - ARM Linux, LAK,
	Preeti U Murthy, Morten Rasmussen, Mike Galbraith, Nicolas Pitre,
	linaro-kernel, Daniel Lezcano

On Tue, May 27, 2014 at 05:19:02PM +0200, Vincent Guittot wrote:
> On 27 May 2014 14:48, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Fri, May 23, 2014 at 05:52:56PM +0200, Vincent Guittot wrote:
> >> I have tried to understand the meaning of the condition :
> >>  (this_load <= load &&
> >>   this_load + target_load(prev_cpu, idx) <= tl_per_task)
> >> but i failed to find a use case that can take advantage of it and i haven't
> >> found description of it in the previous commits' log.
> >
> > commit 2dd73a4f09beacadde827a032cf15fd8b1fa3d48
> >
> >     int try_to_wake_up():
> >
> >     in this function the value SCHED_LOAD_BALANCE is used to represent the load
> >     contribution of a single task in various calculations in the code that
> >     decides which CPU to put the waking task on.  While this would be a valid
> >     on a system where the nice values for the runnable tasks were distributed
> >     evenly around zero it will lead to anomalous load balancing if the
> >     distribution is skewed in either direction.  To overcome this problem
> >     SCHED_LOAD_SCALE has been replaced by the load_weight for the relevant task
> >     or by the average load_weight per task for the queue in question (as
> >     appropriate).
> >
> >                         if ((tl <= load &&
> > -                               tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) ||
> > -                               100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) {
> > +                               tl + target_load(cpu, idx) <= tl_per_task) ||
> > +                               100*(tl + p->load_weight) <= imbalance*load) {
> 
> The oldest patch i had found was: https://lkml.org/lkml/2005/2/24/34
> where task_hot had been replaced by
> + if ((tl <= load &&
> + tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) ||
> + 100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) {
> 
> but as explained, i haven't found a clear explanation of this condition

Yeah, that's the commit I had below; but I suppose we could ask Nick if
we really want, I've heard he still replies to email, even though he's
locked up in a basement somewhere :-)

> > commit a3f21bce1fefdf92a4d1705e888d390b10f3ac6f
> >
> >
> > +                       if ((tl <= load &&
> > +                               tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) ||
> > +                               100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) {
> >
> >
> > So back when the code got introduced, it read:
> >
> >         target_load(prev_cpu, idx) - sync*SCHED_LOAD_SCALE < source_load(this_cpu, idx) &&
> >         target_load(prev_cpu, idx) - sync*SCHED_LOAD_SCALE + target_load(this_cpu, idx) < SCHED_LOAD_SCALE
> >
> > So while the first line makes some sense, the second line is still
> > somewhat challenging.
> >
> > I read the second line something like: if there's less than one full
> > task running on the combined cpus.
> 
> ok. your explanation makes sense

Maybe, its still slightly weird :-)

> >
> > Now for idx==0 this is hard, because even when sync=1 you can only make
> > it true if both cpus are completely idle, in which case you really want
> > to move to the waking cpu I suppose.
> 
> This use case is already taken into account by
> 
> if (this_load > 0)
> ..
> else
>  balance = true

Agreed.

> > One task running will have it == SCHED_LOAD_SCALE.
> >
> > But for idx>0 this can trigger in all kinds of situations of light load.
> 
> target_load is the max between load for idx == 0 and load for the
> selected idx so we have even less chance to match the condition : both
> cpu are completely idle

Ah, yes, I forgot to look at the target_load() thing and missed the max,
yes that all makes it entirely less likely.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 02/11] sched: remove a wake_affine condition
@ 2014-05-27 15:39         ` Peter Zijlstra
  0 siblings, 0 replies; 216+ messages in thread
From: Peter Zijlstra @ 2014-05-27 15:39 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, May 27, 2014 at 05:19:02PM +0200, Vincent Guittot wrote:
> On 27 May 2014 14:48, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Fri, May 23, 2014 at 05:52:56PM +0200, Vincent Guittot wrote:
> >> I have tried to understand the meaning of the condition :
> >>  (this_load <= load &&
> >>   this_load + target_load(prev_cpu, idx) <= tl_per_task)
> >> but i failed to find a use case that can take advantage of it and i haven't
> >> found description of it in the previous commits' log.
> >
> > commit 2dd73a4f09beacadde827a032cf15fd8b1fa3d48
> >
> >     int try_to_wake_up():
> >
> >     in this function the value SCHED_LOAD_BALANCE is used to represent the load
> >     contribution of a single task in various calculations in the code that
> >     decides which CPU to put the waking task on.  While this would be a valid
> >     on a system where the nice values for the runnable tasks were distributed
> >     evenly around zero it will lead to anomalous load balancing if the
> >     distribution is skewed in either direction.  To overcome this problem
> >     SCHED_LOAD_SCALE has been replaced by the load_weight for the relevant task
> >     or by the average load_weight per task for the queue in question (as
> >     appropriate).
> >
> >                         if ((tl <= load &&
> > -                               tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) ||
> > -                               100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) {
> > +                               tl + target_load(cpu, idx) <= tl_per_task) ||
> > +                               100*(tl + p->load_weight) <= imbalance*load) {
> 
> The oldest patch i had found was: https://lkml.org/lkml/2005/2/24/34
> where task_hot had been replaced by
> + if ((tl <= load &&
> + tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) ||
> + 100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) {
> 
> but as explained, i haven't found a clear explanation of this condition

Yeah, that's the commit I had below; but I suppose we could ask Nick if
we really want, I've heard he still replies to email, even though he's
locked up in a basement somewhere :-)

> > commit a3f21bce1fefdf92a4d1705e888d390b10f3ac6f
> >
> >
> > +                       if ((tl <= load &&
> > +                               tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) ||
> > +                               100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) {
> >
> >
> > So back when the code got introduced, it read:
> >
> >         target_load(prev_cpu, idx) - sync*SCHED_LOAD_SCALE < source_load(this_cpu, idx) &&
> >         target_load(prev_cpu, idx) - sync*SCHED_LOAD_SCALE + target_load(this_cpu, idx) < SCHED_LOAD_SCALE
> >
> > So while the first line makes some sense, the second line is still
> > somewhat challenging.
> >
> > I read the second line something like: if there's less than one full
> > task running on the combined cpus.
> 
> ok. your explanation makes sense

Maybe, its still slightly weird :-)

> >
> > Now for idx==0 this is hard, because even when sync=1 you can only make
> > it true if both cpus are completely idle, in which case you really want
> > to move to the waking cpu I suppose.
> 
> This use case is already taken into account by
> 
> if (this_load > 0)
> ..
> else
>  balance = true

Agreed.

> > One task running will have it == SCHED_LOAD_SCALE.
> >
> > But for idx>0 this can trigger in all kinds of situations of light load.
> 
> target_load is the max between load for idx == 0 and load for the
> selected idx so we have even less chance to match the condition : both
> cpu are completely idle

Ah, yes, I forgot to look at the target_load() thing and missed the max,
yes that all makes it entirely less likely.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 02/11] sched: remove a wake_affine condition
  2014-05-27 15:39         ` Peter Zijlstra
@ 2014-05-27 16:14           ` Vincent Guittot
  -1 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-05-27 16:14 UTC (permalink / raw)
  To: Peter Zijlstra, nickpiggin
  Cc: Ingo Molnar, linux-kernel, Russell King - ARM Linux, LAK,
	Preeti U Murthy, Morten Rasmussen, Mike Galbraith, Nicolas Pitre,
	linaro-kernel, Daniel Lezcano

On 27 May 2014 17:39, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, May 27, 2014 at 05:19:02PM +0200, Vincent Guittot wrote:
>> On 27 May 2014 14:48, Peter Zijlstra <peterz@infradead.org> wrote:
>> > On Fri, May 23, 2014 at 05:52:56PM +0200, Vincent Guittot wrote:
>> >> I have tried to understand the meaning of the condition :
>> >>  (this_load <= load &&
>> >>   this_load + target_load(prev_cpu, idx) <= tl_per_task)
>> >> but i failed to find a use case that can take advantage of it and i haven't
>> >> found description of it in the previous commits' log.
>> >
>> > commit 2dd73a4f09beacadde827a032cf15fd8b1fa3d48
>> >
>> >     int try_to_wake_up():
>> >
>> >     in this function the value SCHED_LOAD_BALANCE is used to represent the load
>> >     contribution of a single task in various calculations in the code that
>> >     decides which CPU to put the waking task on.  While this would be a valid
>> >     on a system where the nice values for the runnable tasks were distributed
>> >     evenly around zero it will lead to anomalous load balancing if the
>> >     distribution is skewed in either direction.  To overcome this problem
>> >     SCHED_LOAD_SCALE has been replaced by the load_weight for the relevant task
>> >     or by the average load_weight per task for the queue in question (as
>> >     appropriate).
>> >
>> >                         if ((tl <= load &&
>> > -                               tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) ||
>> > -                               100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) {
>> > +                               tl + target_load(cpu, idx) <= tl_per_task) ||
>> > +                               100*(tl + p->load_weight) <= imbalance*load) {
>>
>> The oldest patch i had found was: https://lkml.org/lkml/2005/2/24/34
>> where task_hot had been replaced by
>> + if ((tl <= load &&
>> + tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) ||
>> + 100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) {
>>
>> but as explained, i haven't found a clear explanation of this condition
>
> Yeah, that's the commit I had below; but I suppose we could ask Nick if
> we really want, I've heard he still replies to email, even though he's
> locked up in a basement somewhere :-)

ok, I have added him in the list

Nick,

While doing some rework on the wake affine part of the scheduler, i
failed to catch the use case that takes advantage of a condition that
you added some while ago with the commit
a3f21bce1fefdf92a4d1705e888d390b10f3ac6f

Could you help us to clarify the 2 first lines of the test that you added ?

+                       if ((tl <= load &&
+                               tl + target_load(cpu, idx) <=
SCHED_LOAD_SCALE) ||
+                               100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) {

Regards,
Vincent

>
>> > commit a3f21bce1fefdf92a4d1705e888d390b10f3ac6f
>> >
>> >
>> > +                       if ((tl <= load &&
>> > +                               tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) ||
>> > +                               100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) {
>> >
>> >
>> > So back when the code got introduced, it read:
>> >
>> >         target_load(prev_cpu, idx) - sync*SCHED_LOAD_SCALE < source_load(this_cpu, idx) &&
>> >         target_load(prev_cpu, idx) - sync*SCHED_LOAD_SCALE + target_load(this_cpu, idx) < SCHED_LOAD_SCALE
>> >
>> > So while the first line makes some sense, the second line is still
>> > somewhat challenging.
>> >
>> > I read the second line something like: if there's less than one full
>> > task running on the combined cpus.
>>
>> ok. your explanation makes sense
>
> Maybe, its still slightly weird :-)
>
>> >
>> > Now for idx==0 this is hard, because even when sync=1 you can only make
>> > it true if both cpus are completely idle, in which case you really want
>> > to move to the waking cpu I suppose.
>>
>> This use case is already taken into account by
>>
>> if (this_load > 0)
>> ..
>> else
>>  balance = true
>
> Agreed.
>
>> > One task running will have it == SCHED_LOAD_SCALE.
>> >
>> > But for idx>0 this can trigger in all kinds of situations of light load.
>>
>> target_load is the max between load for idx == 0 and load for the
>> selected idx so we have even less chance to match the condition : both
>> cpu are completely idle
>
> Ah, yes, I forgot to look at the target_load() thing and missed the max,
> yes that all makes it entirely less likely.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 02/11] sched: remove a wake_affine condition
@ 2014-05-27 16:14           ` Vincent Guittot
  0 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-05-27 16:14 UTC (permalink / raw)
  To: linux-arm-kernel

On 27 May 2014 17:39, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, May 27, 2014 at 05:19:02PM +0200, Vincent Guittot wrote:
>> On 27 May 2014 14:48, Peter Zijlstra <peterz@infradead.org> wrote:
>> > On Fri, May 23, 2014 at 05:52:56PM +0200, Vincent Guittot wrote:
>> >> I have tried to understand the meaning of the condition :
>> >>  (this_load <= load &&
>> >>   this_load + target_load(prev_cpu, idx) <= tl_per_task)
>> >> but i failed to find a use case that can take advantage of it and i haven't
>> >> found description of it in the previous commits' log.
>> >
>> > commit 2dd73a4f09beacadde827a032cf15fd8b1fa3d48
>> >
>> >     int try_to_wake_up():
>> >
>> >     in this function the value SCHED_LOAD_BALANCE is used to represent the load
>> >     contribution of a single task in various calculations in the code that
>> >     decides which CPU to put the waking task on.  While this would be a valid
>> >     on a system where the nice values for the runnable tasks were distributed
>> >     evenly around zero it will lead to anomalous load balancing if the
>> >     distribution is skewed in either direction.  To overcome this problem
>> >     SCHED_LOAD_SCALE has been replaced by the load_weight for the relevant task
>> >     or by the average load_weight per task for the queue in question (as
>> >     appropriate).
>> >
>> >                         if ((tl <= load &&
>> > -                               tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) ||
>> > -                               100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) {
>> > +                               tl + target_load(cpu, idx) <= tl_per_task) ||
>> > +                               100*(tl + p->load_weight) <= imbalance*load) {
>>
>> The oldest patch i had found was: https://lkml.org/lkml/2005/2/24/34
>> where task_hot had been replaced by
>> + if ((tl <= load &&
>> + tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) ||
>> + 100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) {
>>
>> but as explained, i haven't found a clear explanation of this condition
>
> Yeah, that's the commit I had below; but I suppose we could ask Nick if
> we really want, I've heard he still replies to email, even though he's
> locked up in a basement somewhere :-)

ok, I have added him in the list

Nick,

While doing some rework on the wake affine part of the scheduler, i
failed to catch the use case that takes advantage of a condition that
you added some while ago with the commit
a3f21bce1fefdf92a4d1705e888d390b10f3ac6f

Could you help us to clarify the 2 first lines of the test that you added ?

+                       if ((tl <= load &&
+                               tl + target_load(cpu, idx) <=
SCHED_LOAD_SCALE) ||
+                               100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) {

Regards,
Vincent

>
>> > commit a3f21bce1fefdf92a4d1705e888d390b10f3ac6f
>> >
>> >
>> > +                       if ((tl <= load &&
>> > +                               tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) ||
>> > +                               100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) {
>> >
>> >
>> > So back when the code got introduced, it read:
>> >
>> >         target_load(prev_cpu, idx) - sync*SCHED_LOAD_SCALE < source_load(this_cpu, idx) &&
>> >         target_load(prev_cpu, idx) - sync*SCHED_LOAD_SCALE + target_load(this_cpu, idx) < SCHED_LOAD_SCALE
>> >
>> > So while the first line makes some sense, the second line is still
>> > somewhat challenging.
>> >
>> > I read the second line something like: if there's less than one full
>> > task running on the combined cpus.
>>
>> ok. your explanation makes sense
>
> Maybe, its still slightly weird :-)
>
>> >
>> > Now for idx==0 this is hard, because even when sync=1 you can only make
>> > it true if both cpus are completely idle, in which case you really want
>> > to move to the waking cpu I suppose.
>>
>> This use case is already taken into account by
>>
>> if (this_load > 0)
>> ..
>> else
>>  balance = true
>
> Agreed.
>
>> > One task running will have it == SCHED_LOAD_SCALE.
>> >
>> > But for idx>0 this can trigger in all kinds of situations of light load.
>>
>> target_load is the max between load for idx == 0 and load for the
>> selected idx so we have even less chance to match the condition : both
>> cpu are completely idle
>
> Ah, yes, I forgot to look at the target_load() thing and missed the max,
> yes that all makes it entirely less likely.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 08/11] sched: get CPU's activity statistic
  2014-05-23 15:53   ` Vincent Guittot
@ 2014-05-27 17:32     ` Peter Zijlstra
  -1 siblings, 0 replies; 216+ messages in thread
From: Peter Zijlstra @ 2014-05-27 17:32 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, linux-kernel, linux, linux-arm-kernel, preeti,
	Morten.Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano

On Fri, May 23, 2014 at 05:53:02PM +0200, Vincent Guittot wrote:
> Monitor the activity level of each group of each sched_domain level. The
> activity is the amount of cpu_power that is currently used on a CPU or group
> of CPUs. We use the runnable_avg_sum and _period to evaluate this activity
> level. In the special use case where the CPU is fully loaded by more than 1
> task, the activity level is set above the cpu_power in order to reflect the
> overload of the CPU
> 

> +static int get_cpu_activity(int cpu)
> +{
> +	struct rq *rq = cpu_rq(cpu);
> +	u32 sum = rq->avg.runnable_avg_sum;
> +	u32 period = rq->avg.runnable_avg_period;
> +
> +	if (sum >= period)
> +		return power_orig_of(cpu) + rq->nr_running - 1;
> +
> +	return (sum * power_orig_of(cpu)) / period;
> +}

While I appreciate the need to signify the overload situation, I don't
think adding nr_running makes sense. The amount of tasks has no bearing
on the amount of overload.

Also, and note I've not yet seen the use, it strikes me as odd to use
the orig power here. I would've thought the current capacity (not the
max capacity) is relevant to balancing.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 08/11] sched: get CPU's activity statistic
@ 2014-05-27 17:32     ` Peter Zijlstra
  0 siblings, 0 replies; 216+ messages in thread
From: Peter Zijlstra @ 2014-05-27 17:32 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, May 23, 2014 at 05:53:02PM +0200, Vincent Guittot wrote:
> Monitor the activity level of each group of each sched_domain level. The
> activity is the amount of cpu_power that is currently used on a CPU or group
> of CPUs. We use the runnable_avg_sum and _period to evaluate this activity
> level. In the special use case where the CPU is fully loaded by more than 1
> task, the activity level is set above the cpu_power in order to reflect the
> overload of the CPU
> 

> +static int get_cpu_activity(int cpu)
> +{
> +	struct rq *rq = cpu_rq(cpu);
> +	u32 sum = rq->avg.runnable_avg_sum;
> +	u32 period = rq->avg.runnable_avg_period;
> +
> +	if (sum >= period)
> +		return power_orig_of(cpu) + rq->nr_running - 1;
> +
> +	return (sum * power_orig_of(cpu)) / period;
> +}

While I appreciate the need to signify the overload situation, I don't
think adding nr_running makes sense. The amount of tasks has no bearing
on the amount of overload.

Also, and note I've not yet seen the use, it strikes me as odd to use
the orig power here. I would've thought the current capacity (not the
max capacity) is relevant to balancing.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 02/11] sched: remove a wake_affine condition
  2014-05-27 16:14           ` Vincent Guittot
@ 2014-05-28  6:49             ` Vincent Guittot
  -1 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-05-28  6:49 UTC (permalink / raw)
  To: Peter Zijlstra, npiggin
  Cc: Ingo Molnar, linux-kernel, Russell King - ARM Linux, LAK,
	Preeti U Murthy, Morten Rasmussen, Mike Galbraith, Nicolas Pitre,
	linaro-kernel, Daniel Lezcano

Using another email address for Nick

On 27 May 2014 18:14, Vincent Guittot <vincent.guittot@linaro.org> wrote:
> On 27 May 2014 17:39, Peter Zijlstra <peterz@infradead.org> wrote:
>> On Tue, May 27, 2014 at 05:19:02PM +0200, Vincent Guittot wrote:
>>> On 27 May 2014 14:48, Peter Zijlstra <peterz@infradead.org> wrote:
>>> > On Fri, May 23, 2014 at 05:52:56PM +0200, Vincent Guittot wrote:
>>> >> I have tried to understand the meaning of the condition :
>>> >>  (this_load <= load &&
>>> >>   this_load + target_load(prev_cpu, idx) <= tl_per_task)
>>> >> but i failed to find a use case that can take advantage of it and i haven't
>>> >> found description of it in the previous commits' log.
>>> >
>>> > commit 2dd73a4f09beacadde827a032cf15fd8b1fa3d48
>>> >
>>> >     int try_to_wake_up():
>>> >
>>> >     in this function the value SCHED_LOAD_BALANCE is used to represent the load
>>> >     contribution of a single task in various calculations in the code that
>>> >     decides which CPU to put the waking task on.  While this would be a valid
>>> >     on a system where the nice values for the runnable tasks were distributed
>>> >     evenly around zero it will lead to anomalous load balancing if the
>>> >     distribution is skewed in either direction.  To overcome this problem
>>> >     SCHED_LOAD_SCALE has been replaced by the load_weight for the relevant task
>>> >     or by the average load_weight per task for the queue in question (as
>>> >     appropriate).
>>> >
>>> >                         if ((tl <= load &&
>>> > -                               tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) ||
>>> > -                               100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) {
>>> > +                               tl + target_load(cpu, idx) <= tl_per_task) ||
>>> > +                               100*(tl + p->load_weight) <= imbalance*load) {
>>>
>>> The oldest patch i had found was: https://lkml.org/lkml/2005/2/24/34
>>> where task_hot had been replaced by
>>> + if ((tl <= load &&
>>> + tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) ||
>>> + 100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) {
>>>
>>> but as explained, i haven't found a clear explanation of this condition
>>
>> Yeah, that's the commit I had below; but I suppose we could ask Nick if
>> we really want, I've heard he still replies to email, even though he's
>> locked up in a basement somewhere :-)

ok, I have added him in the list

Nick,

While doing some rework on the wake affine part of the scheduler, i
failed to catch the use case that takes advantage of a condition that
you added some while ago with the commit
a3f21bce1fefdf92a4d1705e888d390b10f3ac6f

Could you help us to clarify the 2 first lines of the test that you added ?
+                       if ((tl <= load &&
+                               tl + target_load(cpu, idx) <=
SCHED_LOAD_SCALE) ||
+                               100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) {

Regards,
Vincent
>
>>
>>> > commit a3f21bce1fefdf92a4d1705e888d390b10f3ac6f
>>> >
>>> >
>>> > +                       if ((tl <= load &&
>>> > +                               tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) ||
>>> > +                               100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) {
>>> >
>>> >
>>> > So back when the code got introduced, it read:
>>> >
>>> >         target_load(prev_cpu, idx) - sync*SCHED_LOAD_SCALE < source_load(this_cpu, idx) &&
>>> >         target_load(prev_cpu, idx) - sync*SCHED_LOAD_SCALE + target_load(this_cpu, idx) < SCHED_LOAD_SCALE
>>> >
>>> > So while the first line makes some sense, the second line is still
>>> > somewhat challenging.
>>> >
>>> > I read the second line something like: if there's less than one full
>>> > task running on the combined cpus.
>>>
>>> ok. your explanation makes sense
>>
>> Maybe, its still slightly weird :-)
>>
>>> >
>>> > Now for idx==0 this is hard, because even when sync=1 you can only make
>>> > it true if both cpus are completely idle, in which case you really want
>>> > to move to the waking cpu I suppose.
>>>
>>> This use case is already taken into account by
>>>
>>> if (this_load > 0)
>>> ..
>>> else
>>>  balance = true
>>
>> Agreed.
>>
>>> > One task running will have it == SCHED_LOAD_SCALE.
>>> >
>>> > But for idx>0 this can trigger in all kinds of situations of light load.
>>>
>>> target_load is the max between load for idx == 0 and load for the
>>> selected idx so we have even less chance to match the condition : both
>>> cpu are completely idle
>>
>> Ah, yes, I forgot to look at the target_load() thing and missed the max,
>> yes that all makes it entirely less likely.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 02/11] sched: remove a wake_affine condition
@ 2014-05-28  6:49             ` Vincent Guittot
  0 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-05-28  6:49 UTC (permalink / raw)
  To: linux-arm-kernel

Using another email address for Nick

On 27 May 2014 18:14, Vincent Guittot <vincent.guittot@linaro.org> wrote:
> On 27 May 2014 17:39, Peter Zijlstra <peterz@infradead.org> wrote:
>> On Tue, May 27, 2014 at 05:19:02PM +0200, Vincent Guittot wrote:
>>> On 27 May 2014 14:48, Peter Zijlstra <peterz@infradead.org> wrote:
>>> > On Fri, May 23, 2014 at 05:52:56PM +0200, Vincent Guittot wrote:
>>> >> I have tried to understand the meaning of the condition :
>>> >>  (this_load <= load &&
>>> >>   this_load + target_load(prev_cpu, idx) <= tl_per_task)
>>> >> but i failed to find a use case that can take advantage of it and i haven't
>>> >> found description of it in the previous commits' log.
>>> >
>>> > commit 2dd73a4f09beacadde827a032cf15fd8b1fa3d48
>>> >
>>> >     int try_to_wake_up():
>>> >
>>> >     in this function the value SCHED_LOAD_BALANCE is used to represent the load
>>> >     contribution of a single task in various calculations in the code that
>>> >     decides which CPU to put the waking task on.  While this would be a valid
>>> >     on a system where the nice values for the runnable tasks were distributed
>>> >     evenly around zero it will lead to anomalous load balancing if the
>>> >     distribution is skewed in either direction.  To overcome this problem
>>> >     SCHED_LOAD_SCALE has been replaced by the load_weight for the relevant task
>>> >     or by the average load_weight per task for the queue in question (as
>>> >     appropriate).
>>> >
>>> >                         if ((tl <= load &&
>>> > -                               tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) ||
>>> > -                               100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) {
>>> > +                               tl + target_load(cpu, idx) <= tl_per_task) ||
>>> > +                               100*(tl + p->load_weight) <= imbalance*load) {
>>>
>>> The oldest patch i had found was: https://lkml.org/lkml/2005/2/24/34
>>> where task_hot had been replaced by
>>> + if ((tl <= load &&
>>> + tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) ||
>>> + 100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) {
>>>
>>> but as explained, i haven't found a clear explanation of this condition
>>
>> Yeah, that's the commit I had below; but I suppose we could ask Nick if
>> we really want, I've heard he still replies to email, even though he's
>> locked up in a basement somewhere :-)

ok, I have added him in the list

Nick,

While doing some rework on the wake affine part of the scheduler, i
failed to catch the use case that takes advantage of a condition that
you added some while ago with the commit
a3f21bce1fefdf92a4d1705e888d390b10f3ac6f

Could you help us to clarify the 2 first lines of the test that you added ?
+                       if ((tl <= load &&
+                               tl + target_load(cpu, idx) <=
SCHED_LOAD_SCALE) ||
+                               100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) {

Regards,
Vincent
>
>>
>>> > commit a3f21bce1fefdf92a4d1705e888d390b10f3ac6f
>>> >
>>> >
>>> > +                       if ((tl <= load &&
>>> > +                               tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) ||
>>> > +                               100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) {
>>> >
>>> >
>>> > So back when the code got introduced, it read:
>>> >
>>> >         target_load(prev_cpu, idx) - sync*SCHED_LOAD_SCALE < source_load(this_cpu, idx) &&
>>> >         target_load(prev_cpu, idx) - sync*SCHED_LOAD_SCALE + target_load(this_cpu, idx) < SCHED_LOAD_SCALE
>>> >
>>> > So while the first line makes some sense, the second line is still
>>> > somewhat challenging.
>>> >
>>> > I read the second line something like: if there's less than one full
>>> > task running on the combined cpus.
>>>
>>> ok. your explanation makes sense
>>
>> Maybe, its still slightly weird :-)
>>
>>> >
>>> > Now for idx==0 this is hard, because even when sync=1 you can only make
>>> > it true if both cpus are completely idle, in which case you really want
>>> > to move to the waking cpu I suppose.
>>>
>>> This use case is already taken into account by
>>>
>>> if (this_load > 0)
>>> ..
>>> else
>>>  balance = true
>>
>> Agreed.
>>
>>> > One task running will have it == SCHED_LOAD_SCALE.
>>> >
>>> > But for idx>0 this can trigger in all kinds of situations of light load.
>>>
>>> target_load is the max between load for idx == 0 and load for the
>>> selected idx so we have even less chance to match the condition : both
>>> cpu are completely idle
>>
>> Ah, yes, I forgot to look at the target_load() thing and missed the max,
>> yes that all makes it entirely less likely.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 08/11] sched: get CPU's activity statistic
  2014-05-27 17:32     ` Peter Zijlstra
@ 2014-05-28  7:01       ` Vincent Guittot
  -1 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-05-28  7:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Russell King - ARM Linux, LAK,
	Preeti U Murthy, Morten Rasmussen, Mike Galbraith, Nicolas Pitre,
	linaro-kernel, Daniel Lezcano

On 27 May 2014 19:32, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, May 23, 2014 at 05:53:02PM +0200, Vincent Guittot wrote:
>> Monitor the activity level of each group of each sched_domain level. The
>> activity is the amount of cpu_power that is currently used on a CPU or group
>> of CPUs. We use the runnable_avg_sum and _period to evaluate this activity
>> level. In the special use case where the CPU is fully loaded by more than 1
>> task, the activity level is set above the cpu_power in order to reflect the
>> overload of the CPU
>>
>
>> +static int get_cpu_activity(int cpu)
>> +{
>> +     struct rq *rq = cpu_rq(cpu);
>> +     u32 sum = rq->avg.runnable_avg_sum;
>> +     u32 period = rq->avg.runnable_avg_period;
>> +
>> +     if (sum >= period)
>> +             return power_orig_of(cpu) + rq->nr_running - 1;
>> +
>> +     return (sum * power_orig_of(cpu)) / period;
>> +}
>
> While I appreciate the need to signify the overload situation, I don't
> think adding nr_running makes sense. The amount of tasks has no bearing
> on the amount of overload.

I agree that it's not the best way to evaluate overload but it's the
only simple one that is available without additional computation. I'm
going to try to find a better metric or come back the solution which
only adds +1  and compute the overload somewhere else

>
> Also, and note I've not yet seen the use, it strikes me as odd to use
> the orig power here. I would've thought the current capacity (not the
> max capacity) is relevant to balancing.

activity also measures the non cfs tasks activity whereas current
capacity removes the capacity used by non cfs tasks. So as long as
arch_scale_cpu_freq is not used (which is the case for now), original
capacity is ok but we might need a 3rd metric so we would have:
original capacity (max capacity that can provide a CPU)
usable capacity (new value that reflects current capacity available
for any kind of processing rt tasks, irq, cfs tasks)
current capacity (capacity available for cfs tasks)

Vincent

^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 08/11] sched: get CPU's activity statistic
@ 2014-05-28  7:01       ` Vincent Guittot
  0 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-05-28  7:01 UTC (permalink / raw)
  To: linux-arm-kernel

On 27 May 2014 19:32, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, May 23, 2014 at 05:53:02PM +0200, Vincent Guittot wrote:
>> Monitor the activity level of each group of each sched_domain level. The
>> activity is the amount of cpu_power that is currently used on a CPU or group
>> of CPUs. We use the runnable_avg_sum and _period to evaluate this activity
>> level. In the special use case where the CPU is fully loaded by more than 1
>> task, the activity level is set above the cpu_power in order to reflect the
>> overload of the CPU
>>
>
>> +static int get_cpu_activity(int cpu)
>> +{
>> +     struct rq *rq = cpu_rq(cpu);
>> +     u32 sum = rq->avg.runnable_avg_sum;
>> +     u32 period = rq->avg.runnable_avg_period;
>> +
>> +     if (sum >= period)
>> +             return power_orig_of(cpu) + rq->nr_running - 1;
>> +
>> +     return (sum * power_orig_of(cpu)) / period;
>> +}
>
> While I appreciate the need to signify the overload situation, I don't
> think adding nr_running makes sense. The amount of tasks has no bearing
> on the amount of overload.

I agree that it's not the best way to evaluate overload but it's the
only simple one that is available without additional computation. I'm
going to try to find a better metric or come back the solution which
only adds +1  and compute the overload somewhere else

>
> Also, and note I've not yet seen the use, it strikes me as odd to use
> the orig power here. I would've thought the current capacity (not the
> max capacity) is relevant to balancing.

activity also measures the non cfs tasks activity whereas current
capacity removes the capacity used by non cfs tasks. So as long as
arch_scale_cpu_freq is not used (which is the case for now), original
capacity is ok but we might need a 3rd metric so we would have:
original capacity (max capacity that can provide a CPU)
usable capacity (new value that reflects current capacity available
for any kind of processing rt tasks, irq, cfs tasks)
current capacity (capacity available for cfs tasks)

Vincent

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 09/11] sched: test the cpu's capacity in wake affine
  2014-05-23 15:53   ` Vincent Guittot
@ 2014-05-28 10:58     ` Peter Zijlstra
  -1 siblings, 0 replies; 216+ messages in thread
From: Peter Zijlstra @ 2014-05-28 10:58 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, linux-kernel, linux, linux-arm-kernel, preeti,
	Morten.Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano

[-- Attachment #1: Type: text/plain, Size: 946 bytes --]

On Fri, May 23, 2014 at 05:53:03PM +0200, Vincent Guittot wrote:
> Currently the task always wakes affine on this_cpu if the latter is idle.
> Before waking up the task on this_cpu, we check that this_cpu capacity is not
> significantly reduced because of RT tasks or irq activity.
> 
> Use case where the number of irq and the time spent under irq is important
> will take benefit of this because the task that is woken up by irq or softirq
> will not use the same CPU than irq (and softirq) but a idle one which share
> its LLC.

OK, so I'm having a terrible time parsing the above.

So looking at the patch you make balance false even when this_load==0
when the effective power/capacity (nico's patches haven't fully sunk in
yet) of this cpu is less than that of the previous cpu.

Is that right?

Now I'm only struggling to understand the rationale for this, its got
LLC in there somewhere, but I'm failing to comprehend.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 09/11] sched: test the cpu's capacity in wake affine
@ 2014-05-28 10:58     ` Peter Zijlstra
  0 siblings, 0 replies; 216+ messages in thread
From: Peter Zijlstra @ 2014-05-28 10:58 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, May 23, 2014 at 05:53:03PM +0200, Vincent Guittot wrote:
> Currently the task always wakes affine on this_cpu if the latter is idle.
> Before waking up the task on this_cpu, we check that this_cpu capacity is not
> significantly reduced because of RT tasks or irq activity.
> 
> Use case where the number of irq and the time spent under irq is important
> will take benefit of this because the task that is woken up by irq or softirq
> will not use the same CPU than irq (and softirq) but a idle one which share
> its LLC.

OK, so I'm having a terrible time parsing the above.

So looking at the patch you make balance false even when this_load==0
when the effective power/capacity (nico's patches haven't fully sunk in
yet) of this cpu is less than that of the previous cpu.

Is that right?

Now I'm only struggling to understand the rationale for this, its got
LLC in there somewhere, but I'm failing to comprehend.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 836 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20140528/39dff26e/attachment.sig>

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 09/11] sched: test the cpu's capacity in wake affine
  2014-05-28 10:58     ` Peter Zijlstra
@ 2014-05-28 11:15       ` Vincent Guittot
  -1 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-05-28 11:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Russell King - ARM Linux, LAK,
	Preeti U Murthy, Morten Rasmussen, Mike Galbraith, Nicolas Pitre,
	linaro-kernel, Daniel Lezcano

On 28 May 2014 12:58, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, May 23, 2014 at 05:53:03PM +0200, Vincent Guittot wrote:
>> Currently the task always wakes affine on this_cpu if the latter is idle.
>> Before waking up the task on this_cpu, we check that this_cpu capacity is not
>> significantly reduced because of RT tasks or irq activity.
>>
>> Use case where the number of irq and the time spent under irq is important
>> will take benefit of this because the task that is woken up by irq or softirq
>> will not use the same CPU than irq (and softirq) but a idle one which share
>> its LLC.
>
> OK, so I'm having a terrible time parsing the above.
>
> So looking at the patch you make balance false even when this_load==0
> when the effective power/capacity (nico's patches haven't fully sunk in
> yet) of this cpu is less than that of the previous cpu.
>
> Is that right?

yes,

>
> Now I'm only struggling to understand the rationale for this, its got
> LLC in there somewhere, but I'm failing to comprehend.

Ah.. i have probably overestimated the fact that wake_affine was only
done at MC or SMT level but after reading more deeply the glags
configuration of all  sched_domain level , my assumption is not true.
So I need to test sd flag to make sure that their share their cache at
this level

^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 09/11] sched: test the cpu's capacity in wake affine
@ 2014-05-28 11:15       ` Vincent Guittot
  0 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-05-28 11:15 UTC (permalink / raw)
  To: linux-arm-kernel

On 28 May 2014 12:58, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, May 23, 2014 at 05:53:03PM +0200, Vincent Guittot wrote:
>> Currently the task always wakes affine on this_cpu if the latter is idle.
>> Before waking up the task on this_cpu, we check that this_cpu capacity is not
>> significantly reduced because of RT tasks or irq activity.
>>
>> Use case where the number of irq and the time spent under irq is important
>> will take benefit of this because the task that is woken up by irq or softirq
>> will not use the same CPU than irq (and softirq) but a idle one which share
>> its LLC.
>
> OK, so I'm having a terrible time parsing the above.
>
> So looking at the patch you make balance false even when this_load==0
> when the effective power/capacity (nico's patches haven't fully sunk in
> yet) of this cpu is less than that of the previous cpu.
>
> Is that right?

yes,

>
> Now I'm only struggling to understand the rationale for this, its got
> LLC in there somewhere, but I'm failing to comprehend.

Ah.. i have probably overestimated the fact that wake_affine was only
done at MC or SMT level but after reading more deeply the glags
configuration of all  sched_domain level , my assumption is not true.
So I need to test sd flag to make sure that their share their cache at
this level

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 08/11] sched: get CPU's activity statistic
  2014-05-23 15:53   ` Vincent Guittot
@ 2014-05-28 12:10     ` Morten Rasmussen
  -1 siblings, 0 replies; 216+ messages in thread
From: Morten Rasmussen @ 2014-05-28 12:10 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: peterz, mingo, linux-kernel, linux, linux-arm-kernel, preeti,
	efault, nicolas.pitre, linaro-kernel, daniel.lezcano

On Fri, May 23, 2014 at 04:53:02PM +0100, Vincent Guittot wrote:
> Monitor the activity level of each group of each sched_domain level. The
> activity is the amount of cpu_power that is currently used on a CPU or group
> of CPUs. We use the runnable_avg_sum and _period to evaluate this activity
> level. In the special use case where the CPU is fully loaded by more than 1
> task, the activity level is set above the cpu_power in order to reflect the
> overload of the CPU
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>  kernel/sched/fair.c | 22 ++++++++++++++++++++++
>  1 file changed, 22 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b7c51be..c01d8b6 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4044,6 +4044,11 @@ static unsigned long power_of(int cpu)
>  	return cpu_rq(cpu)->cpu_power;
>  }
>  
> +static unsigned long power_orig_of(int cpu)
> +{
> +	return cpu_rq(cpu)->cpu_power_orig;
> +}
> +
>  static unsigned long cpu_avg_load_per_task(int cpu)
>  {
>  	struct rq *rq = cpu_rq(cpu);
> @@ -4438,6 +4443,18 @@ done:
>  	return target;
>  }
>  
> +static int get_cpu_activity(int cpu)
> +{
> +	struct rq *rq = cpu_rq(cpu);
> +	u32 sum = rq->avg.runnable_avg_sum;
> +	u32 period = rq->avg.runnable_avg_period;
> +
> +	if (sum >= period)
> +		return power_orig_of(cpu) + rq->nr_running - 1;
> +
> +	return (sum * power_orig_of(cpu)) / period;
> +}

The rq runnable_avg_{sum, period} give a very long term view of the cpu
utilization (I will use the term utilization instead of activity as I
think that is what we are talking about here). IMHO, it is too slow to
be used as basis for load balancing decisions. I think that was also
agreed upon in the last discussion related to this topic [1].

The basic problem is that worst case: sum starting from 0 and period
already at LOAD_AVG_MAX = 47742, it takes LOAD_AVG_MAX_N = 345 periods
(ms) for sum to reach 47742. In other words, the cpu might have been
fully utilized for 345 ms before it is considered fully utilized.
Periodic load-balancing happens much more frequently than that.

Also, if load-balancing actually moves tasks around it may take quite a
while before runnable_avg_sum actually reflects this change. The next
periodic load-balance is likely to happen before runnable_avg_sum has
reflected the result of the previous periodic load-balance.

To avoid these problems, we need to base utilization on a metric which
is updated instantaneously when we add/remove tasks to a cpu (or a least
fast enough that we don't see the above problems). In the previous
discussion [1] it was suggested that a sum of unweighted task
runnable_avg_{sum,period} ratio instead. That is, an unweighted
equivalent to weighted_cpuload(). That isn't a perfect solution either.
It is fine as long as the cpus are not fully utilized, but when they are
we need to use weighted_cpuload() to preserve smp_nice. What to do
around the tipping point needs more thought, but I think that is
currently the best proposal for a solution for task and cpu utilization.

rq runnable_avg_sum is useful for decisions where we need a longer term
view of the cpu utilization, but I don't see how we can use as cpu
utilization metric for load-balancing decisions at wakeup or
periodically.

Morten

[1] https://lkml.org/lkml/2014/1/8/251

^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 08/11] sched: get CPU's activity statistic
@ 2014-05-28 12:10     ` Morten Rasmussen
  0 siblings, 0 replies; 216+ messages in thread
From: Morten Rasmussen @ 2014-05-28 12:10 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, May 23, 2014 at 04:53:02PM +0100, Vincent Guittot wrote:
> Monitor the activity level of each group of each sched_domain level. The
> activity is the amount of cpu_power that is currently used on a CPU or group
> of CPUs. We use the runnable_avg_sum and _period to evaluate this activity
> level. In the special use case where the CPU is fully loaded by more than 1
> task, the activity level is set above the cpu_power in order to reflect the
> overload of the CPU
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>  kernel/sched/fair.c | 22 ++++++++++++++++++++++
>  1 file changed, 22 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b7c51be..c01d8b6 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4044,6 +4044,11 @@ static unsigned long power_of(int cpu)
>  	return cpu_rq(cpu)->cpu_power;
>  }
>  
> +static unsigned long power_orig_of(int cpu)
> +{
> +	return cpu_rq(cpu)->cpu_power_orig;
> +}
> +
>  static unsigned long cpu_avg_load_per_task(int cpu)
>  {
>  	struct rq *rq = cpu_rq(cpu);
> @@ -4438,6 +4443,18 @@ done:
>  	return target;
>  }
>  
> +static int get_cpu_activity(int cpu)
> +{
> +	struct rq *rq = cpu_rq(cpu);
> +	u32 sum = rq->avg.runnable_avg_sum;
> +	u32 period = rq->avg.runnable_avg_period;
> +
> +	if (sum >= period)
> +		return power_orig_of(cpu) + rq->nr_running - 1;
> +
> +	return (sum * power_orig_of(cpu)) / period;
> +}

The rq runnable_avg_{sum, period} give a very long term view of the cpu
utilization (I will use the term utilization instead of activity as I
think that is what we are talking about here). IMHO, it is too slow to
be used as basis for load balancing decisions. I think that was also
agreed upon in the last discussion related to this topic [1].

The basic problem is that worst case: sum starting from 0 and period
already at LOAD_AVG_MAX = 47742, it takes LOAD_AVG_MAX_N = 345 periods
(ms) for sum to reach 47742. In other words, the cpu might have been
fully utilized for 345 ms before it is considered fully utilized.
Periodic load-balancing happens much more frequently than that.

Also, if load-balancing actually moves tasks around it may take quite a
while before runnable_avg_sum actually reflects this change. The next
periodic load-balance is likely to happen before runnable_avg_sum has
reflected the result of the previous periodic load-balance.

To avoid these problems, we need to base utilization on a metric which
is updated instantaneously when we add/remove tasks to a cpu (or a least
fast enough that we don't see the above problems). In the previous
discussion [1] it was suggested that a sum of unweighted task
runnable_avg_{sum,period} ratio instead. That is, an unweighted
equivalent to weighted_cpuload(). That isn't a perfect solution either.
It is fine as long as the cpus are not fully utilized, but when they are
we need to use weighted_cpuload() to preserve smp_nice. What to do
around the tipping point needs more thought, but I think that is
currently the best proposal for a solution for task and cpu utilization.

rq runnable_avg_sum is useful for decisions where we need a longer term
view of the cpu utilization, but I don't see how we can use as cpu
utilization metric for load-balancing decisions at wakeup or
periodically.

Morten

[1] https://lkml.org/lkml/2014/1/8/251

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 08/11] sched: get CPU's activity statistic
  2014-05-28 12:10     ` Morten Rasmussen
@ 2014-05-28 13:15       ` Vincent Guittot
  -1 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-05-28 13:15 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: peterz, mingo, linux-kernel, linux, linux-arm-kernel, preeti,
	efault, nicolas.pitre, linaro-kernel, daniel.lezcano

On 28 May 2014 14:10, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> On Fri, May 23, 2014 at 04:53:02PM +0100, Vincent Guittot wrote:
>> Monitor the activity level of each group of each sched_domain level. The
>> activity is the amount of cpu_power that is currently used on a CPU or group
>> of CPUs. We use the runnable_avg_sum and _period to evaluate this activity
>> level. In the special use case where the CPU is fully loaded by more than 1
>> task, the activity level is set above the cpu_power in order to reflect the
>> overload of the CPU
>>
>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>> ---
>>  kernel/sched/fair.c | 22 ++++++++++++++++++++++
>>  1 file changed, 22 insertions(+)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index b7c51be..c01d8b6 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -4044,6 +4044,11 @@ static unsigned long power_of(int cpu)
>>       return cpu_rq(cpu)->cpu_power;
>>  }
>>
>> +static unsigned long power_orig_of(int cpu)
>> +{
>> +     return cpu_rq(cpu)->cpu_power_orig;
>> +}
>> +
>>  static unsigned long cpu_avg_load_per_task(int cpu)
>>  {
>>       struct rq *rq = cpu_rq(cpu);
>> @@ -4438,6 +4443,18 @@ done:
>>       return target;
>>  }
>>
>> +static int get_cpu_activity(int cpu)
>> +{
>> +     struct rq *rq = cpu_rq(cpu);
>> +     u32 sum = rq->avg.runnable_avg_sum;
>> +     u32 period = rq->avg.runnable_avg_period;
>> +
>> +     if (sum >= period)
>> +             return power_orig_of(cpu) + rq->nr_running - 1;
>> +
>> +     return (sum * power_orig_of(cpu)) / period;
>> +}
>
> The rq runnable_avg_{sum, period} give a very long term view of the cpu
> utilization (I will use the term utilization instead of activity as I
> think that is what we are talking about here). IMHO, it is too slow to
> be used as basis for load balancing decisions. I think that was also
> agreed upon in the last discussion related to this topic [1].
>
> The basic problem is that worst case: sum starting from 0 and period
> already at LOAD_AVG_MAX = 47742, it takes LOAD_AVG_MAX_N = 345 periods
> (ms) for sum to reach 47742. In other words, the cpu might have been
> fully utilized for 345 ms before it is considered fully utilized.
> Periodic load-balancing happens much more frequently than that.

I agree that it's not really responsive but several statistics of the
scheduler use the same kind of metrics and have the same kind of
responsiveness.
I agree that it's not enough and that's why i'm not using only this
metric but it gives information that the unweighted load_avg_contrib
(that you are speaking about below) can't give. So i would be less
contrasted than you and would say that we probably need additional
metrics

>
> Also, if load-balancing actually moves tasks around it may take quite a
> while before runnable_avg_sum actually reflects this change. The next
> periodic load-balance is likely to happen before runnable_avg_sum has
> reflected the result of the previous periodic load-balance.

runnable_avg_sum uses a 1ms unit step so i tend to disagree with your
point above

>
> To avoid these problems, we need to base utilization on a metric which
> is updated instantaneously when we add/remove tasks to a cpu (or a least
> fast enough that we don't see the above problems). In the previous
> discussion [1] it was suggested that a sum of unweighted task
> runnable_avg_{sum,period} ratio instead. That is, an unweighted
> equivalent to weighted_cpuload(). That isn't a perfect solution either.

Regarding the unweighted load_avg_contrib, you will have similar issue
because of the slowness in the variation of each sched_entity load
that will be added/removed in the unweighted load_avg_contrib.

The update of the runnable_avg_{sum,period}  of an sched_entity is
quite similar to cpu utilization. This value is linked to the CPU on
which it has run previously because of the time sharing with others
tasks, so the unweighted load of a freshly migrated task will reflect
its load on the previous CPU (with the time sharing with other tasks
on prev CPU).

I'm not saying that such metric is useless but it's not perfect as well.

Vincent

> It is fine as long as the cpus are not fully utilized, but when they are
> we need to use weighted_cpuload() to preserve smp_nice. What to do
> around the tipping point needs more thought, but I think that is
> currently the best proposal for a solution for task and cpu utilization.
>
> rq runnable_avg_sum is useful for decisions where we need a longer term
> view of the cpu utilization, but I don't see how we can use as cpu
> utilization metric for load-balancing decisions at wakeup or
> periodically.
>
> Morten
>
> [1] https://lkml.org/lkml/2014/1/8/251

^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 08/11] sched: get CPU's activity statistic
@ 2014-05-28 13:15       ` Vincent Guittot
  0 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-05-28 13:15 UTC (permalink / raw)
  To: linux-arm-kernel

On 28 May 2014 14:10, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> On Fri, May 23, 2014 at 04:53:02PM +0100, Vincent Guittot wrote:
>> Monitor the activity level of each group of each sched_domain level. The
>> activity is the amount of cpu_power that is currently used on a CPU or group
>> of CPUs. We use the runnable_avg_sum and _period to evaluate this activity
>> level. In the special use case where the CPU is fully loaded by more than 1
>> task, the activity level is set above the cpu_power in order to reflect the
>> overload of the CPU
>>
>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>> ---
>>  kernel/sched/fair.c | 22 ++++++++++++++++++++++
>>  1 file changed, 22 insertions(+)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index b7c51be..c01d8b6 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -4044,6 +4044,11 @@ static unsigned long power_of(int cpu)
>>       return cpu_rq(cpu)->cpu_power;
>>  }
>>
>> +static unsigned long power_orig_of(int cpu)
>> +{
>> +     return cpu_rq(cpu)->cpu_power_orig;
>> +}
>> +
>>  static unsigned long cpu_avg_load_per_task(int cpu)
>>  {
>>       struct rq *rq = cpu_rq(cpu);
>> @@ -4438,6 +4443,18 @@ done:
>>       return target;
>>  }
>>
>> +static int get_cpu_activity(int cpu)
>> +{
>> +     struct rq *rq = cpu_rq(cpu);
>> +     u32 sum = rq->avg.runnable_avg_sum;
>> +     u32 period = rq->avg.runnable_avg_period;
>> +
>> +     if (sum >= period)
>> +             return power_orig_of(cpu) + rq->nr_running - 1;
>> +
>> +     return (sum * power_orig_of(cpu)) / period;
>> +}
>
> The rq runnable_avg_{sum, period} give a very long term view of the cpu
> utilization (I will use the term utilization instead of activity as I
> think that is what we are talking about here). IMHO, it is too slow to
> be used as basis for load balancing decisions. I think that was also
> agreed upon in the last discussion related to this topic [1].
>
> The basic problem is that worst case: sum starting from 0 and period
> already at LOAD_AVG_MAX = 47742, it takes LOAD_AVG_MAX_N = 345 periods
> (ms) for sum to reach 47742. In other words, the cpu might have been
> fully utilized for 345 ms before it is considered fully utilized.
> Periodic load-balancing happens much more frequently than that.

I agree that it's not really responsive but several statistics of the
scheduler use the same kind of metrics and have the same kind of
responsiveness.
I agree that it's not enough and that's why i'm not using only this
metric but it gives information that the unweighted load_avg_contrib
(that you are speaking about below) can't give. So i would be less
contrasted than you and would say that we probably need additional
metrics

>
> Also, if load-balancing actually moves tasks around it may take quite a
> while before runnable_avg_sum actually reflects this change. The next
> periodic load-balance is likely to happen before runnable_avg_sum has
> reflected the result of the previous periodic load-balance.

runnable_avg_sum uses a 1ms unit step so i tend to disagree with your
point above

>
> To avoid these problems, we need to base utilization on a metric which
> is updated instantaneously when we add/remove tasks to a cpu (or a least
> fast enough that we don't see the above problems). In the previous
> discussion [1] it was suggested that a sum of unweighted task
> runnable_avg_{sum,period} ratio instead. That is, an unweighted
> equivalent to weighted_cpuload(). That isn't a perfect solution either.

Regarding the unweighted load_avg_contrib, you will have similar issue
because of the slowness in the variation of each sched_entity load
that will be added/removed in the unweighted load_avg_contrib.

The update of the runnable_avg_{sum,period}  of an sched_entity is
quite similar to cpu utilization. This value is linked to the CPU on
which it has run previously because of the time sharing with others
tasks, so the unweighted load of a freshly migrated task will reflect
its load on the previous CPU (with the time sharing with other tasks
on prev CPU).

I'm not saying that such metric is useless but it's not perfect as well.

Vincent

> It is fine as long as the cpus are not fully utilized, but when they are
> we need to use weighted_cpuload() to preserve smp_nice. What to do
> around the tipping point needs more thought, but I think that is
> currently the best proposal for a solution for task and cpu utilization.
>
> rq runnable_avg_sum is useful for decisions where we need a longer term
> view of the cpu utilization, but I don't see how we can use as cpu
> utilization metric for load-balancing decisions at wakeup or
> periodically.
>
> Morten
>
> [1] https://lkml.org/lkml/2014/1/8/251

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 02/11] sched: remove a wake_affine condition
  2014-05-28  6:49             ` Vincent Guittot
@ 2014-05-28 15:09               ` Dietmar Eggemann
  -1 siblings, 0 replies; 216+ messages in thread
From: Dietmar Eggemann @ 2014-05-28 15:09 UTC (permalink / raw)
  To: Vincent Guittot, Peter Zijlstra, npiggin
  Cc: Ingo Molnar, linux-kernel, Russell King - ARM Linux, LAK,
	Preeti U Murthy, Morten Rasmussen, Mike Galbraith, Nicolas Pitre,
	linaro-kernel, Daniel Lezcano

Hi Vincent & Peter,

On 28/05/14 07:49, Vincent Guittot wrote:
[...]
> 
> Nick,
> 
> While doing some rework on the wake affine part of the scheduler, i
> failed to catch the use case that takes advantage of a condition that
> you added some while ago with the commit
> a3f21bce1fefdf92a4d1705e888d390b10f3ac6f
> 
> Could you help us to clarify the 2 first lines of the test that you added ?
> +                       if ((tl <= load &&
> +                               tl + target_load(cpu, idx) <=
> SCHED_LOAD_SCALE) ||
> +                               100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) {
> 
> Regards,
> Vincent
>>
>>>
>>>>> commit a3f21bce1fefdf92a4d1705e888d390b10f3ac6f
>>>>>
>>>>>
>>>>> +                       if ((tl <= load &&
>>>>> +                               tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) ||
>>>>> +                               100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) {
>>>>>
>>>>>
>>>>> So back when the code got introduced, it read:
>>>>>
>>>>>         target_load(prev_cpu, idx) - sync*SCHED_LOAD_SCALE < source_load(this_cpu, idx) &&
>>>>>         target_load(prev_cpu, idx) - sync*SCHED_LOAD_SCALE + target_load(this_cpu, idx) < SCHED_LOAD_SCALE
>>>>>

Shouldn't this be 

target_load(this_cpu, idx) - sync*SCHED_LOAD_SCALE <= source_load(prev_cpu, idx) &&
target_load(this_cpu, idx) - sync*SCHED_LOAD_SCALE + target_load(prev_cpu, idx) <= SCHED_LOAD_SCALE

"[PATCH] sched: implement smpnice" (2dd73a4f09beacadde827a032cf15fd8b1fa3d48)
mentions that SCHED_LOAD_BALANCE (IMHO, should be SCHED_LOAD_SCALE) represents
the load contribution of a single task. So I read the second part as if
the sum of the load of this_cpu and prev_cpu is smaller or equal to the
(maximal) load contribution (maximal possible effect) of a single task.

There is even a comment in "[PATCH] sched: tweak affine wakeups"
(a3f21bce1fefdf92a4d1705e888d390b10f3ac6f) in try_to_wake_up() when
SCHED_LOAD_SCALE gets subtracted from tl = this_load =
target_load(this_cpu, idx):

+ * If sync wakeup then subtract the (maximum possible)
+ * effect of the currently running task from the load
+ * of the current CPU:

"[PATCH] sched: implement smpnice" then replaces SCHED_LOAD_SCALE w/ 

+static inline unsigned long cpu_avg_load_per_task(int cpu)
+{
+       runqueue_t *rq = cpu_rq(cpu);
+       unsigned long n = rq->nr_running;
+
+       return n ?  rq->raw_weighted_load / n : SCHED_LOAD_SCALE;

-- Dietmar

>>>>> So while the first line makes some sense, the second line is still
>>>>> somewhat challenging.
>>>>>
>>>>> I read the second line something like: if there's less than one full
>>>>> task running on the combined cpus.
>>>>
>>>> ok. your explanation makes sense
>>>
>>> Maybe, its still slightly weird :-)
>>>
>>>>>
[...]


^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 02/11] sched: remove a wake_affine condition
@ 2014-05-28 15:09               ` Dietmar Eggemann
  0 siblings, 0 replies; 216+ messages in thread
From: Dietmar Eggemann @ 2014-05-28 15:09 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Vincent & Peter,

On 28/05/14 07:49, Vincent Guittot wrote:
[...]
> 
> Nick,
> 
> While doing some rework on the wake affine part of the scheduler, i
> failed to catch the use case that takes advantage of a condition that
> you added some while ago with the commit
> a3f21bce1fefdf92a4d1705e888d390b10f3ac6f
> 
> Could you help us to clarify the 2 first lines of the test that you added ?
> +                       if ((tl <= load &&
> +                               tl + target_load(cpu, idx) <=
> SCHED_LOAD_SCALE) ||
> +                               100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) {
> 
> Regards,
> Vincent
>>
>>>
>>>>> commit a3f21bce1fefdf92a4d1705e888d390b10f3ac6f
>>>>>
>>>>>
>>>>> +                       if ((tl <= load &&
>>>>> +                               tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) ||
>>>>> +                               100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) {
>>>>>
>>>>>
>>>>> So back when the code got introduced, it read:
>>>>>
>>>>>         target_load(prev_cpu, idx) - sync*SCHED_LOAD_SCALE < source_load(this_cpu, idx) &&
>>>>>         target_load(prev_cpu, idx) - sync*SCHED_LOAD_SCALE + target_load(this_cpu, idx) < SCHED_LOAD_SCALE
>>>>>

Shouldn't this be 

target_load(this_cpu, idx) - sync*SCHED_LOAD_SCALE <= source_load(prev_cpu, idx) &&
target_load(this_cpu, idx) - sync*SCHED_LOAD_SCALE + target_load(prev_cpu, idx) <= SCHED_LOAD_SCALE

"[PATCH] sched: implement smpnice" (2dd73a4f09beacadde827a032cf15fd8b1fa3d48)
mentions that SCHED_LOAD_BALANCE (IMHO, should be SCHED_LOAD_SCALE) represents
the load contribution of a single task. So I read the second part as if
the sum of the load of this_cpu and prev_cpu is smaller or equal to the
(maximal) load contribution (maximal possible effect) of a single task.

There is even a comment in "[PATCH] sched: tweak affine wakeups"
(a3f21bce1fefdf92a4d1705e888d390b10f3ac6f) in try_to_wake_up() when
SCHED_LOAD_SCALE gets subtracted from tl = this_load =
target_load(this_cpu, idx):

+ * If sync wakeup then subtract the (maximum possible)
+ * effect of the currently running task from the load
+ * of the current CPU:

"[PATCH] sched: implement smpnice" then replaces SCHED_LOAD_SCALE w/ 

+static inline unsigned long cpu_avg_load_per_task(int cpu)
+{
+       runqueue_t *rq = cpu_rq(cpu);
+       unsigned long n = rq->nr_running;
+
+       return n ?  rq->raw_weighted_load / n : SCHED_LOAD_SCALE;

-- Dietmar

>>>>> So while the first line makes some sense, the second line is still
>>>>> somewhat challenging.
>>>>>
>>>>> I read the second line something like: if there's less than one full
>>>>> task running on the combined cpus.
>>>>
>>>> ok. your explanation makes sense
>>>
>>> Maybe, its still slightly weird :-)
>>>
>>>>>
[...]

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 08/11] sched: get CPU's activity statistic
  2014-05-28 12:10     ` Morten Rasmussen
@ 2014-05-28 15:17       ` Peter Zijlstra
  -1 siblings, 0 replies; 216+ messages in thread
From: Peter Zijlstra @ 2014-05-28 15:17 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Vincent Guittot, mingo, linux-kernel, linux, linux-arm-kernel,
	preeti, efault, nicolas.pitre, linaro-kernel, daniel.lezcano

[-- Attachment #1: Type: text/plain, Size: 576 bytes --]

On Wed, May 28, 2014 at 01:10:01PM +0100, Morten Rasmussen wrote:
> The basic problem is that worst case: sum starting from 0 and period
> already at LOAD_AVG_MAX = 47742, it takes LOAD_AVG_MAX_N = 345 periods
> (ms) for sum to reach 47742. 

But it takes only 4 periods (4*32ms) to get ~94% of the way there, 

So our half-time is 32 * 1024 ms, now look at \Sum_i=1,4 1/(2^i) =
0.9375. So 4 periods (4*32*1024=131072 ms) gets us ~94% of the way
there.

Now, granted, 4 periods of 32-ish ms is still very long, but nowhere
near as horrid as the full convergence.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 08/11] sched: get CPU's activity statistic
@ 2014-05-28 15:17       ` Peter Zijlstra
  0 siblings, 0 replies; 216+ messages in thread
From: Peter Zijlstra @ 2014-05-28 15:17 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, May 28, 2014 at 01:10:01PM +0100, Morten Rasmussen wrote:
> The basic problem is that worst case: sum starting from 0 and period
> already at LOAD_AVG_MAX = 47742, it takes LOAD_AVG_MAX_N = 345 periods
> (ms) for sum to reach 47742. 

But it takes only 4 periods (4*32ms) to get ~94% of the way there, 

So our half-time is 32 * 1024 ms, now look at \Sum_i=1,4 1/(2^i) =
0.9375. So 4 periods (4*32*1024=131072 ms) gets us ~94% of the way
there.

Now, granted, 4 periods of 32-ish ms is still very long, but nowhere
near as horrid as the full convergence.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 836 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20140528/e764fce4/attachment.sig>

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 02/11] sched: remove a wake_affine condition
  2014-05-28 15:09               ` Dietmar Eggemann
@ 2014-05-28 15:25                 ` Vincent Guittot
  -1 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-05-28 15:25 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Peter Zijlstra, npiggin, Ingo Molnar, linux-kernel,
	Russell King - ARM Linux, LAK, Preeti U Murthy, Morten Rasmussen,
	Mike Galbraith, Nicolas Pitre, linaro-kernel, Daniel Lezcano

On 28 May 2014 17:09, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> Hi Vincent & Peter,
>
> On 28/05/14 07:49, Vincent Guittot wrote:
> [...]
>>
>> Nick,
>>
>> While doing some rework on the wake affine part of the scheduler, i
>> failed to catch the use case that takes advantage of a condition that
>> you added some while ago with the commit
>> a3f21bce1fefdf92a4d1705e888d390b10f3ac6f
>>
>> Could you help us to clarify the 2 first lines of the test that you added ?
>> +                       if ((tl <= load &&
>> +                               tl + target_load(cpu, idx) <=
>> SCHED_LOAD_SCALE) ||
>> +                               100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) {
>>
>> Regards,
>> Vincent
>>>
>>>>
>>>>>> commit a3f21bce1fefdf92a4d1705e888d390b10f3ac6f
>>>>>>
>>>>>>
>>>>>> +                       if ((tl <= load &&
>>>>>> +                               tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) ||
>>>>>> +                               100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) {
>>>>>>
>>>>>>
>>>>>> So back when the code got introduced, it read:
>>>>>>
>>>>>>         target_load(prev_cpu, idx) - sync*SCHED_LOAD_SCALE < source_load(this_cpu, idx) &&
>>>>>>         target_load(prev_cpu, idx) - sync*SCHED_LOAD_SCALE + target_load(this_cpu, idx) < SCHED_LOAD_SCALE
>>>>>>
>
> Shouldn't this be
>
> target_load(this_cpu, idx) - sync*SCHED_LOAD_SCALE <= source_load(prev_cpu, idx) &&
> target_load(this_cpu, idx) - sync*SCHED_LOAD_SCALE + target_load(prev_cpu, idx) <= SCHED_LOAD_SCALE

yes, there was a typo mistake in Peter's explanation

>
> "[PATCH] sched: implement smpnice" (2dd73a4f09beacadde827a032cf15fd8b1fa3d48)
> mentions that SCHED_LOAD_BALANCE (IMHO, should be SCHED_LOAD_SCALE) represents
> the load contribution of a single task. So I read the second part as if
> the sum of the load of this_cpu and prev_cpu is smaller or equal to the
> (maximal) load contribution (maximal possible effect) of a single task.
>
> There is even a comment in "[PATCH] sched: tweak affine wakeups"
> (a3f21bce1fefdf92a4d1705e888d390b10f3ac6f) in try_to_wake_up() when
> SCHED_LOAD_SCALE gets subtracted from tl = this_load =
> target_load(this_cpu, idx):
>
> + * If sync wakeup then subtract the (maximum possible)
> + * effect of the currently running task from the load
> + * of the current CPU:
>
> "[PATCH] sched: implement smpnice" then replaces SCHED_LOAD_SCALE w/
>
> +static inline unsigned long cpu_avg_load_per_task(int cpu)
> +{
> +       runqueue_t *rq = cpu_rq(cpu);
> +       unsigned long n = rq->nr_running;
> +
> +       return n ?  rq->raw_weighted_load / n : SCHED_LOAD_SCALE;
>
> -- Dietmar
>
>>>>>> So while the first line makes some sense, the second line is still
>>>>>> somewhat challenging.
>>>>>>
>>>>>> I read the second line something like: if there's less than one full
>>>>>> task running on the combined cpus.
>>>>>
>>>>> ok. your explanation makes sense
>>>>
>>>> Maybe, its still slightly weird :-)
>>>>
>>>>>>
> [...]
>

^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 02/11] sched: remove a wake_affine condition
@ 2014-05-28 15:25                 ` Vincent Guittot
  0 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-05-28 15:25 UTC (permalink / raw)
  To: linux-arm-kernel

On 28 May 2014 17:09, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> Hi Vincent & Peter,
>
> On 28/05/14 07:49, Vincent Guittot wrote:
> [...]
>>
>> Nick,
>>
>> While doing some rework on the wake affine part of the scheduler, i
>> failed to catch the use case that takes advantage of a condition that
>> you added some while ago with the commit
>> a3f21bce1fefdf92a4d1705e888d390b10f3ac6f
>>
>> Could you help us to clarify the 2 first lines of the test that you added ?
>> +                       if ((tl <= load &&
>> +                               tl + target_load(cpu, idx) <=
>> SCHED_LOAD_SCALE) ||
>> +                               100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) {
>>
>> Regards,
>> Vincent
>>>
>>>>
>>>>>> commit a3f21bce1fefdf92a4d1705e888d390b10f3ac6f
>>>>>>
>>>>>>
>>>>>> +                       if ((tl <= load &&
>>>>>> +                               tl + target_load(cpu, idx) <= SCHED_LOAD_SCALE) ||
>>>>>> +                               100*(tl + SCHED_LOAD_SCALE) <= imbalance*load) {
>>>>>>
>>>>>>
>>>>>> So back when the code got introduced, it read:
>>>>>>
>>>>>>         target_load(prev_cpu, idx) - sync*SCHED_LOAD_SCALE < source_load(this_cpu, idx) &&
>>>>>>         target_load(prev_cpu, idx) - sync*SCHED_LOAD_SCALE + target_load(this_cpu, idx) < SCHED_LOAD_SCALE
>>>>>>
>
> Shouldn't this be
>
> target_load(this_cpu, idx) - sync*SCHED_LOAD_SCALE <= source_load(prev_cpu, idx) &&
> target_load(this_cpu, idx) - sync*SCHED_LOAD_SCALE + target_load(prev_cpu, idx) <= SCHED_LOAD_SCALE

yes, there was a typo mistake in Peter's explanation

>
> "[PATCH] sched: implement smpnice" (2dd73a4f09beacadde827a032cf15fd8b1fa3d48)
> mentions that SCHED_LOAD_BALANCE (IMHO, should be SCHED_LOAD_SCALE) represents
> the load contribution of a single task. So I read the second part as if
> the sum of the load of this_cpu and prev_cpu is smaller or equal to the
> (maximal) load contribution (maximal possible effect) of a single task.
>
> There is even a comment in "[PATCH] sched: tweak affine wakeups"
> (a3f21bce1fefdf92a4d1705e888d390b10f3ac6f) in try_to_wake_up() when
> SCHED_LOAD_SCALE gets subtracted from tl = this_load =
> target_load(this_cpu, idx):
>
> + * If sync wakeup then subtract the (maximum possible)
> + * effect of the currently running task from the load
> + * of the current CPU:
>
> "[PATCH] sched: implement smpnice" then replaces SCHED_LOAD_SCALE w/
>
> +static inline unsigned long cpu_avg_load_per_task(int cpu)
> +{
> +       runqueue_t *rq = cpu_rq(cpu);
> +       unsigned long n = rq->nr_running;
> +
> +       return n ?  rq->raw_weighted_load / n : SCHED_LOAD_SCALE;
>
> -- Dietmar
>
>>>>>> So while the first line makes some sense, the second line is still
>>>>>> somewhat challenging.
>>>>>>
>>>>>> I read the second line something like: if there's less than one full
>>>>>> task running on the combined cpus.
>>>>>
>>>>> ok. your explanation makes sense
>>>>
>>>> Maybe, its still slightly weird :-)
>>>>
>>>>>>
> [...]
>

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 08/11] sched: get CPU's activity statistic
  2014-05-28 13:15       ` Vincent Guittot
@ 2014-05-28 15:47         ` Morten Rasmussen
  -1 siblings, 0 replies; 216+ messages in thread
From: Morten Rasmussen @ 2014-05-28 15:47 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: peterz, mingo, linux-kernel, linux, linux-arm-kernel, preeti,
	efault, nicolas.pitre, linaro-kernel, daniel.lezcano

On Wed, May 28, 2014 at 02:15:03PM +0100, Vincent Guittot wrote:
> On 28 May 2014 14:10, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> > On Fri, May 23, 2014 at 04:53:02PM +0100, Vincent Guittot wrote:
> >> Monitor the activity level of each group of each sched_domain level. The
> >> activity is the amount of cpu_power that is currently used on a CPU or group
> >> of CPUs. We use the runnable_avg_sum and _period to evaluate this activity
> >> level. In the special use case where the CPU is fully loaded by more than 1
> >> task, the activity level is set above the cpu_power in order to reflect the
> >> overload of the CPU
> >>
> >> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> >> ---
> >>  kernel/sched/fair.c | 22 ++++++++++++++++++++++
> >>  1 file changed, 22 insertions(+)
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index b7c51be..c01d8b6 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -4044,6 +4044,11 @@ static unsigned long power_of(int cpu)
> >>       return cpu_rq(cpu)->cpu_power;
> >>  }
> >>
> >> +static unsigned long power_orig_of(int cpu)
> >> +{
> >> +     return cpu_rq(cpu)->cpu_power_orig;
> >> +}
> >> +
> >>  static unsigned long cpu_avg_load_per_task(int cpu)
> >>  {
> >>       struct rq *rq = cpu_rq(cpu);
> >> @@ -4438,6 +4443,18 @@ done:
> >>       return target;
> >>  }
> >>
> >> +static int get_cpu_activity(int cpu)
> >> +{
> >> +     struct rq *rq = cpu_rq(cpu);
> >> +     u32 sum = rq->avg.runnable_avg_sum;
> >> +     u32 period = rq->avg.runnable_avg_period;
> >> +
> >> +     if (sum >= period)
> >> +             return power_orig_of(cpu) + rq->nr_running - 1;
> >> +
> >> +     return (sum * power_orig_of(cpu)) / period;
> >> +}
> >
> > The rq runnable_avg_{sum, period} give a very long term view of the cpu
> > utilization (I will use the term utilization instead of activity as I
> > think that is what we are talking about here). IMHO, it is too slow to
> > be used as basis for load balancing decisions. I think that was also
> > agreed upon in the last discussion related to this topic [1].
> >
> > The basic problem is that worst case: sum starting from 0 and period
> > already at LOAD_AVG_MAX = 47742, it takes LOAD_AVG_MAX_N = 345 periods
> > (ms) for sum to reach 47742. In other words, the cpu might have been
> > fully utilized for 345 ms before it is considered fully utilized.
> > Periodic load-balancing happens much more frequently than that.
> 
> I agree that it's not really responsive but several statistics of the
> scheduler use the same kind of metrics and have the same kind of
> responsiveness.

I might be wrong, but I don't think we use anything similar to this to
estimate cpu load/utilization for load-balancing purposes except for
{source, target}_load() where it is used to bias the decisions whether
or not to balance if the difference is small. That is what the
discussion was about last time.

> I agree that it's not enough and that's why i'm not using only this
> metric but it gives information that the unweighted load_avg_contrib
> (that you are speaking about below) can't give. So i would be less
> contrasted than you and would say that we probably need additional
> metrics

I'm not saying that we shouldn't this metric at all, I'm just saying
that I don't think it is suitable for estimating the short term view cpu
utilization which is what you need to make load-balancing decisions. We
can't observe the effect of recent load-balancing decisions if the
metric is too slow to react.

I realize that what I mean by 'slow' might be unclear. Load tracking
(both task and rq) takes a certain amount of history into account in
runnable_avg_{sum, period}. This amount is determined by the 'y'-weight,
which has been chosen such that we consider the load in the past 345
time units, where the time unit is ~1 ms. The contribution is smaller
the further you go back due to y^n, which diminishes to 0 for n > 345.
So, if a task or cpu goes from having been idle for >345 ms to being
constantly busy, it will take 345 ms until the entire history that we
care about will reflect this change. Only then runnable_avg_sum will
reach 47742. The rate of change is faster to begin with since the weight
of the most recent history is higher. runnable_avg_sum will get to
47742/2 in just 32 ms.

Since we may do periodic load-balance every 10 ms or so, we will perform
a number of load-balances where runnable_avg_sum will mostly be
reflecting the state of the world before a change (new task queued or
moved a task to a different cpu). If you had have two tasks continuously
on one cpu and your other cpu is idle, and you move one of the tasks to
the other cpu, runnable_avg_sum will remain unchanged, 47742, on the
first cpu while it starts from 0 on the other one. 10 ms later it will
have increased a bit, 32 ms later it will be 47742/2, and 345 ms later
it reaches 47742. In the mean time the cpu doesn't appear fully utilized
and we might decide to put more tasks on it because we don't know if
runnable_avg_sum represents a partially utilized cpu (for example a 50%
task) or if it will continue to rise and eventually get to 47742.

IMO, we need cpu utilization to clearly represent the current
utilization of the cpu such that any changes since the last wakeup or
load-balance are clearly visible.

> 
> >
> > Also, if load-balancing actually moves tasks around it may take quite a
> > while before runnable_avg_sum actually reflects this change. The next
> > periodic load-balance is likely to happen before runnable_avg_sum has
> > reflected the result of the previous periodic load-balance.
> 
> runnable_avg_sum uses a 1ms unit step so i tend to disagree with your
> point above

See explanation above. The important thing is how much history we take
into account. That is 345x 1 ms time units. The rate at which the sum is
updated doesn't have any change anything. 1 ms after a change (wakeup,
load-balance,...) runnable_avg_sum can only change by 1024. The
remaining ~98% of your weighted history still reflects the world before
the change.

> > To avoid these problems, we need to base utilization on a metric which
> > is updated instantaneously when we add/remove tasks to a cpu (or a least
> > fast enough that we don't see the above problems). In the previous
> > discussion [1] it was suggested that a sum of unweighted task
> > runnable_avg_{sum,period} ratio instead. That is, an unweighted
> > equivalent to weighted_cpuload(). That isn't a perfect solution either.
> 
> Regarding the unweighted load_avg_contrib, you will have similar issue
> because of the slowness in the variation of each sched_entity load
> that will be added/removed in the unweighted load_avg_contrib.
> 
> The update of the runnable_avg_{sum,period}  of an sched_entity is
> quite similar to cpu utilization.

Yes, runnable_avg_{sum, period} for tasks and rqs are exactly the same.
No difference there :)

> This value is linked to the CPU on
> which it has run previously because of the time sharing with others
> tasks, so the unweighted load of a freshly migrated task will reflect
> its load on the previous CPU (with the time sharing with other tasks
> on prev CPU).

I agree that the task runnable_avg_sum is always affected by the
circumstances on the cpu where it is running, and that it takes this
history with it. However, I think cfs.runnable_load_avg leads to less
problems than using the rq runnable_avg_sum. It would work nicely for
the two tasks on two cpus example I mentioned earlier. We don't need add
something on top when the cpu is fully utilized by more than one task.
It comes more naturally with cfs.runnable_load_avg. If it is much larger
than 47742, it should be fairly safe to assume that you shouldn't stick
more tasks on that cpu.

> 
> I'm not saying that such metric is useless but it's not perfect as well.

It comes with its own set of problems, agreed. Based on my current
understanding (or lack thereof) they just seem smaller :)

Morten

^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 08/11] sched: get CPU's activity statistic
@ 2014-05-28 15:47         ` Morten Rasmussen
  0 siblings, 0 replies; 216+ messages in thread
From: Morten Rasmussen @ 2014-05-28 15:47 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, May 28, 2014 at 02:15:03PM +0100, Vincent Guittot wrote:
> On 28 May 2014 14:10, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> > On Fri, May 23, 2014 at 04:53:02PM +0100, Vincent Guittot wrote:
> >> Monitor the activity level of each group of each sched_domain level. The
> >> activity is the amount of cpu_power that is currently used on a CPU or group
> >> of CPUs. We use the runnable_avg_sum and _period to evaluate this activity
> >> level. In the special use case where the CPU is fully loaded by more than 1
> >> task, the activity level is set above the cpu_power in order to reflect the
> >> overload of the CPU
> >>
> >> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> >> ---
> >>  kernel/sched/fair.c | 22 ++++++++++++++++++++++
> >>  1 file changed, 22 insertions(+)
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index b7c51be..c01d8b6 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -4044,6 +4044,11 @@ static unsigned long power_of(int cpu)
> >>       return cpu_rq(cpu)->cpu_power;
> >>  }
> >>
> >> +static unsigned long power_orig_of(int cpu)
> >> +{
> >> +     return cpu_rq(cpu)->cpu_power_orig;
> >> +}
> >> +
> >>  static unsigned long cpu_avg_load_per_task(int cpu)
> >>  {
> >>       struct rq *rq = cpu_rq(cpu);
> >> @@ -4438,6 +4443,18 @@ done:
> >>       return target;
> >>  }
> >>
> >> +static int get_cpu_activity(int cpu)
> >> +{
> >> +     struct rq *rq = cpu_rq(cpu);
> >> +     u32 sum = rq->avg.runnable_avg_sum;
> >> +     u32 period = rq->avg.runnable_avg_period;
> >> +
> >> +     if (sum >= period)
> >> +             return power_orig_of(cpu) + rq->nr_running - 1;
> >> +
> >> +     return (sum * power_orig_of(cpu)) / period;
> >> +}
> >
> > The rq runnable_avg_{sum, period} give a very long term view of the cpu
> > utilization (I will use the term utilization instead of activity as I
> > think that is what we are talking about here). IMHO, it is too slow to
> > be used as basis for load balancing decisions. I think that was also
> > agreed upon in the last discussion related to this topic [1].
> >
> > The basic problem is that worst case: sum starting from 0 and period
> > already at LOAD_AVG_MAX = 47742, it takes LOAD_AVG_MAX_N = 345 periods
> > (ms) for sum to reach 47742. In other words, the cpu might have been
> > fully utilized for 345 ms before it is considered fully utilized.
> > Periodic load-balancing happens much more frequently than that.
> 
> I agree that it's not really responsive but several statistics of the
> scheduler use the same kind of metrics and have the same kind of
> responsiveness.

I might be wrong, but I don't think we use anything similar to this to
estimate cpu load/utilization for load-balancing purposes except for
{source, target}_load() where it is used to bias the decisions whether
or not to balance if the difference is small. That is what the
discussion was about last time.

> I agree that it's not enough and that's why i'm not using only this
> metric but it gives information that the unweighted load_avg_contrib
> (that you are speaking about below) can't give. So i would be less
> contrasted than you and would say that we probably need additional
> metrics

I'm not saying that we shouldn't this metric at all, I'm just saying
that I don't think it is suitable for estimating the short term view cpu
utilization which is what you need to make load-balancing decisions. We
can't observe the effect of recent load-balancing decisions if the
metric is too slow to react.

I realize that what I mean by 'slow' might be unclear. Load tracking
(both task and rq) takes a certain amount of history into account in
runnable_avg_{sum, period}. This amount is determined by the 'y'-weight,
which has been chosen such that we consider the load in the past 345
time units, where the time unit is ~1 ms. The contribution is smaller
the further you go back due to y^n, which diminishes to 0 for n > 345.
So, if a task or cpu goes from having been idle for >345 ms to being
constantly busy, it will take 345 ms until the entire history that we
care about will reflect this change. Only then runnable_avg_sum will
reach 47742. The rate of change is faster to begin with since the weight
of the most recent history is higher. runnable_avg_sum will get to
47742/2 in just 32 ms.

Since we may do periodic load-balance every 10 ms or so, we will perform
a number of load-balances where runnable_avg_sum will mostly be
reflecting the state of the world before a change (new task queued or
moved a task to a different cpu). If you had have two tasks continuously
on one cpu and your other cpu is idle, and you move one of the tasks to
the other cpu, runnable_avg_sum will remain unchanged, 47742, on the
first cpu while it starts from 0 on the other one. 10 ms later it will
have increased a bit, 32 ms later it will be 47742/2, and 345 ms later
it reaches 47742. In the mean time the cpu doesn't appear fully utilized
and we might decide to put more tasks on it because we don't know if
runnable_avg_sum represents a partially utilized cpu (for example a 50%
task) or if it will continue to rise and eventually get to 47742.

IMO, we need cpu utilization to clearly represent the current
utilization of the cpu such that any changes since the last wakeup or
load-balance are clearly visible.

> 
> >
> > Also, if load-balancing actually moves tasks around it may take quite a
> > while before runnable_avg_sum actually reflects this change. The next
> > periodic load-balance is likely to happen before runnable_avg_sum has
> > reflected the result of the previous periodic load-balance.
> 
> runnable_avg_sum uses a 1ms unit step so i tend to disagree with your
> point above

See explanation above. The important thing is how much history we take
into account. That is 345x 1 ms time units. The rate at which the sum is
updated doesn't have any change anything. 1 ms after a change (wakeup,
load-balance,...) runnable_avg_sum can only change by 1024. The
remaining ~98% of your weighted history still reflects the world before
the change.

> > To avoid these problems, we need to base utilization on a metric which
> > is updated instantaneously when we add/remove tasks to a cpu (or a least
> > fast enough that we don't see the above problems). In the previous
> > discussion [1] it was suggested that a sum of unweighted task
> > runnable_avg_{sum,period} ratio instead. That is, an unweighted
> > equivalent to weighted_cpuload(). That isn't a perfect solution either.
> 
> Regarding the unweighted load_avg_contrib, you will have similar issue
> because of the slowness in the variation of each sched_entity load
> that will be added/removed in the unweighted load_avg_contrib.
> 
> The update of the runnable_avg_{sum,period}  of an sched_entity is
> quite similar to cpu utilization.

Yes, runnable_avg_{sum, period} for tasks and rqs are exactly the same.
No difference there :)

> This value is linked to the CPU on
> which it has run previously because of the time sharing with others
> tasks, so the unweighted load of a freshly migrated task will reflect
> its load on the previous CPU (with the time sharing with other tasks
> on prev CPU).

I agree that the task runnable_avg_sum is always affected by the
circumstances on the cpu where it is running, and that it takes this
history with it. However, I think cfs.runnable_load_avg leads to less
problems than using the rq runnable_avg_sum. It would work nicely for
the two tasks on two cpus example I mentioned earlier. We don't need add
something on top when the cpu is fully utilized by more than one task.
It comes more naturally with cfs.runnable_load_avg. If it is much larger
than 47742, it should be fairly safe to assume that you shouldn't stick
more tasks on that cpu.

> 
> I'm not saying that such metric is useless but it's not perfect as well.

It comes with its own set of problems, agreed. Based on my current
understanding (or lack thereof) they just seem smaller :)

Morten

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 08/11] sched: get CPU's activity statistic
  2014-05-28 15:47         ` Morten Rasmussen
@ 2014-05-28 16:39           ` Vincent Guittot
  -1 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-05-28 16:39 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: peterz, mingo, linux-kernel, linux, linux-arm-kernel, preeti,
	efault, nicolas.pitre, linaro-kernel, daniel.lezcano

On 28 May 2014 17:47, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> On Wed, May 28, 2014 at 02:15:03PM +0100, Vincent Guittot wrote:
>> On 28 May 2014 14:10, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
>> > On Fri, May 23, 2014 at 04:53:02PM +0100, Vincent Guittot wrote:

[snip]

>
>> This value is linked to the CPU on
>> which it has run previously because of the time sharing with others
>> tasks, so the unweighted load of a freshly migrated task will reflect
>> its load on the previous CPU (with the time sharing with other tasks
>> on prev CPU).
>
> I agree that the task runnable_avg_sum is always affected by the
> circumstances on the cpu where it is running, and that it takes this
> history with it. However, I think cfs.runnable_load_avg leads to less
> problems than using the rq runnable_avg_sum. It would work nicely for
> the two tasks on two cpus example I mentioned earlier. We don't need add

i would say that nr_running is an even better metrics for such
situation as the load doesn't give any additional information.
Just to point that we can spent a lot of time listing which use case
are better covered by which metrics :-)

> something on top when the cpu is fully utilized by more than one task.
> It comes more naturally with cfs.runnable_load_avg. If it is much larger
> than 47742, it should be fairly safe to assume that you shouldn't stick
> more tasks on that cpu.
>
>>
>> I'm not saying that such metric is useless but it's not perfect as well.
>
> It comes with its own set of problems, agreed. Based on my current
> understanding (or lack thereof) they just seem smaller :)

I think it's worth using the cpu utilization for some cases because it
has got some information that are not available elsewhere. And the
replacement of the current capacity computation is one example.
As explained previously, I'm not against adding other metrics and i'm
not sure to understand why you oppose these 2 metrics whereas they
could be complementary

Vincent

>
> Morten

^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 08/11] sched: get CPU's activity statistic
@ 2014-05-28 16:39           ` Vincent Guittot
  0 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-05-28 16:39 UTC (permalink / raw)
  To: linux-arm-kernel

On 28 May 2014 17:47, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> On Wed, May 28, 2014 at 02:15:03PM +0100, Vincent Guittot wrote:
>> On 28 May 2014 14:10, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
>> > On Fri, May 23, 2014 at 04:53:02PM +0100, Vincent Guittot wrote:

[snip]

>
>> This value is linked to the CPU on
>> which it has run previously because of the time sharing with others
>> tasks, so the unweighted load of a freshly migrated task will reflect
>> its load on the previous CPU (with the time sharing with other tasks
>> on prev CPU).
>
> I agree that the task runnable_avg_sum is always affected by the
> circumstances on the cpu where it is running, and that it takes this
> history with it. However, I think cfs.runnable_load_avg leads to less
> problems than using the rq runnable_avg_sum. It would work nicely for
> the two tasks on two cpus example I mentioned earlier. We don't need add

i would say that nr_running is an even better metrics for such
situation as the load doesn't give any additional information.
Just to point that we can spent a lot of time listing which use case
are better covered by which metrics :-)

> something on top when the cpu is fully utilized by more than one task.
> It comes more naturally with cfs.runnable_load_avg. If it is much larger
> than 47742, it should be fairly safe to assume that you shouldn't stick
> more tasks on that cpu.
>
>>
>> I'm not saying that such metric is useless but it's not perfect as well.
>
> It comes with its own set of problems, agreed. Based on my current
> understanding (or lack thereof) they just seem smaller :)

I think it's worth using the cpu utilization for some cases because it
has got some information that are not available elsewhere. And the
replacement of the current capacity computation is one example.
As explained previously, I'm not against adding other metrics and i'm
not sure to understand why you oppose these 2 metrics whereas they
could be complementary

Vincent

>
> Morten

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 10/11] sched: move cfs task on a CPU with higher capacity
  2014-05-23 15:53   ` Vincent Guittot
@ 2014-05-29  9:50     ` Peter Zijlstra
  -1 siblings, 0 replies; 216+ messages in thread
From: Peter Zijlstra @ 2014-05-29  9:50 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, linux-kernel, linux, linux-arm-kernel, preeti,
	Morten.Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano

On Fri, May 23, 2014 at 05:53:04PM +0200, Vincent Guittot wrote:
> If the CPU is used for handling lot of IRQs, trig a load balance to check if
> it's worth moving its tasks on another CPU that has more capacity
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>  kernel/sched/fair.c | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e8a30f9..2501e49 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5948,6 +5948,13 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>  	if (sgs->sum_nr_running > sgs->group_capacity)
>  		return true;
>  
> +	/*
> +	 * The group capacity is reduced probably because of activity from other
> +	 * sched class or interrupts which use part of the available capacity
> +	 */
> +	if ((sg->sgp->power_orig * 100) > (sgs->group_power * env->sd->imbalance_pct))
> +		return true;
> +
>  	if (sgs->group_imb)
>  		return true;
>  

But we should already do this because the load numbers are scaled with
the power/capacity figures. If one CPU gets significant less time to run
fair tasks, its effective load would spike and it'd get to be selected
here anyway.

Or am I missing something?

> @@ -7282,6 +7289,12 @@ static inline int nohz_kick_needed(struct rq *rq)
>  
>  		if (nr_busy > 1)
>  			goto need_kick_unlock;
> +
> +		if ((rq->cfs.h_nr_running >= 1)
> +		 && ((rq->cpu_power * sd->imbalance_pct) <
> +					(rq->cpu_power_orig * 100)))
> +			goto need_kick_unlock;
> +
>  	}
>  
>  	sd = rcu_dereference(per_cpu(sd_asym, cpu));

OK, so there you're kicking the idle balancer to try and get another CPU
to pull some load? That makes sense I suppose.

That function is pretty horrible though; how about something like this
first?

---
 kernel/sched/fair.c | 29 +++++++++++++----------------
 1 file changed, 13 insertions(+), 16 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c9617b73bcc0..47fb96e6fa83 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7215,15 +7215,16 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
  *   - For SD_ASYM_PACKING, if the lower numbered cpu's in the scheduler
  *     domain span are idle.
  */
-static inline int nohz_kick_needed(struct rq *rq)
+static inline bool nohz_kick_needed(struct rq *rq)
 {
 	unsigned long now = jiffies;
 	struct sched_domain *sd;
 	struct sched_group_power *sgp;
 	int nr_busy, cpu = rq->cpu;
+	bool kick = false;
 
 	if (unlikely(rq->idle_balance))
-		return 0;
+		return false;
 
        /*
 	* We may be recently in ticked or tickless idle mode. At the first
@@ -7237,38 +7238,34 @@ static inline int nohz_kick_needed(struct rq *rq)
 	 * balancing.
 	 */
 	if (likely(!atomic_read(&nohz.nr_cpus)))
-		return 0;
+		return false;
 
 	if (time_before(now, nohz.next_balance))
-		return 0;
+		return false;
 
 	if (rq->nr_running >= 2)
-		goto need_kick;
+		return true;
 
 	rcu_read_lock();
 	sd = rcu_dereference(per_cpu(sd_busy, cpu));
-
 	if (sd) {
 		sgp = sd->groups->sgp;
 		nr_busy = atomic_read(&sgp->nr_busy_cpus);
 
-		if (nr_busy > 1)
-			goto need_kick_unlock;
+		if (nr_busy > 1) {
+			kick = true;
+			goto unlock;
+		}
 	}
 
 	sd = rcu_dereference(per_cpu(sd_asym, cpu));
-
 	if (sd && (cpumask_first_and(nohz.idle_cpus_mask,
 				  sched_domain_span(sd)) < cpu))
-		goto need_kick_unlock;
-
-	rcu_read_unlock();
-	return 0;
+		kick = true;
 
-need_kick_unlock:
+unlock:
 	rcu_read_unlock();
-need_kick:
-	return 1;
+	return kick;
 }
 #else
 static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) { }

^ permalink raw reply related	[flat|nested] 216+ messages in thread

* [PATCH v2 10/11] sched: move cfs task on a CPU with higher capacity
@ 2014-05-29  9:50     ` Peter Zijlstra
  0 siblings, 0 replies; 216+ messages in thread
From: Peter Zijlstra @ 2014-05-29  9:50 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, May 23, 2014 at 05:53:04PM +0200, Vincent Guittot wrote:
> If the CPU is used for handling lot of IRQs, trig a load balance to check if
> it's worth moving its tasks on another CPU that has more capacity
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>  kernel/sched/fair.c | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e8a30f9..2501e49 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5948,6 +5948,13 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>  	if (sgs->sum_nr_running > sgs->group_capacity)
>  		return true;
>  
> +	/*
> +	 * The group capacity is reduced probably because of activity from other
> +	 * sched class or interrupts which use part of the available capacity
> +	 */
> +	if ((sg->sgp->power_orig * 100) > (sgs->group_power * env->sd->imbalance_pct))
> +		return true;
> +
>  	if (sgs->group_imb)
>  		return true;
>  

But we should already do this because the load numbers are scaled with
the power/capacity figures. If one CPU gets significant less time to run
fair tasks, its effective load would spike and it'd get to be selected
here anyway.

Or am I missing something?

> @@ -7282,6 +7289,12 @@ static inline int nohz_kick_needed(struct rq *rq)
>  
>  		if (nr_busy > 1)
>  			goto need_kick_unlock;
> +
> +		if ((rq->cfs.h_nr_running >= 1)
> +		 && ((rq->cpu_power * sd->imbalance_pct) <
> +					(rq->cpu_power_orig * 100)))
> +			goto need_kick_unlock;
> +
>  	}
>  
>  	sd = rcu_dereference(per_cpu(sd_asym, cpu));

OK, so there you're kicking the idle balancer to try and get another CPU
to pull some load? That makes sense I suppose.

That function is pretty horrible though; how about something like this
first?

---
 kernel/sched/fair.c | 29 +++++++++++++----------------
 1 file changed, 13 insertions(+), 16 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c9617b73bcc0..47fb96e6fa83 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7215,15 +7215,16 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
  *   - For SD_ASYM_PACKING, if the lower numbered cpu's in the scheduler
  *     domain span are idle.
  */
-static inline int nohz_kick_needed(struct rq *rq)
+static inline bool nohz_kick_needed(struct rq *rq)
 {
 	unsigned long now = jiffies;
 	struct sched_domain *sd;
 	struct sched_group_power *sgp;
 	int nr_busy, cpu = rq->cpu;
+	bool kick = false;
 
 	if (unlikely(rq->idle_balance))
-		return 0;
+		return false;
 
        /*
 	* We may be recently in ticked or tickless idle mode. At the first
@@ -7237,38 +7238,34 @@ static inline int nohz_kick_needed(struct rq *rq)
 	 * balancing.
 	 */
 	if (likely(!atomic_read(&nohz.nr_cpus)))
-		return 0;
+		return false;
 
 	if (time_before(now, nohz.next_balance))
-		return 0;
+		return false;
 
 	if (rq->nr_running >= 2)
-		goto need_kick;
+		return true;
 
 	rcu_read_lock();
 	sd = rcu_dereference(per_cpu(sd_busy, cpu));
-
 	if (sd) {
 		sgp = sd->groups->sgp;
 		nr_busy = atomic_read(&sgp->nr_busy_cpus);
 
-		if (nr_busy > 1)
-			goto need_kick_unlock;
+		if (nr_busy > 1) {
+			kick = true;
+			goto unlock;
+		}
 	}
 
 	sd = rcu_dereference(per_cpu(sd_asym, cpu));
-
 	if (sd && (cpumask_first_and(nohz.idle_cpus_mask,
 				  sched_domain_span(sd)) < cpu))
-		goto need_kick_unlock;
-
-	rcu_read_unlock();
-	return 0;
+		kick = true;
 
-need_kick_unlock:
+unlock:
 	rcu_read_unlock();
-need_kick:
-	return 1;
+	return kick;
 }
 #else
 static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) { }

^ permalink raw reply related	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 11/11] sched: replace capacity by activity
  2014-05-23 15:53   ` Vincent Guittot
@ 2014-05-29 13:55     ` Peter Zijlstra
  -1 siblings, 0 replies; 216+ messages in thread
From: Peter Zijlstra @ 2014-05-29 13:55 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, linux-kernel, linux, linux-arm-kernel, preeti,
	Morten.Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano

On Fri, May 23, 2014 at 05:53:05PM +0200, Vincent Guittot wrote:
> The scheduler tries to compute how many tasks a group of CPUs can handle by
> assuming that a task's load is SCHED_LOAD_SCALE and a CPU capacity is
> SCHED_POWER_SCALE.
> We can now have a better idea of the utilization of a group fo CPUs thanks to
> group_actitvity and deduct how many capacity is still available.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---

Right, so as Preeti already mentioned, this wrecks SMT. It also seems to
loose the aggressive spread, where we want to run 1 task on each 'core'
before we start 'balancing'.

So I think we should be able to fix this by setting PREFER_SIBLING on
the SMT domain, that way we'll get single tasks running on each SMT
domain before filling them up until capacity.

Now, its been a while since I looked at PREFER_SIBLING, and I've not yet
looked at what your patch does to it, but it seems to me that that is
the first direction we should look for an answer to this.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 11/11] sched: replace capacity by activity
@ 2014-05-29 13:55     ` Peter Zijlstra
  0 siblings, 0 replies; 216+ messages in thread
From: Peter Zijlstra @ 2014-05-29 13:55 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, May 23, 2014 at 05:53:05PM +0200, Vincent Guittot wrote:
> The scheduler tries to compute how many tasks a group of CPUs can handle by
> assuming that a task's load is SCHED_LOAD_SCALE and a CPU capacity is
> SCHED_POWER_SCALE.
> We can now have a better idea of the utilization of a group fo CPUs thanks to
> group_actitvity and deduct how many capacity is still available.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---

Right, so as Preeti already mentioned, this wrecks SMT. It also seems to
loose the aggressive spread, where we want to run 1 task on each 'core'
before we start 'balancing'.

So I think we should be able to fix this by setting PREFER_SIBLING on
the SMT domain, that way we'll get single tasks running on each SMT
domain before filling them up until capacity.

Now, its been a while since I looked at PREFER_SIBLING, and I've not yet
looked at what your patch does to it, but it seems to me that that is
the first direction we should look for an answer to this.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 11/11] sched: replace capacity by activity
  2014-05-23 15:53   ` Vincent Guittot
@ 2014-05-29 14:02     ` Peter Zijlstra
  -1 siblings, 0 replies; 216+ messages in thread
From: Peter Zijlstra @ 2014-05-29 14:02 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, linux-kernel, linux, linux-arm-kernel, preeti,
	Morten.Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano

On Fri, May 23, 2014 at 05:53:05PM +0200, Vincent Guittot wrote:
> @@ -6052,8 +6006,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>  		 * with a large weight task outweighs the tasks on the system).
>  		 */
>  		if (prefer_sibling && sds->local &&
> -		    sds->local_stat.group_has_capacity)
> -			sgs->group_capacity = min(sgs->group_capacity, 1U);
> +		    sds->local_stat.group_capacity > 0)
> +			sgs->group_capacity = min(sgs->group_capacity, 1L);
>  
>  		if (update_sd_pick_busiest(env, sds, sg, sgs)) {
>  			sds->busiest = sg;
> @@ -6228,7 +6182,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>  		 * have to drop below capacity to reach cpu-load equilibrium.
>  		 */
>  		load_above_capacity =
> -			(busiest->sum_nr_running - busiest->group_capacity);
> +			(busiest->sum_nr_running - busiest->group_weight);
>  
>  		load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_POWER_SCALE);
>  		load_above_capacity /= busiest->group_power;

I think you just broke PREFER_SIBLING here..

So we want PREFER_SIBLING to work on nr_running, not utilization because
we want to spread single tasks around, regardless of their utilization.


^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 11/11] sched: replace capacity by activity
@ 2014-05-29 14:02     ` Peter Zijlstra
  0 siblings, 0 replies; 216+ messages in thread
From: Peter Zijlstra @ 2014-05-29 14:02 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, May 23, 2014 at 05:53:05PM +0200, Vincent Guittot wrote:
> @@ -6052,8 +6006,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>  		 * with a large weight task outweighs the tasks on the system).
>  		 */
>  		if (prefer_sibling && sds->local &&
> -		    sds->local_stat.group_has_capacity)
> -			sgs->group_capacity = min(sgs->group_capacity, 1U);
> +		    sds->local_stat.group_capacity > 0)
> +			sgs->group_capacity = min(sgs->group_capacity, 1L);
>  
>  		if (update_sd_pick_busiest(env, sds, sg, sgs)) {
>  			sds->busiest = sg;
> @@ -6228,7 +6182,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>  		 * have to drop below capacity to reach cpu-load equilibrium.
>  		 */
>  		load_above_capacity =
> -			(busiest->sum_nr_running - busiest->group_capacity);
> +			(busiest->sum_nr_running - busiest->group_weight);
>  
>  		load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_POWER_SCALE);
>  		load_above_capacity /= busiest->group_power;

I think you just broke PREFER_SIBLING here..

So we want PREFER_SIBLING to work on nr_running, not utilization because
we want to spread single tasks around, regardless of their utilization.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 10/11] sched: move cfs task on a CPU with higher capacity
  2014-05-23 15:53   ` Vincent Guittot
@ 2014-05-29 14:04     ` Peter Zijlstra
  -1 siblings, 0 replies; 216+ messages in thread
From: Peter Zijlstra @ 2014-05-29 14:04 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, linux-kernel, linux, linux-arm-kernel, preeti,
	Morten.Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano

On Fri, May 23, 2014 at 05:53:04PM +0200, Vincent Guittot wrote:
> @@ -7282,6 +7289,12 @@ static inline int nohz_kick_needed(struct rq *rq)
>  
>  		if (nr_busy > 1)
>  			goto need_kick_unlock;
> +
> +		if ((rq->cfs.h_nr_running >= 1)
> +		 && ((rq->cpu_power * sd->imbalance_pct) <
> +					(rq->cpu_power_orig * 100)))
> +			goto need_kick_unlock;
> +
>  	}
>  
>  	sd = rcu_dereference(per_cpu(sd_asym, cpu));

So what happens when a cpu is consistently low on power (say due to a
pinned RT task) the balancer would quickly adjust the load level, but
this would endlessly kick things into action, even though we're balanced
just fine.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 10/11] sched: move cfs task on a CPU with higher capacity
@ 2014-05-29 14:04     ` Peter Zijlstra
  0 siblings, 0 replies; 216+ messages in thread
From: Peter Zijlstra @ 2014-05-29 14:04 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, May 23, 2014 at 05:53:04PM +0200, Vincent Guittot wrote:
> @@ -7282,6 +7289,12 @@ static inline int nohz_kick_needed(struct rq *rq)
>  
>  		if (nr_busy > 1)
>  			goto need_kick_unlock;
> +
> +		if ((rq->cfs.h_nr_running >= 1)
> +		 && ((rq->cpu_power * sd->imbalance_pct) <
> +					(rq->cpu_power_orig * 100)))
> +			goto need_kick_unlock;
> +
>  	}
>  
>  	sd = rcu_dereference(per_cpu(sd_asym, cpu));

So what happens when a cpu is consistently low on power (say due to a
pinned RT task) the balancer would quickly adjust the load level, but
this would endlessly kick things into action, even though we're balanced
just fine.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 10/11] sched: move cfs task on a CPU with higher capacity
  2014-05-29  9:50     ` Peter Zijlstra
@ 2014-05-29 19:37       ` Vincent Guittot
  -1 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-05-29 19:37 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Russell King - ARM Linux, LAK,
	Preeti U Murthy, Morten Rasmussen, Mike Galbraith, Nicolas Pitre,
	linaro-kernel, Daniel Lezcano

On 29 May 2014 11:50, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, May 23, 2014 at 05:53:04PM +0200, Vincent Guittot wrote:
>> If the CPU is used for handling lot of IRQs, trig a load balance to check if
>> it's worth moving its tasks on another CPU that has more capacity
>>
>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>> ---
>>  kernel/sched/fair.c | 13 +++++++++++++
>>  1 file changed, 13 insertions(+)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index e8a30f9..2501e49 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -5948,6 +5948,13 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>>       if (sgs->sum_nr_running > sgs->group_capacity)
>>               return true;
>>
>> +     /*
>> +      * The group capacity is reduced probably because of activity from other
>> +      * sched class or interrupts which use part of the available capacity
>> +      */
>> +     if ((sg->sgp->power_orig * 100) > (sgs->group_power * env->sd->imbalance_pct))
>> +             return true;
>> +
>>       if (sgs->group_imb)
>>               return true;
>>
>
> But we should already do this because the load numbers are scaled with
> the power/capacity figures. If one CPU gets significant less time to run
> fair tasks, its effective load would spike and it'd get to be selected
> here anyway.
>
> Or am I missing something?

The CPU could have been picked when the capacity becomes null (which
occurred when the cpu_power goes below half the default
SCHED_POWER_SCALE). And even after that, there were some conditions in
find_busiest_group that was bypassing this busiest group

>
>> @@ -7282,6 +7289,12 @@ static inline int nohz_kick_needed(struct rq *rq)
>>
>>               if (nr_busy > 1)
>>                       goto need_kick_unlock;
>> +
>> +             if ((rq->cfs.h_nr_running >= 1)
>> +              && ((rq->cpu_power * sd->imbalance_pct) <
>> +                                     (rq->cpu_power_orig * 100)))
>> +                     goto need_kick_unlock;
>> +
>>       }
>>
>>       sd = rcu_dereference(per_cpu(sd_asym, cpu));
>
> OK, so there you're kicking the idle balancer to try and get another CPU
> to pull some load? That makes sense I suppose.

and especially if we have idle CPUs and one task on the CPU with
reduced capacity

>
> That function is pretty horrible though; how about something like this
> first?

ok, i will integrate this modification in next version

>
> ---
>  kernel/sched/fair.c | 29 +++++++++++++----------------
>  1 file changed, 13 insertions(+), 16 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c9617b73bcc0..47fb96e6fa83 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7215,15 +7215,16 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
>   *   - For SD_ASYM_PACKING, if the lower numbered cpu's in the scheduler
>   *     domain span are idle.
>   */
> -static inline int nohz_kick_needed(struct rq *rq)
> +static inline bool nohz_kick_needed(struct rq *rq)
>  {
>         unsigned long now = jiffies;
>         struct sched_domain *sd;
>         struct sched_group_power *sgp;
>         int nr_busy, cpu = rq->cpu;
> +       bool kick = false;
>
>         if (unlikely(rq->idle_balance))
> -               return 0;
> +               return false;
>
>         /*
>         * We may be recently in ticked or tickless idle mode. At the first
> @@ -7237,38 +7238,34 @@ static inline int nohz_kick_needed(struct rq *rq)
>          * balancing.
>          */
>         if (likely(!atomic_read(&nohz.nr_cpus)))
> -               return 0;
> +               return false;
>
>         if (time_before(now, nohz.next_balance))
> -               return 0;
> +               return false;
>
>         if (rq->nr_running >= 2)
> -               goto need_kick;
> +               return true;
>
>         rcu_read_lock();
>         sd = rcu_dereference(per_cpu(sd_busy, cpu));
> -
>         if (sd) {
>                 sgp = sd->groups->sgp;
>                 nr_busy = atomic_read(&sgp->nr_busy_cpus);
>
> -               if (nr_busy > 1)
> -                       goto need_kick_unlock;
> +               if (nr_busy > 1) {
> +                       kick = true;
> +                       goto unlock;
> +               }
>         }
>
>         sd = rcu_dereference(per_cpu(sd_asym, cpu));
> -
>         if (sd && (cpumask_first_and(nohz.idle_cpus_mask,
>                                   sched_domain_span(sd)) < cpu))
> -               goto need_kick_unlock;
> -
> -       rcu_read_unlock();
> -       return 0;
> +               kick = true;
>
> -need_kick_unlock:
> +unlock:
>         rcu_read_unlock();
> -need_kick:
> -       return 1;
> +       return kick;
>  }
>  #else
>  static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) { }

^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 10/11] sched: move cfs task on a CPU with higher capacity
@ 2014-05-29 19:37       ` Vincent Guittot
  0 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-05-29 19:37 UTC (permalink / raw)
  To: linux-arm-kernel

On 29 May 2014 11:50, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, May 23, 2014 at 05:53:04PM +0200, Vincent Guittot wrote:
>> If the CPU is used for handling lot of IRQs, trig a load balance to check if
>> it's worth moving its tasks on another CPU that has more capacity
>>
>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>> ---
>>  kernel/sched/fair.c | 13 +++++++++++++
>>  1 file changed, 13 insertions(+)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index e8a30f9..2501e49 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -5948,6 +5948,13 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>>       if (sgs->sum_nr_running > sgs->group_capacity)
>>               return true;
>>
>> +     /*
>> +      * The group capacity is reduced probably because of activity from other
>> +      * sched class or interrupts which use part of the available capacity
>> +      */
>> +     if ((sg->sgp->power_orig * 100) > (sgs->group_power * env->sd->imbalance_pct))
>> +             return true;
>> +
>>       if (sgs->group_imb)
>>               return true;
>>
>
> But we should already do this because the load numbers are scaled with
> the power/capacity figures. If one CPU gets significant less time to run
> fair tasks, its effective load would spike and it'd get to be selected
> here anyway.
>
> Or am I missing something?

The CPU could have been picked when the capacity becomes null (which
occurred when the cpu_power goes below half the default
SCHED_POWER_SCALE). And even after that, there were some conditions in
find_busiest_group that was bypassing this busiest group

>
>> @@ -7282,6 +7289,12 @@ static inline int nohz_kick_needed(struct rq *rq)
>>
>>               if (nr_busy > 1)
>>                       goto need_kick_unlock;
>> +
>> +             if ((rq->cfs.h_nr_running >= 1)
>> +              && ((rq->cpu_power * sd->imbalance_pct) <
>> +                                     (rq->cpu_power_orig * 100)))
>> +                     goto need_kick_unlock;
>> +
>>       }
>>
>>       sd = rcu_dereference(per_cpu(sd_asym, cpu));
>
> OK, so there you're kicking the idle balancer to try and get another CPU
> to pull some load? That makes sense I suppose.

and especially if we have idle CPUs and one task on the CPU with
reduced capacity

>
> That function is pretty horrible though; how about something like this
> first?

ok, i will integrate this modification in next version

>
> ---
>  kernel/sched/fair.c | 29 +++++++++++++----------------
>  1 file changed, 13 insertions(+), 16 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c9617b73bcc0..47fb96e6fa83 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7215,15 +7215,16 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
>   *   - For SD_ASYM_PACKING, if the lower numbered cpu's in the scheduler
>   *     domain span are idle.
>   */
> -static inline int nohz_kick_needed(struct rq *rq)
> +static inline bool nohz_kick_needed(struct rq *rq)
>  {
>         unsigned long now = jiffies;
>         struct sched_domain *sd;
>         struct sched_group_power *sgp;
>         int nr_busy, cpu = rq->cpu;
> +       bool kick = false;
>
>         if (unlikely(rq->idle_balance))
> -               return 0;
> +               return false;
>
>         /*
>         * We may be recently in ticked or tickless idle mode. At the first
> @@ -7237,38 +7238,34 @@ static inline int nohz_kick_needed(struct rq *rq)
>          * balancing.
>          */
>         if (likely(!atomic_read(&nohz.nr_cpus)))
> -               return 0;
> +               return false;
>
>         if (time_before(now, nohz.next_balance))
> -               return 0;
> +               return false;
>
>         if (rq->nr_running >= 2)
> -               goto need_kick;
> +               return true;
>
>         rcu_read_lock();
>         sd = rcu_dereference(per_cpu(sd_busy, cpu));
> -
>         if (sd) {
>                 sgp = sd->groups->sgp;
>                 nr_busy = atomic_read(&sgp->nr_busy_cpus);
>
> -               if (nr_busy > 1)
> -                       goto need_kick_unlock;
> +               if (nr_busy > 1) {
> +                       kick = true;
> +                       goto unlock;
> +               }
>         }
>
>         sd = rcu_dereference(per_cpu(sd_asym, cpu));
> -
>         if (sd && (cpumask_first_and(nohz.idle_cpus_mask,
>                                   sched_domain_span(sd)) < cpu))
> -               goto need_kick_unlock;
> -
> -       rcu_read_unlock();
> -       return 0;
> +               kick = true;
>
> -need_kick_unlock:
> +unlock:
>         rcu_read_unlock();
> -need_kick:
> -       return 1;
> +       return kick;
>  }
>  #else
>  static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) { }

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 10/11] sched: move cfs task on a CPU with higher capacity
  2014-05-29 14:04     ` Peter Zijlstra
@ 2014-05-29 19:44       ` Vincent Guittot
  -1 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-05-29 19:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Russell King - ARM Linux, LAK,
	Preeti U Murthy, Morten Rasmussen, Mike Galbraith, Nicolas Pitre,
	linaro-kernel, Daniel Lezcano

On 29 May 2014 16:04, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, May 23, 2014 at 05:53:04PM +0200, Vincent Guittot wrote:
>> @@ -7282,6 +7289,12 @@ static inline int nohz_kick_needed(struct rq *rq)
>>
>>               if (nr_busy > 1)
>>                       goto need_kick_unlock;
>> +
>> +             if ((rq->cfs.h_nr_running >= 1)
>> +              && ((rq->cpu_power * sd->imbalance_pct) <
>> +                                     (rq->cpu_power_orig * 100)))
>> +                     goto need_kick_unlock;
>> +
>>       }
>>
>>       sd = rcu_dereference(per_cpu(sd_asym, cpu));
>
> So what happens when a cpu is consistently low on power (say due to a
> pinned RT task) the balancer would quickly adjust the load level, but
> this would endlessly kick things into action, even though we're balanced
> just fine.

If there is more than 1 running task or more than 1 busy CPU, we will
kick the ilb because of the former conditions. Then, if there is only
1 task and no other busy cpu, we should trig the ILB. Nevertheless, I
can add a test to check that there is an idle cpu in the sched_domain

^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 10/11] sched: move cfs task on a CPU with higher capacity
@ 2014-05-29 19:44       ` Vincent Guittot
  0 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-05-29 19:44 UTC (permalink / raw)
  To: linux-arm-kernel

On 29 May 2014 16:04, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, May 23, 2014 at 05:53:04PM +0200, Vincent Guittot wrote:
>> @@ -7282,6 +7289,12 @@ static inline int nohz_kick_needed(struct rq *rq)
>>
>>               if (nr_busy > 1)
>>                       goto need_kick_unlock;
>> +
>> +             if ((rq->cfs.h_nr_running >= 1)
>> +              && ((rq->cpu_power * sd->imbalance_pct) <
>> +                                     (rq->cpu_power_orig * 100)))
>> +                     goto need_kick_unlock;
>> +
>>       }
>>
>>       sd = rcu_dereference(per_cpu(sd_asym, cpu));
>
> So what happens when a cpu is consistently low on power (say due to a
> pinned RT task) the balancer would quickly adjust the load level, but
> this would endlessly kick things into action, even though we're balanced
> just fine.

If there is more than 1 running task or more than 1 busy CPU, we will
kick the ilb because of the former conditions. Then, if there is only
1 task and no other busy cpu, we should trig the ILB. Nevertheless, I
can add a test to check that there is an idle cpu in the sched_domain

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 11/11] sched: replace capacity by activity
  2014-05-29 13:55     ` Peter Zijlstra
@ 2014-05-29 19:51       ` Vincent Guittot
  -1 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-05-29 19:51 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Russell King - ARM Linux, LAK,
	Preeti U Murthy, Morten Rasmussen, Mike Galbraith, Nicolas Pitre,
	linaro-kernel, Daniel Lezcano

On 29 May 2014 15:55, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, May 23, 2014 at 05:53:05PM +0200, Vincent Guittot wrote:
>> The scheduler tries to compute how many tasks a group of CPUs can handle by
>> assuming that a task's load is SCHED_LOAD_SCALE and a CPU capacity is
>> SCHED_POWER_SCALE.
>> We can now have a better idea of the utilization of a group fo CPUs thanks to
>> group_actitvity and deduct how many capacity is still available.
>>
>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>> ---
>
> Right, so as Preeti already mentioned, this wrecks SMT. It also seems to
> loose the aggressive spread, where we want to run 1 task on each 'core'
> before we start 'balancing'.
>
> So I think we should be able to fix this by setting PREFER_SIBLING on
> the SMT domain, that way we'll get single tasks running on each SMT
> domain before filling them up until capacity.
>
> Now, its been a while since I looked at PREFER_SIBLING, and I've not yet
> looked at what your patch does to it, but it seems to me that that is
> the first direction we should look for an answer to this.

OK, i'm going to look more deeply in PREFER_SIBLING too

^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 11/11] sched: replace capacity by activity
@ 2014-05-29 19:51       ` Vincent Guittot
  0 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-05-29 19:51 UTC (permalink / raw)
  To: linux-arm-kernel

On 29 May 2014 15:55, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, May 23, 2014 at 05:53:05PM +0200, Vincent Guittot wrote:
>> The scheduler tries to compute how many tasks a group of CPUs can handle by
>> assuming that a task's load is SCHED_LOAD_SCALE and a CPU capacity is
>> SCHED_POWER_SCALE.
>> We can now have a better idea of the utilization of a group fo CPUs thanks to
>> group_actitvity and deduct how many capacity is still available.
>>
>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>> ---
>
> Right, so as Preeti already mentioned, this wrecks SMT. It also seems to
> loose the aggressive spread, where we want to run 1 task on each 'core'
> before we start 'balancing'.
>
> So I think we should be able to fix this by setting PREFER_SIBLING on
> the SMT domain, that way we'll get single tasks running on each SMT
> domain before filling them up until capacity.
>
> Now, its been a while since I looked at PREFER_SIBLING, and I've not yet
> looked at what your patch does to it, but it seems to me that that is
> the first direction we should look for an answer to this.

OK, i'm going to look more deeply in PREFER_SIBLING too

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 11/11] sched: replace capacity by activity
  2014-05-29 14:02     ` Peter Zijlstra
@ 2014-05-29 19:56       ` Vincent Guittot
  -1 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-05-29 19:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Russell King - ARM Linux, LAK,
	Preeti U Murthy, Morten Rasmussen, Mike Galbraith, Nicolas Pitre,
	linaro-kernel, Daniel Lezcano

On 29 May 2014 16:02, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, May 23, 2014 at 05:53:05PM +0200, Vincent Guittot wrote:
>> @@ -6052,8 +6006,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>>                * with a large weight task outweighs the tasks on the system).
>>                */
>>               if (prefer_sibling && sds->local &&
>> -                 sds->local_stat.group_has_capacity)
>> -                     sgs->group_capacity = min(sgs->group_capacity, 1U);
>> +                 sds->local_stat.group_capacity > 0)
>> +                     sgs->group_capacity = min(sgs->group_capacity, 1L);
>>
>>               if (update_sd_pick_busiest(env, sds, sg, sgs)) {
>>                       sds->busiest = sg;
>> @@ -6228,7 +6182,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>>                * have to drop below capacity to reach cpu-load equilibrium.
>>                */
>>               load_above_capacity =
>> -                     (busiest->sum_nr_running - busiest->group_capacity);
>> +                     (busiest->sum_nr_running - busiest->group_weight);
>>
>>               load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_POWER_SCALE);
>>               load_above_capacity /= busiest->group_power;
>
> I think you just broke PREFER_SIBLING here..

you mean by replacing the capacity which was reflecting the number of
core for SMT by the group_weight ?

>
> So we want PREFER_SIBLING to work on nr_running, not utilization because
> we want to spread single tasks around, regardless of their utilization.
>

^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 11/11] sched: replace capacity by activity
@ 2014-05-29 19:56       ` Vincent Guittot
  0 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-05-29 19:56 UTC (permalink / raw)
  To: linux-arm-kernel

On 29 May 2014 16:02, Peter Zijlstra <peterz@infradead.org> wrote:
> On Fri, May 23, 2014 at 05:53:05PM +0200, Vincent Guittot wrote:
>> @@ -6052,8 +6006,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>>                * with a large weight task outweighs the tasks on the system).
>>                */
>>               if (prefer_sibling && sds->local &&
>> -                 sds->local_stat.group_has_capacity)
>> -                     sgs->group_capacity = min(sgs->group_capacity, 1U);
>> +                 sds->local_stat.group_capacity > 0)
>> +                     sgs->group_capacity = min(sgs->group_capacity, 1L);
>>
>>               if (update_sd_pick_busiest(env, sds, sg, sgs)) {
>>                       sds->busiest = sg;
>> @@ -6228,7 +6182,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>>                * have to drop below capacity to reach cpu-load equilibrium.
>>                */
>>               load_above_capacity =
>> -                     (busiest->sum_nr_running - busiest->group_capacity);
>> +                     (busiest->sum_nr_running - busiest->group_weight);
>>
>>               load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_POWER_SCALE);
>>               load_above_capacity /= busiest->group_power;
>
> I think you just broke PREFER_SIBLING here..

you mean by replacing the capacity which was reflecting the number of
core for SMT by the group_weight ?

>
> So we want PREFER_SIBLING to work on nr_running, not utilization because
> we want to spread single tasks around, regardless of their utilization.
>

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 10/11] sched: move cfs task on a CPU with higher capacity
  2014-05-29 19:37       ` Vincent Guittot
@ 2014-05-30  6:29         ` Peter Zijlstra
  -1 siblings, 0 replies; 216+ messages in thread
From: Peter Zijlstra @ 2014-05-30  6:29 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, linux-kernel, Russell King - ARM Linux, LAK,
	Preeti U Murthy, Morten Rasmussen, Mike Galbraith, Nicolas Pitre,
	linaro-kernel, Daniel Lezcano

[-- Attachment #1: Type: text/plain, Size: 1923 bytes --]

On Thu, May 29, 2014 at 09:37:39PM +0200, Vincent Guittot wrote:
> On 29 May 2014 11:50, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Fri, May 23, 2014 at 05:53:04PM +0200, Vincent Guittot wrote:
> >> If the CPU is used for handling lot of IRQs, trig a load balance to check if
> >> it's worth moving its tasks on another CPU that has more capacity
> >>
> >> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> >> ---
> >>  kernel/sched/fair.c | 13 +++++++++++++
> >>  1 file changed, 13 insertions(+)
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index e8a30f9..2501e49 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -5948,6 +5948,13 @@ static bool update_sd_pick_busiest(struct lb_env *env,
> >>       if (sgs->sum_nr_running > sgs->group_capacity)
> >>               return true;
> >>
> >> +     /*
> >> +      * The group capacity is reduced probably because of activity from other
> >> +      * sched class or interrupts which use part of the available capacity
> >> +      */
> >> +     if ((sg->sgp->power_orig * 100) > (sgs->group_power * env->sd->imbalance_pct))
> >> +             return true;
> >> +
> >>       if (sgs->group_imb)
> >>               return true;
> >>
> >
> > But we should already do this because the load numbers are scaled with
> > the power/capacity figures. If one CPU gets significant less time to run
> > fair tasks, its effective load would spike and it'd get to be selected
> > here anyway.
> >
> > Or am I missing something?
> 
> The CPU could have been picked when the capacity becomes null (which
> occurred when the cpu_power goes below half the default
> SCHED_POWER_SCALE). And even after that, there were some conditions in
> find_busiest_group that was bypassing this busiest group

Could you detail those conditions? FWIW those make excellent Changelog
material.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 10/11] sched: move cfs task on a CPU with higher capacity
@ 2014-05-30  6:29         ` Peter Zijlstra
  0 siblings, 0 replies; 216+ messages in thread
From: Peter Zijlstra @ 2014-05-30  6:29 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, May 29, 2014 at 09:37:39PM +0200, Vincent Guittot wrote:
> On 29 May 2014 11:50, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Fri, May 23, 2014 at 05:53:04PM +0200, Vincent Guittot wrote:
> >> If the CPU is used for handling lot of IRQs, trig a load balance to check if
> >> it's worth moving its tasks on another CPU that has more capacity
> >>
> >> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> >> ---
> >>  kernel/sched/fair.c | 13 +++++++++++++
> >>  1 file changed, 13 insertions(+)
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index e8a30f9..2501e49 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -5948,6 +5948,13 @@ static bool update_sd_pick_busiest(struct lb_env *env,
> >>       if (sgs->sum_nr_running > sgs->group_capacity)
> >>               return true;
> >>
> >> +     /*
> >> +      * The group capacity is reduced probably because of activity from other
> >> +      * sched class or interrupts which use part of the available capacity
> >> +      */
> >> +     if ((sg->sgp->power_orig * 100) > (sgs->group_power * env->sd->imbalance_pct))
> >> +             return true;
> >> +
> >>       if (sgs->group_imb)
> >>               return true;
> >>
> >
> > But we should already do this because the load numbers are scaled with
> > the power/capacity figures. If one CPU gets significant less time to run
> > fair tasks, its effective load would spike and it'd get to be selected
> > here anyway.
> >
> > Or am I missing something?
> 
> The CPU could have been picked when the capacity becomes null (which
> occurred when the cpu_power goes below half the default
> SCHED_POWER_SCALE). And even after that, there were some conditions in
> find_busiest_group that was bypassing this busiest group

Could you detail those conditions? FWIW those make excellent Changelog
material.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 836 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20140530/b71e5088/attachment.sig>

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 11/11] sched: replace capacity by activity
  2014-05-29 19:56       ` Vincent Guittot
@ 2014-05-30  6:34         ` Peter Zijlstra
  -1 siblings, 0 replies; 216+ messages in thread
From: Peter Zijlstra @ 2014-05-30  6:34 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, linux-kernel, Russell King - ARM Linux, LAK,
	Preeti U Murthy, Morten Rasmussen, Mike Galbraith, Nicolas Pitre,
	linaro-kernel, Daniel Lezcano

[-- Attachment #1: Type: text/plain, Size: 2344 bytes --]

On Thu, May 29, 2014 at 09:56:24PM +0200, Vincent Guittot wrote:
> On 29 May 2014 16:02, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Fri, May 23, 2014 at 05:53:05PM +0200, Vincent Guittot wrote:
> >> @@ -6052,8 +6006,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
> >>                * with a large weight task outweighs the tasks on the system).
> >>                */
> >>               if (prefer_sibling && sds->local &&
> >> -                 sds->local_stat.group_has_capacity)
> >> -                     sgs->group_capacity = min(sgs->group_capacity, 1U);
> >> +                 sds->local_stat.group_capacity > 0)
> >> +                     sgs->group_capacity = min(sgs->group_capacity, 1L);
> >>
> >>               if (update_sd_pick_busiest(env, sds, sg, sgs)) {
> >>                       sds->busiest = sg;
> >> @@ -6228,7 +6182,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> >>                * have to drop below capacity to reach cpu-load equilibrium.
> >>                */
> >>               load_above_capacity =
> >> -                     (busiest->sum_nr_running - busiest->group_capacity);
> >> +                     (busiest->sum_nr_running - busiest->group_weight);
> >>
> >>               load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_POWER_SCALE);
> >>               load_above_capacity /= busiest->group_power;
> >
> > I think you just broke PREFER_SIBLING here..
> 
> you mean by replacing the capacity which was reflecting the number of
> core for SMT by the group_weight ?

Right to in the first hunk we lower group_capacity to 1 when prefer_sibling,
then in the second hunk, you replace that group_capacity usage with
group_weight.

With the end result that prefer_sibling is now ineffective.

That said, I fudged the prefer_sibling usage into the capacity logic,
mostly because I could and it was already how the SMT stuff was working.
But there is no reason we should continue to intertwine these two
things.

So I think it would be good to have a patch that implements
prefer_sibling on nr_running separate from the existing capacity bits,
and then convert the remaining capacity bits to utilization (or activity
or whatever you did call it, see Morton's comments etc.).



[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 11/11] sched: replace capacity by activity
@ 2014-05-30  6:34         ` Peter Zijlstra
  0 siblings, 0 replies; 216+ messages in thread
From: Peter Zijlstra @ 2014-05-30  6:34 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, May 29, 2014 at 09:56:24PM +0200, Vincent Guittot wrote:
> On 29 May 2014 16:02, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Fri, May 23, 2014 at 05:53:05PM +0200, Vincent Guittot wrote:
> >> @@ -6052,8 +6006,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
> >>                * with a large weight task outweighs the tasks on the system).
> >>                */
> >>               if (prefer_sibling && sds->local &&
> >> -                 sds->local_stat.group_has_capacity)
> >> -                     sgs->group_capacity = min(sgs->group_capacity, 1U);
> >> +                 sds->local_stat.group_capacity > 0)
> >> +                     sgs->group_capacity = min(sgs->group_capacity, 1L);
> >>
> >>               if (update_sd_pick_busiest(env, sds, sg, sgs)) {
> >>                       sds->busiest = sg;
> >> @@ -6228,7 +6182,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
> >>                * have to drop below capacity to reach cpu-load equilibrium.
> >>                */
> >>               load_above_capacity =
> >> -                     (busiest->sum_nr_running - busiest->group_capacity);
> >> +                     (busiest->sum_nr_running - busiest->group_weight);
> >>
> >>               load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_POWER_SCALE);
> >>               load_above_capacity /= busiest->group_power;
> >
> > I think you just broke PREFER_SIBLING here..
> 
> you mean by replacing the capacity which was reflecting the number of
> core for SMT by the group_weight ?

Right to in the first hunk we lower group_capacity to 1 when prefer_sibling,
then in the second hunk, you replace that group_capacity usage with
group_weight.

With the end result that prefer_sibling is now ineffective.

That said, I fudged the prefer_sibling usage into the capacity logic,
mostly because I could and it was already how the SMT stuff was working.
But there is no reason we should continue to intertwine these two
things.

So I think it would be good to have a patch that implements
prefer_sibling on nr_running separate from the existing capacity bits,
and then convert the remaining capacity bits to utilization (or activity
or whatever you did call it, see Morton's comments etc.).


-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 836 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20140530/06b10982/attachment-0001.sig>

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 08/11] sched: get CPU's activity statistic
  2014-05-23 15:53   ` Vincent Guittot
@ 2014-05-30  9:50     ` Dietmar Eggemann
  -1 siblings, 0 replies; 216+ messages in thread
From: Dietmar Eggemann @ 2014-05-30  9:50 UTC (permalink / raw)
  To: Vincent Guittot, peterz, mingo, linux-kernel, linux, linux-arm-kernel
  Cc: preeti, Morten Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano

On 23/05/14 16:53, Vincent Guittot wrote:
> Monitor the activity level of each group of each sched_domain level. The
> activity is the amount of cpu_power that is currently used on a CPU or group
> of CPUs. We use the runnable_avg_sum and _period to evaluate this activity
> level. In the special use case where the CPU is fully loaded by more than 1
> task, the activity level is set above the cpu_power in order to reflect the
> overload of the CPU
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

[...]

>  /*
>   * select_task_rq_fair: Select target runqueue for the waking task in domains
>   * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
> @@ -5518,6 +5535,7 @@ struct sg_lb_stats {
>  	unsigned long sum_weighted_load; /* Weighted load of group's tasks */
>  	unsigned long load_per_task;
>  	unsigned long group_power;
> +	unsigned long group_activity; /* Total activity of the group */
>  	unsigned int sum_nr_running; /* Nr tasks running in the group */
>  	unsigned int group_capacity;
>  	unsigned int idle_cpus;
> @@ -5538,6 +5556,7 @@ struct sd_lb_stats {
>  	struct sched_group *busiest;	/* Busiest group in this sd */
>  	struct sched_group *local;	/* Local group in this sd */
>  	unsigned long total_load;	/* Total load of all groups in sd */
> +	unsigned long total_activity;  /* Total activity of all groups in sd */
>  	unsigned long total_pwr;	/* Total power of all groups in sd */
>  	unsigned long avg_load;	/* Average load across all groups in sd */
>  
> @@ -5557,6 +5576,7 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds)
>  		.busiest = NULL,
>  		.local = NULL,
>  		.total_load = 0UL,
> +		.total_activity = 0UL,

AFAICS, total_activity is not used right now. Do you intend to use it to
calculate something like avg_activity later (like total_load/avg_load)?

>  		.total_pwr = 0UL,
>  		.busiest_stat = {
>  			.avg_load = 0UL,

[...]


^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 08/11] sched: get CPU's activity statistic
@ 2014-05-30  9:50     ` Dietmar Eggemann
  0 siblings, 0 replies; 216+ messages in thread
From: Dietmar Eggemann @ 2014-05-30  9:50 UTC (permalink / raw)
  To: linux-arm-kernel

On 23/05/14 16:53, Vincent Guittot wrote:
> Monitor the activity level of each group of each sched_domain level. The
> activity is the amount of cpu_power that is currently used on a CPU or group
> of CPUs. We use the runnable_avg_sum and _period to evaluate this activity
> level. In the special use case where the CPU is fully loaded by more than 1
> task, the activity level is set above the cpu_power in order to reflect the
> overload of the CPU
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

[...]

>  /*
>   * select_task_rq_fair: Select target runqueue for the waking task in domains
>   * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
> @@ -5518,6 +5535,7 @@ struct sg_lb_stats {
>  	unsigned long sum_weighted_load; /* Weighted load of group's tasks */
>  	unsigned long load_per_task;
>  	unsigned long group_power;
> +	unsigned long group_activity; /* Total activity of the group */
>  	unsigned int sum_nr_running; /* Nr tasks running in the group */
>  	unsigned int group_capacity;
>  	unsigned int idle_cpus;
> @@ -5538,6 +5556,7 @@ struct sd_lb_stats {
>  	struct sched_group *busiest;	/* Busiest group in this sd */
>  	struct sched_group *local;	/* Local group in this sd */
>  	unsigned long total_load;	/* Total load of all groups in sd */
> +	unsigned long total_activity;  /* Total activity of all groups in sd */
>  	unsigned long total_pwr;	/* Total power of all groups in sd */
>  	unsigned long avg_load;	/* Average load across all groups in sd */
>  
> @@ -5557,6 +5576,7 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds)
>  		.busiest = NULL,
>  		.local = NULL,
>  		.total_load = 0UL,
> +		.total_activity = 0UL,

AFAICS, total_activity is not used right now. Do you intend to use it to
calculate something like avg_activity later (like total_load/avg_load)?

>  		.total_pwr = 0UL,
>  		.busiest_stat = {
>  			.avg_load = 0UL,

[...]

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 10/11] sched: move cfs task on a CPU with higher capacity
  2014-05-23 15:53   ` Vincent Guittot
@ 2014-05-30 13:26     ` Dietmar Eggemann
  -1 siblings, 0 replies; 216+ messages in thread
From: Dietmar Eggemann @ 2014-05-30 13:26 UTC (permalink / raw)
  To: Vincent Guittot, peterz, mingo, linux-kernel, linux, linux-arm-kernel
  Cc: preeti, Morten Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano

On 23/05/14 16:53, Vincent Guittot wrote:
> If the CPU is used for handling lot of IRQs, trig a load balance to check if
> it's worth moving its tasks on another CPU that has more capacity
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>  kernel/sched/fair.c | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e8a30f9..2501e49 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5948,6 +5948,13 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>  	if (sgs->sum_nr_running > sgs->group_capacity)
>  		return true;
>  
> +	/*
> +	 * The group capacity is reduced probably because of activity from other

Here 'group capacity' refers to sgs->group_power and not to
sgs->group_capacity, right?

> +	 * sched class or interrupts which use part of the available capacity
> +	 */

... 'interrupts' only w/ CONFIG_IRQ_TIME_ACCOUNTING=y, right ?

> +	if ((sg->sgp->power_orig * 100) > (sgs->group_power * env->sd->imbalance_pct))
> +		return true;
> +
[...]



^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 10/11] sched: move cfs task on a CPU with higher capacity
@ 2014-05-30 13:26     ` Dietmar Eggemann
  0 siblings, 0 replies; 216+ messages in thread
From: Dietmar Eggemann @ 2014-05-30 13:26 UTC (permalink / raw)
  To: linux-arm-kernel

On 23/05/14 16:53, Vincent Guittot wrote:
> If the CPU is used for handling lot of IRQs, trig a load balance to check if
> it's worth moving its tasks on another CPU that has more capacity
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>  kernel/sched/fair.c | 13 +++++++++++++
>  1 file changed, 13 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index e8a30f9..2501e49 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5948,6 +5948,13 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>  	if (sgs->sum_nr_running > sgs->group_capacity)
>  		return true;
>  
> +	/*
> +	 * The group capacity is reduced probably because of activity from other

Here 'group capacity' refers to sgs->group_power and not to
sgs->group_capacity, right?

> +	 * sched class or interrupts which use part of the available capacity
> +	 */

... 'interrupts' only w/ CONFIG_IRQ_TIME_ACCOUNTING=y, right ?

> +	if ((sg->sgp->power_orig * 100) > (sgs->group_power * env->sd->imbalance_pct))
> +		return true;
> +
[...]

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 04/11] sched: Allow all archs to set the power_orig
  2014-05-23 15:52   ` Vincent Guittot
@ 2014-05-30 14:04     ` Dietmar Eggemann
  -1 siblings, 0 replies; 216+ messages in thread
From: Dietmar Eggemann @ 2014-05-30 14:04 UTC (permalink / raw)
  To: Vincent Guittot, peterz, mingo, linux-kernel, linux, linux-arm-kernel
  Cc: preeti, Morten Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano

On 23/05/14 16:52, Vincent Guittot wrote:
> power_orig is only changed for system with a SMT sched_domain level in order to
> reflect the lower capacity of CPUs. Heterogenous system also have to reflect an
> original capacity that is different from the default value.
> 
> Create a more generic function arch_scale_cpu_power that can be also used by
> non SMT platform to set power_orig.
> 
> The weak behavior of arch_scale_cpu_power is the previous SMT one in order to
> keep backward compatibility in the use of power_orig.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

As you know, besides uarch scaled cpu power for HMP, freq scaled cpu
power is important for energy-aware scheduling to achieve freq scale
invariance for task load.

I know that your patch-set is not about introducing freq scaled cpu
power, but we were discussing how this can be achieved w/ your patch-set
in place, so maybe you can share your opinion regarding the easiest way
to achieve freq scale invariance with us?

(1) We assume that the current way (update_cpu_power() calls
arch_scale_freq_power() to get the avg power(freq) over the time period
since the last call to arch_scale_freq_power()) is suitable
for us. Do you have another opinion here?

(2) Is the current layout of update_cpu_power() adequate for this, where
we scale power_orig related to freq and then related to rt/(irq):

  power_orig = scale_cpu(SCHED_POWER_SCALE)
  power = scale_rt(scale_freq(power_orig))

or do we need an extra power_freq data member on the rq and do:

  power_orig = scale_cpu(SCHED_POWER_SCALE)
  power_freq = scale_freq(power_orig))
  power = scale_rt(power_orig))

In other words, do we consider rt/(irq) pressure when calculating freq
scale invariant task load or not?

Thanks,

-- Dietmar
[...]


^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 04/11] sched: Allow all archs to set the power_orig
@ 2014-05-30 14:04     ` Dietmar Eggemann
  0 siblings, 0 replies; 216+ messages in thread
From: Dietmar Eggemann @ 2014-05-30 14:04 UTC (permalink / raw)
  To: linux-arm-kernel

On 23/05/14 16:52, Vincent Guittot wrote:
> power_orig is only changed for system with a SMT sched_domain level in order to
> reflect the lower capacity of CPUs. Heterogenous system also have to reflect an
> original capacity that is different from the default value.
> 
> Create a more generic function arch_scale_cpu_power that can be also used by
> non SMT platform to set power_orig.
> 
> The weak behavior of arch_scale_cpu_power is the previous SMT one in order to
> keep backward compatibility in the use of power_orig.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

As you know, besides uarch scaled cpu power for HMP, freq scaled cpu
power is important for energy-aware scheduling to achieve freq scale
invariance for task load.

I know that your patch-set is not about introducing freq scaled cpu
power, but we were discussing how this can be achieved w/ your patch-set
in place, so maybe you can share your opinion regarding the easiest way
to achieve freq scale invariance with us?

(1) We assume that the current way (update_cpu_power() calls
arch_scale_freq_power() to get the avg power(freq) over the time period
since the last call to arch_scale_freq_power()) is suitable
for us. Do you have another opinion here?

(2) Is the current layout of update_cpu_power() adequate for this, where
we scale power_orig related to freq and then related to rt/(irq):

  power_orig = scale_cpu(SCHED_POWER_SCALE)
  power = scale_rt(scale_freq(power_orig))

or do we need an extra power_freq data member on the rq and do:

  power_orig = scale_cpu(SCHED_POWER_SCALE)
  power_freq = scale_freq(power_orig))
  power = scale_rt(power_orig))

In other words, do we consider rt/(irq) pressure when calculating freq
scale invariant task load or not?

Thanks,

-- Dietmar
[...]

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 04/11] sched: Allow all archs to set the power_orig
  2014-05-30 14:04     ` Dietmar Eggemann
@ 2014-05-30 14:46       ` Peter Zijlstra
  -1 siblings, 0 replies; 216+ messages in thread
From: Peter Zijlstra @ 2014-05-30 14:46 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Vincent Guittot, mingo, linux-kernel, linux, linux-arm-kernel,
	preeti, Morten Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano

[-- Attachment #1: Type: text/plain, Size: 2238 bytes --]

On Fri, May 30, 2014 at 03:04:32PM +0100, Dietmar Eggemann wrote:
> On 23/05/14 16:52, Vincent Guittot wrote:
> > power_orig is only changed for system with a SMT sched_domain level in order to
> > reflect the lower capacity of CPUs. Heterogenous system also have to reflect an
> > original capacity that is different from the default value.
> > 
> > Create a more generic function arch_scale_cpu_power that can be also used by
> > non SMT platform to set power_orig.
> > 
> > The weak behavior of arch_scale_cpu_power is the previous SMT one in order to
> > keep backward compatibility in the use of power_orig.
> > 
> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> 
> As you know, besides uarch scaled cpu power for HMP, freq scaled cpu
> power is important for energy-aware scheduling to achieve freq scale
> invariance for task load.
> 
> I know that your patch-set is not about introducing freq scaled cpu
> power, but we were discussing how this can be achieved w/ your patch-set
> in place, so maybe you can share your opinion regarding the easiest way
> to achieve freq scale invariance with us?
> 
> (1) We assume that the current way (update_cpu_power() calls
> arch_scale_freq_power() to get the avg power(freq) over the time period
> since the last call to arch_scale_freq_power()) is suitable
> for us. Do you have another opinion here?
> 
> (2) Is the current layout of update_cpu_power() adequate for this, where
> we scale power_orig related to freq and then related to rt/(irq):
> 
>   power_orig = scale_cpu(SCHED_POWER_SCALE)
>   power = scale_rt(scale_freq(power_orig))
> 
> or do we need an extra power_freq data member on the rq and do:
> 
>   power_orig = scale_cpu(SCHED_POWER_SCALE)
>   power_freq = scale_freq(power_orig))
>   power = scale_rt(power_orig))
> 
> In other words, do we consider rt/(irq) pressure when calculating freq
> scale invariant task load or not?

I don't think you should. The work done depends on the frequency, not on
other tasks present on the cpu. The same is true for an over-utilized
cpu, a task will run less than the desired amount of time, this is no
different from a RT/irq preempting the task and taking its time.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 04/11] sched: Allow all archs to set the power_orig
@ 2014-05-30 14:46       ` Peter Zijlstra
  0 siblings, 0 replies; 216+ messages in thread
From: Peter Zijlstra @ 2014-05-30 14:46 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, May 30, 2014 at 03:04:32PM +0100, Dietmar Eggemann wrote:
> On 23/05/14 16:52, Vincent Guittot wrote:
> > power_orig is only changed for system with a SMT sched_domain level in order to
> > reflect the lower capacity of CPUs. Heterogenous system also have to reflect an
> > original capacity that is different from the default value.
> > 
> > Create a more generic function arch_scale_cpu_power that can be also used by
> > non SMT platform to set power_orig.
> > 
> > The weak behavior of arch_scale_cpu_power is the previous SMT one in order to
> > keep backward compatibility in the use of power_orig.
> > 
> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> 
> As you know, besides uarch scaled cpu power for HMP, freq scaled cpu
> power is important for energy-aware scheduling to achieve freq scale
> invariance for task load.
> 
> I know that your patch-set is not about introducing freq scaled cpu
> power, but we were discussing how this can be achieved w/ your patch-set
> in place, so maybe you can share your opinion regarding the easiest way
> to achieve freq scale invariance with us?
> 
> (1) We assume that the current way (update_cpu_power() calls
> arch_scale_freq_power() to get the avg power(freq) over the time period
> since the last call to arch_scale_freq_power()) is suitable
> for us. Do you have another opinion here?
> 
> (2) Is the current layout of update_cpu_power() adequate for this, where
> we scale power_orig related to freq and then related to rt/(irq):
> 
>   power_orig = scale_cpu(SCHED_POWER_SCALE)
>   power = scale_rt(scale_freq(power_orig))
> 
> or do we need an extra power_freq data member on the rq and do:
> 
>   power_orig = scale_cpu(SCHED_POWER_SCALE)
>   power_freq = scale_freq(power_orig))
>   power = scale_rt(power_orig))
> 
> In other words, do we consider rt/(irq) pressure when calculating freq
> scale invariant task load or not?

I don't think you should. The work done depends on the frequency, not on
other tasks present on the cpu. The same is true for an over-utilized
cpu, a task will run less than the desired amount of time, this is no
different from a RT/irq preempting the task and taking its time.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 836 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20140530/21dad1e5/attachment.sig>

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 11/11] sched: replace capacity by activity
  2014-05-30  6:34         ` Peter Zijlstra
@ 2014-05-30 19:13           ` Vincent Guittot
  -1 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-05-30 19:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Russell King - ARM Linux, LAK,
	Preeti U Murthy, Morten Rasmussen, Mike Galbraith, Nicolas Pitre,
	linaro-kernel, Daniel Lezcano

On 30 May 2014 08:34, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, May 29, 2014 at 09:56:24PM +0200, Vincent Guittot wrote:
>> On 29 May 2014 16:02, Peter Zijlstra <peterz@infradead.org> wrote:
>> > On Fri, May 23, 2014 at 05:53:05PM +0200, Vincent Guittot wrote:
>> >> @@ -6052,8 +6006,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>> >>                * with a large weight task outweighs the tasks on the system).
>> >>                */
>> >>               if (prefer_sibling && sds->local &&
>> >> -                 sds->local_stat.group_has_capacity)
>> >> -                     sgs->group_capacity = min(sgs->group_capacity, 1U);
>> >> +                 sds->local_stat.group_capacity > 0)
>> >> +                     sgs->group_capacity = min(sgs->group_capacity, 1L);
>> >>
>> >>               if (update_sd_pick_busiest(env, sds, sg, sgs)) {
>> >>                       sds->busiest = sg;
>> >> @@ -6228,7 +6182,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>> >>                * have to drop below capacity to reach cpu-load equilibrium.
>> >>                */
>> >>               load_above_capacity =
>> >> -                     (busiest->sum_nr_running - busiest->group_capacity);
>> >> +                     (busiest->sum_nr_running - busiest->group_weight);
>> >>
>> >>               load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_POWER_SCALE);
>> >>               load_above_capacity /= busiest->group_power;
>> >
>> > I think you just broke PREFER_SIBLING here..
>>
>> you mean by replacing the capacity which was reflecting the number of
>> core for SMT by the group_weight ?
>
> Right to in the first hunk we lower group_capacity to 1 when prefer_sibling,
> then in the second hunk, you replace that group_capacity usage with
> group_weight.
>
> With the end result that prefer_sibling is now ineffective.

ok

>
> That said, I fudged the prefer_sibling usage into the capacity logic,
> mostly because I could and it was already how the SMT stuff was working.
> But there is no reason we should continue to intertwine these two
> things.
>
> So I think it would be good to have a patch that implements
> prefer_sibling on nr_running separate from the existing capacity bits,
> and then convert the remaining capacity bits to utilization (or activity
> or whatever you did call it, see Morton's comments etc.).

ok, i'm going to prepare such change

>
>

^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 11/11] sched: replace capacity by activity
@ 2014-05-30 19:13           ` Vincent Guittot
  0 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-05-30 19:13 UTC (permalink / raw)
  To: linux-arm-kernel

On 30 May 2014 08:34, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, May 29, 2014 at 09:56:24PM +0200, Vincent Guittot wrote:
>> On 29 May 2014 16:02, Peter Zijlstra <peterz@infradead.org> wrote:
>> > On Fri, May 23, 2014 at 05:53:05PM +0200, Vincent Guittot wrote:
>> >> @@ -6052,8 +6006,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>> >>                * with a large weight task outweighs the tasks on the system).
>> >>                */
>> >>               if (prefer_sibling && sds->local &&
>> >> -                 sds->local_stat.group_has_capacity)
>> >> -                     sgs->group_capacity = min(sgs->group_capacity, 1U);
>> >> +                 sds->local_stat.group_capacity > 0)
>> >> +                     sgs->group_capacity = min(sgs->group_capacity, 1L);
>> >>
>> >>               if (update_sd_pick_busiest(env, sds, sg, sgs)) {
>> >>                       sds->busiest = sg;
>> >> @@ -6228,7 +6182,7 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
>> >>                * have to drop below capacity to reach cpu-load equilibrium.
>> >>                */
>> >>               load_above_capacity =
>> >> -                     (busiest->sum_nr_running - busiest->group_capacity);
>> >> +                     (busiest->sum_nr_running - busiest->group_weight);
>> >>
>> >>               load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_POWER_SCALE);
>> >>               load_above_capacity /= busiest->group_power;
>> >
>> > I think you just broke PREFER_SIBLING here..
>>
>> you mean by replacing the capacity which was reflecting the number of
>> core for SMT by the group_weight ?
>
> Right to in the first hunk we lower group_capacity to 1 when prefer_sibling,
> then in the second hunk, you replace that group_capacity usage with
> group_weight.
>
> With the end result that prefer_sibling is now ineffective.

ok

>
> That said, I fudged the prefer_sibling usage into the capacity logic,
> mostly because I could and it was already how the SMT stuff was working.
> But there is no reason we should continue to intertwine these two
> things.
>
> So I think it would be good to have a patch that implements
> prefer_sibling on nr_running separate from the existing capacity bits,
> and then convert the remaining capacity bits to utilization (or activity
> or whatever you did call it, see Morton's comments etc.).

ok, i'm going to prepare such change

>
>

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 08/11] sched: get CPU's activity statistic
  2014-05-30  9:50     ` Dietmar Eggemann
@ 2014-05-30 19:20       ` Vincent Guittot
  -1 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-05-30 19:20 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: peterz, mingo, linux-kernel, linux, linux-arm-kernel, preeti,
	Morten Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano

On 30 May 2014 11:50, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> On 23/05/14 16:53, Vincent Guittot wrote:
>> Monitor the activity level of each group of each sched_domain level. The
>> activity is the amount of cpu_power that is currently used on a CPU or group
>> of CPUs. We use the runnable_avg_sum and _period to evaluate this activity
>> level. In the special use case where the CPU is fully loaded by more than 1
>> task, the activity level is set above the cpu_power in order to reflect the
>> overload of the CPU
>>
>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>
> [...]
>
>>  /*
>>   * select_task_rq_fair: Select target runqueue for the waking task in domains
>>   * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
>> @@ -5518,6 +5535,7 @@ struct sg_lb_stats {
>>       unsigned long sum_weighted_load; /* Weighted load of group's tasks */
>>       unsigned long load_per_task;
>>       unsigned long group_power;
>> +     unsigned long group_activity; /* Total activity of the group */
>>       unsigned int sum_nr_running; /* Nr tasks running in the group */
>>       unsigned int group_capacity;
>>       unsigned int idle_cpus;
>> @@ -5538,6 +5556,7 @@ struct sd_lb_stats {
>>       struct sched_group *busiest;    /* Busiest group in this sd */
>>       struct sched_group *local;      /* Local group in this sd */
>>       unsigned long total_load;       /* Total load of all groups in sd */
>> +     unsigned long total_activity;  /* Total activity of all groups in sd */
>>       unsigned long total_pwr;        /* Total power of all groups in sd */
>>       unsigned long avg_load; /* Average load across all groups in sd */
>>
>> @@ -5557,6 +5576,7 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds)
>>               .busiest = NULL,
>>               .local = NULL,
>>               .total_load = 0UL,
>> +             .total_activity = 0UL,
>
> AFAICS, total_activity is not used right now. Do you intend to use it to
> calculate something like avg_activity later (like total_load/avg_load)?

this patch only creates group_activity and total_activity.The use of
these statistics appears on the next patches. I have not planned to
compute such avg_activity for the moment mainly because i have need it
yet.

Do you have in mind any use of such avg_activity ?

Vincent

>
>>               .total_pwr = 0UL,
>>               .busiest_stat = {
>>                       .avg_load = 0UL,
>
> [...]
>

^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 08/11] sched: get CPU's activity statistic
@ 2014-05-30 19:20       ` Vincent Guittot
  0 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-05-30 19:20 UTC (permalink / raw)
  To: linux-arm-kernel

On 30 May 2014 11:50, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> On 23/05/14 16:53, Vincent Guittot wrote:
>> Monitor the activity level of each group of each sched_domain level. The
>> activity is the amount of cpu_power that is currently used on a CPU or group
>> of CPUs. We use the runnable_avg_sum and _period to evaluate this activity
>> level. In the special use case where the CPU is fully loaded by more than 1
>> task, the activity level is set above the cpu_power in order to reflect the
>> overload of the CPU
>>
>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>
> [...]
>
>>  /*
>>   * select_task_rq_fair: Select target runqueue for the waking task in domains
>>   * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
>> @@ -5518,6 +5535,7 @@ struct sg_lb_stats {
>>       unsigned long sum_weighted_load; /* Weighted load of group's tasks */
>>       unsigned long load_per_task;
>>       unsigned long group_power;
>> +     unsigned long group_activity; /* Total activity of the group */
>>       unsigned int sum_nr_running; /* Nr tasks running in the group */
>>       unsigned int group_capacity;
>>       unsigned int idle_cpus;
>> @@ -5538,6 +5556,7 @@ struct sd_lb_stats {
>>       struct sched_group *busiest;    /* Busiest group in this sd */
>>       struct sched_group *local;      /* Local group in this sd */
>>       unsigned long total_load;       /* Total load of all groups in sd */
>> +     unsigned long total_activity;  /* Total activity of all groups in sd */
>>       unsigned long total_pwr;        /* Total power of all groups in sd */
>>       unsigned long avg_load; /* Average load across all groups in sd */
>>
>> @@ -5557,6 +5576,7 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds)
>>               .busiest = NULL,
>>               .local = NULL,
>>               .total_load = 0UL,
>> +             .total_activity = 0UL,
>
> AFAICS, total_activity is not used right now. Do you intend to use it to
> calculate something like avg_activity later (like total_load/avg_load)?

this patch only creates group_activity and total_activity.The use of
these statistics appears on the next patches. I have not planned to
compute such avg_activity for the moment mainly because i have need it
yet.

Do you have in mind any use of such avg_activity ?

Vincent

>
>>               .total_pwr = 0UL,
>>               .busiest_stat = {
>>                       .avg_load = 0UL,
>
> [...]
>

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 10/11] sched: move cfs task on a CPU with higher capacity
  2014-05-30 13:26     ` Dietmar Eggemann
@ 2014-05-30 19:24       ` Vincent Guittot
  -1 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-05-30 19:24 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: peterz, mingo, linux-kernel, linux, linux-arm-kernel, preeti,
	Morten Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano

On 30 May 2014 15:26, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> On 23/05/14 16:53, Vincent Guittot wrote:
>> If the CPU is used for handling lot of IRQs, trig a load balance to check if
>> it's worth moving its tasks on another CPU that has more capacity
>>
>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>> ---
>>  kernel/sched/fair.c | 13 +++++++++++++
>>  1 file changed, 13 insertions(+)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index e8a30f9..2501e49 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -5948,6 +5948,13 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>>       if (sgs->sum_nr_running > sgs->group_capacity)
>>               return true;
>>
>> +     /*
>> +      * The group capacity is reduced probably because of activity from other
>
> Here 'group capacity' refers to sgs->group_power and not to
> sgs->group_capacity, right?

yes, you're right it's cpu_power, i will correct the comment

>
>> +      * sched class or interrupts which use part of the available capacity
>> +      */
>
> ... 'interrupts' only w/ CONFIG_IRQ_TIME_ACCOUNTING=y, right ?

yes, we need CONFIG_IRQ_TIME_ACCOUNTING in order to scale the
cpu_power with time spent under irq. I have made test with and without
this config to show impact (in the cover letter)

Thanks
Vincent

>
>> +     if ((sg->sgp->power_orig * 100) > (sgs->group_power * env->sd->imbalance_pct))
>> +             return true;
>> +
> [...]
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 10/11] sched: move cfs task on a CPU with higher capacity
@ 2014-05-30 19:24       ` Vincent Guittot
  0 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-05-30 19:24 UTC (permalink / raw)
  To: linux-arm-kernel

On 30 May 2014 15:26, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> On 23/05/14 16:53, Vincent Guittot wrote:
>> If the CPU is used for handling lot of IRQs, trig a load balance to check if
>> it's worth moving its tasks on another CPU that has more capacity
>>
>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>> ---
>>  kernel/sched/fair.c | 13 +++++++++++++
>>  1 file changed, 13 insertions(+)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index e8a30f9..2501e49 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -5948,6 +5948,13 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>>       if (sgs->sum_nr_running > sgs->group_capacity)
>>               return true;
>>
>> +     /*
>> +      * The group capacity is reduced probably because of activity from other
>
> Here 'group capacity' refers to sgs->group_power and not to
> sgs->group_capacity, right?

yes, you're right it's cpu_power, i will correct the comment

>
>> +      * sched class or interrupts which use part of the available capacity
>> +      */
>
> ... 'interrupts' only w/ CONFIG_IRQ_TIME_ACCOUNTING=y, right ?

yes, we need CONFIG_IRQ_TIME_ACCOUNTING in order to scale the
cpu_power with time spent under irq. I have made test with and without
this config to show impact (in the cover letter)

Thanks
Vincent

>
>> +     if ((sg->sgp->power_orig * 100) > (sgs->group_power * env->sd->imbalance_pct))
>> +             return true;
>> +
> [...]
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 10/11] sched: move cfs task on a CPU with higher capacity
  2014-05-30 19:24       ` Vincent Guittot
@ 2014-05-30 19:45         ` Nicolas Pitre
  -1 siblings, 0 replies; 216+ messages in thread
From: Nicolas Pitre @ 2014-05-30 19:45 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Dietmar Eggemann, peterz, mingo, linux-kernel, linux,
	linux-arm-kernel, preeti, Morten Rasmussen, efault,
	linaro-kernel, daniel.lezcano

On Fri, 30 May 2014, Vincent Guittot wrote:

> On 30 May 2014 15:26, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> >> +     /*
> >> +      * The group capacity is reduced probably because of activity from other
> >
> > Here 'group capacity' refers to sgs->group_power and not to
> > sgs->group_capacity, right?
> 
> yes, you're right it's cpu_power, i will correct the comment

Note that peterz applied my rename patches so sgs->group_power has 
become sgs->group_capacity.


Nicolas

^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 10/11] sched: move cfs task on a CPU with higher capacity
@ 2014-05-30 19:45         ` Nicolas Pitre
  0 siblings, 0 replies; 216+ messages in thread
From: Nicolas Pitre @ 2014-05-30 19:45 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, 30 May 2014, Vincent Guittot wrote:

> On 30 May 2014 15:26, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> >> +     /*
> >> +      * The group capacity is reduced probably because of activity from other
> >
> > Here 'group capacity' refers to sgs->group_power and not to
> > sgs->group_capacity, right?
> 
> yes, you're right it's cpu_power, i will correct the comment

Note that peterz applied my rename patches so sgs->group_power has 
become sgs->group_capacity.


Nicolas

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 10/11] sched: move cfs task on a CPU with higher capacity
  2014-05-30  6:29         ` Peter Zijlstra
@ 2014-05-30 20:05           ` Vincent Guittot
  -1 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-05-30 20:05 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Russell King - ARM Linux, LAK,
	Preeti U Murthy, Morten Rasmussen, Mike Galbraith, Nicolas Pitre,
	linaro-kernel, Daniel Lezcano

On 30 May 2014 08:29, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, May 29, 2014 at 09:37:39PM +0200, Vincent Guittot wrote:
>> On 29 May 2014 11:50, Peter Zijlstra <peterz@infradead.org> wrote:
>> > On Fri, May 23, 2014 at 05:53:04PM +0200, Vincent Guittot wrote:
>> >> If the CPU is used for handling lot of IRQs, trig a load balance to check if
>> >> it's worth moving its tasks on another CPU that has more capacity
>> >>
>> >> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>> >> ---
>> >>  kernel/sched/fair.c | 13 +++++++++++++
>> >>  1 file changed, 13 insertions(+)
>> >>
>> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> >> index e8a30f9..2501e49 100644
>> >> --- a/kernel/sched/fair.c
>> >> +++ b/kernel/sched/fair.c
>> >> @@ -5948,6 +5948,13 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>> >>       if (sgs->sum_nr_running > sgs->group_capacity)
>> >>               return true;
>> >>
>> >> +     /*
>> >> +      * The group capacity is reduced probably because of activity from other
>> >> +      * sched class or interrupts which use part of the available capacity
>> >> +      */
>> >> +     if ((sg->sgp->power_orig * 100) > (sgs->group_power * env->sd->imbalance_pct))
>> >> +             return true;
>> >> +
>> >>       if (sgs->group_imb)
>> >>               return true;
>> >>
>> >
>> > But we should already do this because the load numbers are scaled with
>> > the power/capacity figures. If one CPU gets significant less time to run
>> > fair tasks, its effective load would spike and it'd get to be selected
>> > here anyway.
>> >
>> > Or am I missing something?
>>
>> The CPU could have been picked when the capacity becomes null (which
>> occurred when the cpu_power goes below half the default
>> SCHED_POWER_SCALE). And even after that, there were some conditions in
>> find_busiest_group that was bypassing this busiest group
>
> Could you detail those conditions? FWIW those make excellent Changelog
> material.

I need to go back to my traces and use case to be sure that I point
the right culprit but IIRC, the imbalance was null. I will come back
with more details once i'll be back in front of the boards.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 10/11] sched: move cfs task on a CPU with higher capacity
@ 2014-05-30 20:05           ` Vincent Guittot
  0 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-05-30 20:05 UTC (permalink / raw)
  To: linux-arm-kernel

On 30 May 2014 08:29, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, May 29, 2014 at 09:37:39PM +0200, Vincent Guittot wrote:
>> On 29 May 2014 11:50, Peter Zijlstra <peterz@infradead.org> wrote:
>> > On Fri, May 23, 2014 at 05:53:04PM +0200, Vincent Guittot wrote:
>> >> If the CPU is used for handling lot of IRQs, trig a load balance to check if
>> >> it's worth moving its tasks on another CPU that has more capacity
>> >>
>> >> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>> >> ---
>> >>  kernel/sched/fair.c | 13 +++++++++++++
>> >>  1 file changed, 13 insertions(+)
>> >>
>> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> >> index e8a30f9..2501e49 100644
>> >> --- a/kernel/sched/fair.c
>> >> +++ b/kernel/sched/fair.c
>> >> @@ -5948,6 +5948,13 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>> >>       if (sgs->sum_nr_running > sgs->group_capacity)
>> >>               return true;
>> >>
>> >> +     /*
>> >> +      * The group capacity is reduced probably because of activity from other
>> >> +      * sched class or interrupts which use part of the available capacity
>> >> +      */
>> >> +     if ((sg->sgp->power_orig * 100) > (sgs->group_power * env->sd->imbalance_pct))
>> >> +             return true;
>> >> +
>> >>       if (sgs->group_imb)
>> >>               return true;
>> >>
>> >
>> > But we should already do this because the load numbers are scaled with
>> > the power/capacity figures. If one CPU gets significant less time to run
>> > fair tasks, its effective load would spike and it'd get to be selected
>> > here anyway.
>> >
>> > Or am I missing something?
>>
>> The CPU could have been picked when the capacity becomes null (which
>> occurred when the cpu_power goes below half the default
>> SCHED_POWER_SCALE). And even after that, there were some conditions in
>> find_busiest_group that was bypassing this busiest group
>
> Could you detail those conditions? FWIW those make excellent Changelog
> material.

I need to go back to my traces and use case to be sure that I point
the right culprit but IIRC, the imbalance was null. I will come back
with more details once i'll be back in front of the boards.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 10/11] sched: move cfs task on a CPU with higher capacity
  2014-05-30 19:45         ` Nicolas Pitre
@ 2014-05-30 20:07           ` Vincent Guittot
  -1 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-05-30 20:07 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Dietmar Eggemann, peterz, mingo, linux-kernel, linux,
	linux-arm-kernel, preeti, Morten Rasmussen, efault,
	linaro-kernel, daniel.lezcano

On 30 May 2014 21:45, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
> On Fri, 30 May 2014, Vincent Guittot wrote:
>
>> On 30 May 2014 15:26, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>> >> +     /*
>> >> +      * The group capacity is reduced probably because of activity from other
>> >
>> > Here 'group capacity' refers to sgs->group_power and not to
>> > sgs->group_capacity, right?
>>
>> yes, you're right it's cpu_power, i will correct the comment
>
> Note that peterz applied my rename patches so sgs->group_power has
> become sgs->group_capacity.

I will update the naming in the next version when i will rebase

Vincent

>
>
> Nicolas

^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 10/11] sched: move cfs task on a CPU with higher capacity
@ 2014-05-30 20:07           ` Vincent Guittot
  0 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-05-30 20:07 UTC (permalink / raw)
  To: linux-arm-kernel

On 30 May 2014 21:45, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
> On Fri, 30 May 2014, Vincent Guittot wrote:
>
>> On 30 May 2014 15:26, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>> >> +     /*
>> >> +      * The group capacity is reduced probably because of activity from other
>> >
>> > Here 'group capacity' refers to sgs->group_power and not to
>> > sgs->group_capacity, right?
>>
>> yes, you're right it's cpu_power, i will correct the comment
>
> Note that peterz applied my rename patches so sgs->group_power has
> become sgs->group_capacity.

I will update the naming in the next version when i will rebase

Vincent

>
>
> Nicolas

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 04/11] sched: Allow all archs to set the power_orig
  2014-05-30 14:04     ` Dietmar Eggemann
@ 2014-05-30 20:50       ` Vincent Guittot
  -1 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-05-30 20:50 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: peterz, mingo, linux-kernel, linux, linux-arm-kernel, preeti,
	Morten Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano

On 30 May 2014 16:04, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> On 23/05/14 16:52, Vincent Guittot wrote:
>> power_orig is only changed for system with a SMT sched_domain level in order to
>> reflect the lower capacity of CPUs. Heterogenous system also have to reflect an
>> original capacity that is different from the default value.
>>
>> Create a more generic function arch_scale_cpu_power that can be also used by
>> non SMT platform to set power_orig.
>>
>> The weak behavior of arch_scale_cpu_power is the previous SMT one in order to
>> keep backward compatibility in the use of power_orig.
>>
>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>
> As you know, besides uarch scaled cpu power for HMP, freq scaled cpu
> power is important for energy-aware scheduling to achieve freq scale
> invariance for task load.
>
> I know that your patch-set is not about introducing freq scaled cpu
> power, but we were discussing how this can be achieved w/ your patch-set
> in place, so maybe you can share your opinion regarding the easiest way
> to achieve freq scale invariance with us?
>
> (1) We assume that the current way (update_cpu_power() calls
> arch_scale_freq_power() to get the avg power(freq) over the time period
> since the last call to arch_scale_freq_power()) is suitable
> for us. Do you have another opinion here?

Using power (or power_freq as you mentioned below) is probably the
easiest and more straight forward solution. You can use it to scale
each element when updating entity runnable.
Nevertheless, I see to 2 potential issues:
- is power updated often enough to correctly follow the frequency
scaling ? we need to compare power update frequency with
runnable_avg_sum variation speed and the rate at which we will change
the CPU's frequency.
- the max value of runnable_avg_sum will be also scaled so a task
running on a CPU with less capacity could be seen as a "low" load even
if it's an always running tasks. So we need to find a way to reach the
max value for such situation

>
> (2) Is the current layout of update_cpu_power() adequate for this, where
> we scale power_orig related to freq and then related to rt/(irq):
>
>   power_orig = scale_cpu(SCHED_POWER_SCALE)
>   power = scale_rt(scale_freq(power_orig))
>
> or do we need an extra power_freq data member on the rq and do:
>
>   power_orig = scale_cpu(SCHED_POWER_SCALE)
>   power_freq = scale_freq(power_orig))
>   power = scale_rt(power_orig))

do you really mean power = scale_rt(power_orig) or power=scale_rt(power_freq) ?

>
> In other words, do we consider rt/(irq) pressure when calculating freq
> scale invariant task load or not?

we should take power_freq which implies a new field

Thanks,
Vincent
>
> Thanks,
>
> -- Dietmar
> [...]
>

^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 04/11] sched: Allow all archs to set the power_orig
@ 2014-05-30 20:50       ` Vincent Guittot
  0 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-05-30 20:50 UTC (permalink / raw)
  To: linux-arm-kernel

On 30 May 2014 16:04, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> On 23/05/14 16:52, Vincent Guittot wrote:
>> power_orig is only changed for system with a SMT sched_domain level in order to
>> reflect the lower capacity of CPUs. Heterogenous system also have to reflect an
>> original capacity that is different from the default value.
>>
>> Create a more generic function arch_scale_cpu_power that can be also used by
>> non SMT platform to set power_orig.
>>
>> The weak behavior of arch_scale_cpu_power is the previous SMT one in order to
>> keep backward compatibility in the use of power_orig.
>>
>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>
> As you know, besides uarch scaled cpu power for HMP, freq scaled cpu
> power is important for energy-aware scheduling to achieve freq scale
> invariance for task load.
>
> I know that your patch-set is not about introducing freq scaled cpu
> power, but we were discussing how this can be achieved w/ your patch-set
> in place, so maybe you can share your opinion regarding the easiest way
> to achieve freq scale invariance with us?
>
> (1) We assume that the current way (update_cpu_power() calls
> arch_scale_freq_power() to get the avg power(freq) over the time period
> since the last call to arch_scale_freq_power()) is suitable
> for us. Do you have another opinion here?

Using power (or power_freq as you mentioned below) is probably the
easiest and more straight forward solution. You can use it to scale
each element when updating entity runnable.
Nevertheless, I see to 2 potential issues:
- is power updated often enough to correctly follow the frequency
scaling ? we need to compare power update frequency with
runnable_avg_sum variation speed and the rate at which we will change
the CPU's frequency.
- the max value of runnable_avg_sum will be also scaled so a task
running on a CPU with less capacity could be seen as a "low" load even
if it's an always running tasks. So we need to find a way to reach the
max value for such situation

>
> (2) Is the current layout of update_cpu_power() adequate for this, where
> we scale power_orig related to freq and then related to rt/(irq):
>
>   power_orig = scale_cpu(SCHED_POWER_SCALE)
>   power = scale_rt(scale_freq(power_orig))
>
> or do we need an extra power_freq data member on the rq and do:
>
>   power_orig = scale_cpu(SCHED_POWER_SCALE)
>   power_freq = scale_freq(power_orig))
>   power = scale_rt(power_orig))

do you really mean power = scale_rt(power_orig) or power=scale_rt(power_freq) ?

>
> In other words, do we consider rt/(irq) pressure when calculating freq
> scale invariant task load or not?

we should take power_freq which implies a new field

Thanks,
Vincent
>
> Thanks,
>
> -- Dietmar
> [...]
>

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 08/11] sched: get CPU's activity statistic
  2014-05-30 19:20       ` Vincent Guittot
@ 2014-06-01 11:33         ` Dietmar Eggemann
  -1 siblings, 0 replies; 216+ messages in thread
From: Dietmar Eggemann @ 2014-06-01 11:33 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: peterz, mingo, linux-kernel, linux, linux-arm-kernel, preeti,
	Morten Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano

On 30/05/14 20:20, Vincent Guittot wrote:
> On 30 May 2014 11:50, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>> On 23/05/14 16:53, Vincent Guittot wrote:
>>> Monitor the activity level of each group of each sched_domain level. The
>>> activity is the amount of cpu_power that is currently used on a CPU or group
>>> of CPUs. We use the runnable_avg_sum and _period to evaluate this activity
>>> level. In the special use case where the CPU is fully loaded by more than 1
>>> task, the activity level is set above the cpu_power in order to reflect the
>>> overload of the CPU
>>>
>>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>>
>> [...]
>>
>>>   /*
>>>    * select_task_rq_fair: Select target runqueue for the waking task in domains
>>>    * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
>>> @@ -5518,6 +5535,7 @@ struct sg_lb_stats {
>>>        unsigned long sum_weighted_load; /* Weighted load of group's tasks */
>>>        unsigned long load_per_task;
>>>        unsigned long group_power;
>>> +     unsigned long group_activity; /* Total activity of the group */
>>>        unsigned int sum_nr_running; /* Nr tasks running in the group */
>>>        unsigned int group_capacity;
>>>        unsigned int idle_cpus;
>>> @@ -5538,6 +5556,7 @@ struct sd_lb_stats {
>>>        struct sched_group *busiest;    /* Busiest group in this sd */
>>>        struct sched_group *local;      /* Local group in this sd */
>>>        unsigned long total_load;       /* Total load of all groups in sd */
>>> +     unsigned long total_activity;  /* Total activity of all groups in sd */
>>>        unsigned long total_pwr;        /* Total power of all groups in sd */
>>>        unsigned long avg_load; /* Average load across all groups in sd */
>>>
>>> @@ -5557,6 +5576,7 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds)
>>>                .busiest = NULL,
>>>                .local = NULL,
>>>                .total_load = 0UL,
>>> +             .total_activity = 0UL,
>>
>> AFAICS, total_activity is not used right now. Do you intend to use it to
>> calculate something like avg_activity later (like total_load/avg_load)?
>
> this patch only creates group_activity and total_activity.The use of
> these statistics appears on the next patches. I have not planned to
> compute such avg_activity for the moment mainly because i have need it
> yet.
>
> Do you have in mind any use of such avg_activity ?

No, not really but I still don't see (taken all your patches) where 
sds->total_activity is used.

$ find . -name "*.[ch]" | xargs grep total_activity
./kernel/sched/fair.c:	unsigned long total_activity;  /* Total activity 
of all groups in sd */
./kernel/sched/fair.c:		.total_activity = 0UL,
./kernel/sched/fair.c:		sds->total_activity += sgs->group_activity;

OOTH, sgs->group_activity is used to set sgs->group_capacity which is 
then used in if conditions in update_sd_pick_busiest(), 
update_sd_lb_stats() and find_busiest_group().

[...]



^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 08/11] sched: get CPU's activity statistic
@ 2014-06-01 11:33         ` Dietmar Eggemann
  0 siblings, 0 replies; 216+ messages in thread
From: Dietmar Eggemann @ 2014-06-01 11:33 UTC (permalink / raw)
  To: linux-arm-kernel

On 30/05/14 20:20, Vincent Guittot wrote:
> On 30 May 2014 11:50, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>> On 23/05/14 16:53, Vincent Guittot wrote:
>>> Monitor the activity level of each group of each sched_domain level. The
>>> activity is the amount of cpu_power that is currently used on a CPU or group
>>> of CPUs. We use the runnable_avg_sum and _period to evaluate this activity
>>> level. In the special use case where the CPU is fully loaded by more than 1
>>> task, the activity level is set above the cpu_power in order to reflect the
>>> overload of the CPU
>>>
>>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>>
>> [...]
>>
>>>   /*
>>>    * select_task_rq_fair: Select target runqueue for the waking task in domains
>>>    * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
>>> @@ -5518,6 +5535,7 @@ struct sg_lb_stats {
>>>        unsigned long sum_weighted_load; /* Weighted load of group's tasks */
>>>        unsigned long load_per_task;
>>>        unsigned long group_power;
>>> +     unsigned long group_activity; /* Total activity of the group */
>>>        unsigned int sum_nr_running; /* Nr tasks running in the group */
>>>        unsigned int group_capacity;
>>>        unsigned int idle_cpus;
>>> @@ -5538,6 +5556,7 @@ struct sd_lb_stats {
>>>        struct sched_group *busiest;    /* Busiest group in this sd */
>>>        struct sched_group *local;      /* Local group in this sd */
>>>        unsigned long total_load;       /* Total load of all groups in sd */
>>> +     unsigned long total_activity;  /* Total activity of all groups in sd */
>>>        unsigned long total_pwr;        /* Total power of all groups in sd */
>>>        unsigned long avg_load; /* Average load across all groups in sd */
>>>
>>> @@ -5557,6 +5576,7 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds)
>>>                .busiest = NULL,
>>>                .local = NULL,
>>>                .total_load = 0UL,
>>> +             .total_activity = 0UL,
>>
>> AFAICS, total_activity is not used right now. Do you intend to use it to
>> calculate something like avg_activity later (like total_load/avg_load)?
>
> this patch only creates group_activity and total_activity.The use of
> these statistics appears on the next patches. I have not planned to
> compute such avg_activity for the moment mainly because i have need it
> yet.
>
> Do you have in mind any use of such avg_activity ?

No, not really but I still don't see (taken all your patches) where 
sds->total_activity is used.

$ find . -name "*.[ch]" | xargs grep total_activity
./kernel/sched/fair.c:	unsigned long total_activity;  /* Total activity 
of all groups in sd */
./kernel/sched/fair.c:		.total_activity = 0UL,
./kernel/sched/fair.c:		sds->total_activity += sgs->group_activity;

OOTH, sgs->group_activity is used to set sgs->group_capacity which is 
then used in if conditions in update_sd_pick_busiest(), 
update_sd_lb_stats() and find_busiest_group().

[...]

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 11/11] sched: replace capacity by activity
  2014-05-29 13:55     ` Peter Zijlstra
@ 2014-06-02  6:21       ` Preeti U Murthy
  -1 siblings, 0 replies; 216+ messages in thread
From: Preeti U Murthy @ 2014-06-02  6:21 UTC (permalink / raw)
  To: Peter Zijlstra, Vincent Guittot
  Cc: mingo, linux-kernel, linux, linux-arm-kernel, Morten.Rasmussen,
	efault, nicolas.pitre, linaro-kernel, daniel.lezcano

On 05/29/2014 07:25 PM, Peter Zijlstra wrote:
> On Fri, May 23, 2014 at 05:53:05PM +0200, Vincent Guittot wrote:
>> The scheduler tries to compute how many tasks a group of CPUs can handle by
>> assuming that a task's load is SCHED_LOAD_SCALE and a CPU capacity is
>> SCHED_POWER_SCALE.
>> We can now have a better idea of the utilization of a group fo CPUs thanks to
>> group_actitvity and deduct how many capacity is still available.
>>
>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>> ---
> 
> Right, so as Preeti already mentioned, this wrecks SMT. It also seems to
> loose the aggressive spread, where we want to run 1 task on each 'core'
> before we start 'balancing'.

True. I just profiled the ebizzy runs and found that ebizzy threads were
being packed onto a single core which is SMT-8 capable before spreading.
This was a 6 core, SMT-8 machine. So for instance if I run 8 threads of
ebizzy. the load balancing as record by perf sched record showed that
two cores were packed upto 3 ebizzy threads and one core ran two ebizzy
threads while the rest of the 3 cores were idle.

I am unable to understand which part of this patch is aiding packing to
a core.  There is this check in this patch right?

if (sgs->group_capacity < 0)
		return true;

which should ideally prevent such packing? Because irrespective of the
number of SMT threads, the capacity of a core is unchanged. And in the
above scenario, we have 6 tasks on 3 cores. So shouldn't the above check
have caught it?

Regards
Preeti U Murthy
> 
> So I think we should be able to fix this by setting PREFER_SIBLING on
> the SMT domain, that way we'll get single tasks running on each SMT
> domain before filling them up until capacity.
> 
> Now, its been a while since I looked at PREFER_SIBLING, and I've not yet
> looked at what your patch does to it, but it seems to me that that is
> the first direction we should look for an answer to this.
> 


^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 11/11] sched: replace capacity by activity
@ 2014-06-02  6:21       ` Preeti U Murthy
  0 siblings, 0 replies; 216+ messages in thread
From: Preeti U Murthy @ 2014-06-02  6:21 UTC (permalink / raw)
  To: linux-arm-kernel

On 05/29/2014 07:25 PM, Peter Zijlstra wrote:
> On Fri, May 23, 2014 at 05:53:05PM +0200, Vincent Guittot wrote:
>> The scheduler tries to compute how many tasks a group of CPUs can handle by
>> assuming that a task's load is SCHED_LOAD_SCALE and a CPU capacity is
>> SCHED_POWER_SCALE.
>> We can now have a better idea of the utilization of a group fo CPUs thanks to
>> group_actitvity and deduct how many capacity is still available.
>>
>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>> ---
> 
> Right, so as Preeti already mentioned, this wrecks SMT. It also seems to
> loose the aggressive spread, where we want to run 1 task on each 'core'
> before we start 'balancing'.

True. I just profiled the ebizzy runs and found that ebizzy threads were
being packed onto a single core which is SMT-8 capable before spreading.
This was a 6 core, SMT-8 machine. So for instance if I run 8 threads of
ebizzy. the load balancing as record by perf sched record showed that
two cores were packed upto 3 ebizzy threads and one core ran two ebizzy
threads while the rest of the 3 cores were idle.

I am unable to understand which part of this patch is aiding packing to
a core.  There is this check in this patch right?

if (sgs->group_capacity < 0)
		return true;

which should ideally prevent such packing? Because irrespective of the
number of SMT threads, the capacity of a core is unchanged. And in the
above scenario, we have 6 tasks on 3 cores. So shouldn't the above check
have caught it?

Regards
Preeti U Murthy
> 
> So I think we should be able to fix this by setting PREFER_SIBLING on
> the SMT domain, that way we'll get single tasks running on each SMT
> domain before filling them up until capacity.
> 
> Now, its been a while since I looked at PREFER_SIBLING, and I've not yet
> looked at what your patch does to it, but it seems to me that that is
> the first direction we should look for an answer to this.
> 

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 08/11] sched: get CPU's activity statistic
  2014-06-01 11:33         ` Dietmar Eggemann
@ 2014-06-02 14:07           ` Vincent Guittot
  -1 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-06-02 14:07 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: peterz, mingo, linux-kernel, linux, linux-arm-kernel, preeti,
	Morten Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano

On 1 June 2014 13:33, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> On 30/05/14 20:20, Vincent Guittot wrote:
>>
>> On 30 May 2014 11:50, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>>
>>> On 23/05/14 16:53, Vincent Guittot wrote:
>>>>
>>>> Monitor the activity level of each group of each sched_domain level. The
>>>> activity is the amount of cpu_power that is currently used on a CPU or
>>>> group
>>>> of CPUs. We use the runnable_avg_sum and _period to evaluate this
>>>> activity
>>>> level. In the special use case where the CPU is fully loaded by more
>>>> than 1
>>>> task, the activity level is set above the cpu_power in order to reflect
>>>> the
>>>> overload of the CPU
>>>>
>>>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>>>
>>>
>>> [...]
>>>
>>>>   /*
>>>>    * select_task_rq_fair: Select target runqueue for the waking task in
>>>> domains
>>>>    * that have the 'sd_flag' flag set. In practice, this is
>>>> SD_BALANCE_WAKE,
>>>> @@ -5518,6 +5535,7 @@ struct sg_lb_stats {
>>>>        unsigned long sum_weighted_load; /* Weighted load of group's
>>>> tasks */
>>>>        unsigned long load_per_task;
>>>>        unsigned long group_power;
>>>> +     unsigned long group_activity; /* Total activity of the group */
>>>>        unsigned int sum_nr_running; /* Nr tasks running in the group */
>>>>        unsigned int group_capacity;
>>>>        unsigned int idle_cpus;
>>>> @@ -5538,6 +5556,7 @@ struct sd_lb_stats {
>>>>        struct sched_group *busiest;    /* Busiest group in this sd */
>>>>        struct sched_group *local;      /* Local group in this sd */
>>>>        unsigned long total_load;       /* Total load of all groups in sd
>>>> */
>>>> +     unsigned long total_activity;  /* Total activity of all groups in
>>>> sd */
>>>>        unsigned long total_pwr;        /* Total power of all groups in
>>>> sd */
>>>>        unsigned long avg_load; /* Average load across all groups in sd
>>>> */
>>>>
>>>> @@ -5557,6 +5576,7 @@ static inline void init_sd_lb_stats(struct
>>>> sd_lb_stats *sds)
>>>>                .busiest = NULL,
>>>>                .local = NULL,
>>>>                .total_load = 0UL,
>>>> +             .total_activity = 0UL,
>>>
>>>
>>> AFAICS, total_activity is not used right now. Do you intend to use it to
>>> calculate something like avg_activity later (like total_load/avg_load)?
>>
>>
>> this patch only creates group_activity and total_activity.The use of
>> these statistics appears on the next patches. I have not planned to
>> compute such avg_activity for the moment mainly because i have need it
>> yet.
>>
>> Do you have in mind any use of such avg_activity ?
>
>
> No, not really but I still don't see (taken all your patches) where
> sds->total_activity is used.
>
> $ find . -name "*.[ch]" | xargs grep total_activity
> ./kernel/sched/fair.c:  unsigned long total_activity;  /* Total activity of
> all groups in sd */
> ./kernel/sched/fair.c:          .total_activity = 0UL,
> ./kernel/sched/fair.c:          sds->total_activity += sgs->group_activity;
>
> OOTH, sgs->group_activity is used to set sgs->group_capacity which is then
> used in if conditions in update_sd_pick_busiest(), update_sd_lb_stats() and
> find_busiest_group().

Sorry, i misread you 1st email and haven't understood that you were
specifically speaking about total_activity.
When i created the activity statistic, i have added both sg and sd
fields but i finally only use the sg one. If i don't need the sd field
after updating my patches, i will remove it

Thanks
Vincent

>
> [...]
>
>

^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 08/11] sched: get CPU's activity statistic
@ 2014-06-02 14:07           ` Vincent Guittot
  0 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-06-02 14:07 UTC (permalink / raw)
  To: linux-arm-kernel

On 1 June 2014 13:33, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> On 30/05/14 20:20, Vincent Guittot wrote:
>>
>> On 30 May 2014 11:50, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>>
>>> On 23/05/14 16:53, Vincent Guittot wrote:
>>>>
>>>> Monitor the activity level of each group of each sched_domain level. The
>>>> activity is the amount of cpu_power that is currently used on a CPU or
>>>> group
>>>> of CPUs. We use the runnable_avg_sum and _period to evaluate this
>>>> activity
>>>> level. In the special use case where the CPU is fully loaded by more
>>>> than 1
>>>> task, the activity level is set above the cpu_power in order to reflect
>>>> the
>>>> overload of the CPU
>>>>
>>>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>>>
>>>
>>> [...]
>>>
>>>>   /*
>>>>    * select_task_rq_fair: Select target runqueue for the waking task in
>>>> domains
>>>>    * that have the 'sd_flag' flag set. In practice, this is
>>>> SD_BALANCE_WAKE,
>>>> @@ -5518,6 +5535,7 @@ struct sg_lb_stats {
>>>>        unsigned long sum_weighted_load; /* Weighted load of group's
>>>> tasks */
>>>>        unsigned long load_per_task;
>>>>        unsigned long group_power;
>>>> +     unsigned long group_activity; /* Total activity of the group */
>>>>        unsigned int sum_nr_running; /* Nr tasks running in the group */
>>>>        unsigned int group_capacity;
>>>>        unsigned int idle_cpus;
>>>> @@ -5538,6 +5556,7 @@ struct sd_lb_stats {
>>>>        struct sched_group *busiest;    /* Busiest group in this sd */
>>>>        struct sched_group *local;      /* Local group in this sd */
>>>>        unsigned long total_load;       /* Total load of all groups in sd
>>>> */
>>>> +     unsigned long total_activity;  /* Total activity of all groups in
>>>> sd */
>>>>        unsigned long total_pwr;        /* Total power of all groups in
>>>> sd */
>>>>        unsigned long avg_load; /* Average load across all groups in sd
>>>> */
>>>>
>>>> @@ -5557,6 +5576,7 @@ static inline void init_sd_lb_stats(struct
>>>> sd_lb_stats *sds)
>>>>                .busiest = NULL,
>>>>                .local = NULL,
>>>>                .total_load = 0UL,
>>>> +             .total_activity = 0UL,
>>>
>>>
>>> AFAICS, total_activity is not used right now. Do you intend to use it to
>>> calculate something like avg_activity later (like total_load/avg_load)?
>>
>>
>> this patch only creates group_activity and total_activity.The use of
>> these statistics appears on the next patches. I have not planned to
>> compute such avg_activity for the moment mainly because i have need it
>> yet.
>>
>> Do you have in mind any use of such avg_activity ?
>
>
> No, not really but I still don't see (taken all your patches) where
> sds->total_activity is used.
>
> $ find . -name "*.[ch]" | xargs grep total_activity
> ./kernel/sched/fair.c:  unsigned long total_activity;  /* Total activity of
> all groups in sd */
> ./kernel/sched/fair.c:          .total_activity = 0UL,
> ./kernel/sched/fair.c:          sds->total_activity += sgs->group_activity;
>
> OOTH, sgs->group_activity is used to set sgs->group_capacity which is then
> used in if conditions in update_sd_pick_busiest(), update_sd_lb_stats() and
> find_busiest_group().

Sorry, i misread you 1st email and haven't understood that you were
specifically speaking about total_activity.
When i created the activity statistic, i have added both sg and sd
fields but i finally only use the sg one. If i don't need the sd field
after updating my patches, i will remove it

Thanks
Vincent

>
> [...]
>
>

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 10/11] sched: move cfs task on a CPU with higher capacity
  2014-05-30  6:29         ` Peter Zijlstra
@ 2014-06-02 17:06           ` Vincent Guittot
  -1 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-06-02 17:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Russell King - ARM Linux, LAK,
	Preeti U Murthy, Morten Rasmussen, Mike Galbraith, Nicolas Pitre,
	linaro-kernel, Daniel Lezcano

On 30 May 2014 08:29, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, May 29, 2014 at 09:37:39PM +0200, Vincent Guittot wrote:
>> On 29 May 2014 11:50, Peter Zijlstra <peterz@infradead.org> wrote:
>> > On Fri, May 23, 2014 at 05:53:04PM +0200, Vincent Guittot wrote:
>> >> If the CPU is used for handling lot of IRQs, trig a load balance to check if
>> >> it's worth moving its tasks on another CPU that has more capacity
>> >>
>> >> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>> >> ---
>> >>  kernel/sched/fair.c | 13 +++++++++++++
>> >>  1 file changed, 13 insertions(+)
>> >>
>> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> >> index e8a30f9..2501e49 100644
>> >> --- a/kernel/sched/fair.c
>> >> +++ b/kernel/sched/fair.c
>> >> @@ -5948,6 +5948,13 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>> >>       if (sgs->sum_nr_running > sgs->group_capacity)
>> >>               return true;
>> >>
>> >> +     /*
>> >> +      * The group capacity is reduced probably because of activity from other
>> >> +      * sched class or interrupts which use part of the available capacity
>> >> +      */
>> >> +     if ((sg->sgp->power_orig * 100) > (sgs->group_power * env->sd->imbalance_pct))
>> >> +             return true;
>> >> +
>> >>       if (sgs->group_imb)
>> >>               return true;
>> >>
>> >
>> > But we should already do this because the load numbers are scaled with
>> > the power/capacity figures. If one CPU gets significant less time to run
>> > fair tasks, its effective load would spike and it'd get to be selected
>> > here anyway.
>> >
>> > Or am I missing something?
>>
>> The CPU could have been picked when the capacity becomes null (which
>> occurred when the cpu_power goes below half the default
>> SCHED_POWER_SCALE). And even after that, there were some conditions in
>> find_busiest_group that was bypassing this busiest group
>
> Could you detail those conditions? FWIW those make excellent Changelog
> material.

I have looked back into my tests and traces:

In a 1st test, the capacity of the CPU was still above half default
value (power=538) unlike what i remembered. So it's some what "normal"
to keep the task on CPU0 which also handles IRQ because sg_capacity
still returns 1.

In a 2nd test,the main task runs (most of the time) on CPU0 whereas
the max power of the latter is only 623 and the cpu_power goes below
512 (power=330) during the use case. So the sg_capacity of CPU0 is
null but the main task still stays on CPU0.
The use case (scp transfer) is made of a long running task (ssh) and a
periodic short task (scp). ssh runs on CPU0 and scp runs each 6ms on
CPU1. The newly idle load balance on CPU1 doesn't pull the long
running task although sg_capacity is null because of
sd->nr_balance_failed is never incremented and load_balance doesn't
trig an active load_balance. When an idle balance occurs in the middle
of the newly idle balance, the ssh long task migrates on CPU1 but as
soon as it sleeps and wakes up, it goes back on CPU0 because of the
wake affine which migrates it back on CPU0 (issue solved by patch 09).

Vincent

^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 10/11] sched: move cfs task on a CPU with higher capacity
@ 2014-06-02 17:06           ` Vincent Guittot
  0 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-06-02 17:06 UTC (permalink / raw)
  To: linux-arm-kernel

On 30 May 2014 08:29, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, May 29, 2014 at 09:37:39PM +0200, Vincent Guittot wrote:
>> On 29 May 2014 11:50, Peter Zijlstra <peterz@infradead.org> wrote:
>> > On Fri, May 23, 2014 at 05:53:04PM +0200, Vincent Guittot wrote:
>> >> If the CPU is used for handling lot of IRQs, trig a load balance to check if
>> >> it's worth moving its tasks on another CPU that has more capacity
>> >>
>> >> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>> >> ---
>> >>  kernel/sched/fair.c | 13 +++++++++++++
>> >>  1 file changed, 13 insertions(+)
>> >>
>> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> >> index e8a30f9..2501e49 100644
>> >> --- a/kernel/sched/fair.c
>> >> +++ b/kernel/sched/fair.c
>> >> @@ -5948,6 +5948,13 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>> >>       if (sgs->sum_nr_running > sgs->group_capacity)
>> >>               return true;
>> >>
>> >> +     /*
>> >> +      * The group capacity is reduced probably because of activity from other
>> >> +      * sched class or interrupts which use part of the available capacity
>> >> +      */
>> >> +     if ((sg->sgp->power_orig * 100) > (sgs->group_power * env->sd->imbalance_pct))
>> >> +             return true;
>> >> +
>> >>       if (sgs->group_imb)
>> >>               return true;
>> >>
>> >
>> > But we should already do this because the load numbers are scaled with
>> > the power/capacity figures. If one CPU gets significant less time to run
>> > fair tasks, its effective load would spike and it'd get to be selected
>> > here anyway.
>> >
>> > Or am I missing something?
>>
>> The CPU could have been picked when the capacity becomes null (which
>> occurred when the cpu_power goes below half the default
>> SCHED_POWER_SCALE). And even after that, there were some conditions in
>> find_busiest_group that was bypassing this busiest group
>
> Could you detail those conditions? FWIW those make excellent Changelog
> material.

I have looked back into my tests and traces:

In a 1st test, the capacity of the CPU was still above half default
value (power=538) unlike what i remembered. So it's some what "normal"
to keep the task on CPU0 which also handles IRQ because sg_capacity
still returns 1.

In a 2nd test,the main task runs (most of the time) on CPU0 whereas
the max power of the latter is only 623 and the cpu_power goes below
512 (power=330) during the use case. So the sg_capacity of CPU0 is
null but the main task still stays on CPU0.
The use case (scp transfer) is made of a long running task (ssh) and a
periodic short task (scp). ssh runs on CPU0 and scp runs each 6ms on
CPU1. The newly idle load balance on CPU1 doesn't pull the long
running task although sg_capacity is null because of
sd->nr_balance_failed is never incremented and load_balance doesn't
trig an active load_balance. When an idle balance occurs in the middle
of the newly idle balance, the ssh long task migrates on CPU1 but as
soon as it sleeps and wakes up, it goes back on CPU0 because of the
wake affine which migrates it back on CPU0 (issue solved by patch 09).

Vincent

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 11/11] sched: replace capacity by activity
  2014-06-02  6:21       ` Preeti U Murthy
@ 2014-06-03  9:50         ` Vincent Guittot
  -1 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-06-03  9:50 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel,
	Russell King - ARM Linux, LAK, Morten Rasmussen, Mike Galbraith,
	Nicolas Pitre, linaro-kernel, Daniel Lezcano

On 2 June 2014 08:21, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
> On 05/29/2014 07:25 PM, Peter Zijlstra wrote:
>> On Fri, May 23, 2014 at 05:53:05PM +0200, Vincent Guittot wrote:
>>> The scheduler tries to compute how many tasks a group of CPUs can handle by
>>> assuming that a task's load is SCHED_LOAD_SCALE and a CPU capacity is
>>> SCHED_POWER_SCALE.
>>> We can now have a better idea of the utilization of a group fo CPUs thanks to
>>> group_actitvity and deduct how many capacity is still available.
>>>
>>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>>> ---
>>
>> Right, so as Preeti already mentioned, this wrecks SMT. It also seems to
>> loose the aggressive spread, where we want to run 1 task on each 'core'
>> before we start 'balancing'.
>
> True. I just profiled the ebizzy runs and found that ebizzy threads were
> being packed onto a single core which is SMT-8 capable before spreading.
> This was a 6 core, SMT-8 machine. So for instance if I run 8 threads of
> ebizzy. the load balancing as record by perf sched record showed that
> two cores were packed upto 3 ebizzy threads and one core ran two ebizzy
> threads while the rest of the 3 cores were idle.
>
> I am unable to understand which part of this patch is aiding packing to
> a core.  There is this check in this patch right?
>
> if (sgs->group_capacity < 0)
>                 return true;
>
> which should ideally prevent such packing? Because irrespective of the
> number of SMT threads, the capacity of a core is unchanged. And in the
> above scenario, we have 6 tasks on 3 cores. So shouldn't the above check
> have caught it?

yes, it should. the group_capacity should become < 0 because the CPU
are fully loaded and the activity reach the max capacity value +
nr_running

>
> Regards
> Preeti U Murthy
>>
>> So I think we should be able to fix this by setting PREFER_SIBLING on
>> the SMT domain, that way we'll get single tasks running on each SMT
>> domain before filling them up until capacity.
>>
>> Now, its been a while since I looked at PREFER_SIBLING, and I've not yet
>> looked at what your patch does to it, but it seems to me that that is
>> the first direction we should look for an answer to this.
>>
>

^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 11/11] sched: replace capacity by activity
@ 2014-06-03  9:50         ` Vincent Guittot
  0 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-06-03  9:50 UTC (permalink / raw)
  To: linux-arm-kernel

On 2 June 2014 08:21, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
> On 05/29/2014 07:25 PM, Peter Zijlstra wrote:
>> On Fri, May 23, 2014 at 05:53:05PM +0200, Vincent Guittot wrote:
>>> The scheduler tries to compute how many tasks a group of CPUs can handle by
>>> assuming that a task's load is SCHED_LOAD_SCALE and a CPU capacity is
>>> SCHED_POWER_SCALE.
>>> We can now have a better idea of the utilization of a group fo CPUs thanks to
>>> group_actitvity and deduct how many capacity is still available.
>>>
>>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>>> ---
>>
>> Right, so as Preeti already mentioned, this wrecks SMT. It also seems to
>> loose the aggressive spread, where we want to run 1 task on each 'core'
>> before we start 'balancing'.
>
> True. I just profiled the ebizzy runs and found that ebizzy threads were
> being packed onto a single core which is SMT-8 capable before spreading.
> This was a 6 core, SMT-8 machine. So for instance if I run 8 threads of
> ebizzy. the load balancing as record by perf sched record showed that
> two cores were packed upto 3 ebizzy threads and one core ran two ebizzy
> threads while the rest of the 3 cores were idle.
>
> I am unable to understand which part of this patch is aiding packing to
> a core.  There is this check in this patch right?
>
> if (sgs->group_capacity < 0)
>                 return true;
>
> which should ideally prevent such packing? Because irrespective of the
> number of SMT threads, the capacity of a core is unchanged. And in the
> above scenario, we have 6 tasks on 3 cores. So shouldn't the above check
> have caught it?

yes, it should. the group_capacity should become < 0 because the CPU
are fully loaded and the activity reach the max capacity value +
nr_running

>
> Regards
> Preeti U Murthy
>>
>> So I think we should be able to fix this by setting PREFER_SIBLING on
>> the SMT domain, that way we'll get single tasks running on each SMT
>> domain before filling them up until capacity.
>>
>> Now, its been a while since I looked at PREFER_SIBLING, and I've not yet
>> looked at what your patch does to it, but it seems to me that that is
>> the first direction we should look for an answer to this.
>>
>

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 10/11] sched: move cfs task on a CPU with higher capacity
  2014-06-02 17:06           ` Vincent Guittot
@ 2014-06-03 11:15             ` Peter Zijlstra
  -1 siblings, 0 replies; 216+ messages in thread
From: Peter Zijlstra @ 2014-06-03 11:15 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, linux-kernel, Russell King - ARM Linux, LAK,
	Preeti U Murthy, Morten Rasmussen, Mike Galbraith, Nicolas Pitre,
	linaro-kernel, Daniel Lezcano

[-- Attachment #1: Type: text/plain, Size: 1876 bytes --]

On Mon, Jun 02, 2014 at 07:06:44PM +0200, Vincent Guittot wrote:
> > Could you detail those conditions? FWIW those make excellent Changelog
> > material.
> 
> I have looked back into my tests and traces:
> 
> In a 1st test, the capacity of the CPU was still above half default
> value (power=538) unlike what i remembered. So it's some what "normal"
> to keep the task on CPU0 which also handles IRQ because sg_capacity
> still returns 1.

OK, so I suspect that once we move to utilization based capacity stuff
we'll do the migration IF the task indeed requires more cpu than can be
provided by the reduced, one, right?

> In a 2nd test,the main task runs (most of the time) on CPU0 whereas
> the max power of the latter is only 623 and the cpu_power goes below
> 512 (power=330) during the use case. So the sg_capacity of CPU0 is
> null but the main task still stays on CPU0.
> The use case (scp transfer) is made of a long running task (ssh) and a
> periodic short task (scp). ssh runs on CPU0 and scp runs each 6ms on
> CPU1. The newly idle load balance on CPU1 doesn't pull the long
> running task although sg_capacity is null because of
> sd->nr_balance_failed is never incremented and load_balance doesn't
> trig an active load_balance. When an idle balance occurs in the middle
> of the newly idle balance, the ssh long task migrates on CPU1 but as
> soon as it sleeps and wakes up, it goes back on CPU0 because of the
> wake affine which migrates it back on CPU0 (issue solved by patch 09).

OK, so there's two problems here, right?
 1) we don't migrate away from cpu0
 2) if we do, we get pulled back.

And patch 9 solves 2, so maybe enhance its changelog to mention this
slightly more explicit.

Which leaves us with 1.. interesting problem. I'm just not sure
endlessly kicking a low capacity cpu is the right fix for that.



[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 10/11] sched: move cfs task on a CPU with higher capacity
@ 2014-06-03 11:15             ` Peter Zijlstra
  0 siblings, 0 replies; 216+ messages in thread
From: Peter Zijlstra @ 2014-06-03 11:15 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Jun 02, 2014 at 07:06:44PM +0200, Vincent Guittot wrote:
> > Could you detail those conditions? FWIW those make excellent Changelog
> > material.
> 
> I have looked back into my tests and traces:
> 
> In a 1st test, the capacity of the CPU was still above half default
> value (power=538) unlike what i remembered. So it's some what "normal"
> to keep the task on CPU0 which also handles IRQ because sg_capacity
> still returns 1.

OK, so I suspect that once we move to utilization based capacity stuff
we'll do the migration IF the task indeed requires more cpu than can be
provided by the reduced, one, right?

> In a 2nd test,the main task runs (most of the time) on CPU0 whereas
> the max power of the latter is only 623 and the cpu_power goes below
> 512 (power=330) during the use case. So the sg_capacity of CPU0 is
> null but the main task still stays on CPU0.
> The use case (scp transfer) is made of a long running task (ssh) and a
> periodic short task (scp). ssh runs on CPU0 and scp runs each 6ms on
> CPU1. The newly idle load balance on CPU1 doesn't pull the long
> running task although sg_capacity is null because of
> sd->nr_balance_failed is never incremented and load_balance doesn't
> trig an active load_balance. When an idle balance occurs in the middle
> of the newly idle balance, the ssh long task migrates on CPU1 but as
> soon as it sleeps and wakes up, it goes back on CPU0 because of the
> wake affine which migrates it back on CPU0 (issue solved by patch 09).

OK, so there's two problems here, right?
 1) we don't migrate away from cpu0
 2) if we do, we get pulled back.

And patch 9 solves 2, so maybe enhance its changelog to mention this
slightly more explicit.

Which leaves us with 1.. interesting problem. I'm just not sure
endlessly kicking a low capacity cpu is the right fix for that.


-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 836 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20140603/e8b14b3d/attachment.sig>

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 08/11] sched: get CPU's activity statistic
  2014-05-28 16:39           ` Vincent Guittot
@ 2014-06-03 12:03             ` Morten Rasmussen
  -1 siblings, 0 replies; 216+ messages in thread
From: Morten Rasmussen @ 2014-06-03 12:03 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: peterz, mingo, linux-kernel, linux, linux-arm-kernel, preeti,
	efault, nicolas.pitre, linaro-kernel, daniel.lezcano

On Wed, May 28, 2014 at 05:39:10PM +0100, Vincent Guittot wrote:
> On 28 May 2014 17:47, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> > On Wed, May 28, 2014 at 02:15:03PM +0100, Vincent Guittot wrote:
> >> On 28 May 2014 14:10, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> >> > On Fri, May 23, 2014 at 04:53:02PM +0100, Vincent Guittot wrote:
> 
> [snip]
> 
> >
> >> This value is linked to the CPU on
> >> which it has run previously because of the time sharing with others
> >> tasks, so the unweighted load of a freshly migrated task will reflect
> >> its load on the previous CPU (with the time sharing with other tasks
> >> on prev CPU).
> >
> > I agree that the task runnable_avg_sum is always affected by the
> > circumstances on the cpu where it is running, and that it takes this
> > history with it. However, I think cfs.runnable_load_avg leads to less
> > problems than using the rq runnable_avg_sum. It would work nicely for
> > the two tasks on two cpus example I mentioned earlier. We don't need add
> 
> i would say that nr_running is an even better metrics for such
> situation as the load doesn't give any additional information.

I fail to understand how nr_running can be used. nr_running doesn't tell
you anything about the utilization of the cpu, just the number tasks
that happen to be runnable at a point in time on a specific cpu. It
might be two small tasks that just happened to be running while you read
nr_running.

An unweighted version of cfs.runnable_load_avg gives you a metric that
captures cpu utilization to some extend, but not the number of tasks.
And it reflects task migrations immediately unlike the rq
runnable_avg_sum.

> Just to point that we can spent a lot of time listing which use case
> are better covered by which metrics :-)

Agreed, but I think it is quite important to discuss what we understand
by cpu utilization. It seems to be different depending on what you want
to use it for. I think it is also clear that none of the metrics that
have been proposed are perfect. We therefore have to be careful to only
use metrics in scenarios where they make sense. IMHO, both rq
runnable_avg_sum and unweighted cfs.runnable_load_avg capture cpu
utilization, but in different ways.

We have done experiments internally with rq runnable_avg_sum for
load-balancing decisions in the past and found it unsuitable due to its
slow response to task migrations. That is why I brought it up here.
AFAICT, you use rq runnable_avg_sum more like a flag than a quantitative
measure of cpu utilization. Viewing things from an energy-awareness
point of view I'm more interested in the latter for estimating the
implications of moving tasks around. I don't have any problems with
using rq runnable_avg_sum for other things as long we are fully aware of
how this metric works.

> > something on top when the cpu is fully utilized by more than one task.
> > It comes more naturally with cfs.runnable_load_avg. If it is much larger
> > than 47742, it should be fairly safe to assume that you shouldn't stick
> > more tasks on that cpu.
> >
> >>
> >> I'm not saying that such metric is useless but it's not perfect as well.
> >
> > It comes with its own set of problems, agreed. Based on my current
> > understanding (or lack thereof) they just seem smaller :)
> 
> I think it's worth using the cpu utilization for some cases because it
> has got some information that are not available elsewhere. And the
> replacement of the current capacity computation is one example.
> As explained previously, I'm not against adding other metrics and i'm
> not sure to understand why you oppose these 2 metrics whereas they
> could be complementary

I think we more or less agree :) I'm fine with both metrics and I agree
that they complement each other. My concern is using the right metric
for the right job. If you choose to use rq runnable_avg_sum you have to
keep its slow reaction time in mind. I think that might be
difficult/not possible for some load-balancing decisions. That is
basically my point :)

Morten

^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 08/11] sched: get CPU's activity statistic
@ 2014-06-03 12:03             ` Morten Rasmussen
  0 siblings, 0 replies; 216+ messages in thread
From: Morten Rasmussen @ 2014-06-03 12:03 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, May 28, 2014 at 05:39:10PM +0100, Vincent Guittot wrote:
> On 28 May 2014 17:47, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> > On Wed, May 28, 2014 at 02:15:03PM +0100, Vincent Guittot wrote:
> >> On 28 May 2014 14:10, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> >> > On Fri, May 23, 2014 at 04:53:02PM +0100, Vincent Guittot wrote:
> 
> [snip]
> 
> >
> >> This value is linked to the CPU on
> >> which it has run previously because of the time sharing with others
> >> tasks, so the unweighted load of a freshly migrated task will reflect
> >> its load on the previous CPU (with the time sharing with other tasks
> >> on prev CPU).
> >
> > I agree that the task runnable_avg_sum is always affected by the
> > circumstances on the cpu where it is running, and that it takes this
> > history with it. However, I think cfs.runnable_load_avg leads to less
> > problems than using the rq runnable_avg_sum. It would work nicely for
> > the two tasks on two cpus example I mentioned earlier. We don't need add
> 
> i would say that nr_running is an even better metrics for such
> situation as the load doesn't give any additional information.

I fail to understand how nr_running can be used. nr_running doesn't tell
you anything about the utilization of the cpu, just the number tasks
that happen to be runnable at a point in time on a specific cpu. It
might be two small tasks that just happened to be running while you read
nr_running.

An unweighted version of cfs.runnable_load_avg gives you a metric that
captures cpu utilization to some extend, but not the number of tasks.
And it reflects task migrations immediately unlike the rq
runnable_avg_sum.

> Just to point that we can spent a lot of time listing which use case
> are better covered by which metrics :-)

Agreed, but I think it is quite important to discuss what we understand
by cpu utilization. It seems to be different depending on what you want
to use it for. I think it is also clear that none of the metrics that
have been proposed are perfect. We therefore have to be careful to only
use metrics in scenarios where they make sense. IMHO, both rq
runnable_avg_sum and unweighted cfs.runnable_load_avg capture cpu
utilization, but in different ways.

We have done experiments internally with rq runnable_avg_sum for
load-balancing decisions in the past and found it unsuitable due to its
slow response to task migrations. That is why I brought it up here.
AFAICT, you use rq runnable_avg_sum more like a flag than a quantitative
measure of cpu utilization. Viewing things from an energy-awareness
point of view I'm more interested in the latter for estimating the
implications of moving tasks around. I don't have any problems with
using rq runnable_avg_sum for other things as long we are fully aware of
how this metric works.

> > something on top when the cpu is fully utilized by more than one task.
> > It comes more naturally with cfs.runnable_load_avg. If it is much larger
> > than 47742, it should be fairly safe to assume that you shouldn't stick
> > more tasks on that cpu.
> >
> >>
> >> I'm not saying that such metric is useless but it's not perfect as well.
> >
> > It comes with its own set of problems, agreed. Based on my current
> > understanding (or lack thereof) they just seem smaller :)
> 
> I think it's worth using the cpu utilization for some cases because it
> has got some information that are not available elsewhere. And the
> replacement of the current capacity computation is one example.
> As explained previously, I'm not against adding other metrics and i'm
> not sure to understand why you oppose these 2 metrics whereas they
> could be complementary

I think we more or less agree :) I'm fine with both metrics and I agree
that they complement each other. My concern is using the right metric
for the right job. If you choose to use rq runnable_avg_sum you have to
keep its slow reaction time in mind. I think that might be
difficult/not possible for some load-balancing decisions. That is
basically my point :)

Morten

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 10/11] sched: move cfs task on a CPU with higher capacity
  2014-06-03 11:15             ` Peter Zijlstra
@ 2014-06-03 12:31               ` Vincent Guittot
  -1 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-06-03 12:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Russell King - ARM Linux, LAK,
	Preeti U Murthy, Morten Rasmussen, Mike Galbraith, Nicolas Pitre,
	linaro-kernel, Daniel Lezcano

On 3 June 2014 13:15, Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, Jun 02, 2014 at 07:06:44PM +0200, Vincent Guittot wrote:
>> > Could you detail those conditions? FWIW those make excellent Changelog
>> > material.
>>
>> I have looked back into my tests and traces:
>>
>> In a 1st test, the capacity of the CPU was still above half default
>> value (power=538) unlike what i remembered. So it's some what "normal"
>> to keep the task on CPU0 which also handles IRQ because sg_capacity
>> still returns 1.
>
> OK, so I suspect that once we move to utilization based capacity stuff
> we'll do the migration IF the task indeed requires more cpu than can be
> provided by the reduced, one, right?

The current version of the patchset only checks if the capacity of a
CPU has significantly reduced that we should look for another CPU. But
we effectively could also add compare the remaining capacity with the
task load

>
>> In a 2nd test,the main task runs (most of the time) on CPU0 whereas
>> the max power of the latter is only 623 and the cpu_power goes below
>> 512 (power=330) during the use case. So the sg_capacity of CPU0 is
>> null but the main task still stays on CPU0.
>> The use case (scp transfer) is made of a long running task (ssh) and a
>> periodic short task (scp). ssh runs on CPU0 and scp runs each 6ms on
>> CPU1. The newly idle load balance on CPU1 doesn't pull the long
>> running task although sg_capacity is null because of
>> sd->nr_balance_failed is never incremented and load_balance doesn't
>> trig an active load_balance. When an idle balance occurs in the middle
>> of the newly idle balance, the ssh long task migrates on CPU1 but as
>> soon as it sleeps and wakes up, it goes back on CPU0 because of the
>> wake affine which migrates it back on CPU0 (issue solved by patch 09).
>
> OK, so there's two problems here, right?
>  1) we don't migrate away from cpu0
>  2) if we do, we get pulled back.
>
> And patch 9 solves 2, so maybe enhance its changelog to mention this
> slightly more explicit.
>
> Which leaves us with 1.. interesting problem. I'm just not sure
> endlessly kicking a low capacity cpu is the right fix for that.

What prevent us to migrate the task directly is the fact that
nr_balance_failed is not incremented for newly idle and it's the only
condition for active migration (except asym feature)

We could add a additional test in need_active_balance for newly_idle
load balance. Something like:

if ((sd->flags & SD_SHARE_PKG_RESOURCES)
 && (senv->rc_rq->cpu_power_orig * 100) > (env->src_rq->group_power *
env->sd->imbalance_pct))
return 1;

>
>

^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 10/11] sched: move cfs task on a CPU with higher capacity
@ 2014-06-03 12:31               ` Vincent Guittot
  0 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-06-03 12:31 UTC (permalink / raw)
  To: linux-arm-kernel

On 3 June 2014 13:15, Peter Zijlstra <peterz@infradead.org> wrote:
> On Mon, Jun 02, 2014 at 07:06:44PM +0200, Vincent Guittot wrote:
>> > Could you detail those conditions? FWIW those make excellent Changelog
>> > material.
>>
>> I have looked back into my tests and traces:
>>
>> In a 1st test, the capacity of the CPU was still above half default
>> value (power=538) unlike what i remembered. So it's some what "normal"
>> to keep the task on CPU0 which also handles IRQ because sg_capacity
>> still returns 1.
>
> OK, so I suspect that once we move to utilization based capacity stuff
> we'll do the migration IF the task indeed requires more cpu than can be
> provided by the reduced, one, right?

The current version of the patchset only checks if the capacity of a
CPU has significantly reduced that we should look for another CPU. But
we effectively could also add compare the remaining capacity with the
task load

>
>> In a 2nd test,the main task runs (most of the time) on CPU0 whereas
>> the max power of the latter is only 623 and the cpu_power goes below
>> 512 (power=330) during the use case. So the sg_capacity of CPU0 is
>> null but the main task still stays on CPU0.
>> The use case (scp transfer) is made of a long running task (ssh) and a
>> periodic short task (scp). ssh runs on CPU0 and scp runs each 6ms on
>> CPU1. The newly idle load balance on CPU1 doesn't pull the long
>> running task although sg_capacity is null because of
>> sd->nr_balance_failed is never incremented and load_balance doesn't
>> trig an active load_balance. When an idle balance occurs in the middle
>> of the newly idle balance, the ssh long task migrates on CPU1 but as
>> soon as it sleeps and wakes up, it goes back on CPU0 because of the
>> wake affine which migrates it back on CPU0 (issue solved by patch 09).
>
> OK, so there's two problems here, right?
>  1) we don't migrate away from cpu0
>  2) if we do, we get pulled back.
>
> And patch 9 solves 2, so maybe enhance its changelog to mention this
> slightly more explicit.
>
> Which leaves us with 1.. interesting problem. I'm just not sure
> endlessly kicking a low capacity cpu is the right fix for that.

What prevent us to migrate the task directly is the fact that
nr_balance_failed is not incremented for newly idle and it's the only
condition for active migration (except asym feature)

We could add a additional test in need_active_balance for newly_idle
load balance. Something like:

if ((sd->flags & SD_SHARE_PKG_RESOURCES)
 && (senv->rc_rq->cpu_power_orig * 100) > (env->src_rq->group_power *
env->sd->imbalance_pct))
return 1;

>
>

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 04/11] sched: Allow all archs to set the power_orig
  2014-05-23 15:52   ` Vincent Guittot
@ 2014-06-03 13:22     ` Morten Rasmussen
  -1 siblings, 0 replies; 216+ messages in thread
From: Morten Rasmussen @ 2014-06-03 13:22 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: peterz, mingo, linux-kernel, linux, linux-arm-kernel, preeti,
	efault, nicolas.pitre, linaro-kernel, daniel.lezcano

On Fri, May 23, 2014 at 04:52:58PM +0100, Vincent Guittot wrote:
> power_orig is only changed for system with a SMT sched_domain level in order to
> reflect the lower capacity of CPUs. Heterogenous system also have to reflect an
> original capacity that is different from the default value.
> 
> Create a more generic function arch_scale_cpu_power that can be also used by
> non SMT platform to set power_orig.

I did a quick test of the patch set with adjusting cpu_power on
big.LITTLE (ARM TC2) to reflect the different compute capacities of the
A15s and A7s. I ran the sysbench cpu benchmark with 5 threads with and
without the patches applied, but with non-default cpu_powers.

I didn't see any difference in the load-balance. Three tasks ended up on
the two A15s and two tasks ended up on two of the three A7s leaving one
unused in both cases.

Using default cpu_power I get one task on each of the five cpus (best
throughput). Unless I messed something up, it seems that setting
cpu_power doesn't give me the best throughput with these patches
applied.

Have you done any tests on big.LITTLE?

Morten

^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 04/11] sched: Allow all archs to set the power_orig
@ 2014-06-03 13:22     ` Morten Rasmussen
  0 siblings, 0 replies; 216+ messages in thread
From: Morten Rasmussen @ 2014-06-03 13:22 UTC (permalink / raw)
  To: linux-arm-kernel

On Fri, May 23, 2014 at 04:52:58PM +0100, Vincent Guittot wrote:
> power_orig is only changed for system with a SMT sched_domain level in order to
> reflect the lower capacity of CPUs. Heterogenous system also have to reflect an
> original capacity that is different from the default value.
> 
> Create a more generic function arch_scale_cpu_power that can be also used by
> non SMT platform to set power_orig.

I did a quick test of the patch set with adjusting cpu_power on
big.LITTLE (ARM TC2) to reflect the different compute capacities of the
A15s and A7s. I ran the sysbench cpu benchmark with 5 threads with and
without the patches applied, but with non-default cpu_powers.

I didn't see any difference in the load-balance. Three tasks ended up on
the two A15s and two tasks ended up on two of the three A7s leaving one
unused in both cases.

Using default cpu_power I get one task on each of the five cpus (best
throughput). Unless I messed something up, it seems that setting
cpu_power doesn't give me the best throughput with these patches
applied.

Have you done any tests on big.LITTLE?

Morten

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 04/11] sched: Allow all archs to set the power_orig
  2014-06-03 13:22     ` Morten Rasmussen
@ 2014-06-03 14:02       ` Vincent Guittot
  -1 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-06-03 14:02 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: peterz, mingo, linux-kernel, linux, linux-arm-kernel, preeti,
	efault, nicolas.pitre, linaro-kernel, daniel.lezcano

On 3 June 2014 15:22, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> On Fri, May 23, 2014 at 04:52:58PM +0100, Vincent Guittot wrote:
>> power_orig is only changed for system with a SMT sched_domain level in order to
>> reflect the lower capacity of CPUs. Heterogenous system also have to reflect an
>> original capacity that is different from the default value.
>>
>> Create a more generic function arch_scale_cpu_power that can be also used by
>> non SMT platform to set power_orig.
>
> I did a quick test of the patch set with adjusting cpu_power on
> big.LITTLE (ARM TC2) to reflect the different compute capacities of the
> A15s and A7s. I ran the sysbench cpu benchmark with 5 threads with and
> without the patches applied, but with non-default cpu_powers.
>
> I didn't see any difference in the load-balance. Three tasks ended up on
> the two A15s and two tasks ended up on two of the three A7s leaving one
> unused in both cases.
>
> Using default cpu_power I get one task on each of the five cpus (best
> throughput). Unless I messed something up, it seems that setting
> cpu_power doesn't give me the best throughput with these patches
> applied.

That's normal this patchset is necessary but not enough to solve the
issue you mention. We also need to fix the way the imbalance is
calculated for such situation. I have planned to push that in another
patchset in order to not mix too much thing together

Vincent

>
> Have you done any tests on big.LITTLE?
>
> Morten

^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 04/11] sched: Allow all archs to set the power_orig
@ 2014-06-03 14:02       ` Vincent Guittot
  0 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-06-03 14:02 UTC (permalink / raw)
  To: linux-arm-kernel

On 3 June 2014 15:22, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> On Fri, May 23, 2014 at 04:52:58PM +0100, Vincent Guittot wrote:
>> power_orig is only changed for system with a SMT sched_domain level in order to
>> reflect the lower capacity of CPUs. Heterogenous system also have to reflect an
>> original capacity that is different from the default value.
>>
>> Create a more generic function arch_scale_cpu_power that can be also used by
>> non SMT platform to set power_orig.
>
> I did a quick test of the patch set with adjusting cpu_power on
> big.LITTLE (ARM TC2) to reflect the different compute capacities of the
> A15s and A7s. I ran the sysbench cpu benchmark with 5 threads with and
> without the patches applied, but with non-default cpu_powers.
>
> I didn't see any difference in the load-balance. Three tasks ended up on
> the two A15s and two tasks ended up on two of the three A7s leaving one
> unused in both cases.
>
> Using default cpu_power I get one task on each of the five cpus (best
> throughput). Unless I messed something up, it seems that setting
> cpu_power doesn't give me the best throughput with these patches
> applied.

That's normal this patchset is necessary but not enough to solve the
issue you mention. We also need to fix the way the imbalance is
calculated for such situation. I have planned to push that in another
patchset in order to not mix too much thing together

Vincent

>
> Have you done any tests on big.LITTLE?
>
> Morten

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 08/11] sched: get CPU's activity statistic
  2014-05-28 12:10     ` Morten Rasmussen
@ 2014-06-03 15:40       ` Peter Zijlstra
  -1 siblings, 0 replies; 216+ messages in thread
From: Peter Zijlstra @ 2014-06-03 15:40 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Vincent Guittot, mingo, linux-kernel, linux, linux-arm-kernel,
	preeti, efault, nicolas.pitre, linaro-kernel, daniel.lezcano

[-- Attachment #1: Type: text/plain, Size: 2705 bytes --]

On Wed, May 28, 2014 at 01:10:01PM +0100, Morten Rasmussen wrote:
> The rq runnable_avg_{sum, period} give a very long term view of the cpu
> utilization (I will use the term utilization instead of activity as I
> think that is what we are talking about here). IMHO, it is too slow to
> be used as basis for load balancing decisions. I think that was also
> agreed upon in the last discussion related to this topic [1].
> 
> The basic problem is that worst case: sum starting from 0 and period
> already at LOAD_AVG_MAX = 47742, it takes LOAD_AVG_MAX_N = 345 periods
> (ms) for sum to reach 47742. In other words, the cpu might have been
> fully utilized for 345 ms before it is considered fully utilized.
> Periodic load-balancing happens much more frequently than that.

Like said earlier the 94% mark is actually hit much sooner, but yes,
likely still too slow.

50% at 32 ms, 75% at 64 ms, 87.5% at 96 ms, etc..

> Also, if load-balancing actually moves tasks around it may take quite a
> while before runnable_avg_sum actually reflects this change. The next
> periodic load-balance is likely to happen before runnable_avg_sum has
> reflected the result of the previous periodic load-balance.
> 
> To avoid these problems, we need to base utilization on a metric which
> is updated instantaneously when we add/remove tasks to a cpu (or a least
> fast enough that we don't see the above problems).

So the per-task-load-tracking stuff already does that. It updates the
per-cpu load metrics on migration. See {de,en}queue_entity_load_avg().

And keeping an unweighted per-cpu variant isn't that much more work.

> In the previous
> discussion [1] it was suggested that a sum of unweighted task
> runnable_avg_{sum,period} ratio instead. That is, an unweighted
> equivalent to weighted_cpuload(). That isn't a perfect solution either.
> It is fine as long as the cpus are not fully utilized, but when they are
> we need to use weighted_cpuload() to preserve smp_nice. What to do
> around the tipping point needs more thought, but I think that is
> currently the best proposal for a solution for task and cpu utilization.

I'm not too worried about the tipping point, per task runnable figures
of an overloaded cpu are higher, so migration between an overloaded cpu
and an underloaded cpu are going to be tricky no matter what we do.

> rq runnable_avg_sum is useful for decisions where we need a longer term
> view of the cpu utilization, but I don't see how we can use as cpu
> utilization metric for load-balancing decisions at wakeup or
> periodically.

So keeping one with a faster decay would add extra per-task storage. But
would be possible..


[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 08/11] sched: get CPU's activity statistic
@ 2014-06-03 15:40       ` Peter Zijlstra
  0 siblings, 0 replies; 216+ messages in thread
From: Peter Zijlstra @ 2014-06-03 15:40 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, May 28, 2014 at 01:10:01PM +0100, Morten Rasmussen wrote:
> The rq runnable_avg_{sum, period} give a very long term view of the cpu
> utilization (I will use the term utilization instead of activity as I
> think that is what we are talking about here). IMHO, it is too slow to
> be used as basis for load balancing decisions. I think that was also
> agreed upon in the last discussion related to this topic [1].
> 
> The basic problem is that worst case: sum starting from 0 and period
> already at LOAD_AVG_MAX = 47742, it takes LOAD_AVG_MAX_N = 345 periods
> (ms) for sum to reach 47742. In other words, the cpu might have been
> fully utilized for 345 ms before it is considered fully utilized.
> Periodic load-balancing happens much more frequently than that.

Like said earlier the 94% mark is actually hit much sooner, but yes,
likely still too slow.

50% at 32 ms, 75% at 64 ms, 87.5% at 96 ms, etc..

> Also, if load-balancing actually moves tasks around it may take quite a
> while before runnable_avg_sum actually reflects this change. The next
> periodic load-balance is likely to happen before runnable_avg_sum has
> reflected the result of the previous periodic load-balance.
> 
> To avoid these problems, we need to base utilization on a metric which
> is updated instantaneously when we add/remove tasks to a cpu (or a least
> fast enough that we don't see the above problems).

So the per-task-load-tracking stuff already does that. It updates the
per-cpu load metrics on migration. See {de,en}queue_entity_load_avg().

And keeping an unweighted per-cpu variant isn't that much more work.

> In the previous
> discussion [1] it was suggested that a sum of unweighted task
> runnable_avg_{sum,period} ratio instead. That is, an unweighted
> equivalent to weighted_cpuload(). That isn't a perfect solution either.
> It is fine as long as the cpus are not fully utilized, but when they are
> we need to use weighted_cpuload() to preserve smp_nice. What to do
> around the tipping point needs more thought, but I think that is
> currently the best proposal for a solution for task and cpu utilization.

I'm not too worried about the tipping point, per task runnable figures
of an overloaded cpu are higher, so migration between an overloaded cpu
and an underloaded cpu are going to be tricky no matter what we do.

> rq runnable_avg_sum is useful for decisions where we need a longer term
> view of the cpu utilization, but I don't see how we can use as cpu
> utilization metric for load-balancing decisions at wakeup or
> periodically.

So keeping one with a faster decay would add extra per-task storage. But
would be possible..

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 836 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20140603/ed0ab26d/attachment.sig>

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 08/11] sched: get CPU's activity statistic
  2014-05-28 15:47         ` Morten Rasmussen
@ 2014-06-03 15:50           ` Peter Zijlstra
  -1 siblings, 0 replies; 216+ messages in thread
From: Peter Zijlstra @ 2014-06-03 15:50 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Vincent Guittot, mingo, linux-kernel, linux, linux-arm-kernel,
	preeti, efault, nicolas.pitre, linaro-kernel, daniel.lezcano

[-- Attachment #1: Type: text/plain, Size: 1171 bytes --]

On Wed, May 28, 2014 at 04:47:03PM +0100, Morten Rasmussen wrote:
> Since we may do periodic load-balance every 10 ms or so, we will perform
> a number of load-balances where runnable_avg_sum will mostly be
> reflecting the state of the world before a change (new task queued or
> moved a task to a different cpu). If you had have two tasks continuously
> on one cpu and your other cpu is idle, and you move one of the tasks to
> the other cpu, runnable_avg_sum will remain unchanged, 47742, on the
> first cpu while it starts from 0 on the other one. 10 ms later it will
> have increased a bit, 32 ms later it will be 47742/2, and 345 ms later
> it reaches 47742. In the mean time the cpu doesn't appear fully utilized
> and we might decide to put more tasks on it because we don't know if
> runnable_avg_sum represents a partially utilized cpu (for example a 50%
> task) or if it will continue to rise and eventually get to 47742.

Ah, no, since we track per task, and update the per-cpu ones when we
migrate tasks, the per-cpu values should be instantly updated.

If we were to increase per task storage, we might as well also track
running_avg not only runnable_avg.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 08/11] sched: get CPU's activity statistic
@ 2014-06-03 15:50           ` Peter Zijlstra
  0 siblings, 0 replies; 216+ messages in thread
From: Peter Zijlstra @ 2014-06-03 15:50 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, May 28, 2014 at 04:47:03PM +0100, Morten Rasmussen wrote:
> Since we may do periodic load-balance every 10 ms or so, we will perform
> a number of load-balances where runnable_avg_sum will mostly be
> reflecting the state of the world before a change (new task queued or
> moved a task to a different cpu). If you had have two tasks continuously
> on one cpu and your other cpu is idle, and you move one of the tasks to
> the other cpu, runnable_avg_sum will remain unchanged, 47742, on the
> first cpu while it starts from 0 on the other one. 10 ms later it will
> have increased a bit, 32 ms later it will be 47742/2, and 345 ms later
> it reaches 47742. In the mean time the cpu doesn't appear fully utilized
> and we might decide to put more tasks on it because we don't know if
> runnable_avg_sum represents a partially utilized cpu (for example a 50%
> task) or if it will continue to rise and eventually get to 47742.

Ah, no, since we track per task, and update the per-cpu ones when we
migrate tasks, the per-cpu values should be instantly updated.

If we were to increase per task storage, we might as well also track
running_avg not only runnable_avg.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 836 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20140603/5592e64c/attachment.sig>

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 08/11] sched: get CPU's activity statistic
  2014-06-03 12:03             ` Morten Rasmussen
@ 2014-06-03 15:59               ` Peter Zijlstra
  -1 siblings, 0 replies; 216+ messages in thread
From: Peter Zijlstra @ 2014-06-03 15:59 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Vincent Guittot, mingo, linux-kernel, linux, linux-arm-kernel,
	preeti, efault, nicolas.pitre, linaro-kernel, daniel.lezcano

[-- Attachment #1: Type: text/plain, Size: 2712 bytes --]

On Tue, Jun 03, 2014 at 01:03:54PM +0100, Morten Rasmussen wrote:
> On Wed, May 28, 2014 at 05:39:10PM +0100, Vincent Guittot wrote:
> > On 28 May 2014 17:47, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> > > On Wed, May 28, 2014 at 02:15:03PM +0100, Vincent Guittot wrote:
> > >> On 28 May 2014 14:10, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> > >> > On Fri, May 23, 2014 at 04:53:02PM +0100, Vincent Guittot wrote:

> > > I agree that the task runnable_avg_sum is always affected by the
> > > circumstances on the cpu where it is running, and that it takes this
> > > history with it. However, I think cfs.runnable_load_avg leads to less
> > > problems than using the rq runnable_avg_sum. It would work nicely for
> > > the two tasks on two cpus example I mentioned earlier. We don't need add
> > 
> > i would say that nr_running is an even better metrics for such
> > situation as the load doesn't give any additional information.
> 
> I fail to understand how nr_running can be used. nr_running doesn't tell
> you anything about the utilization of the cpu, just the number tasks
> that happen to be runnable at a point in time on a specific cpu. It
> might be two small tasks that just happened to be running while you read
> nr_running.

Agreed, I'm not at all seeing how nr_running is useful here.

> An unweighted version of cfs.runnable_load_avg gives you a metric that
> captures cpu utilization to some extend, but not the number of tasks.
> And it reflects task migrations immediately unlike the rq
> runnable_avg_sum.

So runnable_avg would be equal to the utilization as long as
there's idle time, as soon as we're over-loaded the metric shows how
much extra cpu is required.

That is, runnable_avg - running_avg >= 0 and the amount is the
exact amount of extra cpu required to make all tasks run but not have
idle time.

> Agreed, but I think it is quite important to discuss what we understand
> by cpu utilization. It seems to be different depending on what you want
> to use it for.

I understand utilization to be however much cpu is actually used, so I
would, per the existing naming, call running_avg to be the avg
utilization of a task/group/cpu whatever.

> We have done experiments internally with rq runnable_avg_sum for
> load-balancing decisions in the past and found it unsuitable due to its
> slow response to task migrations. That is why I brought it up here.

So I'm not entirely seeing that from the code (I've not traced this),
afaict we actually update the per-cpu values on migration based on the
task values.

old_rq->sum -= p->val;
new_rq->sum += p->val;

like,.. except of course totally obscured.


[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 08/11] sched: get CPU's activity statistic
@ 2014-06-03 15:59               ` Peter Zijlstra
  0 siblings, 0 replies; 216+ messages in thread
From: Peter Zijlstra @ 2014-06-03 15:59 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Jun 03, 2014 at 01:03:54PM +0100, Morten Rasmussen wrote:
> On Wed, May 28, 2014 at 05:39:10PM +0100, Vincent Guittot wrote:
> > On 28 May 2014 17:47, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> > > On Wed, May 28, 2014 at 02:15:03PM +0100, Vincent Guittot wrote:
> > >> On 28 May 2014 14:10, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> > >> > On Fri, May 23, 2014 at 04:53:02PM +0100, Vincent Guittot wrote:

> > > I agree that the task runnable_avg_sum is always affected by the
> > > circumstances on the cpu where it is running, and that it takes this
> > > history with it. However, I think cfs.runnable_load_avg leads to less
> > > problems than using the rq runnable_avg_sum. It would work nicely for
> > > the two tasks on two cpus example I mentioned earlier. We don't need add
> > 
> > i would say that nr_running is an even better metrics for such
> > situation as the load doesn't give any additional information.
> 
> I fail to understand how nr_running can be used. nr_running doesn't tell
> you anything about the utilization of the cpu, just the number tasks
> that happen to be runnable at a point in time on a specific cpu. It
> might be two small tasks that just happened to be running while you read
> nr_running.

Agreed, I'm not at all seeing how nr_running is useful here.

> An unweighted version of cfs.runnable_load_avg gives you a metric that
> captures cpu utilization to some extend, but not the number of tasks.
> And it reflects task migrations immediately unlike the rq
> runnable_avg_sum.

So runnable_avg would be equal to the utilization as long as
there's idle time, as soon as we're over-loaded the metric shows how
much extra cpu is required.

That is, runnable_avg - running_avg >= 0 and the amount is the
exact amount of extra cpu required to make all tasks run but not have
idle time.

> Agreed, but I think it is quite important to discuss what we understand
> by cpu utilization. It seems to be different depending on what you want
> to use it for.

I understand utilization to be however much cpu is actually used, so I
would, per the existing naming, call running_avg to be the avg
utilization of a task/group/cpu whatever.

> We have done experiments internally with rq runnable_avg_sum for
> load-balancing decisions in the past and found it unsuitable due to its
> slow response to task migrations. That is why I brought it up here.

So I'm not entirely seeing that from the code (I've not traced this),
afaict we actually update the per-cpu values on migration based on the
task values.

old_rq->sum -= p->val;
new_rq->sum += p->val;

like,.. except of course totally obscured.

-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 836 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20140603/847f8a37/attachment.sig>

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 08/11] sched: get CPU's activity statistic
  2014-06-03 15:40       ` Peter Zijlstra
@ 2014-06-03 17:16         ` Morten Rasmussen
  -1 siblings, 0 replies; 216+ messages in thread
From: Morten Rasmussen @ 2014-06-03 17:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vincent Guittot, mingo, linux-kernel, linux, linux-arm-kernel,
	preeti, efault, nicolas.pitre, linaro-kernel, daniel.lezcano,
	Paul Turner, Benjamin Segall

On Tue, Jun 03, 2014 at 04:40:58PM +0100, Peter Zijlstra wrote:
> On Wed, May 28, 2014 at 01:10:01PM +0100, Morten Rasmussen wrote:
> > The rq runnable_avg_{sum, period} give a very long term view of the cpu
> > utilization (I will use the term utilization instead of activity as I
> > think that is what we are talking about here). IMHO, it is too slow to
> > be used as basis for load balancing decisions. I think that was also
> > agreed upon in the last discussion related to this topic [1].
> > 
> > The basic problem is that worst case: sum starting from 0 and period
> > already at LOAD_AVG_MAX = 47742, it takes LOAD_AVG_MAX_N = 345 periods
> > (ms) for sum to reach 47742. In other words, the cpu might have been
> > fully utilized for 345 ms before it is considered fully utilized.
> > Periodic load-balancing happens much more frequently than that.
> 
> Like said earlier the 94% mark is actually hit much sooner, but yes,
> likely still too slow.
> 
> 50% at 32 ms, 75% at 64 ms, 87.5% at 96 ms, etc..

Agreed.

> 
> > Also, if load-balancing actually moves tasks around it may take quite a
> > while before runnable_avg_sum actually reflects this change. The next
> > periodic load-balance is likely to happen before runnable_avg_sum has
> > reflected the result of the previous periodic load-balance.
> > 
> > To avoid these problems, we need to base utilization on a metric which
> > is updated instantaneously when we add/remove tasks to a cpu (or a least
> > fast enough that we don't see the above problems).
> 
> So the per-task-load-tracking stuff already does that. It updates the
> per-cpu load metrics on migration. See {de,en}queue_entity_load_avg().

I think there is some confusion here. There are two per-cpu load metrics
that tracks differently.

The cfs.runnable_load_avg is basically the sum of the load contributions
of the tasks on the cfs rq. The sum gets updated whenever tasks are
{en,de}queued by adding/subtracting the load contribution of the task
being added/removed. That is the one you are referring to.

The rq runnable_avg_sum (actually rq->avg.runnable_avg_{sum, period}) is
tracking whether the cpu has something to do or not. It doesn't matter
many tasks are runnable or what their load is. It is updated in
update_rq_runnable_avg(). It increases when rq->nr_running > 0 and
decays if not. It also takes time spent running rt tasks into account in
idle_{enter, exit}_fair(). So if you remove tasks from the rq, this
metric will start decaying and eventually get to 0, unlike the
cfs.runnable_load_avg where the task load contribution subtracted every
time a task is removed. The rq runnable_avg_sum is the one being used in
this patch set.

Ben, pjt, please correct me if I'm wrong.

> And keeping an unweighted per-cpu variant isn't that much more work.

Agreed.

> 
> > In the previous
> > discussion [1] it was suggested that a sum of unweighted task
> > runnable_avg_{sum,period} ratio instead. That is, an unweighted
> > equivalent to weighted_cpuload(). That isn't a perfect solution either.
> > It is fine as long as the cpus are not fully utilized, but when they are
> > we need to use weighted_cpuload() to preserve smp_nice. What to do
> > around the tipping point needs more thought, but I think that is
> > currently the best proposal for a solution for task and cpu utilization.
> 
> I'm not too worried about the tipping point, per task runnable figures
> of an overloaded cpu are higher, so migration between an overloaded cpu
> and an underloaded cpu are going to be tricky no matter what we do.

Yes, agreed. I just got the impression that you were concerned about
smp_nice last time we discussed this.

> > rq runnable_avg_sum is useful for decisions where we need a longer term
> > view of the cpu utilization, but I don't see how we can use as cpu
> > utilization metric for load-balancing decisions at wakeup or
> > periodically.
> 
> So keeping one with a faster decay would add extra per-task storage. But
> would be possible..

I have had that thought when we discussed potential replacements for
cpu_load[]. It will require some messing around with the nicely
optimized load tracking maths if we want to have load tracking with a
different y-coefficient.


^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 08/11] sched: get CPU's activity statistic
@ 2014-06-03 17:16         ` Morten Rasmussen
  0 siblings, 0 replies; 216+ messages in thread
From: Morten Rasmussen @ 2014-06-03 17:16 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Jun 03, 2014 at 04:40:58PM +0100, Peter Zijlstra wrote:
> On Wed, May 28, 2014 at 01:10:01PM +0100, Morten Rasmussen wrote:
> > The rq runnable_avg_{sum, period} give a very long term view of the cpu
> > utilization (I will use the term utilization instead of activity as I
> > think that is what we are talking about here). IMHO, it is too slow to
> > be used as basis for load balancing decisions. I think that was also
> > agreed upon in the last discussion related to this topic [1].
> > 
> > The basic problem is that worst case: sum starting from 0 and period
> > already at LOAD_AVG_MAX = 47742, it takes LOAD_AVG_MAX_N = 345 periods
> > (ms) for sum to reach 47742. In other words, the cpu might have been
> > fully utilized for 345 ms before it is considered fully utilized.
> > Periodic load-balancing happens much more frequently than that.
> 
> Like said earlier the 94% mark is actually hit much sooner, but yes,
> likely still too slow.
> 
> 50% at 32 ms, 75% at 64 ms, 87.5% at 96 ms, etc..

Agreed.

> 
> > Also, if load-balancing actually moves tasks around it may take quite a
> > while before runnable_avg_sum actually reflects this change. The next
> > periodic load-balance is likely to happen before runnable_avg_sum has
> > reflected the result of the previous periodic load-balance.
> > 
> > To avoid these problems, we need to base utilization on a metric which
> > is updated instantaneously when we add/remove tasks to a cpu (or a least
> > fast enough that we don't see the above problems).
> 
> So the per-task-load-tracking stuff already does that. It updates the
> per-cpu load metrics on migration. See {de,en}queue_entity_load_avg().

I think there is some confusion here. There are two per-cpu load metrics
that tracks differently.

The cfs.runnable_load_avg is basically the sum of the load contributions
of the tasks on the cfs rq. The sum gets updated whenever tasks are
{en,de}queued by adding/subtracting the load contribution of the task
being added/removed. That is the one you are referring to.

The rq runnable_avg_sum (actually rq->avg.runnable_avg_{sum, period}) is
tracking whether the cpu has something to do or not. It doesn't matter
many tasks are runnable or what their load is. It is updated in
update_rq_runnable_avg(). It increases when rq->nr_running > 0 and
decays if not. It also takes time spent running rt tasks into account in
idle_{enter, exit}_fair(). So if you remove tasks from the rq, this
metric will start decaying and eventually get to 0, unlike the
cfs.runnable_load_avg where the task load contribution subtracted every
time a task is removed. The rq runnable_avg_sum is the one being used in
this patch set.

Ben, pjt, please correct me if I'm wrong.

> And keeping an unweighted per-cpu variant isn't that much more work.

Agreed.

> 
> > In the previous
> > discussion [1] it was suggested that a sum of unweighted task
> > runnable_avg_{sum,period} ratio instead. That is, an unweighted
> > equivalent to weighted_cpuload(). That isn't a perfect solution either.
> > It is fine as long as the cpus are not fully utilized, but when they are
> > we need to use weighted_cpuload() to preserve smp_nice. What to do
> > around the tipping point needs more thought, but I think that is
> > currently the best proposal for a solution for task and cpu utilization.
> 
> I'm not too worried about the tipping point, per task runnable figures
> of an overloaded cpu are higher, so migration between an overloaded cpu
> and an underloaded cpu are going to be tricky no matter what we do.

Yes, agreed. I just got the impression that you were concerned about
smp_nice last time we discussed this.

> > rq runnable_avg_sum is useful for decisions where we need a longer term
> > view of the cpu utilization, but I don't see how we can use as cpu
> > utilization metric for load-balancing decisions at wakeup or
> > periodically.
> 
> So keeping one with a faster decay would add extra per-task storage. But
> would be possible..

I have had that thought when we discussed potential replacements for
cpu_load[]. It will require some messing around with the nicely
optimized load tracking maths if we want to have load tracking with a
different y-coefficient.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 08/11] sched: get CPU's activity statistic
  2014-06-03 15:50           ` Peter Zijlstra
@ 2014-06-03 17:20             ` Morten Rasmussen
  -1 siblings, 0 replies; 216+ messages in thread
From: Morten Rasmussen @ 2014-06-03 17:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vincent Guittot, mingo, linux-kernel, linux, linux-arm-kernel,
	preeti, efault, nicolas.pitre, linaro-kernel, daniel.lezcano,
	Benjamin Segall, Paul Turner

On Tue, Jun 03, 2014 at 04:50:07PM +0100, Peter Zijlstra wrote:
> On Wed, May 28, 2014 at 04:47:03PM +0100, Morten Rasmussen wrote:
> > Since we may do periodic load-balance every 10 ms or so, we will perform
> > a number of load-balances where runnable_avg_sum will mostly be
> > reflecting the state of the world before a change (new task queued or
> > moved a task to a different cpu). If you had have two tasks continuously
> > on one cpu and your other cpu is idle, and you move one of the tasks to
> > the other cpu, runnable_avg_sum will remain unchanged, 47742, on the
> > first cpu while it starts from 0 on the other one. 10 ms later it will
> > have increased a bit, 32 ms later it will be 47742/2, and 345 ms later
> > it reaches 47742. In the mean time the cpu doesn't appear fully utilized
> > and we might decide to put more tasks on it because we don't know if
> > runnable_avg_sum represents a partially utilized cpu (for example a 50%
> > task) or if it will continue to rise and eventually get to 47742.
> 
> Ah, no, since we track per task, and update the per-cpu ones when we
> migrate tasks, the per-cpu values should be instantly updated.

No, not for this per-cpu tracking metric :) For cfs.runnable_load_avg
you are right, but this patch set is using rq->avg.runnable_avg_sum
which is different. See my other reply.

> If we were to increase per task storage, we might as well also track
> running_avg not only runnable_avg.

That could probably make sense. We had that in pjt's first proposal.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 08/11] sched: get CPU's activity statistic
@ 2014-06-03 17:20             ` Morten Rasmussen
  0 siblings, 0 replies; 216+ messages in thread
From: Morten Rasmussen @ 2014-06-03 17:20 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Jun 03, 2014 at 04:50:07PM +0100, Peter Zijlstra wrote:
> On Wed, May 28, 2014 at 04:47:03PM +0100, Morten Rasmussen wrote:
> > Since we may do periodic load-balance every 10 ms or so, we will perform
> > a number of load-balances where runnable_avg_sum will mostly be
> > reflecting the state of the world before a change (new task queued or
> > moved a task to a different cpu). If you had have two tasks continuously
> > on one cpu and your other cpu is idle, and you move one of the tasks to
> > the other cpu, runnable_avg_sum will remain unchanged, 47742, on the
> > first cpu while it starts from 0 on the other one. 10 ms later it will
> > have increased a bit, 32 ms later it will be 47742/2, and 345 ms later
> > it reaches 47742. In the mean time the cpu doesn't appear fully utilized
> > and we might decide to put more tasks on it because we don't know if
> > runnable_avg_sum represents a partially utilized cpu (for example a 50%
> > task) or if it will continue to rise and eventually get to 47742.
> 
> Ah, no, since we track per task, and update the per-cpu ones when we
> migrate tasks, the per-cpu values should be instantly updated.

No, not for this per-cpu tracking metric :) For cfs.runnable_load_avg
you are right, but this patch set is using rq->avg.runnable_avg_sum
which is different. See my other reply.

> If we were to increase per task storage, we might as well also track
> running_avg not only runnable_avg.

That could probably make sense. We had that in pjt's first proposal.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 08/11] sched: get CPU's activity statistic
  2014-06-03 17:16         ` Morten Rasmussen
@ 2014-06-03 17:37           ` Peter Zijlstra
  -1 siblings, 0 replies; 216+ messages in thread
From: Peter Zijlstra @ 2014-06-03 17:37 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Vincent Guittot, mingo, linux-kernel, linux, linux-arm-kernel,
	preeti, efault, nicolas.pitre, linaro-kernel, daniel.lezcano,
	Paul Turner, Benjamin Segall

On Tue, Jun 03, 2014 at 06:16:28PM +0100, Morten Rasmussen wrote:
> > I'm not too worried about the tipping point, per task runnable figures
> > of an overloaded cpu are higher, so migration between an overloaded cpu
> > and an underloaded cpu are going to be tricky no matter what we do.
> 
> Yes, agreed. I just got the impression that you were concerned about
> smp_nice last time we discussed this.

Well, yes, we need to keep that working, but the exact detail around the
tipping point are near impossible to get right, so I'm not too bothered
there.

> > > rq runnable_avg_sum is useful for decisions where we need a longer term
> > > view of the cpu utilization, but I don't see how we can use as cpu
> > > utilization metric for load-balancing decisions at wakeup or
> > > periodically.
> > 
> > So keeping one with a faster decay would add extra per-task storage. But
> > would be possible..
> 
> I have had that thought when we discussed potential replacements for
> cpu_load[]. It will require some messing around with the nicely
> optimized load tracking maths if we want to have load tracking with a
> different y-coefficient.

My initial thought was a y=0.5, which is really >>=1. But yes, if we
want something else that'll get messy real fast methinks.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 08/11] sched: get CPU's activity statistic
@ 2014-06-03 17:37           ` Peter Zijlstra
  0 siblings, 0 replies; 216+ messages in thread
From: Peter Zijlstra @ 2014-06-03 17:37 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Jun 03, 2014 at 06:16:28PM +0100, Morten Rasmussen wrote:
> > I'm not too worried about the tipping point, per task runnable figures
> > of an overloaded cpu are higher, so migration between an overloaded cpu
> > and an underloaded cpu are going to be tricky no matter what we do.
> 
> Yes, agreed. I just got the impression that you were concerned about
> smp_nice last time we discussed this.

Well, yes, we need to keep that working, but the exact detail around the
tipping point are near impossible to get right, so I'm not too bothered
there.

> > > rq runnable_avg_sum is useful for decisions where we need a longer term
> > > view of the cpu utilization, but I don't see how we can use as cpu
> > > utilization metric for load-balancing decisions at wakeup or
> > > periodically.
> > 
> > So keeping one with a faster decay would add extra per-task storage. But
> > would be possible..
> 
> I have had that thought when we discussed potential replacements for
> cpu_load[]. It will require some messing around with the nicely
> optimized load tracking maths if we want to have load tracking with a
> different y-coefficient.

My initial thought was a y=0.5, which is really >>=1. But yes, if we
want something else that'll get messy real fast methinks.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 08/11] sched: get CPU's activity statistic
  2014-06-03 17:16         ` Morten Rasmussen
@ 2014-06-03 17:39           ` Peter Zijlstra
  -1 siblings, 0 replies; 216+ messages in thread
From: Peter Zijlstra @ 2014-06-03 17:39 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Vincent Guittot, mingo, linux-kernel, linux, linux-arm-kernel,
	preeti, efault, nicolas.pitre, linaro-kernel, daniel.lezcano,
	Paul Turner, Benjamin Segall

On Tue, Jun 03, 2014 at 06:16:28PM +0100, Morten Rasmussen wrote:
> > So the per-task-load-tracking stuff already does that. It updates the
> > per-cpu load metrics on migration. See {de,en}queue_entity_load_avg().
> 
> I think there is some confusion here. There are two per-cpu load metrics
> that tracks differently.
> 
> The cfs.runnable_load_avg is basically the sum of the load contributions
> of the tasks on the cfs rq. The sum gets updated whenever tasks are
> {en,de}queued by adding/subtracting the load contribution of the task
> being added/removed. That is the one you are referring to.
> 
> The rq runnable_avg_sum (actually rq->avg.runnable_avg_{sum, period}) is
> tracking whether the cpu has something to do or not. It doesn't matter
> many tasks are runnable or what their load is. It is updated in
> update_rq_runnable_avg(). It increases when rq->nr_running > 0 and
> decays if not. It also takes time spent running rt tasks into account in
> idle_{enter, exit}_fair(). So if you remove tasks from the rq, this
> metric will start decaying and eventually get to 0, unlike the
> cfs.runnable_load_avg where the task load contribution subtracted every
> time a task is removed. The rq runnable_avg_sum is the one being used in
> this patch set.
> 
> Ben, pjt, please correct me if I'm wrong.

Argh, ok I completely missed that. I think the cfs.runnable_load_avg is
the sane number, not entirely sure what good rq->avg.runnable_avg is
good for, it seems a weird metric on first consideration.

Will have to ponder that a bit more.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 08/11] sched: get CPU's activity statistic
@ 2014-06-03 17:39           ` Peter Zijlstra
  0 siblings, 0 replies; 216+ messages in thread
From: Peter Zijlstra @ 2014-06-03 17:39 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Jun 03, 2014 at 06:16:28PM +0100, Morten Rasmussen wrote:
> > So the per-task-load-tracking stuff already does that. It updates the
> > per-cpu load metrics on migration. See {de,en}queue_entity_load_avg().
> 
> I think there is some confusion here. There are two per-cpu load metrics
> that tracks differently.
> 
> The cfs.runnable_load_avg is basically the sum of the load contributions
> of the tasks on the cfs rq. The sum gets updated whenever tasks are
> {en,de}queued by adding/subtracting the load contribution of the task
> being added/removed. That is the one you are referring to.
> 
> The rq runnable_avg_sum (actually rq->avg.runnable_avg_{sum, period}) is
> tracking whether the cpu has something to do or not. It doesn't matter
> many tasks are runnable or what their load is. It is updated in
> update_rq_runnable_avg(). It increases when rq->nr_running > 0 and
> decays if not. It also takes time spent running rt tasks into account in
> idle_{enter, exit}_fair(). So if you remove tasks from the rq, this
> metric will start decaying and eventually get to 0, unlike the
> cfs.runnable_load_avg where the task load contribution subtracted every
> time a task is removed. The rq runnable_avg_sum is the one being used in
> this patch set.
> 
> Ben, pjt, please correct me if I'm wrong.

Argh, ok I completely missed that. I think the cfs.runnable_load_avg is
the sane number, not entirely sure what good rq->avg.runnable_avg is
good for, it seems a weird metric on first consideration.

Will have to ponder that a bit more.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 08/11] sched: get CPU's activity statistic
  2014-06-03 15:59               ` Peter Zijlstra
@ 2014-06-03 17:41                 ` Morten Rasmussen
  -1 siblings, 0 replies; 216+ messages in thread
From: Morten Rasmussen @ 2014-06-03 17:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vincent Guittot, mingo, linux-kernel, linux, linux-arm-kernel,
	preeti, efault, nicolas.pitre, linaro-kernel, daniel.lezcano,
	Paul Turner, Benjamin Segall

On Tue, Jun 03, 2014 at 04:59:39PM +0100, Peter Zijlstra wrote:
> On Tue, Jun 03, 2014 at 01:03:54PM +0100, Morten Rasmussen wrote:
> > An unweighted version of cfs.runnable_load_avg gives you a metric that
> > captures cpu utilization to some extend, but not the number of tasks.
> > And it reflects task migrations immediately unlike the rq
> > runnable_avg_sum.
> 
> So runnable_avg would be equal to the utilization as long as
> there's idle time, as soon as we're over-loaded the metric shows how
> much extra cpu is required.
> 
> That is, runnable_avg - running_avg >= 0 and the amount is the
> exact amount of extra cpu required to make all tasks run but not have
> idle time.

Yes, roughly. runnable_avg goes up quite steeply if you have many tasks
on a fully utilized cpu, so the actual amount of extra cpu required
might be somewhat lower. I can't come up with something better, so I
agree.

> 
> > Agreed, but I think it is quite important to discuss what we understand
> > by cpu utilization. It seems to be different depending on what you want
> > to use it for.
> 
> I understand utilization to be however much cpu is actually used, so I
> would, per the existing naming, call running_avg to be the avg
> utilization of a task/group/cpu whatever.

I see your point, but for load balancing purposes we are more intested
in the runnable_avg as it tells us about the cpu capacity requirements.
I don't like to throw more terms into the mix, but you could call
runnable_avg the potential task/group/cpu utilization. This is an
estimate of how much utilization a task would cause if we moved it to an
idle cpu. That might be quite different from running_avg on an
over-utilized cpu.

> 
> > We have done experiments internally with rq runnable_avg_sum for
> > load-balancing decisions in the past and found it unsuitable due to its
> > slow response to task migrations. That is why I brought it up here.
> 
> So I'm not entirely seeing that from the code (I've not traced this),
> afaict we actually update the per-cpu values on migration based on the
> task values.
> 
> old_rq->sum -= p->val;
> new_rq->sum += p->val;
> 
> like,.. except of course totally obscured.

Yes, for cfs.runnable_load_avg, rq->avg.runnable_avg_sum is different.
See the other reply.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 08/11] sched: get CPU's activity statistic
@ 2014-06-03 17:41                 ` Morten Rasmussen
  0 siblings, 0 replies; 216+ messages in thread
From: Morten Rasmussen @ 2014-06-03 17:41 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Jun 03, 2014 at 04:59:39PM +0100, Peter Zijlstra wrote:
> On Tue, Jun 03, 2014 at 01:03:54PM +0100, Morten Rasmussen wrote:
> > An unweighted version of cfs.runnable_load_avg gives you a metric that
> > captures cpu utilization to some extend, but not the number of tasks.
> > And it reflects task migrations immediately unlike the rq
> > runnable_avg_sum.
> 
> So runnable_avg would be equal to the utilization as long as
> there's idle time, as soon as we're over-loaded the metric shows how
> much extra cpu is required.
> 
> That is, runnable_avg - running_avg >= 0 and the amount is the
> exact amount of extra cpu required to make all tasks run but not have
> idle time.

Yes, roughly. runnable_avg goes up quite steeply if you have many tasks
on a fully utilized cpu, so the actual amount of extra cpu required
might be somewhat lower. I can't come up with something better, so I
agree.

> 
> > Agreed, but I think it is quite important to discuss what we understand
> > by cpu utilization. It seems to be different depending on what you want
> > to use it for.
> 
> I understand utilization to be however much cpu is actually used, so I
> would, per the existing naming, call running_avg to be the avg
> utilization of a task/group/cpu whatever.

I see your point, but for load balancing purposes we are more intested
in the runnable_avg as it tells us about the cpu capacity requirements.
I don't like to throw more terms into the mix, but you could call
runnable_avg the potential task/group/cpu utilization. This is an
estimate of how much utilization a task would cause if we moved it to an
idle cpu. That might be quite different from running_avg on an
over-utilized cpu.

> 
> > We have done experiments internally with rq runnable_avg_sum for
> > load-balancing decisions in the past and found it unsuitable due to its
> > slow response to task migrations. That is why I brought it up here.
> 
> So I'm not entirely seeing that from the code (I've not traced this),
> afaict we actually update the per-cpu values on migration based on the
> task values.
> 
> old_rq->sum -= p->val;
> new_rq->sum += p->val;
> 
> like,.. except of course totally obscured.

Yes, for cfs.runnable_load_avg, rq->avg.runnable_avg_sum is different.
See the other reply.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 08/11] sched: get CPU's activity statistic
  2014-06-03 15:40       ` Peter Zijlstra
@ 2014-06-03 23:11         ` Yuyang Du
  -1 siblings, 0 replies; 216+ messages in thread
From: Yuyang Du @ 2014-06-03 23:11 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Morten Rasmussen, Vincent Guittot, mingo, linux-kernel, linux,
	linux-arm-kernel, preeti, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano

> > The basic problem is that worst case: sum starting from 0 and period
> > already at LOAD_AVG_MAX = 47742, it takes LOAD_AVG_MAX_N = 345 periods
> > (ms) for sum to reach 47742. In other words, the cpu might have been
> > fully utilized for 345 ms before it is considered fully utilized.
> > Periodic load-balancing happens much more frequently than that.
> 
> Like said earlier the 94% mark is actually hit much sooner, but yes,
> likely still too slow.
> 
> 50% at 32 ms, 75% at 64 ms, 87.5% at 96 ms, etc..
> 
> > In the previous
> > discussion [1] it was suggested that a sum of unweighted task
> > runnable_avg_{sum,period} ratio instead. That is, an unweighted
> > equivalent to weighted_cpuload(). That isn't a perfect solution either.
> > It is fine as long as the cpus are not fully utilized, but when they are
> > we need to use weighted_cpuload() to preserve smp_nice. What to do
> > around the tipping point needs more thought, but I think that is
> > currently the best proposal for a solution for task and cpu utilization.
> 
> I'm not too worried about the tipping point, per task runnable figures
> of an overloaded cpu are higher, so migration between an overloaded cpu
> and an underloaded cpu are going to be tricky no matter what we do.
> 
Hi,

Can I join this dicussion late?

As I understand, you are talking about a metric for cpu activity. And the
issues about runnable_avg_sum is its sluggishness to latest change, and also
need unweighted load averages.

You might be aware of my recent proposal to CPU ConCurrency (CC). It is 1) an
average of nr_running, or 2) nr_running weighted CPU utilization. So it is a
combination of CPU utlization and run queue (both factored  natually). It
meets the needs you talked, I think. You can take it as a candidate, or at
least we can talk about it?

Thanks,
Yuyang

^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 08/11] sched: get CPU's activity statistic
@ 2014-06-03 23:11         ` Yuyang Du
  0 siblings, 0 replies; 216+ messages in thread
From: Yuyang Du @ 2014-06-03 23:11 UTC (permalink / raw)
  To: linux-arm-kernel

> > The basic problem is that worst case: sum starting from 0 and period
> > already at LOAD_AVG_MAX = 47742, it takes LOAD_AVG_MAX_N = 345 periods
> > (ms) for sum to reach 47742. In other words, the cpu might have been
> > fully utilized for 345 ms before it is considered fully utilized.
> > Periodic load-balancing happens much more frequently than that.
> 
> Like said earlier the 94% mark is actually hit much sooner, but yes,
> likely still too slow.
> 
> 50% at 32 ms, 75% at 64 ms, 87.5% at 96 ms, etc..
> 
> > In the previous
> > discussion [1] it was suggested that a sum of unweighted task
> > runnable_avg_{sum,period} ratio instead. That is, an unweighted
> > equivalent to weighted_cpuload(). That isn't a perfect solution either.
> > It is fine as long as the cpus are not fully utilized, but when they are
> > we need to use weighted_cpuload() to preserve smp_nice. What to do
> > around the tipping point needs more thought, but I think that is
> > currently the best proposal for a solution for task and cpu utilization.
> 
> I'm not too worried about the tipping point, per task runnable figures
> of an overloaded cpu are higher, so migration between an overloaded cpu
> and an underloaded cpu are going to be tricky no matter what we do.
> 
Hi,

Can I join this dicussion late?

As I understand, you are talking about a metric for cpu activity. And the
issues about runnable_avg_sum is its sluggishness to latest change, and also
need unweighted load averages.

You might be aware of my recent proposal to CPU ConCurrency (CC). It is 1) an
average of nr_running, or 2) nr_running weighted CPU utilization. So it is a
combination of CPU utlization and run queue (both factored  natually). It
meets the needs you talked, I think. You can take it as a candidate, or at
least we can talk about it?

Thanks,
Yuyang

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 08/11] sched: get CPU's activity statistic
  2014-06-03 15:50           ` Peter Zijlstra
@ 2014-06-04  7:47             ` Vincent Guittot
  -1 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-06-04  7:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Morten Rasmussen, mingo, linux-kernel, linux, linux-arm-kernel,
	preeti, efault, nicolas.pitre, linaro-kernel, daniel.lezcano

On 3 June 2014 17:50, Peter Zijlstra <peterz@infradead.org> wrote:
> On Wed, May 28, 2014 at 04:47:03PM +0100, Morten Rasmussen wrote:
>> Since we may do periodic load-balance every 10 ms or so, we will perform
>> a number of load-balances where runnable_avg_sum will mostly be
>> reflecting the state of the world before a change (new task queued or
>> moved a task to a different cpu). If you had have two tasks continuously
>> on one cpu and your other cpu is idle, and you move one of the tasks to
>> the other cpu, runnable_avg_sum will remain unchanged, 47742, on the
>> first cpu while it starts from 0 on the other one. 10 ms later it will
>> have increased a bit, 32 ms later it will be 47742/2, and 345 ms later
>> it reaches 47742. In the mean time the cpu doesn't appear fully utilized
>> and we might decide to put more tasks on it because we don't know if
>> runnable_avg_sum represents a partially utilized cpu (for example a 50%
>> task) or if it will continue to rise and eventually get to 47742.
>
> Ah, no, since we track per task, and update the per-cpu ones when we
> migrate tasks, the per-cpu values should be instantly updated.
>
> If we were to increase per task storage, we might as well also track
> running_avg not only runnable_avg.

I agree that the removed running_avg should give more useful
information about the the load of a CPU.

The main issue with running_avg is that it's disturbed by other tasks
(as point out previously). As a typical example,  if we have 2 tasks
with a load of 25% on 1 CPU, the unweighted runnable_load_avg will be
in the range of [100% - 50%] depending of the parallelism of the
runtime of the tasks whereas the reality is 50% and the use of
running_avg will return this value

Vincent

^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 08/11] sched: get CPU's activity statistic
@ 2014-06-04  7:47             ` Vincent Guittot
  0 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-06-04  7:47 UTC (permalink / raw)
  To: linux-arm-kernel

On 3 June 2014 17:50, Peter Zijlstra <peterz@infradead.org> wrote:
> On Wed, May 28, 2014 at 04:47:03PM +0100, Morten Rasmussen wrote:
>> Since we may do periodic load-balance every 10 ms or so, we will perform
>> a number of load-balances where runnable_avg_sum will mostly be
>> reflecting the state of the world before a change (new task queued or
>> moved a task to a different cpu). If you had have two tasks continuously
>> on one cpu and your other cpu is idle, and you move one of the tasks to
>> the other cpu, runnable_avg_sum will remain unchanged, 47742, on the
>> first cpu while it starts from 0 on the other one. 10 ms later it will
>> have increased a bit, 32 ms later it will be 47742/2, and 345 ms later
>> it reaches 47742. In the mean time the cpu doesn't appear fully utilized
>> and we might decide to put more tasks on it because we don't know if
>> runnable_avg_sum represents a partially utilized cpu (for example a 50%
>> task) or if it will continue to rise and eventually get to 47742.
>
> Ah, no, since we track per task, and update the per-cpu ones when we
> migrate tasks, the per-cpu values should be instantly updated.
>
> If we were to increase per task storage, we might as well also track
> running_avg not only runnable_avg.

I agree that the removed running_avg should give more useful
information about the the load of a CPU.

The main issue with running_avg is that it's disturbed by other tasks
(as point out previously). As a typical example,  if we have 2 tasks
with a load of 25% on 1 CPU, the unweighted runnable_load_avg will be
in the range of [100% - 50%] depending of the parallelism of the
runtime of the tasks whereas the reality is 50% and the use of
running_avg will return this value

Vincent

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 08/11] sched: get CPU's activity statistic
  2014-06-04  7:47             ` Vincent Guittot
@ 2014-06-04  8:08               ` Peter Zijlstra
  -1 siblings, 0 replies; 216+ messages in thread
From: Peter Zijlstra @ 2014-06-04  8:08 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Morten Rasmussen, mingo, linux-kernel, linux, linux-arm-kernel,
	preeti, efault, nicolas.pitre, linaro-kernel, daniel.lezcano

[-- Attachment #1: Type: text/plain, Size: 2021 bytes --]

On Wed, Jun 04, 2014 at 09:47:26AM +0200, Vincent Guittot wrote:
> On 3 June 2014 17:50, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Wed, May 28, 2014 at 04:47:03PM +0100, Morten Rasmussen wrote:
> >> Since we may do periodic load-balance every 10 ms or so, we will perform
> >> a number of load-balances where runnable_avg_sum will mostly be
> >> reflecting the state of the world before a change (new task queued or
> >> moved a task to a different cpu). If you had have two tasks continuously
> >> on one cpu and your other cpu is idle, and you move one of the tasks to
> >> the other cpu, runnable_avg_sum will remain unchanged, 47742, on the
> >> first cpu while it starts from 0 on the other one. 10 ms later it will
> >> have increased a bit, 32 ms later it will be 47742/2, and 345 ms later
> >> it reaches 47742. In the mean time the cpu doesn't appear fully utilized
> >> and we might decide to put more tasks on it because we don't know if
> >> runnable_avg_sum represents a partially utilized cpu (for example a 50%
> >> task) or if it will continue to rise and eventually get to 47742.
> >
> > Ah, no, since we track per task, and update the per-cpu ones when we
> > migrate tasks, the per-cpu values should be instantly updated.
> >
> > If we were to increase per task storage, we might as well also track
> > running_avg not only runnable_avg.
> 
> I agree that the removed running_avg should give more useful
> information about the the load of a CPU.
> 
> The main issue with running_avg is that it's disturbed by other tasks
> (as point out previously). As a typical example,  if we have 2 tasks
> with a load of 25% on 1 CPU, the unweighted runnable_load_avg will be
> in the range of [100% - 50%] depending of the parallelism of the
> runtime of the tasks whereas the reality is 50% and the use of
> running_avg will return this value

I'm not sure I see how 100% is possible, but yes I agree that runnable
can indeed be inflated due to this queueing effect.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 08/11] sched: get CPU's activity statistic
@ 2014-06-04  8:08               ` Peter Zijlstra
  0 siblings, 0 replies; 216+ messages in thread
From: Peter Zijlstra @ 2014-06-04  8:08 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Jun 04, 2014 at 09:47:26AM +0200, Vincent Guittot wrote:
> On 3 June 2014 17:50, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Wed, May 28, 2014 at 04:47:03PM +0100, Morten Rasmussen wrote:
> >> Since we may do periodic load-balance every 10 ms or so, we will perform
> >> a number of load-balances where runnable_avg_sum will mostly be
> >> reflecting the state of the world before a change (new task queued or
> >> moved a task to a different cpu). If you had have two tasks continuously
> >> on one cpu and your other cpu is idle, and you move one of the tasks to
> >> the other cpu, runnable_avg_sum will remain unchanged, 47742, on the
> >> first cpu while it starts from 0 on the other one. 10 ms later it will
> >> have increased a bit, 32 ms later it will be 47742/2, and 345 ms later
> >> it reaches 47742. In the mean time the cpu doesn't appear fully utilized
> >> and we might decide to put more tasks on it because we don't know if
> >> runnable_avg_sum represents a partially utilized cpu (for example a 50%
> >> task) or if it will continue to rise and eventually get to 47742.
> >
> > Ah, no, since we track per task, and update the per-cpu ones when we
> > migrate tasks, the per-cpu values should be instantly updated.
> >
> > If we were to increase per task storage, we might as well also track
> > running_avg not only runnable_avg.
> 
> I agree that the removed running_avg should give more useful
> information about the the load of a CPU.
> 
> The main issue with running_avg is that it's disturbed by other tasks
> (as point out previously). As a typical example,  if we have 2 tasks
> with a load of 25% on 1 CPU, the unweighted runnable_load_avg will be
> in the range of [100% - 50%] depending of the parallelism of the
> runtime of the tasks whereas the reality is 50% and the use of
> running_avg will return this value

I'm not sure I see how 100% is possible, but yes I agree that runnable
can indeed be inflated due to this queueing effect.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 836 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20140604/2acea284/attachment.sig>

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 08/11] sched: get CPU's activity statistic
  2014-06-04  8:08               ` Peter Zijlstra
@ 2014-06-04  8:55                 ` Morten Rasmussen
  -1 siblings, 0 replies; 216+ messages in thread
From: Morten Rasmussen @ 2014-06-04  8:55 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vincent Guittot, mingo, linux-kernel, linux, linux-arm-kernel,
	preeti, efault, nicolas.pitre, linaro-kernel, daniel.lezcano

On Wed, Jun 04, 2014 at 09:08:09AM +0100, Peter Zijlstra wrote:
> On Wed, Jun 04, 2014 at 09:47:26AM +0200, Vincent Guittot wrote:
> > On 3 June 2014 17:50, Peter Zijlstra <peterz@infradead.org> wrote:
> > > On Wed, May 28, 2014 at 04:47:03PM +0100, Morten Rasmussen wrote:
> > >> Since we may do periodic load-balance every 10 ms or so, we will perform
> > >> a number of load-balances where runnable_avg_sum will mostly be
> > >> reflecting the state of the world before a change (new task queued or
> > >> moved a task to a different cpu). If you had have two tasks continuously
> > >> on one cpu and your other cpu is idle, and you move one of the tasks to
> > >> the other cpu, runnable_avg_sum will remain unchanged, 47742, on the
> > >> first cpu while it starts from 0 on the other one. 10 ms later it will
> > >> have increased a bit, 32 ms later it will be 47742/2, and 345 ms later
> > >> it reaches 47742. In the mean time the cpu doesn't appear fully utilized
> > >> and we might decide to put more tasks on it because we don't know if
> > >> runnable_avg_sum represents a partially utilized cpu (for example a 50%
> > >> task) or if it will continue to rise and eventually get to 47742.
> > >
> > > Ah, no, since we track per task, and update the per-cpu ones when we
> > > migrate tasks, the per-cpu values should be instantly updated.
> > >
> > > If we were to increase per task storage, we might as well also track
> > > running_avg not only runnable_avg.
> > 
> > I agree that the removed running_avg should give more useful
> > information about the the load of a CPU.
> > 
> > The main issue with running_avg is that it's disturbed by other tasks
> > (as point out previously). As a typical example,  if we have 2 tasks
> > with a load of 25% on 1 CPU, the unweighted runnable_load_avg will be
> > in the range of [100% - 50%] depending of the parallelism of the
> > runtime of the tasks whereas the reality is 50% and the use of
> > running_avg will return this value

Both running_avg and runnable_avg are affected by other tasks on the
same cpus, but in different ways. They are equal if you only have one
task on a cpu. If you have more, running_avg will give you the true
requirement of the tasks until the cpu is fully utilized. At which point
the task running_avg will drop if you add more tasks (the unweighted sum
of task running_avgs remains constant).

runnable_avg on the other hand, might be affected as soon as you have
two task running on the same cpu if they are runnable at the same time.
That isn't necessarily a bad thing for load-balancing purposes, because
tasks that are runnable at the same time are likely to be run more
efficiently by placing them on different cpus. You might view as at sort
of built in concurrency factor, somewhat similar to what Yuyang is
proposing. runnable_avg increases rapidly when the cpu is over-utilized.

> I'm not sure I see how 100% is possible, but yes I agree that runnable
> can indeed be inflated due to this queueing effect.

You should only be able to get to 75% worst case for runnable_avg for
that example. The total running_avg is 50% no matter if the tasks
overlaps or not.

f you had five tasks on one cpu that each have a 25% requirement you can
get individual task runnable_avgs of up to 100% (cpu unweighted
runnable_load_avg can get up 500%, I think), but the task running_avgs
would be 20% each (total of 100%). 

^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 08/11] sched: get CPU's activity statistic
@ 2014-06-04  8:55                 ` Morten Rasmussen
  0 siblings, 0 replies; 216+ messages in thread
From: Morten Rasmussen @ 2014-06-04  8:55 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Jun 04, 2014 at 09:08:09AM +0100, Peter Zijlstra wrote:
> On Wed, Jun 04, 2014 at 09:47:26AM +0200, Vincent Guittot wrote:
> > On 3 June 2014 17:50, Peter Zijlstra <peterz@infradead.org> wrote:
> > > On Wed, May 28, 2014 at 04:47:03PM +0100, Morten Rasmussen wrote:
> > >> Since we may do periodic load-balance every 10 ms or so, we will perform
> > >> a number of load-balances where runnable_avg_sum will mostly be
> > >> reflecting the state of the world before a change (new task queued or
> > >> moved a task to a different cpu). If you had have two tasks continuously
> > >> on one cpu and your other cpu is idle, and you move one of the tasks to
> > >> the other cpu, runnable_avg_sum will remain unchanged, 47742, on the
> > >> first cpu while it starts from 0 on the other one. 10 ms later it will
> > >> have increased a bit, 32 ms later it will be 47742/2, and 345 ms later
> > >> it reaches 47742. In the mean time the cpu doesn't appear fully utilized
> > >> and we might decide to put more tasks on it because we don't know if
> > >> runnable_avg_sum represents a partially utilized cpu (for example a 50%
> > >> task) or if it will continue to rise and eventually get to 47742.
> > >
> > > Ah, no, since we track per task, and update the per-cpu ones when we
> > > migrate tasks, the per-cpu values should be instantly updated.
> > >
> > > If we were to increase per task storage, we might as well also track
> > > running_avg not only runnable_avg.
> > 
> > I agree that the removed running_avg should give more useful
> > information about the the load of a CPU.
> > 
> > The main issue with running_avg is that it's disturbed by other tasks
> > (as point out previously). As a typical example,  if we have 2 tasks
> > with a load of 25% on 1 CPU, the unweighted runnable_load_avg will be
> > in the range of [100% - 50%] depending of the parallelism of the
> > runtime of the tasks whereas the reality is 50% and the use of
> > running_avg will return this value

Both running_avg and runnable_avg are affected by other tasks on the
same cpus, but in different ways. They are equal if you only have one
task on a cpu. If you have more, running_avg will give you the true
requirement of the tasks until the cpu is fully utilized. At which point
the task running_avg will drop if you add more tasks (the unweighted sum
of task running_avgs remains constant).

runnable_avg on the other hand, might be affected as soon as you have
two task running on the same cpu if they are runnable at the same time.
That isn't necessarily a bad thing for load-balancing purposes, because
tasks that are runnable at the same time are likely to be run more
efficiently by placing them on different cpus. You might view as at sort
of built in concurrency factor, somewhat similar to what Yuyang is
proposing. runnable_avg increases rapidly when the cpu is over-utilized.

> I'm not sure I see how 100% is possible, but yes I agree that runnable
> can indeed be inflated due to this queueing effect.

You should only be able to get to 75% worst case for runnable_avg for
that example. The total running_avg is 50% no matter if the tasks
overlaps or not.

f you had five tasks on one cpu that each have a 25% requirement you can
get individual task runnable_avgs of up to 100% (cpu unweighted
runnable_load_avg can get up 500%, I think), but the task running_avgs
would be 20% each (total of 100%). 

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 08/11] sched: get CPU's activity statistic
  2014-06-04  8:55                 ` Morten Rasmussen
@ 2014-06-04  9:23                   ` Peter Zijlstra
  -1 siblings, 0 replies; 216+ messages in thread
From: Peter Zijlstra @ 2014-06-04  9:23 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Vincent Guittot, mingo, linux-kernel, linux, linux-arm-kernel,
	preeti, efault, nicolas.pitre, linaro-kernel, daniel.lezcano

[-- Attachment #1: Type: text/plain, Size: 1933 bytes --]

On Wed, Jun 04, 2014 at 09:55:42AM +0100, Morten Rasmussen wrote:
> Both running_avg and runnable_avg are affected by other tasks on the
> same cpus, but in different ways. They are equal if you only have one
> task on a cpu. If you have more, running_avg will give you the true
> requirement of the tasks until the cpu is fully utilized. At which point
> the task running_avg will drop if you add more tasks (the unweighted sum
> of task running_avgs remains constant).
> 
> runnable_avg on the other hand, might be affected as soon as you have
> two task running on the same cpu if they are runnable at the same time.
> That isn't necessarily a bad thing for load-balancing purposes, because
> tasks that are runnable at the same time are likely to be run more
> efficiently by placing them on different cpus. You might view as at sort
> of built in concurrency factor, somewhat similar to what Yuyang is
> proposing. runnable_avg increases rapidly when the cpu is over-utilized.

Agreed.

> > I'm not sure I see how 100% is possible, but yes I agree that runnable
> > can indeed be inflated due to this queueing effect.
> 
> You should only be able to get to 75% worst case for runnable_avg for
> that example. The total running_avg is 50% no matter if the tasks
> overlaps or not.

Yes, 75% is what I ended up with.

> f you had five tasks on one cpu that each have a 25% requirement you can
> get individual task runnable_avgs of up to 100% (cpu unweighted
> runnable_load_avg can get up 500%, I think), but the task running_avgs
> would be 20% each (total of 100%). 

Yeah, more or less so indeed. I had not considered the queueing effects
on runnable_avg yesterday, so good that that got raised.

That does indeed invalidate my: runnable - running := extra cpu required
thing. It ends up being the extra cpu required for 0 latency but gobs of
idle time, which is something else entirely.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 08/11] sched: get CPU's activity statistic
@ 2014-06-04  9:23                   ` Peter Zijlstra
  0 siblings, 0 replies; 216+ messages in thread
From: Peter Zijlstra @ 2014-06-04  9:23 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Jun 04, 2014 at 09:55:42AM +0100, Morten Rasmussen wrote:
> Both running_avg and runnable_avg are affected by other tasks on the
> same cpus, but in different ways. They are equal if you only have one
> task on a cpu. If you have more, running_avg will give you the true
> requirement of the tasks until the cpu is fully utilized. At which point
> the task running_avg will drop if you add more tasks (the unweighted sum
> of task running_avgs remains constant).
> 
> runnable_avg on the other hand, might be affected as soon as you have
> two task running on the same cpu if they are runnable at the same time.
> That isn't necessarily a bad thing for load-balancing purposes, because
> tasks that are runnable at the same time are likely to be run more
> efficiently by placing them on different cpus. You might view as at sort
> of built in concurrency factor, somewhat similar to what Yuyang is
> proposing. runnable_avg increases rapidly when the cpu is over-utilized.

Agreed.

> > I'm not sure I see how 100% is possible, but yes I agree that runnable
> > can indeed be inflated due to this queueing effect.
> 
> You should only be able to get to 75% worst case for runnable_avg for
> that example. The total running_avg is 50% no matter if the tasks
> overlaps or not.

Yes, 75% is what I ended up with.

> f you had five tasks on one cpu that each have a 25% requirement you can
> get individual task runnable_avgs of up to 100% (cpu unweighted
> runnable_load_avg can get up 500%, I think), but the task running_avgs
> would be 20% each (total of 100%). 

Yeah, more or less so indeed. I had not considered the queueing effects
on runnable_avg yesterday, so good that that got raised.

That does indeed invalidate my: runnable - running := extra cpu required
thing. It ends up being the extra cpu required for 0 latency but gobs of
idle time, which is something else entirely.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 836 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20140604/51458426/attachment.sig>

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 08/11] sched: get CPU's activity statistic
  2014-06-04  8:08               ` Peter Zijlstra
@ 2014-06-04  9:32                 ` Vincent Guittot
  -1 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-06-04  9:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Morten Rasmussen, mingo, linux-kernel, linux, linux-arm-kernel,
	preeti, efault, nicolas.pitre, linaro-kernel, daniel.lezcano

On 4 June 2014 10:08, Peter Zijlstra <peterz@infradead.org> wrote:
> On Wed, Jun 04, 2014 at 09:47:26AM +0200, Vincent Guittot wrote:
>> On 3 June 2014 17:50, Peter Zijlstra <peterz@infradead.org> wrote:
>> > On Wed, May 28, 2014 at 04:47:03PM +0100, Morten Rasmussen wrote:
>> >> Since we may do periodic load-balance every 10 ms or so, we will perform
>> >> a number of load-balances where runnable_avg_sum will mostly be
>> >> reflecting the state of the world before a change (new task queued or
>> >> moved a task to a different cpu). If you had have two tasks continuously
>> >> on one cpu and your other cpu is idle, and you move one of the tasks to
>> >> the other cpu, runnable_avg_sum will remain unchanged, 47742, on the
>> >> first cpu while it starts from 0 on the other one. 10 ms later it will
>> >> have increased a bit, 32 ms later it will be 47742/2, and 345 ms later
>> >> it reaches 47742. In the mean time the cpu doesn't appear fully utilized
>> >> and we might decide to put more tasks on it because we don't know if
>> >> runnable_avg_sum represents a partially utilized cpu (for example a 50%
>> >> task) or if it will continue to rise and eventually get to 47742.
>> >
>> > Ah, no, since we track per task, and update the per-cpu ones when we
>> > migrate tasks, the per-cpu values should be instantly updated.
>> >
>> > If we were to increase per task storage, we might as well also track
>> > running_avg not only runnable_avg.
>>
>> I agree that the removed running_avg should give more useful
>> information about the the load of a CPU.
>>
>> The main issue with running_avg is that it's disturbed by other tasks
>> (as point out previously). As a typical example,  if we have 2 tasks
>> with a load of 25% on 1 CPU, the unweighted runnable_load_avg will be
>> in the range of [100% - 50%] depending of the parallelism of the
>> runtime of the tasks whereas the reality is 50% and the use of
>> running_avg will return this value
>
> I'm not sure I see how 100% is possible, but yes I agree that runnable
> can indeed be inflated due to this queueing effect.

In fact, it can be even worse than that because i forgot to take into
account the geometric series effect which implies that it depends of
the runtime (idletime) of the task

Take 3 examples:

2 tasks that need to run 10ms  simultaneously each 40ms. If they share
the same CPU, they will be on the runqueue 20ms (in fact a bit less
for one of them), Their load (runnable_avg_sum/runnable_avg_period)
will be 33% each so the unweighted runnable_load_avg of the CPU will
be 66%

2 tasks that need to run 25ms simultaneously each 100ms. If they share
the same CPU, they will be on the runqueue 50ms (in fact a bit less
for one of them), Their load (runnable_avg_sum/runnable_avg_period)
will be 74% each so the unweighted runnable_load_avg of the CPU will
be 148%

2 tasks that need to run 50ms  simultaneously each 200ms. If they
share the same CPU, they will be on the runqueue 100ms (in fact a bit
less for one of them), Their load
(runnable_avg_sum/runnable_avg_period) will be 89% each so the
unweighted runnable_load_avg of the CPU will be 180%

Vincent

^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 08/11] sched: get CPU's activity statistic
@ 2014-06-04  9:32                 ` Vincent Guittot
  0 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-06-04  9:32 UTC (permalink / raw)
  To: linux-arm-kernel

On 4 June 2014 10:08, Peter Zijlstra <peterz@infradead.org> wrote:
> On Wed, Jun 04, 2014 at 09:47:26AM +0200, Vincent Guittot wrote:
>> On 3 June 2014 17:50, Peter Zijlstra <peterz@infradead.org> wrote:
>> > On Wed, May 28, 2014 at 04:47:03PM +0100, Morten Rasmussen wrote:
>> >> Since we may do periodic load-balance every 10 ms or so, we will perform
>> >> a number of load-balances where runnable_avg_sum will mostly be
>> >> reflecting the state of the world before a change (new task queued or
>> >> moved a task to a different cpu). If you had have two tasks continuously
>> >> on one cpu and your other cpu is idle, and you move one of the tasks to
>> >> the other cpu, runnable_avg_sum will remain unchanged, 47742, on the
>> >> first cpu while it starts from 0 on the other one. 10 ms later it will
>> >> have increased a bit, 32 ms later it will be 47742/2, and 345 ms later
>> >> it reaches 47742. In the mean time the cpu doesn't appear fully utilized
>> >> and we might decide to put more tasks on it because we don't know if
>> >> runnable_avg_sum represents a partially utilized cpu (for example a 50%
>> >> task) or if it will continue to rise and eventually get to 47742.
>> >
>> > Ah, no, since we track per task, and update the per-cpu ones when we
>> > migrate tasks, the per-cpu values should be instantly updated.
>> >
>> > If we were to increase per task storage, we might as well also track
>> > running_avg not only runnable_avg.
>>
>> I agree that the removed running_avg should give more useful
>> information about the the load of a CPU.
>>
>> The main issue with running_avg is that it's disturbed by other tasks
>> (as point out previously). As a typical example,  if we have 2 tasks
>> with a load of 25% on 1 CPU, the unweighted runnable_load_avg will be
>> in the range of [100% - 50%] depending of the parallelism of the
>> runtime of the tasks whereas the reality is 50% and the use of
>> running_avg will return this value
>
> I'm not sure I see how 100% is possible, but yes I agree that runnable
> can indeed be inflated due to this queueing effect.

In fact, it can be even worse than that because i forgot to take into
account the geometric series effect which implies that it depends of
the runtime (idletime) of the task

Take 3 examples:

2 tasks that need to run 10ms  simultaneously each 40ms. If they share
the same CPU, they will be on the runqueue 20ms (in fact a bit less
for one of them), Their load (runnable_avg_sum/runnable_avg_period)
will be 33% each so the unweighted runnable_load_avg of the CPU will
be 66%

2 tasks that need to run 25ms simultaneously each 100ms. If they share
the same CPU, they will be on the runqueue 50ms (in fact a bit less
for one of them), Their load (runnable_avg_sum/runnable_avg_period)
will be 74% each so the unweighted runnable_load_avg of the CPU will
be 148%

2 tasks that need to run 50ms  simultaneously each 200ms. If they
share the same CPU, they will be on the runqueue 100ms (in fact a bit
less for one of them), Their load
(runnable_avg_sum/runnable_avg_period) will be 89% each so the
unweighted runnable_load_avg of the CPU will be 180%

Vincent

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 08/11] sched: get CPU's activity statistic
  2014-06-04  9:23                   ` Peter Zijlstra
@ 2014-06-04  9:35                     ` Vincent Guittot
  -1 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-06-04  9:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Morten Rasmussen, mingo, linux-kernel, linux, linux-arm-kernel,
	preeti, efault, nicolas.pitre, linaro-kernel, daniel.lezcano

On 4 June 2014 11:23, Peter Zijlstra <peterz@infradead.org> wrote:
> On Wed, Jun 04, 2014 at 09:55:42AM +0100, Morten Rasmussen wrote:
>> Both running_avg and runnable_avg are affected by other tasks on the
>> same cpus, but in different ways. They are equal if you only have one
>> task on a cpu. If you have more, running_avg will give you the true
>> requirement of the tasks until the cpu is fully utilized. At which point
>> the task running_avg will drop if you add more tasks (the unweighted sum
>> of task running_avgs remains constant).
>>
>> runnable_avg on the other hand, might be affected as soon as you have
>> two task running on the same cpu if they are runnable at the same time.
>> That isn't necessarily a bad thing for load-balancing purposes, because
>> tasks that are runnable at the same time are likely to be run more
>> efficiently by placing them on different cpus. You might view as at sort
>> of built in concurrency factor, somewhat similar to what Yuyang is
>> proposing. runnable_avg increases rapidly when the cpu is over-utilized.
>
> Agreed.
>
>> > I'm not sure I see how 100% is possible, but yes I agree that runnable
>> > can indeed be inflated due to this queueing effect.
>>
>> You should only be able to get to 75% worst case for runnable_avg for
>> that example. The total running_avg is 50% no matter if the tasks
>> overlaps or not.
>
> Yes, 75% is what I ended up with.

Can you explain how you reach 75% as it depends on the runtime and a
runtime longer than 345ms will end to a 100% load whatever the
idletime was previously ?


>
>> f you had five tasks on one cpu that each have a 25% requirement you can
>> get individual task runnable_avgs of up to 100% (cpu unweighted
>> runnable_load_avg can get up 500%, I think), but the task running_avgs
>> would be 20% each (total of 100%).
>
> Yeah, more or less so indeed. I had not considered the queueing effects
> on runnable_avg yesterday, so good that that got raised.
>
> That does indeed invalidate my: runnable - running := extra cpu required
> thing. It ends up being the extra cpu required for 0 latency but gobs of
> idle time, which is something else entirely.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 08/11] sched: get CPU's activity statistic
@ 2014-06-04  9:35                     ` Vincent Guittot
  0 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-06-04  9:35 UTC (permalink / raw)
  To: linux-arm-kernel

On 4 June 2014 11:23, Peter Zijlstra <peterz@infradead.org> wrote:
> On Wed, Jun 04, 2014 at 09:55:42AM +0100, Morten Rasmussen wrote:
>> Both running_avg and runnable_avg are affected by other tasks on the
>> same cpus, but in different ways. They are equal if you only have one
>> task on a cpu. If you have more, running_avg will give you the true
>> requirement of the tasks until the cpu is fully utilized. At which point
>> the task running_avg will drop if you add more tasks (the unweighted sum
>> of task running_avgs remains constant).
>>
>> runnable_avg on the other hand, might be affected as soon as you have
>> two task running on the same cpu if they are runnable at the same time.
>> That isn't necessarily a bad thing for load-balancing purposes, because
>> tasks that are runnable at the same time are likely to be run more
>> efficiently by placing them on different cpus. You might view as at sort
>> of built in concurrency factor, somewhat similar to what Yuyang is
>> proposing. runnable_avg increases rapidly when the cpu is over-utilized.
>
> Agreed.
>
>> > I'm not sure I see how 100% is possible, but yes I agree that runnable
>> > can indeed be inflated due to this queueing effect.
>>
>> You should only be able to get to 75% worst case for runnable_avg for
>> that example. The total running_avg is 50% no matter if the tasks
>> overlaps or not.
>
> Yes, 75% is what I ended up with.

Can you explain how you reach 75% as it depends on the runtime and a
runtime longer than 345ms will end to a 100% load whatever the
idletime was previously ?


>
>> f you had five tasks on one cpu that each have a 25% requirement you can
>> get individual task runnable_avgs of up to 100% (cpu unweighted
>> runnable_load_avg can get up 500%, I think), but the task running_avgs
>> would be 20% each (total of 100%).
>
> Yeah, more or less so indeed. I had not considered the queueing effects
> on runnable_avg yesterday, so good that that got raised.
>
> That does indeed invalidate my: runnable - running := extra cpu required
> thing. It ends up being the extra cpu required for 0 latency but gobs of
> idle time, which is something else entirely.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 04/11] sched: Allow all archs to set the power_orig
  2014-05-30 20:50       ` Vincent Guittot
@ 2014-06-04  9:42         ` Dietmar Eggemann
  -1 siblings, 0 replies; 216+ messages in thread
From: Dietmar Eggemann @ 2014-06-04  9:42 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: peterz, mingo, linux-kernel, linux, linux-arm-kernel, preeti,
	Morten Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano

[...]
>> (1) We assume that the current way (update_cpu_power() calls
>> arch_scale_freq_power() to get the avg power(freq) over the time period
>> since the last call to arch_scale_freq_power()) is suitable
>> for us. Do you have another opinion here?
> 
> Using power (or power_freq as you mentioned below) is probably the
> easiest and more straight forward solution. You can use it to scale
> each element when updating entity runnable.
> Nevertheless, I see to 2 potential issues:
> - is power updated often enough to correctly follow the frequency
> scaling ? we need to compare power update frequency with
> runnable_avg_sum variation speed and the rate at which we will change
> the CPU's frequency.
> - the max value of runnable_avg_sum will be also scaled so a task
> running on a CPU with less capacity could be seen as a "low" load even
> if it's an always running tasks. So we need to find a way to reach the
> max value for such situation

I think I mixed two problems together here:

Firstly, we need to scale cpu power in update_cpu_power() regarding
uArch, frequency and rt/irq pressure.
Here the freq related value we get back from arch_scale_freq_power(...,
cpu) could be an instantaneous value (curr_freq(cpu)/max_freq(cpu)).

Secondly, to be able to scale the runnable avg sum of a sched entity
(se->avg->runnable_avg_sum), we preferable have a coefficient
representing uArch diffs (cpu_power_orig(cpu)/cpu_power_orig(most
powerful cpu in the system) and another coefficient (avg freq over 'now
- sa->last_runnable_update'(cpu)/max_freq(cpu). This value would have to
be retrieved from the arch in __update_entity_runnable_avg().

>> (2) Is the current layout of update_cpu_power() adequate for this, where
>> we scale power_orig related to freq and then related to rt/(irq):
>>
>>   power_orig = scale_cpu(SCHED_POWER_SCALE)
>>   power = scale_rt(scale_freq(power_orig))
>>
>> or do we need an extra power_freq data member on the rq and do:
>>
>>   power_orig = scale_cpu(SCHED_POWER_SCALE)
>>   power_freq = scale_freq(power_orig))
>>   power = scale_rt(power_orig))
> 
> do you really mean power = scale_rt(power_orig) or power=scale_rt(power_freq) ?

No, I also think that power=scale_rt(power_freq) is correct.

>> In other words, do we consider rt/(irq) pressure when calculating freq
>> scale invariant task load or not?
> 
> we should take power_freq which implies a new field
[...]


^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 04/11] sched: Allow all archs to set the power_orig
@ 2014-06-04  9:42         ` Dietmar Eggemann
  0 siblings, 0 replies; 216+ messages in thread
From: Dietmar Eggemann @ 2014-06-04  9:42 UTC (permalink / raw)
  To: linux-arm-kernel

[...]
>> (1) We assume that the current way (update_cpu_power() calls
>> arch_scale_freq_power() to get the avg power(freq) over the time period
>> since the last call to arch_scale_freq_power()) is suitable
>> for us. Do you have another opinion here?
> 
> Using power (or power_freq as you mentioned below) is probably the
> easiest and more straight forward solution. You can use it to scale
> each element when updating entity runnable.
> Nevertheless, I see to 2 potential issues:
> - is power updated often enough to correctly follow the frequency
> scaling ? we need to compare power update frequency with
> runnable_avg_sum variation speed and the rate at which we will change
> the CPU's frequency.
> - the max value of runnable_avg_sum will be also scaled so a task
> running on a CPU with less capacity could be seen as a "low" load even
> if it's an always running tasks. So we need to find a way to reach the
> max value for such situation

I think I mixed two problems together here:

Firstly, we need to scale cpu power in update_cpu_power() regarding
uArch, frequency and rt/irq pressure.
Here the freq related value we get back from arch_scale_freq_power(...,
cpu) could be an instantaneous value (curr_freq(cpu)/max_freq(cpu)).

Secondly, to be able to scale the runnable avg sum of a sched entity
(se->avg->runnable_avg_sum), we preferable have a coefficient
representing uArch diffs (cpu_power_orig(cpu)/cpu_power_orig(most
powerful cpu in the system) and another coefficient (avg freq over 'now
- sa->last_runnable_update'(cpu)/max_freq(cpu). This value would have to
be retrieved from the arch in __update_entity_runnable_avg().

>> (2) Is the current layout of update_cpu_power() adequate for this, where
>> we scale power_orig related to freq and then related to rt/(irq):
>>
>>   power_orig = scale_cpu(SCHED_POWER_SCALE)
>>   power = scale_rt(scale_freq(power_orig))
>>
>> or do we need an extra power_freq data member on the rq and do:
>>
>>   power_orig = scale_cpu(SCHED_POWER_SCALE)
>>   power_freq = scale_freq(power_orig))
>>   power = scale_rt(power_orig))
> 
> do you really mean power = scale_rt(power_orig) or power=scale_rt(power_freq) ?

No, I also think that power=scale_rt(power_freq) is correct.

>> In other words, do we consider rt/(irq) pressure when calculating freq
>> scale invariant task load or not?
> 
> we should take power_freq which implies a new field
[...]

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 08/11] sched: get CPU's activity statistic
  2014-06-04  9:23                   ` Peter Zijlstra
@ 2014-06-04  9:44                     ` Morten Rasmussen
  -1 siblings, 0 replies; 216+ messages in thread
From: Morten Rasmussen @ 2014-06-04  9:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vincent Guittot, mingo, linux-kernel, linux, linux-arm-kernel,
	preeti, efault, nicolas.pitre, linaro-kernel, daniel.lezcano

On Wed, Jun 04, 2014 at 10:23:13AM +0100, Peter Zijlstra wrote:
> > f you had five tasks on one cpu that each have a 25% requirement you can
> > get individual task runnable_avgs of up to 100% (cpu unweighted
> > runnable_load_avg can get up 500%, I think), but the task running_avgs
> > would be 20% each (total of 100%). 
> 
> Yeah, more or less so indeed. I had not considered the queueing effects
> on runnable_avg yesterday, so good that that got raised.
> 
> That does indeed invalidate my: runnable - running := extra cpu required
> thing. It ends up being the extra cpu required for 0 latency but gobs of
> idle time, which is something else entirely.

Agreed, but I think it is still a useful estimate of the required
compute capacity. If there is a significant difference between runnable
and running on a cpu, the current mix of tasks is not good for latency.
However, we need to treat it as a worst case estimate and not necessarily
try to move exactly runnable-running worth of tasks to another cpu.

So far I haven't been able to come up with something better.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 08/11] sched: get CPU's activity statistic
@ 2014-06-04  9:44                     ` Morten Rasmussen
  0 siblings, 0 replies; 216+ messages in thread
From: Morten Rasmussen @ 2014-06-04  9:44 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Jun 04, 2014 at 10:23:13AM +0100, Peter Zijlstra wrote:
> > f you had five tasks on one cpu that each have a 25% requirement you can
> > get individual task runnable_avgs of up to 100% (cpu unweighted
> > runnable_load_avg can get up 500%, I think), but the task running_avgs
> > would be 20% each (total of 100%). 
> 
> Yeah, more or less so indeed. I had not considered the queueing effects
> on runnable_avg yesterday, so good that that got raised.
> 
> That does indeed invalidate my: runnable - running := extra cpu required
> thing. It ends up being the extra cpu required for 0 latency but gobs of
> idle time, which is something else entirely.

Agreed, but I think it is still a useful estimate of the required
compute capacity. If there is a significant difference between runnable
and running on a cpu, the current mix of tasks is not good for latency.
However, we need to treat it as a worst case estimate and not necessarily
try to move exactly runnable-running worth of tasks to another cpu.

So far I haven't been able to come up with something better.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 08/11] sched: get CPU's activity statistic
  2014-06-04  9:32                 ` Vincent Guittot
@ 2014-06-04 10:00                   ` Morten Rasmussen
  -1 siblings, 0 replies; 216+ messages in thread
From: Morten Rasmussen @ 2014-06-04 10:00 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, mingo, linux-kernel, linux, linux-arm-kernel,
	preeti, efault, nicolas.pitre, linaro-kernel, daniel.lezcano

On Wed, Jun 04, 2014 at 10:32:10AM +0100, Vincent Guittot wrote:
> On 4 June 2014 10:08, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Wed, Jun 04, 2014 at 09:47:26AM +0200, Vincent Guittot wrote:
> >> On 3 June 2014 17:50, Peter Zijlstra <peterz@infradead.org> wrote:
> >> > On Wed, May 28, 2014 at 04:47:03PM +0100, Morten Rasmussen wrote:
> >> >> Since we may do periodic load-balance every 10 ms or so, we will perform
> >> >> a number of load-balances where runnable_avg_sum will mostly be
> >> >> reflecting the state of the world before a change (new task queued or
> >> >> moved a task to a different cpu). If you had have two tasks continuously
> >> >> on one cpu and your other cpu is idle, and you move one of the tasks to
> >> >> the other cpu, runnable_avg_sum will remain unchanged, 47742, on the
> >> >> first cpu while it starts from 0 on the other one. 10 ms later it will
> >> >> have increased a bit, 32 ms later it will be 47742/2, and 345 ms later
> >> >> it reaches 47742. In the mean time the cpu doesn't appear fully utilized
> >> >> and we might decide to put more tasks on it because we don't know if
> >> >> runnable_avg_sum represents a partially utilized cpu (for example a 50%
> >> >> task) or if it will continue to rise and eventually get to 47742.
> >> >
> >> > Ah, no, since we track per task, and update the per-cpu ones when we
> >> > migrate tasks, the per-cpu values should be instantly updated.
> >> >
> >> > If we were to increase per task storage, we might as well also track
> >> > running_avg not only runnable_avg.
> >>
> >> I agree that the removed running_avg should give more useful
> >> information about the the load of a CPU.
> >>
> >> The main issue with running_avg is that it's disturbed by other tasks
> >> (as point out previously). As a typical example,  if we have 2 tasks
> >> with a load of 25% on 1 CPU, the unweighted runnable_load_avg will be
> >> in the range of [100% - 50%] depending of the parallelism of the
> >> runtime of the tasks whereas the reality is 50% and the use of
> >> running_avg will return this value
> >
> > I'm not sure I see how 100% is possible, but yes I agree that runnable
> > can indeed be inflated due to this queueing effect.
> 
> In fact, it can be even worse than that because i forgot to take into
> account the geometric series effect which implies that it depends of
> the runtime (idletime) of the task
> 
> Take 3 examples:
> 
> 2 tasks that need to run 10ms  simultaneously each 40ms. If they share
> the same CPU, they will be on the runqueue 20ms (in fact a bit less
> for one of them), Their load (runnable_avg_sum/runnable_avg_period)
> will be 33% each so the unweighted runnable_load_avg of the CPU will
> be 66%

Right, it actually depends on how often you switch between the tasks. If
you sched_tick happens every 10 ms then in this example one task will
run for 10 ms an be done, while the other one waits for 10 ms and then
runs to completion in 10 ms. The result is that one task is runnable for
10 ms and the other one is runnable for 20 ms. That gives you 25% and
50% for a total of 75%.

> 
> 2 tasks that need to run 25ms simultaneously each 100ms. If they share
> the same CPU, they will be on the runqueue 50ms (in fact a bit less
> for one of them), Their load (runnable_avg_sum/runnable_avg_period)
> will be 74% each so the unweighted runnable_load_avg of the CPU will
> be 148%
> 
> 2 tasks that need to run 50ms  simultaneously each 200ms. If they
> share the same CPU, they will be on the runqueue 100ms (in fact a bit
> less for one of them), Their load
> (runnable_avg_sum/runnable_avg_period) will be 89% each so the
> unweighted runnable_load_avg of the CPU will be 180%

In this case you switch been the tasks before they complete, so in this
case both tasks are waiting so runnable will get much righer than 75%.
You are right. It was thinking about tasks where the busy period is
short enough that they run to completion when they are scheduled.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 08/11] sched: get CPU's activity statistic
@ 2014-06-04 10:00                   ` Morten Rasmussen
  0 siblings, 0 replies; 216+ messages in thread
From: Morten Rasmussen @ 2014-06-04 10:00 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Jun 04, 2014 at 10:32:10AM +0100, Vincent Guittot wrote:
> On 4 June 2014 10:08, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Wed, Jun 04, 2014 at 09:47:26AM +0200, Vincent Guittot wrote:
> >> On 3 June 2014 17:50, Peter Zijlstra <peterz@infradead.org> wrote:
> >> > On Wed, May 28, 2014 at 04:47:03PM +0100, Morten Rasmussen wrote:
> >> >> Since we may do periodic load-balance every 10 ms or so, we will perform
> >> >> a number of load-balances where runnable_avg_sum will mostly be
> >> >> reflecting the state of the world before a change (new task queued or
> >> >> moved a task to a different cpu). If you had have two tasks continuously
> >> >> on one cpu and your other cpu is idle, and you move one of the tasks to
> >> >> the other cpu, runnable_avg_sum will remain unchanged, 47742, on the
> >> >> first cpu while it starts from 0 on the other one. 10 ms later it will
> >> >> have increased a bit, 32 ms later it will be 47742/2, and 345 ms later
> >> >> it reaches 47742. In the mean time the cpu doesn't appear fully utilized
> >> >> and we might decide to put more tasks on it because we don't know if
> >> >> runnable_avg_sum represents a partially utilized cpu (for example a 50%
> >> >> task) or if it will continue to rise and eventually get to 47742.
> >> >
> >> > Ah, no, since we track per task, and update the per-cpu ones when we
> >> > migrate tasks, the per-cpu values should be instantly updated.
> >> >
> >> > If we were to increase per task storage, we might as well also track
> >> > running_avg not only runnable_avg.
> >>
> >> I agree that the removed running_avg should give more useful
> >> information about the the load of a CPU.
> >>
> >> The main issue with running_avg is that it's disturbed by other tasks
> >> (as point out previously). As a typical example,  if we have 2 tasks
> >> with a load of 25% on 1 CPU, the unweighted runnable_load_avg will be
> >> in the range of [100% - 50%] depending of the parallelism of the
> >> runtime of the tasks whereas the reality is 50% and the use of
> >> running_avg will return this value
> >
> > I'm not sure I see how 100% is possible, but yes I agree that runnable
> > can indeed be inflated due to this queueing effect.
> 
> In fact, it can be even worse than that because i forgot to take into
> account the geometric series effect which implies that it depends of
> the runtime (idletime) of the task
> 
> Take 3 examples:
> 
> 2 tasks that need to run 10ms  simultaneously each 40ms. If they share
> the same CPU, they will be on the runqueue 20ms (in fact a bit less
> for one of them), Their load (runnable_avg_sum/runnable_avg_period)
> will be 33% each so the unweighted runnable_load_avg of the CPU will
> be 66%

Right, it actually depends on how often you switch between the tasks. If
you sched_tick happens every 10 ms then in this example one task will
run for 10 ms an be done, while the other one waits for 10 ms and then
runs to completion in 10 ms. The result is that one task is runnable for
10 ms and the other one is runnable for 20 ms. That gives you 25% and
50% for a total of 75%.

> 
> 2 tasks that need to run 25ms simultaneously each 100ms. If they share
> the same CPU, they will be on the runqueue 50ms (in fact a bit less
> for one of them), Their load (runnable_avg_sum/runnable_avg_period)
> will be 74% each so the unweighted runnable_load_avg of the CPU will
> be 148%
> 
> 2 tasks that need to run 50ms  simultaneously each 200ms. If they
> share the same CPU, they will be on the runqueue 100ms (in fact a bit
> less for one of them), Their load
> (runnable_avg_sum/runnable_avg_period) will be 89% each so the
> unweighted runnable_load_avg of the CPU will be 180%

In this case you switch been the tasks before they complete, so in this
case both tasks are waiting so runnable will get much righer than 75%.
You are right. It was thinking about tasks where the busy period is
short enough that they run to completion when they are scheduled.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 08/11] sched: get CPU's activity statistic
  2014-06-04  9:32                 ` Vincent Guittot
@ 2014-06-04 10:17                   ` Peter Zijlstra
  -1 siblings, 0 replies; 216+ messages in thread
From: Peter Zijlstra @ 2014-06-04 10:17 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Morten Rasmussen, mingo, linux-kernel, linux, linux-arm-kernel,
	preeti, efault, nicolas.pitre, linaro-kernel, daniel.lezcano

[-- Attachment #1: Type: text/plain, Size: 4098 bytes --]

On Wed, Jun 04, 2014 at 11:32:10AM +0200, Vincent Guittot wrote:
> On 4 June 2014 10:08, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Wed, Jun 04, 2014 at 09:47:26AM +0200, Vincent Guittot wrote:
> >> On 3 June 2014 17:50, Peter Zijlstra <peterz@infradead.org> wrote:
> >> > On Wed, May 28, 2014 at 04:47:03PM +0100, Morten Rasmussen wrote:
> >> >> Since we may do periodic load-balance every 10 ms or so, we will perform
> >> >> a number of load-balances where runnable_avg_sum will mostly be
> >> >> reflecting the state of the world before a change (new task queued or
> >> >> moved a task to a different cpu). If you had have two tasks continuously
> >> >> on one cpu and your other cpu is idle, and you move one of the tasks to
> >> >> the other cpu, runnable_avg_sum will remain unchanged, 47742, on the
> >> >> first cpu while it starts from 0 on the other one. 10 ms later it will
> >> >> have increased a bit, 32 ms later it will be 47742/2, and 345 ms later
> >> >> it reaches 47742. In the mean time the cpu doesn't appear fully utilized
> >> >> and we might decide to put more tasks on it because we don't know if
> >> >> runnable_avg_sum represents a partially utilized cpu (for example a 50%
> >> >> task) or if it will continue to rise and eventually get to 47742.
> >> >
> >> > Ah, no, since we track per task, and update the per-cpu ones when we
> >> > migrate tasks, the per-cpu values should be instantly updated.
> >> >
> >> > If we were to increase per task storage, we might as well also track
> >> > running_avg not only runnable_avg.
> >>
> >> I agree that the removed running_avg should give more useful
> >> information about the the load of a CPU.
> >>
> >> The main issue with running_avg is that it's disturbed by other tasks
> >> (as point out previously). As a typical example,  if we have 2 tasks
> >> with a load of 25% on 1 CPU, the unweighted runnable_load_avg will be
> >> in the range of [100% - 50%] depending of the parallelism of the
> >> runtime of the tasks whereas the reality is 50% and the use of
> >> running_avg will return this value
> >
> > I'm not sure I see how 100% is possible, but yes I agree that runnable
> > can indeed be inflated due to this queueing effect.
 
Let me explain the 75%, take any one of the above scenarios. Lets call
the two tasks A and B, and let for a moment assume A always wins and
runs first, and then B.

So A will be runnable for 25%, B otoh will be runnable the entire time A
is actually running plus its own running time, giving 50%. Together that
makes 75%.

If you release the assumption that A runs first, but instead assume they
equally win the first execution, you get them averaging at 37.5% each,
which combined will still give 75%.

> In fact, it can be even worse than that because i forgot to take into
> account the geometric series effect which implies that it depends of
> the runtime (idletime) of the task
> 
> Take 3 examples:
> 
> 2 tasks that need to run 10ms  simultaneously each 40ms. If they share
> the same CPU, they will be on the runqueue 20ms (in fact a bit less
> for one of them), Their load (runnable_avg_sum/runnable_avg_period)
> will be 33% each so the unweighted runnable_load_avg of the CPU will
> be 66%
> 
> 2 tasks that need to run 25ms simultaneously each 100ms. If they share
> the same CPU, they will be on the runqueue 50ms (in fact a bit less
> for one of them), Their load (runnable_avg_sum/runnable_avg_period)
> will be 74% each so the unweighted runnable_load_avg of the CPU will
> be 148%
> 
> 2 tasks that need to run 50ms  simultaneously each 200ms. If they
> share the same CPU, they will be on the runqueue 100ms (in fact a bit
> less for one of them), Their load
> (runnable_avg_sum/runnable_avg_period) will be 89% each so the
> unweighted runnable_load_avg of the CPU will be 180%

And this is because the running time is 'large' compared to the decay
and we get hit by the weight of the recent state? Yes, I can see that,
the avg will fluctuate due to the nature of this thing.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 08/11] sched: get CPU's activity statistic
@ 2014-06-04 10:17                   ` Peter Zijlstra
  0 siblings, 0 replies; 216+ messages in thread
From: Peter Zijlstra @ 2014-06-04 10:17 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Jun 04, 2014 at 11:32:10AM +0200, Vincent Guittot wrote:
> On 4 June 2014 10:08, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Wed, Jun 04, 2014 at 09:47:26AM +0200, Vincent Guittot wrote:
> >> On 3 June 2014 17:50, Peter Zijlstra <peterz@infradead.org> wrote:
> >> > On Wed, May 28, 2014 at 04:47:03PM +0100, Morten Rasmussen wrote:
> >> >> Since we may do periodic load-balance every 10 ms or so, we will perform
> >> >> a number of load-balances where runnable_avg_sum will mostly be
> >> >> reflecting the state of the world before a change (new task queued or
> >> >> moved a task to a different cpu). If you had have two tasks continuously
> >> >> on one cpu and your other cpu is idle, and you move one of the tasks to
> >> >> the other cpu, runnable_avg_sum will remain unchanged, 47742, on the
> >> >> first cpu while it starts from 0 on the other one. 10 ms later it will
> >> >> have increased a bit, 32 ms later it will be 47742/2, and 345 ms later
> >> >> it reaches 47742. In the mean time the cpu doesn't appear fully utilized
> >> >> and we might decide to put more tasks on it because we don't know if
> >> >> runnable_avg_sum represents a partially utilized cpu (for example a 50%
> >> >> task) or if it will continue to rise and eventually get to 47742.
> >> >
> >> > Ah, no, since we track per task, and update the per-cpu ones when we
> >> > migrate tasks, the per-cpu values should be instantly updated.
> >> >
> >> > If we were to increase per task storage, we might as well also track
> >> > running_avg not only runnable_avg.
> >>
> >> I agree that the removed running_avg should give more useful
> >> information about the the load of a CPU.
> >>
> >> The main issue with running_avg is that it's disturbed by other tasks
> >> (as point out previously). As a typical example,  if we have 2 tasks
> >> with a load of 25% on 1 CPU, the unweighted runnable_load_avg will be
> >> in the range of [100% - 50%] depending of the parallelism of the
> >> runtime of the tasks whereas the reality is 50% and the use of
> >> running_avg will return this value
> >
> > I'm not sure I see how 100% is possible, but yes I agree that runnable
> > can indeed be inflated due to this queueing effect.
 
Let me explain the 75%, take any one of the above scenarios. Lets call
the two tasks A and B, and let for a moment assume A always wins and
runs first, and then B.

So A will be runnable for 25%, B otoh will be runnable the entire time A
is actually running plus its own running time, giving 50%. Together that
makes 75%.

If you release the assumption that A runs first, but instead assume they
equally win the first execution, you get them averaging at 37.5% each,
which combined will still give 75%.

> In fact, it can be even worse than that because i forgot to take into
> account the geometric series effect which implies that it depends of
> the runtime (idletime) of the task
> 
> Take 3 examples:
> 
> 2 tasks that need to run 10ms  simultaneously each 40ms. If they share
> the same CPU, they will be on the runqueue 20ms (in fact a bit less
> for one of them), Their load (runnable_avg_sum/runnable_avg_period)
> will be 33% each so the unweighted runnable_load_avg of the CPU will
> be 66%
> 
> 2 tasks that need to run 25ms simultaneously each 100ms. If they share
> the same CPU, they will be on the runqueue 50ms (in fact a bit less
> for one of them), Their load (runnable_avg_sum/runnable_avg_period)
> will be 74% each so the unweighted runnable_load_avg of the CPU will
> be 148%
> 
> 2 tasks that need to run 50ms  simultaneously each 200ms. If they
> share the same CPU, they will be on the runqueue 100ms (in fact a bit
> less for one of them), Their load
> (runnable_avg_sum/runnable_avg_period) will be 89% each so the
> unweighted runnable_load_avg of the CPU will be 180%

And this is because the running time is 'large' compared to the decay
and we get hit by the weight of the recent state? Yes, I can see that,
the avg will fluctuate due to the nature of this thing.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 836 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20140604/482868c4/attachment.sig>

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 08/11] sched: get CPU's activity statistic
  2014-06-04  9:35                     ` Vincent Guittot
@ 2014-06-04 10:25                       ` Morten Rasmussen
  -1 siblings, 0 replies; 216+ messages in thread
From: Morten Rasmussen @ 2014-06-04 10:25 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, mingo, linux-kernel, linux, linux-arm-kernel,
	preeti, efault, nicolas.pitre, linaro-kernel, daniel.lezcano

On Wed, Jun 04, 2014 at 10:35:15AM +0100, Vincent Guittot wrote:
> On 4 June 2014 11:23, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Wed, Jun 04, 2014 at 09:55:42AM +0100, Morten Rasmussen wrote:
> >> Both running_avg and runnable_avg are affected by other tasks on the
> >> same cpus, but in different ways. They are equal if you only have one
> >> task on a cpu. If you have more, running_avg will give you the true
> >> requirement of the tasks until the cpu is fully utilized. At which point
> >> the task running_avg will drop if you add more tasks (the unweighted sum
> >> of task running_avgs remains constant).
> >>
> >> runnable_avg on the other hand, might be affected as soon as you have
> >> two task running on the same cpu if they are runnable at the same time.
> >> That isn't necessarily a bad thing for load-balancing purposes, because
> >> tasks that are runnable at the same time are likely to be run more
> >> efficiently by placing them on different cpus. You might view as at sort
> >> of built in concurrency factor, somewhat similar to what Yuyang is
> >> proposing. runnable_avg increases rapidly when the cpu is over-utilized.
> >
> > Agreed.
> >
> >> > I'm not sure I see how 100% is possible, but yes I agree that runnable
> >> > can indeed be inflated due to this queueing effect.
> >>
> >> You should only be able to get to 75% worst case for runnable_avg for
> >> that example. The total running_avg is 50% no matter if the tasks
> >> overlaps or not.
> >
> > Yes, 75% is what I ended up with.
> 
> Can you explain how you reach 75% as it depends on the runtime and a
> runtime longer than 345ms will end to a 100% load whatever the
> idletime was previously ?

If the busy period of each task is long enough that the first one to run
runs to completetion before the other task is scheduled you get 25% and
50%. But as I said in my other reply, you can get higher task runnable
if the tasks busy period is long enough that you switch between them
before the first one completes.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 08/11] sched: get CPU's activity statistic
@ 2014-06-04 10:25                       ` Morten Rasmussen
  0 siblings, 0 replies; 216+ messages in thread
From: Morten Rasmussen @ 2014-06-04 10:25 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Jun 04, 2014 at 10:35:15AM +0100, Vincent Guittot wrote:
> On 4 June 2014 11:23, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Wed, Jun 04, 2014 at 09:55:42AM +0100, Morten Rasmussen wrote:
> >> Both running_avg and runnable_avg are affected by other tasks on the
> >> same cpus, but in different ways. They are equal if you only have one
> >> task on a cpu. If you have more, running_avg will give you the true
> >> requirement of the tasks until the cpu is fully utilized. At which point
> >> the task running_avg will drop if you add more tasks (the unweighted sum
> >> of task running_avgs remains constant).
> >>
> >> runnable_avg on the other hand, might be affected as soon as you have
> >> two task running on the same cpu if they are runnable at the same time.
> >> That isn't necessarily a bad thing for load-balancing purposes, because
> >> tasks that are runnable at the same time are likely to be run more
> >> efficiently by placing them on different cpus. You might view as at sort
> >> of built in concurrency factor, somewhat similar to what Yuyang is
> >> proposing. runnable_avg increases rapidly when the cpu is over-utilized.
> >
> > Agreed.
> >
> >> > I'm not sure I see how 100% is possible, but yes I agree that runnable
> >> > can indeed be inflated due to this queueing effect.
> >>
> >> You should only be able to get to 75% worst case for runnable_avg for
> >> that example. The total running_avg is 50% no matter if the tasks
> >> overlaps or not.
> >
> > Yes, 75% is what I ended up with.
> 
> Can you explain how you reach 75% as it depends on the runtime and a
> runtime longer than 345ms will end to a 100% load whatever the
> idletime was previously ?

If the busy period of each task is long enough that the first one to run
runs to completetion before the other task is scheduled you get 25% and
50%. But as I said in my other reply, you can get higher task runnable
if the tasks busy period is long enough that you switch between them
before the first one completes.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 08/11] sched: get CPU's activity statistic
  2014-06-04 10:17                   ` Peter Zijlstra
@ 2014-06-04 10:36                     ` Morten Rasmussen
  -1 siblings, 0 replies; 216+ messages in thread
From: Morten Rasmussen @ 2014-06-04 10:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vincent Guittot, mingo, linux-kernel, linux, linux-arm-kernel,
	preeti, efault, nicolas.pitre, linaro-kernel, daniel.lezcano

On Wed, Jun 04, 2014 at 11:17:24AM +0100, Peter Zijlstra wrote:
> On Wed, Jun 04, 2014 at 11:32:10AM +0200, Vincent Guittot wrote:
> > On 4 June 2014 10:08, Peter Zijlstra <peterz@infradead.org> wrote:
> > > On Wed, Jun 04, 2014 at 09:47:26AM +0200, Vincent Guittot wrote:
> > >> On 3 June 2014 17:50, Peter Zijlstra <peterz@infradead.org> wrote:
> > >> > On Wed, May 28, 2014 at 04:47:03PM +0100, Morten Rasmussen wrote:
> > >> >> Since we may do periodic load-balance every 10 ms or so, we will perform
> > >> >> a number of load-balances where runnable_avg_sum will mostly be
> > >> >> reflecting the state of the world before a change (new task queued or
> > >> >> moved a task to a different cpu). If you had have two tasks continuously
> > >> >> on one cpu and your other cpu is idle, and you move one of the tasks to
> > >> >> the other cpu, runnable_avg_sum will remain unchanged, 47742, on the
> > >> >> first cpu while it starts from 0 on the other one. 10 ms later it will
> > >> >> have increased a bit, 32 ms later it will be 47742/2, and 345 ms later
> > >> >> it reaches 47742. In the mean time the cpu doesn't appear fully utilized
> > >> >> and we might decide to put more tasks on it because we don't know if
> > >> >> runnable_avg_sum represents a partially utilized cpu (for example a 50%
> > >> >> task) or if it will continue to rise and eventually get to 47742.
> > >> >
> > >> > Ah, no, since we track per task, and update the per-cpu ones when we
> > >> > migrate tasks, the per-cpu values should be instantly updated.
> > >> >
> > >> > If we were to increase per task storage, we might as well also track
> > >> > running_avg not only runnable_avg.
> > >>
> > >> I agree that the removed running_avg should give more useful
> > >> information about the the load of a CPU.
> > >>
> > >> The main issue with running_avg is that it's disturbed by other tasks
> > >> (as point out previously). As a typical example,  if we have 2 tasks
> > >> with a load of 25% on 1 CPU, the unweighted runnable_load_avg will be
> > >> in the range of [100% - 50%] depending of the parallelism of the
> > >> runtime of the tasks whereas the reality is 50% and the use of
> > >> running_avg will return this value
> > >
> > > I'm not sure I see how 100% is possible, but yes I agree that runnable
> > > can indeed be inflated due to this queueing effect.
>  
> Let me explain the 75%, take any one of the above scenarios. Lets call
> the two tasks A and B, and let for a moment assume A always wins and
> runs first, and then B.
> 
> So A will be runnable for 25%, B otoh will be runnable the entire time A
> is actually running plus its own running time, giving 50%. Together that
> makes 75%.
> 
> If you release the assumption that A runs first, but instead assume they
> equally win the first execution, you get them averaging at 37.5% each,
> which combined will still give 75%.

But that is assuming that the first task gets to run to completion of it
busy period. If it uses up its sched_slice and we switch to the other
tasks, they both get to wait.

For example, if the sched_slice is 5 ms and the busy period is 10 ms,
the execution pattern would be: A, B, A, B, idle, ... In that case A is
runnable for 15 ms and B is for 20 ms. Assuming that the overall period
is 40 ms, the A runnable is 37.5% and B is 50%.


^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 08/11] sched: get CPU's activity statistic
@ 2014-06-04 10:36                     ` Morten Rasmussen
  0 siblings, 0 replies; 216+ messages in thread
From: Morten Rasmussen @ 2014-06-04 10:36 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Jun 04, 2014 at 11:17:24AM +0100, Peter Zijlstra wrote:
> On Wed, Jun 04, 2014 at 11:32:10AM +0200, Vincent Guittot wrote:
> > On 4 June 2014 10:08, Peter Zijlstra <peterz@infradead.org> wrote:
> > > On Wed, Jun 04, 2014 at 09:47:26AM +0200, Vincent Guittot wrote:
> > >> On 3 June 2014 17:50, Peter Zijlstra <peterz@infradead.org> wrote:
> > >> > On Wed, May 28, 2014 at 04:47:03PM +0100, Morten Rasmussen wrote:
> > >> >> Since we may do periodic load-balance every 10 ms or so, we will perform
> > >> >> a number of load-balances where runnable_avg_sum will mostly be
> > >> >> reflecting the state of the world before a change (new task queued or
> > >> >> moved a task to a different cpu). If you had have two tasks continuously
> > >> >> on one cpu and your other cpu is idle, and you move one of the tasks to
> > >> >> the other cpu, runnable_avg_sum will remain unchanged, 47742, on the
> > >> >> first cpu while it starts from 0 on the other one. 10 ms later it will
> > >> >> have increased a bit, 32 ms later it will be 47742/2, and 345 ms later
> > >> >> it reaches 47742. In the mean time the cpu doesn't appear fully utilized
> > >> >> and we might decide to put more tasks on it because we don't know if
> > >> >> runnable_avg_sum represents a partially utilized cpu (for example a 50%
> > >> >> task) or if it will continue to rise and eventually get to 47742.
> > >> >
> > >> > Ah, no, since we track per task, and update the per-cpu ones when we
> > >> > migrate tasks, the per-cpu values should be instantly updated.
> > >> >
> > >> > If we were to increase per task storage, we might as well also track
> > >> > running_avg not only runnable_avg.
> > >>
> > >> I agree that the removed running_avg should give more useful
> > >> information about the the load of a CPU.
> > >>
> > >> The main issue with running_avg is that it's disturbed by other tasks
> > >> (as point out previously). As a typical example,  if we have 2 tasks
> > >> with a load of 25% on 1 CPU, the unweighted runnable_load_avg will be
> > >> in the range of [100% - 50%] depending of the parallelism of the
> > >> runtime of the tasks whereas the reality is 50% and the use of
> > >> running_avg will return this value
> > >
> > > I'm not sure I see how 100% is possible, but yes I agree that runnable
> > > can indeed be inflated due to this queueing effect.
>  
> Let me explain the 75%, take any one of the above scenarios. Lets call
> the two tasks A and B, and let for a moment assume A always wins and
> runs first, and then B.
> 
> So A will be runnable for 25%, B otoh will be runnable the entire time A
> is actually running plus its own running time, giving 50%. Together that
> makes 75%.
> 
> If you release the assumption that A runs first, but instead assume they
> equally win the first execution, you get them averaging at 37.5% each,
> which combined will still give 75%.

But that is assuming that the first task gets to run to completion of it
busy period. If it uses up its sched_slice and we switch to the other
tasks, they both get to wait.

For example, if the sched_slice is 5 ms and the busy period is 10 ms,
the execution pattern would be: A, B, A, B, idle, ... In that case A is
runnable for 15 ms and B is for 20 ms. Assuming that the overall period
is 40 ms, the A runnable is 37.5% and B is 50%.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 08/11] sched: get CPU's activity statistic
  2014-06-04 10:36                     ` Morten Rasmussen
@ 2014-06-04 10:55                       ` Peter Zijlstra
  -1 siblings, 0 replies; 216+ messages in thread
From: Peter Zijlstra @ 2014-06-04 10:55 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Vincent Guittot, mingo, linux-kernel, linux, linux-arm-kernel,
	preeti, efault, nicolas.pitre, linaro-kernel, daniel.lezcano

[-- Attachment #1: Type: text/plain, Size: 1286 bytes --]

On Wed, Jun 04, 2014 at 11:36:19AM +0100, Morten Rasmussen wrote:
> On Wed, Jun 04, 2014 at 11:17:24AM +0100, Peter Zijlstra wrote:
> > Let me explain the 75%, take any one of the above scenarios. Lets call
> > the two tasks A and B, and let for a moment assume A always wins and
> > runs first, and then B.
> > 
> > So A will be runnable for 25%, B otoh will be runnable the entire time A
> > is actually running plus its own running time, giving 50%. Together that
> > makes 75%.
> > 
> > If you release the assumption that A runs first, but instead assume they
> > equally win the first execution, you get them averaging at 37.5% each,
> > which combined will still give 75%.
> 
> But that is assuming that the first task gets to run to completion of it
> busy period. If it uses up its sched_slice and we switch to the other
> tasks, they both get to wait.
> 
> For example, if the sched_slice is 5 ms and the busy period is 10 ms,
> the execution pattern would be: A, B, A, B, idle, ... In that case A is
> runnable for 15 ms and B is for 20 ms. Assuming that the overall period
> is 40 ms, the A runnable is 37.5% and B is 50%.

Indeed, with preemption added you can pull this out further. You can
then indeed get infinitesimally close to 100% with this.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 08/11] sched: get CPU's activity statistic
@ 2014-06-04 10:55                       ` Peter Zijlstra
  0 siblings, 0 replies; 216+ messages in thread
From: Peter Zijlstra @ 2014-06-04 10:55 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Jun 04, 2014 at 11:36:19AM +0100, Morten Rasmussen wrote:
> On Wed, Jun 04, 2014 at 11:17:24AM +0100, Peter Zijlstra wrote:
> > Let me explain the 75%, take any one of the above scenarios. Lets call
> > the two tasks A and B, and let for a moment assume A always wins and
> > runs first, and then B.
> > 
> > So A will be runnable for 25%, B otoh will be runnable the entire time A
> > is actually running plus its own running time, giving 50%. Together that
> > makes 75%.
> > 
> > If you release the assumption that A runs first, but instead assume they
> > equally win the first execution, you get them averaging at 37.5% each,
> > which combined will still give 75%.
> 
> But that is assuming that the first task gets to run to completion of it
> busy period. If it uses up its sched_slice and we switch to the other
> tasks, they both get to wait.
> 
> For example, if the sched_slice is 5 ms and the busy period is 10 ms,
> the execution pattern would be: A, B, A, B, idle, ... In that case A is
> runnable for 15 ms and B is for 20 ms. Assuming that the overall period
> is 40 ms, the A runnable is 37.5% and B is 50%.

Indeed, with preemption added you can pull this out further. You can
then indeed get infinitesimally close to 100% with this.
-------------- next part --------------
A non-text attachment was scrubbed...
Name: not available
Type: application/pgp-signature
Size: 836 bytes
Desc: not available
URL: <http://lists.infradead.org/pipermail/linux-arm-kernel/attachments/20140604/78fdbe49/attachment.sig>

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 08/11] sched: get CPU's activity statistic
  2014-06-04 10:36                     ` Morten Rasmussen
@ 2014-06-04 11:07                       ` Vincent Guittot
  -1 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-06-04 11:07 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Peter Zijlstra, mingo, linux-kernel, linux, linux-arm-kernel,
	preeti, efault, nicolas.pitre, linaro-kernel, daniel.lezcano

On 4 June 2014 12:36, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> On Wed, Jun 04, 2014 at 11:17:24AM +0100, Peter Zijlstra wrote:
>> On Wed, Jun 04, 2014 at 11:32:10AM +0200, Vincent Guittot wrote:
>> > On 4 June 2014 10:08, Peter Zijlstra <peterz@infradead.org> wrote:
>> > > On Wed, Jun 04, 2014 at 09:47:26AM +0200, Vincent Guittot wrote:
>> > >> On 3 June 2014 17:50, Peter Zijlstra <peterz@infradead.org> wrote:
>> > >> > On Wed, May 28, 2014 at 04:47:03PM +0100, Morten Rasmussen wrote:
>> > >> >> Since we may do periodic load-balance every 10 ms or so, we will perform
>> > >> >> a number of load-balances where runnable_avg_sum will mostly be
>> > >> >> reflecting the state of the world before a change (new task queued or
>> > >> >> moved a task to a different cpu). If you had have two tasks continuously
>> > >> >> on one cpu and your other cpu is idle, and you move one of the tasks to
>> > >> >> the other cpu, runnable_avg_sum will remain unchanged, 47742, on the
>> > >> >> first cpu while it starts from 0 on the other one. 10 ms later it will
>> > >> >> have increased a bit, 32 ms later it will be 47742/2, and 345 ms later
>> > >> >> it reaches 47742. In the mean time the cpu doesn't appear fully utilized
>> > >> >> and we might decide to put more tasks on it because we don't know if
>> > >> >> runnable_avg_sum represents a partially utilized cpu (for example a 50%
>> > >> >> task) or if it will continue to rise and eventually get to 47742.
>> > >> >
>> > >> > Ah, no, since we track per task, and update the per-cpu ones when we
>> > >> > migrate tasks, the per-cpu values should be instantly updated.
>> > >> >
>> > >> > If we were to increase per task storage, we might as well also track
>> > >> > running_avg not only runnable_avg.
>> > >>
>> > >> I agree that the removed running_avg should give more useful
>> > >> information about the the load of a CPU.
>> > >>
>> > >> The main issue with running_avg is that it's disturbed by other tasks
>> > >> (as point out previously). As a typical example,  if we have 2 tasks
>> > >> with a load of 25% on 1 CPU, the unweighted runnable_load_avg will be
>> > >> in the range of [100% - 50%] depending of the parallelism of the
>> > >> runtime of the tasks whereas the reality is 50% and the use of
>> > >> running_avg will return this value
>> > >
>> > > I'm not sure I see how 100% is possible, but yes I agree that runnable
>> > > can indeed be inflated due to this queueing effect.
>>
>> Let me explain the 75%, take any one of the above scenarios. Lets call
>> the two tasks A and B, and let for a moment assume A always wins and
>> runs first, and then B.
>>
>> So A will be runnable for 25%, B otoh will be runnable the entire time A
>> is actually running plus its own running time, giving 50%. Together that
>> makes 75%.
>>
>> If you release the assumption that A runs first, but instead assume they
>> equally win the first execution, you get them averaging at 37.5% each,
>> which combined will still give 75%.
>
> But that is assuming that the first task gets to run to completion of it
> busy period. If it uses up its sched_slice and we switch to the other
> tasks, they both get to wait.
>
> For example, if the sched_slice is 5 ms and the busy period is 10 ms,
> the execution pattern would be: A, B, A, B, idle, ... In that case A is
> runnable for 15 ms and B is for 20 ms. Assuming that the overall period
> is 40 ms, the A runnable is 37.5% and B is 50%.

The exact value for your scheduling example above is:
A runnable will be 47% and B runnable will be 60% (unless i make a
mistake in my computation)
and CPU runnable will be 60% too

Vincent

>

^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 08/11] sched: get CPU's activity statistic
@ 2014-06-04 11:07                       ` Vincent Guittot
  0 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-06-04 11:07 UTC (permalink / raw)
  To: linux-arm-kernel

On 4 June 2014 12:36, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> On Wed, Jun 04, 2014 at 11:17:24AM +0100, Peter Zijlstra wrote:
>> On Wed, Jun 04, 2014 at 11:32:10AM +0200, Vincent Guittot wrote:
>> > On 4 June 2014 10:08, Peter Zijlstra <peterz@infradead.org> wrote:
>> > > On Wed, Jun 04, 2014 at 09:47:26AM +0200, Vincent Guittot wrote:
>> > >> On 3 June 2014 17:50, Peter Zijlstra <peterz@infradead.org> wrote:
>> > >> > On Wed, May 28, 2014 at 04:47:03PM +0100, Morten Rasmussen wrote:
>> > >> >> Since we may do periodic load-balance every 10 ms or so, we will perform
>> > >> >> a number of load-balances where runnable_avg_sum will mostly be
>> > >> >> reflecting the state of the world before a change (new task queued or
>> > >> >> moved a task to a different cpu). If you had have two tasks continuously
>> > >> >> on one cpu and your other cpu is idle, and you move one of the tasks to
>> > >> >> the other cpu, runnable_avg_sum will remain unchanged, 47742, on the
>> > >> >> first cpu while it starts from 0 on the other one. 10 ms later it will
>> > >> >> have increased a bit, 32 ms later it will be 47742/2, and 345 ms later
>> > >> >> it reaches 47742. In the mean time the cpu doesn't appear fully utilized
>> > >> >> and we might decide to put more tasks on it because we don't know if
>> > >> >> runnable_avg_sum represents a partially utilized cpu (for example a 50%
>> > >> >> task) or if it will continue to rise and eventually get to 47742.
>> > >> >
>> > >> > Ah, no, since we track per task, and update the per-cpu ones when we
>> > >> > migrate tasks, the per-cpu values should be instantly updated.
>> > >> >
>> > >> > If we were to increase per task storage, we might as well also track
>> > >> > running_avg not only runnable_avg.
>> > >>
>> > >> I agree that the removed running_avg should give more useful
>> > >> information about the the load of a CPU.
>> > >>
>> > >> The main issue with running_avg is that it's disturbed by other tasks
>> > >> (as point out previously). As a typical example,  if we have 2 tasks
>> > >> with a load of 25% on 1 CPU, the unweighted runnable_load_avg will be
>> > >> in the range of [100% - 50%] depending of the parallelism of the
>> > >> runtime of the tasks whereas the reality is 50% and the use of
>> > >> running_avg will return this value
>> > >
>> > > I'm not sure I see how 100% is possible, but yes I agree that runnable
>> > > can indeed be inflated due to this queueing effect.
>>
>> Let me explain the 75%, take any one of the above scenarios. Lets call
>> the two tasks A and B, and let for a moment assume A always wins and
>> runs first, and then B.
>>
>> So A will be runnable for 25%, B otoh will be runnable the entire time A
>> is actually running plus its own running time, giving 50%. Together that
>> makes 75%.
>>
>> If you release the assumption that A runs first, but instead assume they
>> equally win the first execution, you get them averaging at 37.5% each,
>> which combined will still give 75%.
>
> But that is assuming that the first task gets to run to completion of it
> busy period. If it uses up its sched_slice and we switch to the other
> tasks, they both get to wait.
>
> For example, if the sched_slice is 5 ms and the busy period is 10 ms,
> the execution pattern would be: A, B, A, B, idle, ... In that case A is
> runnable for 15 ms and B is for 20 ms. Assuming that the overall period
> is 40 ms, the A runnable is 37.5% and B is 50%.

The exact value for your scheduling example above is:
A runnable will be 47% and B runnable will be 60% (unless i make a
mistake in my computation)
and CPU runnable will be 60% too

Vincent

>

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 04/11] sched: Allow all archs to set the power_orig
  2014-06-04  9:42         ` Dietmar Eggemann
@ 2014-06-04 11:15           ` Vincent Guittot
  -1 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-06-04 11:15 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: peterz, mingo, linux-kernel, linux, linux-arm-kernel, preeti,
	Morten Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano

On 4 June 2014 11:42, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> [...]
>>> (1) We assume that the current way (update_cpu_power() calls
>>> arch_scale_freq_power() to get the avg power(freq) over the time period
>>> since the last call to arch_scale_freq_power()) is suitable
>>> for us. Do you have another opinion here?
>>
>> Using power (or power_freq as you mentioned below) is probably the
>> easiest and more straight forward solution. You can use it to scale
>> each element when updating entity runnable.
>> Nevertheless, I see to 2 potential issues:
>> - is power updated often enough to correctly follow the frequency
>> scaling ? we need to compare power update frequency with
>> runnable_avg_sum variation speed and the rate at which we will change
>> the CPU's frequency.
>> - the max value of runnable_avg_sum will be also scaled so a task
>> running on a CPU with less capacity could be seen as a "low" load even
>> if it's an always running tasks. So we need to find a way to reach the
>> max value for such situation
>
> I think I mixed two problems together here:
>
> Firstly, we need to scale cpu power in update_cpu_power() regarding
> uArch, frequency and rt/irq pressure.
> Here the freq related value we get back from arch_scale_freq_power(...,
> cpu) could be an instantaneous value (curr_freq(cpu)/max_freq(cpu)).
>
> Secondly, to be able to scale the runnable avg sum of a sched entity
> (se->avg->runnable_avg_sum), we preferable have a coefficient
> representing uArch diffs (cpu_power_orig(cpu)/cpu_power_orig(most
> powerful cpu in the system) and another coefficient (avg freq over 'now

AFAICT, the coefficient representing uArch diffs is already taken into
account into power_freq thanks to scale_cpu, isn't it ?

> - sa->last_runnable_update'(cpu)/max_freq(cpu). This value would have to
> be retrieved from the arch in __update_entity_runnable_avg().
>
>>> (2) Is the current layout of update_cpu_power() adequate for this, where
>>> we scale power_orig related to freq and then related to rt/(irq):
>>>
>>>   power_orig = scale_cpu(SCHED_POWER_SCALE)
>>>   power = scale_rt(scale_freq(power_orig))
>>>
>>> or do we need an extra power_freq data member on the rq and do:
>>>
>>>   power_orig = scale_cpu(SCHED_POWER_SCALE)
>>>   power_freq = scale_freq(power_orig))
>>>   power = scale_rt(power_orig))
>>
>> do you really mean power = scale_rt(power_orig) or power=scale_rt(power_freq) ?
>
> No, I also think that power=scale_rt(power_freq) is correct.
>
>>> In other words, do we consider rt/(irq) pressure when calculating freq
>>> scale invariant task load or not?
>>
>> we should take power_freq which implies a new field
> [...]
>

^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 04/11] sched: Allow all archs to set the power_orig
@ 2014-06-04 11:15           ` Vincent Guittot
  0 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-06-04 11:15 UTC (permalink / raw)
  To: linux-arm-kernel

On 4 June 2014 11:42, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> [...]
>>> (1) We assume that the current way (update_cpu_power() calls
>>> arch_scale_freq_power() to get the avg power(freq) over the time period
>>> since the last call to arch_scale_freq_power()) is suitable
>>> for us. Do you have another opinion here?
>>
>> Using power (or power_freq as you mentioned below) is probably the
>> easiest and more straight forward solution. You can use it to scale
>> each element when updating entity runnable.
>> Nevertheless, I see to 2 potential issues:
>> - is power updated often enough to correctly follow the frequency
>> scaling ? we need to compare power update frequency with
>> runnable_avg_sum variation speed and the rate at which we will change
>> the CPU's frequency.
>> - the max value of runnable_avg_sum will be also scaled so a task
>> running on a CPU with less capacity could be seen as a "low" load even
>> if it's an always running tasks. So we need to find a way to reach the
>> max value for such situation
>
> I think I mixed two problems together here:
>
> Firstly, we need to scale cpu power in update_cpu_power() regarding
> uArch, frequency and rt/irq pressure.
> Here the freq related value we get back from arch_scale_freq_power(...,
> cpu) could be an instantaneous value (curr_freq(cpu)/max_freq(cpu)).
>
> Secondly, to be able to scale the runnable avg sum of a sched entity
> (se->avg->runnable_avg_sum), we preferable have a coefficient
> representing uArch diffs (cpu_power_orig(cpu)/cpu_power_orig(most
> powerful cpu in the system) and another coefficient (avg freq over 'now

AFAICT, the coefficient representing uArch diffs is already taken into
account into power_freq thanks to scale_cpu, isn't it ?

> - sa->last_runnable_update'(cpu)/max_freq(cpu). This value would have to
> be retrieved from the arch in __update_entity_runnable_avg().
>
>>> (2) Is the current layout of update_cpu_power() adequate for this, where
>>> we scale power_orig related to freq and then related to rt/(irq):
>>>
>>>   power_orig = scale_cpu(SCHED_POWER_SCALE)
>>>   power = scale_rt(scale_freq(power_orig))
>>>
>>> or do we need an extra power_freq data member on the rq and do:
>>>
>>>   power_orig = scale_cpu(SCHED_POWER_SCALE)
>>>   power_freq = scale_freq(power_orig))
>>>   power = scale_rt(power_orig))
>>
>> do you really mean power = scale_rt(power_orig) or power=scale_rt(power_freq) ?
>
> No, I also think that power=scale_rt(power_freq) is correct.
>
>>> In other words, do we consider rt/(irq) pressure when calculating freq
>>> scale invariant task load or not?
>>
>> we should take power_freq which implies a new field
> [...]
>

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 04/11] sched: Allow all archs to set the power_orig
  2014-06-03 14:02       ` Vincent Guittot
@ 2014-06-04 11:17         ` Morten Rasmussen
  -1 siblings, 0 replies; 216+ messages in thread
From: Morten Rasmussen @ 2014-06-04 11:17 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: peterz, mingo, linux-kernel, linux, linux-arm-kernel, preeti,
	efault, nicolas.pitre, linaro-kernel, daniel.lezcano

On Tue, Jun 03, 2014 at 03:02:18PM +0100, Vincent Guittot wrote:
> On 3 June 2014 15:22, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> > On Fri, May 23, 2014 at 04:52:58PM +0100, Vincent Guittot wrote:
> >> power_orig is only changed for system with a SMT sched_domain level in order to
> >> reflect the lower capacity of CPUs. Heterogenous system also have to reflect an
> >> original capacity that is different from the default value.
> >>
> >> Create a more generic function arch_scale_cpu_power that can be also used by
> >> non SMT platform to set power_orig.
> >
> > I did a quick test of the patch set with adjusting cpu_power on
> > big.LITTLE (ARM TC2) to reflect the different compute capacities of the
> > A15s and A7s. I ran the sysbench cpu benchmark with 5 threads with and
> > without the patches applied, but with non-default cpu_powers.
> >
> > I didn't see any difference in the load-balance. Three tasks ended up on
> > the two A15s and two tasks ended up on two of the three A7s leaving one
> > unused in both cases.
> >
> > Using default cpu_power I get one task on each of the five cpus (best
> > throughput). Unless I messed something up, it seems that setting
> > cpu_power doesn't give me the best throughput with these patches
> > applied.
> 
> That's normal this patchset is necessary but not enough to solve the
> issue you mention. We also need to fix the way the imbalance is
> calculated for such situation. I have planned to push that in another
> patchset in order to not mix too much thing together

Based on the commit messages I was just lead to believe that this was a
self-contained patch set that also addressed issues related to handling
heterogeneous systems. Maybe it would be worth mentioning that this set
is only part of the solution somewhere?

It is a bit unclear to me how these changes, which appear to mainly
improve factoring rt and irq time into cpu_power, will solve the
cpu_power issues related to heterogeneous systems. Can you share your
plans for the follow up patch set? I think it would be better to review
the solution as a whole.

I absolutely agree that the imbalance calculation needs to fixed, but I
don't think the current rq runnable_avg_sum is the right choice for that
purpose for the reasons I pointed out the in other thread.

Morten

^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 04/11] sched: Allow all archs to set the power_orig
@ 2014-06-04 11:17         ` Morten Rasmussen
  0 siblings, 0 replies; 216+ messages in thread
From: Morten Rasmussen @ 2014-06-04 11:17 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Jun 03, 2014 at 03:02:18PM +0100, Vincent Guittot wrote:
> On 3 June 2014 15:22, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> > On Fri, May 23, 2014 at 04:52:58PM +0100, Vincent Guittot wrote:
> >> power_orig is only changed for system with a SMT sched_domain level in order to
> >> reflect the lower capacity of CPUs. Heterogenous system also have to reflect an
> >> original capacity that is different from the default value.
> >>
> >> Create a more generic function arch_scale_cpu_power that can be also used by
> >> non SMT platform to set power_orig.
> >
> > I did a quick test of the patch set with adjusting cpu_power on
> > big.LITTLE (ARM TC2) to reflect the different compute capacities of the
> > A15s and A7s. I ran the sysbench cpu benchmark with 5 threads with and
> > without the patches applied, but with non-default cpu_powers.
> >
> > I didn't see any difference in the load-balance. Three tasks ended up on
> > the two A15s and two tasks ended up on two of the three A7s leaving one
> > unused in both cases.
> >
> > Using default cpu_power I get one task on each of the five cpus (best
> > throughput). Unless I messed something up, it seems that setting
> > cpu_power doesn't give me the best throughput with these patches
> > applied.
> 
> That's normal this patchset is necessary but not enough to solve the
> issue you mention. We also need to fix the way the imbalance is
> calculated for such situation. I have planned to push that in another
> patchset in order to not mix too much thing together

Based on the commit messages I was just lead to believe that this was a
self-contained patch set that also addressed issues related to handling
heterogeneous systems. Maybe it would be worth mentioning that this set
is only part of the solution somewhere?

It is a bit unclear to me how these changes, which appear to mainly
improve factoring rt and irq time into cpu_power, will solve the
cpu_power issues related to heterogeneous systems. Can you share your
plans for the follow up patch set? I think it would be better to review
the solution as a whole.

I absolutely agree that the imbalance calculation needs to fixed, but I
don't think the current rq runnable_avg_sum is the right choice for that
purpose for the reasons I pointed out the in other thread.

Morten

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 08/11] sched: get CPU's activity statistic
  2014-06-04 11:07                       ` Vincent Guittot
@ 2014-06-04 11:23                         ` Morten Rasmussen
  -1 siblings, 0 replies; 216+ messages in thread
From: Morten Rasmussen @ 2014-06-04 11:23 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, mingo, linux-kernel, linux, linux-arm-kernel,
	preeti, efault, nicolas.pitre, linaro-kernel, daniel.lezcano

On Wed, Jun 04, 2014 at 12:07:29PM +0100, Vincent Guittot wrote:
> On 4 June 2014 12:36, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> > On Wed, Jun 04, 2014 at 11:17:24AM +0100, Peter Zijlstra wrote:
> >> On Wed, Jun 04, 2014 at 11:32:10AM +0200, Vincent Guittot wrote:
> >> > On 4 June 2014 10:08, Peter Zijlstra <peterz@infradead.org> wrote:
> >> > > On Wed, Jun 04, 2014 at 09:47:26AM +0200, Vincent Guittot wrote:
> >> > >> On 3 June 2014 17:50, Peter Zijlstra <peterz@infradead.org> wrote:
> >> > >> > On Wed, May 28, 2014 at 04:47:03PM +0100, Morten Rasmussen wrote:
> >> > >> >> Since we may do periodic load-balance every 10 ms or so, we will perform
> >> > >> >> a number of load-balances where runnable_avg_sum will mostly be
> >> > >> >> reflecting the state of the world before a change (new task queued or
> >> > >> >> moved a task to a different cpu). If you had have two tasks continuously
> >> > >> >> on one cpu and your other cpu is idle, and you move one of the tasks to
> >> > >> >> the other cpu, runnable_avg_sum will remain unchanged, 47742, on the
> >> > >> >> first cpu while it starts from 0 on the other one. 10 ms later it will
> >> > >> >> have increased a bit, 32 ms later it will be 47742/2, and 345 ms later
> >> > >> >> it reaches 47742. In the mean time the cpu doesn't appear fully utilized
> >> > >> >> and we might decide to put more tasks on it because we don't know if
> >> > >> >> runnable_avg_sum represents a partially utilized cpu (for example a 50%
> >> > >> >> task) or if it will continue to rise and eventually get to 47742.
> >> > >> >
> >> > >> > Ah, no, since we track per task, and update the per-cpu ones when we
> >> > >> > migrate tasks, the per-cpu values should be instantly updated.
> >> > >> >
> >> > >> > If we were to increase per task storage, we might as well also track
> >> > >> > running_avg not only runnable_avg.
> >> > >>
> >> > >> I agree that the removed running_avg should give more useful
> >> > >> information about the the load of a CPU.
> >> > >>
> >> > >> The main issue with running_avg is that it's disturbed by other tasks
> >> > >> (as point out previously). As a typical example,  if we have 2 tasks
> >> > >> with a load of 25% on 1 CPU, the unweighted runnable_load_avg will be
> >> > >> in the range of [100% - 50%] depending of the parallelism of the
> >> > >> runtime of the tasks whereas the reality is 50% and the use of
> >> > >> running_avg will return this value
> >> > >
> >> > > I'm not sure I see how 100% is possible, but yes I agree that runnable
> >> > > can indeed be inflated due to this queueing effect.
> >>
> >> Let me explain the 75%, take any one of the above scenarios. Lets call
> >> the two tasks A and B, and let for a moment assume A always wins and
> >> runs first, and then B.
> >>
> >> So A will be runnable for 25%, B otoh will be runnable the entire time A
> >> is actually running plus its own running time, giving 50%. Together that
> >> makes 75%.
> >>
> >> If you release the assumption that A runs first, but instead assume they
> >> equally win the first execution, you get them averaging at 37.5% each,
> >> which combined will still give 75%.
> >
> > But that is assuming that the first task gets to run to completion of it
> > busy period. If it uses up its sched_slice and we switch to the other
> > tasks, they both get to wait.
> >
> > For example, if the sched_slice is 5 ms and the busy period is 10 ms,
> > the execution pattern would be: A, B, A, B, idle, ... In that case A is
> > runnable for 15 ms and B is for 20 ms. Assuming that the overall period
> > is 40 ms, the A runnable is 37.5% and B is 50%.
> 
> The exact value for your scheduling example above is:
> A runnable will be 47% and B runnable will be 60% (unless i make a
> mistake in my computation)

I get:

A: 15/40 ms = 37.5%
B: 20/40 ms = 50%

Schedule:

   | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms |
A:   run     rq     run  ----------- sleeping -------------  run
B:   rq      run    rq    run   ---- sleeping -------------  rq

> and CPU runnable will be 60% too

rq->avg.runnable_avg_sum should be 50%. You have two tasks running for
20 ms every 40 ms.

Right?

Morten

^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 08/11] sched: get CPU's activity statistic
@ 2014-06-04 11:23                         ` Morten Rasmussen
  0 siblings, 0 replies; 216+ messages in thread
From: Morten Rasmussen @ 2014-06-04 11:23 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Jun 04, 2014 at 12:07:29PM +0100, Vincent Guittot wrote:
> On 4 June 2014 12:36, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> > On Wed, Jun 04, 2014 at 11:17:24AM +0100, Peter Zijlstra wrote:
> >> On Wed, Jun 04, 2014 at 11:32:10AM +0200, Vincent Guittot wrote:
> >> > On 4 June 2014 10:08, Peter Zijlstra <peterz@infradead.org> wrote:
> >> > > On Wed, Jun 04, 2014 at 09:47:26AM +0200, Vincent Guittot wrote:
> >> > >> On 3 June 2014 17:50, Peter Zijlstra <peterz@infradead.org> wrote:
> >> > >> > On Wed, May 28, 2014 at 04:47:03PM +0100, Morten Rasmussen wrote:
> >> > >> >> Since we may do periodic load-balance every 10 ms or so, we will perform
> >> > >> >> a number of load-balances where runnable_avg_sum will mostly be
> >> > >> >> reflecting the state of the world before a change (new task queued or
> >> > >> >> moved a task to a different cpu). If you had have two tasks continuously
> >> > >> >> on one cpu and your other cpu is idle, and you move one of the tasks to
> >> > >> >> the other cpu, runnable_avg_sum will remain unchanged, 47742, on the
> >> > >> >> first cpu while it starts from 0 on the other one. 10 ms later it will
> >> > >> >> have increased a bit, 32 ms later it will be 47742/2, and 345 ms later
> >> > >> >> it reaches 47742. In the mean time the cpu doesn't appear fully utilized
> >> > >> >> and we might decide to put more tasks on it because we don't know if
> >> > >> >> runnable_avg_sum represents a partially utilized cpu (for example a 50%
> >> > >> >> task) or if it will continue to rise and eventually get to 47742.
> >> > >> >
> >> > >> > Ah, no, since we track per task, and update the per-cpu ones when we
> >> > >> > migrate tasks, the per-cpu values should be instantly updated.
> >> > >> >
> >> > >> > If we were to increase per task storage, we might as well also track
> >> > >> > running_avg not only runnable_avg.
> >> > >>
> >> > >> I agree that the removed running_avg should give more useful
> >> > >> information about the the load of a CPU.
> >> > >>
> >> > >> The main issue with running_avg is that it's disturbed by other tasks
> >> > >> (as point out previously). As a typical example,  if we have 2 tasks
> >> > >> with a load of 25% on 1 CPU, the unweighted runnable_load_avg will be
> >> > >> in the range of [100% - 50%] depending of the parallelism of the
> >> > >> runtime of the tasks whereas the reality is 50% and the use of
> >> > >> running_avg will return this value
> >> > >
> >> > > I'm not sure I see how 100% is possible, but yes I agree that runnable
> >> > > can indeed be inflated due to this queueing effect.
> >>
> >> Let me explain the 75%, take any one of the above scenarios. Lets call
> >> the two tasks A and B, and let for a moment assume A always wins and
> >> runs first, and then B.
> >>
> >> So A will be runnable for 25%, B otoh will be runnable the entire time A
> >> is actually running plus its own running time, giving 50%. Together that
> >> makes 75%.
> >>
> >> If you release the assumption that A runs first, but instead assume they
> >> equally win the first execution, you get them averaging at 37.5% each,
> >> which combined will still give 75%.
> >
> > But that is assuming that the first task gets to run to completion of it
> > busy period. If it uses up its sched_slice and we switch to the other
> > tasks, they both get to wait.
> >
> > For example, if the sched_slice is 5 ms and the busy period is 10 ms,
> > the execution pattern would be: A, B, A, B, idle, ... In that case A is
> > runnable for 15 ms and B is for 20 ms. Assuming that the overall period
> > is 40 ms, the A runnable is 37.5% and B is 50%.
> 
> The exact value for your scheduling example above is:
> A runnable will be 47% and B runnable will be 60% (unless i make a
> mistake in my computation)

I get:

A: 15/40 ms = 37.5%
B: 20/40 ms = 50%

Schedule:

   | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms |
A:   run     rq     run  ----------- sleeping -------------  run
B:   rq      run    rq    run   ---- sleeping -------------  rq

> and CPU runnable will be 60% too

rq->avg.runnable_avg_sum should be 50%. You have two tasks running for
20 ms every 40 ms.

Right?

Morten

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 08/11] sched: get CPU's activity statistic
  2014-06-04 11:23                         ` Morten Rasmussen
@ 2014-06-04 11:52                           ` Vincent Guittot
  -1 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-06-04 11:52 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Peter Zijlstra, mingo, linux-kernel, linux, linux-arm-kernel,
	preeti, efault, nicolas.pitre, linaro-kernel, daniel.lezcano

On 4 June 2014 13:23, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> On Wed, Jun 04, 2014 at 12:07:29PM +0100, Vincent Guittot wrote:
>> On 4 June 2014 12:36, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
>> > On Wed, Jun 04, 2014 at 11:17:24AM +0100, Peter Zijlstra wrote:
>> >> On Wed, Jun 04, 2014 at 11:32:10AM +0200, Vincent Guittot wrote:
>> >> > On 4 June 2014 10:08, Peter Zijlstra <peterz@infradead.org> wrote:
>> >> > > On Wed, Jun 04, 2014 at 09:47:26AM +0200, Vincent Guittot wrote:
>> >> > >> On 3 June 2014 17:50, Peter Zijlstra <peterz@infradead.org> wrote:
>> >> > >> > On Wed, May 28, 2014 at 04:47:03PM +0100, Morten Rasmussen wrote:
>> >> > >> >> Since we may do periodic load-balance every 10 ms or so, we will perform
>> >> > >> >> a number of load-balances where runnable_avg_sum will mostly be
>> >> > >> >> reflecting the state of the world before a change (new task queued or
>> >> > >> >> moved a task to a different cpu). If you had have two tasks continuously
>> >> > >> >> on one cpu and your other cpu is idle, and you move one of the tasks to
>> >> > >> >> the other cpu, runnable_avg_sum will remain unchanged, 47742, on the
>> >> > >> >> first cpu while it starts from 0 on the other one. 10 ms later it will
>> >> > >> >> have increased a bit, 32 ms later it will be 47742/2, and 345 ms later
>> >> > >> >> it reaches 47742. In the mean time the cpu doesn't appear fully utilized
>> >> > >> >> and we might decide to put more tasks on it because we don't know if
>> >> > >> >> runnable_avg_sum represents a partially utilized cpu (for example a 50%
>> >> > >> >> task) or if it will continue to rise and eventually get to 47742.
>> >> > >> >
>> >> > >> > Ah, no, since we track per task, and update the per-cpu ones when we
>> >> > >> > migrate tasks, the per-cpu values should be instantly updated.
>> >> > >> >
>> >> > >> > If we were to increase per task storage, we might as well also track
>> >> > >> > running_avg not only runnable_avg.
>> >> > >>
>> >> > >> I agree that the removed running_avg should give more useful
>> >> > >> information about the the load of a CPU.
>> >> > >>
>> >> > >> The main issue with running_avg is that it's disturbed by other tasks
>> >> > >> (as point out previously). As a typical example,  if we have 2 tasks
>> >> > >> with a load of 25% on 1 CPU, the unweighted runnable_load_avg will be
>> >> > >> in the range of [100% - 50%] depending of the parallelism of the
>> >> > >> runtime of the tasks whereas the reality is 50% and the use of
>> >> > >> running_avg will return this value
>> >> > >
>> >> > > I'm not sure I see how 100% is possible, but yes I agree that runnable
>> >> > > can indeed be inflated due to this queueing effect.
>> >>
>> >> Let me explain the 75%, take any one of the above scenarios. Lets call
>> >> the two tasks A and B, and let for a moment assume A always wins and
>> >> runs first, and then B.
>> >>
>> >> So A will be runnable for 25%, B otoh will be runnable the entire time A
>> >> is actually running plus its own running time, giving 50%. Together that
>> >> makes 75%.
>> >>
>> >> If you release the assumption that A runs first, but instead assume they
>> >> equally win the first execution, you get them averaging at 37.5% each,
>> >> which combined will still give 75%.
>> >
>> > But that is assuming that the first task gets to run to completion of it
>> > busy period. If it uses up its sched_slice and we switch to the other
>> > tasks, they both get to wait.
>> >
>> > For example, if the sched_slice is 5 ms and the busy period is 10 ms,
>> > the execution pattern would be: A, B, A, B, idle, ... In that case A is
>> > runnable for 15 ms and B is for 20 ms. Assuming that the overall period
>> > is 40 ms, the A runnable is 37.5% and B is 50%.
>>
>> The exact value for your scheduling example above is:
>> A runnable will be 47% and B runnable will be 60% (unless i make a
>> mistake in my computation)
>
> I get:
>
> A: 15/40 ms = 37.5%
> B: 20/40 ms = 50%
>
> Schedule:
>
>    | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms |
> A:   run     rq     run  ----------- sleeping -------------  run
> B:   rq      run    rq    run   ---- sleeping -------------  rq
>
>> and CPU runnable will be 60% too
>
> rq->avg.runnable_avg_sum should be 50%. You have two tasks running for
> 20 ms every 40 ms.
>
> Right?

ok, i see the misunderstood.
it's depends of what we mean by runnable. You take the % of time
whereas i take the runnable_avg_sum/period

so A is on_rq 15/40 ms = 37.5% of the time which gives a
runnable_avg_sum/runnable_avg_period of 47%
B is on_rq  20/40 ms = 50% of the time which gives a
runnable_avg_sum/runnable_avg_period of 60%
and CPU has a task on its rq 20/40ms = 50% of the time which gives a
runnable_avg_sum/runnable_avg_period of 60%

Vincent

>
> Morten

^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 08/11] sched: get CPU's activity statistic
@ 2014-06-04 11:52                           ` Vincent Guittot
  0 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-06-04 11:52 UTC (permalink / raw)
  To: linux-arm-kernel

On 4 June 2014 13:23, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> On Wed, Jun 04, 2014 at 12:07:29PM +0100, Vincent Guittot wrote:
>> On 4 June 2014 12:36, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
>> > On Wed, Jun 04, 2014 at 11:17:24AM +0100, Peter Zijlstra wrote:
>> >> On Wed, Jun 04, 2014 at 11:32:10AM +0200, Vincent Guittot wrote:
>> >> > On 4 June 2014 10:08, Peter Zijlstra <peterz@infradead.org> wrote:
>> >> > > On Wed, Jun 04, 2014 at 09:47:26AM +0200, Vincent Guittot wrote:
>> >> > >> On 3 June 2014 17:50, Peter Zijlstra <peterz@infradead.org> wrote:
>> >> > >> > On Wed, May 28, 2014 at 04:47:03PM +0100, Morten Rasmussen wrote:
>> >> > >> >> Since we may do periodic load-balance every 10 ms or so, we will perform
>> >> > >> >> a number of load-balances where runnable_avg_sum will mostly be
>> >> > >> >> reflecting the state of the world before a change (new task queued or
>> >> > >> >> moved a task to a different cpu). If you had have two tasks continuously
>> >> > >> >> on one cpu and your other cpu is idle, and you move one of the tasks to
>> >> > >> >> the other cpu, runnable_avg_sum will remain unchanged, 47742, on the
>> >> > >> >> first cpu while it starts from 0 on the other one. 10 ms later it will
>> >> > >> >> have increased a bit, 32 ms later it will be 47742/2, and 345 ms later
>> >> > >> >> it reaches 47742. In the mean time the cpu doesn't appear fully utilized
>> >> > >> >> and we might decide to put more tasks on it because we don't know if
>> >> > >> >> runnable_avg_sum represents a partially utilized cpu (for example a 50%
>> >> > >> >> task) or if it will continue to rise and eventually get to 47742.
>> >> > >> >
>> >> > >> > Ah, no, since we track per task, and update the per-cpu ones when we
>> >> > >> > migrate tasks, the per-cpu values should be instantly updated.
>> >> > >> >
>> >> > >> > If we were to increase per task storage, we might as well also track
>> >> > >> > running_avg not only runnable_avg.
>> >> > >>
>> >> > >> I agree that the removed running_avg should give more useful
>> >> > >> information about the the load of a CPU.
>> >> > >>
>> >> > >> The main issue with running_avg is that it's disturbed by other tasks
>> >> > >> (as point out previously). As a typical example,  if we have 2 tasks
>> >> > >> with a load of 25% on 1 CPU, the unweighted runnable_load_avg will be
>> >> > >> in the range of [100% - 50%] depending of the parallelism of the
>> >> > >> runtime of the tasks whereas the reality is 50% and the use of
>> >> > >> running_avg will return this value
>> >> > >
>> >> > > I'm not sure I see how 100% is possible, but yes I agree that runnable
>> >> > > can indeed be inflated due to this queueing effect.
>> >>
>> >> Let me explain the 75%, take any one of the above scenarios. Lets call
>> >> the two tasks A and B, and let for a moment assume A always wins and
>> >> runs first, and then B.
>> >>
>> >> So A will be runnable for 25%, B otoh will be runnable the entire time A
>> >> is actually running plus its own running time, giving 50%. Together that
>> >> makes 75%.
>> >>
>> >> If you release the assumption that A runs first, but instead assume they
>> >> equally win the first execution, you get them averaging at 37.5% each,
>> >> which combined will still give 75%.
>> >
>> > But that is assuming that the first task gets to run to completion of it
>> > busy period. If it uses up its sched_slice and we switch to the other
>> > tasks, they both get to wait.
>> >
>> > For example, if the sched_slice is 5 ms and the busy period is 10 ms,
>> > the execution pattern would be: A, B, A, B, idle, ... In that case A is
>> > runnable for 15 ms and B is for 20 ms. Assuming that the overall period
>> > is 40 ms, the A runnable is 37.5% and B is 50%.
>>
>> The exact value for your scheduling example above is:
>> A runnable will be 47% and B runnable will be 60% (unless i make a
>> mistake in my computation)
>
> I get:
>
> A: 15/40 ms = 37.5%
> B: 20/40 ms = 50%
>
> Schedule:
>
>    | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms |
> A:   run     rq     run  ----------- sleeping -------------  run
> B:   rq      run    rq    run   ---- sleeping -------------  rq
>
>> and CPU runnable will be 60% too
>
> rq->avg.runnable_avg_sum should be 50%. You have two tasks running for
> 20 ms every 40 ms.
>
> Right?

ok, i see the misunderstood.
it's depends of what we mean by runnable. You take the % of time
whereas i take the runnable_avg_sum/period

so A is on_rq 15/40 ms = 37.5% of the time which gives a
runnable_avg_sum/runnable_avg_period of 47%
B is on_rq  20/40 ms = 50% of the time which gives a
runnable_avg_sum/runnable_avg_period of 60%
and CPU has a task on its rq 20/40ms = 50% of the time which gives a
runnable_avg_sum/runnable_avg_period of 60%

Vincent

>
> Morten

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 08/11] sched: get CPU's activity statistic
  2014-06-04 11:52                           ` Vincent Guittot
@ 2014-06-04 13:09                             ` Morten Rasmussen
  -1 siblings, 0 replies; 216+ messages in thread
From: Morten Rasmussen @ 2014-06-04 13:09 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, mingo, linux-kernel, linux, linux-arm-kernel,
	preeti, efault, nicolas.pitre, linaro-kernel, daniel.lezcano

On Wed, Jun 04, 2014 at 12:52:46PM +0100, Vincent Guittot wrote:
> On 4 June 2014 13:23, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> > I get:
> >
> > A: 15/40 ms = 37.5%
> > B: 20/40 ms = 50%
> >
> > Schedule:
> >
> >    | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms |
> > A:   run     rq     run  ----------- sleeping -------------  run
> > B:   rq      run    rq    run   ---- sleeping -------------  rq
> >
> >> and CPU runnable will be 60% too
> >
> > rq->avg.runnable_avg_sum should be 50%. You have two tasks running for
> > 20 ms every 40 ms.
> >
> > Right?
> 
> ok, i see the misunderstood.
> it's depends of what we mean by runnable. You take the % of time
> whereas i take the runnable_avg_sum/period

Right. There is a difference.

> 
> so A is on_rq 15/40 ms = 37.5% of the time which gives a
> runnable_avg_sum/runnable_avg_period of 47%
> B is on_rq  20/40 ms = 50% of the time which gives a
> runnable_avg_sum/runnable_avg_period of 60%
> and CPU has a task on its rq 20/40ms = 50% of the time which gives a
> runnable_avg_sum/runnable_avg_period of 60%

Yes, that seems about right.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 08/11] sched: get CPU's activity statistic
@ 2014-06-04 13:09                             ` Morten Rasmussen
  0 siblings, 0 replies; 216+ messages in thread
From: Morten Rasmussen @ 2014-06-04 13:09 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Jun 04, 2014 at 12:52:46PM +0100, Vincent Guittot wrote:
> On 4 June 2014 13:23, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> > I get:
> >
> > A: 15/40 ms = 37.5%
> > B: 20/40 ms = 50%
> >
> > Schedule:
> >
> >    | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms |
> > A:   run     rq     run  ----------- sleeping -------------  run
> > B:   rq      run    rq    run   ---- sleeping -------------  rq
> >
> >> and CPU runnable will be 60% too
> >
> > rq->avg.runnable_avg_sum should be 50%. You have two tasks running for
> > 20 ms every 40 ms.
> >
> > Right?
> 
> ok, i see the misunderstood.
> it's depends of what we mean by runnable. You take the % of time
> whereas i take the runnable_avg_sum/period

Right. There is a difference.

> 
> so A is on_rq 15/40 ms = 37.5% of the time which gives a
> runnable_avg_sum/runnable_avg_period of 47%
> B is on_rq  20/40 ms = 50% of the time which gives a
> runnable_avg_sum/runnable_avg_period of 60%
> and CPU has a task on its rq 20/40ms = 50% of the time which gives a
> runnable_avg_sum/runnable_avg_period of 60%

Yes, that seems about right.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 08/11] sched: get CPU's activity statistic
  2014-06-04 13:09                             ` Morten Rasmussen
@ 2014-06-04 13:23                               ` Morten Rasmussen
  -1 siblings, 0 replies; 216+ messages in thread
From: Morten Rasmussen @ 2014-06-04 13:23 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, mingo, linux-kernel, linux, linux-arm-kernel,
	preeti, efault, nicolas.pitre, linaro-kernel, daniel.lezcano

On Wed, Jun 04, 2014 at 02:09:18PM +0100, Morten Rasmussen wrote:
> On Wed, Jun 04, 2014 at 12:52:46PM +0100, Vincent Guittot wrote:
> > On 4 June 2014 13:23, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> > > I get:
> > >
> > > A: 15/40 ms = 37.5%
> > > B: 20/40 ms = 50%
> > >
> > > Schedule:
> > >
> > >    | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms |
> > > A:   run     rq     run  ----------- sleeping -------------  run
> > > B:   rq      run    rq    run   ---- sleeping -------------  rq
> > >
> > >> and CPU runnable will be 60% too
> > >
> > > rq->avg.runnable_avg_sum should be 50%. You have two tasks running for
> > > 20 ms every 40 ms.
> > >
> > > Right?
> > 
> > ok, i see the misunderstood.
> > it's depends of what we mean by runnable. You take the % of time
> > whereas i take the runnable_avg_sum/period
> 
> Right. There is a difference.
> 
> > 
> > so A is on_rq 15/40 ms = 37.5% of the time which gives a
> > runnable_avg_sum/runnable_avg_period of 47%
> > B is on_rq  20/40 ms = 50% of the time which gives a
> > runnable_avg_sum/runnable_avg_period of 60%
> > and CPU has a task on its rq 20/40ms = 50% of the time which gives a
> > runnable_avg_sum/runnable_avg_period of 60%
> 
> Yes, that seems about right.

If my calculations are right, cfs.runnable_load_avg would peak 15 ms
into the period at about 104%. After 20 ms A has decayed more than B has
gained so the sum is slightly lower.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 08/11] sched: get CPU's activity statistic
@ 2014-06-04 13:23                               ` Morten Rasmussen
  0 siblings, 0 replies; 216+ messages in thread
From: Morten Rasmussen @ 2014-06-04 13:23 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Jun 04, 2014 at 02:09:18PM +0100, Morten Rasmussen wrote:
> On Wed, Jun 04, 2014 at 12:52:46PM +0100, Vincent Guittot wrote:
> > On 4 June 2014 13:23, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> > > I get:
> > >
> > > A: 15/40 ms = 37.5%
> > > B: 20/40 ms = 50%
> > >
> > > Schedule:
> > >
> > >    | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms | 5 ms |
> > > A:   run     rq     run  ----------- sleeping -------------  run
> > > B:   rq      run    rq    run   ---- sleeping -------------  rq
> > >
> > >> and CPU runnable will be 60% too
> > >
> > > rq->avg.runnable_avg_sum should be 50%. You have two tasks running for
> > > 20 ms every 40 ms.
> > >
> > > Right?
> > 
> > ok, i see the misunderstood.
> > it's depends of what we mean by runnable. You take the % of time
> > whereas i take the runnable_avg_sum/period
> 
> Right. There is a difference.
> 
> > 
> > so A is on_rq 15/40 ms = 37.5% of the time which gives a
> > runnable_avg_sum/runnable_avg_period of 47%
> > B is on_rq  20/40 ms = 50% of the time which gives a
> > runnable_avg_sum/runnable_avg_period of 60%
> > and CPU has a task on its rq 20/40ms = 50% of the time which gives a
> > runnable_avg_sum/runnable_avg_period of 60%
> 
> Yes, that seems about right.

If my calculations are right, cfs.runnable_load_avg would peak 15 ms
into the period at about 104%. After 20 ms A has decayed more than B has
gained so the sum is slightly lower.

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 04/11] sched: Allow all archs to set the power_orig
  2014-06-04 11:15           ` Vincent Guittot
@ 2014-06-05  8:59             ` Dietmar Eggemann
  -1 siblings, 0 replies; 216+ messages in thread
From: Dietmar Eggemann @ 2014-06-05  8:59 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: peterz, mingo, linux-kernel, linux, linux-arm-kernel, preeti,
	Morten Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano

[...]
>> Firstly, we need to scale cpu power in update_cpu_power() regarding
>> uArch, frequency and rt/irq pressure.
>> Here the freq related value we get back from arch_scale_freq_power(...,
>> cpu) could be an instantaneous value (curr_freq(cpu)/max_freq(cpu)).
>>
>> Secondly, to be able to scale the runnable avg sum of a sched entity
>> (se->avg->runnable_avg_sum), we preferable have a coefficient
>> representing uArch diffs (cpu_power_orig(cpu)/cpu_power_orig(most
>> powerful cpu in the system) and another coefficient (avg freq over 'now
> 
> AFAICT, the coefficient representing uArch diffs is already taken into
> account into power_freq thanks to scale_cpu, isn't it ?

True, but I can't see how the current signature of
arch_scale_cpu_power() and arch_scale_freq_power() fit into this uArch
and freq invariant updating of se->avg->runnable_avg_sum business.

>> - sa->last_runnable_update'(cpu)/max_freq(cpu). This value would have to
>> be retrieved from the arch in __update_entity_runnable_avg().
>>
[...]


^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 04/11] sched: Allow all archs to set the power_orig
@ 2014-06-05  8:59             ` Dietmar Eggemann
  0 siblings, 0 replies; 216+ messages in thread
From: Dietmar Eggemann @ 2014-06-05  8:59 UTC (permalink / raw)
  To: linux-arm-kernel

[...]
>> Firstly, we need to scale cpu power in update_cpu_power() regarding
>> uArch, frequency and rt/irq pressure.
>> Here the freq related value we get back from arch_scale_freq_power(...,
>> cpu) could be an instantaneous value (curr_freq(cpu)/max_freq(cpu)).
>>
>> Secondly, to be able to scale the runnable avg sum of a sched entity
>> (se->avg->runnable_avg_sum), we preferable have a coefficient
>> representing uArch diffs (cpu_power_orig(cpu)/cpu_power_orig(most
>> powerful cpu in the system) and another coefficient (avg freq over 'now
> 
> AFAICT, the coefficient representing uArch diffs is already taken into
> account into power_freq thanks to scale_cpu, isn't it ?

True, but I can't see how the current signature of
arch_scale_cpu_power() and arch_scale_freq_power() fit into this uArch
and freq invariant updating of se->avg->runnable_avg_sum business.

>> - sa->last_runnable_update'(cpu)/max_freq(cpu). This value would have to
>> be retrieved from the arch in __update_entity_runnable_avg().
>>
[...]

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 04/11] sched: Allow all archs to set the power_orig
  2014-06-04 11:17         ` Morten Rasmussen
@ 2014-06-06  7:01           ` Vincent Guittot
  -1 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-06-06  7:01 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: peterz, mingo, linux-kernel, linux, linux-arm-kernel, preeti,
	efault, nicolas.pitre, linaro-kernel, daniel.lezcano

On 4 June 2014 13:17, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> On Tue, Jun 03, 2014 at 03:02:18PM +0100, Vincent Guittot wrote:
>> On 3 June 2014 15:22, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
>> > On Fri, May 23, 2014 at 04:52:58PM +0100, Vincent Guittot wrote:
>> >> power_orig is only changed for system with a SMT sched_domain level in order to
>> >> reflect the lower capacity of CPUs. Heterogenous system also have to reflect an
>> >> original capacity that is different from the default value.
>> >>
>> >> Create a more generic function arch_scale_cpu_power that can be also used by
>> >> non SMT platform to set power_orig.
>> >
>> > I did a quick test of the patch set with adjusting cpu_power on
>> > big.LITTLE (ARM TC2) to reflect the different compute capacities of the
>> > A15s and A7s. I ran the sysbench cpu benchmark with 5 threads with and
>> > without the patches applied, but with non-default cpu_powers.
>> >
>> > I didn't see any difference in the load-balance. Three tasks ended up on
>> > the two A15s and two tasks ended up on two of the three A7s leaving one
>> > unused in both cases.
>> >
>> > Using default cpu_power I get one task on each of the five cpus (best
>> > throughput). Unless I messed something up, it seems that setting
>> > cpu_power doesn't give me the best throughput with these patches
>> > applied.
>>
>> That's normal this patchset is necessary but not enough to solve the
>> issue you mention. We also need to fix the way the imbalance is
>> calculated for such situation. I have planned to push that in another
>> patchset in order to not mix too much thing together
>
> Based on the commit messages I was just lead to believe that this was a
> self-contained patch set that also addressed issues related to handling
> heterogeneous systems. Maybe it would be worth mentioning that this set
> is only part of the solution somewhere?

So this patch addresses the issue of describing the CPU capacity of an
HMP system but doesn't solve the one task per CPU issue alone.

>
> It is a bit unclear to me how these changes, which appear to mainly
> improve factoring rt and irq time into cpu_power, will solve the
> cpu_power issues related to heterogeneous systems. Can you share your
> plans for the follow up patch set? I think it would be better to review
> the solution as a whole.

I have planned to use load_per_task to calculate the imbalance to pull

>
> I absolutely agree that the imbalance calculation needs to fixed, but I
> don't think the current rq runnable_avg_sum is the right choice for that
> purpose for the reasons I pointed out the in other thread.
>
> Morten

^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 04/11] sched: Allow all archs to set the power_orig
@ 2014-06-06  7:01           ` Vincent Guittot
  0 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-06-06  7:01 UTC (permalink / raw)
  To: linux-arm-kernel

On 4 June 2014 13:17, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> On Tue, Jun 03, 2014 at 03:02:18PM +0100, Vincent Guittot wrote:
>> On 3 June 2014 15:22, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
>> > On Fri, May 23, 2014 at 04:52:58PM +0100, Vincent Guittot wrote:
>> >> power_orig is only changed for system with a SMT sched_domain level in order to
>> >> reflect the lower capacity of CPUs. Heterogenous system also have to reflect an
>> >> original capacity that is different from the default value.
>> >>
>> >> Create a more generic function arch_scale_cpu_power that can be also used by
>> >> non SMT platform to set power_orig.
>> >
>> > I did a quick test of the patch set with adjusting cpu_power on
>> > big.LITTLE (ARM TC2) to reflect the different compute capacities of the
>> > A15s and A7s. I ran the sysbench cpu benchmark with 5 threads with and
>> > without the patches applied, but with non-default cpu_powers.
>> >
>> > I didn't see any difference in the load-balance. Three tasks ended up on
>> > the two A15s and two tasks ended up on two of the three A7s leaving one
>> > unused in both cases.
>> >
>> > Using default cpu_power I get one task on each of the five cpus (best
>> > throughput). Unless I messed something up, it seems that setting
>> > cpu_power doesn't give me the best throughput with these patches
>> > applied.
>>
>> That's normal this patchset is necessary but not enough to solve the
>> issue you mention. We also need to fix the way the imbalance is
>> calculated for such situation. I have planned to push that in another
>> patchset in order to not mix too much thing together
>
> Based on the commit messages I was just lead to believe that this was a
> self-contained patch set that also addressed issues related to handling
> heterogeneous systems. Maybe it would be worth mentioning that this set
> is only part of the solution somewhere?

So this patch addresses the issue of describing the CPU capacity of an
HMP system but doesn't solve the one task per CPU issue alone.

>
> It is a bit unclear to me how these changes, which appear to mainly
> improve factoring rt and irq time into cpu_power, will solve the
> cpu_power issues related to heterogeneous systems. Can you share your
> plans for the follow up patch set? I think it would be better to review
> the solution as a whole.

I have planned to use load_per_task to calculate the imbalance to pull

>
> I absolutely agree that the imbalance calculation needs to fixed, but I
> don't think the current rq runnable_avg_sum is the right choice for that
> purpose for the reasons I pointed out the in other thread.
>
> Morten

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 04/11] sched: Allow all archs to set the power_orig
  2014-06-05  8:59             ` Dietmar Eggemann
@ 2014-06-16  9:01               ` Vincent Guittot
  -1 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-06-16  9:01 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: peterz, mingo, linux-kernel, linux, linux-arm-kernel, preeti,
	Morten Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano, Mike Turquette

On 5 June 2014 10:59, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> [...]
>>> Firstly, we need to scale cpu power in update_cpu_power() regarding
>>> uArch, frequency and rt/irq pressure.
>>> Here the freq related value we get back from arch_scale_freq_power(...,
>>> cpu) could be an instantaneous value (curr_freq(cpu)/max_freq(cpu)).
>>>
>>> Secondly, to be able to scale the runnable avg sum of a sched entity
>>> (se->avg->runnable_avg_sum), we preferable have a coefficient
>>> representing uArch diffs (cpu_power_orig(cpu)/cpu_power_orig(most
>>> powerful cpu in the system) and another coefficient (avg freq over 'now
>>
>> AFAICT, the coefficient representing uArch diffs is already taken into
>> account into power_freq thanks to scale_cpu, isn't it ?
>
> True, but I can't see how the current signature of
> arch_scale_cpu_power() and arch_scale_freq_power() fit into this uArch
> and freq invariant updating of se->avg->runnable_avg_sum business.

My 1st assumption is that the update of rq->cpu_power (or
rq->cpu_power_freq as we discussed earlier) is fast enough compare to
frequency scaling so that it can be used to scale the load without
major error. In this case, you can use the cpu_power_orig, cpu_power
fields. The update period of cpu_power is equal balance_interval
which is initialized with the weight of the sched_domain (less or
equal to 4ms for most of ARM platform). Most of ARM platforms use a
100 HZ jiffies so the update period will be aligned to 10ms (1 tick).

If this update is not fast enough, the second possibility could be to
directly access to current frequency (or something that represent it).
This might be possible once the cpufreq will have been consolidated
into the scheduler similarly to what is going with cpuidle

Vincent

>
>>> - sa->last_runnable_update'(cpu)/max_freq(cpu). This value would have to
>>> be retrieved from the arch in __update_entity_runnable_avg().
>>>
> [...]
>

^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 04/11] sched: Allow all archs to set the power_orig
@ 2014-06-16  9:01               ` Vincent Guittot
  0 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-06-16  9:01 UTC (permalink / raw)
  To: linux-arm-kernel

On 5 June 2014 10:59, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> [...]
>>> Firstly, we need to scale cpu power in update_cpu_power() regarding
>>> uArch, frequency and rt/irq pressure.
>>> Here the freq related value we get back from arch_scale_freq_power(...,
>>> cpu) could be an instantaneous value (curr_freq(cpu)/max_freq(cpu)).
>>>
>>> Secondly, to be able to scale the runnable avg sum of a sched entity
>>> (se->avg->runnable_avg_sum), we preferable have a coefficient
>>> representing uArch diffs (cpu_power_orig(cpu)/cpu_power_orig(most
>>> powerful cpu in the system) and another coefficient (avg freq over 'now
>>
>> AFAICT, the coefficient representing uArch diffs is already taken into
>> account into power_freq thanks to scale_cpu, isn't it ?
>
> True, but I can't see how the current signature of
> arch_scale_cpu_power() and arch_scale_freq_power() fit into this uArch
> and freq invariant updating of se->avg->runnable_avg_sum business.

My 1st assumption is that the update of rq->cpu_power (or
rq->cpu_power_freq as we discussed earlier) is fast enough compare to
frequency scaling so that it can be used to scale the load without
major error. In this case, you can use the cpu_power_orig, cpu_power
fields. The update period of cpu_power is equal balance_interval
which is initialized with the weight of the sched_domain (less or
equal to 4ms for most of ARM platform). Most of ARM platforms use a
100 HZ jiffies so the update period will be aligned to 10ms (1 tick).

If this update is not fast enough, the second possibility could be to
directly access to current frequency (or something that represent it).
This might be possible once the cpufreq will have been consolidated
into the scheduler similarly to what is going with cpuidle

Vincent

>
>>> - sa->last_runnable_update'(cpu)/max_freq(cpu). This value would have to
>>> be retrieved from the arch in __update_entity_runnable_avg().
>>>
> [...]
>

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 09/11] sched: test the cpu's capacity in wake affine
  2014-05-28 11:15       ` Vincent Guittot
@ 2014-11-24  0:34         ` Wanpeng Li
  -1 siblings, 0 replies; 216+ messages in thread
From: Wanpeng Li @ 2014-11-24  0:34 UTC (permalink / raw)
  To: Vincent Guittot, Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Russell King - ARM Linux, LAK,
	Preeti U Murthy, Morten Rasmussen, Mike Galbraith, Nicolas Pitre,
	linaro-kernel, Daniel Lezcano

Hi Vincent,
On 5/28/14, 7:15 PM, Vincent Guittot wrote:
> On 28 May 2014 12:58, Peter Zijlstra <peterz@infradead.org> wrote:
>> On Fri, May 23, 2014 at 05:53:03PM +0200, Vincent Guittot wrote:
[snip]
>> Now I'm only struggling to understand the rationale for this, its got
>> LLC in there somewhere, but I'm failing to comprehend.
> Ah.. i have probably overestimated the fact that wake_affine was only
> done at MC or SMT level but after reading more deeply the glags
> configuration of all  sched_domain level , my assumption is not true.

What's level the wake_affine should happen?

Regards,
Wanpeng Li

> So I need to test sd flag to make sure that their share their cache at
> this level
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/


^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 09/11] sched: test the cpu's capacity in wake affine
@ 2014-11-24  0:34         ` Wanpeng Li
  0 siblings, 0 replies; 216+ messages in thread
From: Wanpeng Li @ 2014-11-24  0:34 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Vincent,
On 5/28/14, 7:15 PM, Vincent Guittot wrote:
> On 28 May 2014 12:58, Peter Zijlstra <peterz@infradead.org> wrote:
>> On Fri, May 23, 2014 at 05:53:03PM +0200, Vincent Guittot wrote:
[snip]
>> Now I'm only struggling to understand the rationale for this, its got
>> LLC in there somewhere, but I'm failing to comprehend.
> Ah.. i have probably overestimated the fact that wake_affine was only
> done at MC or SMT level but after reading more deeply the glags
> configuration of all  sched_domain level , my assumption is not true.

What's level the wake_affine should happen?

Regards,
Wanpeng Li

> So I need to test sd flag to make sure that their share their cache at
> this level
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 216+ messages in thread

* Re: [PATCH v2 09/11] sched: test the cpu's capacity in wake affine
  2014-11-24  0:34         ` Wanpeng Li
@ 2014-11-24 13:23           ` Vincent Guittot
  -1 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-11-24 13:23 UTC (permalink / raw)
  To: Wanpeng Li
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel,
	Russell King - ARM Linux, LAK, Preeti U Murthy, Morten Rasmussen,
	Mike Galbraith, Nicolas Pitre, linaro-kernel, Daniel Lezcano

On 24 November 2014 at 01:34, Wanpeng Li <kernellwp@gmail.com> wrote:
> Hi Vincent,
> On 5/28/14, 7:15 PM, Vincent Guittot wrote:
>>
>> On 28 May 2014 12:58, Peter Zijlstra <peterz@infradead.org> wrote:
>>>
>>> On Fri, May 23, 2014 at 05:53:03PM +0200, Vincent Guittot wrote:
>
> [snip]
>>>
>>> Now I'm only struggling to understand the rationale for this, its got
>>> LLC in there somewhere, but I'm failing to comprehend.
>>
>> Ah.. i have probably overestimated the fact that wake_affine was only
>> done at MC or SMT level but after reading more deeply the glags
>> configuration of all  sched_domain level , my assumption is not true.
>
>
> What's level the wake_affine should happen?

In 1st version of this patch, it was proposed to test the cpu_capacity
of idle CPU only if they were sharing their cache

Regards,
Vincent
>
> Regards,
> Wanpeng Li
>
>> So I need to test sd flag to make sure that their share their cache at
>> this level
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo@vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
>
>

^ permalink raw reply	[flat|nested] 216+ messages in thread

* [PATCH v2 09/11] sched: test the cpu's capacity in wake affine
@ 2014-11-24 13:23           ` Vincent Guittot
  0 siblings, 0 replies; 216+ messages in thread
From: Vincent Guittot @ 2014-11-24 13:23 UTC (permalink / raw)
  To: linux-arm-kernel

On 24 November 2014 at 01:34, Wanpeng Li <kernellwp@gmail.com> wrote:
> Hi Vincent,
> On 5/28/14, 7:15 PM, Vincent Guittot wrote:
>>
>> On 28 May 2014 12:58, Peter Zijlstra <peterz@infradead.org> wrote:
>>>
>>> On Fri, May 23, 2014 at 05:53:03PM +0200, Vincent Guittot wrote:
>
> [snip]
>>>
>>> Now I'm only struggling to understand the rationale for this, its got
>>> LLC in there somewhere, but I'm failing to comprehend.
>>
>> Ah.. i have probably overestimated the fact that wake_affine was only
>> done at MC or SMT level but after reading more deeply the glags
>> configuration of all  sched_domain level , my assumption is not true.
>
>
> What's level the wake_affine should happen?

In 1st version of this patch, it was proposed to test the cpu_capacity
of idle CPU only if they were sharing their cache

Regards,
Vincent
>
> Regards,
> Wanpeng Li
>
>> So I need to test sd flag to make sure that their share their cache at
>> this level
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> the body of a message to majordomo at vger.kernel.org
>> More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> Please read the FAQ at  http://www.tux.org/lkml/
>
>

^ permalink raw reply	[flat|nested] 216+ messages in thread

end of thread, other threads:[~2014-11-24 13:23 UTC | newest]

Thread overview: 216+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-05-23 15:52 [PATCH v2 00/11] sched: consolidation of cpu_power Vincent Guittot
2014-05-23 15:52 ` Vincent Guittot
2014-05-23 15:52 ` [PATCH v2 01/11] sched: fix imbalance flag reset Vincent Guittot
2014-05-23 15:52   ` Vincent Guittot
2014-05-25 10:33   ` Preeti U Murthy
2014-05-25 10:33     ` Preeti U Murthy
2014-05-26  7:49     ` Vincent Guittot
2014-05-26  7:49       ` Vincent Guittot
2014-05-26  9:16       ` Preeti U Murthy
2014-05-26  9:16         ` Preeti U Murthy
2014-05-26 10:14         ` Vincent Guittot
2014-05-26 10:14           ` Vincent Guittot
2014-05-23 15:52 ` [PATCH v2 02/11] sched: remove a wake_affine condition Vincent Guittot
2014-05-23 15:52   ` Vincent Guittot
2014-05-27 12:48   ` Peter Zijlstra
2014-05-27 12:48     ` Peter Zijlstra
2014-05-27 15:19     ` Vincent Guittot
2014-05-27 15:19       ` Vincent Guittot
2014-05-27 15:39       ` Peter Zijlstra
2014-05-27 15:39         ` Peter Zijlstra
2014-05-27 16:14         ` Vincent Guittot
2014-05-27 16:14           ` Vincent Guittot
2014-05-28  6:49           ` Vincent Guittot
2014-05-28  6:49             ` Vincent Guittot
2014-05-28 15:09             ` Dietmar Eggemann
2014-05-28 15:09               ` Dietmar Eggemann
2014-05-28 15:25               ` Vincent Guittot
2014-05-28 15:25                 ` Vincent Guittot
2014-05-27 13:43   ` Peter Zijlstra
2014-05-27 13:43     ` Peter Zijlstra
2014-05-27 13:45   ` Peter Zijlstra
2014-05-27 13:45     ` Peter Zijlstra
2014-05-27 15:20     ` Vincent Guittot
2014-05-27 15:20       ` Vincent Guittot
2014-05-23 15:52 ` [PATCH v2 03/11] sched: fix avg_load computation Vincent Guittot
2014-05-23 15:52   ` Vincent Guittot
2014-05-27 13:48   ` Peter Zijlstra
2014-05-27 13:48     ` Peter Zijlstra
2014-05-23 15:52 ` [PATCH v2 04/11] sched: Allow all archs to set the power_orig Vincent Guittot
2014-05-23 15:52   ` Vincent Guittot
2014-05-30 14:04   ` Dietmar Eggemann
2014-05-30 14:04     ` Dietmar Eggemann
2014-05-30 14:46     ` Peter Zijlstra
2014-05-30 14:46       ` Peter Zijlstra
2014-05-30 20:50     ` Vincent Guittot
2014-05-30 20:50       ` Vincent Guittot
2014-06-04  9:42       ` Dietmar Eggemann
2014-06-04  9:42         ` Dietmar Eggemann
2014-06-04 11:15         ` Vincent Guittot
2014-06-04 11:15           ` Vincent Guittot
2014-06-05  8:59           ` Dietmar Eggemann
2014-06-05  8:59             ` Dietmar Eggemann
2014-06-16  9:01             ` Vincent Guittot
2014-06-16  9:01               ` Vincent Guittot
2014-06-03 13:22   ` Morten Rasmussen
2014-06-03 13:22     ` Morten Rasmussen
2014-06-03 14:02     ` Vincent Guittot
2014-06-03 14:02       ` Vincent Guittot
2014-06-04 11:17       ` Morten Rasmussen
2014-06-04 11:17         ` Morten Rasmussen
2014-06-06  7:01         ` Vincent Guittot
2014-06-06  7:01           ` Vincent Guittot
2014-05-23 15:52 ` [PATCH v2 05/11] ARM: topology: use new cpu_power interface Vincent Guittot
2014-05-23 15:52   ` Vincent Guittot
2014-05-25 13:22   ` Preeti U Murthy
2014-05-25 13:22     ` Preeti U Murthy
2014-05-26  8:25     ` Vincent Guittot
2014-05-26  8:25       ` Vincent Guittot
2014-05-26  9:19       ` Preeti U Murthy
2014-05-26  9:19         ` Preeti U Murthy
2014-05-23 15:53 ` [PATCH v2 06/11] sched: add per rq cpu_power_orig Vincent Guittot
2014-05-23 15:53   ` Vincent Guittot
2014-05-23 15:53 ` [PATCH v2 07/11] Revert "sched: Put rq's sched_avg under CONFIG_FAIR_GROUP_SCHED" Vincent Guittot
2014-05-23 15:53   ` Vincent Guittot
2014-05-23 15:53 ` [PATCH v2 08/11] sched: get CPU's activity statistic Vincent Guittot
2014-05-23 15:53   ` Vincent Guittot
2014-05-27 17:32   ` Peter Zijlstra
2014-05-27 17:32     ` Peter Zijlstra
2014-05-28  7:01     ` Vincent Guittot
2014-05-28  7:01       ` Vincent Guittot
2014-05-28 12:10   ` Morten Rasmussen
2014-05-28 12:10     ` Morten Rasmussen
2014-05-28 13:15     ` Vincent Guittot
2014-05-28 13:15       ` Vincent Guittot
2014-05-28 15:47       ` Morten Rasmussen
2014-05-28 15:47         ` Morten Rasmussen
2014-05-28 16:39         ` Vincent Guittot
2014-05-28 16:39           ` Vincent Guittot
2014-06-03 12:03           ` Morten Rasmussen
2014-06-03 12:03             ` Morten Rasmussen
2014-06-03 15:59             ` Peter Zijlstra
2014-06-03 15:59               ` Peter Zijlstra
2014-06-03 17:41               ` Morten Rasmussen
2014-06-03 17:41                 ` Morten Rasmussen
2014-06-03 15:50         ` Peter Zijlstra
2014-06-03 15:50           ` Peter Zijlstra
2014-06-03 17:20           ` Morten Rasmussen
2014-06-03 17:20             ` Morten Rasmussen
2014-06-04  7:47           ` Vincent Guittot
2014-06-04  7:47             ` Vincent Guittot
2014-06-04  8:08             ` Peter Zijlstra
2014-06-04  8:08               ` Peter Zijlstra
2014-06-04  8:55               ` Morten Rasmussen
2014-06-04  8:55                 ` Morten Rasmussen
2014-06-04  9:23                 ` Peter Zijlstra
2014-06-04  9:23                   ` Peter Zijlstra
2014-06-04  9:35                   ` Vincent Guittot
2014-06-04  9:35                     ` Vincent Guittot
2014-06-04 10:25                     ` Morten Rasmussen
2014-06-04 10:25                       ` Morten Rasmussen
2014-06-04  9:44                   ` Morten Rasmussen
2014-06-04  9:44                     ` Morten Rasmussen
2014-06-04  9:32               ` Vincent Guittot
2014-06-04  9:32                 ` Vincent Guittot
2014-06-04 10:00                 ` Morten Rasmussen
2014-06-04 10:00                   ` Morten Rasmussen
2014-06-04 10:17                 ` Peter Zijlstra
2014-06-04 10:17                   ` Peter Zijlstra
2014-06-04 10:36                   ` Morten Rasmussen
2014-06-04 10:36                     ` Morten Rasmussen
2014-06-04 10:55                     ` Peter Zijlstra
2014-06-04 10:55                       ` Peter Zijlstra
2014-06-04 11:07                     ` Vincent Guittot
2014-06-04 11:07                       ` Vincent Guittot
2014-06-04 11:23                       ` Morten Rasmussen
2014-06-04 11:23                         ` Morten Rasmussen
2014-06-04 11:52                         ` Vincent Guittot
2014-06-04 11:52                           ` Vincent Guittot
2014-06-04 13:09                           ` Morten Rasmussen
2014-06-04 13:09                             ` Morten Rasmussen
2014-06-04 13:23                             ` Morten Rasmussen
2014-06-04 13:23                               ` Morten Rasmussen
2014-05-28 15:17     ` Peter Zijlstra
2014-05-28 15:17       ` Peter Zijlstra
2014-06-03 15:40     ` Peter Zijlstra
2014-06-03 15:40       ` Peter Zijlstra
2014-06-03 17:16       ` Morten Rasmussen
2014-06-03 17:16         ` Morten Rasmussen
2014-06-03 17:37         ` Peter Zijlstra
2014-06-03 17:37           ` Peter Zijlstra
2014-06-03 17:39         ` Peter Zijlstra
2014-06-03 17:39           ` Peter Zijlstra
2014-06-03 23:11       ` Yuyang Du
2014-06-03 23:11         ` Yuyang Du
2014-05-30  9:50   ` Dietmar Eggemann
2014-05-30  9:50     ` Dietmar Eggemann
2014-05-30 19:20     ` Vincent Guittot
2014-05-30 19:20       ` Vincent Guittot
2014-06-01 11:33       ` Dietmar Eggemann
2014-06-01 11:33         ` Dietmar Eggemann
2014-06-02 14:07         ` Vincent Guittot
2014-06-02 14:07           ` Vincent Guittot
2014-05-23 15:53 ` [PATCH v2 09/11] sched: test the cpu's capacity in wake affine Vincent Guittot
2014-05-23 15:53   ` Vincent Guittot
2014-05-28 10:58   ` Peter Zijlstra
2014-05-28 10:58     ` Peter Zijlstra
2014-05-28 11:15     ` Vincent Guittot
2014-05-28 11:15       ` Vincent Guittot
2014-11-24  0:34       ` Wanpeng Li
2014-11-24  0:34         ` Wanpeng Li
2014-11-24 13:23         ` Vincent Guittot
2014-11-24 13:23           ` Vincent Guittot
2014-05-23 15:53 ` [PATCH v2 10/11] sched: move cfs task on a CPU with higher capacity Vincent Guittot
2014-05-23 15:53   ` Vincent Guittot
2014-05-29  9:50   ` Peter Zijlstra
2014-05-29  9:50     ` Peter Zijlstra
2014-05-29 19:37     ` Vincent Guittot
2014-05-29 19:37       ` Vincent Guittot
2014-05-30  6:29       ` Peter Zijlstra
2014-05-30  6:29         ` Peter Zijlstra
2014-05-30 20:05         ` Vincent Guittot
2014-05-30 20:05           ` Vincent Guittot
2014-06-02 17:06         ` Vincent Guittot
2014-06-02 17:06           ` Vincent Guittot
2014-06-03 11:15           ` Peter Zijlstra
2014-06-03 11:15             ` Peter Zijlstra
2014-06-03 12:31             ` Vincent Guittot
2014-06-03 12:31               ` Vincent Guittot
2014-05-29 14:04   ` Peter Zijlstra
2014-05-29 14:04     ` Peter Zijlstra
2014-05-29 19:44     ` Vincent Guittot
2014-05-29 19:44       ` Vincent Guittot
2014-05-30 13:26   ` Dietmar Eggemann
2014-05-30 13:26     ` Dietmar Eggemann
2014-05-30 19:24     ` Vincent Guittot
2014-05-30 19:24       ` Vincent Guittot
2014-05-30 19:45       ` Nicolas Pitre
2014-05-30 19:45         ` Nicolas Pitre
2014-05-30 20:07         ` Vincent Guittot
2014-05-30 20:07           ` Vincent Guittot
2014-05-23 15:53 ` [PATCH v2 11/11] sched: replace capacity by activity Vincent Guittot
2014-05-23 15:53   ` Vincent Guittot
2014-05-29 13:55   ` Peter Zijlstra
2014-05-29 13:55     ` Peter Zijlstra
2014-05-29 19:51     ` Vincent Guittot
2014-05-29 19:51       ` Vincent Guittot
2014-06-02  6:21     ` Preeti U Murthy
2014-06-02  6:21       ` Preeti U Murthy
2014-06-03  9:50       ` Vincent Guittot
2014-06-03  9:50         ` Vincent Guittot
2014-05-29 14:02   ` Peter Zijlstra
2014-05-29 14:02     ` Peter Zijlstra
2014-05-29 19:56     ` Vincent Guittot
2014-05-29 19:56       ` Vincent Guittot
2014-05-30  6:34       ` Peter Zijlstra
2014-05-30  6:34         ` Peter Zijlstra
2014-05-30 19:13         ` Vincent Guittot
2014-05-30 19:13           ` Vincent Guittot
2014-05-26  9:44 ` [PATCH v2 00/11] sched: consolidation of cpu_power Preeti U Murthy
2014-05-26  9:44   ` Preeti U Murthy
2014-05-26 10:04   ` Vincent Guittot
2014-05-26 10:04     ` Vincent Guittot
2014-05-26 15:54     ` Vincent Guittot
2014-05-26 15:54       ` Vincent Guittot
2014-05-27  5:47       ` Preeti U Murthy
2014-05-27  5:47         ` Preeti U Murthy

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.