All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 00/12] sched: consolidation of cpu_capacity
@ 2014-08-26 11:06 ` Vincent Guittot
  0 siblings, 0 replies; 164+ messages in thread
From: Vincent Guittot @ 2014-08-26 11:06 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel, preeti, linux, linux-arm-kernel
  Cc: riel, Morten.Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano, dietmar.eggemann, Vincent Guittot

Part of this patchset was previously part of the larger tasks packing patchset
[1]. I have splitted the latter in 3 different patchsets (at least) to make the
thing easier.
-configuration of sched_domain topology [2]
-update and consolidation of cpu_capacity (this patchset)
-tasks packing algorithm

SMT system is no more the only system that can have a CPUs with an original
capacity that is different from the default value. We need to extend the use of
(cpu_)capacity_orig to all kind of platform so the scheduler will have both the
maximum capacity (cpu_capacity_orig/capacity_orig) and the current capacity
(cpu_capacity/capacity) of CPUs and sched_groups. A new function
arch_scale_cpu_capacity has been created and replace arch_scale_smt_capacity,
which is SMT specifc in the computation of the capapcity of a CPU.

During load balance, the scheduler evaluates the number of tasks that a group
of CPUs can handle. The current method assumes that tasks have a fix load of
SCHED_LOAD_SCALE and CPUs have a default capacity of SCHED_CAPACITY_SCALE.
This assumption generates wrong decision by creating ghost cores or by
removing real ones when the original capacity of CPUs is different from the
default SCHED_CAPACITY_SCALE. We don't try anymore to evaluate the number of
available cores based on the group_capacity but instead we detect when the
group is fully utilized

Now that we have the original capacity of CPUS and their activity/utilization,
we can evaluate more accuratly the capacity and the level of utilization of a
group of CPUs.

This patchset mainly replaces the old capacity method by a new one and has kept
the policy almost unchanged whereas we could certainly take advantage of this
new statistic in several other places of the load balance.

Tests results (done on v4, no test has been done on v5 that is only a rebase):
I have put below results of 4 kind of tests:
- hackbench -l 500 -s 4096
- perf bench sched pipe -l 400000
- scp of 100MB file on the platform
- ebizzy with various number of threads
on 4 kernels :
- tip = tip/sched/core
- step1 = tip + patches(1-8)
- patchset = tip + whole patchset
- patchset+irq = tip + this patchset + irq accounting

each test has been run 6 times and the figure below show the stdev and the
diff compared to the tip kernel

Dual A7				 tip   |  +step1           |  +patchset        |  patchset+irq	
			         stdev |  results    stdev |  results    stdev |  results    stdev
hackbench (lower is better) (+/-)0.64% | -0.19% (+/-)0.73% |  0.58% (+/-)1.29% |  0.20% (+/-)1.00%
perf	(lower is better)   (+/-)0.28% |  1.22% (+/-)0.17% |  1.29% (+/-)0.06% |  2.85% (+/-)0.33%
scp			    (+/-)4.81% |  2.61% (+/-)0.28% |  2.39% (+/-)0.22% | 82.18% (+/-)3.30%
ebizzy	-t 1		    (+/-)2.31% | -1.32% (+/-)1.90% | -0.79% (+/-)2.88% |  3.10% (+/-)2.32%
ebizzy	-t 2		    (+/-)0.70% |  8.29% (+/-)6.66% |  1.93% (+/-)5.47% |  2.72% (+/-)5.72%
ebizzy	-t 4		    (+/-)3.54% |  5.57% (+/-)8.00% |  0.36% (+/-)9.00% |  2.53% (+/-)3.17%
ebizzy	-t 6		    (+/-)2.36% | -0.43% (+/-)3.29% | -1.93% (+/-)3.47% |  0.57% (+/-)0.75%
ebizzy	-t 8		    (+/-)1.65% | -0.45% (+/-)0.93% | -1.95% (+/-)1.52% | -1.18% (+/-)1.61%
ebizzy	-t 10		    (+/-)2.55% | -0.98% (+/-)3.06% | -1.18% (+/-)6.17% | -2.33% (+/-)3.28%
ebizzy	-t 12		    (+/-)6.22% |  0.17% (+/-)5.63% |  2.98% (+/-)7.11% |  1.19% (+/-)4.68%
ebizzy	-t 14		    (+/-)5.38% | -0.14% (+/-)5.33% |  2.49% (+/-)4.93% |  1.43% (+/-)6.55%
															
Quad A15			 tip	| +patchset1	    | +patchset2        | patchset+irq		
			         stdev  | results     stdev | results    stdev  | results    stdev
hackbench (lower is better)  (+/-)0.78% |  0.87% (+/-)1.72% |  0.91% (+/-)2.02% |  3.30% (+/-)2.02%
perf	(lower is better)    (+/-)2.03% | -0.31% (+/-)0.76% | -2.38% (+/-)1.37% |  1.42% (+/-)3.14%
scp			     (+/-)0.04% |  0.51% (+/-)1.37% |  1.79% (+/-)0.84% |  1.77% (+/-)0.38%
ebizzy	-t 1		     (+/-)0.41% |  2.05% (+/-)0.38% |  2.08% (+/-)0.24% |  0.17% (+/-)0.62%
ebizzy	-t 2		     (+/-)0.78% |  0.60% (+/-)0.63% |  0.43% (+/-)0.48% |  1.61% (+/-)0.38%
ebizzy	-t 4		     (+/-)0.58% | -0.10% (+/-)0.97% | -0.65% (+/-)0.76% | -0.75% (+/-)0.86%
ebizzy	-t 6		     (+/-)0.31% |  1.07% (+/-)1.12% | -0.16% (+/-)0.87% | -0.76% (+/-)0.22%
ebizzy	-t 8		     (+/-)0.95% | -0.30% (+/-)0.85% | -0.79% (+/-)0.28% | -1.66% (+/-)0.21%
ebizzy	-t 10		     (+/-)0.31% |  0.04% (+/-)0.97% | -1.44% (+/-)1.54% | -0.55% (+/-)0.62%
ebizzy	-t 12		     (+/-)8.35% | -1.89% (+/-)7.64% |  0.75% (+/-)5.30% | -1.18% (+/-)8.16%
ebizzy	-t 14		    (+/-)13.17% |  6.22% (+/-)4.71% |  5.25% (+/-)9.14% |  5.87% (+/-)5.77%

I haven't been able to fully test the patchset for a SMT system to check that
the regression that has been reported by Preethi has been solved but the
various tests that i have done, don't show any regression so far.
The correction of SD_PREFER_SIBLING mode and the use of the latter at SMT level
should have fix the regression.

The usage_avg_contrib is based on the current implementation of the
load avg tracking. I also have a version of the usage_avg_contrib that is based
on the new implementation [3] but haven't provide the patches and results as
[3] is still under review. I can provide change above [3] to change how 
usage_avg_contrib is computed and adapt to new mecanism.

Change since V4
 - rebase to manage conflicts with changes in selection of busiest group [4]

Change since V3:
 - add usage_avg_contrib statistic which sums the running time of tasks on a rq
 - use usage_avg_contrib instead of runnable_avg_sum for cpu_utilization
 - fix replacement power by capacity
 - update some comments

Change since V2:
 - rebase on top of capacity renaming
 - fix wake_affine statistic update
 - rework nohz_kick_needed
 - optimize the active migration of a task from CPU with reduced capacity
 - rename group_activity by group_utilization and remove unused total_utilization
 - repair SD_PREFER_SIBLING and use it for SMT level
 - reorder patchset to gather patches with same topics

Change since V1:
 - add 3 fixes
 - correct some commit messages
 - replace capacity computation by activity
 - take into account current cpu capacity

[1] https://lkml.org/lkml/2013/10/18/121
[2] https://lkml.org/lkml/2014/3/19/377
[3] https://lkml.org/lkml/2014/7/18/110
[4] https://lkml.org/lkml/2014/7/25/589

Vincent Guittot (12):
  sched: fix imbalance flag reset
  sched: remove a wake_affine condition
  sched: fix avg_load computation
  sched: Allow all archs to set the capacity_orig
  ARM: topology: use new cpu_capacity interface
  sched: add per rq cpu_capacity_orig
  sched: test the cpu's capacity in wake affine
  sched: move cfs task on a CPU with higher capacity
  sched: add usage_load_avg
  sched: get CPU's utilization statistic
  sched: replace capacity_factor by utilization
  sched: add SD_PREFER_SIBLING for SMT level

 arch/arm/kernel/topology.c |   4 +-
 include/linux/sched.h      |   4 +-
 kernel/sched/core.c        |   3 +-
 kernel/sched/fair.c        | 356 ++++++++++++++++++++++++++-------------------
 kernel/sched/sched.h       |   3 +-
 5 files changed, 211 insertions(+), 159 deletions(-)

-- 
1.9.1


^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v5 00/12] sched: consolidation of cpu_capacity
@ 2014-08-26 11:06 ` Vincent Guittot
  0 siblings, 0 replies; 164+ messages in thread
From: Vincent Guittot @ 2014-08-26 11:06 UTC (permalink / raw)
  To: linux-arm-kernel

Part of this patchset was previously part of the larger tasks packing patchset
[1]. I have splitted the latter in 3 different patchsets (at least) to make the
thing easier.
-configuration of sched_domain topology [2]
-update and consolidation of cpu_capacity (this patchset)
-tasks packing algorithm

SMT system is no more the only system that can have a CPUs with an original
capacity that is different from the default value. We need to extend the use of
(cpu_)capacity_orig to all kind of platform so the scheduler will have both the
maximum capacity (cpu_capacity_orig/capacity_orig) and the current capacity
(cpu_capacity/capacity) of CPUs and sched_groups. A new function
arch_scale_cpu_capacity has been created and replace arch_scale_smt_capacity,
which is SMT specifc in the computation of the capapcity of a CPU.

During load balance, the scheduler evaluates the number of tasks that a group
of CPUs can handle. The current method assumes that tasks have a fix load of
SCHED_LOAD_SCALE and CPUs have a default capacity of SCHED_CAPACITY_SCALE.
This assumption generates wrong decision by creating ghost cores or by
removing real ones when the original capacity of CPUs is different from the
default SCHED_CAPACITY_SCALE. We don't try anymore to evaluate the number of
available cores based on the group_capacity but instead we detect when the
group is fully utilized

Now that we have the original capacity of CPUS and their activity/utilization,
we can evaluate more accuratly the capacity and the level of utilization of a
group of CPUs.

This patchset mainly replaces the old capacity method by a new one and has kept
the policy almost unchanged whereas we could certainly take advantage of this
new statistic in several other places of the load balance.

Tests results (done on v4, no test has been done on v5 that is only a rebase):
I have put below results of 4 kind of tests:
- hackbench -l 500 -s 4096
- perf bench sched pipe -l 400000
- scp of 100MB file on the platform
- ebizzy with various number of threads
on 4 kernels :
- tip = tip/sched/core
- step1 = tip + patches(1-8)
- patchset = tip + whole patchset
- patchset+irq = tip + this patchset + irq accounting

each test has been run 6 times and the figure below show the stdev and the
diff compared to the tip kernel

Dual A7				 tip   |  +step1           |  +patchset        |  patchset+irq	
			         stdev |  results    stdev |  results    stdev |  results    stdev
hackbench (lower is better) (+/-)0.64% | -0.19% (+/-)0.73% |  0.58% (+/-)1.29% |  0.20% (+/-)1.00%
perf	(lower is better)   (+/-)0.28% |  1.22% (+/-)0.17% |  1.29% (+/-)0.06% |  2.85% (+/-)0.33%
scp			    (+/-)4.81% |  2.61% (+/-)0.28% |  2.39% (+/-)0.22% | 82.18% (+/-)3.30%
ebizzy	-t 1		    (+/-)2.31% | -1.32% (+/-)1.90% | -0.79% (+/-)2.88% |  3.10% (+/-)2.32%
ebizzy	-t 2		    (+/-)0.70% |  8.29% (+/-)6.66% |  1.93% (+/-)5.47% |  2.72% (+/-)5.72%
ebizzy	-t 4		    (+/-)3.54% |  5.57% (+/-)8.00% |  0.36% (+/-)9.00% |  2.53% (+/-)3.17%
ebizzy	-t 6		    (+/-)2.36% | -0.43% (+/-)3.29% | -1.93% (+/-)3.47% |  0.57% (+/-)0.75%
ebizzy	-t 8		    (+/-)1.65% | -0.45% (+/-)0.93% | -1.95% (+/-)1.52% | -1.18% (+/-)1.61%
ebizzy	-t 10		    (+/-)2.55% | -0.98% (+/-)3.06% | -1.18% (+/-)6.17% | -2.33% (+/-)3.28%
ebizzy	-t 12		    (+/-)6.22% |  0.17% (+/-)5.63% |  2.98% (+/-)7.11% |  1.19% (+/-)4.68%
ebizzy	-t 14		    (+/-)5.38% | -0.14% (+/-)5.33% |  2.49% (+/-)4.93% |  1.43% (+/-)6.55%
															
Quad A15			 tip	| +patchset1	    | +patchset2        | patchset+irq		
			         stdev  | results     stdev | results    stdev  | results    stdev
hackbench (lower is better)  (+/-)0.78% |  0.87% (+/-)1.72% |  0.91% (+/-)2.02% |  3.30% (+/-)2.02%
perf	(lower is better)    (+/-)2.03% | -0.31% (+/-)0.76% | -2.38% (+/-)1.37% |  1.42% (+/-)3.14%
scp			     (+/-)0.04% |  0.51% (+/-)1.37% |  1.79% (+/-)0.84% |  1.77% (+/-)0.38%
ebizzy	-t 1		     (+/-)0.41% |  2.05% (+/-)0.38% |  2.08% (+/-)0.24% |  0.17% (+/-)0.62%
ebizzy	-t 2		     (+/-)0.78% |  0.60% (+/-)0.63% |  0.43% (+/-)0.48% |  1.61% (+/-)0.38%
ebizzy	-t 4		     (+/-)0.58% | -0.10% (+/-)0.97% | -0.65% (+/-)0.76% | -0.75% (+/-)0.86%
ebizzy	-t 6		     (+/-)0.31% |  1.07% (+/-)1.12% | -0.16% (+/-)0.87% | -0.76% (+/-)0.22%
ebizzy	-t 8		     (+/-)0.95% | -0.30% (+/-)0.85% | -0.79% (+/-)0.28% | -1.66% (+/-)0.21%
ebizzy	-t 10		     (+/-)0.31% |  0.04% (+/-)0.97% | -1.44% (+/-)1.54% | -0.55% (+/-)0.62%
ebizzy	-t 12		     (+/-)8.35% | -1.89% (+/-)7.64% |  0.75% (+/-)5.30% | -1.18% (+/-)8.16%
ebizzy	-t 14		    (+/-)13.17% |  6.22% (+/-)4.71% |  5.25% (+/-)9.14% |  5.87% (+/-)5.77%

I haven't been able to fully test the patchset for a SMT system to check that
the regression that has been reported by Preethi has been solved but the
various tests that i have done, don't show any regression so far.
The correction of SD_PREFER_SIBLING mode and the use of the latter at SMT level
should have fix the regression.

The usage_avg_contrib is based on the current implementation of the
load avg tracking. I also have a version of the usage_avg_contrib that is based
on the new implementation [3] but haven't provide the patches and results as
[3] is still under review. I can provide change above [3] to change how 
usage_avg_contrib is computed and adapt to new mecanism.

Change since V4
 - rebase to manage conflicts with changes in selection of busiest group [4]

Change since V3:
 - add usage_avg_contrib statistic which sums the running time of tasks on a rq
 - use usage_avg_contrib instead of runnable_avg_sum for cpu_utilization
 - fix replacement power by capacity
 - update some comments

Change since V2:
 - rebase on top of capacity renaming
 - fix wake_affine statistic update
 - rework nohz_kick_needed
 - optimize the active migration of a task from CPU with reduced capacity
 - rename group_activity by group_utilization and remove unused total_utilization
 - repair SD_PREFER_SIBLING and use it for SMT level
 - reorder patchset to gather patches with same topics

Change since V1:
 - add 3 fixes
 - correct some commit messages
 - replace capacity computation by activity
 - take into account current cpu capacity

[1] https://lkml.org/lkml/2013/10/18/121
[2] https://lkml.org/lkml/2014/3/19/377
[3] https://lkml.org/lkml/2014/7/18/110
[4] https://lkml.org/lkml/2014/7/25/589

Vincent Guittot (12):
  sched: fix imbalance flag reset
  sched: remove a wake_affine condition
  sched: fix avg_load computation
  sched: Allow all archs to set the capacity_orig
  ARM: topology: use new cpu_capacity interface
  sched: add per rq cpu_capacity_orig
  sched: test the cpu's capacity in wake affine
  sched: move cfs task on a CPU with higher capacity
  sched: add usage_load_avg
  sched: get CPU's utilization statistic
  sched: replace capacity_factor by utilization
  sched: add SD_PREFER_SIBLING for SMT level

 arch/arm/kernel/topology.c |   4 +-
 include/linux/sched.h      |   4 +-
 kernel/sched/core.c        |   3 +-
 kernel/sched/fair.c        | 356 ++++++++++++++++++++++++++-------------------
 kernel/sched/sched.h       |   3 +-
 5 files changed, 211 insertions(+), 159 deletions(-)

-- 
1.9.1

^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v5 01/12] sched: fix imbalance flag reset
  2014-08-26 11:06 ` Vincent Guittot
@ 2014-08-26 11:06   ` Vincent Guittot
  -1 siblings, 0 replies; 164+ messages in thread
From: Vincent Guittot @ 2014-08-26 11:06 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel, preeti, linux, linux-arm-kernel
  Cc: riel, Morten.Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano, dietmar.eggemann, Vincent Guittot

The imbalance flag can stay set whereas there is no imbalance.

Let assume that we have 3 tasks that run on a dual cores /dual cluster system.
We will have some idle load balance which are triggered during tick.
Unfortunately, the tick is also used to queue background work so we can reach
the situation where short work has been queued on a CPU which already runs a
task. The load balance will detect this imbalance (2 tasks on 1 CPU and an idle
CPU) and will try to pull the waiting task on the idle CPU. The waiting task is
a worker thread that is pinned on a CPU so an imbalance due to pinned task is
detected and the imbalance flag is set.
Then, we will not be able to clear the flag because we have at most 1 task on
each CPU but the imbalance flag will trig to useless active load balance
between the idle CPU and the busy CPU.

We need to reset of the imbalance flag as soon as we have reached a balanced
state. If all tasks are pinned, we don't consider that as a balanced state and
let the imbalance flag set.

Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 23 +++++++++++++++++++----
 1 file changed, 19 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d3427a8..085e853 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6765,10 +6765,8 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 		if (sd_parent) {
 			int *group_imbalance = &sd_parent->groups->sgc->imbalance;
 
-			if ((env.flags & LBF_SOME_PINNED) && env.imbalance > 0) {
+			if ((env.flags & LBF_SOME_PINNED) && env.imbalance > 0)
 				*group_imbalance = 1;
-			} else if (*group_imbalance)
-				*group_imbalance = 0;
 		}
 
 		/* All tasks on this runqueue were pinned by CPU affinity */
@@ -6779,7 +6777,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 				env.loop_break = sched_nr_migrate_break;
 				goto redo;
 			}
-			goto out_balanced;
+			goto out_all_pinned;
 		}
 	}
 
@@ -6853,6 +6851,23 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 	goto out;
 
 out_balanced:
+	/*
+	 * We reach balance although we may have faced some affinity
+	 * constraints. Clear the imbalance flag if it was set.
+	 */
+	if (sd_parent) {
+		int *group_imbalance = &sd_parent->groups->sgc->imbalance;
+
+		if (*group_imbalance)
+			*group_imbalance = 0;
+	}
+
+out_all_pinned:
+	/*
+	 * We reach balance because all tasks are pinned at this level so
+	 * we can't migrate them. Let the imbalance flag set so parent level
+	 * can try to migrate them.
+	 */
 	schedstat_inc(sd, lb_balanced[idle]);
 
 	sd->nr_balance_failed = 0;
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 164+ messages in thread

* [PATCH v5 01/12] sched: fix imbalance flag reset
@ 2014-08-26 11:06   ` Vincent Guittot
  0 siblings, 0 replies; 164+ messages in thread
From: Vincent Guittot @ 2014-08-26 11:06 UTC (permalink / raw)
  To: linux-arm-kernel

The imbalance flag can stay set whereas there is no imbalance.

Let assume that we have 3 tasks that run on a dual cores /dual cluster system.
We will have some idle load balance which are triggered during tick.
Unfortunately, the tick is also used to queue background work so we can reach
the situation where short work has been queued on a CPU which already runs a
task. The load balance will detect this imbalance (2 tasks on 1 CPU and an idle
CPU) and will try to pull the waiting task on the idle CPU. The waiting task is
a worker thread that is pinned on a CPU so an imbalance due to pinned task is
detected and the imbalance flag is set.
Then, we will not be able to clear the flag because we have at most 1 task on
each CPU but the imbalance flag will trig to useless active load balance
between the idle CPU and the busy CPU.

We need to reset of the imbalance flag as soon as we have reached a balanced
state. If all tasks are pinned, we don't consider that as a balanced state and
let the imbalance flag set.

Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 23 +++++++++++++++++++----
 1 file changed, 19 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d3427a8..085e853 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6765,10 +6765,8 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 		if (sd_parent) {
 			int *group_imbalance = &sd_parent->groups->sgc->imbalance;
 
-			if ((env.flags & LBF_SOME_PINNED) && env.imbalance > 0) {
+			if ((env.flags & LBF_SOME_PINNED) && env.imbalance > 0)
 				*group_imbalance = 1;
-			} else if (*group_imbalance)
-				*group_imbalance = 0;
 		}
 
 		/* All tasks on this runqueue were pinned by CPU affinity */
@@ -6779,7 +6777,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 				env.loop_break = sched_nr_migrate_break;
 				goto redo;
 			}
-			goto out_balanced;
+			goto out_all_pinned;
 		}
 	}
 
@@ -6853,6 +6851,23 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 	goto out;
 
 out_balanced:
+	/*
+	 * We reach balance although we may have faced some affinity
+	 * constraints. Clear the imbalance flag if it was set.
+	 */
+	if (sd_parent) {
+		int *group_imbalance = &sd_parent->groups->sgc->imbalance;
+
+		if (*group_imbalance)
+			*group_imbalance = 0;
+	}
+
+out_all_pinned:
+	/*
+	 * We reach balance because all tasks are pinned at this level so
+	 * we can't migrate them. Let the imbalance flag set so parent level
+	 * can try to migrate them.
+	 */
 	schedstat_inc(sd, lb_balanced[idle]);
 
 	sd->nr_balance_failed = 0;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 164+ messages in thread

* [PATCH v5 02/12] sched: remove a wake_affine condition
  2014-08-26 11:06 ` Vincent Guittot
@ 2014-08-26 11:06   ` Vincent Guittot
  -1 siblings, 0 replies; 164+ messages in thread
From: Vincent Guittot @ 2014-08-26 11:06 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel, preeti, linux, linux-arm-kernel
  Cc: riel, Morten.Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano, dietmar.eggemann, Vincent Guittot

I have tried to understand the meaning of the condition :
 (this_load <= load &&
  this_load + target_load(prev_cpu, idx) <= tl_per_task)
but i failed to find a use case that can take advantage of it and i haven't
found clear description in the previous commits' log.
Futhermore, the comment of the condition refers to the task_hot function that
was used before being replaced by the current condition:
/*
 * This domain has SD_WAKE_AFFINE and
 * p is cache cold in this domain, and
 * there is no bad imbalance.
 */

If we look more deeply the below condition
 this_load + target_load(prev_cpu, idx) <= tl_per_task

When sync is clear, we have :
 tl_per_task = runnable_load_avg / nr_running
 this_load = max(runnable_load_avg, cpuload[idx])
 target_load =  max(runnable_load_avg', cpuload'[idx])

It implies that runnable_load_avg' == 0 and nr_running <= 1 in order to match the
condition. This implies that runnable_load_avg == 0 too because of the
condition: this_load <= load
but if this _load is null, balanced is already set and the test is redundant.

If sync is set, it's not as straight forward as above (especially if cgroup
are involved) but the policy should be similar as we have removed a task that's
going to sleep in order to get a more accurate load and this_load values.

The current conclusion is that these additional condition don't give any benefit
so we can remove them.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 30 ++++++------------------------
 1 file changed, 6 insertions(+), 24 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 085e853..87b9dc7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4282,7 +4282,6 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
 {
 	s64 this_load, load;
 	int idx, this_cpu, prev_cpu;
-	unsigned long tl_per_task;
 	struct task_group *tg;
 	unsigned long weight;
 	int balanced;
@@ -4340,32 +4339,15 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
 		balanced = this_eff_load <= prev_eff_load;
 	} else
 		balanced = true;
-
-	/*
-	 * If the currently running task will sleep within
-	 * a reasonable amount of time then attract this newly
-	 * woken task:
-	 */
-	if (sync && balanced)
-		return 1;
-
 	schedstat_inc(p, se.statistics.nr_wakeups_affine_attempts);
-	tl_per_task = cpu_avg_load_per_task(this_cpu);
 
-	if (balanced ||
-	    (this_load <= load &&
-	     this_load + target_load(prev_cpu, idx) <= tl_per_task)) {
-		/*
-		 * This domain has SD_WAKE_AFFINE and
-		 * p is cache cold in this domain, and
-		 * there is no bad imbalance.
-		 */
-		schedstat_inc(sd, ttwu_move_affine);
-		schedstat_inc(p, se.statistics.nr_wakeups_affine);
+	if (!balanced)
+		return 0;
 
-		return 1;
-	}
-	return 0;
+	schedstat_inc(sd, ttwu_move_affine);
+	schedstat_inc(p, se.statistics.nr_wakeups_affine);
+
+	return 1;
 }
 
 /*
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 164+ messages in thread

* [PATCH v5 02/12] sched: remove a wake_affine condition
@ 2014-08-26 11:06   ` Vincent Guittot
  0 siblings, 0 replies; 164+ messages in thread
From: Vincent Guittot @ 2014-08-26 11:06 UTC (permalink / raw)
  To: linux-arm-kernel

I have tried to understand the meaning of the condition :
 (this_load <= load &&
  this_load + target_load(prev_cpu, idx) <= tl_per_task)
but i failed to find a use case that can take advantage of it and i haven't
found clear description in the previous commits' log.
Futhermore, the comment of the condition refers to the task_hot function that
was used before being replaced by the current condition:
/*
 * This domain has SD_WAKE_AFFINE and
 * p is cache cold in this domain, and
 * there is no bad imbalance.
 */

If we look more deeply the below condition
 this_load + target_load(prev_cpu, idx) <= tl_per_task

When sync is clear, we have :
 tl_per_task = runnable_load_avg / nr_running
 this_load = max(runnable_load_avg, cpuload[idx])
 target_load =  max(runnable_load_avg', cpuload'[idx])

It implies that runnable_load_avg' == 0 and nr_running <= 1 in order to match the
condition. This implies that runnable_load_avg == 0 too because of the
condition: this_load <= load
but if this _load is null, balanced is already set and the test is redundant.

If sync is set, it's not as straight forward as above (especially if cgroup
are involved) but the policy should be similar as we have removed a task that's
going to sleep in order to get a more accurate load and this_load values.

The current conclusion is that these additional condition don't give any benefit
so we can remove them.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 30 ++++++------------------------
 1 file changed, 6 insertions(+), 24 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 085e853..87b9dc7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4282,7 +4282,6 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
 {
 	s64 this_load, load;
 	int idx, this_cpu, prev_cpu;
-	unsigned long tl_per_task;
 	struct task_group *tg;
 	unsigned long weight;
 	int balanced;
@@ -4340,32 +4339,15 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
 		balanced = this_eff_load <= prev_eff_load;
 	} else
 		balanced = true;
-
-	/*
-	 * If the currently running task will sleep within
-	 * a reasonable amount of time then attract this newly
-	 * woken task:
-	 */
-	if (sync && balanced)
-		return 1;
-
 	schedstat_inc(p, se.statistics.nr_wakeups_affine_attempts);
-	tl_per_task = cpu_avg_load_per_task(this_cpu);
 
-	if (balanced ||
-	    (this_load <= load &&
-	     this_load + target_load(prev_cpu, idx) <= tl_per_task)) {
-		/*
-		 * This domain has SD_WAKE_AFFINE and
-		 * p is cache cold in this domain, and
-		 * there is no bad imbalance.
-		 */
-		schedstat_inc(sd, ttwu_move_affine);
-		schedstat_inc(p, se.statistics.nr_wakeups_affine);
+	if (!balanced)
+		return 0;
 
-		return 1;
-	}
-	return 0;
+	schedstat_inc(sd, ttwu_move_affine);
+	schedstat_inc(p, se.statistics.nr_wakeups_affine);
+
+	return 1;
 }
 
 /*
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 164+ messages in thread

* [PATCH v5 03/12] sched: fix avg_load computation
  2014-08-26 11:06 ` Vincent Guittot
@ 2014-08-26 11:06   ` Vincent Guittot
  -1 siblings, 0 replies; 164+ messages in thread
From: Vincent Guittot @ 2014-08-26 11:06 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel, preeti, linux, linux-arm-kernel
  Cc: riel, Morten.Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano, dietmar.eggemann, Vincent Guittot

The computation of avg_load and avg_load_per_task should only takes into
account the number of cfs tasks. The non cfs task are already taken into
account by decreasing the cpu's capacity and they will be tracked in the
CPU's utilization (group_utilization) of the next patches

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 87b9dc7..b85e9f7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4092,7 +4092,7 @@ static unsigned long capacity_of(int cpu)
 static unsigned long cpu_avg_load_per_task(int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
-	unsigned long nr_running = ACCESS_ONCE(rq->nr_running);
+	unsigned long nr_running = ACCESS_ONCE(rq->cfs.h_nr_running);
 	unsigned long load_avg = rq->cfs.runnable_load_avg;
 
 	if (nr_running)
@@ -5985,7 +5985,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 			load = source_load(i, load_idx);
 
 		sgs->group_load += load;
-		sgs->sum_nr_running += rq->nr_running;
+		sgs->sum_nr_running += rq->cfs.h_nr_running;
 
 		if (rq->nr_running > 1)
 			*overload = true;
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 164+ messages in thread

* [PATCH v5 03/12] sched: fix avg_load computation
@ 2014-08-26 11:06   ` Vincent Guittot
  0 siblings, 0 replies; 164+ messages in thread
From: Vincent Guittot @ 2014-08-26 11:06 UTC (permalink / raw)
  To: linux-arm-kernel

The computation of avg_load and avg_load_per_task should only takes into
account the number of cfs tasks. The non cfs task are already taken into
account by decreasing the cpu's capacity and they will be tracked in the
CPU's utilization (group_utilization) of the next patches

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 87b9dc7..b85e9f7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4092,7 +4092,7 @@ static unsigned long capacity_of(int cpu)
 static unsigned long cpu_avg_load_per_task(int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
-	unsigned long nr_running = ACCESS_ONCE(rq->nr_running);
+	unsigned long nr_running = ACCESS_ONCE(rq->cfs.h_nr_running);
 	unsigned long load_avg = rq->cfs.runnable_load_avg;
 
 	if (nr_running)
@@ -5985,7 +5985,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 			load = source_load(i, load_idx);
 
 		sgs->group_load += load;
-		sgs->sum_nr_running += rq->nr_running;
+		sgs->sum_nr_running += rq->cfs.h_nr_running;
 
 		if (rq->nr_running > 1)
 			*overload = true;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 164+ messages in thread

* [PATCH v5 04/12] sched: Allow all archs to set the capacity_orig
  2014-08-26 11:06 ` Vincent Guittot
@ 2014-08-26 11:06   ` Vincent Guittot
  -1 siblings, 0 replies; 164+ messages in thread
From: Vincent Guittot @ 2014-08-26 11:06 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel, preeti, linux, linux-arm-kernel
  Cc: riel, Morten.Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano, dietmar.eggemann, Vincent Guittot

capacity_orig is only changed for system with a SMT sched_domain level in order
to reflect the lower capacity of CPUs. Heterogenous system also have to reflect an
original capacity that is different from the default value.

Create a more generic function arch_scale_cpu_capacity that can be also used by
non SMT platform to set capacity_orig.

The weak behavior of arch_scale_cpu_capacity is the previous SMT one in order to
keep backward compatibility in the use of capacity_orig.

arch_scale_smt_capacity and default_scale_smt_capacity have been removed as
they were not use elsewhere than in arch_scale_cpu_capacity.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 25 ++++++-------------------
 1 file changed, 6 insertions(+), 19 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b85e9f7..8176bda 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5705,19 +5705,12 @@ unsigned long __weak arch_scale_freq_capacity(struct sched_domain *sd, int cpu)
 	return default_scale_capacity(sd, cpu);
 }
 
-static unsigned long default_scale_smt_capacity(struct sched_domain *sd, int cpu)
+unsigned long __weak arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
 {
-	unsigned long weight = sd->span_weight;
-	unsigned long smt_gain = sd->smt_gain;
+	if ((sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1))
+		return sd->smt_gain / sd->span_weight;
 
-	smt_gain /= weight;
-
-	return smt_gain;
-}
-
-unsigned long __weak arch_scale_smt_capacity(struct sched_domain *sd, int cpu)
-{
-	return default_scale_smt_capacity(sd, cpu);
+	return SCHED_CAPACITY_SCALE;
 }
 
 static unsigned long scale_rt_capacity(int cpu)
@@ -5756,18 +5749,12 @@ static unsigned long scale_rt_capacity(int cpu)
 
 static void update_cpu_capacity(struct sched_domain *sd, int cpu)
 {
-	unsigned long weight = sd->span_weight;
 	unsigned long capacity = SCHED_CAPACITY_SCALE;
 	struct sched_group *sdg = sd->groups;
 
-	if ((sd->flags & SD_SHARE_CPUCAPACITY) && weight > 1) {
-		if (sched_feat(ARCH_CAPACITY))
-			capacity *= arch_scale_smt_capacity(sd, cpu);
-		else
-			capacity *= default_scale_smt_capacity(sd, cpu);
+	capacity *= arch_scale_cpu_capacity(sd, cpu);
 
-		capacity >>= SCHED_CAPACITY_SHIFT;
-	}
+	capacity >>= SCHED_CAPACITY_SHIFT;
 
 	sdg->sgc->capacity_orig = capacity;
 
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 164+ messages in thread

* [PATCH v5 04/12] sched: Allow all archs to set the capacity_orig
@ 2014-08-26 11:06   ` Vincent Guittot
  0 siblings, 0 replies; 164+ messages in thread
From: Vincent Guittot @ 2014-08-26 11:06 UTC (permalink / raw)
  To: linux-arm-kernel

capacity_orig is only changed for system with a SMT sched_domain level in order
to reflect the lower capacity of CPUs. Heterogenous system also have to reflect an
original capacity that is different from the default value.

Create a more generic function arch_scale_cpu_capacity that can be also used by
non SMT platform to set capacity_orig.

The weak behavior of arch_scale_cpu_capacity is the previous SMT one in order to
keep backward compatibility in the use of capacity_orig.

arch_scale_smt_capacity and default_scale_smt_capacity have been removed as
they were not use elsewhere than in arch_scale_cpu_capacity.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 25 ++++++-------------------
 1 file changed, 6 insertions(+), 19 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b85e9f7..8176bda 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5705,19 +5705,12 @@ unsigned long __weak arch_scale_freq_capacity(struct sched_domain *sd, int cpu)
 	return default_scale_capacity(sd, cpu);
 }
 
-static unsigned long default_scale_smt_capacity(struct sched_domain *sd, int cpu)
+unsigned long __weak arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
 {
-	unsigned long weight = sd->span_weight;
-	unsigned long smt_gain = sd->smt_gain;
+	if ((sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1))
+		return sd->smt_gain / sd->span_weight;
 
-	smt_gain /= weight;
-
-	return smt_gain;
-}
-
-unsigned long __weak arch_scale_smt_capacity(struct sched_domain *sd, int cpu)
-{
-	return default_scale_smt_capacity(sd, cpu);
+	return SCHED_CAPACITY_SCALE;
 }
 
 static unsigned long scale_rt_capacity(int cpu)
@@ -5756,18 +5749,12 @@ static unsigned long scale_rt_capacity(int cpu)
 
 static void update_cpu_capacity(struct sched_domain *sd, int cpu)
 {
-	unsigned long weight = sd->span_weight;
 	unsigned long capacity = SCHED_CAPACITY_SCALE;
 	struct sched_group *sdg = sd->groups;
 
-	if ((sd->flags & SD_SHARE_CPUCAPACITY) && weight > 1) {
-		if (sched_feat(ARCH_CAPACITY))
-			capacity *= arch_scale_smt_capacity(sd, cpu);
-		else
-			capacity *= default_scale_smt_capacity(sd, cpu);
+	capacity *= arch_scale_cpu_capacity(sd, cpu);
 
-		capacity >>= SCHED_CAPACITY_SHIFT;
-	}
+	capacity >>= SCHED_CAPACITY_SHIFT;
 
 	sdg->sgc->capacity_orig = capacity;
 
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 164+ messages in thread

* [PATCH v5 05/12] ARM: topology: use new cpu_capacity interface
  2014-08-26 11:06 ` Vincent Guittot
@ 2014-08-26 11:06   ` Vincent Guittot
  -1 siblings, 0 replies; 164+ messages in thread
From: Vincent Guittot @ 2014-08-26 11:06 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel, preeti, linux, linux-arm-kernel
  Cc: riel, Morten.Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano, dietmar.eggemann, Vincent Guittot

Use the new arch_scale_cpu_capacity in order to reflect the original capacity
of a CPU instead of arch_scale_freq_capacity which is more linked to a scaling
of the capacity linked to the frequency.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 arch/arm/kernel/topology.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
index e35d880..89cfdd6 100644
--- a/arch/arm/kernel/topology.c
+++ b/arch/arm/kernel/topology.c
@@ -42,7 +42,7 @@
  */
 static DEFINE_PER_CPU(unsigned long, cpu_scale);
 
-unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu)
+unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
 {
 	return per_cpu(cpu_scale, cpu);
 }
@@ -166,7 +166,7 @@ static void update_cpu_capacity(unsigned int cpu)
 	set_capacity_scale(cpu, cpu_capacity(cpu) / middle_capacity);
 
 	printk(KERN_INFO "CPU%u: update cpu_capacity %lu\n",
-		cpu, arch_scale_freq_capacity(NULL, cpu));
+		cpu, arch_scale_cpu_capacity(NULL, cpu));
 }
 
 #else
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 164+ messages in thread

* [PATCH v5 05/12] ARM: topology: use new cpu_capacity interface
@ 2014-08-26 11:06   ` Vincent Guittot
  0 siblings, 0 replies; 164+ messages in thread
From: Vincent Guittot @ 2014-08-26 11:06 UTC (permalink / raw)
  To: linux-arm-kernel

Use the new arch_scale_cpu_capacity in order to reflect the original capacity
of a CPU instead of arch_scale_freq_capacity which is more linked to a scaling
of the capacity linked to the frequency.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 arch/arm/kernel/topology.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
index e35d880..89cfdd6 100644
--- a/arch/arm/kernel/topology.c
+++ b/arch/arm/kernel/topology.c
@@ -42,7 +42,7 @@
  */
 static DEFINE_PER_CPU(unsigned long, cpu_scale);
 
-unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu)
+unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
 {
 	return per_cpu(cpu_scale, cpu);
 }
@@ -166,7 +166,7 @@ static void update_cpu_capacity(unsigned int cpu)
 	set_capacity_scale(cpu, cpu_capacity(cpu) / middle_capacity);
 
 	printk(KERN_INFO "CPU%u: update cpu_capacity %lu\n",
-		cpu, arch_scale_freq_capacity(NULL, cpu));
+		cpu, arch_scale_cpu_capacity(NULL, cpu));
 }
 
 #else
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 164+ messages in thread

* [PATCH v5 06/12] sched: add per rq cpu_capacity_orig
  2014-08-26 11:06 ` Vincent Guittot
@ 2014-08-26 11:06   ` Vincent Guittot
  -1 siblings, 0 replies; 164+ messages in thread
From: Vincent Guittot @ 2014-08-26 11:06 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel, preeti, linux, linux-arm-kernel
  Cc: riel, Morten.Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano, dietmar.eggemann, Vincent Guittot

This new field cpu_capacity_orig reflects the available capacity of a CPUs
unlike the cpu_capacity which reflects the current capacity that can be altered
by frequency and rt tasks.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/core.c  | 2 +-
 kernel/sched/fair.c  | 1 +
 kernel/sched/sched.h | 1 +
 3 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a394f99..7c3b237 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7003,7 +7003,7 @@ void __init sched_init(void)
 #ifdef CONFIG_SMP
 		rq->sd = NULL;
 		rq->rd = NULL;
-		rq->cpu_capacity = SCHED_CAPACITY_SCALE;
+		rq->cpu_capacity = rq->cpu_capacity_orig = SCHED_CAPACITY_SCALE;
 		rq->post_schedule = 0;
 		rq->active_balance = 0;
 		rq->next_balance = jiffies;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8176bda..17c16cc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5756,6 +5756,7 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
 
 	capacity >>= SCHED_CAPACITY_SHIFT;
 
+	cpu_rq(cpu)->cpu_capacity_orig = capacity;
 	sdg->sgc->capacity_orig = capacity;
 
 	if (sched_feat(ARCH_CAPACITY))
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index aa0f73b..7c0a74e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -575,6 +575,7 @@ struct rq {
 	struct sched_domain *sd;
 
 	unsigned long cpu_capacity;
+	unsigned long cpu_capacity_orig;
 
 	unsigned char idle_balance;
 	/* For active balancing */
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 164+ messages in thread

* [PATCH v5 06/12] sched: add per rq cpu_capacity_orig
@ 2014-08-26 11:06   ` Vincent Guittot
  0 siblings, 0 replies; 164+ messages in thread
From: Vincent Guittot @ 2014-08-26 11:06 UTC (permalink / raw)
  To: linux-arm-kernel

This new field cpu_capacity_orig reflects the available capacity of a CPUs
unlike the cpu_capacity which reflects the current capacity that can be altered
by frequency and rt tasks.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/core.c  | 2 +-
 kernel/sched/fair.c  | 1 +
 kernel/sched/sched.h | 1 +
 3 files changed, 3 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a394f99..7c3b237 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7003,7 +7003,7 @@ void __init sched_init(void)
 #ifdef CONFIG_SMP
 		rq->sd = NULL;
 		rq->rd = NULL;
-		rq->cpu_capacity = SCHED_CAPACITY_SCALE;
+		rq->cpu_capacity = rq->cpu_capacity_orig = SCHED_CAPACITY_SCALE;
 		rq->post_schedule = 0;
 		rq->active_balance = 0;
 		rq->next_balance = jiffies;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8176bda..17c16cc 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5756,6 +5756,7 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
 
 	capacity >>= SCHED_CAPACITY_SHIFT;
 
+	cpu_rq(cpu)->cpu_capacity_orig = capacity;
 	sdg->sgc->capacity_orig = capacity;
 
 	if (sched_feat(ARCH_CAPACITY))
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index aa0f73b..7c0a74e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -575,6 +575,7 @@ struct rq {
 	struct sched_domain *sd;
 
 	unsigned long cpu_capacity;
+	unsigned long cpu_capacity_orig;
 
 	unsigned char idle_balance;
 	/* For active balancing */
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 164+ messages in thread

* [PATCH v5 07/12] sched: test the cpu's capacity in wake affine
  2014-08-26 11:06 ` Vincent Guittot
@ 2014-08-26 11:06   ` Vincent Guittot
  -1 siblings, 0 replies; 164+ messages in thread
From: Vincent Guittot @ 2014-08-26 11:06 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel, preeti, linux, linux-arm-kernel
  Cc: riel, Morten.Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano, dietmar.eggemann, Vincent Guittot

Currently the task always wakes affine on this_cpu if the latter is idle.
Before waking up the task on this_cpu, we check that this_cpu capacity is not
significantly reduced because of RT tasks or irq activity.

Use case where the number of irq and/or the time spent under irq is important
will take benefit of this because the task that is woken up by irq or softirq
will not use the same CPU than irq (and softirq) but a idle one.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 17c16cc..18db43e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4281,6 +4281,7 @@ static int wake_wide(struct task_struct *p)
 static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
 {
 	s64 this_load, load;
+	s64 this_eff_load, prev_eff_load;
 	int idx, this_cpu, prev_cpu;
 	struct task_group *tg;
 	unsigned long weight;
@@ -4324,21 +4325,21 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
 	 * Otherwise check if either cpus are near enough in load to allow this
 	 * task to be woken on this_cpu.
 	 */
-	if (this_load > 0) {
-		s64 this_eff_load, prev_eff_load;
+	this_eff_load = 100;
+	this_eff_load *= capacity_of(prev_cpu);
+
+	prev_eff_load = 100 + (sd->imbalance_pct - 100) / 2;
+	prev_eff_load *= capacity_of(this_cpu);
 
-		this_eff_load = 100;
-		this_eff_load *= capacity_of(prev_cpu);
+	if (this_load > 0) {
 		this_eff_load *= this_load +
 			effective_load(tg, this_cpu, weight, weight);
 
-		prev_eff_load = 100 + (sd->imbalance_pct - 100) / 2;
-		prev_eff_load *= capacity_of(this_cpu);
 		prev_eff_load *= load + effective_load(tg, prev_cpu, 0, weight);
+	}
+
+	balanced = this_eff_load <= prev_eff_load;
 
-		balanced = this_eff_load <= prev_eff_load;
-	} else
-		balanced = true;
 	schedstat_inc(p, se.statistics.nr_wakeups_affine_attempts);
 
 	if (!balanced)
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 164+ messages in thread

* [PATCH v5 07/12] sched: test the cpu's capacity in wake affine
@ 2014-08-26 11:06   ` Vincent Guittot
  0 siblings, 0 replies; 164+ messages in thread
From: Vincent Guittot @ 2014-08-26 11:06 UTC (permalink / raw)
  To: linux-arm-kernel

Currently the task always wakes affine on this_cpu if the latter is idle.
Before waking up the task on this_cpu, we check that this_cpu capacity is not
significantly reduced because of RT tasks or irq activity.

Use case where the number of irq and/or the time spent under irq is important
will take benefit of this because the task that is woken up by irq or softirq
will not use the same CPU than irq (and softirq) but a idle one.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 17c16cc..18db43e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4281,6 +4281,7 @@ static int wake_wide(struct task_struct *p)
 static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
 {
 	s64 this_load, load;
+	s64 this_eff_load, prev_eff_load;
 	int idx, this_cpu, prev_cpu;
 	struct task_group *tg;
 	unsigned long weight;
@@ -4324,21 +4325,21 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
 	 * Otherwise check if either cpus are near enough in load to allow this
 	 * task to be woken on this_cpu.
 	 */
-	if (this_load > 0) {
-		s64 this_eff_load, prev_eff_load;
+	this_eff_load = 100;
+	this_eff_load *= capacity_of(prev_cpu);
+
+	prev_eff_load = 100 + (sd->imbalance_pct - 100) / 2;
+	prev_eff_load *= capacity_of(this_cpu);
 
-		this_eff_load = 100;
-		this_eff_load *= capacity_of(prev_cpu);
+	if (this_load > 0) {
 		this_eff_load *= this_load +
 			effective_load(tg, this_cpu, weight, weight);
 
-		prev_eff_load = 100 + (sd->imbalance_pct - 100) / 2;
-		prev_eff_load *= capacity_of(this_cpu);
 		prev_eff_load *= load + effective_load(tg, prev_cpu, 0, weight);
+	}
+
+	balanced = this_eff_load <= prev_eff_load;
 
-		balanced = this_eff_load <= prev_eff_load;
-	} else
-		balanced = true;
 	schedstat_inc(p, se.statistics.nr_wakeups_affine_attempts);
 
 	if (!balanced)
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 164+ messages in thread

* [PATCH v5 08/12] sched: move cfs task on a CPU with higher capacity
  2014-08-26 11:06 ` Vincent Guittot
@ 2014-08-26 11:06   ` Vincent Guittot
  -1 siblings, 0 replies; 164+ messages in thread
From: Vincent Guittot @ 2014-08-26 11:06 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel, preeti, linux, linux-arm-kernel
  Cc: riel, Morten.Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano, dietmar.eggemann, Vincent Guittot

If the CPU is used for handling lot of IRQs, trig a load balance to check if
it's worth moving its tasks on another CPU that has more capacity.

As a sidenote, this will note generate more spurious ilb because we already
trig an ilb if there is more than 1 busy cpu. If this cpu is the only one that
has a task, we will trig the ilb once for migrating the task.

The nohz_kick_needed function has been cleaned up a bit while adding the new
test

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 69 +++++++++++++++++++++++++++++++++++++----------------
 1 file changed, 49 insertions(+), 20 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 18db43e..60ae1ce 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6049,6 +6049,14 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 			return true;
 	}
 
+	/*
+	 * The group capacity is reduced probably because of activity from other
+	 * sched class or interrupts which use part of the available capacity
+	 */
+	if ((sg->sgc->capacity_orig * 100) > (sgs->group_capacity *
+				env->sd->imbalance_pct))
+		return true;
+
 	return false;
 }
 
@@ -6534,13 +6542,23 @@ static int need_active_balance(struct lb_env *env)
 	struct sched_domain *sd = env->sd;
 
 	if (env->idle == CPU_NEWLY_IDLE) {
+		int src_cpu = env->src_cpu;
 
 		/*
 		 * ASYM_PACKING needs to force migrate tasks from busy but
 		 * higher numbered CPUs in order to pack all tasks in the
 		 * lowest numbered CPUs.
 		 */
-		if ((sd->flags & SD_ASYM_PACKING) && env->src_cpu > env->dst_cpu)
+		if ((sd->flags & SD_ASYM_PACKING) && src_cpu > env->dst_cpu)
+			return 1;
+
+		/*
+		 * If the CPUs share their cache and the src_cpu's capacity is
+		 * reduced because of other sched_class or IRQs, we trig an
+		 * active balance to move the task
+		 */
+		if ((capacity_orig_of(src_cpu) * 100) > (capacity_of(src_cpu) *
+				sd->imbalance_pct))
 			return 1;
 	}
 
@@ -6643,6 +6661,8 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 
 	schedstat_add(sd, lb_imbalance[idle], env.imbalance);
 
+	env.src_cpu = busiest->cpu;
+
 	ld_moved = 0;
 	if (busiest->nr_running > 1) {
 		/*
@@ -6652,7 +6672,6 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 		 * correctly treated as an imbalance.
 		 */
 		env.flags |= LBF_ALL_PINNED;
-		env.src_cpu   = busiest->cpu;
 		env.src_rq    = busiest;
 		env.loop_max  = min(sysctl_sched_nr_migrate, busiest->nr_running);
 
@@ -7359,10 +7378,12 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
 
 /*
  * Current heuristic for kicking the idle load balancer in the presence
- * of an idle cpu is the system.
+ * of an idle cpu in the system.
  *   - This rq has more than one task.
- *   - At any scheduler domain level, this cpu's scheduler group has multiple
- *     busy cpu's exceeding the group's capacity.
+ *   - This rq has at least one CFS task and the capacity of the CPU is
+ *     significantly reduced because of RT tasks or IRQs.
+ *   - At parent of LLC scheduler domain level, this cpu's scheduler group has
+ *     multiple busy cpu.
  *   - For SD_ASYM_PACKING, if the lower numbered cpu's in the scheduler
  *     domain span are idle.
  */
@@ -7372,9 +7393,10 @@ static inline int nohz_kick_needed(struct rq *rq)
 	struct sched_domain *sd;
 	struct sched_group_capacity *sgc;
 	int nr_busy, cpu = rq->cpu;
+	bool kick = false;
 
 	if (unlikely(rq->idle_balance))
-		return 0;
+		return false;
 
        /*
 	* We may be recently in ticked or tickless idle mode. At the first
@@ -7388,38 +7410,45 @@ static inline int nohz_kick_needed(struct rq *rq)
 	 * balancing.
 	 */
 	if (likely(!atomic_read(&nohz.nr_cpus)))
-		return 0;
+		return false;
 
 	if (time_before(now, nohz.next_balance))
-		return 0;
+		return false;
 
 	if (rq->nr_running >= 2)
-		goto need_kick;
+		return true;
 
 	rcu_read_lock();
 	sd = rcu_dereference(per_cpu(sd_busy, cpu));
-
 	if (sd) {
 		sgc = sd->groups->sgc;
 		nr_busy = atomic_read(&sgc->nr_busy_cpus);
 
-		if (nr_busy > 1)
-			goto need_kick_unlock;
+		if (nr_busy > 1) {
+			kick = true;
+			goto unlock;
+		}
+
 	}
 
-	sd = rcu_dereference(per_cpu(sd_asym, cpu));
+	sd = rcu_dereference(rq->sd);
+	if (sd) {
+		if ((rq->cfs.h_nr_running >= 1) &&
+				((rq->cpu_capacity * sd->imbalance_pct) <
+				(rq->cpu_capacity_orig * 100))) {
+			kick = true;
+			goto unlock;
+		}
+	}
 
+	sd = rcu_dereference(per_cpu(sd_asym, cpu));
 	if (sd && (cpumask_first_and(nohz.idle_cpus_mask,
 				  sched_domain_span(sd)) < cpu))
-		goto need_kick_unlock;
+		kick = true;
 
+unlock:
 	rcu_read_unlock();
-	return 0;
-
-need_kick_unlock:
-	rcu_read_unlock();
-need_kick:
-	return 1;
+	return kick;
 }
 #else
 static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) { }
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 164+ messages in thread

* [PATCH v5 08/12] sched: move cfs task on a CPU with higher capacity
@ 2014-08-26 11:06   ` Vincent Guittot
  0 siblings, 0 replies; 164+ messages in thread
From: Vincent Guittot @ 2014-08-26 11:06 UTC (permalink / raw)
  To: linux-arm-kernel

If the CPU is used for handling lot of IRQs, trig a load balance to check if
it's worth moving its tasks on another CPU that has more capacity.

As a sidenote, this will note generate more spurious ilb because we already
trig an ilb if there is more than 1 busy cpu. If this cpu is the only one that
has a task, we will trig the ilb once for migrating the task.

The nohz_kick_needed function has been cleaned up a bit while adding the new
test

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 69 +++++++++++++++++++++++++++++++++++++----------------
 1 file changed, 49 insertions(+), 20 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 18db43e..60ae1ce 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6049,6 +6049,14 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 			return true;
 	}
 
+	/*
+	 * The group capacity is reduced probably because of activity from other
+	 * sched class or interrupts which use part of the available capacity
+	 */
+	if ((sg->sgc->capacity_orig * 100) > (sgs->group_capacity *
+				env->sd->imbalance_pct))
+		return true;
+
 	return false;
 }
 
@@ -6534,13 +6542,23 @@ static int need_active_balance(struct lb_env *env)
 	struct sched_domain *sd = env->sd;
 
 	if (env->idle == CPU_NEWLY_IDLE) {
+		int src_cpu = env->src_cpu;
 
 		/*
 		 * ASYM_PACKING needs to force migrate tasks from busy but
 		 * higher numbered CPUs in order to pack all tasks in the
 		 * lowest numbered CPUs.
 		 */
-		if ((sd->flags & SD_ASYM_PACKING) && env->src_cpu > env->dst_cpu)
+		if ((sd->flags & SD_ASYM_PACKING) && src_cpu > env->dst_cpu)
+			return 1;
+
+		/*
+		 * If the CPUs share their cache and the src_cpu's capacity is
+		 * reduced because of other sched_class or IRQs, we trig an
+		 * active balance to move the task
+		 */
+		if ((capacity_orig_of(src_cpu) * 100) > (capacity_of(src_cpu) *
+				sd->imbalance_pct))
 			return 1;
 	}
 
@@ -6643,6 +6661,8 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 
 	schedstat_add(sd, lb_imbalance[idle], env.imbalance);
 
+	env.src_cpu = busiest->cpu;
+
 	ld_moved = 0;
 	if (busiest->nr_running > 1) {
 		/*
@@ -6652,7 +6672,6 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 		 * correctly treated as an imbalance.
 		 */
 		env.flags |= LBF_ALL_PINNED;
-		env.src_cpu   = busiest->cpu;
 		env.src_rq    = busiest;
 		env.loop_max  = min(sysctl_sched_nr_migrate, busiest->nr_running);
 
@@ -7359,10 +7378,12 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
 
 /*
  * Current heuristic for kicking the idle load balancer in the presence
- * of an idle cpu is the system.
+ * of an idle cpu in the system.
  *   - This rq has more than one task.
- *   - At any scheduler domain level, this cpu's scheduler group has multiple
- *     busy cpu's exceeding the group's capacity.
+ *   - This rq has at least one CFS task and the capacity of the CPU is
+ *     significantly reduced because of RT tasks or IRQs.
+ *   - At parent of LLC scheduler domain level, this cpu's scheduler group has
+ *     multiple busy cpu.
  *   - For SD_ASYM_PACKING, if the lower numbered cpu's in the scheduler
  *     domain span are idle.
  */
@@ -7372,9 +7393,10 @@ static inline int nohz_kick_needed(struct rq *rq)
 	struct sched_domain *sd;
 	struct sched_group_capacity *sgc;
 	int nr_busy, cpu = rq->cpu;
+	bool kick = false;
 
 	if (unlikely(rq->idle_balance))
-		return 0;
+		return false;
 
        /*
 	* We may be recently in ticked or tickless idle mode. At the first
@@ -7388,38 +7410,45 @@ static inline int nohz_kick_needed(struct rq *rq)
 	 * balancing.
 	 */
 	if (likely(!atomic_read(&nohz.nr_cpus)))
-		return 0;
+		return false;
 
 	if (time_before(now, nohz.next_balance))
-		return 0;
+		return false;
 
 	if (rq->nr_running >= 2)
-		goto need_kick;
+		return true;
 
 	rcu_read_lock();
 	sd = rcu_dereference(per_cpu(sd_busy, cpu));
-
 	if (sd) {
 		sgc = sd->groups->sgc;
 		nr_busy = atomic_read(&sgc->nr_busy_cpus);
 
-		if (nr_busy > 1)
-			goto need_kick_unlock;
+		if (nr_busy > 1) {
+			kick = true;
+			goto unlock;
+		}
+
 	}
 
-	sd = rcu_dereference(per_cpu(sd_asym, cpu));
+	sd = rcu_dereference(rq->sd);
+	if (sd) {
+		if ((rq->cfs.h_nr_running >= 1) &&
+				((rq->cpu_capacity * sd->imbalance_pct) <
+				(rq->cpu_capacity_orig * 100))) {
+			kick = true;
+			goto unlock;
+		}
+	}
 
+	sd = rcu_dereference(per_cpu(sd_asym, cpu));
 	if (sd && (cpumask_first_and(nohz.idle_cpus_mask,
 				  sched_domain_span(sd)) < cpu))
-		goto need_kick_unlock;
+		kick = true;
 
+unlock:
 	rcu_read_unlock();
-	return 0;
-
-need_kick_unlock:
-	rcu_read_unlock();
-need_kick:
-	return 1;
+	return kick;
 }
 #else
 static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) { }
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 164+ messages in thread

* [PATCH v5 09/12] sched: add usage_load_avg
  2014-08-26 11:06 ` Vincent Guittot
@ 2014-08-26 11:06   ` Vincent Guittot
  -1 siblings, 0 replies; 164+ messages in thread
From: Vincent Guittot @ 2014-08-26 11:06 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel, preeti, linux, linux-arm-kernel
  Cc: riel, Morten.Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano, dietmar.eggemann, Vincent Guittot

Add new statistics which reflect the average time a task is running on the
CPU and the sum of the tasks' running on a runqueue. The latter is named
usage_avg_contrib.

This patch is based on the usage metric that was proposed in the 1st
versions of the per-entity load tracking patchset but that has be removed
afterward. This version differs from the original one in the sense that it's
not linked to task_group.

The rq's usage_avg_contrib will be used to check if a rq is overloaded or not
instead of trying to compute how many task a group of CPUs can handle

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 include/linux/sched.h |  4 ++--
 kernel/sched/fair.c   | 47 ++++++++++++++++++++++++++++++++++++++++++-----
 kernel/sched/sched.h  |  2 +-
 3 files changed, 45 insertions(+), 8 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5c2c885..7dfd584 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1073,10 +1073,10 @@ struct sched_avg {
 	 * above by 1024/(1-y).  Thus we only need a u32 to store them for all
 	 * choices of y < 1-2^(-32)*1024.
 	 */
-	u32 runnable_avg_sum, runnable_avg_period;
+	u32 runnable_avg_sum, runnable_avg_period, running_avg_sum;
 	u64 last_runnable_update;
 	s64 decay_count;
-	unsigned long load_avg_contrib;
+	unsigned long load_avg_contrib, usage_avg_contrib;
 };
 
 #ifdef CONFIG_SCHEDSTATS
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 60ae1ce..1fd2131 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -676,7 +676,7 @@ void init_task_runnable_average(struct task_struct *p)
 
 	p->se.avg.decay_count = 0;
 	slice = sched_slice(task_cfs_rq(p), &p->se) >> 10;
-	p->se.avg.runnable_avg_sum = slice;
+	p->se.avg.runnable_avg_sum = p->se.avg.running_avg_sum = slice;
 	p->se.avg.runnable_avg_period = slice;
 	__update_task_entity_contrib(&p->se);
 }
@@ -2289,7 +2289,8 @@ static u32 __compute_runnable_contrib(u64 n)
  */
 static __always_inline int __update_entity_runnable_avg(u64 now,
 							struct sched_avg *sa,
-							int runnable)
+							int runnable,
+							int running)
 {
 	u64 delta, periods;
 	u32 runnable_contrib;
@@ -2328,6 +2329,8 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
 		delta_w = 1024 - delta_w;
 		if (runnable)
 			sa->runnable_avg_sum += delta_w;
+		if (running)
+			sa->running_avg_sum += delta_w;
 		sa->runnable_avg_period += delta_w;
 
 		delta -= delta_w;
@@ -2338,6 +2341,8 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
 
 		sa->runnable_avg_sum = decay_load(sa->runnable_avg_sum,
 						  periods + 1);
+		sa->running_avg_sum = decay_load(sa->running_avg_sum,
+						  periods + 1);
 		sa->runnable_avg_period = decay_load(sa->runnable_avg_period,
 						     periods + 1);
 
@@ -2345,12 +2350,16 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
 		runnable_contrib = __compute_runnable_contrib(periods);
 		if (runnable)
 			sa->runnable_avg_sum += runnable_contrib;
+		if (running)
+			sa->running_avg_sum += runnable_contrib;
 		sa->runnable_avg_period += runnable_contrib;
 	}
 
 	/* Remainder of delta accrued against u_0` */
 	if (runnable)
 		sa->runnable_avg_sum += delta;
+	if (running)
+		sa->running_avg_sum += delta;
 	sa->runnable_avg_period += delta;
 
 	return decayed;
@@ -2490,6 +2499,27 @@ static long __update_entity_load_avg_contrib(struct sched_entity *se)
 	return se->avg.load_avg_contrib - old_contrib;
 }
 
+
+static inline void __update_task_entity_usage(struct sched_entity *se)
+{
+	u32 contrib;
+
+	/* avoid overflowing a 32-bit type w/ SCHED_LOAD_SCALE */
+	contrib = se->avg.running_avg_sum * scale_load_down(SCHED_LOAD_SCALE);
+	contrib /= (se->avg.runnable_avg_period + 1);
+	se->avg.usage_avg_contrib = scale_load(contrib);
+}
+
+static long __update_entity_usage_avg_contrib(struct sched_entity *se)
+{
+	long old_contrib = se->avg.usage_avg_contrib;
+
+	if (entity_is_task(se))
+		__update_task_entity_usage(se);
+
+	return se->avg.usage_avg_contrib - old_contrib;
+}
+
 static inline void subtract_blocked_load_contrib(struct cfs_rq *cfs_rq,
 						 long load_contrib)
 {
@@ -2506,7 +2536,7 @@ static inline void update_entity_load_avg(struct sched_entity *se,
 					  int update_cfs_rq)
 {
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
-	long contrib_delta;
+	long contrib_delta, usage_delta;
 	u64 now;
 
 	/*
@@ -2518,16 +2548,20 @@ static inline void update_entity_load_avg(struct sched_entity *se,
 	else
 		now = cfs_rq_clock_task(group_cfs_rq(se));
 
-	if (!__update_entity_runnable_avg(now, &se->avg, se->on_rq))
+	if (!__update_entity_runnable_avg(now, &se->avg, se->on_rq,
+					cfs_rq->curr == se))
 		return;
 
 	contrib_delta = __update_entity_load_avg_contrib(se);
+	usage_delta = __update_entity_usage_avg_contrib(se);
 
 	if (!update_cfs_rq)
 		return;
 
-	if (se->on_rq)
+	if (se->on_rq) {
 		cfs_rq->runnable_load_avg += contrib_delta;
+		cfs_rq->usage_load_avg += usage_delta;
+	}
 	else
 		subtract_blocked_load_contrib(cfs_rq, -contrib_delta);
 }
@@ -2604,6 +2638,7 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
 	}
 
 	cfs_rq->runnable_load_avg += se->avg.load_avg_contrib;
+	cfs_rq->usage_load_avg += se->avg.usage_avg_contrib;
 	/* we force update consideration on load-balancer moves */
 	update_cfs_rq_blocked_load(cfs_rq, !wakeup);
 }
@@ -2622,6 +2657,7 @@ static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq,
 	update_cfs_rq_blocked_load(cfs_rq, !sleep);
 
 	cfs_rq->runnable_load_avg -= se->avg.load_avg_contrib;
+	cfs_rq->usage_load_avg -= se->avg.usage_avg_contrib;
 	if (sleep) {
 		cfs_rq->blocked_load_avg += se->avg.load_avg_contrib;
 		se->avg.decay_count = atomic64_read(&cfs_rq->decay_counter);
@@ -2959,6 +2995,7 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 		 */
 		update_stats_wait_end(cfs_rq, se);
 		__dequeue_entity(cfs_rq, se);
+		update_entity_load_avg(se, 1);
 	}
 
 	update_stats_curr_start(cfs_rq, se);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 7c0a74e..d625fbb 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -340,7 +340,7 @@ struct cfs_rq {
 	 * This allows for the description of both thread and group usage (in
 	 * the FAIR_GROUP_SCHED case).
 	 */
-	unsigned long runnable_load_avg, blocked_load_avg;
+	unsigned long runnable_load_avg, blocked_load_avg, usage_load_avg;
 	atomic64_t decay_counter;
 	u64 last_decay;
 	atomic_long_t removed_load;
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 164+ messages in thread

* [PATCH v5 09/12] sched: add usage_load_avg
@ 2014-08-26 11:06   ` Vincent Guittot
  0 siblings, 0 replies; 164+ messages in thread
From: Vincent Guittot @ 2014-08-26 11:06 UTC (permalink / raw)
  To: linux-arm-kernel

Add new statistics which reflect the average time a task is running on the
CPU and the sum of the tasks' running on a runqueue. The latter is named
usage_avg_contrib.

This patch is based on the usage metric that was proposed in the 1st
versions of the per-entity load tracking patchset but that has be removed
afterward. This version differs from the original one in the sense that it's
not linked to task_group.

The rq's usage_avg_contrib will be used to check if a rq is overloaded or not
instead of trying to compute how many task a group of CPUs can handle

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 include/linux/sched.h |  4 ++--
 kernel/sched/fair.c   | 47 ++++++++++++++++++++++++++++++++++++++++++-----
 kernel/sched/sched.h  |  2 +-
 3 files changed, 45 insertions(+), 8 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 5c2c885..7dfd584 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1073,10 +1073,10 @@ struct sched_avg {
 	 * above by 1024/(1-y).  Thus we only need a u32 to store them for all
 	 * choices of y < 1-2^(-32)*1024.
 	 */
-	u32 runnable_avg_sum, runnable_avg_period;
+	u32 runnable_avg_sum, runnable_avg_period, running_avg_sum;
 	u64 last_runnable_update;
 	s64 decay_count;
-	unsigned long load_avg_contrib;
+	unsigned long load_avg_contrib, usage_avg_contrib;
 };
 
 #ifdef CONFIG_SCHEDSTATS
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 60ae1ce..1fd2131 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -676,7 +676,7 @@ void init_task_runnable_average(struct task_struct *p)
 
 	p->se.avg.decay_count = 0;
 	slice = sched_slice(task_cfs_rq(p), &p->se) >> 10;
-	p->se.avg.runnable_avg_sum = slice;
+	p->se.avg.runnable_avg_sum = p->se.avg.running_avg_sum = slice;
 	p->se.avg.runnable_avg_period = slice;
 	__update_task_entity_contrib(&p->se);
 }
@@ -2289,7 +2289,8 @@ static u32 __compute_runnable_contrib(u64 n)
  */
 static __always_inline int __update_entity_runnable_avg(u64 now,
 							struct sched_avg *sa,
-							int runnable)
+							int runnable,
+							int running)
 {
 	u64 delta, periods;
 	u32 runnable_contrib;
@@ -2328,6 +2329,8 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
 		delta_w = 1024 - delta_w;
 		if (runnable)
 			sa->runnable_avg_sum += delta_w;
+		if (running)
+			sa->running_avg_sum += delta_w;
 		sa->runnable_avg_period += delta_w;
 
 		delta -= delta_w;
@@ -2338,6 +2341,8 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
 
 		sa->runnable_avg_sum = decay_load(sa->runnable_avg_sum,
 						  periods + 1);
+		sa->running_avg_sum = decay_load(sa->running_avg_sum,
+						  periods + 1);
 		sa->runnable_avg_period = decay_load(sa->runnable_avg_period,
 						     periods + 1);
 
@@ -2345,12 +2350,16 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
 		runnable_contrib = __compute_runnable_contrib(periods);
 		if (runnable)
 			sa->runnable_avg_sum += runnable_contrib;
+		if (running)
+			sa->running_avg_sum += runnable_contrib;
 		sa->runnable_avg_period += runnable_contrib;
 	}
 
 	/* Remainder of delta accrued against u_0` */
 	if (runnable)
 		sa->runnable_avg_sum += delta;
+	if (running)
+		sa->running_avg_sum += delta;
 	sa->runnable_avg_period += delta;
 
 	return decayed;
@@ -2490,6 +2499,27 @@ static long __update_entity_load_avg_contrib(struct sched_entity *se)
 	return se->avg.load_avg_contrib - old_contrib;
 }
 
+
+static inline void __update_task_entity_usage(struct sched_entity *se)
+{
+	u32 contrib;
+
+	/* avoid overflowing a 32-bit type w/ SCHED_LOAD_SCALE */
+	contrib = se->avg.running_avg_sum * scale_load_down(SCHED_LOAD_SCALE);
+	contrib /= (se->avg.runnable_avg_period + 1);
+	se->avg.usage_avg_contrib = scale_load(contrib);
+}
+
+static long __update_entity_usage_avg_contrib(struct sched_entity *se)
+{
+	long old_contrib = se->avg.usage_avg_contrib;
+
+	if (entity_is_task(se))
+		__update_task_entity_usage(se);
+
+	return se->avg.usage_avg_contrib - old_contrib;
+}
+
 static inline void subtract_blocked_load_contrib(struct cfs_rq *cfs_rq,
 						 long load_contrib)
 {
@@ -2506,7 +2536,7 @@ static inline void update_entity_load_avg(struct sched_entity *se,
 					  int update_cfs_rq)
 {
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
-	long contrib_delta;
+	long contrib_delta, usage_delta;
 	u64 now;
 
 	/*
@@ -2518,16 +2548,20 @@ static inline void update_entity_load_avg(struct sched_entity *se,
 	else
 		now = cfs_rq_clock_task(group_cfs_rq(se));
 
-	if (!__update_entity_runnable_avg(now, &se->avg, se->on_rq))
+	if (!__update_entity_runnable_avg(now, &se->avg, se->on_rq,
+					cfs_rq->curr == se))
 		return;
 
 	contrib_delta = __update_entity_load_avg_contrib(se);
+	usage_delta = __update_entity_usage_avg_contrib(se);
 
 	if (!update_cfs_rq)
 		return;
 
-	if (se->on_rq)
+	if (se->on_rq) {
 		cfs_rq->runnable_load_avg += contrib_delta;
+		cfs_rq->usage_load_avg += usage_delta;
+	}
 	else
 		subtract_blocked_load_contrib(cfs_rq, -contrib_delta);
 }
@@ -2604,6 +2638,7 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
 	}
 
 	cfs_rq->runnable_load_avg += se->avg.load_avg_contrib;
+	cfs_rq->usage_load_avg += se->avg.usage_avg_contrib;
 	/* we force update consideration on load-balancer moves */
 	update_cfs_rq_blocked_load(cfs_rq, !wakeup);
 }
@@ -2622,6 +2657,7 @@ static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq,
 	update_cfs_rq_blocked_load(cfs_rq, !sleep);
 
 	cfs_rq->runnable_load_avg -= se->avg.load_avg_contrib;
+	cfs_rq->usage_load_avg -= se->avg.usage_avg_contrib;
 	if (sleep) {
 		cfs_rq->blocked_load_avg += se->avg.load_avg_contrib;
 		se->avg.decay_count = atomic64_read(&cfs_rq->decay_counter);
@@ -2959,6 +2995,7 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 		 */
 		update_stats_wait_end(cfs_rq, se);
 		__dequeue_entity(cfs_rq, se);
+		update_entity_load_avg(se, 1);
 	}
 
 	update_stats_curr_start(cfs_rq, se);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 7c0a74e..d625fbb 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -340,7 +340,7 @@ struct cfs_rq {
 	 * This allows for the description of both thread and group usage (in
 	 * the FAIR_GROUP_SCHED case).
 	 */
-	unsigned long runnable_load_avg, blocked_load_avg;
+	unsigned long runnable_load_avg, blocked_load_avg, usage_load_avg;
 	atomic64_t decay_counter;
 	u64 last_decay;
 	atomic_long_t removed_load;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 164+ messages in thread

* [PATCH v5 10/12] sched: get CPU's utilization statistic
  2014-08-26 11:06 ` Vincent Guittot
@ 2014-08-26 11:06   ` Vincent Guittot
  -1 siblings, 0 replies; 164+ messages in thread
From: Vincent Guittot @ 2014-08-26 11:06 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel, preeti, linux, linux-arm-kernel
  Cc: riel, Morten.Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano, dietmar.eggemann, Vincent Guittot

Monitor the utilization level of each group of each sched_domain level. The
utilization is the amount of cpu_capacity that is currently used on a CPU or
group of CPUs. We use the usage_load_avg to evaluate this utilization level.
In the special use case where the CPU is fully loaded by more than 1 task,
the activity level is set above the cpu_capacity in order to reflect the overload
of the CPU

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 19 ++++++++++++++++++-
 1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1fd2131..2f95d1c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4126,6 +4126,11 @@ static unsigned long capacity_of(int cpu)
 	return cpu_rq(cpu)->cpu_capacity;
 }
 
+static unsigned long capacity_orig_of(int cpu)
+{
+	return cpu_rq(cpu)->cpu_capacity_orig;
+}
+
 static unsigned long cpu_avg_load_per_task(int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
@@ -4514,6 +4519,17 @@ static int select_idle_sibling(struct task_struct *p, int target)
 	return target;
 }
 
+static int get_cpu_utilization(int cpu)
+{
+	unsigned long usage = cpu_rq(cpu)->cfs.usage_load_avg;
+	unsigned long capacity = capacity_of(cpu);
+
+	if (usage >= SCHED_LOAD_SCALE)
+		return capacity + 1;
+
+	return (usage * capacity) >> SCHED_LOAD_SHIFT;
+}
+
 /*
  * select_task_rq_fair: Select target runqueue for the waking task in domains
  * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
@@ -5657,6 +5673,7 @@ struct sg_lb_stats {
 	unsigned long sum_weighted_load; /* Weighted load of group's tasks */
 	unsigned long load_per_task;
 	unsigned long group_capacity;
+	unsigned long group_utilization; /* Total utilization of the group */
 	unsigned int sum_nr_running; /* Nr tasks running in the group */
 	unsigned int group_capacity_factor;
 	unsigned int idle_cpus;
@@ -6011,6 +6028,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 			load = source_load(i, load_idx);
 
 		sgs->group_load += load;
+		sgs->group_utilization += get_cpu_utilization(i);
 		sgs->sum_nr_running += rq->cfs.h_nr_running;
 
 		if (rq->nr_running > 1)
@@ -6188,7 +6206,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 		/* Now, start updating sd_lb_stats */
 		sds->total_load += sgs->group_load;
 		sds->total_capacity += sgs->group_capacity;
-
 		sg = sg->next;
 	} while (sg != env->sd->groups);
 
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 164+ messages in thread

* [PATCH v5 10/12] sched: get CPU's utilization statistic
@ 2014-08-26 11:06   ` Vincent Guittot
  0 siblings, 0 replies; 164+ messages in thread
From: Vincent Guittot @ 2014-08-26 11:06 UTC (permalink / raw)
  To: linux-arm-kernel

Monitor the utilization level of each group of each sched_domain level. The
utilization is the amount of cpu_capacity that is currently used on a CPU or
group of CPUs. We use the usage_load_avg to evaluate this utilization level.
In the special use case where the CPU is fully loaded by more than 1 task,
the activity level is set above the cpu_capacity in order to reflect the overload
of the CPU

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 19 ++++++++++++++++++-
 1 file changed, 18 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1fd2131..2f95d1c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4126,6 +4126,11 @@ static unsigned long capacity_of(int cpu)
 	return cpu_rq(cpu)->cpu_capacity;
 }
 
+static unsigned long capacity_orig_of(int cpu)
+{
+	return cpu_rq(cpu)->cpu_capacity_orig;
+}
+
 static unsigned long cpu_avg_load_per_task(int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
@@ -4514,6 +4519,17 @@ static int select_idle_sibling(struct task_struct *p, int target)
 	return target;
 }
 
+static int get_cpu_utilization(int cpu)
+{
+	unsigned long usage = cpu_rq(cpu)->cfs.usage_load_avg;
+	unsigned long capacity = capacity_of(cpu);
+
+	if (usage >= SCHED_LOAD_SCALE)
+		return capacity + 1;
+
+	return (usage * capacity) >> SCHED_LOAD_SHIFT;
+}
+
 /*
  * select_task_rq_fair: Select target runqueue for the waking task in domains
  * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
@@ -5657,6 +5673,7 @@ struct sg_lb_stats {
 	unsigned long sum_weighted_load; /* Weighted load of group's tasks */
 	unsigned long load_per_task;
 	unsigned long group_capacity;
+	unsigned long group_utilization; /* Total utilization of the group */
 	unsigned int sum_nr_running; /* Nr tasks running in the group */
 	unsigned int group_capacity_factor;
 	unsigned int idle_cpus;
@@ -6011,6 +6028,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 			load = source_load(i, load_idx);
 
 		sgs->group_load += load;
+		sgs->group_utilization += get_cpu_utilization(i);
 		sgs->sum_nr_running += rq->cfs.h_nr_running;
 
 		if (rq->nr_running > 1)
@@ -6188,7 +6206,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 		/* Now, start updating sd_lb_stats */
 		sds->total_load += sgs->group_load;
 		sds->total_capacity += sgs->group_capacity;
-
 		sg = sg->next;
 	} while (sg != env->sd->groups);
 
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 164+ messages in thread

* [PATCH v5 11/12] sched: replace capacity_factor by utilization
  2014-08-26 11:06 ` Vincent Guittot
@ 2014-08-26 11:06   ` Vincent Guittot
  -1 siblings, 0 replies; 164+ messages in thread
From: Vincent Guittot @ 2014-08-26 11:06 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel, preeti, linux, linux-arm-kernel
  Cc: riel, Morten.Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano, dietmar.eggemann, Vincent Guittot

The scheduler tries to compute how many tasks a group of CPUs can handle by
assuming that a task's load is SCHED_LOAD_SCALE and a CPU capacity is
SCHED_CAPACITY_SCALE.
We can now have a better idea of the capacity of a group of CPUs and of the
utilization of this group thanks to the rework of group_capacity_orig and the
group_utilization. We can now deduct how many capacity is still available.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 121 ++++++++++++++++++++++------------------------------
 1 file changed, 51 insertions(+), 70 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2f95d1c..80bd64e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5673,13 +5673,13 @@ struct sg_lb_stats {
 	unsigned long sum_weighted_load; /* Weighted load of group's tasks */
 	unsigned long load_per_task;
 	unsigned long group_capacity;
+	unsigned long group_capacity_orig;
 	unsigned long group_utilization; /* Total utilization of the group */
 	unsigned int sum_nr_running; /* Nr tasks running in the group */
-	unsigned int group_capacity_factor;
 	unsigned int idle_cpus;
 	unsigned int group_weight;
 	enum group_type group_type;
-	int group_has_free_capacity;
+	int group_out_of_capacity;
 #ifdef CONFIG_NUMA_BALANCING
 	unsigned int nr_numa_running;
 	unsigned int nr_preferred_running;
@@ -5901,31 +5901,6 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
 }
 
 /*
- * Try and fix up capacity for tiny siblings, this is needed when
- * things like SD_ASYM_PACKING need f_b_g to select another sibling
- * which on its own isn't powerful enough.
- *
- * See update_sd_pick_busiest() and check_asym_packing().
- */
-static inline int
-fix_small_capacity(struct sched_domain *sd, struct sched_group *group)
-{
-	/*
-	 * Only siblings can have significantly less than SCHED_CAPACITY_SCALE
-	 */
-	if (!(sd->flags & SD_SHARE_CPUCAPACITY))
-		return 0;
-
-	/*
-	 * If ~90% of the cpu_capacity is still there, we're good.
-	 */
-	if (group->sgc->capacity * 32 > group->sgc->capacity_orig * 29)
-		return 1;
-
-	return 0;
-}
-
-/*
  * Group imbalance indicates (and tries to solve) the problem where balancing
  * groups is inadequate due to tsk_cpus_allowed() constraints.
  *
@@ -5959,38 +5934,37 @@ static inline int sg_imbalanced(struct sched_group *group)
 	return group->sgc->imbalance;
 }
 
-/*
- * Compute the group capacity factor.
- *
- * Avoid the issue where N*frac(smt_capacity) >= 1 creates 'phantom' cores by
- * first dividing out the smt factor and computing the actual number of cores
- * and limit unit capacity with that.
- */
-static inline int sg_capacity_factor(struct lb_env *env, struct sched_group *group)
+static inline int group_has_free_capacity(struct sg_lb_stats *sgs,
+			struct lb_env *env)
 {
-	unsigned int capacity_factor, smt, cpus;
-	unsigned int capacity, capacity_orig;
+	if ((sgs->group_capacity_orig * 100) >
+			(sgs->group_utilization * env->sd->imbalance_pct))
+		return 1;
+
+	if (sgs->sum_nr_running < sgs->group_weight)
+		return 1;
 
-	capacity = group->sgc->capacity;
-	capacity_orig = group->sgc->capacity_orig;
-	cpus = group->group_weight;
+	return 0;
+}
 
-	/* smt := ceil(cpus / capacity), assumes: 1 < smt_capacity < 2 */
-	smt = DIV_ROUND_UP(SCHED_CAPACITY_SCALE * cpus, capacity_orig);
-	capacity_factor = cpus / smt; /* cores */
+static inline int group_is_overloaded(struct sg_lb_stats *sgs,
+			struct lb_env *env)
+{
+	if (sgs->sum_nr_running <= sgs->group_weight)
+		return 0;
 
-	capacity_factor = min_t(unsigned,
-		capacity_factor, DIV_ROUND_CLOSEST(capacity, SCHED_CAPACITY_SCALE));
-	if (!capacity_factor)
-		capacity_factor = fix_small_capacity(env->sd, group);
+	if ((sgs->group_capacity_orig * 100) <
+			(sgs->group_utilization * env->sd->imbalance_pct))
+		return 1;
 
-	return capacity_factor;
+	return 0;
 }
 
 static enum group_type
-group_classify(struct sched_group *group, struct sg_lb_stats *sgs)
+group_classify(struct sched_group *group, struct sg_lb_stats *sgs,
+			struct lb_env *env)
 {
-	if (sgs->sum_nr_running > sgs->group_capacity_factor)
+	if (group_is_overloaded(sgs, env))
 		return group_overloaded;
 
 	if (sg_imbalanced(group))
@@ -6043,6 +6017,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 			sgs->idle_cpus++;
 	}
 
+	sgs->group_capacity_orig = group->sgc->capacity_orig;
 	/* Adjust by relative CPU capacity of the group */
 	sgs->group_capacity = group->sgc->capacity;
 	sgs->avg_load = (sgs->group_load*SCHED_CAPACITY_SCALE) / sgs->group_capacity;
@@ -6051,11 +6026,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 		sgs->load_per_task = sgs->sum_weighted_load / sgs->sum_nr_running;
 
 	sgs->group_weight = group->group_weight;
-	sgs->group_capacity_factor = sg_capacity_factor(env, group);
-	sgs->group_type = group_classify(group, sgs);
 
-	if (sgs->group_capacity_factor > sgs->sum_nr_running)
-		sgs->group_has_free_capacity = 1;
+	sgs->group_type = group_classify(group, sgs, env);
+
+	sgs->group_out_of_capacity = group_is_overloaded(sgs, env);
 }
 
 /**
@@ -6185,17 +6159,21 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 
 		/*
 		 * In case the child domain prefers tasks go to siblings
-		 * first, lower the sg capacity factor to one so that we'll try
+		 * first, lower the sg capacity to one so that we'll try
 		 * and move all the excess tasks away. We lower the capacity
 		 * of a group only if the local group has the capacity to fit
-		 * these excess tasks, i.e. nr_running < group_capacity_factor. The
+		 * these excess tasks, i.e. group_capacity > 0. The
 		 * extra check prevents the case where you always pull from the
 		 * heaviest group when it is already under-utilized (possible
 		 * with a large weight task outweighs the tasks on the system).
 		 */
 		if (prefer_sibling && sds->local &&
-		    sds->local_stat.group_has_free_capacity)
-			sgs->group_capacity_factor = min(sgs->group_capacity_factor, 1U);
+		    group_has_free_capacity(&sds->local_stat, env)) {
+			if (sgs->sum_nr_running > 1)
+				sgs->group_out_of_capacity = 1;
+			sgs->group_capacity = min(sgs->group_capacity,
+						SCHED_CAPACITY_SCALE);
+		}
 
 		if (update_sd_pick_busiest(env, sds, sg, sgs)) {
 			sds->busiest = sg;
@@ -6373,11 +6351,12 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 	 */
 	if (busiest->group_type == group_overloaded &&
 	    local->group_type   == group_overloaded) {
-		load_above_capacity =
-			(busiest->sum_nr_running - busiest->group_capacity_factor);
-
-		load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_CAPACITY_SCALE);
-		load_above_capacity /= busiest->group_capacity;
+		load_above_capacity = busiest->sum_nr_running *
+					SCHED_LOAD_SCALE;
+		if (load_above_capacity > busiest->group_capacity)
+			load_above_capacity -= busiest->group_capacity;
+		else
+			load_above_capacity = ~0UL;
 	}
 
 	/*
@@ -6440,6 +6419,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
 	local = &sds.local_stat;
 	busiest = &sds.busiest_stat;
 
+	/* ASYM feature bypasses nice load balance check */
 	if ((env->idle == CPU_IDLE || env->idle == CPU_NEWLY_IDLE) &&
 	    check_asym_packing(env, &sds))
 		return sds.busiest;
@@ -6460,8 +6440,9 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
 		goto force_balance;
 
 	/* SD_BALANCE_NEWIDLE trumps SMP nice when underutilized */
-	if (env->idle == CPU_NEWLY_IDLE && local->group_has_free_capacity &&
-	    !busiest->group_has_free_capacity)
+	if (env->idle == CPU_NEWLY_IDLE &&
+			group_has_free_capacity(local, env) &&
+			busiest->group_out_of_capacity)
 		goto force_balance;
 
 	/*
@@ -6519,7 +6500,7 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 	int i;
 
 	for_each_cpu_and(i, sched_group_cpus(group), env->cpus) {
-		unsigned long capacity, capacity_factor, wl;
+		unsigned long capacity, wl;
 		enum fbq_type rt;
 
 		rq = cpu_rq(i);
@@ -6548,9 +6529,6 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 			continue;
 
 		capacity = capacity_of(i);
-		capacity_factor = DIV_ROUND_CLOSEST(capacity, SCHED_CAPACITY_SCALE);
-		if (!capacity_factor)
-			capacity_factor = fix_small_capacity(env->sd, group);
 
 		wl = weighted_cpuload(i);
 
@@ -6558,7 +6536,10 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 		 * When comparing with imbalance, use weighted_cpuload()
 		 * which is not scaled with the cpu capacity.
 		 */
-		if (capacity_factor && rq->nr_running == 1 && wl > env->imbalance)
+
+		if (rq->nr_running == 1 && wl > env->imbalance &&
+		    ((capacity * env->sd->imbalance_pct) >=
+				(rq->cpu_capacity_orig * 100)))
 			continue;
 
 		/*
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 164+ messages in thread

* [PATCH v5 11/12] sched: replace capacity_factor by utilization
@ 2014-08-26 11:06   ` Vincent Guittot
  0 siblings, 0 replies; 164+ messages in thread
From: Vincent Guittot @ 2014-08-26 11:06 UTC (permalink / raw)
  To: linux-arm-kernel

The scheduler tries to compute how many tasks a group of CPUs can handle by
assuming that a task's load is SCHED_LOAD_SCALE and a CPU capacity is
SCHED_CAPACITY_SCALE.
We can now have a better idea of the capacity of a group of CPUs and of the
utilization of this group thanks to the rework of group_capacity_orig and the
group_utilization. We can now deduct how many capacity is still available.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 121 ++++++++++++++++++++++------------------------------
 1 file changed, 51 insertions(+), 70 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2f95d1c..80bd64e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5673,13 +5673,13 @@ struct sg_lb_stats {
 	unsigned long sum_weighted_load; /* Weighted load of group's tasks */
 	unsigned long load_per_task;
 	unsigned long group_capacity;
+	unsigned long group_capacity_orig;
 	unsigned long group_utilization; /* Total utilization of the group */
 	unsigned int sum_nr_running; /* Nr tasks running in the group */
-	unsigned int group_capacity_factor;
 	unsigned int idle_cpus;
 	unsigned int group_weight;
 	enum group_type group_type;
-	int group_has_free_capacity;
+	int group_out_of_capacity;
 #ifdef CONFIG_NUMA_BALANCING
 	unsigned int nr_numa_running;
 	unsigned int nr_preferred_running;
@@ -5901,31 +5901,6 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
 }
 
 /*
- * Try and fix up capacity for tiny siblings, this is needed when
- * things like SD_ASYM_PACKING need f_b_g to select another sibling
- * which on its own isn't powerful enough.
- *
- * See update_sd_pick_busiest() and check_asym_packing().
- */
-static inline int
-fix_small_capacity(struct sched_domain *sd, struct sched_group *group)
-{
-	/*
-	 * Only siblings can have significantly less than SCHED_CAPACITY_SCALE
-	 */
-	if (!(sd->flags & SD_SHARE_CPUCAPACITY))
-		return 0;
-
-	/*
-	 * If ~90% of the cpu_capacity is still there, we're good.
-	 */
-	if (group->sgc->capacity * 32 > group->sgc->capacity_orig * 29)
-		return 1;
-
-	return 0;
-}
-
-/*
  * Group imbalance indicates (and tries to solve) the problem where balancing
  * groups is inadequate due to tsk_cpus_allowed() constraints.
  *
@@ -5959,38 +5934,37 @@ static inline int sg_imbalanced(struct sched_group *group)
 	return group->sgc->imbalance;
 }
 
-/*
- * Compute the group capacity factor.
- *
- * Avoid the issue where N*frac(smt_capacity) >= 1 creates 'phantom' cores by
- * first dividing out the smt factor and computing the actual number of cores
- * and limit unit capacity with that.
- */
-static inline int sg_capacity_factor(struct lb_env *env, struct sched_group *group)
+static inline int group_has_free_capacity(struct sg_lb_stats *sgs,
+			struct lb_env *env)
 {
-	unsigned int capacity_factor, smt, cpus;
-	unsigned int capacity, capacity_orig;
+	if ((sgs->group_capacity_orig * 100) >
+			(sgs->group_utilization * env->sd->imbalance_pct))
+		return 1;
+
+	if (sgs->sum_nr_running < sgs->group_weight)
+		return 1;
 
-	capacity = group->sgc->capacity;
-	capacity_orig = group->sgc->capacity_orig;
-	cpus = group->group_weight;
+	return 0;
+}
 
-	/* smt := ceil(cpus / capacity), assumes: 1 < smt_capacity < 2 */
-	smt = DIV_ROUND_UP(SCHED_CAPACITY_SCALE * cpus, capacity_orig);
-	capacity_factor = cpus / smt; /* cores */
+static inline int group_is_overloaded(struct sg_lb_stats *sgs,
+			struct lb_env *env)
+{
+	if (sgs->sum_nr_running <= sgs->group_weight)
+		return 0;
 
-	capacity_factor = min_t(unsigned,
-		capacity_factor, DIV_ROUND_CLOSEST(capacity, SCHED_CAPACITY_SCALE));
-	if (!capacity_factor)
-		capacity_factor = fix_small_capacity(env->sd, group);
+	if ((sgs->group_capacity_orig * 100) <
+			(sgs->group_utilization * env->sd->imbalance_pct))
+		return 1;
 
-	return capacity_factor;
+	return 0;
 }
 
 static enum group_type
-group_classify(struct sched_group *group, struct sg_lb_stats *sgs)
+group_classify(struct sched_group *group, struct sg_lb_stats *sgs,
+			struct lb_env *env)
 {
-	if (sgs->sum_nr_running > sgs->group_capacity_factor)
+	if (group_is_overloaded(sgs, env))
 		return group_overloaded;
 
 	if (sg_imbalanced(group))
@@ -6043,6 +6017,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 			sgs->idle_cpus++;
 	}
 
+	sgs->group_capacity_orig = group->sgc->capacity_orig;
 	/* Adjust by relative CPU capacity of the group */
 	sgs->group_capacity = group->sgc->capacity;
 	sgs->avg_load = (sgs->group_load*SCHED_CAPACITY_SCALE) / sgs->group_capacity;
@@ -6051,11 +6026,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 		sgs->load_per_task = sgs->sum_weighted_load / sgs->sum_nr_running;
 
 	sgs->group_weight = group->group_weight;
-	sgs->group_capacity_factor = sg_capacity_factor(env, group);
-	sgs->group_type = group_classify(group, sgs);
 
-	if (sgs->group_capacity_factor > sgs->sum_nr_running)
-		sgs->group_has_free_capacity = 1;
+	sgs->group_type = group_classify(group, sgs, env);
+
+	sgs->group_out_of_capacity = group_is_overloaded(sgs, env);
 }
 
 /**
@@ -6185,17 +6159,21 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 
 		/*
 		 * In case the child domain prefers tasks go to siblings
-		 * first, lower the sg capacity factor to one so that we'll try
+		 * first, lower the sg capacity to one so that we'll try
 		 * and move all the excess tasks away. We lower the capacity
 		 * of a group only if the local group has the capacity to fit
-		 * these excess tasks, i.e. nr_running < group_capacity_factor. The
+		 * these excess tasks, i.e. group_capacity > 0. The
 		 * extra check prevents the case where you always pull from the
 		 * heaviest group when it is already under-utilized (possible
 		 * with a large weight task outweighs the tasks on the system).
 		 */
 		if (prefer_sibling && sds->local &&
-		    sds->local_stat.group_has_free_capacity)
-			sgs->group_capacity_factor = min(sgs->group_capacity_factor, 1U);
+		    group_has_free_capacity(&sds->local_stat, env)) {
+			if (sgs->sum_nr_running > 1)
+				sgs->group_out_of_capacity = 1;
+			sgs->group_capacity = min(sgs->group_capacity,
+						SCHED_CAPACITY_SCALE);
+		}
 
 		if (update_sd_pick_busiest(env, sds, sg, sgs)) {
 			sds->busiest = sg;
@@ -6373,11 +6351,12 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 	 */
 	if (busiest->group_type == group_overloaded &&
 	    local->group_type   == group_overloaded) {
-		load_above_capacity =
-			(busiest->sum_nr_running - busiest->group_capacity_factor);
-
-		load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_CAPACITY_SCALE);
-		load_above_capacity /= busiest->group_capacity;
+		load_above_capacity = busiest->sum_nr_running *
+					SCHED_LOAD_SCALE;
+		if (load_above_capacity > busiest->group_capacity)
+			load_above_capacity -= busiest->group_capacity;
+		else
+			load_above_capacity = ~0UL;
 	}
 
 	/*
@@ -6440,6 +6419,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
 	local = &sds.local_stat;
 	busiest = &sds.busiest_stat;
 
+	/* ASYM feature bypasses nice load balance check */
 	if ((env->idle == CPU_IDLE || env->idle == CPU_NEWLY_IDLE) &&
 	    check_asym_packing(env, &sds))
 		return sds.busiest;
@@ -6460,8 +6440,9 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
 		goto force_balance;
 
 	/* SD_BALANCE_NEWIDLE trumps SMP nice when underutilized */
-	if (env->idle == CPU_NEWLY_IDLE && local->group_has_free_capacity &&
-	    !busiest->group_has_free_capacity)
+	if (env->idle == CPU_NEWLY_IDLE &&
+			group_has_free_capacity(local, env) &&
+			busiest->group_out_of_capacity)
 		goto force_balance;
 
 	/*
@@ -6519,7 +6500,7 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 	int i;
 
 	for_each_cpu_and(i, sched_group_cpus(group), env->cpus) {
-		unsigned long capacity, capacity_factor, wl;
+		unsigned long capacity, wl;
 		enum fbq_type rt;
 
 		rq = cpu_rq(i);
@@ -6548,9 +6529,6 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 			continue;
 
 		capacity = capacity_of(i);
-		capacity_factor = DIV_ROUND_CLOSEST(capacity, SCHED_CAPACITY_SCALE);
-		if (!capacity_factor)
-			capacity_factor = fix_small_capacity(env->sd, group);
 
 		wl = weighted_cpuload(i);
 
@@ -6558,7 +6536,10 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 		 * When comparing with imbalance, use weighted_cpuload()
 		 * which is not scaled with the cpu capacity.
 		 */
-		if (capacity_factor && rq->nr_running == 1 && wl > env->imbalance)
+
+		if (rq->nr_running == 1 && wl > env->imbalance &&
+		    ((capacity * env->sd->imbalance_pct) >=
+				(rq->cpu_capacity_orig * 100)))
 			continue;
 
 		/*
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 164+ messages in thread

* [PATCH v5 12/12] sched: add SD_PREFER_SIBLING for SMT level
  2014-08-26 11:06 ` Vincent Guittot
@ 2014-08-26 11:06   ` Vincent Guittot
  -1 siblings, 0 replies; 164+ messages in thread
From: Vincent Guittot @ 2014-08-26 11:06 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel, preeti, linux, linux-arm-kernel
  Cc: riel, Morten.Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano, dietmar.eggemann, Vincent Guittot

add the SD_PREFER_SIBLING flag for SMT level in order to ensure that
the scheduler will put at least 1 task per core.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/core.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7c3b237..ab5ab60 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6093,6 +6093,7 @@ sd_init(struct sched_domain_topology_level *tl, int cpu)
 	 */
 
 	if (sd->flags & SD_SHARE_CPUCAPACITY) {
+		sd->flags |= SD_PREFER_SIBLING;
 		sd->imbalance_pct = 110;
 		sd->smt_gain = 1178; /* ~15% */
 
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 164+ messages in thread

* [PATCH v5 12/12] sched: add SD_PREFER_SIBLING for SMT level
@ 2014-08-26 11:06   ` Vincent Guittot
  0 siblings, 0 replies; 164+ messages in thread
From: Vincent Guittot @ 2014-08-26 11:06 UTC (permalink / raw)
  To: linux-arm-kernel

add the SD_PREFER_SIBLING flag for SMT level in order to ensure that
the scheduler will put at least 1 task per core.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/core.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 7c3b237..ab5ab60 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6093,6 +6093,7 @@ sd_init(struct sched_domain_topology_level *tl, int cpu)
 	 */
 
 	if (sd->flags & SD_SHARE_CPUCAPACITY) {
+		sd->flags |= SD_PREFER_SIBLING;
 		sd->imbalance_pct = 110;
 		sd->smt_gain = 1178; /* ~15% */
 
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 164+ messages in thread

* Re: [PATCH v5 04/12] sched: Allow all archs to set the capacity_orig
  2014-08-26 11:06   ` Vincent Guittot
@ 2014-08-27 13:12     ` Kamalesh Babulal
  -1 siblings, 0 replies; 164+ messages in thread
From: Kamalesh Babulal @ 2014-08-27 13:12 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: peterz, mingo, linux-kernel, preeti, linux, linux-arm-kernel,
	riel, Morten.Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano, dietmar.eggemann

* Vincent Guittot <vincent.guittot@linaro.org> [2014-08-26 13:06:47]:

> capacity_orig is only changed for system with a SMT sched_domain level in order
> to reflect the lower capacity of CPUs. Heterogenous system also have to reflect an
> original capacity that is different from the default value.
> 
> Create a more generic function arch_scale_cpu_capacity that can be also used by
> non SMT platform to set capacity_orig.
> 
> The weak behavior of arch_scale_cpu_capacity is the previous SMT one in order to
> keep backward compatibility in the use of capacity_orig.
> 
> arch_scale_smt_capacity and default_scale_smt_capacity have been removed as
> they were not use elsewhere than in arch_scale_cpu_capacity.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

Reviewed-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>


^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v5 04/12] sched: Allow all archs to set the capacity_orig
@ 2014-08-27 13:12     ` Kamalesh Babulal
  0 siblings, 0 replies; 164+ messages in thread
From: Kamalesh Babulal @ 2014-08-27 13:12 UTC (permalink / raw)
  To: linux-arm-kernel

* Vincent Guittot <vincent.guittot@linaro.org> [2014-08-26 13:06:47]:

> capacity_orig is only changed for system with a SMT sched_domain level in order
> to reflect the lower capacity of CPUs. Heterogenous system also have to reflect an
> original capacity that is different from the default value.
> 
> Create a more generic function arch_scale_cpu_capacity that can be also used by
> non SMT platform to set capacity_orig.
> 
> The weak behavior of arch_scale_cpu_capacity is the previous SMT one in order to
> keep backward compatibility in the use of capacity_orig.
> 
> arch_scale_smt_capacity and default_scale_smt_capacity have been removed as
> they were not use elsewhere than in arch_scale_cpu_capacity.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

Reviewed-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v5 06/12] sched: add per rq cpu_capacity_orig
  2014-08-26 11:06   ` Vincent Guittot
@ 2014-08-27 13:32     ` Kamalesh Babulal
  -1 siblings, 0 replies; 164+ messages in thread
From: Kamalesh Babulal @ 2014-08-27 13:32 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: peterz, mingo, linux-kernel, preeti, linux, linux-arm-kernel,
	riel, Morten.Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano, dietmar.eggemann

* Vincent Guittot <vincent.guittot@linaro.org> [2014-08-26 13:06:49]:

> This new field cpu_capacity_orig reflects the available capacity of a CPUs
> unlike the cpu_capacity which reflects the current capacity that can be altered
> by frequency and rt tasks.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

Reviewed-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>


^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v5 06/12] sched: add per rq cpu_capacity_orig
@ 2014-08-27 13:32     ` Kamalesh Babulal
  0 siblings, 0 replies; 164+ messages in thread
From: Kamalesh Babulal @ 2014-08-27 13:32 UTC (permalink / raw)
  To: linux-arm-kernel

* Vincent Guittot <vincent.guittot@linaro.org> [2014-08-26 13:06:49]:

> This new field cpu_capacity_orig reflects the available capacity of a CPUs
> unlike the cpu_capacity which reflects the current capacity that can be altered
> by frequency and rt tasks.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

Reviewed-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v5 06/12] sched: add per rq cpu_capacity_orig
  2014-08-27 13:32     ` Kamalesh Babulal
@ 2014-08-28  7:34       ` Vincent Guittot
  -1 siblings, 0 replies; 164+ messages in thread
From: Vincent Guittot @ 2014-08-28  7:34 UTC (permalink / raw)
  To: Kamalesh Babulal
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Preeti U Murthy,
	Russell King - ARM Linux, LAK, Rik van Riel, Morten Rasmussen,
	Mike Galbraith, Nicolas Pitre, linaro-kernel, Daniel Lezcano,
	Dietmar Eggemann

On 27 August 2014 15:32, Kamalesh Babulal <kamalesh@linux.vnet.ibm.com> wrote:
> * Vincent Guittot <vincent.guittot@linaro.org> [2014-08-26 13:06:49]:
>
>> This new field cpu_capacity_orig reflects the available capacity of a CPUs
>> unlike the cpu_capacity which reflects the current capacity that can be altered
>> by frequency and rt tasks.
>>
>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>
> Reviewed-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>

Thanks

>

^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v5 06/12] sched: add per rq cpu_capacity_orig
@ 2014-08-28  7:34       ` Vincent Guittot
  0 siblings, 0 replies; 164+ messages in thread
From: Vincent Guittot @ 2014-08-28  7:34 UTC (permalink / raw)
  To: linux-arm-kernel

On 27 August 2014 15:32, Kamalesh Babulal <kamalesh@linux.vnet.ibm.com> wrote:
> * Vincent Guittot <vincent.guittot@linaro.org> [2014-08-26 13:06:49]:
>
>> This new field cpu_capacity_orig reflects the available capacity of a CPUs
>> unlike the cpu_capacity which reflects the current capacity that can be altered
>> by frequency and rt tasks.
>>
>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>
> Reviewed-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>

Thanks

>

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v5 03/12] sched: fix avg_load computation
  2014-08-26 11:06   ` Vincent Guittot
@ 2014-08-30 12:00     ` Preeti U Murthy
  -1 siblings, 0 replies; 164+ messages in thread
From: Preeti U Murthy @ 2014-08-30 12:00 UTC (permalink / raw)
  To: Vincent Guittot, peterz, mingo, linux-kernel, linux, linux-arm-kernel
  Cc: riel, Morten.Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano, dietmar.eggemann

Hi Vincent,

On 08/26/2014 04:36 PM, Vincent Guittot wrote:
> The computation of avg_load and avg_load_per_task should only takes into
> account the number of cfs tasks. The non cfs task are already taken into
> account by decreasing the cpu's capacity and they will be tracked in the
> CPU's utilization (group_utilization) of the next patches
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>  kernel/sched/fair.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 87b9dc7..b85e9f7 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4092,7 +4092,7 @@ static unsigned long capacity_of(int cpu)
>  static unsigned long cpu_avg_load_per_task(int cpu)
>  {
>  	struct rq *rq = cpu_rq(cpu);
> -	unsigned long nr_running = ACCESS_ONCE(rq->nr_running);
> +	unsigned long nr_running = ACCESS_ONCE(rq->cfs.h_nr_running);
>  	unsigned long load_avg = rq->cfs.runnable_load_avg;
> 
>  	if (nr_running)
> @@ -5985,7 +5985,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>  			load = source_load(i, load_idx);
> 
>  		sgs->group_load += load;
> -		sgs->sum_nr_running += rq->nr_running;
> +		sgs->sum_nr_running += rq->cfs.h_nr_running;
> 
>  		if (rq->nr_running > 1)
>  			*overload = true;
> 

Why do we probe rq->nr_running while we do load balancing? Should not we
be probing cfs_rq->nr_running instead? We are interested after all in
load balancing fair tasks right? The reason I ask this is, I was
wondering if we need to make the above similar change in more places in
load balancing.

To cite examples: The above check says a cpu is overloaded when
rq->nr_running > 1. However if these tasks happen to be rt tasks, we
would anyway not be able to load balance. So while I was looking through
this patch, I noticed this and wanted to cross verify if we are checking
rq->nr_running on purpose in some places in load balancing; another
example being in nohz_kick_needed().


Regards
Preeti U Murthy


^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v5 03/12] sched: fix avg_load computation
@ 2014-08-30 12:00     ` Preeti U Murthy
  0 siblings, 0 replies; 164+ messages in thread
From: Preeti U Murthy @ 2014-08-30 12:00 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Vincent,

On 08/26/2014 04:36 PM, Vincent Guittot wrote:
> The computation of avg_load and avg_load_per_task should only takes into
> account the number of cfs tasks. The non cfs task are already taken into
> account by decreasing the cpu's capacity and they will be tracked in the
> CPU's utilization (group_utilization) of the next patches
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>  kernel/sched/fair.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 87b9dc7..b85e9f7 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4092,7 +4092,7 @@ static unsigned long capacity_of(int cpu)
>  static unsigned long cpu_avg_load_per_task(int cpu)
>  {
>  	struct rq *rq = cpu_rq(cpu);
> -	unsigned long nr_running = ACCESS_ONCE(rq->nr_running);
> +	unsigned long nr_running = ACCESS_ONCE(rq->cfs.h_nr_running);
>  	unsigned long load_avg = rq->cfs.runnable_load_avg;
> 
>  	if (nr_running)
> @@ -5985,7 +5985,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>  			load = source_load(i, load_idx);
> 
>  		sgs->group_load += load;
> -		sgs->sum_nr_running += rq->nr_running;
> +		sgs->sum_nr_running += rq->cfs.h_nr_running;
> 
>  		if (rq->nr_running > 1)
>  			*overload = true;
> 

Why do we probe rq->nr_running while we do load balancing? Should not we
be probing cfs_rq->nr_running instead? We are interested after all in
load balancing fair tasks right? The reason I ask this is, I was
wondering if we need to make the above similar change in more places in
load balancing.

To cite examples: The above check says a cpu is overloaded when
rq->nr_running > 1. However if these tasks happen to be rt tasks, we
would anyway not be able to load balance. So while I was looking through
this patch, I noticed this and wanted to cross verify if we are checking
rq->nr_running on purpose in some places in load balancing; another
example being in nohz_kick_needed().


Regards
Preeti U Murthy

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v5 04/12] sched: Allow all archs to set the capacity_orig
  2014-08-26 11:06   ` Vincent Guittot
@ 2014-08-30 17:07     ` Preeti U Murthy
  -1 siblings, 0 replies; 164+ messages in thread
From: Preeti U Murthy @ 2014-08-30 17:07 UTC (permalink / raw)
  To: Vincent Guittot, peterz, mingo, linux-kernel, linux, linux-arm-kernel
  Cc: riel, Morten.Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano, dietmar.eggemann

Hi Vincent,

On 08/26/2014 04:36 PM, Vincent Guittot wrote:
> capacity_orig is only changed for system with a SMT sched_domain level in order

I think I had asked this before, but why only capacity_orig? The
capacity of a group is also being updated the same way. This patch fixes
the capacity of a group to reflect the capacity of the heterogeneous
CPUs in it, this capacity being both the full capacity of the group:
capacity_orig and the capacity available for the fair tasks. So I feel
in the subject as well as the changelog it would suffice to say 'capacity'.

> to reflect the lower capacity of CPUs. Heterogenous system also have to reflect an
> original capacity that is different from the default value.
> 
> Create a more generic function arch_scale_cpu_capacity that can be also used by
> non SMT platform to set capacity_orig.
> 
> The weak behavior of arch_scale_cpu_capacity is the previous SMT one in order to
> keep backward compatibility in the use of capacity_orig.
> 
> arch_scale_smt_capacity and default_scale_smt_capacity have been removed as
> they were not use elsewhere than in arch_scale_cpu_capacity.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>  kernel/sched/fair.c | 25 ++++++-------------------
>  1 file changed, 6 insertions(+), 19 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b85e9f7..8176bda 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5705,19 +5705,12 @@ unsigned long __weak arch_scale_freq_capacity(struct sched_domain *sd, int cpu)
>  	return default_scale_capacity(sd, cpu);
>  }
> 
> -static unsigned long default_scale_smt_capacity(struct sched_domain *sd, int cpu)
> +unsigned long __weak arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
>  {
> -	unsigned long weight = sd->span_weight;
> -	unsigned long smt_gain = sd->smt_gain;
> +	if ((sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1))
> +		return sd->smt_gain / sd->span_weight;
> 
> -	smt_gain /= weight;
> -
> -	return smt_gain;
> -}
> -
> -unsigned long __weak arch_scale_smt_capacity(struct sched_domain *sd, int cpu)
> -{
> -	return default_scale_smt_capacity(sd, cpu);
> +	return SCHED_CAPACITY_SCALE;
>  }
> 
>  static unsigned long scale_rt_capacity(int cpu)
> @@ -5756,18 +5749,12 @@ static unsigned long scale_rt_capacity(int cpu)
> 
>  static void update_cpu_capacity(struct sched_domain *sd, int cpu)
>  {
> -	unsigned long weight = sd->span_weight;
>  	unsigned long capacity = SCHED_CAPACITY_SCALE;
>  	struct sched_group *sdg = sd->groups;
> 
> -	if ((sd->flags & SD_SHARE_CPUCAPACITY) && weight > 1) {
> -		if (sched_feat(ARCH_CAPACITY))

Aren't you missing this check above? I understand that it is not
crucial, but that would also mean removing ARCH_CAPACITY sched_feat
altogether, wouldn't it?

Regards
Preeti U Murthy
> -			capacity *= arch_scale_smt_capacity(sd, cpu);
> -		else
> -			capacity *= default_scale_smt_capacity(sd, cpu);
> +	capacity *= arch_scale_cpu_capacity(sd, cpu);
> 
> -		capacity >>= SCHED_CAPACITY_SHIFT;
> -	}
> +	capacity >>= SCHED_CAPACITY_SHIFT;
> 
>  	sdg->sgc->capacity_orig = capacity;
> 


^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v5 04/12] sched: Allow all archs to set the capacity_orig
@ 2014-08-30 17:07     ` Preeti U Murthy
  0 siblings, 0 replies; 164+ messages in thread
From: Preeti U Murthy @ 2014-08-30 17:07 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Vincent,

On 08/26/2014 04:36 PM, Vincent Guittot wrote:
> capacity_orig is only changed for system with a SMT sched_domain level in order

I think I had asked this before, but why only capacity_orig? The
capacity of a group is also being updated the same way. This patch fixes
the capacity of a group to reflect the capacity of the heterogeneous
CPUs in it, this capacity being both the full capacity of the group:
capacity_orig and the capacity available for the fair tasks. So I feel
in the subject as well as the changelog it would suffice to say 'capacity'.

> to reflect the lower capacity of CPUs. Heterogenous system also have to reflect an
> original capacity that is different from the default value.
> 
> Create a more generic function arch_scale_cpu_capacity that can be also used by
> non SMT platform to set capacity_orig.
> 
> The weak behavior of arch_scale_cpu_capacity is the previous SMT one in order to
> keep backward compatibility in the use of capacity_orig.
> 
> arch_scale_smt_capacity and default_scale_smt_capacity have been removed as
> they were not use elsewhere than in arch_scale_cpu_capacity.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>  kernel/sched/fair.c | 25 ++++++-------------------
>  1 file changed, 6 insertions(+), 19 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b85e9f7..8176bda 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5705,19 +5705,12 @@ unsigned long __weak arch_scale_freq_capacity(struct sched_domain *sd, int cpu)
>  	return default_scale_capacity(sd, cpu);
>  }
> 
> -static unsigned long default_scale_smt_capacity(struct sched_domain *sd, int cpu)
> +unsigned long __weak arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
>  {
> -	unsigned long weight = sd->span_weight;
> -	unsigned long smt_gain = sd->smt_gain;
> +	if ((sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1))
> +		return sd->smt_gain / sd->span_weight;
> 
> -	smt_gain /= weight;
> -
> -	return smt_gain;
> -}
> -
> -unsigned long __weak arch_scale_smt_capacity(struct sched_domain *sd, int cpu)
> -{
> -	return default_scale_smt_capacity(sd, cpu);
> +	return SCHED_CAPACITY_SCALE;
>  }
> 
>  static unsigned long scale_rt_capacity(int cpu)
> @@ -5756,18 +5749,12 @@ static unsigned long scale_rt_capacity(int cpu)
> 
>  static void update_cpu_capacity(struct sched_domain *sd, int cpu)
>  {
> -	unsigned long weight = sd->span_weight;
>  	unsigned long capacity = SCHED_CAPACITY_SCALE;
>  	struct sched_group *sdg = sd->groups;
> 
> -	if ((sd->flags & SD_SHARE_CPUCAPACITY) && weight > 1) {
> -		if (sched_feat(ARCH_CAPACITY))

Aren't you missing this check above? I understand that it is not
crucial, but that would also mean removing ARCH_CAPACITY sched_feat
altogether, wouldn't it?

Regards
Preeti U Murthy
> -			capacity *= arch_scale_smt_capacity(sd, cpu);
> -		else
> -			capacity *= default_scale_smt_capacity(sd, cpu);
> +	capacity *= arch_scale_cpu_capacity(sd, cpu);
> 
> -		capacity >>= SCHED_CAPACITY_SHIFT;
> -	}
> +	capacity >>= SCHED_CAPACITY_SHIFT;
> 
>  	sdg->sgc->capacity_orig = capacity;
> 

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v5 08/12] sched: move cfs task on a CPU with higher capacity
  2014-08-26 11:06   ` Vincent Guittot
@ 2014-08-30 17:50     ` Preeti U Murthy
  -1 siblings, 0 replies; 164+ messages in thread
From: Preeti U Murthy @ 2014-08-30 17:50 UTC (permalink / raw)
  To: Vincent Guittot, peterz, mingo, linux-kernel, linux, linux-arm-kernel
  Cc: riel, Morten.Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano, dietmar.eggemann

Hi Vincent,

On 08/26/2014 04:36 PM, Vincent Guittot wrote:
> If the CPU is used for handling lot of IRQs, trig a load balance to check if
> it's worth moving its tasks on another CPU that has more capacity.
> 
> As a sidenote, this will note generate more spurious ilb because we already
> trig an ilb if there is more than 1 busy cpu. If this cpu is the only one that
> has a task, we will trig the ilb once for migrating the task.
> 
> The nohz_kick_needed function has been cleaned up a bit while adding the new
> test
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>  kernel/sched/fair.c | 69 +++++++++++++++++++++++++++++++++++++----------------
>  1 file changed, 49 insertions(+), 20 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 18db43e..60ae1ce 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6049,6 +6049,14 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>  			return true;
>  	}
> 
> +	/*
> +	 * The group capacity is reduced probably because of activity from other
> +	 * sched class or interrupts which use part of the available capacity
> +	 */
> +	if ((sg->sgc->capacity_orig * 100) > (sgs->group_capacity *
> +				env->sd->imbalance_pct))

Wouldn't the check on avg_load let us know if we are packing more tasks
in this group than its capacity ? Isn't that the metric we are more
interested in?

> +		return true;
> +
>  	return false;
>  }
> 
> @@ -6534,13 +6542,23 @@ static int need_active_balance(struct lb_env *env)
>  	struct sched_domain *sd = env->sd;
> 
>  	if (env->idle == CPU_NEWLY_IDLE) {
> +		int src_cpu = env->src_cpu;
> 
>  		/*
>  		 * ASYM_PACKING needs to force migrate tasks from busy but
>  		 * higher numbered CPUs in order to pack all tasks in the
>  		 * lowest numbered CPUs.
>  		 */
> -		if ((sd->flags & SD_ASYM_PACKING) && env->src_cpu > env->dst_cpu)
> +		if ((sd->flags & SD_ASYM_PACKING) && src_cpu > env->dst_cpu)
> +			return 1;
> +
> +		/*
> +		 * If the CPUs share their cache and the src_cpu's capacity is
> +		 * reduced because of other sched_class or IRQs, we trig an
> +		 * active balance to move the task
> +		 */
> +		if ((capacity_orig_of(src_cpu) * 100) > (capacity_of(src_cpu) *
> +				sd->imbalance_pct))
>  			return 1;
>  	}
> 
> @@ -6643,6 +6661,8 @@ static int load_balance(int this_cpu, struct rq *this_rq,
> 
>  	schedstat_add(sd, lb_imbalance[idle], env.imbalance);
> 
> +	env.src_cpu = busiest->cpu;
> +
>  	ld_moved = 0;
>  	if (busiest->nr_running > 1) {
>  		/*
> @@ -6652,7 +6672,6 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>  		 * correctly treated as an imbalance.
>  		 */
>  		env.flags |= LBF_ALL_PINNED;
> -		env.src_cpu   = busiest->cpu;
>  		env.src_rq    = busiest;
>  		env.loop_max  = min(sysctl_sched_nr_migrate, busiest->nr_running);
> 
> @@ -7359,10 +7378,12 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
> 
>  /*
>   * Current heuristic for kicking the idle load balancer in the presence
> - * of an idle cpu is the system.
> + * of an idle cpu in the system.
>   *   - This rq has more than one task.
> - *   - At any scheduler domain level, this cpu's scheduler group has multiple
> - *     busy cpu's exceeding the group's capacity.
> + *   - This rq has at least one CFS task and the capacity of the CPU is
> + *     significantly reduced because of RT tasks or IRQs.
> + *   - At parent of LLC scheduler domain level, this cpu's scheduler group has
> + *     multiple busy cpu.
>   *   - For SD_ASYM_PACKING, if the lower numbered cpu's in the scheduler
>   *     domain span are idle.
>   */
> @@ -7372,9 +7393,10 @@ static inline int nohz_kick_needed(struct rq *rq)
>  	struct sched_domain *sd;
>  	struct sched_group_capacity *sgc;
>  	int nr_busy, cpu = rq->cpu;
> +	bool kick = false;
> 
>  	if (unlikely(rq->idle_balance))
> -		return 0;
> +		return false;
> 
>         /*
>  	* We may be recently in ticked or tickless idle mode. At the first
> @@ -7388,38 +7410,45 @@ static inline int nohz_kick_needed(struct rq *rq)
>  	 * balancing.
>  	 */
>  	if (likely(!atomic_read(&nohz.nr_cpus)))
> -		return 0;
> +		return false;
> 
>  	if (time_before(now, nohz.next_balance))
> -		return 0;
> +		return false;
> 
>  	if (rq->nr_running >= 2)

Will this check ^^ not catch those cases which this patch is targeting?

Regards
Preeti U Murthy

> -		goto need_kick;
> +		return true;
> 
>  	rcu_read_lock();
>  	sd = rcu_dereference(per_cpu(sd_busy, cpu));
> -
>  	if (sd) {
>  		sgc = sd->groups->sgc;
>  		nr_busy = atomic_read(&sgc->nr_busy_cpus);
> 
> -		if (nr_busy > 1)
> -			goto need_kick_unlock;
> +		if (nr_busy > 1) {
> +			kick = true;
> +			goto unlock;
> +		}
> +
>  	}
> 
> -	sd = rcu_dereference(per_cpu(sd_asym, cpu));
> +	sd = rcu_dereference(rq->sd);
> +	if (sd) {
> +		if ((rq->cfs.h_nr_running >= 1) &&
> +				((rq->cpu_capacity * sd->imbalance_pct) <
> +				(rq->cpu_capacity_orig * 100))) {
> +			kick = true;
> +			goto unlock;
> +		}
> +	}
> 
> +	sd = rcu_dereference(per_cpu(sd_asym, cpu));
>  	if (sd && (cpumask_first_and(nohz.idle_cpus_mask,
>  				  sched_domain_span(sd)) < cpu))
> -		goto need_kick_unlock;
> +		kick = true;
> 
> +unlock:
>  	rcu_read_unlock();
> -	return 0;
> -
> -need_kick_unlock:
> -	rcu_read_unlock();
> -need_kick:
> -	return 1;
> +	return kick;
>  }
>  #else
>  static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) { }
> 


^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v5 08/12] sched: move cfs task on a CPU with higher capacity
@ 2014-08-30 17:50     ` Preeti U Murthy
  0 siblings, 0 replies; 164+ messages in thread
From: Preeti U Murthy @ 2014-08-30 17:50 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Vincent,

On 08/26/2014 04:36 PM, Vincent Guittot wrote:
> If the CPU is used for handling lot of IRQs, trig a load balance to check if
> it's worth moving its tasks on another CPU that has more capacity.
> 
> As a sidenote, this will note generate more spurious ilb because we already
> trig an ilb if there is more than 1 busy cpu. If this cpu is the only one that
> has a task, we will trig the ilb once for migrating the task.
> 
> The nohz_kick_needed function has been cleaned up a bit while adding the new
> test
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>  kernel/sched/fair.c | 69 +++++++++++++++++++++++++++++++++++++----------------
>  1 file changed, 49 insertions(+), 20 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 18db43e..60ae1ce 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6049,6 +6049,14 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>  			return true;
>  	}
> 
> +	/*
> +	 * The group capacity is reduced probably because of activity from other
> +	 * sched class or interrupts which use part of the available capacity
> +	 */
> +	if ((sg->sgc->capacity_orig * 100) > (sgs->group_capacity *
> +				env->sd->imbalance_pct))

Wouldn't the check on avg_load let us know if we are packing more tasks
in this group than its capacity ? Isn't that the metric we are more
interested in?

> +		return true;
> +
>  	return false;
>  }
> 
> @@ -6534,13 +6542,23 @@ static int need_active_balance(struct lb_env *env)
>  	struct sched_domain *sd = env->sd;
> 
>  	if (env->idle == CPU_NEWLY_IDLE) {
> +		int src_cpu = env->src_cpu;
> 
>  		/*
>  		 * ASYM_PACKING needs to force migrate tasks from busy but
>  		 * higher numbered CPUs in order to pack all tasks in the
>  		 * lowest numbered CPUs.
>  		 */
> -		if ((sd->flags & SD_ASYM_PACKING) && env->src_cpu > env->dst_cpu)
> +		if ((sd->flags & SD_ASYM_PACKING) && src_cpu > env->dst_cpu)
> +			return 1;
> +
> +		/*
> +		 * If the CPUs share their cache and the src_cpu's capacity is
> +		 * reduced because of other sched_class or IRQs, we trig an
> +		 * active balance to move the task
> +		 */
> +		if ((capacity_orig_of(src_cpu) * 100) > (capacity_of(src_cpu) *
> +				sd->imbalance_pct))
>  			return 1;
>  	}
> 
> @@ -6643,6 +6661,8 @@ static int load_balance(int this_cpu, struct rq *this_rq,
> 
>  	schedstat_add(sd, lb_imbalance[idle], env.imbalance);
> 
> +	env.src_cpu = busiest->cpu;
> +
>  	ld_moved = 0;
>  	if (busiest->nr_running > 1) {
>  		/*
> @@ -6652,7 +6672,6 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>  		 * correctly treated as an imbalance.
>  		 */
>  		env.flags |= LBF_ALL_PINNED;
> -		env.src_cpu   = busiest->cpu;
>  		env.src_rq    = busiest;
>  		env.loop_max  = min(sysctl_sched_nr_migrate, busiest->nr_running);
> 
> @@ -7359,10 +7378,12 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
> 
>  /*
>   * Current heuristic for kicking the idle load balancer in the presence
> - * of an idle cpu is the system.
> + * of an idle cpu in the system.
>   *   - This rq has more than one task.
> - *   - At any scheduler domain level, this cpu's scheduler group has multiple
> - *     busy cpu's exceeding the group's capacity.
> + *   - This rq has at least one CFS task and the capacity of the CPU is
> + *     significantly reduced because of RT tasks or IRQs.
> + *   - At parent of LLC scheduler domain level, this cpu's scheduler group has
> + *     multiple busy cpu.
>   *   - For SD_ASYM_PACKING, if the lower numbered cpu's in the scheduler
>   *     domain span are idle.
>   */
> @@ -7372,9 +7393,10 @@ static inline int nohz_kick_needed(struct rq *rq)
>  	struct sched_domain *sd;
>  	struct sched_group_capacity *sgc;
>  	int nr_busy, cpu = rq->cpu;
> +	bool kick = false;
> 
>  	if (unlikely(rq->idle_balance))
> -		return 0;
> +		return false;
> 
>         /*
>  	* We may be recently in ticked or tickless idle mode. At the first
> @@ -7388,38 +7410,45 @@ static inline int nohz_kick_needed(struct rq *rq)
>  	 * balancing.
>  	 */
>  	if (likely(!atomic_read(&nohz.nr_cpus)))
> -		return 0;
> +		return false;
> 
>  	if (time_before(now, nohz.next_balance))
> -		return 0;
> +		return false;
> 
>  	if (rq->nr_running >= 2)

Will this check ^^ not catch those cases which this patch is targeting?

Regards
Preeti U Murthy

> -		goto need_kick;
> +		return true;
> 
>  	rcu_read_lock();
>  	sd = rcu_dereference(per_cpu(sd_busy, cpu));
> -
>  	if (sd) {
>  		sgc = sd->groups->sgc;
>  		nr_busy = atomic_read(&sgc->nr_busy_cpus);
> 
> -		if (nr_busy > 1)
> -			goto need_kick_unlock;
> +		if (nr_busy > 1) {
> +			kick = true;
> +			goto unlock;
> +		}
> +
>  	}
> 
> -	sd = rcu_dereference(per_cpu(sd_asym, cpu));
> +	sd = rcu_dereference(rq->sd);
> +	if (sd) {
> +		if ((rq->cfs.h_nr_running >= 1) &&
> +				((rq->cpu_capacity * sd->imbalance_pct) <
> +				(rq->cpu_capacity_orig * 100))) {
> +			kick = true;
> +			goto unlock;
> +		}
> +	}
> 
> +	sd = rcu_dereference(per_cpu(sd_asym, cpu));
>  	if (sd && (cpumask_first_and(nohz.idle_cpus_mask,
>  				  sched_domain_span(sd)) < cpu))
> -		goto need_kick_unlock;
> +		kick = true;
> 
> +unlock:
>  	rcu_read_unlock();
> -	return 0;
> -
> -need_kick_unlock:
> -	rcu_read_unlock();
> -need_kick:
> -	return 1;
> +	return kick;
>  }
>  #else
>  static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) { }
> 

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v5 04/12] sched: Allow all archs to set the capacity_orig
  2014-08-30 17:07     ` Preeti U Murthy
@ 2014-09-01  8:05       ` Vincent Guittot
  -1 siblings, 0 replies; 164+ messages in thread
From: Vincent Guittot @ 2014-09-01  8:05 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel,
	Russell King - ARM Linux, LAK, Rik van Riel, Morten Rasmussen,
	Mike Galbraith, Nicolas Pitre, linaro-kernel, Daniel Lezcano,
	Dietmar Eggemann

On 30 August 2014 19:07, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
> Hi Vincent,
>
> On 08/26/2014 04:36 PM, Vincent Guittot wrote:
>> capacity_orig is only changed for system with a SMT sched_domain level in order
>
> I think I had asked this before, but why only capacity_orig? The
> capacity of a group is also being updated the same way. This patch fixes
> the capacity of a group to reflect the capacity of the heterogeneous
> CPUs in it, this capacity being both the full capacity of the group:
> capacity_orig and the capacity available for the fair tasks. So I feel
> in the subject as well as the changelog it would suffice to say 'capacity'.

IIRC, we discussed that point on a former version. The patch changes
only the behavior of capacity_orig but the behavior of capacity stays
unchanged as all archs can already set the capacity whereas the
capacity_orig was configurable only if the SD_SHARE_CPUCAPACITY was
set in the sched_domain. This is no more the case with this patch
which creates arch_scale_cpu_capacity for setting capacity_orig.

>
>> to reflect the lower capacity of CPUs. Heterogenous system also have to reflect an
>> original capacity that is different from the default value.
>>
>> Create a more generic function arch_scale_cpu_capacity that can be also used by
>> non SMT platform to set capacity_orig.
>>
>> The weak behavior of arch_scale_cpu_capacity is the previous SMT one in order to
>> keep backward compatibility in the use of capacity_orig.
>>
>> arch_scale_smt_capacity and default_scale_smt_capacity have been removed as
>> they were not use elsewhere than in arch_scale_cpu_capacity.
>>
>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>> ---
>>  kernel/sched/fair.c | 25 ++++++-------------------
>>  1 file changed, 6 insertions(+), 19 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index b85e9f7..8176bda 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -5705,19 +5705,12 @@ unsigned long __weak arch_scale_freq_capacity(struct sched_domain *sd, int cpu)
>>       return default_scale_capacity(sd, cpu);
>>  }
>>
>> -static unsigned long default_scale_smt_capacity(struct sched_domain *sd, int cpu)
>> +unsigned long __weak arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
>>  {
>> -     unsigned long weight = sd->span_weight;
>> -     unsigned long smt_gain = sd->smt_gain;
>> +     if ((sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1))
>> +             return sd->smt_gain / sd->span_weight;
>>
>> -     smt_gain /= weight;
>> -
>> -     return smt_gain;
>> -}
>> -
>> -unsigned long __weak arch_scale_smt_capacity(struct sched_domain *sd, int cpu)
>> -{
>> -     return default_scale_smt_capacity(sd, cpu);
>> +     return SCHED_CAPACITY_SCALE;
>>  }
>>
>>  static unsigned long scale_rt_capacity(int cpu)
>> @@ -5756,18 +5749,12 @@ static unsigned long scale_rt_capacity(int cpu)
>>
>>  static void update_cpu_capacity(struct sched_domain *sd, int cpu)
>>  {
>> -     unsigned long weight = sd->span_weight;
>>       unsigned long capacity = SCHED_CAPACITY_SCALE;
>>       struct sched_group *sdg = sd->groups;
>>
>> -     if ((sd->flags & SD_SHARE_CPUCAPACITY) && weight > 1) {
>> -             if (sched_feat(ARCH_CAPACITY))
>
> Aren't you missing this check above? I understand that it is not
> crucial, but that would also mean removing ARCH_CAPACITY sched_feat
> altogether, wouldn't it?

Peter has proposed to remove all those checks and to keep only the
default behavior because no arch uses arch_scale_smt_capacity.
Then, ARCH_CAPACITY is still used in update_cpu_capacity

Regards,
Vincent
>
> Regards
> Preeti U Murthy
>> -                     capacity *= arch_scale_smt_capacity(sd, cpu);
>> -             else
>> -                     capacity *= default_scale_smt_capacity(sd, cpu);
>> +     capacity *= arch_scale_cpu_capacity(sd, cpu);
>>
>> -             capacity >>= SCHED_CAPACITY_SHIFT;
>> -     }
>> +     capacity >>= SCHED_CAPACITY_SHIFT;
>>
>>       sdg->sgc->capacity_orig = capacity;
>>
>

^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v5 04/12] sched: Allow all archs to set the capacity_orig
@ 2014-09-01  8:05       ` Vincent Guittot
  0 siblings, 0 replies; 164+ messages in thread
From: Vincent Guittot @ 2014-09-01  8:05 UTC (permalink / raw)
  To: linux-arm-kernel

On 30 August 2014 19:07, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
> Hi Vincent,
>
> On 08/26/2014 04:36 PM, Vincent Guittot wrote:
>> capacity_orig is only changed for system with a SMT sched_domain level in order
>
> I think I had asked this before, but why only capacity_orig? The
> capacity of a group is also being updated the same way. This patch fixes
> the capacity of a group to reflect the capacity of the heterogeneous
> CPUs in it, this capacity being both the full capacity of the group:
> capacity_orig and the capacity available for the fair tasks. So I feel
> in the subject as well as the changelog it would suffice to say 'capacity'.

IIRC, we discussed that point on a former version. The patch changes
only the behavior of capacity_orig but the behavior of capacity stays
unchanged as all archs can already set the capacity whereas the
capacity_orig was configurable only if the SD_SHARE_CPUCAPACITY was
set in the sched_domain. This is no more the case with this patch
which creates arch_scale_cpu_capacity for setting capacity_orig.

>
>> to reflect the lower capacity of CPUs. Heterogenous system also have to reflect an
>> original capacity that is different from the default value.
>>
>> Create a more generic function arch_scale_cpu_capacity that can be also used by
>> non SMT platform to set capacity_orig.
>>
>> The weak behavior of arch_scale_cpu_capacity is the previous SMT one in order to
>> keep backward compatibility in the use of capacity_orig.
>>
>> arch_scale_smt_capacity and default_scale_smt_capacity have been removed as
>> they were not use elsewhere than in arch_scale_cpu_capacity.
>>
>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>> ---
>>  kernel/sched/fair.c | 25 ++++++-------------------
>>  1 file changed, 6 insertions(+), 19 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index b85e9f7..8176bda 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -5705,19 +5705,12 @@ unsigned long __weak arch_scale_freq_capacity(struct sched_domain *sd, int cpu)
>>       return default_scale_capacity(sd, cpu);
>>  }
>>
>> -static unsigned long default_scale_smt_capacity(struct sched_domain *sd, int cpu)
>> +unsigned long __weak arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
>>  {
>> -     unsigned long weight = sd->span_weight;
>> -     unsigned long smt_gain = sd->smt_gain;
>> +     if ((sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1))
>> +             return sd->smt_gain / sd->span_weight;
>>
>> -     smt_gain /= weight;
>> -
>> -     return smt_gain;
>> -}
>> -
>> -unsigned long __weak arch_scale_smt_capacity(struct sched_domain *sd, int cpu)
>> -{
>> -     return default_scale_smt_capacity(sd, cpu);
>> +     return SCHED_CAPACITY_SCALE;
>>  }
>>
>>  static unsigned long scale_rt_capacity(int cpu)
>> @@ -5756,18 +5749,12 @@ static unsigned long scale_rt_capacity(int cpu)
>>
>>  static void update_cpu_capacity(struct sched_domain *sd, int cpu)
>>  {
>> -     unsigned long weight = sd->span_weight;
>>       unsigned long capacity = SCHED_CAPACITY_SCALE;
>>       struct sched_group *sdg = sd->groups;
>>
>> -     if ((sd->flags & SD_SHARE_CPUCAPACITY) && weight > 1) {
>> -             if (sched_feat(ARCH_CAPACITY))
>
> Aren't you missing this check above? I understand that it is not
> crucial, but that would also mean removing ARCH_CAPACITY sched_feat
> altogether, wouldn't it?

Peter has proposed to remove all those checks and to keep only the
default behavior because no arch uses arch_scale_smt_capacity.
Then, ARCH_CAPACITY is still used in update_cpu_capacity

Regards,
Vincent
>
> Regards
> Preeti U Murthy
>> -                     capacity *= arch_scale_smt_capacity(sd, cpu);
>> -             else
>> -                     capacity *= default_scale_smt_capacity(sd, cpu);
>> +     capacity *= arch_scale_cpu_capacity(sd, cpu);
>>
>> -             capacity >>= SCHED_CAPACITY_SHIFT;
>> -     }
>> +     capacity >>= SCHED_CAPACITY_SHIFT;
>>
>>       sdg->sgc->capacity_orig = capacity;
>>
>

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v5 08/12] sched: move cfs task on a CPU with higher capacity
  2014-08-30 17:50     ` Preeti U Murthy
@ 2014-09-01  8:45       ` Vincent Guittot
  -1 siblings, 0 replies; 164+ messages in thread
From: Vincent Guittot @ 2014-09-01  8:45 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel,
	Russell King - ARM Linux, LAK, Rik van Riel, Morten Rasmussen,
	Mike Galbraith, Nicolas Pitre, linaro-kernel, Daniel Lezcano,
	Dietmar Eggemann

On 30 August 2014 19:50, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
> Hi Vincent,
>
> On 08/26/2014 04:36 PM, Vincent Guittot wrote:
>> If the CPU is used for handling lot of IRQs, trig a load balance to check if
>> it's worth moving its tasks on another CPU that has more capacity.
>>
>> As a sidenote, this will note generate more spurious ilb because we already
>> trig an ilb if there is more than 1 busy cpu. If this cpu is the only one that
>> has a task, we will trig the ilb once for migrating the task.
>>
>> The nohz_kick_needed function has been cleaned up a bit while adding the new
>> test
>>
>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>> ---
>>  kernel/sched/fair.c | 69 +++++++++++++++++++++++++++++++++++++----------------
>>  1 file changed, 49 insertions(+), 20 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 18db43e..60ae1ce 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6049,6 +6049,14 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>>                       return true;
>>       }
>>
>> +     /*
>> +      * The group capacity is reduced probably because of activity from other
>> +      * sched class or interrupts which use part of the available capacity
>> +      */
>> +     if ((sg->sgc->capacity_orig * 100) > (sgs->group_capacity *
>> +                             env->sd->imbalance_pct))
>
> Wouldn't the check on avg_load let us know if we are packing more tasks
> in this group than its capacity ? Isn't that the metric we are more
> interested in?

With  this patch, we don't want to pack but we prefer to spread the
task on another CPU than the one which handles the interruption if
latter uses a significant part of the CPU capacity.

>
>> +             return true;
>> +
>>       return false;
>>  }
>>
>> @@ -6534,13 +6542,23 @@ static int need_active_balance(struct lb_env *env)
>>       struct sched_domain *sd = env->sd;
>>
>>       if (env->idle == CPU_NEWLY_IDLE) {
>> +             int src_cpu = env->src_cpu;
>>
>>               /*
>>                * ASYM_PACKING needs to force migrate tasks from busy but
>>                * higher numbered CPUs in order to pack all tasks in the
>>                * lowest numbered CPUs.
>>                */
>> -             if ((sd->flags & SD_ASYM_PACKING) && env->src_cpu > env->dst_cpu)
>> +             if ((sd->flags & SD_ASYM_PACKING) && src_cpu > env->dst_cpu)
>> +                     return 1;
>> +
>> +             /*
>> +              * If the CPUs share their cache and the src_cpu's capacity is
>> +              * reduced because of other sched_class or IRQs, we trig an
>> +              * active balance to move the task
>> +              */
>> +             if ((capacity_orig_of(src_cpu) * 100) > (capacity_of(src_cpu) *
>> +                             sd->imbalance_pct))
>>                       return 1;
>>       }
>>
>> @@ -6643,6 +6661,8 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>>
>>       schedstat_add(sd, lb_imbalance[idle], env.imbalance);
>>
>> +     env.src_cpu = busiest->cpu;
>> +
>>       ld_moved = 0;
>>       if (busiest->nr_running > 1) {
>>               /*
>> @@ -6652,7 +6672,6 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>>                * correctly treated as an imbalance.
>>                */
>>               env.flags |= LBF_ALL_PINNED;
>> -             env.src_cpu   = busiest->cpu;
>>               env.src_rq    = busiest;
>>               env.loop_max  = min(sysctl_sched_nr_migrate, busiest->nr_running);
>>
>> @@ -7359,10 +7378,12 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
>>
>>  /*
>>   * Current heuristic for kicking the idle load balancer in the presence
>> - * of an idle cpu is the system.
>> + * of an idle cpu in the system.
>>   *   - This rq has more than one task.
>> - *   - At any scheduler domain level, this cpu's scheduler group has multiple
>> - *     busy cpu's exceeding the group's capacity.
>> + *   - This rq has at least one CFS task and the capacity of the CPU is
>> + *     significantly reduced because of RT tasks or IRQs.
>> + *   - At parent of LLC scheduler domain level, this cpu's scheduler group has
>> + *     multiple busy cpu.
>>   *   - For SD_ASYM_PACKING, if the lower numbered cpu's in the scheduler
>>   *     domain span are idle.
>>   */
>> @@ -7372,9 +7393,10 @@ static inline int nohz_kick_needed(struct rq *rq)
>>       struct sched_domain *sd;
>>       struct sched_group_capacity *sgc;
>>       int nr_busy, cpu = rq->cpu;
>> +     bool kick = false;
>>
>>       if (unlikely(rq->idle_balance))
>> -             return 0;
>> +             return false;
>>
>>         /*
>>       * We may be recently in ticked or tickless idle mode. At the first
>> @@ -7388,38 +7410,45 @@ static inline int nohz_kick_needed(struct rq *rq)
>>        * balancing.
>>        */
>>       if (likely(!atomic_read(&nohz.nr_cpus)))
>> -             return 0;
>> +             return false;
>>
>>       if (time_before(now, nohz.next_balance))
>> -             return 0;
>> +             return false;
>>
>>       if (rq->nr_running >= 2)
>
> Will this check ^^ not catch those cases which this patch is targeting?

This patch is not about how many tasks are running but if the capacity
of the CPU is reduced because of side activity like interruptions
which are only visible in the capacity value (with IRQ_TIME_ACCOUNTING
config) but not in nr_running.
Even if the capacity is reduced because of RT tasks, nothing ensures
that the RT task will be running when the tick fires

Regards,
Vincent
>
> Regards
> Preeti U Murthy
>
>> -             goto need_kick;
>> +             return true;
>>
>>       rcu_read_lock();
>>       sd = rcu_dereference(per_cpu(sd_busy, cpu));
>> -
>>       if (sd) {
>>               sgc = sd->groups->sgc;
>>               nr_busy = atomic_read(&sgc->nr_busy_cpus);
>>
>> -             if (nr_busy > 1)
>> -                     goto need_kick_unlock;
>> +             if (nr_busy > 1) {
>> +                     kick = true;
>> +                     goto unlock;
>> +             }
>> +
>>       }
>>
>> -     sd = rcu_dereference(per_cpu(sd_asym, cpu));
>> +     sd = rcu_dereference(rq->sd);
>> +     if (sd) {
>> +             if ((rq->cfs.h_nr_running >= 1) &&
>> +                             ((rq->cpu_capacity * sd->imbalance_pct) <
>> +                             (rq->cpu_capacity_orig * 100))) {
>> +                     kick = true;
>> +                     goto unlock;
>> +             }
>> +     }
>>
>> +     sd = rcu_dereference(per_cpu(sd_asym, cpu));
>>       if (sd && (cpumask_first_and(nohz.idle_cpus_mask,
>>                                 sched_domain_span(sd)) < cpu))
>> -             goto need_kick_unlock;
>> +             kick = true;
>>
>> +unlock:
>>       rcu_read_unlock();
>> -     return 0;
>> -
>> -need_kick_unlock:
>> -     rcu_read_unlock();
>> -need_kick:
>> -     return 1;
>> +     return kick;
>>  }
>>  #else
>>  static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) { }
>>
>

^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v5 08/12] sched: move cfs task on a CPU with higher capacity
@ 2014-09-01  8:45       ` Vincent Guittot
  0 siblings, 0 replies; 164+ messages in thread
From: Vincent Guittot @ 2014-09-01  8:45 UTC (permalink / raw)
  To: linux-arm-kernel

On 30 August 2014 19:50, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
> Hi Vincent,
>
> On 08/26/2014 04:36 PM, Vincent Guittot wrote:
>> If the CPU is used for handling lot of IRQs, trig a load balance to check if
>> it's worth moving its tasks on another CPU that has more capacity.
>>
>> As a sidenote, this will note generate more spurious ilb because we already
>> trig an ilb if there is more than 1 busy cpu. If this cpu is the only one that
>> has a task, we will trig the ilb once for migrating the task.
>>
>> The nohz_kick_needed function has been cleaned up a bit while adding the new
>> test
>>
>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>> ---
>>  kernel/sched/fair.c | 69 +++++++++++++++++++++++++++++++++++++----------------
>>  1 file changed, 49 insertions(+), 20 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 18db43e..60ae1ce 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6049,6 +6049,14 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>>                       return true;
>>       }
>>
>> +     /*
>> +      * The group capacity is reduced probably because of activity from other
>> +      * sched class or interrupts which use part of the available capacity
>> +      */
>> +     if ((sg->sgc->capacity_orig * 100) > (sgs->group_capacity *
>> +                             env->sd->imbalance_pct))
>
> Wouldn't the check on avg_load let us know if we are packing more tasks
> in this group than its capacity ? Isn't that the metric we are more
> interested in?

With  this patch, we don't want to pack but we prefer to spread the
task on another CPU than the one which handles the interruption if
latter uses a significant part of the CPU capacity.

>
>> +             return true;
>> +
>>       return false;
>>  }
>>
>> @@ -6534,13 +6542,23 @@ static int need_active_balance(struct lb_env *env)
>>       struct sched_domain *sd = env->sd;
>>
>>       if (env->idle == CPU_NEWLY_IDLE) {
>> +             int src_cpu = env->src_cpu;
>>
>>               /*
>>                * ASYM_PACKING needs to force migrate tasks from busy but
>>                * higher numbered CPUs in order to pack all tasks in the
>>                * lowest numbered CPUs.
>>                */
>> -             if ((sd->flags & SD_ASYM_PACKING) && env->src_cpu > env->dst_cpu)
>> +             if ((sd->flags & SD_ASYM_PACKING) && src_cpu > env->dst_cpu)
>> +                     return 1;
>> +
>> +             /*
>> +              * If the CPUs share their cache and the src_cpu's capacity is
>> +              * reduced because of other sched_class or IRQs, we trig an
>> +              * active balance to move the task
>> +              */
>> +             if ((capacity_orig_of(src_cpu) * 100) > (capacity_of(src_cpu) *
>> +                             sd->imbalance_pct))
>>                       return 1;
>>       }
>>
>> @@ -6643,6 +6661,8 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>>
>>       schedstat_add(sd, lb_imbalance[idle], env.imbalance);
>>
>> +     env.src_cpu = busiest->cpu;
>> +
>>       ld_moved = 0;
>>       if (busiest->nr_running > 1) {
>>               /*
>> @@ -6652,7 +6672,6 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>>                * correctly treated as an imbalance.
>>                */
>>               env.flags |= LBF_ALL_PINNED;
>> -             env.src_cpu   = busiest->cpu;
>>               env.src_rq    = busiest;
>>               env.loop_max  = min(sysctl_sched_nr_migrate, busiest->nr_running);
>>
>> @@ -7359,10 +7378,12 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
>>
>>  /*
>>   * Current heuristic for kicking the idle load balancer in the presence
>> - * of an idle cpu is the system.
>> + * of an idle cpu in the system.
>>   *   - This rq has more than one task.
>> - *   - At any scheduler domain level, this cpu's scheduler group has multiple
>> - *     busy cpu's exceeding the group's capacity.
>> + *   - This rq has at least one CFS task and the capacity of the CPU is
>> + *     significantly reduced because of RT tasks or IRQs.
>> + *   - At parent of LLC scheduler domain level, this cpu's scheduler group has
>> + *     multiple busy cpu.
>>   *   - For SD_ASYM_PACKING, if the lower numbered cpu's in the scheduler
>>   *     domain span are idle.
>>   */
>> @@ -7372,9 +7393,10 @@ static inline int nohz_kick_needed(struct rq *rq)
>>       struct sched_domain *sd;
>>       struct sched_group_capacity *sgc;
>>       int nr_busy, cpu = rq->cpu;
>> +     bool kick = false;
>>
>>       if (unlikely(rq->idle_balance))
>> -             return 0;
>> +             return false;
>>
>>         /*
>>       * We may be recently in ticked or tickless idle mode. At the first
>> @@ -7388,38 +7410,45 @@ static inline int nohz_kick_needed(struct rq *rq)
>>        * balancing.
>>        */
>>       if (likely(!atomic_read(&nohz.nr_cpus)))
>> -             return 0;
>> +             return false;
>>
>>       if (time_before(now, nohz.next_balance))
>> -             return 0;
>> +             return false;
>>
>>       if (rq->nr_running >= 2)
>
> Will this check ^^ not catch those cases which this patch is targeting?

This patch is not about how many tasks are running but if the capacity
of the CPU is reduced because of side activity like interruptions
which are only visible in the capacity value (with IRQ_TIME_ACCOUNTING
config) but not in nr_running.
Even if the capacity is reduced because of RT tasks, nothing ensures
that the RT task will be running when the tick fires

Regards,
Vincent
>
> Regards
> Preeti U Murthy
>
>> -             goto need_kick;
>> +             return true;
>>
>>       rcu_read_lock();
>>       sd = rcu_dereference(per_cpu(sd_busy, cpu));
>> -
>>       if (sd) {
>>               sgc = sd->groups->sgc;
>>               nr_busy = atomic_read(&sgc->nr_busy_cpus);
>>
>> -             if (nr_busy > 1)
>> -                     goto need_kick_unlock;
>> +             if (nr_busy > 1) {
>> +                     kick = true;
>> +                     goto unlock;
>> +             }
>> +
>>       }
>>
>> -     sd = rcu_dereference(per_cpu(sd_asym, cpu));
>> +     sd = rcu_dereference(rq->sd);
>> +     if (sd) {
>> +             if ((rq->cfs.h_nr_running >= 1) &&
>> +                             ((rq->cpu_capacity * sd->imbalance_pct) <
>> +                             (rq->cpu_capacity_orig * 100))) {
>> +                     kick = true;
>> +                     goto unlock;
>> +             }
>> +     }
>>
>> +     sd = rcu_dereference(per_cpu(sd_asym, cpu));
>>       if (sd && (cpumask_first_and(nohz.idle_cpus_mask,
>>                                 sched_domain_span(sd)) < cpu))
>> -             goto need_kick_unlock;
>> +             kick = true;
>>
>> +unlock:
>>       rcu_read_unlock();
>> -     return 0;
>> -
>> -need_kick_unlock:
>> -     rcu_read_unlock();
>> -need_kick:
>> -     return 1;
>> +     return kick;
>>  }
>>  #else
>>  static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) { }
>>
>

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v5 04/12] sched: Allow all archs to set the capacity_orig
  2014-09-01  8:05       ` Vincent Guittot
@ 2014-09-03  8:41         ` Preeti U Murthy
  -1 siblings, 0 replies; 164+ messages in thread
From: Preeti U Murthy @ 2014-09-03  8:41 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel,
	Russell King - ARM Linux, LAK, Rik van Riel, Morten Rasmussen,
	Mike Galbraith, Nicolas Pitre, linaro-kernel, Daniel Lezcano,
	Dietmar Eggemann

On 09/01/2014 01:35 PM, Vincent Guittot wrote:
> On 30 August 2014 19:07, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
>> Hi Vincent,
>>
>> On 08/26/2014 04:36 PM, Vincent Guittot wrote:
>>> capacity_orig is only changed for system with a SMT sched_domain level in order
>>
>> I think I had asked this before, but why only capacity_orig? The
>> capacity of a group is also being updated the same way. This patch fixes
>> the capacity of a group to reflect the capacity of the heterogeneous
>> CPUs in it, this capacity being both the full capacity of the group:
>> capacity_orig and the capacity available for the fair tasks. So I feel
>> in the subject as well as the changelog it would suffice to say 'capacity'.
> 
> IIRC, we discussed that point on a former version. The patch changes
> only the behavior of capacity_orig but the behavior of capacity stays
> unchanged as all archs can already set the capacity whereas the
> capacity_orig was configurable only if the SD_SHARE_CPUCAPACITY was
> set in the sched_domain. This is no more the case with this patch
> which creates arch_scale_cpu_capacity for setting capacity_orig.

Yes, sorry I overlooked that.

Reviewed-by: Preeti U. Murthy <preeti@linux.vnet.ibm.com>

Regards
Preeti U Murthy


^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v5 04/12] sched: Allow all archs to set the capacity_orig
@ 2014-09-03  8:41         ` Preeti U Murthy
  0 siblings, 0 replies; 164+ messages in thread
From: Preeti U Murthy @ 2014-09-03  8:41 UTC (permalink / raw)
  To: linux-arm-kernel

On 09/01/2014 01:35 PM, Vincent Guittot wrote:
> On 30 August 2014 19:07, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
>> Hi Vincent,
>>
>> On 08/26/2014 04:36 PM, Vincent Guittot wrote:
>>> capacity_orig is only changed for system with a SMT sched_domain level in order
>>
>> I think I had asked this before, but why only capacity_orig? The
>> capacity of a group is also being updated the same way. This patch fixes
>> the capacity of a group to reflect the capacity of the heterogeneous
>> CPUs in it, this capacity being both the full capacity of the group:
>> capacity_orig and the capacity available for the fair tasks. So I feel
>> in the subject as well as the changelog it would suffice to say 'capacity'.
> 
> IIRC, we discussed that point on a former version. The patch changes
> only the behavior of capacity_orig but the behavior of capacity stays
> unchanged as all archs can already set the capacity whereas the
> capacity_orig was configurable only if the SD_SHARE_CPUCAPACITY was
> set in the sched_domain. This is no more the case with this patch
> which creates arch_scale_cpu_capacity for setting capacity_orig.

Yes, sorry I overlooked that.

Reviewed-by: Preeti U. Murthy <preeti@linux.vnet.ibm.com>

Regards
Preeti U Murthy

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v5 08/12] sched: move cfs task on a CPU with higher capacity
  2014-09-01  8:45       ` Vincent Guittot
@ 2014-09-03  9:11         ` Preeti U Murthy
  -1 siblings, 0 replies; 164+ messages in thread
From: Preeti U Murthy @ 2014-09-03  9:11 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel,
	Russell King - ARM Linux, LAK, Rik van Riel, Morten Rasmussen,
	Mike Galbraith, Nicolas Pitre, linaro-kernel, Daniel Lezcano,
	Dietmar Eggemann

On 09/01/2014 02:15 PM, Vincent Guittot wrote:
> On 30 August 2014 19:50, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
>> Hi Vincent,
>>> index 18db43e..60ae1ce 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -6049,6 +6049,14 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>>>                       return true;
>>>       }
>>>
>>> +     /*
>>> +      * The group capacity is reduced probably because of activity from other
>>> +      * sched class or interrupts which use part of the available capacity
>>> +      */
>>> +     if ((sg->sgc->capacity_orig * 100) > (sgs->group_capacity *
>>> +                             env->sd->imbalance_pct))
>>
>> Wouldn't the check on avg_load let us know if we are packing more tasks
>> in this group than its capacity ? Isn't that the metric we are more
>> interested in?
> 
> With  this patch, we don't want to pack but we prefer to spread the
> task on another CPU than the one which handles the interruption if
> latter uses a significant part of the CPU capacity.
> 
>>
>>> +             return true;
>>> +
>>>       return false;
>>>  }
>>>
>>> @@ -6534,13 +6542,23 @@ static int need_active_balance(struct lb_env *env)
>>>       struct sched_domain *sd = env->sd;
>>>
>>>       if (env->idle == CPU_NEWLY_IDLE) {
>>> +             int src_cpu = env->src_cpu;
>>>
>>>               /*
>>>                * ASYM_PACKING needs to force migrate tasks from busy but
>>>                * higher numbered CPUs in order to pack all tasks in the
>>>                * lowest numbered CPUs.
>>>                */
>>> -             if ((sd->flags & SD_ASYM_PACKING) && env->src_cpu > env->dst_cpu)
>>> +             if ((sd->flags & SD_ASYM_PACKING) && src_cpu > env->dst_cpu)
>>> +                     return 1;
>>> +
>>> +             /*
>>> +              * If the CPUs share their cache and the src_cpu's capacity is
>>> +              * reduced because of other sched_class or IRQs, we trig an
>>> +              * active balance to move the task
>>> +              */
>>> +             if ((capacity_orig_of(src_cpu) * 100) > (capacity_of(src_cpu) *
>>> +                             sd->imbalance_pct))
>>>                       return 1;
>>>       }
>>>
>>> @@ -6643,6 +6661,8 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>>>
>>>       schedstat_add(sd, lb_imbalance[idle], env.imbalance);
>>>
>>> +     env.src_cpu = busiest->cpu;
>>> +
>>>       ld_moved = 0;
>>>       if (busiest->nr_running > 1) {
>>>               /*
>>> @@ -6652,7 +6672,6 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>>>                * correctly treated as an imbalance.
>>>                */
>>>               env.flags |= LBF_ALL_PINNED;
>>> -             env.src_cpu   = busiest->cpu;
>>>               env.src_rq    = busiest;
>>>               env.loop_max  = min(sysctl_sched_nr_migrate, busiest->nr_running);
>>>
>>> @@ -7359,10 +7378,12 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
>>>
>>>  /*
>>>   * Current heuristic for kicking the idle load balancer in the presence
>>> - * of an idle cpu is the system.
>>> + * of an idle cpu in the system.
>>>   *   - This rq has more than one task.
>>> - *   - At any scheduler domain level, this cpu's scheduler group has multiple
>>> - *     busy cpu's exceeding the group's capacity.
>>> + *   - This rq has at least one CFS task and the capacity of the CPU is
>>> + *     significantly reduced because of RT tasks or IRQs.
>>> + *   - At parent of LLC scheduler domain level, this cpu's scheduler group has
>>> + *     multiple busy cpu.
>>>   *   - For SD_ASYM_PACKING, if the lower numbered cpu's in the scheduler
>>>   *     domain span are idle.
>>>   */
>>> @@ -7372,9 +7393,10 @@ static inline int nohz_kick_needed(struct rq *rq)
>>>       struct sched_domain *sd;
>>>       struct sched_group_capacity *sgc;
>>>       int nr_busy, cpu = rq->cpu;
>>> +     bool kick = false;
>>>
>>>       if (unlikely(rq->idle_balance))
>>> -             return 0;
>>> +             return false;
>>>
>>>         /*
>>>       * We may be recently in ticked or tickless idle mode. At the first
>>> @@ -7388,38 +7410,45 @@ static inline int nohz_kick_needed(struct rq *rq)
>>>        * balancing.
>>>        */
>>>       if (likely(!atomic_read(&nohz.nr_cpus)))
>>> -             return 0;
>>> +             return false;
>>>
>>>       if (time_before(now, nohz.next_balance))
>>> -             return 0;
>>> +             return false;
>>>
>>>       if (rq->nr_running >= 2)
>>
>> Will this check ^^ not catch those cases which this patch is targeting?
> 
> This patch is not about how many tasks are running but if the capacity
> of the CPU is reduced because of side activity like interruptions
> which are only visible in the capacity value (with IRQ_TIME_ACCOUNTING
> config) but not in nr_running.
> Even if the capacity is reduced because of RT tasks, nothing ensures
> that the RT task will be running when the tick fires
> 
> Regards,
> Vincent
>>
>> Regards
>> Preeti U Murthy
>>
>>> -             goto need_kick;
>>> +             return true;
>>>
>>>       rcu_read_lock();
>>>       sd = rcu_dereference(per_cpu(sd_busy, cpu));
>>> -
>>>       if (sd) {
>>>               sgc = sd->groups->sgc;
>>>               nr_busy = atomic_read(&sgc->nr_busy_cpus);
>>>
>>> -             if (nr_busy > 1)
>>> -                     goto need_kick_unlock;
>>> +             if (nr_busy > 1) {
>>> +                     kick = true;
>>> +                     goto unlock;
>>> +             }
>>> +
>>>       }
>>>
>>> -     sd = rcu_dereference(per_cpu(sd_asym, cpu));
>>> +     sd = rcu_dereference(rq->sd);
>>> +     if (sd) {
>>> +             if ((rq->cfs.h_nr_running >= 1) &&
>>> +                             ((rq->cpu_capacity * sd->imbalance_pct) <
>>> +                             (rq->cpu_capacity_orig * 100))) {

Ok I understand your explanation above. But I was wondering if you would
need to add this check around rq->cfs.h_nr_running >= 1 in the above two
cases as well.

I have actually raised this concern over whether we should be using
rq->nr_running or cfs_rq->nr_running while we do load balancing in reply
to your patch3. While all our load measurements are about the cfs_rq
load, we use rq->nr_running, which may include tasks from other sched
classes as well. We divide them to get average load per task during load
balancing which is wrong, isn't it? Similarly during nohz_kick_needed(),
we trigger load balancing based on rq->nr_running again.

In this patch too, even if you know that the cpu is being dominated by
tasks that do not belong to cfs class, you would be triggering a futile
load balance if there are no fair tasks to move.

Regards
Preeti U Murthy


^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v5 08/12] sched: move cfs task on a CPU with higher capacity
@ 2014-09-03  9:11         ` Preeti U Murthy
  0 siblings, 0 replies; 164+ messages in thread
From: Preeti U Murthy @ 2014-09-03  9:11 UTC (permalink / raw)
  To: linux-arm-kernel

On 09/01/2014 02:15 PM, Vincent Guittot wrote:
> On 30 August 2014 19:50, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
>> Hi Vincent,
>>> index 18db43e..60ae1ce 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -6049,6 +6049,14 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>>>                       return true;
>>>       }
>>>
>>> +     /*
>>> +      * The group capacity is reduced probably because of activity from other
>>> +      * sched class or interrupts which use part of the available capacity
>>> +      */
>>> +     if ((sg->sgc->capacity_orig * 100) > (sgs->group_capacity *
>>> +                             env->sd->imbalance_pct))
>>
>> Wouldn't the check on avg_load let us know if we are packing more tasks
>> in this group than its capacity ? Isn't that the metric we are more
>> interested in?
> 
> With  this patch, we don't want to pack but we prefer to spread the
> task on another CPU than the one which handles the interruption if
> latter uses a significant part of the CPU capacity.
> 
>>
>>> +             return true;
>>> +
>>>       return false;
>>>  }
>>>
>>> @@ -6534,13 +6542,23 @@ static int need_active_balance(struct lb_env *env)
>>>       struct sched_domain *sd = env->sd;
>>>
>>>       if (env->idle == CPU_NEWLY_IDLE) {
>>> +             int src_cpu = env->src_cpu;
>>>
>>>               /*
>>>                * ASYM_PACKING needs to force migrate tasks from busy but
>>>                * higher numbered CPUs in order to pack all tasks in the
>>>                * lowest numbered CPUs.
>>>                */
>>> -             if ((sd->flags & SD_ASYM_PACKING) && env->src_cpu > env->dst_cpu)
>>> +             if ((sd->flags & SD_ASYM_PACKING) && src_cpu > env->dst_cpu)
>>> +                     return 1;
>>> +
>>> +             /*
>>> +              * If the CPUs share their cache and the src_cpu's capacity is
>>> +              * reduced because of other sched_class or IRQs, we trig an
>>> +              * active balance to move the task
>>> +              */
>>> +             if ((capacity_orig_of(src_cpu) * 100) > (capacity_of(src_cpu) *
>>> +                             sd->imbalance_pct))
>>>                       return 1;
>>>       }
>>>
>>> @@ -6643,6 +6661,8 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>>>
>>>       schedstat_add(sd, lb_imbalance[idle], env.imbalance);
>>>
>>> +     env.src_cpu = busiest->cpu;
>>> +
>>>       ld_moved = 0;
>>>       if (busiest->nr_running > 1) {
>>>               /*
>>> @@ -6652,7 +6672,6 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>>>                * correctly treated as an imbalance.
>>>                */
>>>               env.flags |= LBF_ALL_PINNED;
>>> -             env.src_cpu   = busiest->cpu;
>>>               env.src_rq    = busiest;
>>>               env.loop_max  = min(sysctl_sched_nr_migrate, busiest->nr_running);
>>>
>>> @@ -7359,10 +7378,12 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
>>>
>>>  /*
>>>   * Current heuristic for kicking the idle load balancer in the presence
>>> - * of an idle cpu is the system.
>>> + * of an idle cpu in the system.
>>>   *   - This rq has more than one task.
>>> - *   - At any scheduler domain level, this cpu's scheduler group has multiple
>>> - *     busy cpu's exceeding the group's capacity.
>>> + *   - This rq has at least one CFS task and the capacity of the CPU is
>>> + *     significantly reduced because of RT tasks or IRQs.
>>> + *   - At parent of LLC scheduler domain level, this cpu's scheduler group has
>>> + *     multiple busy cpu.
>>>   *   - For SD_ASYM_PACKING, if the lower numbered cpu's in the scheduler
>>>   *     domain span are idle.
>>>   */
>>> @@ -7372,9 +7393,10 @@ static inline int nohz_kick_needed(struct rq *rq)
>>>       struct sched_domain *sd;
>>>       struct sched_group_capacity *sgc;
>>>       int nr_busy, cpu = rq->cpu;
>>> +     bool kick = false;
>>>
>>>       if (unlikely(rq->idle_balance))
>>> -             return 0;
>>> +             return false;
>>>
>>>         /*
>>>       * We may be recently in ticked or tickless idle mode. At the first
>>> @@ -7388,38 +7410,45 @@ static inline int nohz_kick_needed(struct rq *rq)
>>>        * balancing.
>>>        */
>>>       if (likely(!atomic_read(&nohz.nr_cpus)))
>>> -             return 0;
>>> +             return false;
>>>
>>>       if (time_before(now, nohz.next_balance))
>>> -             return 0;
>>> +             return false;
>>>
>>>       if (rq->nr_running >= 2)
>>
>> Will this check ^^ not catch those cases which this patch is targeting?
> 
> This patch is not about how many tasks are running but if the capacity
> of the CPU is reduced because of side activity like interruptions
> which are only visible in the capacity value (with IRQ_TIME_ACCOUNTING
> config) but not in nr_running.
> Even if the capacity is reduced because of RT tasks, nothing ensures
> that the RT task will be running when the tick fires
> 
> Regards,
> Vincent
>>
>> Regards
>> Preeti U Murthy
>>
>>> -             goto need_kick;
>>> +             return true;
>>>
>>>       rcu_read_lock();
>>>       sd = rcu_dereference(per_cpu(sd_busy, cpu));
>>> -
>>>       if (sd) {
>>>               sgc = sd->groups->sgc;
>>>               nr_busy = atomic_read(&sgc->nr_busy_cpus);
>>>
>>> -             if (nr_busy > 1)
>>> -                     goto need_kick_unlock;
>>> +             if (nr_busy > 1) {
>>> +                     kick = true;
>>> +                     goto unlock;
>>> +             }
>>> +
>>>       }
>>>
>>> -     sd = rcu_dereference(per_cpu(sd_asym, cpu));
>>> +     sd = rcu_dereference(rq->sd);
>>> +     if (sd) {
>>> +             if ((rq->cfs.h_nr_running >= 1) &&
>>> +                             ((rq->cpu_capacity * sd->imbalance_pct) <
>>> +                             (rq->cpu_capacity_orig * 100))) {

Ok I understand your explanation above. But I was wondering if you would
need to add this check around rq->cfs.h_nr_running >= 1 in the above two
cases as well.

I have actually raised this concern over whether we should be using
rq->nr_running or cfs_rq->nr_running while we do load balancing in reply
to your patch3. While all our load measurements are about the cfs_rq
load, we use rq->nr_running, which may include tasks from other sched
classes as well. We divide them to get average load per task during load
balancing which is wrong, isn't it? Similarly during nohz_kick_needed(),
we trigger load balancing based on rq->nr_running again.

In this patch too, even if you know that the cpu is being dominated by
tasks that do not belong to cfs class, you would be triggering a futile
load balance if there are no fair tasks to move.

Regards
Preeti U Murthy

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v5 03/12] sched: fix avg_load computation
  2014-08-30 12:00     ` Preeti U Murthy
@ 2014-09-03 11:09       ` Vincent Guittot
  -1 siblings, 0 replies; 164+ messages in thread
From: Vincent Guittot @ 2014-09-03 11:09 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel,
	Russell King - ARM Linux, LAK, Rik van Riel, Morten Rasmussen,
	Mike Galbraith, Nicolas Pitre, linaro-kernel, Daniel Lezcano,
	Dietmar Eggemann

On 30 August 2014 14:00, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
> Hi Vincent,
>
> On 08/26/2014 04:36 PM, Vincent Guittot wrote:
>> The computation of avg_load and avg_load_per_task should only takes into
>> account the number of cfs tasks. The non cfs task are already taken into
>> account by decreasing the cpu's capacity and they will be tracked in the
>> CPU's utilization (group_utilization) of the next patches
>>
>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>> ---
>>  kernel/sched/fair.c | 4 ++--
>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 87b9dc7..b85e9f7 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -4092,7 +4092,7 @@ static unsigned long capacity_of(int cpu)
>>  static unsigned long cpu_avg_load_per_task(int cpu)
>>  {
>>       struct rq *rq = cpu_rq(cpu);
>> -     unsigned long nr_running = ACCESS_ONCE(rq->nr_running);
>> +     unsigned long nr_running = ACCESS_ONCE(rq->cfs.h_nr_running);
>>       unsigned long load_avg = rq->cfs.runnable_load_avg;
>>
>>       if (nr_running)
>> @@ -5985,7 +5985,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>>                       load = source_load(i, load_idx);
>>
>>               sgs->group_load += load;
>> -             sgs->sum_nr_running += rq->nr_running;
>> +             sgs->sum_nr_running += rq->cfs.h_nr_running;
>>
>>               if (rq->nr_running > 1)
>>                       *overload = true;
>>
>
> Why do we probe rq->nr_running while we do load balancing? Should not we
> be probing cfs_rq->nr_running instead? We are interested after all in
> load balancing fair tasks right? The reason I ask this is, I was
> wondering if we need to make the above similar change in more places in
> load balancing.

Hi Preeti,

Yes, we should probably the test   rq->cfs.h_nr_running > 0 before
setting overload.

Sorry for this late answer, the email was lost in my messy inbox

Vincent

>
> To cite examples: The above check says a cpu is overloaded when
> rq->nr_running > 1. However if these tasks happen to be rt tasks, we
> would anyway not be able to load balance. So while I was looking through
> this patch, I noticed this and wanted to cross verify if we are checking
> rq->nr_running on purpose in some places in load balancing; another
> example being in nohz_kick_needed().
>
>
> Regards
> Preeti U Murthy
>

^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v5 03/12] sched: fix avg_load computation
@ 2014-09-03 11:09       ` Vincent Guittot
  0 siblings, 0 replies; 164+ messages in thread
From: Vincent Guittot @ 2014-09-03 11:09 UTC (permalink / raw)
  To: linux-arm-kernel

On 30 August 2014 14:00, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
> Hi Vincent,
>
> On 08/26/2014 04:36 PM, Vincent Guittot wrote:
>> The computation of avg_load and avg_load_per_task should only takes into
>> account the number of cfs tasks. The non cfs task are already taken into
>> account by decreasing the cpu's capacity and they will be tracked in the
>> CPU's utilization (group_utilization) of the next patches
>>
>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>> ---
>>  kernel/sched/fair.c | 4 ++--
>>  1 file changed, 2 insertions(+), 2 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 87b9dc7..b85e9f7 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -4092,7 +4092,7 @@ static unsigned long capacity_of(int cpu)
>>  static unsigned long cpu_avg_load_per_task(int cpu)
>>  {
>>       struct rq *rq = cpu_rq(cpu);
>> -     unsigned long nr_running = ACCESS_ONCE(rq->nr_running);
>> +     unsigned long nr_running = ACCESS_ONCE(rq->cfs.h_nr_running);
>>       unsigned long load_avg = rq->cfs.runnable_load_avg;
>>
>>       if (nr_running)
>> @@ -5985,7 +5985,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>>                       load = source_load(i, load_idx);
>>
>>               sgs->group_load += load;
>> -             sgs->sum_nr_running += rq->nr_running;
>> +             sgs->sum_nr_running += rq->cfs.h_nr_running;
>>
>>               if (rq->nr_running > 1)
>>                       *overload = true;
>>
>
> Why do we probe rq->nr_running while we do load balancing? Should not we
> be probing cfs_rq->nr_running instead? We are interested after all in
> load balancing fair tasks right? The reason I ask this is, I was
> wondering if we need to make the above similar change in more places in
> load balancing.

Hi Preeti,

Yes, we should probably the test   rq->cfs.h_nr_running > 0 before
setting overload.

Sorry for this late answer, the email was lost in my messy inbox

Vincent

>
> To cite examples: The above check says a cpu is overloaded when
> rq->nr_running > 1. However if these tasks happen to be rt tasks, we
> would anyway not be able to load balance. So while I was looking through
> this patch, I noticed this and wanted to cross verify if we are checking
> rq->nr_running on purpose in some places in load balancing; another
> example being in nohz_kick_needed().
>
>
> Regards
> Preeti U Murthy
>

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v5 08/12] sched: move cfs task on a CPU with higher capacity
  2014-09-03  9:11         ` Preeti U Murthy
@ 2014-09-03 11:44           ` Vincent Guittot
  -1 siblings, 0 replies; 164+ messages in thread
From: Vincent Guittot @ 2014-09-03 11:44 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel,
	Russell King - ARM Linux, LAK, Rik van Riel, Morten Rasmussen,
	Mike Galbraith, Nicolas Pitre, linaro-kernel, Daniel Lezcano,
	Dietmar Eggemann

On 3 September 2014 11:11, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
> On 09/01/2014 02:15 PM, Vincent Guittot wrote:
>> On 30 August 2014 19:50, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
>>> Hi Vincent,
>>>> index 18db43e..60ae1ce 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -6049,6 +6049,14 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>>>>                       return true;
>>>>       }
>>>>
>>>> +     /*
>>>> +      * The group capacity is reduced probably because of activity from other
>>>> +      * sched class or interrupts which use part of the available capacity
>>>> +      */
>>>> +     if ((sg->sgc->capacity_orig * 100) > (sgs->group_capacity *
>>>> +                             env->sd->imbalance_pct))
>>>
>>> Wouldn't the check on avg_load let us know if we are packing more tasks
>>> in this group than its capacity ? Isn't that the metric we are more
>>> interested in?
>>
>> With  this patch, we don't want to pack but we prefer to spread the
>> task on another CPU than the one which handles the interruption if
>> latter uses a significant part of the CPU capacity.
>>
>>>
>>>> +             return true;
>>>> +
>>>>       return false;
>>>>  }
>>>>
>>>> @@ -6534,13 +6542,23 @@ static int need_active_balance(struct lb_env *env)
>>>>       struct sched_domain *sd = env->sd;
>>>>
>>>>       if (env->idle == CPU_NEWLY_IDLE) {
>>>> +             int src_cpu = env->src_cpu;
>>>>
>>>>               /*
>>>>                * ASYM_PACKING needs to force migrate tasks from busy but
>>>>                * higher numbered CPUs in order to pack all tasks in the
>>>>                * lowest numbered CPUs.
>>>>                */
>>>> -             if ((sd->flags & SD_ASYM_PACKING) && env->src_cpu > env->dst_cpu)
>>>> +             if ((sd->flags & SD_ASYM_PACKING) && src_cpu > env->dst_cpu)
>>>> +                     return 1;
>>>> +
>>>> +             /*
>>>> +              * If the CPUs share their cache and the src_cpu's capacity is
>>>> +              * reduced because of other sched_class or IRQs, we trig an
>>>> +              * active balance to move the task
>>>> +              */
>>>> +             if ((capacity_orig_of(src_cpu) * 100) > (capacity_of(src_cpu) *
>>>> +                             sd->imbalance_pct))
>>>>                       return 1;
>>>>       }
>>>>
>>>> @@ -6643,6 +6661,8 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>>>>
>>>>       schedstat_add(sd, lb_imbalance[idle], env.imbalance);
>>>>
>>>> +     env.src_cpu = busiest->cpu;
>>>> +
>>>>       ld_moved = 0;
>>>>       if (busiest->nr_running > 1) {
>>>>               /*
>>>> @@ -6652,7 +6672,6 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>>>>                * correctly treated as an imbalance.
>>>>                */
>>>>               env.flags |= LBF_ALL_PINNED;
>>>> -             env.src_cpu   = busiest->cpu;
>>>>               env.src_rq    = busiest;
>>>>               env.loop_max  = min(sysctl_sched_nr_migrate, busiest->nr_running);
>>>>
>>>> @@ -7359,10 +7378,12 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
>>>>
>>>>  /*
>>>>   * Current heuristic for kicking the idle load balancer in the presence
>>>> - * of an idle cpu is the system.
>>>> + * of an idle cpu in the system.
>>>>   *   - This rq has more than one task.
>>>> - *   - At any scheduler domain level, this cpu's scheduler group has multiple
>>>> - *     busy cpu's exceeding the group's capacity.
>>>> + *   - This rq has at least one CFS task and the capacity of the CPU is
>>>> + *     significantly reduced because of RT tasks or IRQs.
>>>> + *   - At parent of LLC scheduler domain level, this cpu's scheduler group has
>>>> + *     multiple busy cpu.
>>>>   *   - For SD_ASYM_PACKING, if the lower numbered cpu's in the scheduler
>>>>   *     domain span are idle.
>>>>   */
>>>> @@ -7372,9 +7393,10 @@ static inline int nohz_kick_needed(struct rq *rq)
>>>>       struct sched_domain *sd;
>>>>       struct sched_group_capacity *sgc;
>>>>       int nr_busy, cpu = rq->cpu;
>>>> +     bool kick = false;
>>>>
>>>>       if (unlikely(rq->idle_balance))
>>>> -             return 0;
>>>> +             return false;
>>>>
>>>>         /*
>>>>       * We may be recently in ticked or tickless idle mode. At the first
>>>> @@ -7388,38 +7410,45 @@ static inline int nohz_kick_needed(struct rq *rq)
>>>>        * balancing.
>>>>        */
>>>>       if (likely(!atomic_read(&nohz.nr_cpus)))
>>>> -             return 0;
>>>> +             return false;
>>>>
>>>>       if (time_before(now, nohz.next_balance))
>>>> -             return 0;
>>>> +             return false;
>>>>
>>>>       if (rq->nr_running >= 2)
>>>
>>> Will this check ^^ not catch those cases which this patch is targeting?
>>
>> This patch is not about how many tasks are running but if the capacity
>> of the CPU is reduced because of side activity like interruptions
>> which are only visible in the capacity value (with IRQ_TIME_ACCOUNTING
>> config) but not in nr_running.
>> Even if the capacity is reduced because of RT tasks, nothing ensures
>> that the RT task will be running when the tick fires
>>
>> Regards,
>> Vincent
>>>
>>> Regards
>>> Preeti U Murthy
>>>
>>>> -             goto need_kick;
>>>> +             return true;
>>>>
>>>>       rcu_read_lock();
>>>>       sd = rcu_dereference(per_cpu(sd_busy, cpu));
>>>> -
>>>>       if (sd) {
>>>>               sgc = sd->groups->sgc;
>>>>               nr_busy = atomic_read(&sgc->nr_busy_cpus);
>>>>
>>>> -             if (nr_busy > 1)
>>>> -                     goto need_kick_unlock;
>>>> +             if (nr_busy > 1) {
>>>> +                     kick = true;
>>>> +                     goto unlock;
>>>> +             }
>>>> +
>>>>       }
>>>>
>>>> -     sd = rcu_dereference(per_cpu(sd_asym, cpu));
>>>> +     sd = rcu_dereference(rq->sd);
>>>> +     if (sd) {
>>>> +             if ((rq->cfs.h_nr_running >= 1) &&
>>>> +                             ((rq->cpu_capacity * sd->imbalance_pct) <
>>>> +                             (rq->cpu_capacity_orig * 100))) {
>
> Ok I understand your explanation above. But I was wondering if you would
> need to add this check around rq->cfs.h_nr_running >= 1 in the above two
> cases as well.

yes you're right for the test if (rq->nr_running >= 2).

It's not so straight forward for nr_busy_cpus which reflects how many
CPUs have not stopped their tick. The scheduler assumes that the
latter are busy with cfs tasks

>
> I have actually raised this concern over whether we should be using
> rq->nr_running or cfs_rq->nr_running while we do load balancing in reply
> to your patch3. While all our load measurements are about the cfs_rq

I have just replied to your comments on patch 3. Sorry for the delay

> load, we use rq->nr_running, which may include tasks from other sched
> classes as well. We divide them to get average load per task during load
> balancing which is wrong, isn't it? Similarly during nohz_kick_needed(),
> we trigger load balancing based on rq->nr_running again.
>
> In this patch too, even if you know that the cpu is being dominated by
> tasks that do not belong to cfs class, you would be triggering a futile
> load balance if there are no fair tasks to move.
This patch adds one additional condition that is based on
rq->cfs.h_nr_running so it should not trigger any futile load balance.
Then, I have also take advantage of this patch to clean up
nohz_kick_needed as proposed by Peter but the conditions are the same
than previously (except the one with rq->cfs.h_nr_running)

I can prepare another patchset that will solve the concerns that you
raised for nohz_kick_needed and in patch 3 but i would prefer not
include them in this patchset which is large enough and which subject
is a bit different.
Does it seem ok for you ?

Regards,
Vincent
>
> Regards
> Preeti U Murthy
>

^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v5 08/12] sched: move cfs task on a CPU with higher capacity
@ 2014-09-03 11:44           ` Vincent Guittot
  0 siblings, 0 replies; 164+ messages in thread
From: Vincent Guittot @ 2014-09-03 11:44 UTC (permalink / raw)
  To: linux-arm-kernel

On 3 September 2014 11:11, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
> On 09/01/2014 02:15 PM, Vincent Guittot wrote:
>> On 30 August 2014 19:50, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
>>> Hi Vincent,
>>>> index 18db43e..60ae1ce 100644
>>>> --- a/kernel/sched/fair.c
>>>> +++ b/kernel/sched/fair.c
>>>> @@ -6049,6 +6049,14 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>>>>                       return true;
>>>>       }
>>>>
>>>> +     /*
>>>> +      * The group capacity is reduced probably because of activity from other
>>>> +      * sched class or interrupts which use part of the available capacity
>>>> +      */
>>>> +     if ((sg->sgc->capacity_orig * 100) > (sgs->group_capacity *
>>>> +                             env->sd->imbalance_pct))
>>>
>>> Wouldn't the check on avg_load let us know if we are packing more tasks
>>> in this group than its capacity ? Isn't that the metric we are more
>>> interested in?
>>
>> With  this patch, we don't want to pack but we prefer to spread the
>> task on another CPU than the one which handles the interruption if
>> latter uses a significant part of the CPU capacity.
>>
>>>
>>>> +             return true;
>>>> +
>>>>       return false;
>>>>  }
>>>>
>>>> @@ -6534,13 +6542,23 @@ static int need_active_balance(struct lb_env *env)
>>>>       struct sched_domain *sd = env->sd;
>>>>
>>>>       if (env->idle == CPU_NEWLY_IDLE) {
>>>> +             int src_cpu = env->src_cpu;
>>>>
>>>>               /*
>>>>                * ASYM_PACKING needs to force migrate tasks from busy but
>>>>                * higher numbered CPUs in order to pack all tasks in the
>>>>                * lowest numbered CPUs.
>>>>                */
>>>> -             if ((sd->flags & SD_ASYM_PACKING) && env->src_cpu > env->dst_cpu)
>>>> +             if ((sd->flags & SD_ASYM_PACKING) && src_cpu > env->dst_cpu)
>>>> +                     return 1;
>>>> +
>>>> +             /*
>>>> +              * If the CPUs share their cache and the src_cpu's capacity is
>>>> +              * reduced because of other sched_class or IRQs, we trig an
>>>> +              * active balance to move the task
>>>> +              */
>>>> +             if ((capacity_orig_of(src_cpu) * 100) > (capacity_of(src_cpu) *
>>>> +                             sd->imbalance_pct))
>>>>                       return 1;
>>>>       }
>>>>
>>>> @@ -6643,6 +6661,8 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>>>>
>>>>       schedstat_add(sd, lb_imbalance[idle], env.imbalance);
>>>>
>>>> +     env.src_cpu = busiest->cpu;
>>>> +
>>>>       ld_moved = 0;
>>>>       if (busiest->nr_running > 1) {
>>>>               /*
>>>> @@ -6652,7 +6672,6 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>>>>                * correctly treated as an imbalance.
>>>>                */
>>>>               env.flags |= LBF_ALL_PINNED;
>>>> -             env.src_cpu   = busiest->cpu;
>>>>               env.src_rq    = busiest;
>>>>               env.loop_max  = min(sysctl_sched_nr_migrate, busiest->nr_running);
>>>>
>>>> @@ -7359,10 +7378,12 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
>>>>
>>>>  /*
>>>>   * Current heuristic for kicking the idle load balancer in the presence
>>>> - * of an idle cpu is the system.
>>>> + * of an idle cpu in the system.
>>>>   *   - This rq has more than one task.
>>>> - *   - At any scheduler domain level, this cpu's scheduler group has multiple
>>>> - *     busy cpu's exceeding the group's capacity.
>>>> + *   - This rq has at least one CFS task and the capacity of the CPU is
>>>> + *     significantly reduced because of RT tasks or IRQs.
>>>> + *   - At parent of LLC scheduler domain level, this cpu's scheduler group has
>>>> + *     multiple busy cpu.
>>>>   *   - For SD_ASYM_PACKING, if the lower numbered cpu's in the scheduler
>>>>   *     domain span are idle.
>>>>   */
>>>> @@ -7372,9 +7393,10 @@ static inline int nohz_kick_needed(struct rq *rq)
>>>>       struct sched_domain *sd;
>>>>       struct sched_group_capacity *sgc;
>>>>       int nr_busy, cpu = rq->cpu;
>>>> +     bool kick = false;
>>>>
>>>>       if (unlikely(rq->idle_balance))
>>>> -             return 0;
>>>> +             return false;
>>>>
>>>>         /*
>>>>       * We may be recently in ticked or tickless idle mode. At the first
>>>> @@ -7388,38 +7410,45 @@ static inline int nohz_kick_needed(struct rq *rq)
>>>>        * balancing.
>>>>        */
>>>>       if (likely(!atomic_read(&nohz.nr_cpus)))
>>>> -             return 0;
>>>> +             return false;
>>>>
>>>>       if (time_before(now, nohz.next_balance))
>>>> -             return 0;
>>>> +             return false;
>>>>
>>>>       if (rq->nr_running >= 2)
>>>
>>> Will this check ^^ not catch those cases which this patch is targeting?
>>
>> This patch is not about how many tasks are running but if the capacity
>> of the CPU is reduced because of side activity like interruptions
>> which are only visible in the capacity value (with IRQ_TIME_ACCOUNTING
>> config) but not in nr_running.
>> Even if the capacity is reduced because of RT tasks, nothing ensures
>> that the RT task will be running when the tick fires
>>
>> Regards,
>> Vincent
>>>
>>> Regards
>>> Preeti U Murthy
>>>
>>>> -             goto need_kick;
>>>> +             return true;
>>>>
>>>>       rcu_read_lock();
>>>>       sd = rcu_dereference(per_cpu(sd_busy, cpu));
>>>> -
>>>>       if (sd) {
>>>>               sgc = sd->groups->sgc;
>>>>               nr_busy = atomic_read(&sgc->nr_busy_cpus);
>>>>
>>>> -             if (nr_busy > 1)
>>>> -                     goto need_kick_unlock;
>>>> +             if (nr_busy > 1) {
>>>> +                     kick = true;
>>>> +                     goto unlock;
>>>> +             }
>>>> +
>>>>       }
>>>>
>>>> -     sd = rcu_dereference(per_cpu(sd_asym, cpu));
>>>> +     sd = rcu_dereference(rq->sd);
>>>> +     if (sd) {
>>>> +             if ((rq->cfs.h_nr_running >= 1) &&
>>>> +                             ((rq->cpu_capacity * sd->imbalance_pct) <
>>>> +                             (rq->cpu_capacity_orig * 100))) {
>
> Ok I understand your explanation above. But I was wondering if you would
> need to add this check around rq->cfs.h_nr_running >= 1 in the above two
> cases as well.

yes you're right for the test if (rq->nr_running >= 2).

It's not so straight forward for nr_busy_cpus which reflects how many
CPUs have not stopped their tick. The scheduler assumes that the
latter are busy with cfs tasks

>
> I have actually raised this concern over whether we should be using
> rq->nr_running or cfs_rq->nr_running while we do load balancing in reply
> to your patch3. While all our load measurements are about the cfs_rq

I have just replied to your comments on patch 3. Sorry for the delay

> load, we use rq->nr_running, which may include tasks from other sched
> classes as well. We divide them to get average load per task during load
> balancing which is wrong, isn't it? Similarly during nohz_kick_needed(),
> we trigger load balancing based on rq->nr_running again.
>
> In this patch too, even if you know that the cpu is being dominated by
> tasks that do not belong to cfs class, you would be triggering a futile
> load balance if there are no fair tasks to move.
This patch adds one additional condition that is based on
rq->cfs.h_nr_running so it should not trigger any futile load balance.
Then, I have also take advantage of this patch to clean up
nohz_kick_needed as proposed by Peter but the conditions are the same
than previously (except the one with rq->cfs.h_nr_running)

I can prepare another patchset that will solve the concerns that you
raised for nohz_kick_needed and in patch 3 but i would prefer not
include them in this patchset which is large enough and which subject
is a bit different.
Does it seem ok for you ?

Regards,
Vincent
>
> Regards
> Preeti U Murthy
>

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v5 08/12] sched: move cfs task on a CPU with higher capacity
  2014-09-03 11:44           ` Vincent Guittot
@ 2014-09-03 12:26             ` Preeti U Murthy
  -1 siblings, 0 replies; 164+ messages in thread
From: Preeti U Murthy @ 2014-09-03 12:26 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel,
	Russell King - ARM Linux, LAK, Rik van Riel, Morten Rasmussen,
	Mike Galbraith, Nicolas Pitre, linaro-kernel, Daniel Lezcano,
	Dietmar Eggemann

On 09/03/2014 05:14 PM, Vincent Guittot wrote:
> On 3 September 2014 11:11, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
>> On 09/01/2014 02:15 PM, Vincent Guittot wrote:
>>> On 30 August 2014 19:50, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
>>>> Hi Vincent,
>>>>> index 18db43e..60ae1ce 100644
>>>>> --- a/kernel/sched/fair.c
>>>>> +++ b/kernel/sched/fair.c
>>>>> @@ -6049,6 +6049,14 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>>>>>                       return true;
>>>>>       }
>>>>>
>>>>> +     /*
>>>>> +      * The group capacity is reduced probably because of activity from other
>>>>> +      * sched class or interrupts which use part of the available capacity
>>>>> +      */
>>>>> +     if ((sg->sgc->capacity_orig * 100) > (sgs->group_capacity *
>>>>> +                             env->sd->imbalance_pct))
>>>>
>>>> Wouldn't the check on avg_load let us know if we are packing more tasks
>>>> in this group than its capacity ? Isn't that the metric we are more
>>>> interested in?
>>>
>>> With  this patch, we don't want to pack but we prefer to spread the
>>> task on another CPU than the one which handles the interruption if
>>> latter uses a significant part of the CPU capacity.
>>>
>>>>
>>>>> +             return true;
>>>>> +
>>>>>       return false;
>>>>>  }
>>>>>
>>>>> @@ -6534,13 +6542,23 @@ static int need_active_balance(struct lb_env *env)
>>>>>       struct sched_domain *sd = env->sd;
>>>>>
>>>>>       if (env->idle == CPU_NEWLY_IDLE) {
>>>>> +             int src_cpu = env->src_cpu;
>>>>>
>>>>>               /*
>>>>>                * ASYM_PACKING needs to force migrate tasks from busy but
>>>>>                * higher numbered CPUs in order to pack all tasks in the
>>>>>                * lowest numbered CPUs.
>>>>>                */
>>>>> -             if ((sd->flags & SD_ASYM_PACKING) && env->src_cpu > env->dst_cpu)
>>>>> +             if ((sd->flags & SD_ASYM_PACKING) && src_cpu > env->dst_cpu)
>>>>> +                     return 1;
>>>>> +
>>>>> +             /*
>>>>> +              * If the CPUs share their cache and the src_cpu's capacity is
>>>>> +              * reduced because of other sched_class or IRQs, we trig an
>>>>> +              * active balance to move the task
>>>>> +              */
>>>>> +             if ((capacity_orig_of(src_cpu) * 100) > (capacity_of(src_cpu) *
>>>>> +                             sd->imbalance_pct))
>>>>>                       return 1;
>>>>>       }
>>>>>
>>>>> @@ -6643,6 +6661,8 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>>>>>
>>>>>       schedstat_add(sd, lb_imbalance[idle], env.imbalance);
>>>>>
>>>>> +     env.src_cpu = busiest->cpu;
>>>>> +
>>>>>       ld_moved = 0;
>>>>>       if (busiest->nr_running > 1) {
>>>>>               /*
>>>>> @@ -6652,7 +6672,6 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>>>>>                * correctly treated as an imbalance.
>>>>>                */
>>>>>               env.flags |= LBF_ALL_PINNED;
>>>>> -             env.src_cpu   = busiest->cpu;
>>>>>               env.src_rq    = busiest;
>>>>>               env.loop_max  = min(sysctl_sched_nr_migrate, busiest->nr_running);
>>>>>
>>>>> @@ -7359,10 +7378,12 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
>>>>>
>>>>>  /*
>>>>>   * Current heuristic for kicking the idle load balancer in the presence
>>>>> - * of an idle cpu is the system.
>>>>> + * of an idle cpu in the system.
>>>>>   *   - This rq has more than one task.
>>>>> - *   - At any scheduler domain level, this cpu's scheduler group has multiple
>>>>> - *     busy cpu's exceeding the group's capacity.
>>>>> + *   - This rq has at least one CFS task and the capacity of the CPU is
>>>>> + *     significantly reduced because of RT tasks or IRQs.
>>>>> + *   - At parent of LLC scheduler domain level, this cpu's scheduler group has
>>>>> + *     multiple busy cpu.
>>>>>   *   - For SD_ASYM_PACKING, if the lower numbered cpu's in the scheduler
>>>>>   *     domain span are idle.
>>>>>   */
>>>>> @@ -7372,9 +7393,10 @@ static inline int nohz_kick_needed(struct rq *rq)
>>>>>       struct sched_domain *sd;
>>>>>       struct sched_group_capacity *sgc;
>>>>>       int nr_busy, cpu = rq->cpu;
>>>>> +     bool kick = false;
>>>>>
>>>>>       if (unlikely(rq->idle_balance))
>>>>> -             return 0;
>>>>> +             return false;
>>>>>
>>>>>         /*
>>>>>       * We may be recently in ticked or tickless idle mode. At the first
>>>>> @@ -7388,38 +7410,45 @@ static inline int nohz_kick_needed(struct rq *rq)
>>>>>        * balancing.
>>>>>        */
>>>>>       if (likely(!atomic_read(&nohz.nr_cpus)))
>>>>> -             return 0;
>>>>> +             return false;
>>>>>
>>>>>       if (time_before(now, nohz.next_balance))
>>>>> -             return 0;
>>>>> +             return false;
>>>>>
>>>>>       if (rq->nr_running >= 2)
>>>>
>>>> Will this check ^^ not catch those cases which this patch is targeting?
>>>
>>> This patch is not about how many tasks are running but if the capacity
>>> of the CPU is reduced because of side activity like interruptions
>>> which are only visible in the capacity value (with IRQ_TIME_ACCOUNTING
>>> config) but not in nr_running.
>>> Even if the capacity is reduced because of RT tasks, nothing ensures
>>> that the RT task will be running when the tick fires
>>>
>>> Regards,
>>> Vincent
>>>>
>>>> Regards
>>>> Preeti U Murthy
>>>>
>>>>> -             goto need_kick;
>>>>> +             return true;
>>>>>
>>>>>       rcu_read_lock();
>>>>>       sd = rcu_dereference(per_cpu(sd_busy, cpu));
>>>>> -
>>>>>       if (sd) {
>>>>>               sgc = sd->groups->sgc;
>>>>>               nr_busy = atomic_read(&sgc->nr_busy_cpus);
>>>>>
>>>>> -             if (nr_busy > 1)
>>>>> -                     goto need_kick_unlock;
>>>>> +             if (nr_busy > 1) {
>>>>> +                     kick = true;
>>>>> +                     goto unlock;
>>>>> +             }
>>>>> +
>>>>>       }
>>>>>
>>>>> -     sd = rcu_dereference(per_cpu(sd_asym, cpu));
>>>>> +     sd = rcu_dereference(rq->sd);
>>>>> +     if (sd) {
>>>>> +             if ((rq->cfs.h_nr_running >= 1) &&
>>>>> +                             ((rq->cpu_capacity * sd->imbalance_pct) <
>>>>> +                             (rq->cpu_capacity_orig * 100))) {
>>
>> Ok I understand your explanation above. But I was wondering if you would
>> need to add this check around rq->cfs.h_nr_running >= 1 in the above two
>> cases as well.
> 
> yes you're right for the test if (rq->nr_running >= 2).
> 
> It's not so straight forward for nr_busy_cpus which reflects how many
> CPUs have not stopped their tick. The scheduler assumes that the
> latter are busy with cfs tasks
> 
>>
>> I have actually raised this concern over whether we should be using
>> rq->nr_running or cfs_rq->nr_running while we do load balancing in reply
>> to your patch3. While all our load measurements are about the cfs_rq
> 
> I have just replied to your comments on patch 3. Sorry for the delay
> 
>> load, we use rq->nr_running, which may include tasks from other sched
>> classes as well. We divide them to get average load per task during load
>> balancing which is wrong, isn't it? Similarly during nohz_kick_needed(),
>> we trigger load balancing based on rq->nr_running again.
>>
>> In this patch too, even if you know that the cpu is being dominated by
>> tasks that do not belong to cfs class, you would be triggering a futile
>> load balance if there are no fair tasks to move.
> This patch adds one additional condition that is based on
> rq->cfs.h_nr_running so it should not trigger any futile load balance.
> Then, I have also take advantage of this patch to clean up
> nohz_kick_needed as proposed by Peter but the conditions are the same
> than previously (except the one with rq->cfs.h_nr_running)
> 
> I can prepare another patchset that will solve the concerns that you
> raised for nohz_kick_needed and in patch 3 but i would prefer not
> include them in this patchset which is large enough and which subject
> is a bit different.
> Does it seem ok for you ?

Sure Vincent, thanks! I have in fact sent out a mail raising my concern
over rq->nr_running. If others agree on the issue to be existing, maybe
we can work on this next patchset that can clean this up in all places
necessary and not just in nohz_kick_needed().

Regards
Preeti U Murthy
> 
> Regards,
> Vincent
>>
>> Regards
>> Preeti U Murthy
>>
> 


^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v5 08/12] sched: move cfs task on a CPU with higher capacity
@ 2014-09-03 12:26             ` Preeti U Murthy
  0 siblings, 0 replies; 164+ messages in thread
From: Preeti U Murthy @ 2014-09-03 12:26 UTC (permalink / raw)
  To: linux-arm-kernel

On 09/03/2014 05:14 PM, Vincent Guittot wrote:
> On 3 September 2014 11:11, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
>> On 09/01/2014 02:15 PM, Vincent Guittot wrote:
>>> On 30 August 2014 19:50, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
>>>> Hi Vincent,
>>>>> index 18db43e..60ae1ce 100644
>>>>> --- a/kernel/sched/fair.c
>>>>> +++ b/kernel/sched/fair.c
>>>>> @@ -6049,6 +6049,14 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>>>>>                       return true;
>>>>>       }
>>>>>
>>>>> +     /*
>>>>> +      * The group capacity is reduced probably because of activity from other
>>>>> +      * sched class or interrupts which use part of the available capacity
>>>>> +      */
>>>>> +     if ((sg->sgc->capacity_orig * 100) > (sgs->group_capacity *
>>>>> +                             env->sd->imbalance_pct))
>>>>
>>>> Wouldn't the check on avg_load let us know if we are packing more tasks
>>>> in this group than its capacity ? Isn't that the metric we are more
>>>> interested in?
>>>
>>> With  this patch, we don't want to pack but we prefer to spread the
>>> task on another CPU than the one which handles the interruption if
>>> latter uses a significant part of the CPU capacity.
>>>
>>>>
>>>>> +             return true;
>>>>> +
>>>>>       return false;
>>>>>  }
>>>>>
>>>>> @@ -6534,13 +6542,23 @@ static int need_active_balance(struct lb_env *env)
>>>>>       struct sched_domain *sd = env->sd;
>>>>>
>>>>>       if (env->idle == CPU_NEWLY_IDLE) {
>>>>> +             int src_cpu = env->src_cpu;
>>>>>
>>>>>               /*
>>>>>                * ASYM_PACKING needs to force migrate tasks from busy but
>>>>>                * higher numbered CPUs in order to pack all tasks in the
>>>>>                * lowest numbered CPUs.
>>>>>                */
>>>>> -             if ((sd->flags & SD_ASYM_PACKING) && env->src_cpu > env->dst_cpu)
>>>>> +             if ((sd->flags & SD_ASYM_PACKING) && src_cpu > env->dst_cpu)
>>>>> +                     return 1;
>>>>> +
>>>>> +             /*
>>>>> +              * If the CPUs share their cache and the src_cpu's capacity is
>>>>> +              * reduced because of other sched_class or IRQs, we trig an
>>>>> +              * active balance to move the task
>>>>> +              */
>>>>> +             if ((capacity_orig_of(src_cpu) * 100) > (capacity_of(src_cpu) *
>>>>> +                             sd->imbalance_pct))
>>>>>                       return 1;
>>>>>       }
>>>>>
>>>>> @@ -6643,6 +6661,8 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>>>>>
>>>>>       schedstat_add(sd, lb_imbalance[idle], env.imbalance);
>>>>>
>>>>> +     env.src_cpu = busiest->cpu;
>>>>> +
>>>>>       ld_moved = 0;
>>>>>       if (busiest->nr_running > 1) {
>>>>>               /*
>>>>> @@ -6652,7 +6672,6 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>>>>>                * correctly treated as an imbalance.
>>>>>                */
>>>>>               env.flags |= LBF_ALL_PINNED;
>>>>> -             env.src_cpu   = busiest->cpu;
>>>>>               env.src_rq    = busiest;
>>>>>               env.loop_max  = min(sysctl_sched_nr_migrate, busiest->nr_running);
>>>>>
>>>>> @@ -7359,10 +7378,12 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
>>>>>
>>>>>  /*
>>>>>   * Current heuristic for kicking the idle load balancer in the presence
>>>>> - * of an idle cpu is the system.
>>>>> + * of an idle cpu in the system.
>>>>>   *   - This rq has more than one task.
>>>>> - *   - At any scheduler domain level, this cpu's scheduler group has multiple
>>>>> - *     busy cpu's exceeding the group's capacity.
>>>>> + *   - This rq has at least one CFS task and the capacity of the CPU is
>>>>> + *     significantly reduced because of RT tasks or IRQs.
>>>>> + *   - At parent of LLC scheduler domain level, this cpu's scheduler group has
>>>>> + *     multiple busy cpu.
>>>>>   *   - For SD_ASYM_PACKING, if the lower numbered cpu's in the scheduler
>>>>>   *     domain span are idle.
>>>>>   */
>>>>> @@ -7372,9 +7393,10 @@ static inline int nohz_kick_needed(struct rq *rq)
>>>>>       struct sched_domain *sd;
>>>>>       struct sched_group_capacity *sgc;
>>>>>       int nr_busy, cpu = rq->cpu;
>>>>> +     bool kick = false;
>>>>>
>>>>>       if (unlikely(rq->idle_balance))
>>>>> -             return 0;
>>>>> +             return false;
>>>>>
>>>>>         /*
>>>>>       * We may be recently in ticked or tickless idle mode. At the first
>>>>> @@ -7388,38 +7410,45 @@ static inline int nohz_kick_needed(struct rq *rq)
>>>>>        * balancing.
>>>>>        */
>>>>>       if (likely(!atomic_read(&nohz.nr_cpus)))
>>>>> -             return 0;
>>>>> +             return false;
>>>>>
>>>>>       if (time_before(now, nohz.next_balance))
>>>>> -             return 0;
>>>>> +             return false;
>>>>>
>>>>>       if (rq->nr_running >= 2)
>>>>
>>>> Will this check ^^ not catch those cases which this patch is targeting?
>>>
>>> This patch is not about how many tasks are running but if the capacity
>>> of the CPU is reduced because of side activity like interruptions
>>> which are only visible in the capacity value (with IRQ_TIME_ACCOUNTING
>>> config) but not in nr_running.
>>> Even if the capacity is reduced because of RT tasks, nothing ensures
>>> that the RT task will be running when the tick fires
>>>
>>> Regards,
>>> Vincent
>>>>
>>>> Regards
>>>> Preeti U Murthy
>>>>
>>>>> -             goto need_kick;
>>>>> +             return true;
>>>>>
>>>>>       rcu_read_lock();
>>>>>       sd = rcu_dereference(per_cpu(sd_busy, cpu));
>>>>> -
>>>>>       if (sd) {
>>>>>               sgc = sd->groups->sgc;
>>>>>               nr_busy = atomic_read(&sgc->nr_busy_cpus);
>>>>>
>>>>> -             if (nr_busy > 1)
>>>>> -                     goto need_kick_unlock;
>>>>> +             if (nr_busy > 1) {
>>>>> +                     kick = true;
>>>>> +                     goto unlock;
>>>>> +             }
>>>>> +
>>>>>       }
>>>>>
>>>>> -     sd = rcu_dereference(per_cpu(sd_asym, cpu));
>>>>> +     sd = rcu_dereference(rq->sd);
>>>>> +     if (sd) {
>>>>> +             if ((rq->cfs.h_nr_running >= 1) &&
>>>>> +                             ((rq->cpu_capacity * sd->imbalance_pct) <
>>>>> +                             (rq->cpu_capacity_orig * 100))) {
>>
>> Ok I understand your explanation above. But I was wondering if you would
>> need to add this check around rq->cfs.h_nr_running >= 1 in the above two
>> cases as well.
> 
> yes you're right for the test if (rq->nr_running >= 2).
> 
> It's not so straight forward for nr_busy_cpus which reflects how many
> CPUs have not stopped their tick. The scheduler assumes that the
> latter are busy with cfs tasks
> 
>>
>> I have actually raised this concern over whether we should be using
>> rq->nr_running or cfs_rq->nr_running while we do load balancing in reply
>> to your patch3. While all our load measurements are about the cfs_rq
> 
> I have just replied to your comments on patch 3. Sorry for the delay
> 
>> load, we use rq->nr_running, which may include tasks from other sched
>> classes as well. We divide them to get average load per task during load
>> balancing which is wrong, isn't it? Similarly during nohz_kick_needed(),
>> we trigger load balancing based on rq->nr_running again.
>>
>> In this patch too, even if you know that the cpu is being dominated by
>> tasks that do not belong to cfs class, you would be triggering a futile
>> load balance if there are no fair tasks to move.
> This patch adds one additional condition that is based on
> rq->cfs.h_nr_running so it should not trigger any futile load balance.
> Then, I have also take advantage of this patch to clean up
> nohz_kick_needed as proposed by Peter but the conditions are the same
> than previously (except the one with rq->cfs.h_nr_running)
> 
> I can prepare another patchset that will solve the concerns that you
> raised for nohz_kick_needed and in patch 3 but i would prefer not
> include them in this patchset which is large enough and which subject
> is a bit different.
> Does it seem ok for you ?

Sure Vincent, thanks! I have in fact sent out a mail raising my concern
over rq->nr_running. If others agree on the issue to be existing, maybe
we can work on this next patchset that can clean this up in all places
necessary and not just in nohz_kick_needed().

Regards
Preeti U Murthy
> 
> Regards,
> Vincent
>>
>> Regards
>> Preeti U Murthy
>>
> 

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v5 08/12] sched: move cfs task on a CPU with higher capacity
  2014-09-03 12:26             ` Preeti U Murthy
@ 2014-09-03 12:49               ` Vincent Guittot
  -1 siblings, 0 replies; 164+ messages in thread
From: Vincent Guittot @ 2014-09-03 12:49 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel,
	Russell King - ARM Linux, LAK, Rik van Riel, Morten Rasmussen,
	Mike Galbraith, Nicolas Pitre, linaro-kernel, Daniel Lezcano,
	Dietmar Eggemann

On 3 September 2014 14:26, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
> On 09/03/2014 05:14 PM, Vincent Guittot wrote:
>> On 3 September 2014 11:11, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
>>> On 09/01/2014 02:15 PM, Vincent Guittot wrote:

[snip]

>>>
>>> Ok I understand your explanation above. But I was wondering if you would
>>> need to add this check around rq->cfs.h_nr_running >= 1 in the above two
>>> cases as well.
>>
>> yes you're right for the test if (rq->nr_running >= 2).
>>
>> It's not so straight forward for nr_busy_cpus which reflects how many
>> CPUs have not stopped their tick. The scheduler assumes that the
>> latter are busy with cfs tasks
>>
>>>
>>> I have actually raised this concern over whether we should be using
>>> rq->nr_running or cfs_rq->nr_running while we do load balancing in reply
>>> to your patch3. While all our load measurements are about the cfs_rq
>>
>> I have just replied to your comments on patch 3. Sorry for the delay
>>
>>> load, we use rq->nr_running, which may include tasks from other sched
>>> classes as well. We divide them to get average load per task during load
>>> balancing which is wrong, isn't it? Similarly during nohz_kick_needed(),
>>> we trigger load balancing based on rq->nr_running again.
>>>
>>> In this patch too, even if you know that the cpu is being dominated by
>>> tasks that do not belong to cfs class, you would be triggering a futile
>>> load balance if there are no fair tasks to move.
>> This patch adds one additional condition that is based on
>> rq->cfs.h_nr_running so it should not trigger any futile load balance.
>> Then, I have also take advantage of this patch to clean up
>> nohz_kick_needed as proposed by Peter but the conditions are the same
>> than previously (except the one with rq->cfs.h_nr_running)
>>
>> I can prepare another patchset that will solve the concerns that you
>> raised for nohz_kick_needed and in patch 3 but i would prefer not
>> include them in this patchset which is large enough and which subject
>> is a bit different.
>> Does it seem ok for you ?
>
> Sure Vincent, thanks! I have in fact sent out a mail raising my concern
> over rq->nr_running. If others agree on the issue to be existing, maybe
> we can work on this next patchset that can clean this up in all places
> necessary and not just in nohz_kick_needed().

Ok, let continue this discussion on the other thread

Regards,
Vincent

>
> Regards
> Preeti U Murthy
>>
>> Regards,
>> Vincent
>>>
>>> Regards
>>> Preeti U Murthy
>>>
>>
>

^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v5 08/12] sched: move cfs task on a CPU with higher capacity
@ 2014-09-03 12:49               ` Vincent Guittot
  0 siblings, 0 replies; 164+ messages in thread
From: Vincent Guittot @ 2014-09-03 12:49 UTC (permalink / raw)
  To: linux-arm-kernel

On 3 September 2014 14:26, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
> On 09/03/2014 05:14 PM, Vincent Guittot wrote:
>> On 3 September 2014 11:11, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
>>> On 09/01/2014 02:15 PM, Vincent Guittot wrote:

[snip]

>>>
>>> Ok I understand your explanation above. But I was wondering if you would
>>> need to add this check around rq->cfs.h_nr_running >= 1 in the above two
>>> cases as well.
>>
>> yes you're right for the test if (rq->nr_running >= 2).
>>
>> It's not so straight forward for nr_busy_cpus which reflects how many
>> CPUs have not stopped their tick. The scheduler assumes that the
>> latter are busy with cfs tasks
>>
>>>
>>> I have actually raised this concern over whether we should be using
>>> rq->nr_running or cfs_rq->nr_running while we do load balancing in reply
>>> to your patch3. While all our load measurements are about the cfs_rq
>>
>> I have just replied to your comments on patch 3. Sorry for the delay
>>
>>> load, we use rq->nr_running, which may include tasks from other sched
>>> classes as well. We divide them to get average load per task during load
>>> balancing which is wrong, isn't it? Similarly during nohz_kick_needed(),
>>> we trigger load balancing based on rq->nr_running again.
>>>
>>> In this patch too, even if you know that the cpu is being dominated by
>>> tasks that do not belong to cfs class, you would be triggering a futile
>>> load balance if there are no fair tasks to move.
>> This patch adds one additional condition that is based on
>> rq->cfs.h_nr_running so it should not trigger any futile load balance.
>> Then, I have also take advantage of this patch to clean up
>> nohz_kick_needed as proposed by Peter but the conditions are the same
>> than previously (except the one with rq->cfs.h_nr_running)
>>
>> I can prepare another patchset that will solve the concerns that you
>> raised for nohz_kick_needed and in patch 3 but i would prefer not
>> include them in this patchset which is large enough and which subject
>> is a bit different.
>> Does it seem ok for you ?
>
> Sure Vincent, thanks! I have in fact sent out a mail raising my concern
> over rq->nr_running. If others agree on the issue to be existing, maybe
> we can work on this next patchset that can clean this up in all places
> necessary and not just in nohz_kick_needed().

Ok, let continue this discussion on the other thread

Regards,
Vincent

>
> Regards
> Preeti U Murthy
>>
>> Regards,
>> Vincent
>>>
>>> Regards
>>> Preeti U Murthy
>>>
>>
>

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v5 03/12] sched: fix avg_load computation
  2014-09-03 11:09       ` Vincent Guittot
@ 2014-09-03 23:43         ` Tim Chen
  -1 siblings, 0 replies; 164+ messages in thread
From: Tim Chen @ 2014-09-03 23:43 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Preeti U Murthy, Peter Zijlstra, Ingo Molnar, linux-kernel,
	Russell King - ARM Linux, LAK, Rik van Riel, Morten Rasmussen,
	Mike Galbraith, Nicolas Pitre, linaro-kernel, Daniel Lezcano,
	Dietmar Eggemann

On Wed, 2014-09-03 at 13:09 +0200, Vincent Guittot wrote:
> On 30 August 2014 14:00, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
> > Hi Vincent,
> >
> > On 08/26/2014 04:36 PM, Vincent Guittot wrote:
> >> The computation of avg_load and avg_load_per_task should only takes into
> >> account the number of cfs tasks. The non cfs task are already taken into
> >> account by decreasing the cpu's capacity and they will be tracked in the
> >> CPU's utilization (group_utilization) of the next patches
> >>
> >> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> >> ---
> >>  kernel/sched/fair.c | 4 ++--
> >>  1 file changed, 2 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index 87b9dc7..b85e9f7 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -4092,7 +4092,7 @@ static unsigned long capacity_of(int cpu)
> >>  static unsigned long cpu_avg_load_per_task(int cpu)
> >>  {
> >>       struct rq *rq = cpu_rq(cpu);
> >> -     unsigned long nr_running = ACCESS_ONCE(rq->nr_running);
> >> +     unsigned long nr_running = ACCESS_ONCE(rq->cfs.h_nr_running);
> >>       unsigned long load_avg = rq->cfs.runnable_load_avg;
> >>
> >>       if (nr_running)
> >> @@ -5985,7 +5985,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> >>                       load = source_load(i, load_idx);
> >>
> >>               sgs->group_load += load;
> >> -             sgs->sum_nr_running += rq->nr_running;
> >> +             sgs->sum_nr_running += rq->cfs.h_nr_running;
> >>
> >>               if (rq->nr_running > 1)
> >>                       *overload = true;
> >>
> >
> > Why do we probe rq->nr_running while we do load balancing? Should not we
> > be probing cfs_rq->nr_running instead? We are interested after all in
> > load balancing fair tasks right? The reason I ask this is, I was
> > wondering if we need to make the above similar change in more places in
> > load balancing.
> 
> Hi Preeti,
> 
> Yes, we should probably the test   rq->cfs.h_nr_running > 0 before
> setting overload.
> 

The overload indicator is used for knowing when we can totally avoid
load balancing to a cpu that is about to go idle.  
We can avoid load balancing when no cpu has more than 1 task.  So if you
have say just one fair task and multiple deadline tasks on a cpu,
and another cpu about to go idle, you should turn on normal load
balancing in the idle path by setting overload to true. 

So setting overload should be set based on rq->nr_running and not on
rq->cfs.h_nr_running. 

Thanks.
 
Tim



^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v5 03/12] sched: fix avg_load computation
@ 2014-09-03 23:43         ` Tim Chen
  0 siblings, 0 replies; 164+ messages in thread
From: Tim Chen @ 2014-09-03 23:43 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, 2014-09-03 at 13:09 +0200, Vincent Guittot wrote:
> On 30 August 2014 14:00, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
> > Hi Vincent,
> >
> > On 08/26/2014 04:36 PM, Vincent Guittot wrote:
> >> The computation of avg_load and avg_load_per_task should only takes into
> >> account the number of cfs tasks. The non cfs task are already taken into
> >> account by decreasing the cpu's capacity and they will be tracked in the
> >> CPU's utilization (group_utilization) of the next patches
> >>
> >> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> >> ---
> >>  kernel/sched/fair.c | 4 ++--
> >>  1 file changed, 2 insertions(+), 2 deletions(-)
> >>
> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> index 87b9dc7..b85e9f7 100644
> >> --- a/kernel/sched/fair.c
> >> +++ b/kernel/sched/fair.c
> >> @@ -4092,7 +4092,7 @@ static unsigned long capacity_of(int cpu)
> >>  static unsigned long cpu_avg_load_per_task(int cpu)
> >>  {
> >>       struct rq *rq = cpu_rq(cpu);
> >> -     unsigned long nr_running = ACCESS_ONCE(rq->nr_running);
> >> +     unsigned long nr_running = ACCESS_ONCE(rq->cfs.h_nr_running);
> >>       unsigned long load_avg = rq->cfs.runnable_load_avg;
> >>
> >>       if (nr_running)
> >> @@ -5985,7 +5985,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> >>                       load = source_load(i, load_idx);
> >>
> >>               sgs->group_load += load;
> >> -             sgs->sum_nr_running += rq->nr_running;
> >> +             sgs->sum_nr_running += rq->cfs.h_nr_running;
> >>
> >>               if (rq->nr_running > 1)
> >>                       *overload = true;
> >>
> >
> > Why do we probe rq->nr_running while we do load balancing? Should not we
> > be probing cfs_rq->nr_running instead? We are interested after all in
> > load balancing fair tasks right? The reason I ask this is, I was
> > wondering if we need to make the above similar change in more places in
> > load balancing.
> 
> Hi Preeti,
> 
> Yes, we should probably the test   rq->cfs.h_nr_running > 0 before
> setting overload.
> 

The overload indicator is used for knowing when we can totally avoid
load balancing to a cpu that is about to go idle.  
We can avoid load balancing when no cpu has more than 1 task.  So if you
have say just one fair task and multiple deadline tasks on a cpu,
and another cpu about to go idle, you should turn on normal load
balancing in the idle path by setting overload to true. 

So setting overload should be set based on rq->nr_running and not on
rq->cfs.h_nr_running. 

Thanks.
 
Tim

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v5 03/12] sched: fix avg_load computation
  2014-09-03 23:43         ` Tim Chen
@ 2014-09-04  7:17           ` Vincent Guittot
  -1 siblings, 0 replies; 164+ messages in thread
From: Vincent Guittot @ 2014-09-04  7:17 UTC (permalink / raw)
  To: Tim Chen
  Cc: Preeti U Murthy, Peter Zijlstra, Ingo Molnar, linux-kernel,
	Russell King - ARM Linux, LAK, Rik van Riel, Morten Rasmussen,
	Mike Galbraith, Nicolas Pitre, linaro-kernel, Daniel Lezcano,
	Dietmar Eggemann

On 4 September 2014 01:43, Tim Chen <tim.c.chen@linux.intel.com> wrote:
> On Wed, 2014-09-03 at 13:09 +0200, Vincent Guittot wrote:
>> On 30 August 2014 14:00, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
>> > Hi Vincent,
>> >
>> > On 08/26/2014 04:36 PM, Vincent Guittot wrote:
>> >> The computation of avg_load and avg_load_per_task should only takes into
>> >> account the number of cfs tasks. The non cfs task are already taken into
>> >> account by decreasing the cpu's capacity and they will be tracked in the
>> >> CPU's utilization (group_utilization) of the next patches
>> >>
>> >> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>> >> ---
>> >>  kernel/sched/fair.c | 4 ++--
>> >>  1 file changed, 2 insertions(+), 2 deletions(-)
>> >>
>> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> >> index 87b9dc7..b85e9f7 100644
>> >> --- a/kernel/sched/fair.c
>> >> +++ b/kernel/sched/fair.c
>> >> @@ -4092,7 +4092,7 @@ static unsigned long capacity_of(int cpu)
>> >>  static unsigned long cpu_avg_load_per_task(int cpu)
>> >>  {
>> >>       struct rq *rq = cpu_rq(cpu);
>> >> -     unsigned long nr_running = ACCESS_ONCE(rq->nr_running);
>> >> +     unsigned long nr_running = ACCESS_ONCE(rq->cfs.h_nr_running);
>> >>       unsigned long load_avg = rq->cfs.runnable_load_avg;
>> >>
>> >>       if (nr_running)
>> >> @@ -5985,7 +5985,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>> >>                       load = source_load(i, load_idx);
>> >>
>> >>               sgs->group_load += load;
>> >> -             sgs->sum_nr_running += rq->nr_running;
>> >> +             sgs->sum_nr_running += rq->cfs.h_nr_running;
>> >>
>> >>               if (rq->nr_running > 1)
>> >>                       *overload = true;
>> >>
>> >
>> > Why do we probe rq->nr_running while we do load balancing? Should not we
>> > be probing cfs_rq->nr_running instead? We are interested after all in
>> > load balancing fair tasks right? The reason I ask this is, I was
>> > wondering if we need to make the above similar change in more places in
>> > load balancing.
>>
>> Hi Preeti,
>>
>> Yes, we should probably the test   rq->cfs.h_nr_running > 0 before
>> setting overload.
>>
>
> The overload indicator is used for knowing when we can totally avoid
> load balancing to a cpu that is about to go idle.
> We can avoid load balancing when no cpu has more than 1 task.  So if you
> have say just one fair task and multiple deadline tasks on a cpu,
> and another cpu about to go idle, you should turn on normal load
> balancing in the idle path by setting overload to true.

The newly idle load balancing can only affect CFS tasks so triggering
a load_balance because a cpu is overloaded by rt tasks only, will not
change anything.

>
> So setting overload should be set based on rq->nr_running and not on
> rq->cfs.h_nr_running.

We should probably use both values like below

if ((rq->nr_running > 1) && ( rq->cfs.h_nr_running > 0))

Regards,
Vincent
>
> Thanks.
>
> Tim
>
>

^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v5 03/12] sched: fix avg_load computation
@ 2014-09-04  7:17           ` Vincent Guittot
  0 siblings, 0 replies; 164+ messages in thread
From: Vincent Guittot @ 2014-09-04  7:17 UTC (permalink / raw)
  To: linux-arm-kernel

On 4 September 2014 01:43, Tim Chen <tim.c.chen@linux.intel.com> wrote:
> On Wed, 2014-09-03 at 13:09 +0200, Vincent Guittot wrote:
>> On 30 August 2014 14:00, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
>> > Hi Vincent,
>> >
>> > On 08/26/2014 04:36 PM, Vincent Guittot wrote:
>> >> The computation of avg_load and avg_load_per_task should only takes into
>> >> account the number of cfs tasks. The non cfs task are already taken into
>> >> account by decreasing the cpu's capacity and they will be tracked in the
>> >> CPU's utilization (group_utilization) of the next patches
>> >>
>> >> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>> >> ---
>> >>  kernel/sched/fair.c | 4 ++--
>> >>  1 file changed, 2 insertions(+), 2 deletions(-)
>> >>
>> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> >> index 87b9dc7..b85e9f7 100644
>> >> --- a/kernel/sched/fair.c
>> >> +++ b/kernel/sched/fair.c
>> >> @@ -4092,7 +4092,7 @@ static unsigned long capacity_of(int cpu)
>> >>  static unsigned long cpu_avg_load_per_task(int cpu)
>> >>  {
>> >>       struct rq *rq = cpu_rq(cpu);
>> >> -     unsigned long nr_running = ACCESS_ONCE(rq->nr_running);
>> >> +     unsigned long nr_running = ACCESS_ONCE(rq->cfs.h_nr_running);
>> >>       unsigned long load_avg = rq->cfs.runnable_load_avg;
>> >>
>> >>       if (nr_running)
>> >> @@ -5985,7 +5985,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>> >>                       load = source_load(i, load_idx);
>> >>
>> >>               sgs->group_load += load;
>> >> -             sgs->sum_nr_running += rq->nr_running;
>> >> +             sgs->sum_nr_running += rq->cfs.h_nr_running;
>> >>
>> >>               if (rq->nr_running > 1)
>> >>                       *overload = true;
>> >>
>> >
>> > Why do we probe rq->nr_running while we do load balancing? Should not we
>> > be probing cfs_rq->nr_running instead? We are interested after all in
>> > load balancing fair tasks right? The reason I ask this is, I was
>> > wondering if we need to make the above similar change in more places in
>> > load balancing.
>>
>> Hi Preeti,
>>
>> Yes, we should probably the test   rq->cfs.h_nr_running > 0 before
>> setting overload.
>>
>
> The overload indicator is used for knowing when we can totally avoid
> load balancing to a cpu that is about to go idle.
> We can avoid load balancing when no cpu has more than 1 task.  So if you
> have say just one fair task and multiple deadline tasks on a cpu,
> and another cpu about to go idle, you should turn on normal load
> balancing in the idle path by setting overload to true.

The newly idle load balancing can only affect CFS tasks so triggering
a load_balance because a cpu is overloaded by rt tasks only, will not
change anything.

>
> So setting overload should be set based on rq->nr_running and not on
> rq->cfs.h_nr_running.

We should probably use both values like below

if ((rq->nr_running > 1) && ( rq->cfs.h_nr_running > 0))

Regards,
Vincent
>
> Thanks.
>
> Tim
>
>

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v5 09/11] sched: add usage_load_avg
  2014-08-26 11:06   ` Vincent Guittot
@ 2014-09-04  7:34     ` Vincent Guittot
  -1 siblings, 0 replies; 164+ messages in thread
From: Vincent Guittot @ 2014-09-04  7:34 UTC (permalink / raw)
  To: peterz, mingo, linux-kernel, preeti, linux, linux-arm-kernel
  Cc: riel, Morten.Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano, dietmar.eggemann, Vincent Guittot

The update of update_rq_runnable_avg interface is missing for
CONFIG_FAIR_GROUP_SCHED in the original patch
"[PATCH v5 09/12] sched: add usage_load_avg"

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---

Hi Peter,

Do you prefer that I sent a new version of 
"[PATCH v5 09/12] sched: add usage_load_avg" patch with the fix or this fix is
enough ?

Regards,
Vincent

 kernel/sched/fair.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1fd2131..cbe3eff 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2462,7 +2462,8 @@ static inline void __update_group_entity_contrib(struct sched_entity *se)
 
 static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
 {
-	__update_entity_runnable_avg(rq_clock_task(rq), &rq->avg, runnable);
+	__update_entity_runnable_avg(rq_clock_task(rq), &rq->avg, runnable,
+			runnable);
 	__update_tg_runnable_avg(&rq->avg, &rq->cfs);
 }
 #else /* CONFIG_FAIR_GROUP_SCHED */
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 164+ messages in thread

* [PATCH v5 09/11] sched: add usage_load_avg
@ 2014-09-04  7:34     ` Vincent Guittot
  0 siblings, 0 replies; 164+ messages in thread
From: Vincent Guittot @ 2014-09-04  7:34 UTC (permalink / raw)
  To: linux-arm-kernel

The update of update_rq_runnable_avg interface is missing for
CONFIG_FAIR_GROUP_SCHED in the original patch
"[PATCH v5 09/12] sched: add usage_load_avg"

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---

Hi Peter,

Do you prefer that I sent a new version of 
"[PATCH v5 09/12] sched: add usage_load_avg" patch with the fix or this fix is
enough ?

Regards,
Vincent

 kernel/sched/fair.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1fd2131..cbe3eff 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2462,7 +2462,8 @@ static inline void __update_group_entity_contrib(struct sched_entity *se)
 
 static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
 {
-	__update_entity_runnable_avg(rq_clock_task(rq), &rq->avg, runnable);
+	__update_entity_runnable_avg(rq_clock_task(rq), &rq->avg, runnable,
+			runnable);
 	__update_tg_runnable_avg(&rq->avg, &rq->cfs);
 }
 #else /* CONFIG_FAIR_GROUP_SCHED */
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 164+ messages in thread

* Re: [PATCH v5 03/12] sched: fix avg_load computation
  2014-09-04  7:17           ` Vincent Guittot
@ 2014-09-04 16:26             ` Tim Chen
  -1 siblings, 0 replies; 164+ messages in thread
From: Tim Chen @ 2014-09-04 16:26 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Preeti U Murthy, Peter Zijlstra, Ingo Molnar, linux-kernel,
	Russell King - ARM Linux, LAK, Rik van Riel, Morten Rasmussen,
	Mike Galbraith, Nicolas Pitre, linaro-kernel, Daniel Lezcano,
	Dietmar Eggemann

On Thu, 2014-09-04 at 09:17 +0200, Vincent Guittot wrote:
> On 4 September 2014 01:43, Tim Chen <tim.c.chen@linux.intel.com> wrote:
> > On Wed, 2014-09-03 at 13:09 +0200, Vincent Guittot wrote:
> >> On 30 August 2014 14:00, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
> >> > Hi Vincent,
> >> >
> >> > On 08/26/2014 04:36 PM, Vincent Guittot wrote:
> >> >> The computation of avg_load and avg_load_per_task should only takes into
> >> >> account the number of cfs tasks. The non cfs task are already taken into
> >> >> account by decreasing the cpu's capacity and they will be tracked in the
> >> >> CPU's utilization (group_utilization) of the next patches
> >> >>
> >> >> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> >> >> ---
> >> >>  kernel/sched/fair.c | 4 ++--
> >> >>  1 file changed, 2 insertions(+), 2 deletions(-)
> >> >>
> >> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> >> index 87b9dc7..b85e9f7 100644
> >> >> --- a/kernel/sched/fair.c
> >> >> +++ b/kernel/sched/fair.c
> >> >> @@ -4092,7 +4092,7 @@ static unsigned long capacity_of(int cpu)
> >> >>  static unsigned long cpu_avg_load_per_task(int cpu)
> >> >>  {
> >> >>       struct rq *rq = cpu_rq(cpu);
> >> >> -     unsigned long nr_running = ACCESS_ONCE(rq->nr_running);
> >> >> +     unsigned long nr_running = ACCESS_ONCE(rq->cfs.h_nr_running);
> >> >>       unsigned long load_avg = rq->cfs.runnable_load_avg;
> >> >>
> >> >>       if (nr_running)
> >> >> @@ -5985,7 +5985,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> >> >>                       load = source_load(i, load_idx);
> >> >>
> >> >>               sgs->group_load += load;
> >> >> -             sgs->sum_nr_running += rq->nr_running;
> >> >> +             sgs->sum_nr_running += rq->cfs.h_nr_running;
> >> >>
> >> >>               if (rq->nr_running > 1)
> >> >>                       *overload = true;
> >> >>
> >> >
> >> > Why do we probe rq->nr_running while we do load balancing? Should not we
> >> > be probing cfs_rq->nr_running instead? We are interested after all in
> >> > load balancing fair tasks right? The reason I ask this is, I was
> >> > wondering if we need to make the above similar change in more places in
> >> > load balancing.
> >>
> >> Hi Preeti,
> >>
> >> Yes, we should probably the test   rq->cfs.h_nr_running > 0 before
> >> setting overload.
> >>
> >
> > The overload indicator is used for knowing when we can totally avoid
> > load balancing to a cpu that is about to go idle.
> > We can avoid load balancing when no cpu has more than 1 task.  So if you
> > have say just one fair task and multiple deadline tasks on a cpu,
> > and another cpu about to go idle, you should turn on normal load
> > balancing in the idle path by setting overload to true.
> 
> The newly idle load balancing can only affect CFS tasks so triggering
> a load_balance because a cpu is overloaded by rt tasks only, will not
> change anything.
> 
> >
> > So setting overload should be set based on rq->nr_running and not on
> > rq->cfs.h_nr_running.
> 
> We should probably use both values like below
> 
> if ((rq->nr_running > 1) && ( rq->cfs.h_nr_running > 0))

Yes, this modification is the correct one that takes care of the
condition I objected to previously.

Thanks.

Tim



^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v5 03/12] sched: fix avg_load computation
@ 2014-09-04 16:26             ` Tim Chen
  0 siblings, 0 replies; 164+ messages in thread
From: Tim Chen @ 2014-09-04 16:26 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 2014-09-04 at 09:17 +0200, Vincent Guittot wrote:
> On 4 September 2014 01:43, Tim Chen <tim.c.chen@linux.intel.com> wrote:
> > On Wed, 2014-09-03 at 13:09 +0200, Vincent Guittot wrote:
> >> On 30 August 2014 14:00, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
> >> > Hi Vincent,
> >> >
> >> > On 08/26/2014 04:36 PM, Vincent Guittot wrote:
> >> >> The computation of avg_load and avg_load_per_task should only takes into
> >> >> account the number of cfs tasks. The non cfs task are already taken into
> >> >> account by decreasing the cpu's capacity and they will be tracked in the
> >> >> CPU's utilization (group_utilization) of the next patches
> >> >>
> >> >> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> >> >> ---
> >> >>  kernel/sched/fair.c | 4 ++--
> >> >>  1 file changed, 2 insertions(+), 2 deletions(-)
> >> >>
> >> >> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >> >> index 87b9dc7..b85e9f7 100644
> >> >> --- a/kernel/sched/fair.c
> >> >> +++ b/kernel/sched/fair.c
> >> >> @@ -4092,7 +4092,7 @@ static unsigned long capacity_of(int cpu)
> >> >>  static unsigned long cpu_avg_load_per_task(int cpu)
> >> >>  {
> >> >>       struct rq *rq = cpu_rq(cpu);
> >> >> -     unsigned long nr_running = ACCESS_ONCE(rq->nr_running);
> >> >> +     unsigned long nr_running = ACCESS_ONCE(rq->cfs.h_nr_running);
> >> >>       unsigned long load_avg = rq->cfs.runnable_load_avg;
> >> >>
> >> >>       if (nr_running)
> >> >> @@ -5985,7 +5985,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
> >> >>                       load = source_load(i, load_idx);
> >> >>
> >> >>               sgs->group_load += load;
> >> >> -             sgs->sum_nr_running += rq->nr_running;
> >> >> +             sgs->sum_nr_running += rq->cfs.h_nr_running;
> >> >>
> >> >>               if (rq->nr_running > 1)
> >> >>                       *overload = true;
> >> >>
> >> >
> >> > Why do we probe rq->nr_running while we do load balancing? Should not we
> >> > be probing cfs_rq->nr_running instead? We are interested after all in
> >> > load balancing fair tasks right? The reason I ask this is, I was
> >> > wondering if we need to make the above similar change in more places in
> >> > load balancing.
> >>
> >> Hi Preeti,
> >>
> >> Yes, we should probably the test   rq->cfs.h_nr_running > 0 before
> >> setting overload.
> >>
> >
> > The overload indicator is used for knowing when we can totally avoid
> > load balancing to a cpu that is about to go idle.
> > We can avoid load balancing when no cpu has more than 1 task.  So if you
> > have say just one fair task and multiple deadline tasks on a cpu,
> > and another cpu about to go idle, you should turn on normal load
> > balancing in the idle path by setting overload to true.
> 
> The newly idle load balancing can only affect CFS tasks so triggering
> a load_balance because a cpu is overloaded by rt tasks only, will not
> change anything.
> 
> >
> > So setting overload should be set based on rq->nr_running and not on
> > rq->cfs.h_nr_running.
> 
> We should probably use both values like below
> 
> if ((rq->nr_running > 1) && ( rq->cfs.h_nr_running > 0))

Yes, this modification is the correct one that takes care of the
condition I objected to previously.

Thanks.

Tim

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v5 03/12] sched: fix avg_load computation
  2014-08-26 11:06   ` Vincent Guittot
@ 2014-09-05 11:10     ` Preeti U Murthy
  -1 siblings, 0 replies; 164+ messages in thread
From: Preeti U Murthy @ 2014-09-05 11:10 UTC (permalink / raw)
  To: Vincent Guittot, peterz, mingo, linux-kernel, linux, linux-arm-kernel
  Cc: riel, Morten.Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano, dietmar.eggemann

On 08/26/2014 04:36 PM, Vincent Guittot wrote:
> The computation of avg_load and avg_load_per_task should only takes into
> account the number of cfs tasks. The non cfs task are already taken into
> account by decreasing the cpu's capacity and they will be tracked in the
> CPU's utilization (group_utilization) of the next patches
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>  kernel/sched/fair.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 87b9dc7..b85e9f7 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4092,7 +4092,7 @@ static unsigned long capacity_of(int cpu)
>  static unsigned long cpu_avg_load_per_task(int cpu)
>  {
>  	struct rq *rq = cpu_rq(cpu);
> -	unsigned long nr_running = ACCESS_ONCE(rq->nr_running);
> +	unsigned long nr_running = ACCESS_ONCE(rq->cfs.h_nr_running);
>  	unsigned long load_avg = rq->cfs.runnable_load_avg;
> 
>  	if (nr_running)
> @@ -5985,7 +5985,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>  			load = source_load(i, load_idx);
> 
>  		sgs->group_load += load;
> -		sgs->sum_nr_running += rq->nr_running;
> +		sgs->sum_nr_running += rq->cfs.h_nr_running;

Yes this was one of the concerns I had around the usage of
rq->nr_running. Looks good to me.

> 
>  		if (rq->nr_running > 1)
>  			*overload = true;
> 
Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>


^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v5 03/12] sched: fix avg_load computation
@ 2014-09-05 11:10     ` Preeti U Murthy
  0 siblings, 0 replies; 164+ messages in thread
From: Preeti U Murthy @ 2014-09-05 11:10 UTC (permalink / raw)
  To: linux-arm-kernel

On 08/26/2014 04:36 PM, Vincent Guittot wrote:
> The computation of avg_load and avg_load_per_task should only takes into
> account the number of cfs tasks. The non cfs task are already taken into
> account by decreasing the cpu's capacity and they will be tracked in the
> CPU's utilization (group_utilization) of the next patches
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>  kernel/sched/fair.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 87b9dc7..b85e9f7 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4092,7 +4092,7 @@ static unsigned long capacity_of(int cpu)
>  static unsigned long cpu_avg_load_per_task(int cpu)
>  {
>  	struct rq *rq = cpu_rq(cpu);
> -	unsigned long nr_running = ACCESS_ONCE(rq->nr_running);
> +	unsigned long nr_running = ACCESS_ONCE(rq->cfs.h_nr_running);
>  	unsigned long load_avg = rq->cfs.runnable_load_avg;
> 
>  	if (nr_running)
> @@ -5985,7 +5985,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>  			load = source_load(i, load_idx);
> 
>  		sgs->group_load += load;
> -		sgs->sum_nr_running += rq->nr_running;
> +		sgs->sum_nr_running += rq->cfs.h_nr_running;

Yes this was one of the concerns I had around the usage of
rq->nr_running. Looks good to me.

> 
>  		if (rq->nr_running > 1)
>  			*overload = true;
> 
Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v5 08/12] sched: move cfs task on a CPU with higher capacity
  2014-08-26 11:06   ` Vincent Guittot
@ 2014-09-05 12:06     ` Preeti U Murthy
  -1 siblings, 0 replies; 164+ messages in thread
From: Preeti U Murthy @ 2014-09-05 12:06 UTC (permalink / raw)
  To: Vincent Guittot, peterz, mingo, linux-kernel, linux, linux-arm-kernel
  Cc: riel, Morten.Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano, dietmar.eggemann

Hi Vincent,

On 08/26/2014 04:36 PM, Vincent Guittot wrote:
> If the CPU is used for handling lot of IRQs, trig a load balance to check if
> it's worth moving its tasks on another CPU that has more capacity.
> 
> As a sidenote, this will note generate more spurious ilb because we already
> trig an ilb if there is more than 1 busy cpu. If this cpu is the only one that
> has a task, we will trig the ilb once for migrating the task.
> 
> The nohz_kick_needed function has been cleaned up a bit while adding the new
> test
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

So I see that there are added checks in your previous patches on if the
cpu capacity for CFS tasks is good enough to run tasks on the cpu. My
concern is although they appear sensible, would they trigger an increase
in the number of times we load balance to a large extent.

Ebizzy would not test this aspect right? There are no real time
tasks/interrupts that get generated.

Besides, what is the column that says patchset+irq? What is the irq
accounting patchset that you refer to in your cover letter?

Regards
Preeti U Murthy


^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v5 08/12] sched: move cfs task on a CPU with higher capacity
@ 2014-09-05 12:06     ` Preeti U Murthy
  0 siblings, 0 replies; 164+ messages in thread
From: Preeti U Murthy @ 2014-09-05 12:06 UTC (permalink / raw)
  To: linux-arm-kernel

Hi Vincent,

On 08/26/2014 04:36 PM, Vincent Guittot wrote:
> If the CPU is used for handling lot of IRQs, trig a load balance to check if
> it's worth moving its tasks on another CPU that has more capacity.
> 
> As a sidenote, this will note generate more spurious ilb because we already
> trig an ilb if there is more than 1 busy cpu. If this cpu is the only one that
> has a task, we will trig the ilb once for migrating the task.
> 
> The nohz_kick_needed function has been cleaned up a bit while adding the new
> test
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

So I see that there are added checks in your previous patches on if the
cpu capacity for CFS tasks is good enough to run tasks on the cpu. My
concern is although they appear sensible, would they trigger an increase
in the number of times we load balance to a large extent.

Ebizzy would not test this aspect right? There are no real time
tasks/interrupts that get generated.

Besides, what is the column that says patchset+irq? What is the irq
accounting patchset that you refer to in your cover letter?

Regards
Preeti U Murthy

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v5 08/12] sched: move cfs task on a CPU with higher capacity
  2014-09-05 12:06     ` Preeti U Murthy
@ 2014-09-05 12:24       ` Vincent Guittot
  -1 siblings, 0 replies; 164+ messages in thread
From: Vincent Guittot @ 2014-09-05 12:24 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel,
	Russell King - ARM Linux, LAK, Rik van Riel, Morten Rasmussen,
	Mike Galbraith, Nicolas Pitre, linaro-kernel, Daniel Lezcano,
	Dietmar Eggemann

On 5 September 2014 14:06, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
> Hi Vincent,
>
> On 08/26/2014 04:36 PM, Vincent Guittot wrote:
>> If the CPU is used for handling lot of IRQs, trig a load balance to check if
>> it's worth moving its tasks on another CPU that has more capacity.
>>
>> As a sidenote, this will note generate more spurious ilb because we already
>> trig an ilb if there is more than 1 busy cpu. If this cpu is the only one that
>> has a task, we will trig the ilb once for migrating the task.
>>
>> The nohz_kick_needed function has been cleaned up a bit while adding the new
>> test
>>
>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>
> So I see that there are added checks in your previous patches on if the
> cpu capacity for CFS tasks is good enough to run tasks on the cpu. My
> concern is although they appear sensible, would they trigger an increase
> in the number of times we load balance to a large extent.
>
> Ebizzy would not test this aspect right? There are no real time
> tasks/interrupts that get generated.

yes, ebizzy doesn't test this part but check for non regression

The scp test is the one that i use to check this patch and the
previous one but a test with some cfs and rt tasks should also do the
jobs.
we can see an increase of 82% for the dual core when
CONFIG_IRQ_TIME_ACCOUNTING is enable

>
> Besides, what is the column that says patchset+irq? What is the irq
> accounting patchset that you refer to in your cover letter?

it refers to CONFIG_IRQ_TIME_ACCOUNTING which includes the time spent
under interrupt context to compute the scale_rt_capacity

Regards,
Vincent

>
> Regards
> Preeti U Murthy
>

^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v5 08/12] sched: move cfs task on a CPU with higher capacity
@ 2014-09-05 12:24       ` Vincent Guittot
  0 siblings, 0 replies; 164+ messages in thread
From: Vincent Guittot @ 2014-09-05 12:24 UTC (permalink / raw)
  To: linux-arm-kernel

On 5 September 2014 14:06, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
> Hi Vincent,
>
> On 08/26/2014 04:36 PM, Vincent Guittot wrote:
>> If the CPU is used for handling lot of IRQs, trig a load balance to check if
>> it's worth moving its tasks on another CPU that has more capacity.
>>
>> As a sidenote, this will note generate more spurious ilb because we already
>> trig an ilb if there is more than 1 busy cpu. If this cpu is the only one that
>> has a task, we will trig the ilb once for migrating the task.
>>
>> The nohz_kick_needed function has been cleaned up a bit while adding the new
>> test
>>
>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>
> So I see that there are added checks in your previous patches on if the
> cpu capacity for CFS tasks is good enough to run tasks on the cpu. My
> concern is although they appear sensible, would they trigger an increase
> in the number of times we load balance to a large extent.
>
> Ebizzy would not test this aspect right? There are no real time
> tasks/interrupts that get generated.

yes, ebizzy doesn't test this part but check for non regression

The scp test is the one that i use to check this patch and the
previous one but a test with some cfs and rt tasks should also do the
jobs.
we can see an increase of 82% for the dual core when
CONFIG_IRQ_TIME_ACCOUNTING is enable

>
> Besides, what is the column that says patchset+irq? What is the irq
> accounting patchset that you refer to in your cover letter?

it refers to CONFIG_IRQ_TIME_ACCOUNTING which includes the time spent
under interrupt context to compute the scale_rt_capacity

Regards,
Vincent

>
> Regards
> Preeti U Murthy
>

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v5 04/12] sched: Allow all archs to set the capacity_orig
  2014-08-30 17:07     ` Preeti U Murthy
@ 2014-09-10 13:50       ` Peter Zijlstra
  -1 siblings, 0 replies; 164+ messages in thread
From: Peter Zijlstra @ 2014-09-10 13:50 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: Vincent Guittot, mingo, linux-kernel, linux, linux-arm-kernel,
	riel, Morten.Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano, dietmar.eggemann

On Sat, Aug 30, 2014 at 10:37:40PM +0530, Preeti U Murthy wrote:
> > -	if ((sd->flags & SD_SHARE_CPUCAPACITY) && weight > 1) {
> > -		if (sched_feat(ARCH_CAPACITY))
> 
> Aren't you missing this check above? I understand that it is not
> crucial, but that would also mean removing ARCH_CAPACITY sched_feat
> altogether, wouldn't it?

Yes he's missing that, I added the below bit on top.

So the argument last time:
lkml.kernel.org/r/20140709105721.GT19379@twins.programming.kicks-ass.net
was that you cannot put sched_feat(ARCH_CAPACITY) inside a weak arch_*
function. The test has to be outside, seeing how it needs to decide to
call the arch function at all (or revert to the default implementation).

---
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5700,7 +5700,7 @@ unsigned long __weak arch_scale_freq_cap
 	return default_scale_capacity(sd, cpu);
 }
 
-unsigned long __weak arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
+static unsigned long default_scale_cpu_capacity(struct sched_domain *sd, int cpu)
 {
 	if ((sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1))
 		return sd->smt_gain / sd->span_weight;
@@ -5708,6 +5708,11 @@ unsigned long __weak arch_scale_cpu_capa
 	return SCHED_CAPACITY_SCALE;
 }
 
+unsigned long __weak arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
+{
+	return default_scale_cpu_capacity(sd, cpu);
+}
+
 static unsigned long scale_rt_capacity(int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
@@ -5747,7 +5752,10 @@ static void update_cpu_capacity(struct s
 	unsigned long capacity = SCHED_CAPACITY_SCALE;
 	struct sched_group *sdg = sd->groups;
 
-	capacity *= arch_scale_cpu_capacity(sd, cpu);
+	if (sched_feat(ARCH_CAPACITY))
+		capacity *= arch_scale_cpu_capacity(sd, cpu);
+	else
+		capacity *= default_scale_cpu_capacity(sd, cpu);
 
 	capacity >>= SCHED_CAPACITY_SHIFT;
 

^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v5 04/12] sched: Allow all archs to set the capacity_orig
@ 2014-09-10 13:50       ` Peter Zijlstra
  0 siblings, 0 replies; 164+ messages in thread
From: Peter Zijlstra @ 2014-09-10 13:50 UTC (permalink / raw)
  To: linux-arm-kernel

On Sat, Aug 30, 2014 at 10:37:40PM +0530, Preeti U Murthy wrote:
> > -	if ((sd->flags & SD_SHARE_CPUCAPACITY) && weight > 1) {
> > -		if (sched_feat(ARCH_CAPACITY))
> 
> Aren't you missing this check above? I understand that it is not
> crucial, but that would also mean removing ARCH_CAPACITY sched_feat
> altogether, wouldn't it?

Yes he's missing that, I added the below bit on top.

So the argument last time:
lkml.kernel.org/r/20140709105721.GT19379 at twins.programming.kicks-ass.net
was that you cannot put sched_feat(ARCH_CAPACITY) inside a weak arch_*
function. The test has to be outside, seeing how it needs to decide to
call the arch function at all (or revert to the default implementation).

---
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5700,7 +5700,7 @@ unsigned long __weak arch_scale_freq_cap
 	return default_scale_capacity(sd, cpu);
 }
 
-unsigned long __weak arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
+static unsigned long default_scale_cpu_capacity(struct sched_domain *sd, int cpu)
 {
 	if ((sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1))
 		return sd->smt_gain / sd->span_weight;
@@ -5708,6 +5708,11 @@ unsigned long __weak arch_scale_cpu_capa
 	return SCHED_CAPACITY_SCALE;
 }
 
+unsigned long __weak arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
+{
+	return default_scale_cpu_capacity(sd, cpu);
+}
+
 static unsigned long scale_rt_capacity(int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
@@ -5747,7 +5752,10 @@ static void update_cpu_capacity(struct s
 	unsigned long capacity = SCHED_CAPACITY_SCALE;
 	struct sched_group *sdg = sd->groups;
 
-	capacity *= arch_scale_cpu_capacity(sd, cpu);
+	if (sched_feat(ARCH_CAPACITY))
+		capacity *= arch_scale_cpu_capacity(sd, cpu);
+	else
+		capacity *= default_scale_cpu_capacity(sd, cpu);
 
 	capacity >>= SCHED_CAPACITY_SHIFT;
 

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v5 06/12] sched: add per rq cpu_capacity_orig
  2014-08-26 11:06   ` Vincent Guittot
@ 2014-09-10 13:53     ` Peter Zijlstra
  -1 siblings, 0 replies; 164+ messages in thread
From: Peter Zijlstra @ 2014-09-10 13:53 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, linux-kernel, preeti, linux, linux-arm-kernel, riel,
	Morten.Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano, dietmar.eggemann

On Tue, Aug 26, 2014 at 01:06:49PM +0200, Vincent Guittot wrote:
> This new field cpu_capacity_orig reflects the available capacity of a CPUs
> unlike the cpu_capacity which reflects the current capacity that can be altered
> by frequency and rt tasks.
> 

No real objection to the patch as such, but the Changelog fails to tell
us why you want this.

^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v5 06/12] sched: add per rq cpu_capacity_orig
@ 2014-09-10 13:53     ` Peter Zijlstra
  0 siblings, 0 replies; 164+ messages in thread
From: Peter Zijlstra @ 2014-09-10 13:53 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Aug 26, 2014 at 01:06:49PM +0200, Vincent Guittot wrote:
> This new field cpu_capacity_orig reflects the available capacity of a CPUs
> unlike the cpu_capacity which reflects the current capacity that can be altered
> by frequency and rt tasks.
> 

No real objection to the patch as such, but the Changelog fails to tell
us why you want this.

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v5 07/12] sched: test the cpu's capacity in wake affine
  2014-08-26 11:06   ` Vincent Guittot
@ 2014-09-10 14:19     ` Peter Zijlstra
  -1 siblings, 0 replies; 164+ messages in thread
From: Peter Zijlstra @ 2014-09-10 14:19 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, linux-kernel, preeti, linux, linux-arm-kernel, riel,
	Morten.Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano, dietmar.eggemann

On Tue, Aug 26, 2014 at 01:06:50PM +0200, Vincent Guittot wrote:
> Currently the task always wakes affine on this_cpu if the latter is idle.
> Before waking up the task on this_cpu, we check that this_cpu capacity is not
> significantly reduced because of RT tasks or irq activity.
> 
> Use case where the number of irq and/or the time spent under irq is important
> will take benefit of this because the task that is woken up by irq or softirq
> will not use the same CPU than irq (and softirq) but a idle one.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

Rik, you've been 'fighting' wake_affine, any objections?

^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v5 07/12] sched: test the cpu's capacity in wake affine
@ 2014-09-10 14:19     ` Peter Zijlstra
  0 siblings, 0 replies; 164+ messages in thread
From: Peter Zijlstra @ 2014-09-10 14:19 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Aug 26, 2014 at 01:06:50PM +0200, Vincent Guittot wrote:
> Currently the task always wakes affine on this_cpu if the latter is idle.
> Before waking up the task on this_cpu, we check that this_cpu capacity is not
> significantly reduced because of RT tasks or irq activity.
> 
> Use case where the number of irq and/or the time spent under irq is important
> will take benefit of this because the task that is woken up by irq or softirq
> will not use the same CPU than irq (and softirq) but a idle one.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

Rik, you've been 'fighting' wake_affine, any objections?

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v5 06/12] sched: add per rq cpu_capacity_orig
  2014-09-10 13:53     ` Peter Zijlstra
@ 2014-09-10 14:19       ` Vincent Guittot
  -1 siblings, 0 replies; 164+ messages in thread
From: Vincent Guittot @ 2014-09-10 14:19 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Preeti U Murthy,
	Russell King - ARM Linux, LAK, Rik van Riel, Morten Rasmussen,
	Mike Galbraith, Nicolas Pitre, linaro-kernel, Daniel Lezcano,
	Dietmar Eggemann

On 10 September 2014 15:53, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Aug 26, 2014 at 01:06:49PM +0200, Vincent Guittot wrote:
>> This new field cpu_capacity_orig reflects the available capacity of a CPUs
>> unlike the cpu_capacity which reflects the current capacity that can be altered
>> by frequency and rt tasks.
>>
>
> No real objection to the patch as such, but the Changelog fails to tell
> us why you want this.

the cpu_capacity_orig  is used in several places in the following
patches to detect when a CPU's capacity has been noticeably reduced so
we can trig load balance to look for a CPU with full capacity.
As an example, we can detect when a CPU handles a significant amount
of irq (with CONFIG_IRQ_TIME_ACCOUNTING) but is seen as an idle CPU by
scheduler whereas really idle CPU are available

^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v5 06/12] sched: add per rq cpu_capacity_orig
@ 2014-09-10 14:19       ` Vincent Guittot
  0 siblings, 0 replies; 164+ messages in thread
From: Vincent Guittot @ 2014-09-10 14:19 UTC (permalink / raw)
  To: linux-arm-kernel

On 10 September 2014 15:53, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Aug 26, 2014 at 01:06:49PM +0200, Vincent Guittot wrote:
>> This new field cpu_capacity_orig reflects the available capacity of a CPUs
>> unlike the cpu_capacity which reflects the current capacity that can be altered
>> by frequency and rt tasks.
>>
>
> No real objection to the patch as such, but the Changelog fails to tell
> us why you want this.

the cpu_capacity_orig  is used in several places in the following
patches to detect when a CPU's capacity has been noticeably reduced so
we can trig load balance to look for a CPU with full capacity.
As an example, we can detect when a CPU handles a significant amount
of irq (with CONFIG_IRQ_TIME_ACCOUNTING) but is seen as an idle CPU by
scheduler whereas really idle CPU are available

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v5 04/12] sched: Allow all archs to set the capacity_orig
  2014-09-10 13:50       ` Peter Zijlstra
@ 2014-09-10 14:22         ` Vincent Guittot
  -1 siblings, 0 replies; 164+ messages in thread
From: Vincent Guittot @ 2014-09-10 14:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Preeti U Murthy, Ingo Molnar, linux-kernel,
	Russell King - ARM Linux, LAK, Rik van Riel, Morten Rasmussen,
	Mike Galbraith, Nicolas Pitre, linaro-kernel, Daniel Lezcano,
	Dietmar Eggemann

On 10 September 2014 15:50, Peter Zijlstra <peterz@infradead.org> wrote:
> On Sat, Aug 30, 2014 at 10:37:40PM +0530, Preeti U Murthy wrote:
>> > -   if ((sd->flags & SD_SHARE_CPUCAPACITY) && weight > 1) {
>> > -           if (sched_feat(ARCH_CAPACITY))
>>
>> Aren't you missing this check above? I understand that it is not
>> crucial, but that would also mean removing ARCH_CAPACITY sched_feat
>> altogether, wouldn't it?
>
> Yes he's missing that, I added the below bit on top.

FWIW, your changes are fine for me

>
> So the argument last time:
> lkml.kernel.org/r/20140709105721.GT19379@twins.programming.kicks-ass.net
> was that you cannot put sched_feat(ARCH_CAPACITY) inside a weak arch_*
> function. The test has to be outside, seeing how it needs to decide to
> call the arch function at all (or revert to the default implementation).
>
> ---
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5700,7 +5700,7 @@ unsigned long __weak arch_scale_freq_cap
>         return default_scale_capacity(sd, cpu);
>  }
>
> -unsigned long __weak arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
> +static unsigned long default_scale_cpu_capacity(struct sched_domain *sd, int cpu)
>  {
>         if ((sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1))
>                 return sd->smt_gain / sd->span_weight;
> @@ -5708,6 +5708,11 @@ unsigned long __weak arch_scale_cpu_capa
>         return SCHED_CAPACITY_SCALE;
>  }
>
> +unsigned long __weak arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
> +{
> +       return default_scale_cpu_capacity(sd, cpu);
> +}
> +
>  static unsigned long scale_rt_capacity(int cpu)
>  {
>         struct rq *rq = cpu_rq(cpu);
> @@ -5747,7 +5752,10 @@ static void update_cpu_capacity(struct s
>         unsigned long capacity = SCHED_CAPACITY_SCALE;
>         struct sched_group *sdg = sd->groups;
>
> -       capacity *= arch_scale_cpu_capacity(sd, cpu);
> +       if (sched_feat(ARCH_CAPACITY))
> +               capacity *= arch_scale_cpu_capacity(sd, cpu);
> +       else
> +               capacity *= default_scale_cpu_capacity(sd, cpu);
>
>         capacity >>= SCHED_CAPACITY_SHIFT;
>

^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v5 04/12] sched: Allow all archs to set the capacity_orig
@ 2014-09-10 14:22         ` Vincent Guittot
  0 siblings, 0 replies; 164+ messages in thread
From: Vincent Guittot @ 2014-09-10 14:22 UTC (permalink / raw)
  To: linux-arm-kernel

On 10 September 2014 15:50, Peter Zijlstra <peterz@infradead.org> wrote:
> On Sat, Aug 30, 2014 at 10:37:40PM +0530, Preeti U Murthy wrote:
>> > -   if ((sd->flags & SD_SHARE_CPUCAPACITY) && weight > 1) {
>> > -           if (sched_feat(ARCH_CAPACITY))
>>
>> Aren't you missing this check above? I understand that it is not
>> crucial, but that would also mean removing ARCH_CAPACITY sched_feat
>> altogether, wouldn't it?
>
> Yes he's missing that, I added the below bit on top.

FWIW, your changes are fine for me

>
> So the argument last time:
> lkml.kernel.org/r/20140709105721.GT19379 at twins.programming.kicks-ass.net
> was that you cannot put sched_feat(ARCH_CAPACITY) inside a weak arch_*
> function. The test has to be outside, seeing how it needs to decide to
> call the arch function at all (or revert to the default implementation).
>
> ---
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5700,7 +5700,7 @@ unsigned long __weak arch_scale_freq_cap
>         return default_scale_capacity(sd, cpu);
>  }
>
> -unsigned long __weak arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
> +static unsigned long default_scale_cpu_capacity(struct sched_domain *sd, int cpu)
>  {
>         if ((sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1))
>                 return sd->smt_gain / sd->span_weight;
> @@ -5708,6 +5708,11 @@ unsigned long __weak arch_scale_cpu_capa
>         return SCHED_CAPACITY_SCALE;
>  }
>
> +unsigned long __weak arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
> +{
> +       return default_scale_cpu_capacity(sd, cpu);
> +}
> +
>  static unsigned long scale_rt_capacity(int cpu)
>  {
>         struct rq *rq = cpu_rq(cpu);
> @@ -5747,7 +5752,10 @@ static void update_cpu_capacity(struct s
>         unsigned long capacity = SCHED_CAPACITY_SCALE;
>         struct sched_group *sdg = sd->groups;
>
> -       capacity *= arch_scale_cpu_capacity(sd, cpu);
> +       if (sched_feat(ARCH_CAPACITY))
> +               capacity *= arch_scale_cpu_capacity(sd, cpu);
> +       else
> +               capacity *= default_scale_cpu_capacity(sd, cpu);
>
>         capacity >>= SCHED_CAPACITY_SHIFT;
>

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v5 08/12] sched: move cfs task on a CPU with higher capacity
  2014-09-03 12:26             ` Preeti U Murthy
@ 2014-09-11  9:27               ` Peter Zijlstra
  -1 siblings, 0 replies; 164+ messages in thread
From: Peter Zijlstra @ 2014-09-11  9:27 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: Vincent Guittot, Ingo Molnar, linux-kernel,
	Russell King - ARM Linux, LAK, Rik van Riel, Morten Rasmussen,
	Mike Galbraith, Nicolas Pitre, linaro-kernel, Daniel Lezcano,
	Dietmar Eggemann

On Wed, Sep 03, 2014 at 05:56:04PM +0530, Preeti U Murthy wrote:
> On 09/03/2014 05:14 PM, Vincent Guittot wrote:
> > On 3 September 2014 11:11, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
> >> On 09/01/2014 02:15 PM, Vincent Guittot wrote:
> >>> On 30 August 2014 19:50, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
> >>>> Hi Vincent,
> >>>>> index 18db43e..60ae1ce 100644

Guys, really, trim excess crap. This email had over 150 lines of garbage
and 5 pointless quote levels or so before there was anything useful.

^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v5 08/12] sched: move cfs task on a CPU with higher capacity
@ 2014-09-11  9:27               ` Peter Zijlstra
  0 siblings, 0 replies; 164+ messages in thread
From: Peter Zijlstra @ 2014-09-11  9:27 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Sep 03, 2014 at 05:56:04PM +0530, Preeti U Murthy wrote:
> On 09/03/2014 05:14 PM, Vincent Guittot wrote:
> > On 3 September 2014 11:11, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
> >> On 09/01/2014 02:15 PM, Vincent Guittot wrote:
> >>> On 30 August 2014 19:50, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
> >>>> Hi Vincent,
> >>>>> index 18db43e..60ae1ce 100644

Guys, really, trim excess crap. This email had over 150 lines of garbage
and 5 pointless quote levels or so before there was anything useful.

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v5 08/12] sched: move cfs task on a CPU with higher capacity
  2014-08-26 11:06   ` Vincent Guittot
@ 2014-09-11 10:07     ` Peter Zijlstra
  -1 siblings, 0 replies; 164+ messages in thread
From: Peter Zijlstra @ 2014-09-11 10:07 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, linux-kernel, preeti, linux, linux-arm-kernel, riel,
	Morten.Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano, dietmar.eggemann

On Tue, Aug 26, 2014 at 01:06:51PM +0200, Vincent Guittot wrote:

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 18db43e..60ae1ce 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6049,6 +6049,14 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>  			return true;
>  	}
>  
> +	/*
> +	 * The group capacity is reduced probably because of activity from other
> +	 * sched class or interrupts which use part of the available capacity
> +	 */
> +	if ((sg->sgc->capacity_orig * 100) > (sgs->group_capacity *
> +				env->sd->imbalance_pct))
> +		return true;
> +
>  	return false;
>  }

This is unlikely to have worked as intended. You will never reach this,
except on PowerPC >= 7. All other machines will have bailed at the
!ASYM_PACKING check above this.

^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v5 08/12] sched: move cfs task on a CPU with higher capacity
@ 2014-09-11 10:07     ` Peter Zijlstra
  0 siblings, 0 replies; 164+ messages in thread
From: Peter Zijlstra @ 2014-09-11 10:07 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Aug 26, 2014 at 01:06:51PM +0200, Vincent Guittot wrote:

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 18db43e..60ae1ce 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6049,6 +6049,14 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>  			return true;
>  	}
>  
> +	/*
> +	 * The group capacity is reduced probably because of activity from other
> +	 * sched class or interrupts which use part of the available capacity
> +	 */
> +	if ((sg->sgc->capacity_orig * 100) > (sgs->group_capacity *
> +				env->sd->imbalance_pct))
> +		return true;
> +
>  	return false;
>  }

This is unlikely to have worked as intended. You will never reach this,
except on PowerPC >= 7. All other machines will have bailed at the
!ASYM_PACKING check above this.

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v5 08/12] sched: move cfs task on a CPU with higher capacity
  2014-08-26 11:06   ` Vincent Guittot
@ 2014-09-11 10:13     ` Peter Zijlstra
  -1 siblings, 0 replies; 164+ messages in thread
From: Peter Zijlstra @ 2014-09-11 10:13 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, linux-kernel, preeti, linux, linux-arm-kernel, riel,
	Morten.Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano, dietmar.eggemann

On Tue, Aug 26, 2014 at 01:06:51PM +0200, Vincent Guittot wrote:

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 18db43e..60ae1ce 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6049,6 +6049,14 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>  			return true;
>  	}
>  
> +	/*
> +	 * The group capacity is reduced probably because of activity from other
> +	 * sched class or interrupts which use part of the available capacity
> +	 */
> +	if ((sg->sgc->capacity_orig * 100) > (sgs->group_capacity *
> +				env->sd->imbalance_pct))
> +		return true;
> +
>  	return false;
>  }
>  
> @@ -6534,13 +6542,23 @@ static int need_active_balance(struct lb_env *env)
>  	struct sched_domain *sd = env->sd;
>  
>  	if (env->idle == CPU_NEWLY_IDLE) {
> +		int src_cpu = env->src_cpu;
>  
>  		/*
>  		 * ASYM_PACKING needs to force migrate tasks from busy but
>  		 * higher numbered CPUs in order to pack all tasks in the
>  		 * lowest numbered CPUs.
>  		 */
> -		if ((sd->flags & SD_ASYM_PACKING) && env->src_cpu > env->dst_cpu)
> +		if ((sd->flags & SD_ASYM_PACKING) && src_cpu > env->dst_cpu)
> +			return 1;
> +
> +		/*
> +		 * If the CPUs share their cache and the src_cpu's capacity is
> +		 * reduced because of other sched_class or IRQs, we trig an
> +		 * active balance to move the task
> +		 */
> +		if ((capacity_orig_of(src_cpu) * 100) > (capacity_of(src_cpu) *
> +				sd->imbalance_pct))
>  			return 1;
>  	}

Should you not also check -- in both cases -- that the destination is
any better?

Also, there's some obvious repetition going on there, maybe add a
helper?

Also, both sites should probably ensure they're operating in the
non-saturated/overloaded scenario. Because as soon as we're completely
saturated we want SMP nice etc. and that all already works right
(presumably).

^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v5 08/12] sched: move cfs task on a CPU with higher capacity
@ 2014-09-11 10:13     ` Peter Zijlstra
  0 siblings, 0 replies; 164+ messages in thread
From: Peter Zijlstra @ 2014-09-11 10:13 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Aug 26, 2014 at 01:06:51PM +0200, Vincent Guittot wrote:

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 18db43e..60ae1ce 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6049,6 +6049,14 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>  			return true;
>  	}
>  
> +	/*
> +	 * The group capacity is reduced probably because of activity from other
> +	 * sched class or interrupts which use part of the available capacity
> +	 */
> +	if ((sg->sgc->capacity_orig * 100) > (sgs->group_capacity *
> +				env->sd->imbalance_pct))
> +		return true;
> +
>  	return false;
>  }
>  
> @@ -6534,13 +6542,23 @@ static int need_active_balance(struct lb_env *env)
>  	struct sched_domain *sd = env->sd;
>  
>  	if (env->idle == CPU_NEWLY_IDLE) {
> +		int src_cpu = env->src_cpu;
>  
>  		/*
>  		 * ASYM_PACKING needs to force migrate tasks from busy but
>  		 * higher numbered CPUs in order to pack all tasks in the
>  		 * lowest numbered CPUs.
>  		 */
> -		if ((sd->flags & SD_ASYM_PACKING) && env->src_cpu > env->dst_cpu)
> +		if ((sd->flags & SD_ASYM_PACKING) && src_cpu > env->dst_cpu)
> +			return 1;
> +
> +		/*
> +		 * If the CPUs share their cache and the src_cpu's capacity is
> +		 * reduced because of other sched_class or IRQs, we trig an
> +		 * active balance to move the task
> +		 */
> +		if ((capacity_orig_of(src_cpu) * 100) > (capacity_of(src_cpu) *
> +				sd->imbalance_pct))
>  			return 1;
>  	}

Should you not also check -- in both cases -- that the destination is
any better?

Also, there's some obvious repetition going on there, maybe add a
helper?

Also, both sites should probably ensure they're operating in the
non-saturated/overloaded scenario. Because as soon as we're completely
saturated we want SMP nice etc. and that all already works right
(presumably).

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v5 04/12] sched: Allow all archs to set the capacity_orig
  2014-09-10 13:50       ` Peter Zijlstra
@ 2014-09-11 10:36         ` Preeti U Murthy
  -1 siblings, 0 replies; 164+ messages in thread
From: Preeti U Murthy @ 2014-09-11 10:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vincent Guittot, mingo, linux-kernel, linux, linux-arm-kernel,
	riel, Morten.Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano, dietmar.eggemann

On 09/10/2014 07:20 PM, Peter Zijlstra wrote:
> On Sat, Aug 30, 2014 at 10:37:40PM +0530, Preeti U Murthy wrote:
>>> -	if ((sd->flags & SD_SHARE_CPUCAPACITY) && weight > 1) {
>>> -		if (sched_feat(ARCH_CAPACITY))
>>
>> Aren't you missing this check above? I understand that it is not
>> crucial, but that would also mean removing ARCH_CAPACITY sched_feat
>> altogether, wouldn't it?
> 
> Yes he's missing that, I added the below bit on top.
> 
> So the argument last time:
> lkml.kernel.org/r/20140709105721.GT19379@twins.programming.kicks-ass.net
> was that you cannot put sched_feat(ARCH_CAPACITY) inside a weak arch_*
> function. The test has to be outside, seeing how it needs to decide to
> call the arch function at all (or revert to the default implementation).
> 
> ---
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5700,7 +5700,7 @@ unsigned long __weak arch_scale_freq_cap
>  	return default_scale_capacity(sd, cpu);
>  }
> 
> -unsigned long __weak arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
> +static unsigned long default_scale_cpu_capacity(struct sched_domain *sd, int cpu)
>  {
>  	if ((sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1))
>  		return sd->smt_gain / sd->span_weight;
> @@ -5708,6 +5708,11 @@ unsigned long __weak arch_scale_cpu_capa
>  	return SCHED_CAPACITY_SCALE;
>  }
> 
> +unsigned long __weak arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
> +{
> +	return default_scale_cpu_capacity(sd, cpu);
> +}
> +
>  static unsigned long scale_rt_capacity(int cpu)
>  {
>  	struct rq *rq = cpu_rq(cpu);
> @@ -5747,7 +5752,10 @@ static void update_cpu_capacity(struct s
>  	unsigned long capacity = SCHED_CAPACITY_SCALE;
>  	struct sched_group *sdg = sd->groups;
> 
> -	capacity *= arch_scale_cpu_capacity(sd, cpu);
> +	if (sched_feat(ARCH_CAPACITY))
> +		capacity *= arch_scale_cpu_capacity(sd, cpu);
> +	else
> +		capacity *= default_scale_cpu_capacity(sd, cpu);
> 
>  	capacity >>= SCHED_CAPACITY_SHIFT;
> 
> 
Alright, I see now. Thanks!

Regards
Preeti U Murthy


^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v5 04/12] sched: Allow all archs to set the capacity_orig
@ 2014-09-11 10:36         ` Preeti U Murthy
  0 siblings, 0 replies; 164+ messages in thread
From: Preeti U Murthy @ 2014-09-11 10:36 UTC (permalink / raw)
  To: linux-arm-kernel

On 09/10/2014 07:20 PM, Peter Zijlstra wrote:
> On Sat, Aug 30, 2014 at 10:37:40PM +0530, Preeti U Murthy wrote:
>>> -	if ((sd->flags & SD_SHARE_CPUCAPACITY) && weight > 1) {
>>> -		if (sched_feat(ARCH_CAPACITY))
>>
>> Aren't you missing this check above? I understand that it is not
>> crucial, but that would also mean removing ARCH_CAPACITY sched_feat
>> altogether, wouldn't it?
> 
> Yes he's missing that, I added the below bit on top.
> 
> So the argument last time:
> lkml.kernel.org/r/20140709105721.GT19379 at twins.programming.kicks-ass.net
> was that you cannot put sched_feat(ARCH_CAPACITY) inside a weak arch_*
> function. The test has to be outside, seeing how it needs to decide to
> call the arch function at all (or revert to the default implementation).
> 
> ---
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5700,7 +5700,7 @@ unsigned long __weak arch_scale_freq_cap
>  	return default_scale_capacity(sd, cpu);
>  }
> 
> -unsigned long __weak arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
> +static unsigned long default_scale_cpu_capacity(struct sched_domain *sd, int cpu)
>  {
>  	if ((sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1))
>  		return sd->smt_gain / sd->span_weight;
> @@ -5708,6 +5708,11 @@ unsigned long __weak arch_scale_cpu_capa
>  	return SCHED_CAPACITY_SCALE;
>  }
> 
> +unsigned long __weak arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
> +{
> +	return default_scale_cpu_capacity(sd, cpu);
> +}
> +
>  static unsigned long scale_rt_capacity(int cpu)
>  {
>  	struct rq *rq = cpu_rq(cpu);
> @@ -5747,7 +5752,10 @@ static void update_cpu_capacity(struct s
>  	unsigned long capacity = SCHED_CAPACITY_SCALE;
>  	struct sched_group *sdg = sd->groups;
> 
> -	capacity *= arch_scale_cpu_capacity(sd, cpu);
> +	if (sched_feat(ARCH_CAPACITY))
> +		capacity *= arch_scale_cpu_capacity(sd, cpu);
> +	else
> +		capacity *= default_scale_cpu_capacity(sd, cpu);
> 
>  	capacity >>= SCHED_CAPACITY_SHIFT;
> 
> 
Alright, I see now. Thanks!

Regards
Preeti U Murthy

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v5 09/12] sched: add usage_load_avg
  2014-08-26 11:06   ` Vincent Guittot
@ 2014-09-11 11:17     ` Peter Zijlstra
  -1 siblings, 0 replies; 164+ messages in thread
From: Peter Zijlstra @ 2014-09-11 11:17 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, linux-kernel, preeti, linux, linux-arm-kernel, riel,
	Morten.Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano, dietmar.eggemann

On Tue, Aug 26, 2014 at 01:06:52PM +0200, Vincent Guittot wrote:
> index 5c2c885..7dfd584 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1073,10 +1073,10 @@ struct sched_avg {
>  	 * above by 1024/(1-y).  Thus we only need a u32 to store them for all
>  	 * choices of y < 1-2^(-32)*1024.
>  	 */
> -	u32 runnable_avg_sum, runnable_avg_period;
> +	u32 runnable_avg_sum, runnable_avg_period, running_avg_sum;

Seeing how we use runnable_avg_period for both runnable and running,
does it make sense to remove the runnable part of it from the name?

Also, 4 byte hole here, not sure we've got anything useful to stuff in
it though.

>  	u64 last_runnable_update;
>  	s64 decay_count;
> -	unsigned long load_avg_contrib;
> +	unsigned long load_avg_contrib, usage_avg_contrib;
>  };
>  

Man, I should go look at Yuyang's rewrite of this all again. I just
tried to figure out the decay stuff and my head hurts ;-)

^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v5 09/12] sched: add usage_load_avg
@ 2014-09-11 11:17     ` Peter Zijlstra
  0 siblings, 0 replies; 164+ messages in thread
From: Peter Zijlstra @ 2014-09-11 11:17 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Aug 26, 2014 at 01:06:52PM +0200, Vincent Guittot wrote:
> index 5c2c885..7dfd584 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1073,10 +1073,10 @@ struct sched_avg {
>  	 * above by 1024/(1-y).  Thus we only need a u32 to store them for all
>  	 * choices of y < 1-2^(-32)*1024.
>  	 */
> -	u32 runnable_avg_sum, runnable_avg_period;
> +	u32 runnable_avg_sum, runnable_avg_period, running_avg_sum;

Seeing how we use runnable_avg_period for both runnable and running,
does it make sense to remove the runnable part of it from the name?

Also, 4 byte hole here, not sure we've got anything useful to stuff in
it though.

>  	u64 last_runnable_update;
>  	s64 decay_count;
> -	unsigned long load_avg_contrib;
> +	unsigned long load_avg_contrib, usage_avg_contrib;
>  };
>  

Man, I should go look at Yuyang's rewrite of this all again. I just
tried to figure out the decay stuff and my head hurts ;-)

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v5 09/11] sched: add usage_load_avg
  2014-09-04  7:34     ` Vincent Guittot
@ 2014-09-11 11:17       ` Peter Zijlstra
  -1 siblings, 0 replies; 164+ messages in thread
From: Peter Zijlstra @ 2014-09-11 11:17 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, linux-kernel, preeti, linux, linux-arm-kernel, riel,
	Morten.Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano, dietmar.eggemann

On Thu, Sep 04, 2014 at 09:34:39AM +0200, Vincent Guittot wrote:
> The update of update_rq_runnable_avg interface is missing for
> CONFIG_FAIR_GROUP_SCHED in the original patch
> "[PATCH v5 09/12] sched: add usage_load_avg"
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
> 
> Hi Peter,
> 
> Do you prefer that I sent a new version of 
> "[PATCH v5 09/12] sched: add usage_load_avg" patch with the fix or this fix is
> enough ?

This is fine, I hand edited the previous patch.

^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v5 09/11] sched: add usage_load_avg
@ 2014-09-11 11:17       ` Peter Zijlstra
  0 siblings, 0 replies; 164+ messages in thread
From: Peter Zijlstra @ 2014-09-11 11:17 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Sep 04, 2014 at 09:34:39AM +0200, Vincent Guittot wrote:
> The update of update_rq_runnable_avg interface is missing for
> CONFIG_FAIR_GROUP_SCHED in the original patch
> "[PATCH v5 09/12] sched: add usage_load_avg"
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
> 
> Hi Peter,
> 
> Do you prefer that I sent a new version of 
> "[PATCH v5 09/12] sched: add usage_load_avg" patch with the fix or this fix is
> enough ?

This is fine, I hand edited the previous patch.

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v5 08/12] sched: move cfs task on a CPU with higher capacity
  2014-09-11 10:07     ` Peter Zijlstra
@ 2014-09-11 11:20       ` Vincent Guittot
  -1 siblings, 0 replies; 164+ messages in thread
From: Vincent Guittot @ 2014-09-11 11:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Preeti U Murthy,
	Russell King - ARM Linux, LAK, Rik van Riel, Morten Rasmussen,
	Mike Galbraith, Nicolas Pitre, linaro-kernel, Daniel Lezcano,
	Dietmar Eggemann

On 11 September 2014 12:07, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Aug 26, 2014 at 01:06:51PM +0200, Vincent Guittot wrote:
>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 18db43e..60ae1ce 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6049,6 +6049,14 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>>                       return true;
>>       }
>>
>> +     /*
>> +      * The group capacity is reduced probably because of activity from other
>> +      * sched class or interrupts which use part of the available capacity
>> +      */
>> +     if ((sg->sgc->capacity_orig * 100) > (sgs->group_capacity *
>> +                             env->sd->imbalance_pct))
>> +             return true;
>> +
>>       return false;
>>  }
>
> This is unlikely to have worked as intended. You will never reach this,
> except on PowerPC >= 7. All other machines will have bailed at the
> !ASYM_PACKING check above this.

Ah yes, i miss that change while rebasing on rik patches. My use case
fall in this wider test now that we always select a busiest group

^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v5 08/12] sched: move cfs task on a CPU with higher capacity
@ 2014-09-11 11:20       ` Vincent Guittot
  0 siblings, 0 replies; 164+ messages in thread
From: Vincent Guittot @ 2014-09-11 11:20 UTC (permalink / raw)
  To: linux-arm-kernel

On 11 September 2014 12:07, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Aug 26, 2014 at 01:06:51PM +0200, Vincent Guittot wrote:
>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 18db43e..60ae1ce 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6049,6 +6049,14 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>>                       return true;
>>       }
>>
>> +     /*
>> +      * The group capacity is reduced probably because of activity from other
>> +      * sched class or interrupts which use part of the available capacity
>> +      */
>> +     if ((sg->sgc->capacity_orig * 100) > (sgs->group_capacity *
>> +                             env->sd->imbalance_pct))
>> +             return true;
>> +
>>       return false;
>>  }
>
> This is unlikely to have worked as intended. You will never reach this,
> except on PowerPC >= 7. All other machines will have bailed at the
> !ASYM_PACKING check above this.

Ah yes, i miss that change while rebasing on rik patches. My use case
fall in this wider test now that we always select a busiest group

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v5 08/12] sched: move cfs task on a CPU with higher capacity
  2014-08-26 11:06   ` Vincent Guittot
@ 2014-09-11 11:54     ` Peter Zijlstra
  -1 siblings, 0 replies; 164+ messages in thread
From: Peter Zijlstra @ 2014-09-11 11:54 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, linux-kernel, preeti, linux, linux-arm-kernel, riel,
	Morten.Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano, dietmar.eggemann

On Tue, Aug 26, 2014 at 01:06:51PM +0200, Vincent Guittot wrote:
> +		if ((capacity_orig_of(src_cpu) * 100) > (capacity_of(src_cpu) *
> +				sd->imbalance_pct))

Note that capacity_orig_of() is introduced a lot later on in the series.
Patch 10 iirc, so this will not actually compile.

^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v5 08/12] sched: move cfs task on a CPU with higher capacity
@ 2014-09-11 11:54     ` Peter Zijlstra
  0 siblings, 0 replies; 164+ messages in thread
From: Peter Zijlstra @ 2014-09-11 11:54 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Aug 26, 2014 at 01:06:51PM +0200, Vincent Guittot wrote:
> +		if ((capacity_orig_of(src_cpu) * 100) > (capacity_of(src_cpu) *
> +				sd->imbalance_pct))

Note that capacity_orig_of() is introduced a lot later on in the series.
Patch 10 iirc, so this will not actually compile.

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v5 08/12] sched: move cfs task on a CPU with higher capacity
  2014-09-11 10:13     ` Peter Zijlstra
@ 2014-09-11 12:14       ` Vincent Guittot
  -1 siblings, 0 replies; 164+ messages in thread
From: Vincent Guittot @ 2014-09-11 12:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Preeti U Murthy,
	Russell King - ARM Linux, LAK, Rik van Riel, Morten Rasmussen,
	Mike Galbraith, Nicolas Pitre, linaro-kernel, Daniel Lezcano,
	Dietmar Eggemann

On 11 September 2014 12:13, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Aug 26, 2014 at 01:06:51PM +0200, Vincent Guittot wrote:
>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 18db43e..60ae1ce 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6049,6 +6049,14 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>>                       return true;
>>       }
>>
>> +     /*
>> +      * The group capacity is reduced probably because of activity from other
>> +      * sched class or interrupts which use part of the available capacity
>> +      */
>> +     if ((sg->sgc->capacity_orig * 100) > (sgs->group_capacity *
>> +                             env->sd->imbalance_pct))
>> +             return true;
>> +
>>       return false;
>>  }
>>
>> @@ -6534,13 +6542,23 @@ static int need_active_balance(struct lb_env *env)
>>       struct sched_domain *sd = env->sd;
>>
>>       if (env->idle == CPU_NEWLY_IDLE) {
>> +             int src_cpu = env->src_cpu;
>>
>>               /*
>>                * ASYM_PACKING needs to force migrate tasks from busy but
>>                * higher numbered CPUs in order to pack all tasks in the
>>                * lowest numbered CPUs.
>>                */
>> -             if ((sd->flags & SD_ASYM_PACKING) && env->src_cpu > env->dst_cpu)
>> +             if ((sd->flags & SD_ASYM_PACKING) && src_cpu > env->dst_cpu)
>> +                     return 1;
>> +
>> +             /*
>> +              * If the CPUs share their cache and the src_cpu's capacity is
>> +              * reduced because of other sched_class or IRQs, we trig an
>> +              * active balance to move the task
>> +              */
>> +             if ((capacity_orig_of(src_cpu) * 100) > (capacity_of(src_cpu) *
>> +                             sd->imbalance_pct))
>>                       return 1;
>>       }
>
> Should you not also check -- in both cases -- that the destination is
> any better?

The case should have been solved earlier when calculating the
imbalance which should be null if the destination is worse than the
source.

But i haven't formally check that calculate_imbalance correctly
handles that case

>
> Also, there's some obvious repetition going on there, maybe add a
> helper?

yes

>
> Also, both sites should probably ensure they're operating in the
> non-saturated/overloaded scenario. Because as soon as we're completely
> saturated we want SMP nice etc. and that all already works right
> (presumably).

If both are overloaded, calculated_imbalance will cap the max load
that can be pulled so the busiest_group will not become idle

^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v5 08/12] sched: move cfs task on a CPU with higher capacity
@ 2014-09-11 12:14       ` Vincent Guittot
  0 siblings, 0 replies; 164+ messages in thread
From: Vincent Guittot @ 2014-09-11 12:14 UTC (permalink / raw)
  To: linux-arm-kernel

On 11 September 2014 12:13, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Aug 26, 2014 at 01:06:51PM +0200, Vincent Guittot wrote:
>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 18db43e..60ae1ce 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6049,6 +6049,14 @@ static bool update_sd_pick_busiest(struct lb_env *env,
>>                       return true;
>>       }
>>
>> +     /*
>> +      * The group capacity is reduced probably because of activity from other
>> +      * sched class or interrupts which use part of the available capacity
>> +      */
>> +     if ((sg->sgc->capacity_orig * 100) > (sgs->group_capacity *
>> +                             env->sd->imbalance_pct))
>> +             return true;
>> +
>>       return false;
>>  }
>>
>> @@ -6534,13 +6542,23 @@ static int need_active_balance(struct lb_env *env)
>>       struct sched_domain *sd = env->sd;
>>
>>       if (env->idle == CPU_NEWLY_IDLE) {
>> +             int src_cpu = env->src_cpu;
>>
>>               /*
>>                * ASYM_PACKING needs to force migrate tasks from busy but
>>                * higher numbered CPUs in order to pack all tasks in the
>>                * lowest numbered CPUs.
>>                */
>> -             if ((sd->flags & SD_ASYM_PACKING) && env->src_cpu > env->dst_cpu)
>> +             if ((sd->flags & SD_ASYM_PACKING) && src_cpu > env->dst_cpu)
>> +                     return 1;
>> +
>> +             /*
>> +              * If the CPUs share their cache and the src_cpu's capacity is
>> +              * reduced because of other sched_class or IRQs, we trig an
>> +              * active balance to move the task
>> +              */
>> +             if ((capacity_orig_of(src_cpu) * 100) > (capacity_of(src_cpu) *
>> +                             sd->imbalance_pct))
>>                       return 1;
>>       }
>
> Should you not also check -- in both cases -- that the destination is
> any better?

The case should have been solved earlier when calculating the
imbalance which should be null if the destination is worse than the
source.

But i haven't formally check that calculate_imbalance correctly
handles that case

>
> Also, there's some obvious repetition going on there, maybe add a
> helper?

yes

>
> Also, both sites should probably ensure they're operating in the
> non-saturated/overloaded scenario. Because as soon as we're completely
> saturated we want SMP nice etc. and that all already works right
> (presumably).

If both are overloaded, calculated_imbalance will cap the max load
that can be pulled so the busiest_group will not become idle

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v5 09/12] sched: add usage_load_avg
  2014-09-11 11:17     ` Peter Zijlstra
@ 2014-09-11 12:18       ` Vincent Guittot
  -1 siblings, 0 replies; 164+ messages in thread
From: Vincent Guittot @ 2014-09-11 12:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Preeti U Murthy,
	Russell King - ARM Linux, LAK, Rik van Riel, Morten Rasmussen,
	Mike Galbraith, Nicolas Pitre, linaro-kernel, Daniel Lezcano,
	Dietmar Eggemann

On 11 September 2014 13:17, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Aug 26, 2014 at 01:06:52PM +0200, Vincent Guittot wrote:
>> index 5c2c885..7dfd584 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -1073,10 +1073,10 @@ struct sched_avg {
>>        * above by 1024/(1-y).  Thus we only need a u32 to store them for all
>>        * choices of y < 1-2^(-32)*1024.
>>        */
>> -     u32 runnable_avg_sum, runnable_avg_period;
>> +     u32 runnable_avg_sum, runnable_avg_period, running_avg_sum;
>
> Seeing how we use runnable_avg_period for both runnable and running,
> does it make sense to remove the runnable part of it from the name?

It's right

>
> Also, 4 byte hole here, not sure we've got anything useful to stuff in
> it though.

I can move all u32 declaration at the end of the struct unless it has
been put before any u64 for good reason

>
>>       u64 last_runnable_update;
>>       s64 decay_count;
>> -     unsigned long load_avg_contrib;
>> +     unsigned long load_avg_contrib, usage_avg_contrib;
>>  };
>>
>
> Man, I should go look at Yuyang's rewrite of this all again. I just
> tried to figure out the decay stuff and my head hurts ;-)

^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v5 09/12] sched: add usage_load_avg
@ 2014-09-11 12:18       ` Vincent Guittot
  0 siblings, 0 replies; 164+ messages in thread
From: Vincent Guittot @ 2014-09-11 12:18 UTC (permalink / raw)
  To: linux-arm-kernel

On 11 September 2014 13:17, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Aug 26, 2014 at 01:06:52PM +0200, Vincent Guittot wrote:
>> index 5c2c885..7dfd584 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -1073,10 +1073,10 @@ struct sched_avg {
>>        * above by 1024/(1-y).  Thus we only need a u32 to store them for all
>>        * choices of y < 1-2^(-32)*1024.
>>        */
>> -     u32 runnable_avg_sum, runnable_avg_period;
>> +     u32 runnable_avg_sum, runnable_avg_period, running_avg_sum;
>
> Seeing how we use runnable_avg_period for both runnable and running,
> does it make sense to remove the runnable part of it from the name?

It's right

>
> Also, 4 byte hole here, not sure we've got anything useful to stuff in
> it though.

I can move all u32 declaration at the end of the struct unless it has
been put before any u64 for good reason

>
>>       u64 last_runnable_update;
>>       s64 decay_count;
>> -     unsigned long load_avg_contrib;
>> +     unsigned long load_avg_contrib, usage_avg_contrib;
>>  };
>>
>
> Man, I should go look at Yuyang's rewrite of this all again. I just
> tried to figure out the decay stuff and my head hurts ;-)

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v5 09/12] sched: add usage_load_avg
  2014-09-11 11:17     ` Peter Zijlstra
@ 2014-09-11 12:20       ` Vincent Guittot
  -1 siblings, 0 replies; 164+ messages in thread
From: Vincent Guittot @ 2014-09-11 12:20 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Preeti U Murthy,
	Russell King - ARM Linux, LAK, Rik van Riel, Morten Rasmussen,
	Mike Galbraith, Nicolas Pitre, linaro-kernel, Daniel Lezcano,
	Dietmar Eggemann

On 11 September 2014 13:17, Peter Zijlstra <peterz@infradead.org> wrote:

>>  };
>>
>
> Man, I should go look at Yuyang's rewrite of this all again. I just
> tried to figure out the decay stuff and my head hurts ;-)

Regarding Yuyang's rewrite, i had a patch above his patches to add the
running figure that i haven't sent yet

^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v5 09/12] sched: add usage_load_avg
@ 2014-09-11 12:20       ` Vincent Guittot
  0 siblings, 0 replies; 164+ messages in thread
From: Vincent Guittot @ 2014-09-11 12:20 UTC (permalink / raw)
  To: linux-arm-kernel

On 11 September 2014 13:17, Peter Zijlstra <peterz@infradead.org> wrote:

>>  };
>>
>
> Man, I should go look at Yuyang's rewrite of this all again. I just
> tried to figure out the decay stuff and my head hurts ;-)

Regarding Yuyang's rewrite, i had a patch above his patches to add the
running figure that i haven't sent yet

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v5 10/12] sched: get CPU's utilization statistic
  2014-08-26 11:06   ` Vincent Guittot
@ 2014-09-11 12:34     ` Peter Zijlstra
  -1 siblings, 0 replies; 164+ messages in thread
From: Peter Zijlstra @ 2014-09-11 12:34 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, linux-kernel, preeti, linux, linux-arm-kernel, riel,
	Morten.Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano, dietmar.eggemann

On Tue, Aug 26, 2014 at 01:06:53PM +0200, Vincent Guittot wrote:
> Monitor the utilization level of each group of each sched_domain level. The
> utilization is the amount of cpu_capacity that is currently used on a CPU or
> group of CPUs. We use the usage_load_avg to evaluate this utilization level.
> In the special use case where the CPU is fully loaded by more than 1 task,
> the activity level is set above the cpu_capacity in order to reflect the overload
> of the CPU
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>  kernel/sched/fair.c | 19 ++++++++++++++++++-
>  1 file changed, 18 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1fd2131..2f95d1c 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4126,6 +4126,11 @@ static unsigned long capacity_of(int cpu)
>  	return cpu_rq(cpu)->cpu_capacity;
>  }
>  
> +static unsigned long capacity_orig_of(int cpu)
> +{
> +	return cpu_rq(cpu)->cpu_capacity_orig;
> +}
> +
>  static unsigned long cpu_avg_load_per_task(int cpu)
>  {
>  	struct rq *rq = cpu_rq(cpu);

This hunk should probably go into patch 6.

> @@ -4514,6 +4519,17 @@ static int select_idle_sibling(struct task_struct *p, int target)
>  	return target;
>  }
>  
> +static int get_cpu_utilization(int cpu)
> +{
> +	unsigned long usage = cpu_rq(cpu)->cfs.usage_load_avg;
> +	unsigned long capacity = capacity_of(cpu);
> +
> +	if (usage >= SCHED_LOAD_SCALE)
> +		return capacity + 1;
> +
> +	return (usage * capacity) >> SCHED_LOAD_SHIFT;
> +}

So if I understood patch 9 correct, your changelog is iffy.
usage_load_avg should never get > 1 (of whatever unit), no matter how
many tasks are on the rq. You can only maximally run all the time.

Therefore I can only interpret the if (usage >= SCHED_LOAD_SCALE) as
numerical error handling, nothing more.

Also I'm not entirely sure I like the usage, utilization names/metrics.
I would suggest to reverse them. Call the pure running number
'utilization' and this scaled with capacity 'usage' or so.

> @@ -6188,7 +6206,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>  		/* Now, start updating sd_lb_stats */
>  		sds->total_load += sgs->group_load;
>  		sds->total_capacity += sgs->group_capacity;
> -
>  		sg = sg->next;
>  	} while (sg != env->sd->groups);
>  

I like that extra line of whitespace, it separates the body from the
loop itself.

^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v5 10/12] sched: get CPU's utilization statistic
@ 2014-09-11 12:34     ` Peter Zijlstra
  0 siblings, 0 replies; 164+ messages in thread
From: Peter Zijlstra @ 2014-09-11 12:34 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Aug 26, 2014 at 01:06:53PM +0200, Vincent Guittot wrote:
> Monitor the utilization level of each group of each sched_domain level. The
> utilization is the amount of cpu_capacity that is currently used on a CPU or
> group of CPUs. We use the usage_load_avg to evaluate this utilization level.
> In the special use case where the CPU is fully loaded by more than 1 task,
> the activity level is set above the cpu_capacity in order to reflect the overload
> of the CPU
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>  kernel/sched/fair.c | 19 ++++++++++++++++++-
>  1 file changed, 18 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1fd2131..2f95d1c 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4126,6 +4126,11 @@ static unsigned long capacity_of(int cpu)
>  	return cpu_rq(cpu)->cpu_capacity;
>  }
>  
> +static unsigned long capacity_orig_of(int cpu)
> +{
> +	return cpu_rq(cpu)->cpu_capacity_orig;
> +}
> +
>  static unsigned long cpu_avg_load_per_task(int cpu)
>  {
>  	struct rq *rq = cpu_rq(cpu);

This hunk should probably go into patch 6.

> @@ -4514,6 +4519,17 @@ static int select_idle_sibling(struct task_struct *p, int target)
>  	return target;
>  }
>  
> +static int get_cpu_utilization(int cpu)
> +{
> +	unsigned long usage = cpu_rq(cpu)->cfs.usage_load_avg;
> +	unsigned long capacity = capacity_of(cpu);
> +
> +	if (usage >= SCHED_LOAD_SCALE)
> +		return capacity + 1;
> +
> +	return (usage * capacity) >> SCHED_LOAD_SHIFT;
> +}

So if I understood patch 9 correct, your changelog is iffy.
usage_load_avg should never get > 1 (of whatever unit), no matter how
many tasks are on the rq. You can only maximally run all the time.

Therefore I can only interpret the if (usage >= SCHED_LOAD_SCALE) as
numerical error handling, nothing more.

Also I'm not entirely sure I like the usage, utilization names/metrics.
I would suggest to reverse them. Call the pure running number
'utilization' and this scaled with capacity 'usage' or so.

> @@ -6188,7 +6206,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>  		/* Now, start updating sd_lb_stats */
>  		sds->total_load += sgs->group_load;
>  		sds->total_capacity += sgs->group_capacity;
> -
>  		sg = sg->next;
>  	} while (sg != env->sd->groups);
>  

I like that extra line of whitespace, it separates the body from the
loop itself.

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v5 10/12] sched: get CPU's utilization statistic
  2014-09-11 12:34     ` Peter Zijlstra
@ 2014-09-11 13:07       ` Vincent Guittot
  -1 siblings, 0 replies; 164+ messages in thread
From: Vincent Guittot @ 2014-09-11 13:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Preeti U Murthy,
	Russell King - ARM Linux, LAK, Rik van Riel, Morten Rasmussen,
	Mike Galbraith, Nicolas Pitre, linaro-kernel, Daniel Lezcano,
	Dietmar Eggemann

On 11 September 2014 14:34, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Aug 26, 2014 at 01:06:53PM +0200, Vincent Guittot wrote:
>> Monitor the utilization level of each group of each sched_domain level. The
>> utilization is the amount of cpu_capacity that is currently used on a CPU or
>> group of CPUs. We use the usage_load_avg to evaluate this utilization level.
>> In the special use case where the CPU is fully loaded by more than 1 task,
>> the activity level is set above the cpu_capacity in order to reflect the overload
>> of the CPU
>>
>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>> ---
>>  kernel/sched/fair.c | 19 ++++++++++++++++++-
>>  1 file changed, 18 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 1fd2131..2f95d1c 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -4126,6 +4126,11 @@ static unsigned long capacity_of(int cpu)
>>       return cpu_rq(cpu)->cpu_capacity;
>>  }
>>
>> +static unsigned long capacity_orig_of(int cpu)
>> +{
>> +     return cpu_rq(cpu)->cpu_capacity_orig;
>> +}
>> +
>>  static unsigned long cpu_avg_load_per_task(int cpu)
>>  {
>>       struct rq *rq = cpu_rq(cpu);
>
> This hunk should probably go into patch 6.
>
>> @@ -4514,6 +4519,17 @@ static int select_idle_sibling(struct task_struct *p, int target)
>>       return target;
>>  }
>>
>> +static int get_cpu_utilization(int cpu)
>> +{
>> +     unsigned long usage = cpu_rq(cpu)->cfs.usage_load_avg;
>> +     unsigned long capacity = capacity_of(cpu);
>> +
>> +     if (usage >= SCHED_LOAD_SCALE)
>> +             return capacity + 1;
>> +
>> +     return (usage * capacity) >> SCHED_LOAD_SHIFT;
>> +}
>
> So if I understood patch 9 correct, your changelog is iffy.
> usage_load_avg should never get > 1 (of whatever unit), no matter how
> many tasks are on the rq. You can only maximally run all the time.
>
> Therefore I can only interpret the if (usage >= SCHED_LOAD_SCALE) as
> numerical error handling, nothing more.

yes

>
> Also I'm not entirely sure I like the usage, utilization names/metrics.
> I would suggest to reverse them. Call the pure running number
> 'utilization' and this scaled with capacity 'usage' or so.

ok. i can invert 'usage' and 'utilization', which will give

s/get_cpu_utilization/get_cpu_usage/
s/sgs->group_utilization/sgs->group_usage/
s/cfs.usage_load_avg/cfs.utilization_load_avg/
s/se->avg.usage_avg_contrib/se->avg.utilization_avg_contrib
s/__update_task_entity_usage/__update_task_entity_utilization
s/__update_entity_usage_avg_contrib/__update_entity_utilization_avg_contrib

>
>> @@ -6188,7 +6206,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>>               /* Now, start updating sd_lb_stats */
>>               sds->total_load += sgs->group_load;
>>               sds->total_capacity += sgs->group_capacity;
>> -
>>               sg = sg->next;
>>       } while (sg != env->sd->groups);
>>
>
> I like that extra line of whitespace, it separates the body from the
> loop itself.

Sorry, i don't know why i remove it

^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v5 10/12] sched: get CPU's utilization statistic
@ 2014-09-11 13:07       ` Vincent Guittot
  0 siblings, 0 replies; 164+ messages in thread
From: Vincent Guittot @ 2014-09-11 13:07 UTC (permalink / raw)
  To: linux-arm-kernel

On 11 September 2014 14:34, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Aug 26, 2014 at 01:06:53PM +0200, Vincent Guittot wrote:
>> Monitor the utilization level of each group of each sched_domain level. The
>> utilization is the amount of cpu_capacity that is currently used on a CPU or
>> group of CPUs. We use the usage_load_avg to evaluate this utilization level.
>> In the special use case where the CPU is fully loaded by more than 1 task,
>> the activity level is set above the cpu_capacity in order to reflect the overload
>> of the CPU
>>
>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>> ---
>>  kernel/sched/fair.c | 19 ++++++++++++++++++-
>>  1 file changed, 18 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 1fd2131..2f95d1c 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -4126,6 +4126,11 @@ static unsigned long capacity_of(int cpu)
>>       return cpu_rq(cpu)->cpu_capacity;
>>  }
>>
>> +static unsigned long capacity_orig_of(int cpu)
>> +{
>> +     return cpu_rq(cpu)->cpu_capacity_orig;
>> +}
>> +
>>  static unsigned long cpu_avg_load_per_task(int cpu)
>>  {
>>       struct rq *rq = cpu_rq(cpu);
>
> This hunk should probably go into patch 6.
>
>> @@ -4514,6 +4519,17 @@ static int select_idle_sibling(struct task_struct *p, int target)
>>       return target;
>>  }
>>
>> +static int get_cpu_utilization(int cpu)
>> +{
>> +     unsigned long usage = cpu_rq(cpu)->cfs.usage_load_avg;
>> +     unsigned long capacity = capacity_of(cpu);
>> +
>> +     if (usage >= SCHED_LOAD_SCALE)
>> +             return capacity + 1;
>> +
>> +     return (usage * capacity) >> SCHED_LOAD_SHIFT;
>> +}
>
> So if I understood patch 9 correct, your changelog is iffy.
> usage_load_avg should never get > 1 (of whatever unit), no matter how
> many tasks are on the rq. You can only maximally run all the time.
>
> Therefore I can only interpret the if (usage >= SCHED_LOAD_SCALE) as
> numerical error handling, nothing more.

yes

>
> Also I'm not entirely sure I like the usage, utilization names/metrics.
> I would suggest to reverse them. Call the pure running number
> 'utilization' and this scaled with capacity 'usage' or so.

ok. i can invert 'usage' and 'utilization', which will give

s/get_cpu_utilization/get_cpu_usage/
s/sgs->group_utilization/sgs->group_usage/
s/cfs.usage_load_avg/cfs.utilization_load_avg/
s/se->avg.usage_avg_contrib/se->avg.utilization_avg_contrib
s/__update_task_entity_usage/__update_task_entity_utilization
s/__update_entity_usage_avg_contrib/__update_entity_utilization_avg_contrib

>
>> @@ -6188,7 +6206,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>>               /* Now, start updating sd_lb_stats */
>>               sds->total_load += sgs->group_load;
>>               sds->total_capacity += sgs->group_capacity;
>> -
>>               sg = sg->next;
>>       } while (sg != env->sd->groups);
>>
>
> I like that extra line of whitespace, it separates the body from the
> loop itself.

Sorry, i don't know why i remove it

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v5 10/12] sched: get CPU's utilization statistic
  2014-09-11 13:07       ` Vincent Guittot
@ 2014-09-11 14:04         ` Peter Zijlstra
  -1 siblings, 0 replies; 164+ messages in thread
From: Peter Zijlstra @ 2014-09-11 14:04 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, linux-kernel, Preeti U Murthy,
	Russell King - ARM Linux, LAK, Rik van Riel, Morten Rasmussen,
	Mike Galbraith, Nicolas Pitre, linaro-kernel, Daniel Lezcano,
	Dietmar Eggemann

On Thu, Sep 11, 2014 at 03:07:52PM +0200, Vincent Guittot wrote:
> > Also I'm not entirely sure I like the usage, utilization names/metrics.
> > I would suggest to reverse them. Call the pure running number
> > 'utilization' and this scaled with capacity 'usage' or so.
> 
> ok. i can invert 'usage' and 'utilization', which will give
> 
> s/get_cpu_utilization/get_cpu_usage/
> s/sgs->group_utilization/sgs->group_usage/
> s/cfs.usage_load_avg/cfs.utilization_load_avg/
> s/se->avg.usage_avg_contrib/se->avg.utilization_avg_contrib
> s/__update_task_entity_usage/__update_task_entity_utilization
> s/__update_entity_usage_avg_contrib/__update_entity_utilization_avg_contrib
> 

Any other opinions before Vince goes and applies sed on patches? ;-)

^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v5 10/12] sched: get CPU's utilization statistic
@ 2014-09-11 14:04         ` Peter Zijlstra
  0 siblings, 0 replies; 164+ messages in thread
From: Peter Zijlstra @ 2014-09-11 14:04 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Sep 11, 2014 at 03:07:52PM +0200, Vincent Guittot wrote:
> > Also I'm not entirely sure I like the usage, utilization names/metrics.
> > I would suggest to reverse them. Call the pure running number
> > 'utilization' and this scaled with capacity 'usage' or so.
> 
> ok. i can invert 'usage' and 'utilization', which will give
> 
> s/get_cpu_utilization/get_cpu_usage/
> s/sgs->group_utilization/sgs->group_usage/
> s/cfs.usage_load_avg/cfs.utilization_load_avg/
> s/se->avg.usage_avg_contrib/se->avg.utilization_avg_contrib
> s/__update_task_entity_usage/__update_task_entity_utilization
> s/__update_entity_usage_avg_contrib/__update_entity_utilization_avg_contrib
> 

Any other opinions before Vince goes and applies sed on patches? ;-)

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v5 11/12] sched: replace capacity_factor by utilization
  2014-08-26 11:06   ` Vincent Guittot
@ 2014-09-11 15:39     ` Peter Zijlstra
  -1 siblings, 0 replies; 164+ messages in thread
From: Peter Zijlstra @ 2014-09-11 15:39 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, linux-kernel, preeti, linux, linux-arm-kernel, riel,
	Morten.Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano, dietmar.eggemann

On Tue, Aug 26, 2014 at 01:06:54PM +0200, Vincent Guittot wrote:
> The scheduler tries to compute how many tasks a group of CPUs can handle by
> assuming that a task's load is SCHED_LOAD_SCALE and a CPU capacity is
> SCHED_CAPACITY_SCALE.
> We can now have a better idea of the capacity of a group of CPUs and of the
> utilization of this group thanks to the rework of group_capacity_orig and the
> group_utilization. We can now deduct how many capacity is still available.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---

A few minor changes I did while going through it.

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5669,7 +5669,7 @@ struct sg_lb_stats {
 	unsigned int idle_cpus;
 	unsigned int group_weight;
 	enum group_type group_type;
-	int group_out_of_capacity;
+	int group_no_capacity;
 #ifdef CONFIG_NUMA_BALANCING
 	unsigned int nr_numa_running;
 	unsigned int nr_preferred_running;
@@ -5931,37 +5931,37 @@ static inline int sg_imbalanced(struct s
 	return group->sgc->imbalance;
 }
 
-static inline int group_has_free_capacity(struct sg_lb_stats *sgs,
-			struct lb_env *env)
+static inline bool
+group_has_capacity(struct lb_env *env, struct sg_lb_stats *sgs)
 {
 	if ((sgs->group_capacity_orig * 100) >
 			(sgs->group_utilization * env->sd->imbalance_pct))
-		return 1;
+		return true;
 
 	if (sgs->sum_nr_running < sgs->group_weight)
-		return 1;
+		return true;
 
-	return 0;
+	return false;
 }
 
-static inline int group_is_overloaded(struct sg_lb_stats *sgs,
-			struct lb_env *env)
+static inline bool
+group_is_overloaded(struct lb_env *env, struct sg_lb_stats *sgs)
 {
 	if (sgs->sum_nr_running <= sgs->group_weight)
-		return 0;
+		return false;
 
 	if ((sgs->group_capacity_orig * 100) <
 			(sgs->group_utilization * env->sd->imbalance_pct))
-		return 1;
+		return true;
 
-	return 0;
+	return false;
 }
 
-static enum group_type
-group_classify(struct sched_group *group, struct sg_lb_stats *sgs,
-			struct lb_env *env)
+static enum group_type group_classify(struct lb_env *env,
+				      struct sched_group *group,
+				      struct sg_lb_stats *sgs)
 {
-	if (group_is_overloaded(sgs, env))
+	if (group_is_overloaded(env, sgs))
 		return group_overloaded;
 
 	if (sg_imbalanced(group))
@@ -6024,9 +6024,8 @@ static inline void update_sg_lb_stats(st
 
 	sgs->group_weight = group->group_weight;
 
-	sgs->group_type = group_classify(group, sgs, env);
-
-	sgs->group_out_of_capacity = group_is_overloaded(sgs, env);
+	sgs->group_type = group_classify(env, group, sgs);
+	sgs->group_no_capacity = group_is_overloaded(env, sgs);
 }
 
 /**
@@ -6157,9 +6156,9 @@ static inline void update_sd_lb_stats(st
 		 * with a large weight task outweighs the tasks on the system).
 		 */
 		if (prefer_sibling && sds->local &&
-		    group_has_free_capacity(&sds->local_stat, env)) {
+		    group_has_capacity(env, &sds->local_stat)) {
 			if (sgs->sum_nr_running > 1)
-				sgs->group_out_of_capacity = 1;
+				sgs->group_no_capacity = 1;
 			sgs->group_capacity = min(sgs->group_capacity,
 						SCHED_CAPACITY_SCALE);
 		}
@@ -6430,9 +6429,8 @@ static struct sched_group *find_busiest_
 		goto force_balance;
 
 	/* SD_BALANCE_NEWIDLE trumps SMP nice when underutilized */
-	if (env->idle == CPU_NEWLY_IDLE &&
-			group_has_free_capacity(local, env) &&
-			busiest->group_out_of_capacity)
+	if (env->idle == CPU_NEWLY_IDLE && group_has_capacity(env, local) &&
+	    busiest->group_no_capacity)
 		goto force_balance;
 
 	/*

^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v5 11/12] sched: replace capacity_factor by utilization
@ 2014-09-11 15:39     ` Peter Zijlstra
  0 siblings, 0 replies; 164+ messages in thread
From: Peter Zijlstra @ 2014-09-11 15:39 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Aug 26, 2014 at 01:06:54PM +0200, Vincent Guittot wrote:
> The scheduler tries to compute how many tasks a group of CPUs can handle by
> assuming that a task's load is SCHED_LOAD_SCALE and a CPU capacity is
> SCHED_CAPACITY_SCALE.
> We can now have a better idea of the capacity of a group of CPUs and of the
> utilization of this group thanks to the rework of group_capacity_orig and the
> group_utilization. We can now deduct how many capacity is still available.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---

A few minor changes I did while going through it.

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5669,7 +5669,7 @@ struct sg_lb_stats {
 	unsigned int idle_cpus;
 	unsigned int group_weight;
 	enum group_type group_type;
-	int group_out_of_capacity;
+	int group_no_capacity;
 #ifdef CONFIG_NUMA_BALANCING
 	unsigned int nr_numa_running;
 	unsigned int nr_preferred_running;
@@ -5931,37 +5931,37 @@ static inline int sg_imbalanced(struct s
 	return group->sgc->imbalance;
 }
 
-static inline int group_has_free_capacity(struct sg_lb_stats *sgs,
-			struct lb_env *env)
+static inline bool
+group_has_capacity(struct lb_env *env, struct sg_lb_stats *sgs)
 {
 	if ((sgs->group_capacity_orig * 100) >
 			(sgs->group_utilization * env->sd->imbalance_pct))
-		return 1;
+		return true;
 
 	if (sgs->sum_nr_running < sgs->group_weight)
-		return 1;
+		return true;
 
-	return 0;
+	return false;
 }
 
-static inline int group_is_overloaded(struct sg_lb_stats *sgs,
-			struct lb_env *env)
+static inline bool
+group_is_overloaded(struct lb_env *env, struct sg_lb_stats *sgs)
 {
 	if (sgs->sum_nr_running <= sgs->group_weight)
-		return 0;
+		return false;
 
 	if ((sgs->group_capacity_orig * 100) <
 			(sgs->group_utilization * env->sd->imbalance_pct))
-		return 1;
+		return true;
 
-	return 0;
+	return false;
 }
 
-static enum group_type
-group_classify(struct sched_group *group, struct sg_lb_stats *sgs,
-			struct lb_env *env)
+static enum group_type group_classify(struct lb_env *env,
+				      struct sched_group *group,
+				      struct sg_lb_stats *sgs)
 {
-	if (group_is_overloaded(sgs, env))
+	if (group_is_overloaded(env, sgs))
 		return group_overloaded;
 
 	if (sg_imbalanced(group))
@@ -6024,9 +6024,8 @@ static inline void update_sg_lb_stats(st
 
 	sgs->group_weight = group->group_weight;
 
-	sgs->group_type = group_classify(group, sgs, env);
-
-	sgs->group_out_of_capacity = group_is_overloaded(sgs, env);
+	sgs->group_type = group_classify(env, group, sgs);
+	sgs->group_no_capacity = group_is_overloaded(env, sgs);
 }
 
 /**
@@ -6157,9 +6156,9 @@ static inline void update_sd_lb_stats(st
 		 * with a large weight task outweighs the tasks on the system).
 		 */
 		if (prefer_sibling && sds->local &&
-		    group_has_free_capacity(&sds->local_stat, env)) {
+		    group_has_capacity(env, &sds->local_stat)) {
 			if (sgs->sum_nr_running > 1)
-				sgs->group_out_of_capacity = 1;
+				sgs->group_no_capacity = 1;
 			sgs->group_capacity = min(sgs->group_capacity,
 						SCHED_CAPACITY_SCALE);
 		}
@@ -6430,9 +6429,8 @@ static struct sched_group *find_busiest_
 		goto force_balance;
 
 	/* SD_BALANCE_NEWIDLE trumps SMP nice when underutilized */
-	if (env->idle == CPU_NEWLY_IDLE &&
-			group_has_free_capacity(local, env) &&
-			busiest->group_out_of_capacity)
+	if (env->idle == CPU_NEWLY_IDLE && group_has_capacity(env, local) &&
+	    busiest->group_no_capacity)
 		goto force_balance;
 
 	/*

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v5 11/12] sched: replace capacity_factor by utilization
  2014-08-26 11:06   ` Vincent Guittot
@ 2014-09-11 16:15     ` Peter Zijlstra
  -1 siblings, 0 replies; 164+ messages in thread
From: Peter Zijlstra @ 2014-09-11 16:15 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: mingo, linux-kernel, preeti, linux, linux-arm-kernel, riel,
	Morten.Rasmussen, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano, dietmar.eggemann

On Tue, Aug 26, 2014 at 01:06:54PM +0200, Vincent Guittot wrote:
> +static inline int group_has_free_capacity(struct sg_lb_stats *sgs,
> +			struct lb_env *env)
>  {
> +	if ((sgs->group_capacity_orig * 100) >
> +			(sgs->group_utilization * env->sd->imbalance_pct))
> +		return 1;
> +
> +	if (sgs->sum_nr_running < sgs->group_weight)
> +		return 1;
>  
> +	return 0;
> +}
>  
> +static inline int group_is_overloaded(struct sg_lb_stats *sgs,
> +			struct lb_env *env)
> +{
> +	if (sgs->sum_nr_running <= sgs->group_weight)
> +		return 0;
>  
> +	if ((sgs->group_capacity_orig * 100) <
> +			(sgs->group_utilization * env->sd->imbalance_pct))
> +		return 1;
>  
> +	return 0;
>  }

I'm confused about the utilization vs capacity_orig. I see how we should
maybe scale things with the capacity when comparing between CPUs/groups,
but not on the same CPU/group.

I would have expected something simple like:

static inline bool group_has_capacity()
{
	/* Is there a spare cycle? */
	if (sgs->group_utilization < sgs->group_weight * SCHED_LOAD_SCALE)
		return true;

	/* Are there less tasks than logical CPUs? */
	if (sgs->sum_nr_running < sgs->group_weight)
		return true;

	return false;
}

Where group_utilization a pure sum of running_avg.

Now this has a hole when there are RT tasks on the system, in that case
the utilization will never hit 1, but we could fix that another way. I
don't think the capacity_orig thing is right.

^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v5 11/12] sched: replace capacity_factor by utilization
@ 2014-09-11 16:15     ` Peter Zijlstra
  0 siblings, 0 replies; 164+ messages in thread
From: Peter Zijlstra @ 2014-09-11 16:15 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Aug 26, 2014 at 01:06:54PM +0200, Vincent Guittot wrote:
> +static inline int group_has_free_capacity(struct sg_lb_stats *sgs,
> +			struct lb_env *env)
>  {
> +	if ((sgs->group_capacity_orig * 100) >
> +			(sgs->group_utilization * env->sd->imbalance_pct))
> +		return 1;
> +
> +	if (sgs->sum_nr_running < sgs->group_weight)
> +		return 1;
>  
> +	return 0;
> +}
>  
> +static inline int group_is_overloaded(struct sg_lb_stats *sgs,
> +			struct lb_env *env)
> +{
> +	if (sgs->sum_nr_running <= sgs->group_weight)
> +		return 0;
>  
> +	if ((sgs->group_capacity_orig * 100) <
> +			(sgs->group_utilization * env->sd->imbalance_pct))
> +		return 1;
>  
> +	return 0;
>  }

I'm confused about the utilization vs capacity_orig. I see how we should
maybe scale things with the capacity when comparing between CPUs/groups,
but not on the same CPU/group.

I would have expected something simple like:

static inline bool group_has_capacity()
{
	/* Is there a spare cycle? */
	if (sgs->group_utilization < sgs->group_weight * SCHED_LOAD_SCALE)
		return true;

	/* Are there less tasks than logical CPUs? */
	if (sgs->sum_nr_running < sgs->group_weight)
		return true;

	return false;
}

Where group_utilization a pure sum of running_avg.

Now this has a hole when there are RT tasks on the system, in that case
the utilization will never hit 1, but we could fix that another way. I
don't think the capacity_orig thing is right.

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v5 11/12] sched: replace capacity_factor by utilization
  2014-09-11 16:15     ` Peter Zijlstra
@ 2014-09-11 17:26       ` Vincent Guittot
  -1 siblings, 0 replies; 164+ messages in thread
From: Vincent Guittot @ 2014-09-11 17:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Preeti U Murthy,
	Russell King - ARM Linux, LAK, Rik van Riel, Morten Rasmussen,
	Mike Galbraith, Nicolas Pitre, linaro-kernel, Daniel Lezcano,
	Dietmar Eggemann

On 11 September 2014 18:15, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Aug 26, 2014 at 01:06:54PM +0200, Vincent Guittot wrote:
>> +static inline int group_has_free_capacity(struct sg_lb_stats *sgs,
>> +                     struct lb_env *env)
>>  {
>> +     if ((sgs->group_capacity_orig * 100) >
>> +                     (sgs->group_utilization * env->sd->imbalance_pct))
>> +             return 1;
>> +
>> +     if (sgs->sum_nr_running < sgs->group_weight)
>> +             return 1;
>>
>> +     return 0;
>> +}
>>
>> +static inline int group_is_overloaded(struct sg_lb_stats *sgs,
>> +                     struct lb_env *env)
>> +{
>> +     if (sgs->sum_nr_running <= sgs->group_weight)
>> +             return 0;
>>
>> +     if ((sgs->group_capacity_orig * 100) <
>> +                     (sgs->group_utilization * env->sd->imbalance_pct))
>> +             return 1;
>>
>> +     return 0;
>>  }
>
> I'm confused about the utilization vs capacity_orig. I see how we should

1st point is that I should compare utilization vs capacity and not
capacity_orig.
I should have replaced capacity_orig by capacity in the functions
above when i move the utilization statistic from
rq->avg.runnable_avg_sum to cfs.usage_load_avg.
rq->avg.runnable_avg_sum was measuring all activity on the cpu whereas
cfs.usage_load_avg integrates only cfs tasks

With this change, we don't need sgs->group_capacity_orig anymore but
only sgs->group_capacity. So sgs->group_capacity_orig can be removed
as it's no more used in the code as sg_capacity_factor has been
removed

> maybe scale things with the capacity when comparing between CPUs/groups,
> but not on the same CPU/group.
>
> I would have expected something simple like:
>
> static inline bool group_has_capacity()
> {
>         /* Is there a spare cycle? */
>         if (sgs->group_utilization < sgs->group_weight * SCHED_LOAD_SCALE)
>                 return true;
>
>         /* Are there less tasks than logical CPUs? */
>         if (sgs->sum_nr_running < sgs->group_weight)
>                 return true;
>
>         return false;
> }
>
> Where group_utilization a pure sum of running_avg.
>
> Now this has a hole when there are RT tasks on the system, in that case
> the utilization will never hit 1, but we could fix that another way. I
> don't think the capacity_orig thing is right.

^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v5 11/12] sched: replace capacity_factor by utilization
@ 2014-09-11 17:26       ` Vincent Guittot
  0 siblings, 0 replies; 164+ messages in thread
From: Vincent Guittot @ 2014-09-11 17:26 UTC (permalink / raw)
  To: linux-arm-kernel

On 11 September 2014 18:15, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Aug 26, 2014 at 01:06:54PM +0200, Vincent Guittot wrote:
>> +static inline int group_has_free_capacity(struct sg_lb_stats *sgs,
>> +                     struct lb_env *env)
>>  {
>> +     if ((sgs->group_capacity_orig * 100) >
>> +                     (sgs->group_utilization * env->sd->imbalance_pct))
>> +             return 1;
>> +
>> +     if (sgs->sum_nr_running < sgs->group_weight)
>> +             return 1;
>>
>> +     return 0;
>> +}
>>
>> +static inline int group_is_overloaded(struct sg_lb_stats *sgs,
>> +                     struct lb_env *env)
>> +{
>> +     if (sgs->sum_nr_running <= sgs->group_weight)
>> +             return 0;
>>
>> +     if ((sgs->group_capacity_orig * 100) <
>> +                     (sgs->group_utilization * env->sd->imbalance_pct))
>> +             return 1;
>>
>> +     return 0;
>>  }
>
> I'm confused about the utilization vs capacity_orig. I see how we should

1st point is that I should compare utilization vs capacity and not
capacity_orig.
I should have replaced capacity_orig by capacity in the functions
above when i move the utilization statistic from
rq->avg.runnable_avg_sum to cfs.usage_load_avg.
rq->avg.runnable_avg_sum was measuring all activity on the cpu whereas
cfs.usage_load_avg integrates only cfs tasks

With this change, we don't need sgs->group_capacity_orig anymore but
only sgs->group_capacity. So sgs->group_capacity_orig can be removed
as it's no more used in the code as sg_capacity_factor has been
removed

> maybe scale things with the capacity when comparing between CPUs/groups,
> but not on the same CPU/group.
>
> I would have expected something simple like:
>
> static inline bool group_has_capacity()
> {
>         /* Is there a spare cycle? */
>         if (sgs->group_utilization < sgs->group_weight * SCHED_LOAD_SCALE)
>                 return true;
>
>         /* Are there less tasks than logical CPUs? */
>         if (sgs->sum_nr_running < sgs->group_weight)
>                 return true;
>
>         return false;
> }
>
> Where group_utilization a pure sum of running_avg.
>
> Now this has a hole when there are RT tasks on the system, in that case
> the utilization will never hit 1, but we could fix that another way. I
> don't think the capacity_orig thing is right.

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v5 05/12] ARM: topology: use new cpu_capacity interface
  2014-08-26 11:06   ` Vincent Guittot
@ 2014-09-11 18:52     ` Nicolas Pitre
  -1 siblings, 0 replies; 164+ messages in thread
From: Nicolas Pitre @ 2014-09-11 18:52 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: peterz, mingo, linux-kernel, preeti, linux, linux-arm-kernel,
	riel, Morten.Rasmussen, efault, linaro-kernel, daniel.lezcano,
	dietmar.eggemann

On Tue, 26 Aug 2014, Vincent Guittot wrote:

> Use the new arch_scale_cpu_capacity in order to reflect the original capacity
> of a CPU instead of arch_scale_freq_capacity which is more linked to a scaling
> of the capacity linked to the frequency.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

Acked-by: Nicolas Pitre <nico@linaro.org>

> ---
>  arch/arm/kernel/topology.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
> index e35d880..89cfdd6 100644
> --- a/arch/arm/kernel/topology.c
> +++ b/arch/arm/kernel/topology.c
> @@ -42,7 +42,7 @@
>   */
>  static DEFINE_PER_CPU(unsigned long, cpu_scale);
>  
> -unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu)
> +unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
>  {
>  	return per_cpu(cpu_scale, cpu);
>  }
> @@ -166,7 +166,7 @@ static void update_cpu_capacity(unsigned int cpu)
>  	set_capacity_scale(cpu, cpu_capacity(cpu) / middle_capacity);
>  
>  	printk(KERN_INFO "CPU%u: update cpu_capacity %lu\n",
> -		cpu, arch_scale_freq_capacity(NULL, cpu));
> +		cpu, arch_scale_cpu_capacity(NULL, cpu));
>  }
>  
>  #else
> -- 
> 1.9.1
> 
> 

^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v5 05/12] ARM: topology: use new cpu_capacity interface
@ 2014-09-11 18:52     ` Nicolas Pitre
  0 siblings, 0 replies; 164+ messages in thread
From: Nicolas Pitre @ 2014-09-11 18:52 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, 26 Aug 2014, Vincent Guittot wrote:

> Use the new arch_scale_cpu_capacity in order to reflect the original capacity
> of a CPU instead of arch_scale_freq_capacity which is more linked to a scaling
> of the capacity linked to the frequency.
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

Acked-by: Nicolas Pitre <nico@linaro.org>

> ---
>  arch/arm/kernel/topology.c | 4 ++--
>  1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
> index e35d880..89cfdd6 100644
> --- a/arch/arm/kernel/topology.c
> +++ b/arch/arm/kernel/topology.c
> @@ -42,7 +42,7 @@
>   */
>  static DEFINE_PER_CPU(unsigned long, cpu_scale);
>  
> -unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu)
> +unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
>  {
>  	return per_cpu(cpu_scale, cpu);
>  }
> @@ -166,7 +166,7 @@ static void update_cpu_capacity(unsigned int cpu)
>  	set_capacity_scale(cpu, cpu_capacity(cpu) / middle_capacity);
>  
>  	printk(KERN_INFO "CPU%u: update cpu_capacity %lu\n",
> -		cpu, arch_scale_freq_capacity(NULL, cpu));
> +		cpu, arch_scale_cpu_capacity(NULL, cpu));
>  }
>  
>  #else
> -- 
> 1.9.1
> 
> 

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v5 06/12] sched: add per rq cpu_capacity_orig
  2014-08-26 11:06   ` Vincent Guittot
@ 2014-09-11 19:02     ` Nicolas Pitre
  -1 siblings, 0 replies; 164+ messages in thread
From: Nicolas Pitre @ 2014-09-11 19:02 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: peterz, mingo, linux-kernel, preeti, linux, linux-arm-kernel,
	riel, Morten.Rasmussen, efault, linaro-kernel, daniel.lezcano,
	dietmar.eggemann

On Tue, 26 Aug 2014, Vincent Guittot wrote:

> This new field cpu_capacity_orig reflects the available capacity of a CPUs

s/a CPUs/a CPU/

> unlike the cpu_capacity which reflects the current capacity that can be altered
> by frequency and rt tasks.

Shouldn't this be described as the "highest possible capacity" instead 
of only "available capacity" to make it clearer?  In which case, 
shouldn't it be called capacity_max instead of capacity_orig?  With 
"orig" we may think "origin" as the capacity that was available at boot 
time or the like.  If for example the CPU was booted with a low P-state 
then its max capacity is different from the original one.


> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>  kernel/sched/core.c  | 2 +-
>  kernel/sched/fair.c  | 1 +
>  kernel/sched/sched.h | 1 +
>  3 files changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index a394f99..7c3b237 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -7003,7 +7003,7 @@ void __init sched_init(void)
>  #ifdef CONFIG_SMP
>  		rq->sd = NULL;
>  		rq->rd = NULL;
> -		rq->cpu_capacity = SCHED_CAPACITY_SCALE;
> +		rq->cpu_capacity = rq->cpu_capacity_orig = SCHED_CAPACITY_SCALE;
>  		rq->post_schedule = 0;
>  		rq->active_balance = 0;
>  		rq->next_balance = jiffies;
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8176bda..17c16cc 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5756,6 +5756,7 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
>  
>  	capacity >>= SCHED_CAPACITY_SHIFT;
>  
> +	cpu_rq(cpu)->cpu_capacity_orig = capacity;
>  	sdg->sgc->capacity_orig = capacity;
>  
>  	if (sched_feat(ARCH_CAPACITY))
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index aa0f73b..7c0a74e 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -575,6 +575,7 @@ struct rq {
>  	struct sched_domain *sd;
>  
>  	unsigned long cpu_capacity;
> +	unsigned long cpu_capacity_orig;
>  
>  	unsigned char idle_balance;
>  	/* For active balancing */
> -- 
> 1.9.1
> 
> 

^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v5 06/12] sched: add per rq cpu_capacity_orig
@ 2014-09-11 19:02     ` Nicolas Pitre
  0 siblings, 0 replies; 164+ messages in thread
From: Nicolas Pitre @ 2014-09-11 19:02 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, 26 Aug 2014, Vincent Guittot wrote:

> This new field cpu_capacity_orig reflects the available capacity of a CPUs

s/a CPUs/a CPU/

> unlike the cpu_capacity which reflects the current capacity that can be altered
> by frequency and rt tasks.

Shouldn't this be described as the "highest possible capacity" instead 
of only "available capacity" to make it clearer?  In which case, 
shouldn't it be called capacity_max instead of capacity_orig?  With 
"orig" we may think "origin" as the capacity that was available at boot 
time or the like.  If for example the CPU was booted with a low P-state 
then its max capacity is different from the original one.


> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> ---
>  kernel/sched/core.c  | 2 +-
>  kernel/sched/fair.c  | 1 +
>  kernel/sched/sched.h | 1 +
>  3 files changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index a394f99..7c3b237 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -7003,7 +7003,7 @@ void __init sched_init(void)
>  #ifdef CONFIG_SMP
>  		rq->sd = NULL;
>  		rq->rd = NULL;
> -		rq->cpu_capacity = SCHED_CAPACITY_SCALE;
> +		rq->cpu_capacity = rq->cpu_capacity_orig = SCHED_CAPACITY_SCALE;
>  		rq->post_schedule = 0;
>  		rq->active_balance = 0;
>  		rq->next_balance = jiffies;
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8176bda..17c16cc 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5756,6 +5756,7 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
>  
>  	capacity >>= SCHED_CAPACITY_SHIFT;
>  
> +	cpu_rq(cpu)->cpu_capacity_orig = capacity;
>  	sdg->sgc->capacity_orig = capacity;
>  
>  	if (sched_feat(ARCH_CAPACITY))
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index aa0f73b..7c0a74e 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -575,6 +575,7 @@ struct rq {
>  	struct sched_domain *sd;
>  
>  	unsigned long cpu_capacity;
> +	unsigned long cpu_capacity_orig;
>  
>  	unsigned char idle_balance;
>  	/* For active balancing */
> -- 
> 1.9.1
> 
> 

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v5 10/12] sched: get CPU's utilization statistic
  2014-09-11 14:04         ` Peter Zijlstra
@ 2014-09-11 19:17           ` Nicolas Pitre
  -1 siblings, 0 replies; 164+ messages in thread
From: Nicolas Pitre @ 2014-09-11 19:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vincent Guittot, Ingo Molnar, linux-kernel, Preeti U Murthy,
	Russell King - ARM Linux, LAK, Rik van Riel, Morten Rasmussen,
	Mike Galbraith, linaro-kernel, Daniel Lezcano, Dietmar Eggemann

On Thu, 11 Sep 2014, Peter Zijlstra wrote:

> On Thu, Sep 11, 2014 at 03:07:52PM +0200, Vincent Guittot wrote:
> > > Also I'm not entirely sure I like the usage, utilization names/metrics.
> > > I would suggest to reverse them. Call the pure running number
> > > 'utilization' and this scaled with capacity 'usage' or so.
> > 
> > ok. i can invert 'usage' and 'utilization', which will give
> > 
> > s/get_cpu_utilization/get_cpu_usage/
> > s/sgs->group_utilization/sgs->group_usage/
> > s/cfs.usage_load_avg/cfs.utilization_load_avg/
> > s/se->avg.usage_avg_contrib/se->avg.utilization_avg_contrib
> > s/__update_task_entity_usage/__update_task_entity_utilization
> > s/__update_entity_usage_avg_contrib/__update_entity_utilization_avg_contrib
> > 
> 
> Any other opinions before Vince goes and applies sed on patches? ;-)

I don't mind either way, but for sure someone (possibly me) is going to 
confuse the two soon enough.

Please include in the code some formal definition in the context of the 
scheduler.  A comment block right before the corresponding get_cpu_* 
accessors should be good enough.


Nicolas

^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v5 10/12] sched: get CPU's utilization statistic
@ 2014-09-11 19:17           ` Nicolas Pitre
  0 siblings, 0 replies; 164+ messages in thread
From: Nicolas Pitre @ 2014-09-11 19:17 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, 11 Sep 2014, Peter Zijlstra wrote:

> On Thu, Sep 11, 2014 at 03:07:52PM +0200, Vincent Guittot wrote:
> > > Also I'm not entirely sure I like the usage, utilization names/metrics.
> > > I would suggest to reverse them. Call the pure running number
> > > 'utilization' and this scaled with capacity 'usage' or so.
> > 
> > ok. i can invert 'usage' and 'utilization', which will give
> > 
> > s/get_cpu_utilization/get_cpu_usage/
> > s/sgs->group_utilization/sgs->group_usage/
> > s/cfs.usage_load_avg/cfs.utilization_load_avg/
> > s/se->avg.usage_avg_contrib/se->avg.utilization_avg_contrib
> > s/__update_task_entity_usage/__update_task_entity_utilization
> > s/__update_entity_usage_avg_contrib/__update_entity_utilization_avg_contrib
> > 
> 
> Any other opinions before Vince goes and applies sed on patches? ;-)

I don't mind either way, but for sure someone (possibly me) is going to 
confuse the two soon enough.

Please include in the code some formal definition in the context of the 
scheduler.  A comment block right before the corresponding get_cpu_* 
accessors should be good enough.


Nicolas

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v5 10/12] sched: get CPU's utilization statistic
  2014-09-11 19:17           ` Nicolas Pitre
@ 2014-09-12  7:41             ` Vincent Guittot
  -1 siblings, 0 replies; 164+ messages in thread
From: Vincent Guittot @ 2014-09-12  7:41 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Preeti U Murthy,
	Russell King - ARM Linux, LAK, Rik van Riel, Morten Rasmussen,
	Mike Galbraith, linaro-kernel, Daniel Lezcano, Dietmar Eggemann

On 11 September 2014 21:17, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
>>
>> Any other opinions before Vince goes and applies sed on patches? ;-)
>
> I don't mind either way, but for sure someone (possibly me) is going to
> confuse the two soon enough.
>
> Please include in the code some formal definition in the context of the
> scheduler.  A comment block right before the corresponding get_cpu_*
> accessors should be good enough.

ok

>
>
> Nicolas

^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v5 10/12] sched: get CPU's utilization statistic
@ 2014-09-12  7:41             ` Vincent Guittot
  0 siblings, 0 replies; 164+ messages in thread
From: Vincent Guittot @ 2014-09-12  7:41 UTC (permalink / raw)
  To: linux-arm-kernel

On 11 September 2014 21:17, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
>>
>> Any other opinions before Vince goes and applies sed on patches? ;-)
>
> I don't mind either way, but for sure someone (possibly me) is going to
> confuse the two soon enough.
>
> Please include in the code some formal definition in the context of the
> scheduler.  A comment block right before the corresponding get_cpu_*
> accessors should be good enough.

ok

>
>
> Nicolas

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v5 11/12] sched: replace capacity_factor by utilization
  2014-09-11 17:26       ` Vincent Guittot
@ 2014-09-14 19:41         ` Peter Zijlstra
  -1 siblings, 0 replies; 164+ messages in thread
From: Peter Zijlstra @ 2014-09-14 19:41 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, linux-kernel, Preeti U Murthy,
	Russell King - ARM Linux, LAK, Rik van Riel, Morten Rasmussen,
	Mike Galbraith, Nicolas Pitre, linaro-kernel, Daniel Lezcano,
	Dietmar Eggemann

On Thu, Sep 11, 2014 at 07:26:48PM +0200, Vincent Guittot wrote:
> On 11 September 2014 18:15, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Tue, Aug 26, 2014 at 01:06:54PM +0200, Vincent Guittot wrote:
> >> +static inline int group_has_free_capacity(struct sg_lb_stats *sgs,
> >> +                     struct lb_env *env)
> >>  {
> >> +     if ((sgs->group_capacity_orig * 100) >
> >> +                     (sgs->group_utilization * env->sd->imbalance_pct))
> >> +             return 1;
> >> +
> >> +     if (sgs->sum_nr_running < sgs->group_weight)
> >> +             return 1;
> >>
> >> +     return 0;
> >> +}
> >>
> >> +static inline int group_is_overloaded(struct sg_lb_stats *sgs,
> >> +                     struct lb_env *env)
> >> +{
> >> +     if (sgs->sum_nr_running <= sgs->group_weight)
> >> +             return 0;
> >>
> >> +     if ((sgs->group_capacity_orig * 100) <
> >> +                     (sgs->group_utilization * env->sd->imbalance_pct))
> >> +             return 1;
> >>
> >> +     return 0;
> >>  }
> >
> > I'm confused about the utilization vs capacity_orig. I see how we should
> 
> 1st point is that I should compare utilization vs capacity and not
> capacity_orig.
> I should have replaced capacity_orig by capacity in the functions
> above when i move the utilization statistic from
> rq->avg.runnable_avg_sum to cfs.usage_load_avg.
> rq->avg.runnable_avg_sum was measuring all activity on the cpu whereas
> cfs.usage_load_avg integrates only cfs tasks
> 
> With this change, we don't need sgs->group_capacity_orig anymore but
> only sgs->group_capacity. So sgs->group_capacity_orig can be removed
> as it's no more used in the code as sg_capacity_factor has been
> removed

Yes, but.. so I suppose we need to add DVFS accounting and remove
cpufreq from the capacity thing. Otherwise I don't see it make sense.

^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v5 11/12] sched: replace capacity_factor by utilization
@ 2014-09-14 19:41         ` Peter Zijlstra
  0 siblings, 0 replies; 164+ messages in thread
From: Peter Zijlstra @ 2014-09-14 19:41 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Sep 11, 2014 at 07:26:48PM +0200, Vincent Guittot wrote:
> On 11 September 2014 18:15, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Tue, Aug 26, 2014 at 01:06:54PM +0200, Vincent Guittot wrote:
> >> +static inline int group_has_free_capacity(struct sg_lb_stats *sgs,
> >> +                     struct lb_env *env)
> >>  {
> >> +     if ((sgs->group_capacity_orig * 100) >
> >> +                     (sgs->group_utilization * env->sd->imbalance_pct))
> >> +             return 1;
> >> +
> >> +     if (sgs->sum_nr_running < sgs->group_weight)
> >> +             return 1;
> >>
> >> +     return 0;
> >> +}
> >>
> >> +static inline int group_is_overloaded(struct sg_lb_stats *sgs,
> >> +                     struct lb_env *env)
> >> +{
> >> +     if (sgs->sum_nr_running <= sgs->group_weight)
> >> +             return 0;
> >>
> >> +     if ((sgs->group_capacity_orig * 100) <
> >> +                     (sgs->group_utilization * env->sd->imbalance_pct))
> >> +             return 1;
> >>
> >> +     return 0;
> >>  }
> >
> > I'm confused about the utilization vs capacity_orig. I see how we should
> 
> 1st point is that I should compare utilization vs capacity and not
> capacity_orig.
> I should have replaced capacity_orig by capacity in the functions
> above when i move the utilization statistic from
> rq->avg.runnable_avg_sum to cfs.usage_load_avg.
> rq->avg.runnable_avg_sum was measuring all activity on the cpu whereas
> cfs.usage_load_avg integrates only cfs tasks
> 
> With this change, we don't need sgs->group_capacity_orig anymore but
> only sgs->group_capacity. So sgs->group_capacity_orig can be removed
> as it's no more used in the code as sg_capacity_factor has been
> removed

Yes, but.. so I suppose we need to add DVFS accounting and remove
cpufreq from the capacity thing. Otherwise I don't see it make sense.

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v5 11/12] sched: replace capacity_factor by utilization
  2014-09-14 19:41         ` Peter Zijlstra
@ 2014-09-14 19:51           ` Peter Zijlstra
  -1 siblings, 0 replies; 164+ messages in thread
From: Peter Zijlstra @ 2014-09-14 19:51 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, linux-kernel, Preeti U Murthy,
	Russell King - ARM Linux, LAK, Rik van Riel, Morten Rasmussen,
	Mike Galbraith, Nicolas Pitre, linaro-kernel, Daniel Lezcano,
	Dietmar Eggemann

On Sun, Sep 14, 2014 at 09:41:56PM +0200, Peter Zijlstra wrote:
> On Thu, Sep 11, 2014 at 07:26:48PM +0200, Vincent Guittot wrote:
> > On 11 September 2014 18:15, Peter Zijlstra <peterz@infradead.org> wrote:
> > > On Tue, Aug 26, 2014 at 01:06:54PM +0200, Vincent Guittot wrote:
> > >> +static inline int group_has_free_capacity(struct sg_lb_stats *sgs,
> > >> +                     struct lb_env *env)
> > >>  {
> > >> +     if ((sgs->group_capacity_orig * 100) >
> > >> +                     (sgs->group_utilization * env->sd->imbalance_pct))
> > >> +             return 1;
> > >> +
> > >> +     if (sgs->sum_nr_running < sgs->group_weight)
> > >> +             return 1;
> > >>
> > >> +     return 0;
> > >> +}
> > >>
> > >> +static inline int group_is_overloaded(struct sg_lb_stats *sgs,
> > >> +                     struct lb_env *env)
> > >> +{
> > >> +     if (sgs->sum_nr_running <= sgs->group_weight)
> > >> +             return 0;
> > >>
> > >> +     if ((sgs->group_capacity_orig * 100) <
> > >> +                     (sgs->group_utilization * env->sd->imbalance_pct))
> > >> +             return 1;
> > >>
> > >> +     return 0;
> > >>  }
> > >
> > > I'm confused about the utilization vs capacity_orig. I see how we should
> > 
> > 1st point is that I should compare utilization vs capacity and not
> > capacity_orig.
> > I should have replaced capacity_orig by capacity in the functions
> > above when i move the utilization statistic from
> > rq->avg.runnable_avg_sum to cfs.usage_load_avg.
> > rq->avg.runnable_avg_sum was measuring all activity on the cpu whereas
> > cfs.usage_load_avg integrates only cfs tasks
> > 
> > With this change, we don't need sgs->group_capacity_orig anymore but
> > only sgs->group_capacity. So sgs->group_capacity_orig can be removed
> > as it's no more used in the code as sg_capacity_factor has been
> > removed
> 
> Yes, but.. so I suppose we need to add DVFS accounting and remove
> cpufreq from the capacity thing. Otherwise I don't see it make sense.

That is to say, please also explain these details in the changelog, and
preferably also in a XXX/FIXME/TODO comment near wherever so we don't
forget about it.

^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v5 11/12] sched: replace capacity_factor by utilization
@ 2014-09-14 19:51           ` Peter Zijlstra
  0 siblings, 0 replies; 164+ messages in thread
From: Peter Zijlstra @ 2014-09-14 19:51 UTC (permalink / raw)
  To: linux-arm-kernel

On Sun, Sep 14, 2014 at 09:41:56PM +0200, Peter Zijlstra wrote:
> On Thu, Sep 11, 2014 at 07:26:48PM +0200, Vincent Guittot wrote:
> > On 11 September 2014 18:15, Peter Zijlstra <peterz@infradead.org> wrote:
> > > On Tue, Aug 26, 2014 at 01:06:54PM +0200, Vincent Guittot wrote:
> > >> +static inline int group_has_free_capacity(struct sg_lb_stats *sgs,
> > >> +                     struct lb_env *env)
> > >>  {
> > >> +     if ((sgs->group_capacity_orig * 100) >
> > >> +                     (sgs->group_utilization * env->sd->imbalance_pct))
> > >> +             return 1;
> > >> +
> > >> +     if (sgs->sum_nr_running < sgs->group_weight)
> > >> +             return 1;
> > >>
> > >> +     return 0;
> > >> +}
> > >>
> > >> +static inline int group_is_overloaded(struct sg_lb_stats *sgs,
> > >> +                     struct lb_env *env)
> > >> +{
> > >> +     if (sgs->sum_nr_running <= sgs->group_weight)
> > >> +             return 0;
> > >>
> > >> +     if ((sgs->group_capacity_orig * 100) <
> > >> +                     (sgs->group_utilization * env->sd->imbalance_pct))
> > >> +             return 1;
> > >>
> > >> +     return 0;
> > >>  }
> > >
> > > I'm confused about the utilization vs capacity_orig. I see how we should
> > 
> > 1st point is that I should compare utilization vs capacity and not
> > capacity_orig.
> > I should have replaced capacity_orig by capacity in the functions
> > above when i move the utilization statistic from
> > rq->avg.runnable_avg_sum to cfs.usage_load_avg.
> > rq->avg.runnable_avg_sum was measuring all activity on the cpu whereas
> > cfs.usage_load_avg integrates only cfs tasks
> > 
> > With this change, we don't need sgs->group_capacity_orig anymore but
> > only sgs->group_capacity. So sgs->group_capacity_orig can be removed
> > as it's no more used in the code as sg_capacity_factor has been
> > removed
> 
> Yes, but.. so I suppose we need to add DVFS accounting and remove
> cpufreq from the capacity thing. Otherwise I don't see it make sense.

That is to say, please also explain these details in the changelog, and
preferably also in a XXX/FIXME/TODO comment near wherever so we don't
forget about it.

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v5 11/12] sched: replace capacity_factor by utilization
  2014-09-14 19:41         ` Peter Zijlstra
@ 2014-09-15 11:42           ` Peter Zijlstra
  -1 siblings, 0 replies; 164+ messages in thread
From: Peter Zijlstra @ 2014-09-15 11:42 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, linux-kernel, Preeti U Murthy,
	Russell King - ARM Linux, LAK, Rik van Riel, Morten Rasmussen,
	Mike Galbraith, Nicolas Pitre, linaro-kernel, Daniel Lezcano,
	Dietmar Eggemann

On Sun, Sep 14, 2014 at 09:41:56PM +0200, Peter Zijlstra wrote:
> On Thu, Sep 11, 2014 at 07:26:48PM +0200, Vincent Guittot wrote:
> > On 11 September 2014 18:15, Peter Zijlstra <peterz@infradead.org> wrote:

> > > I'm confused about the utilization vs capacity_orig. I see how we should
> > 
> > 1st point is that I should compare utilization vs capacity and not
> > capacity_orig.
> > I should have replaced capacity_orig by capacity in the functions
> > above when i move the utilization statistic from
> > rq->avg.runnable_avg_sum to cfs.usage_load_avg.
> > rq->avg.runnable_avg_sum was measuring all activity on the cpu whereas
> > cfs.usage_load_avg integrates only cfs tasks
> > 
> > With this change, we don't need sgs->group_capacity_orig anymore but
> > only sgs->group_capacity. So sgs->group_capacity_orig can be removed
> > as it's no more used in the code as sg_capacity_factor has been
> > removed
> 
> Yes, but.. so I suppose we need to add DVFS accounting and remove
> cpufreq from the capacity thing. Otherwise I don't see it make sense.

OK, I've reconsidered _again_, I still don't get it.

So fundamentally I think its wrong to scale with the capacity; it just
doesn't make any sense. Consider big.little stuff, their CPUs are
inherently asymmetric in capacity, but that doesn't matter one whit for
utilization numbers. If a core is fully consumed its fully consumed, no
matter how much work it can or can not do.


So the only thing that needs correcting is the fact that these
statistics are based on clock_task and some of that time can end up in
other scheduling classes, at which point we'll never get 100% even
though we're 'saturated'. But correcting for that using capacity doesn't
'work'.

^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v5 11/12] sched: replace capacity_factor by utilization
@ 2014-09-15 11:42           ` Peter Zijlstra
  0 siblings, 0 replies; 164+ messages in thread
From: Peter Zijlstra @ 2014-09-15 11:42 UTC (permalink / raw)
  To: linux-arm-kernel

On Sun, Sep 14, 2014 at 09:41:56PM +0200, Peter Zijlstra wrote:
> On Thu, Sep 11, 2014 at 07:26:48PM +0200, Vincent Guittot wrote:
> > On 11 September 2014 18:15, Peter Zijlstra <peterz@infradead.org> wrote:

> > > I'm confused about the utilization vs capacity_orig. I see how we should
> > 
> > 1st point is that I should compare utilization vs capacity and not
> > capacity_orig.
> > I should have replaced capacity_orig by capacity in the functions
> > above when i move the utilization statistic from
> > rq->avg.runnable_avg_sum to cfs.usage_load_avg.
> > rq->avg.runnable_avg_sum was measuring all activity on the cpu whereas
> > cfs.usage_load_avg integrates only cfs tasks
> > 
> > With this change, we don't need sgs->group_capacity_orig anymore but
> > only sgs->group_capacity. So sgs->group_capacity_orig can be removed
> > as it's no more used in the code as sg_capacity_factor has been
> > removed
> 
> Yes, but.. so I suppose we need to add DVFS accounting and remove
> cpufreq from the capacity thing. Otherwise I don't see it make sense.

OK, I've reconsidered _again_, I still don't get it.

So fundamentally I think its wrong to scale with the capacity; it just
doesn't make any sense. Consider big.little stuff, their CPUs are
inherently asymmetric in capacity, but that doesn't matter one whit for
utilization numbers. If a core is fully consumed its fully consumed, no
matter how much work it can or can not do.


So the only thing that needs correcting is the fact that these
statistics are based on clock_task and some of that time can end up in
other scheduling classes, at which point we'll never get 100% even
though we're 'saturated'. But correcting for that using capacity doesn't
'work'.

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v5 11/12] sched: replace capacity_factor by utilization
  2014-09-15 11:42           ` Peter Zijlstra
@ 2014-09-15 19:07             ` Nicolas Pitre
  -1 siblings, 0 replies; 164+ messages in thread
From: Nicolas Pitre @ 2014-09-15 19:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vincent Guittot, Ingo Molnar, linux-kernel, Preeti U Murthy,
	Russell King - ARM Linux, LAK, Rik van Riel, Morten Rasmussen,
	Mike Galbraith, linaro-kernel, Daniel Lezcano, Dietmar Eggemann

On Mon, 15 Sep 2014, Peter Zijlstra wrote:

> On Sun, Sep 14, 2014 at 09:41:56PM +0200, Peter Zijlstra wrote:
> > On Thu, Sep 11, 2014 at 07:26:48PM +0200, Vincent Guittot wrote:
> > > On 11 September 2014 18:15, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > > > I'm confused about the utilization vs capacity_orig. I see how we should
> > > 
> > > 1st point is that I should compare utilization vs capacity and not
> > > capacity_orig.
> > > I should have replaced capacity_orig by capacity in the functions
> > > above when i move the utilization statistic from
> > > rq->avg.runnable_avg_sum to cfs.usage_load_avg.
> > > rq->avg.runnable_avg_sum was measuring all activity on the cpu whereas
> > > cfs.usage_load_avg integrates only cfs tasks
> > > 
> > > With this change, we don't need sgs->group_capacity_orig anymore but
> > > only sgs->group_capacity. So sgs->group_capacity_orig can be removed
> > > as it's no more used in the code as sg_capacity_factor has been
> > > removed
> > 
> > Yes, but.. so I suppose we need to add DVFS accounting and remove
> > cpufreq from the capacity thing. Otherwise I don't see it make sense.
> 
> OK, I've reconsidered _again_, I still don't get it.
> 
> So fundamentally I think its wrong to scale with the capacity; it just
> doesn't make any sense. Consider big.little stuff, their CPUs are
> inherently asymmetric in capacity, but that doesn't matter one whit for
> utilization numbers. If a core is fully consumed its fully consumed, no
> matter how much work it can or can not do.

Let's suppose a task running on a 1GHz CPU producing a load of 100.

The same task on a 100MHz CPU would produce a load of 1000 because that 
CPU is 10x slower.  So to properly evaluate the load of a task when 
moving it around, we want to normalize its load based on the CPU 
performance.  In this case the correction factor would be 0.1.

Given those normalized loads, we need to scale CPU capacity as well.  If 
the 1GHz CPU can handle 50 of those tasks it has a capacity of 5000.

In theory the 100MHz CPU could handle only 5 of those tasks, meaning it 
has a normalized capacity of 500, but only if the load metric is already 
normalized as well.

Or am I completely missing the point here?


Nicolas





> 
> 
> So the only thing that needs correcting is the fact that these
> statistics are based on clock_task and some of that time can end up in
> other scheduling classes, at which point we'll never get 100% even
> though we're 'saturated'. But correcting for that using capacity doesn't
> 'work'.
> 
> 

^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v5 11/12] sched: replace capacity_factor by utilization
@ 2014-09-15 19:07             ` Nicolas Pitre
  0 siblings, 0 replies; 164+ messages in thread
From: Nicolas Pitre @ 2014-09-15 19:07 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, 15 Sep 2014, Peter Zijlstra wrote:

> On Sun, Sep 14, 2014 at 09:41:56PM +0200, Peter Zijlstra wrote:
> > On Thu, Sep 11, 2014 at 07:26:48PM +0200, Vincent Guittot wrote:
> > > On 11 September 2014 18:15, Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > > > I'm confused about the utilization vs capacity_orig. I see how we should
> > > 
> > > 1st point is that I should compare utilization vs capacity and not
> > > capacity_orig.
> > > I should have replaced capacity_orig by capacity in the functions
> > > above when i move the utilization statistic from
> > > rq->avg.runnable_avg_sum to cfs.usage_load_avg.
> > > rq->avg.runnable_avg_sum was measuring all activity on the cpu whereas
> > > cfs.usage_load_avg integrates only cfs tasks
> > > 
> > > With this change, we don't need sgs->group_capacity_orig anymore but
> > > only sgs->group_capacity. So sgs->group_capacity_orig can be removed
> > > as it's no more used in the code as sg_capacity_factor has been
> > > removed
> > 
> > Yes, but.. so I suppose we need to add DVFS accounting and remove
> > cpufreq from the capacity thing. Otherwise I don't see it make sense.
> 
> OK, I've reconsidered _again_, I still don't get it.
> 
> So fundamentally I think its wrong to scale with the capacity; it just
> doesn't make any sense. Consider big.little stuff, their CPUs are
> inherently asymmetric in capacity, but that doesn't matter one whit for
> utilization numbers. If a core is fully consumed its fully consumed, no
> matter how much work it can or can not do.

Let's suppose a task running on a 1GHz CPU producing a load of 100.

The same task on a 100MHz CPU would produce a load of 1000 because that 
CPU is 10x slower.  So to properly evaluate the load of a task when 
moving it around, we want to normalize its load based on the CPU 
performance.  In this case the correction factor would be 0.1.

Given those normalized loads, we need to scale CPU capacity as well.  If 
the 1GHz CPU can handle 50 of those tasks it has a capacity of 5000.

In theory the 100MHz CPU could handle only 5 of those tasks, meaning it 
has a normalized capacity of 500, but only if the load metric is already 
normalized as well.

Or am I completely missing the point here?


Nicolas





> 
> 
> So the only thing that needs correcting is the fact that these
> statistics are based on clock_task and some of that time can end up in
> other scheduling classes, at which point we'll never get 100% even
> though we're 'saturated'. But correcting for that using capacity doesn't
> 'work'.
> 
> 

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v5 09/12] sched: add usage_load_avg
  2014-08-26 11:06   ` Vincent Guittot
@ 2014-09-15 19:15     ` Morten Rasmussen
  -1 siblings, 0 replies; 164+ messages in thread
From: Morten Rasmussen @ 2014-09-15 19:15 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: peterz, mingo, linux-kernel, preeti, linux, linux-arm-kernel,
	riel, efault, nicolas.pitre, linaro-kernel, daniel.lezcano,
	Dietmar Eggemann

On Tue, Aug 26, 2014 at 12:06:52PM +0100, Vincent Guittot wrote:
> Add new statistics which reflect the average time a task is running on the
> CPU and the sum of the tasks' running on a runqueue. The latter is named
> usage_avg_contrib.
> 
> This patch is based on the usage metric that was proposed in the 1st
> versions of the per-entity load tracking patchset but that has be removed
> afterward. This version differs from the original one in the sense that it's
> not linked to task_group.
> 
> The rq's usage_avg_contrib will be used to check if a rq is overloaded or not
> instead of trying to compute how many task a group of CPUs can handle
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

I should have read this patch before I did almost the same patch for as
part a series to introduce scale-invariance which I am about to post :(

The only difference I see is slightly different naming and that, AFAICT,
task group usage is not accounted for in this patch. Can we add the
support for task groups as well? I can provide a patch based on this one
if you want.

Also, since more than half this patch comes directly from PJT's original
patch I would add "by Paul Turner <pjt@google.com>" somewhere in the
text above.

[...]

> +static inline void __update_task_entity_usage(struct sched_entity *se)
> +{
> +	u32 contrib;
> +
> +	/* avoid overflowing a 32-bit type w/ SCHED_LOAD_SCALE */
> +	contrib = se->avg.running_avg_sum * scale_load_down(SCHED_LOAD_SCALE);
> +	contrib /= (se->avg.runnable_avg_period + 1);
> +	se->avg.usage_avg_contrib = scale_load(contrib);
> +}
> +
> +static long __update_entity_usage_avg_contrib(struct sched_entity *se)
> +{
> +	long old_contrib = se->avg.usage_avg_contrib;
> +
> +	if (entity_is_task(se))
> +		__update_task_entity_usage(se);

Groups are never updated?

As said above. I have code that does it.

Morten

> +
> +	return se->avg.usage_avg_contrib - old_contrib;
> +}
> +


^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v5 09/12] sched: add usage_load_avg
@ 2014-09-15 19:15     ` Morten Rasmussen
  0 siblings, 0 replies; 164+ messages in thread
From: Morten Rasmussen @ 2014-09-15 19:15 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Aug 26, 2014 at 12:06:52PM +0100, Vincent Guittot wrote:
> Add new statistics which reflect the average time a task is running on the
> CPU and the sum of the tasks' running on a runqueue. The latter is named
> usage_avg_contrib.
> 
> This patch is based on the usage metric that was proposed in the 1st
> versions of the per-entity load tracking patchset but that has be removed
> afterward. This version differs from the original one in the sense that it's
> not linked to task_group.
> 
> The rq's usage_avg_contrib will be used to check if a rq is overloaded or not
> instead of trying to compute how many task a group of CPUs can handle
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>

I should have read this patch before I did almost the same patch for as
part a series to introduce scale-invariance which I am about to post :(

The only difference I see is slightly different naming and that, AFAICT,
task group usage is not accounted for in this patch. Can we add the
support for task groups as well? I can provide a patch based on this one
if you want.

Also, since more than half this patch comes directly from PJT's original
patch I would add "by Paul Turner <pjt@google.com>" somewhere in the
text above.

[...]

> +static inline void __update_task_entity_usage(struct sched_entity *se)
> +{
> +	u32 contrib;
> +
> +	/* avoid overflowing a 32-bit type w/ SCHED_LOAD_SCALE */
> +	contrib = se->avg.running_avg_sum * scale_load_down(SCHED_LOAD_SCALE);
> +	contrib /= (se->avg.runnable_avg_period + 1);
> +	se->avg.usage_avg_contrib = scale_load(contrib);
> +}
> +
> +static long __update_entity_usage_avg_contrib(struct sched_entity *se)
> +{
> +	long old_contrib = se->avg.usage_avg_contrib;
> +
> +	if (entity_is_task(se))
> +		__update_task_entity_usage(se);

Groups are never updated?

As said above. I have code that does it.

Morten

> +
> +	return se->avg.usage_avg_contrib - old_contrib;
> +}
> +

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v5 10/12] sched: get CPU's utilization statistic
  2014-09-11 12:34     ` Peter Zijlstra
@ 2014-09-15 19:28       ` Morten Rasmussen
  -1 siblings, 0 replies; 164+ messages in thread
From: Morten Rasmussen @ 2014-09-15 19:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vincent Guittot, mingo, linux-kernel, preeti, linux,
	linux-arm-kernel, riel, efault, nicolas.pitre, linaro-kernel,
	daniel.lezcano, Dietmar Eggemann

On Thu, Sep 11, 2014 at 01:34:12PM +0100, Peter Zijlstra wrote:
> > @@ -4514,6 +4519,17 @@ static int select_idle_sibling(struct task_struct *p, int target)
> >  	return target;
> >  }
> >  
> > +static int get_cpu_utilization(int cpu)
> > +{
> > +	unsigned long usage = cpu_rq(cpu)->cfs.usage_load_avg;
> > +	unsigned long capacity = capacity_of(cpu);
> > +
> > +	if (usage >= SCHED_LOAD_SCALE)
> > +		return capacity + 1;
> > +
> > +	return (usage * capacity) >> SCHED_LOAD_SHIFT;
> > +}
> 
> So if I understood patch 9 correct, your changelog is iffy.
> usage_load_avg should never get > 1 (of whatever unit), no matter how
> many tasks are on the rq. You can only maximally run all the time.
> 
> Therefore I can only interpret the if (usage >= SCHED_LOAD_SCALE) as
> numerical error handling, nothing more.

That is not entirely true unless you also classify transient usage
spikes due to task migrations as numerical errors as well.

Since each task sched_entity is carrying around 350ms worth of execution
history with it between different cpus and cpu utilization is based on
the sum of task entity usage_avg_contrib on the runqueue you may get
cfs.usage_load_avg > 1 temporarily after task migrations. It will
eventually converge to 1.

The same goes for new tasks which are initialized to have a
usage_avg_contrib of 1 and may be queued on cpu with tasks already
running. In that case cfs.usage_load_avg is temporarily unbounded.

> Also I'm not entirely sure I like the usage, utilization names/metrics.
> I would suggest to reverse them. Call the pure running number
> 'utilization' and this scaled with capacity 'usage' or so.

I can agree with calling running for utilization, but I'm not convienced
about capacity. What does it exactly cover here? I'm confused and
jetlagged.

Morten


^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v5 10/12] sched: get CPU's utilization statistic
@ 2014-09-15 19:28       ` Morten Rasmussen
  0 siblings, 0 replies; 164+ messages in thread
From: Morten Rasmussen @ 2014-09-15 19:28 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Sep 11, 2014 at 01:34:12PM +0100, Peter Zijlstra wrote:
> > @@ -4514,6 +4519,17 @@ static int select_idle_sibling(struct task_struct *p, int target)
> >  	return target;
> >  }
> >  
> > +static int get_cpu_utilization(int cpu)
> > +{
> > +	unsigned long usage = cpu_rq(cpu)->cfs.usage_load_avg;
> > +	unsigned long capacity = capacity_of(cpu);
> > +
> > +	if (usage >= SCHED_LOAD_SCALE)
> > +		return capacity + 1;
> > +
> > +	return (usage * capacity) >> SCHED_LOAD_SHIFT;
> > +}
> 
> So if I understood patch 9 correct, your changelog is iffy.
> usage_load_avg should never get > 1 (of whatever unit), no matter how
> many tasks are on the rq. You can only maximally run all the time.
> 
> Therefore I can only interpret the if (usage >= SCHED_LOAD_SCALE) as
> numerical error handling, nothing more.

That is not entirely true unless you also classify transient usage
spikes due to task migrations as numerical errors as well.

Since each task sched_entity is carrying around 350ms worth of execution
history with it between different cpus and cpu utilization is based on
the sum of task entity usage_avg_contrib on the runqueue you may get
cfs.usage_load_avg > 1 temporarily after task migrations. It will
eventually converge to 1.

The same goes for new tasks which are initialized to have a
usage_avg_contrib of 1 and may be queued on cpu with tasks already
running. In that case cfs.usage_load_avg is temporarily unbounded.

> Also I'm not entirely sure I like the usage, utilization names/metrics.
> I would suggest to reverse them. Call the pure running number
> 'utilization' and this scaled with capacity 'usage' or so.

I can agree with calling running for utilization, but I'm not convienced
about capacity. What does it exactly cover here? I'm confused and
jetlagged.

Morten

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v5 10/12] sched: get CPU's utilization statistic
  2014-09-11 14:04         ` Peter Zijlstra
@ 2014-09-15 19:45           ` Morten Rasmussen
  -1 siblings, 0 replies; 164+ messages in thread
From: Morten Rasmussen @ 2014-09-15 19:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vincent Guittot, Ingo Molnar, linux-kernel, Preeti U Murthy,
	Russell King - ARM Linux, LAK, Rik van Riel, Mike Galbraith,
	Nicolas Pitre, linaro-kernel, Daniel Lezcano, Dietmar Eggemann

On Thu, Sep 11, 2014 at 03:04:44PM +0100, Peter Zijlstra wrote:
> On Thu, Sep 11, 2014 at 03:07:52PM +0200, Vincent Guittot wrote:
> > > Also I'm not entirely sure I like the usage, utilization names/metrics.
> > > I would suggest to reverse them. Call the pure running number
> > > 'utilization' and this scaled with capacity 'usage' or so.
> > 
> > ok. i can invert 'usage' and 'utilization', which will give
> > 
> > s/get_cpu_utilization/get_cpu_usage/
> > s/sgs->group_utilization/sgs->group_usage/

The confusion will have new dimensions added when we introduce
scale-invariance too. Then the running number is already scaled by the
current P-state compute capacity. But I don't have any better
suggestions. 

> > s/cfs.usage_load_avg/cfs.utilization_load_avg/

I don't like using "load" for unweighted metrics. I associate load with
something that may be weighted by priority like load_avg_contrib, and
utilization with pure cpu utilization as in how many cycles is spend on
a particular task. I called it "usage_util_avg" in my own patches, but
"util_avg" might be better if we agree that utilization == usage.

> > s/se->avg.usage_avg_contrib/se->avg.utilization_avg_contrib

util_avg_contrib maybe to keep it shorter.

> > s/__update_task_entity_usage/__update_task_entity_utilization
> > s/__update_entity_usage_avg_contrib/__update_entity_utilization_avg_contrib

Maybe use "util" here as well?

Morten


^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v5 10/12] sched: get CPU's utilization statistic
@ 2014-09-15 19:45           ` Morten Rasmussen
  0 siblings, 0 replies; 164+ messages in thread
From: Morten Rasmussen @ 2014-09-15 19:45 UTC (permalink / raw)
  To: linux-arm-kernel

On Thu, Sep 11, 2014 at 03:04:44PM +0100, Peter Zijlstra wrote:
> On Thu, Sep 11, 2014 at 03:07:52PM +0200, Vincent Guittot wrote:
> > > Also I'm not entirely sure I like the usage, utilization names/metrics.
> > > I would suggest to reverse them. Call the pure running number
> > > 'utilization' and this scaled with capacity 'usage' or so.
> > 
> > ok. i can invert 'usage' and 'utilization', which will give
> > 
> > s/get_cpu_utilization/get_cpu_usage/
> > s/sgs->group_utilization/sgs->group_usage/

The confusion will have new dimensions added when we introduce
scale-invariance too. Then the running number is already scaled by the
current P-state compute capacity. But I don't have any better
suggestions. 

> > s/cfs.usage_load_avg/cfs.utilization_load_avg/

I don't like using "load" for unweighted metrics. I associate load with
something that may be weighted by priority like load_avg_contrib, and
utilization with pure cpu utilization as in how many cycles is spend on
a particular task. I called it "usage_util_avg" in my own patches, but
"util_avg" might be better if we agree that utilization == usage.

> > s/se->avg.usage_avg_contrib/se->avg.utilization_avg_contrib

util_avg_contrib maybe to keep it shorter.

> > s/__update_task_entity_usage/__update_task_entity_utilization
> > s/__update_entity_usage_avg_contrib/__update_entity_utilization_avg_contrib

Maybe use "util" here as well?

Morten

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v5 11/12] sched: replace capacity_factor by utilization
  2014-09-15 19:07             ` Nicolas Pitre
@ 2014-09-15 20:01               ` Peter Zijlstra
  -1 siblings, 0 replies; 164+ messages in thread
From: Peter Zijlstra @ 2014-09-15 20:01 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Vincent Guittot, Ingo Molnar, linux-kernel, Preeti U Murthy,
	Russell King - ARM Linux, LAK, Rik van Riel, Morten Rasmussen,
	Mike Galbraith, linaro-kernel, Daniel Lezcano, Dietmar Eggemann

On Mon, Sep 15, 2014 at 03:07:44PM -0400, Nicolas Pitre wrote:
> On Mon, 15 Sep 2014, Peter Zijlstra wrote:

> Let's suppose a task running on a 1GHz CPU producing a load of 100.
> 
> The same task on a 100MHz CPU would produce a load of 1000 because that 
> CPU is 10x slower.  So to properly evaluate the load of a task when 
> moving it around, we want to normalize its load based on the CPU 
> performance.  In this case the correction factor would be 0.1.
> 
> Given those normalized loads, we need to scale CPU capacity as well.  If 
> the 1GHz CPU can handle 50 of those tasks it has a capacity of 5000.
> 
> In theory the 100MHz CPU could handle only 5 of those tasks, meaning it 
> has a normalized capacity of 500, but only if the load metric is already 
> normalized as well.
> 
> Or am I completely missing the point here?

So I was thinking of the usage as per the next patch; where we decide if
a cpu is 'full' or not based on the utilization measure. For this
measure we're not interested in inter CPU relations at all, and any use
of capacity scaling simply doesn't make sense.

But I think you asking this question shows a 'bigger' problem in that
the Changelogs are entirely failing at describing the actual problem and
proposed solution. Because if that were clear, I don't think we would be
having this particular discussion.

^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v5 11/12] sched: replace capacity_factor by utilization
@ 2014-09-15 20:01               ` Peter Zijlstra
  0 siblings, 0 replies; 164+ messages in thread
From: Peter Zijlstra @ 2014-09-15 20:01 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Sep 15, 2014 at 03:07:44PM -0400, Nicolas Pitre wrote:
> On Mon, 15 Sep 2014, Peter Zijlstra wrote:

> Let's suppose a task running on a 1GHz CPU producing a load of 100.
> 
> The same task on a 100MHz CPU would produce a load of 1000 because that 
> CPU is 10x slower.  So to properly evaluate the load of a task when 
> moving it around, we want to normalize its load based on the CPU 
> performance.  In this case the correction factor would be 0.1.
> 
> Given those normalized loads, we need to scale CPU capacity as well.  If 
> the 1GHz CPU can handle 50 of those tasks it has a capacity of 5000.
> 
> In theory the 100MHz CPU could handle only 5 of those tasks, meaning it 
> has a normalized capacity of 500, but only if the load metric is already 
> normalized as well.
> 
> Or am I completely missing the point here?

So I was thinking of the usage as per the next patch; where we decide if
a cpu is 'full' or not based on the utilization measure. For this
measure we're not interested in inter CPU relations at all, and any use
of capacity scaling simply doesn't make sense.

But I think you asking this question shows a 'bigger' problem in that
the Changelogs are entirely failing at describing the actual problem and
proposed solution. Because if that were clear, I don't think we would be
having this particular discussion.

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v5 06/12] sched: add per rq cpu_capacity_orig
  2014-09-11 19:02     ` Nicolas Pitre
@ 2014-09-15 21:22       ` Vincent Guittot
  -1 siblings, 0 replies; 164+ messages in thread
From: Vincent Guittot @ 2014-09-15 21:22 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Preeti U Murthy,
	Russell King - ARM Linux, LAK, Rik van Riel, Morten Rasmussen,
	Mike Galbraith, linaro-kernel, Daniel Lezcano, Dietmar Eggemann

On 11 September 2014 21:02, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
> On Tue, 26 Aug 2014, Vincent Guittot wrote:
>
>> This new field cpu_capacity_orig reflects the available capacity of a CPUs
>
> s/a CPUs/a CPU/

good catch

>
>> unlike the cpu_capacity which reflects the current capacity that can be altered
>> by frequency and rt tasks.
>
> Shouldn't this be described as the "highest possible capacity" instead
> of only "available capacity" to make it clearer?  In which case,
> shouldn't it be called capacity_max instead of capacity_orig?  With
> "orig" we may think "origin" as the capacity that was available at boot
> time or the like.  If for example the CPU was booted with a low P-state
> then its max capacity is different from the original one.

I have use the orig suffix to be aligned with the capacity_orig which
is a sum of cpu_capacity_orig.
I agree that orig is standing for original in the sense that it's the
original capacity befor being altered by scale_freq and scale_rt

Vincent
>
>
>>
>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>> ---
>>  kernel/sched/core.c  | 2 +-
>>  kernel/sched/fair.c  | 1 +
>>  kernel/sched/sched.h | 1 +
>>  3 files changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index a394f99..7c3b237 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -7003,7 +7003,7 @@ void __init sched_init(void)
>>  #ifdef CONFIG_SMP
>>               rq->sd = NULL;
>>               rq->rd = NULL;
>> -             rq->cpu_capacity = SCHED_CAPACITY_SCALE;
>> +             rq->cpu_capacity = rq->cpu_capacity_orig = SCHED_CAPACITY_SCALE;
>>               rq->post_schedule = 0;
>>               rq->active_balance = 0;
>>               rq->next_balance = jiffies;
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 8176bda..17c16cc 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -5756,6 +5756,7 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
>>
>>       capacity >>= SCHED_CAPACITY_SHIFT;
>>
>> +     cpu_rq(cpu)->cpu_capacity_orig = capacity;
>>       sdg->sgc->capacity_orig = capacity;
>>
>>       if (sched_feat(ARCH_CAPACITY))
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index aa0f73b..7c0a74e 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -575,6 +575,7 @@ struct rq {
>>       struct sched_domain *sd;
>>
>>       unsigned long cpu_capacity;
>> +     unsigned long cpu_capacity_orig;
>>
>>       unsigned char idle_balance;
>>       /* For active balancing */
>> --
>> 1.9.1
>>
>>

^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v5 06/12] sched: add per rq cpu_capacity_orig
@ 2014-09-15 21:22       ` Vincent Guittot
  0 siblings, 0 replies; 164+ messages in thread
From: Vincent Guittot @ 2014-09-15 21:22 UTC (permalink / raw)
  To: linux-arm-kernel

On 11 September 2014 21:02, Nicolas Pitre <nicolas.pitre@linaro.org> wrote:
> On Tue, 26 Aug 2014, Vincent Guittot wrote:
>
>> This new field cpu_capacity_orig reflects the available capacity of a CPUs
>
> s/a CPUs/a CPU/

good catch

>
>> unlike the cpu_capacity which reflects the current capacity that can be altered
>> by frequency and rt tasks.
>
> Shouldn't this be described as the "highest possible capacity" instead
> of only "available capacity" to make it clearer?  In which case,
> shouldn't it be called capacity_max instead of capacity_orig?  With
> "orig" we may think "origin" as the capacity that was available at boot
> time or the like.  If for example the CPU was booted with a low P-state
> then its max capacity is different from the original one.

I have use the orig suffix to be aligned with the capacity_orig which
is a sum of cpu_capacity_orig.
I agree that orig is standing for original in the sense that it's the
original capacity befor being altered by scale_freq and scale_rt

Vincent
>
>
>>
>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>> ---
>>  kernel/sched/core.c  | 2 +-
>>  kernel/sched/fair.c  | 1 +
>>  kernel/sched/sched.h | 1 +
>>  3 files changed, 3 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
>> index a394f99..7c3b237 100644
>> --- a/kernel/sched/core.c
>> +++ b/kernel/sched/core.c
>> @@ -7003,7 +7003,7 @@ void __init sched_init(void)
>>  #ifdef CONFIG_SMP
>>               rq->sd = NULL;
>>               rq->rd = NULL;
>> -             rq->cpu_capacity = SCHED_CAPACITY_SCALE;
>> +             rq->cpu_capacity = rq->cpu_capacity_orig = SCHED_CAPACITY_SCALE;
>>               rq->post_schedule = 0;
>>               rq->active_balance = 0;
>>               rq->next_balance = jiffies;
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 8176bda..17c16cc 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -5756,6 +5756,7 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
>>
>>       capacity >>= SCHED_CAPACITY_SHIFT;
>>
>> +     cpu_rq(cpu)->cpu_capacity_orig = capacity;
>>       sdg->sgc->capacity_orig = capacity;
>>
>>       if (sched_feat(ARCH_CAPACITY))
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index aa0f73b..7c0a74e 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -575,6 +575,7 @@ struct rq {
>>       struct sched_domain *sd;
>>
>>       unsigned long cpu_capacity;
>> +     unsigned long cpu_capacity_orig;
>>
>>       unsigned char idle_balance;
>>       /* For active balancing */
>> --
>> 1.9.1
>>
>>

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v5 11/12] sched: replace capacity_factor by utilization
  2014-09-15 11:42           ` Peter Zijlstra
@ 2014-09-15 22:14             ` Vincent Guittot
  -1 siblings, 0 replies; 164+ messages in thread
From: Vincent Guittot @ 2014-09-15 22:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Preeti U Murthy,
	Russell King - ARM Linux, LAK, Rik van Riel, Morten Rasmussen,
	Mike Galbraith, Nicolas Pitre, linaro-kernel, Daniel Lezcano,
	Dietmar Eggemann

On 15 September 2014 13:42, Peter Zijlstra <peterz@infradead.org> wrote:
> On Sun, Sep 14, 2014 at 09:41:56PM +0200, Peter Zijlstra wrote:
>> On Thu, Sep 11, 2014 at 07:26:48PM +0200, Vincent Guittot wrote:
>> > On 11 September 2014 18:15, Peter Zijlstra <peterz@infradead.org> wrote:
>
>> > > I'm confused about the utilization vs capacity_orig. I see how we should
>> >
>> > 1st point is that I should compare utilization vs capacity and not
>> > capacity_orig.
>> > I should have replaced capacity_orig by capacity in the functions
>> > above when i move the utilization statistic from
>> > rq->avg.runnable_avg_sum to cfs.usage_load_avg.
>> > rq->avg.runnable_avg_sum was measuring all activity on the cpu whereas
>> > cfs.usage_load_avg integrates only cfs tasks
>> >
>> > With this change, we don't need sgs->group_capacity_orig anymore but
>> > only sgs->group_capacity. So sgs->group_capacity_orig can be removed
>> > as it's no more used in the code as sg_capacity_factor has been
>> > removed
>>
>> Yes, but.. so I suppose we need to add DVFS accounting and remove
>> cpufreq from the capacity thing. Otherwise I don't see it make sense.
>
> OK, I've reconsidered _again_, I still don't get it.
>
> So fundamentally I think its wrong to scale with the capacity; it just
> doesn't make any sense. Consider big.little stuff, their CPUs are
> inherently asymmetric in capacity, but that doesn't matter one whit for
> utilization numbers. If a core is fully consumed its fully consumed, no
> matter how much work it can or can not do.
>
>
> So the only thing that needs correcting is the fact that these
> statistics are based on clock_task and some of that time can end up in
> other scheduling classes, at which point we'll never get 100% even
> though we're 'saturated'. But correcting for that using capacity doesn't
> 'work'.

I'm not sure to catch your last point because the capacity is the only
figures that take into account the "time" consumed by other classes.
Have you got in mind another way to take into account the other
classes ?

So we have cpu_capacity that is the capacity that can be currently
used by cfs class
We have cfs.usage_load_avg that is the sum of running time of cfs
tasks on the CPU and reflect the % of usage of this CPU by CFS tasks
We have to use the same metrics to compare available capacity for CFS
and current cfs usage

Now we have to use the same unit so we can either weight the
cpu_capacity_orig with the cfs.usage_load_avg and compare it with
cpu_capacity
or with divide cpu_capacity by cpu_capacity_orig and scale it into the
SCHED_LOAD_SCALE range. Is It what you are proposing ?

Vincent

^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v5 11/12] sched: replace capacity_factor by utilization
@ 2014-09-15 22:14             ` Vincent Guittot
  0 siblings, 0 replies; 164+ messages in thread
From: Vincent Guittot @ 2014-09-15 22:14 UTC (permalink / raw)
  To: linux-arm-kernel

On 15 September 2014 13:42, Peter Zijlstra <peterz@infradead.org> wrote:
> On Sun, Sep 14, 2014 at 09:41:56PM +0200, Peter Zijlstra wrote:
>> On Thu, Sep 11, 2014 at 07:26:48PM +0200, Vincent Guittot wrote:
>> > On 11 September 2014 18:15, Peter Zijlstra <peterz@infradead.org> wrote:
>
>> > > I'm confused about the utilization vs capacity_orig. I see how we should
>> >
>> > 1st point is that I should compare utilization vs capacity and not
>> > capacity_orig.
>> > I should have replaced capacity_orig by capacity in the functions
>> > above when i move the utilization statistic from
>> > rq->avg.runnable_avg_sum to cfs.usage_load_avg.
>> > rq->avg.runnable_avg_sum was measuring all activity on the cpu whereas
>> > cfs.usage_load_avg integrates only cfs tasks
>> >
>> > With this change, we don't need sgs->group_capacity_orig anymore but
>> > only sgs->group_capacity. So sgs->group_capacity_orig can be removed
>> > as it's no more used in the code as sg_capacity_factor has been
>> > removed
>>
>> Yes, but.. so I suppose we need to add DVFS accounting and remove
>> cpufreq from the capacity thing. Otherwise I don't see it make sense.
>
> OK, I've reconsidered _again_, I still don't get it.
>
> So fundamentally I think its wrong to scale with the capacity; it just
> doesn't make any sense. Consider big.little stuff, their CPUs are
> inherently asymmetric in capacity, but that doesn't matter one whit for
> utilization numbers. If a core is fully consumed its fully consumed, no
> matter how much work it can or can not do.
>
>
> So the only thing that needs correcting is the fact that these
> statistics are based on clock_task and some of that time can end up in
> other scheduling classes, at which point we'll never get 100% even
> though we're 'saturated'. But correcting for that using capacity doesn't
> 'work'.

I'm not sure to catch your last point because the capacity is the only
figures that take into account the "time" consumed by other classes.
Have you got in mind another way to take into account the other
classes ?

So we have cpu_capacity that is the capacity that can be currently
used by cfs class
We have cfs.usage_load_avg that is the sum of running time of cfs
tasks on the CPU and reflect the % of usage of this CPU by CFS tasks
We have to use the same metrics to compare available capacity for CFS
and current cfs usage

Now we have to use the same unit so we can either weight the
cpu_capacity_orig with the cfs.usage_load_avg and compare it with
cpu_capacity
or with divide cpu_capacity by cpu_capacity_orig and scale it into the
SCHED_LOAD_SCALE range. Is It what you are proposing ?

Vincent

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v5 11/12] sched: replace capacity_factor by utilization
  2014-09-15 22:14             ` Vincent Guittot
@ 2014-09-15 22:18               ` Vincent Guittot
  -1 siblings, 0 replies; 164+ messages in thread
From: Vincent Guittot @ 2014-09-15 22:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Preeti U Murthy,
	Russell King - ARM Linux, LAK, Rik van Riel, Morten Rasmussen,
	Mike Galbraith, Nicolas Pitre, linaro-kernel, Daniel Lezcano,
	Dietmar Eggemann

On 16 September 2014 00:14, Vincent Guittot <vincent.guittot@linaro.org> wrote:
> On 15 September 2014 13:42, Peter Zijlstra <peterz@infradead.org> wrote:
>> On Sun, Sep 14, 2014 at 09:41:56PM +0200, Peter Zijlstra wrote:
>>> On Thu, Sep 11, 2014 at 07:26:48PM +0200, Vincent Guittot wrote:
>>> > On 11 September 2014 18:15, Peter Zijlstra <peterz@infradead.org> wrote:
>>
>>> > > I'm confused about the utilization vs capacity_orig. I see how we should
>>> >
>>> > 1st point is that I should compare utilization vs capacity and not
>>> > capacity_orig.
>>> > I should have replaced capacity_orig by capacity in the functions
>>> > above when i move the utilization statistic from
>>> > rq->avg.runnable_avg_sum to cfs.usage_load_avg.
>>> > rq->avg.runnable_avg_sum was measuring all activity on the cpu whereas
>>> > cfs.usage_load_avg integrates only cfs tasks
>>> >
>>> > With this change, we don't need sgs->group_capacity_orig anymore but
>>> > only sgs->group_capacity. So sgs->group_capacity_orig can be removed
>>> > as it's no more used in the code as sg_capacity_factor has been
>>> > removed
>>>
>>> Yes, but.. so I suppose we need to add DVFS accounting and remove
>>> cpufreq from the capacity thing. Otherwise I don't see it make sense.
>>
>> OK, I've reconsidered _again_, I still don't get it.
>>
>> So fundamentally I think its wrong to scale with the capacity; it just
>> doesn't make any sense. Consider big.little stuff, their CPUs are
>> inherently asymmetric in capacity, but that doesn't matter one whit for
>> utilization numbers. If a core is fully consumed its fully consumed, no
>> matter how much work it can or can not do.
>>
>>
>> So the only thing that needs correcting is the fact that these
>> statistics are based on clock_task and some of that time can end up in
>> other scheduling classes, at which point we'll never get 100% even
>> though we're 'saturated'. But correcting for that using capacity doesn't
>> 'work'.
>
> I'm not sure to catch your last point because the capacity is the only
> figures that take into account the "time" consumed by other classes.
> Have you got in mind another way to take into account the other
> classes ?
>
> So we have cpu_capacity that is the capacity that can be currently
> used by cfs class
> We have cfs.usage_load_avg that is the sum of running time of cfs
> tasks on the CPU and reflect the % of usage of this CPU by CFS tasks
> We have to use the same metrics to compare available capacity for CFS
> and current cfs usage
>
> Now we have to use the same unit so we can either weight the
> cpu_capacity_orig with the cfs.usage_load_avg and compare it with
> cpu_capacity
> or with divide cpu_capacity by cpu_capacity_orig and scale it into the
> SCHED_LOAD_SCALE range. Is It what you are proposing ?

For the latter, we need to keep the  sgs->group_capacity_orig in order
to check if a group is overloaded whereas the 1st solution don't need
it anymore (once the correction i mentioned previously)

Vincent
>
> Vincent

^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v5 11/12] sched: replace capacity_factor by utilization
@ 2014-09-15 22:18               ` Vincent Guittot
  0 siblings, 0 replies; 164+ messages in thread
From: Vincent Guittot @ 2014-09-15 22:18 UTC (permalink / raw)
  To: linux-arm-kernel

On 16 September 2014 00:14, Vincent Guittot <vincent.guittot@linaro.org> wrote:
> On 15 September 2014 13:42, Peter Zijlstra <peterz@infradead.org> wrote:
>> On Sun, Sep 14, 2014 at 09:41:56PM +0200, Peter Zijlstra wrote:
>>> On Thu, Sep 11, 2014 at 07:26:48PM +0200, Vincent Guittot wrote:
>>> > On 11 September 2014 18:15, Peter Zijlstra <peterz@infradead.org> wrote:
>>
>>> > > I'm confused about the utilization vs capacity_orig. I see how we should
>>> >
>>> > 1st point is that I should compare utilization vs capacity and not
>>> > capacity_orig.
>>> > I should have replaced capacity_orig by capacity in the functions
>>> > above when i move the utilization statistic from
>>> > rq->avg.runnable_avg_sum to cfs.usage_load_avg.
>>> > rq->avg.runnable_avg_sum was measuring all activity on the cpu whereas
>>> > cfs.usage_load_avg integrates only cfs tasks
>>> >
>>> > With this change, we don't need sgs->group_capacity_orig anymore but
>>> > only sgs->group_capacity. So sgs->group_capacity_orig can be removed
>>> > as it's no more used in the code as sg_capacity_factor has been
>>> > removed
>>>
>>> Yes, but.. so I suppose we need to add DVFS accounting and remove
>>> cpufreq from the capacity thing. Otherwise I don't see it make sense.
>>
>> OK, I've reconsidered _again_, I still don't get it.
>>
>> So fundamentally I think its wrong to scale with the capacity; it just
>> doesn't make any sense. Consider big.little stuff, their CPUs are
>> inherently asymmetric in capacity, but that doesn't matter one whit for
>> utilization numbers. If a core is fully consumed its fully consumed, no
>> matter how much work it can or can not do.
>>
>>
>> So the only thing that needs correcting is the fact that these
>> statistics are based on clock_task and some of that time can end up in
>> other scheduling classes, at which point we'll never get 100% even
>> though we're 'saturated'. But correcting for that using capacity doesn't
>> 'work'.
>
> I'm not sure to catch your last point because the capacity is the only
> figures that take into account the "time" consumed by other classes.
> Have you got in mind another way to take into account the other
> classes ?
>
> So we have cpu_capacity that is the capacity that can be currently
> used by cfs class
> We have cfs.usage_load_avg that is the sum of running time of cfs
> tasks on the CPU and reflect the % of usage of this CPU by CFS tasks
> We have to use the same metrics to compare available capacity for CFS
> and current cfs usage
>
> Now we have to use the same unit so we can either weight the
> cpu_capacity_orig with the cfs.usage_load_avg and compare it with
> cpu_capacity
> or with divide cpu_capacity by cpu_capacity_orig and scale it into the
> SCHED_LOAD_SCALE range. Is It what you are proposing ?

For the latter, we need to keep the  sgs->group_capacity_orig in order
to check if a group is overloaded whereas the 1st solution don't need
it anymore (once the correction i mentioned previously)

Vincent
>
> Vincent

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v5 09/12] sched: add usage_load_avg
  2014-09-15 19:15     ` Morten Rasmussen
@ 2014-09-15 22:33       ` Vincent Guittot
  -1 siblings, 0 replies; 164+ messages in thread
From: Vincent Guittot @ 2014-09-15 22:33 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: peterz, mingo, linux-kernel, preeti, linux, linux-arm-kernel,
	riel, efault, nicolas.pitre, linaro-kernel, daniel.lezcano,
	Dietmar Eggemann

On 15 September 2014 21:15, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> On Tue, Aug 26, 2014 at 12:06:52PM +0100, Vincent Guittot wrote:
>> Add new statistics which reflect the average time a task is running on the
>> CPU and the sum of the tasks' running on a runqueue. The latter is named
>> usage_avg_contrib.
>>
>> This patch is based on the usage metric that was proposed in the 1st
>> versions of the per-entity load tracking patchset but that has be removed
>> afterward. This version differs from the original one in the sense that it's
>> not linked to task_group.
>>
>> The rq's usage_avg_contrib will be used to check if a rq is overloaded or not
>> instead of trying to compute how many task a group of CPUs can handle
>>
>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>
> I should have read this patch before I did almost the same patch for as
> part a series to introduce scale-invariance which I am about to post :(
>
> The only difference I see is slightly different naming and that, AFAICT,
> task group usage is not accounted for in this patch. Can we add the
> support for task groups as well? I can provide a patch based on this one
> if you want.
>
> Also, since more than half this patch comes directly from PJT's original
> patch I would add "by Paul Turner <pjt@google.com>" somewhere in the
> text above.

Sorry, It was obvious for me that per-entity load tracking patchset
were done by pjt but i can surely replace
"the per-entity load tracking patchset" by "the per-entity load
tracking patchset by Paul Turner <pjt@google.com>"

>
> [...]
>
>> +static inline void __update_task_entity_usage(struct sched_entity *se)
>> +{
>> +     u32 contrib;
>> +
>> +     /* avoid overflowing a 32-bit type w/ SCHED_LOAD_SCALE */
>> +     contrib = se->avg.running_avg_sum * scale_load_down(SCHED_LOAD_SCALE);
>> +     contrib /= (se->avg.runnable_avg_period + 1);
>> +     se->avg.usage_avg_contrib = scale_load(contrib);
>> +}
>> +
>> +static long __update_entity_usage_avg_contrib(struct sched_entity *se)
>> +{
>> +     long old_contrib = se->avg.usage_avg_contrib;
>> +
>> +     if (entity_is_task(se))
>> +             __update_task_entity_usage(se);
>
> Groups are never updated?
>
> As said above. I have code that does it.

Yes, feel free to send a patch above this one that add group

Vincent
>
> Morten
>
>> +
>> +     return se->avg.usage_avg_contrib - old_contrib;
>> +}
>> +
>

^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v5 09/12] sched: add usage_load_avg
@ 2014-09-15 22:33       ` Vincent Guittot
  0 siblings, 0 replies; 164+ messages in thread
From: Vincent Guittot @ 2014-09-15 22:33 UTC (permalink / raw)
  To: linux-arm-kernel

On 15 September 2014 21:15, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> On Tue, Aug 26, 2014 at 12:06:52PM +0100, Vincent Guittot wrote:
>> Add new statistics which reflect the average time a task is running on the
>> CPU and the sum of the tasks' running on a runqueue. The latter is named
>> usage_avg_contrib.
>>
>> This patch is based on the usage metric that was proposed in the 1st
>> versions of the per-entity load tracking patchset but that has be removed
>> afterward. This version differs from the original one in the sense that it's
>> not linked to task_group.
>>
>> The rq's usage_avg_contrib will be used to check if a rq is overloaded or not
>> instead of trying to compute how many task a group of CPUs can handle
>>
>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>
> I should have read this patch before I did almost the same patch for as
> part a series to introduce scale-invariance which I am about to post :(
>
> The only difference I see is slightly different naming and that, AFAICT,
> task group usage is not accounted for in this patch. Can we add the
> support for task groups as well? I can provide a patch based on this one
> if you want.
>
> Also, since more than half this patch comes directly from PJT's original
> patch I would add "by Paul Turner <pjt@google.com>" somewhere in the
> text above.

Sorry, It was obvious for me that per-entity load tracking patchset
were done by pjt but i can surely replace
"the per-entity load tracking patchset" by "the per-entity load
tracking patchset by Paul Turner <pjt@google.com>"

>
> [...]
>
>> +static inline void __update_task_entity_usage(struct sched_entity *se)
>> +{
>> +     u32 contrib;
>> +
>> +     /* avoid overflowing a 32-bit type w/ SCHED_LOAD_SCALE */
>> +     contrib = se->avg.running_avg_sum * scale_load_down(SCHED_LOAD_SCALE);
>> +     contrib /= (se->avg.runnable_avg_period + 1);
>> +     se->avg.usage_avg_contrib = scale_load(contrib);
>> +}
>> +
>> +static long __update_entity_usage_avg_contrib(struct sched_entity *se)
>> +{
>> +     long old_contrib = se->avg.usage_avg_contrib;
>> +
>> +     if (entity_is_task(se))
>> +             __update_task_entity_usage(se);
>
> Groups are never updated?
>
> As said above. I have code that does it.

Yes, feel free to send a patch above this one that add group

Vincent
>
> Morten
>
>> +
>> +     return se->avg.usage_avg_contrib - old_contrib;
>> +}
>> +
>

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v5 11/12] sched: replace capacity_factor by utilization
  2014-09-14 19:41         ` Peter Zijlstra
@ 2014-09-16 17:00           ` Dietmar Eggemann
  -1 siblings, 0 replies; 164+ messages in thread
From: Dietmar Eggemann @ 2014-09-16 17:00 UTC (permalink / raw)
  To: Peter Zijlstra, Vincent Guittot
  Cc: Ingo Molnar, linux-kernel, Preeti U Murthy,
	Russell King - ARM Linux, LAK, Rik van Riel, Morten Rasmussen,
	Mike Galbraith, Nicolas Pitre, linaro-kernel, Daniel Lezcano

On 14/09/14 12:41, Peter Zijlstra wrote:
> On Thu, Sep 11, 2014 at 07:26:48PM +0200, Vincent Guittot wrote:
>> On 11 September 2014 18:15, Peter Zijlstra <peterz@infradead.org> wrote:
>>> On Tue, Aug 26, 2014 at 01:06:54PM +0200, Vincent Guittot wrote:
>>>> +static inline int group_has_free_capacity(struct sg_lb_stats *sgs,
>>>> +                     struct lb_env *env)
>>>>   {
>>>> +     if ((sgs->group_capacity_orig * 100) >
>>>> +                     (sgs->group_utilization * env->sd->imbalance_pct))
>>>> +             return 1;
>>>> +
>>>> +     if (sgs->sum_nr_running < sgs->group_weight)
>>>> +             return 1;
>>>>
>>>> +     return 0;
>>>> +}
>>>>
>>>> +static inline int group_is_overloaded(struct sg_lb_stats *sgs,
>>>> +                     struct lb_env *env)
>>>> +{
>>>> +     if (sgs->sum_nr_running <= sgs->group_weight)
>>>> +             return 0;
>>>>
>>>> +     if ((sgs->group_capacity_orig * 100) <
>>>> +                     (sgs->group_utilization * env->sd->imbalance_pct))
>>>> +             return 1;
>>>>
>>>> +     return 0;
>>>>   }
>>>
>>> I'm confused about the utilization vs capacity_orig. I see how we should
>>
>> 1st point is that I should compare utilization vs capacity and not
>> capacity_orig.
>> I should have replaced capacity_orig by capacity in the functions
>> above when i move the utilization statistic from
>> rq->avg.runnable_avg_sum to cfs.usage_load_avg.
>> rq->avg.runnable_avg_sum was measuring all activity on the cpu whereas
>> cfs.usage_load_avg integrates only cfs tasks
>>
>> With this change, we don't need sgs->group_capacity_orig anymore but
>> only sgs->group_capacity. So sgs->group_capacity_orig can be removed
>> as it's no more used in the code as sg_capacity_factor has been
>> removed
>
> Yes, but.. so I suppose we need to add DVFS accounting and remove
> cpufreq from the capacity thing. Otherwise I don't see it make sense.

My understanding is that uArch scaling of capacacity_orig (therefore of 
capacity too) is done by calling arch_scale_cpu_capacity and frequency 
scaling for capacity is done by calling arch_scale_freq_capacity in 
update_cpu_capacity. I understand that this patch-set does not provide 
an implementation of arch_scale_freq_capacity though.

The uArch and frequency scaling of load & utilization will be added 
later. I know that Morten is currently working on a 'uArch and frequency 
invariant load & utilization tracking' patch-set.

Although I don't know exactly what you mean by DVFS accounting and 
remove cpufreq from the capacity here.

-- Dietmar

> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>



^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v5 11/12] sched: replace capacity_factor by utilization
@ 2014-09-16 17:00           ` Dietmar Eggemann
  0 siblings, 0 replies; 164+ messages in thread
From: Dietmar Eggemann @ 2014-09-16 17:00 UTC (permalink / raw)
  To: linux-arm-kernel

On 14/09/14 12:41, Peter Zijlstra wrote:
> On Thu, Sep 11, 2014 at 07:26:48PM +0200, Vincent Guittot wrote:
>> On 11 September 2014 18:15, Peter Zijlstra <peterz@infradead.org> wrote:
>>> On Tue, Aug 26, 2014 at 01:06:54PM +0200, Vincent Guittot wrote:
>>>> +static inline int group_has_free_capacity(struct sg_lb_stats *sgs,
>>>> +                     struct lb_env *env)
>>>>   {
>>>> +     if ((sgs->group_capacity_orig * 100) >
>>>> +                     (sgs->group_utilization * env->sd->imbalance_pct))
>>>> +             return 1;
>>>> +
>>>> +     if (sgs->sum_nr_running < sgs->group_weight)
>>>> +             return 1;
>>>>
>>>> +     return 0;
>>>> +}
>>>>
>>>> +static inline int group_is_overloaded(struct sg_lb_stats *sgs,
>>>> +                     struct lb_env *env)
>>>> +{
>>>> +     if (sgs->sum_nr_running <= sgs->group_weight)
>>>> +             return 0;
>>>>
>>>> +     if ((sgs->group_capacity_orig * 100) <
>>>> +                     (sgs->group_utilization * env->sd->imbalance_pct))
>>>> +             return 1;
>>>>
>>>> +     return 0;
>>>>   }
>>>
>>> I'm confused about the utilization vs capacity_orig. I see how we should
>>
>> 1st point is that I should compare utilization vs capacity and not
>> capacity_orig.
>> I should have replaced capacity_orig by capacity in the functions
>> above when i move the utilization statistic from
>> rq->avg.runnable_avg_sum to cfs.usage_load_avg.
>> rq->avg.runnable_avg_sum was measuring all activity on the cpu whereas
>> cfs.usage_load_avg integrates only cfs tasks
>>
>> With this change, we don't need sgs->group_capacity_orig anymore but
>> only sgs->group_capacity. So sgs->group_capacity_orig can be removed
>> as it's no more used in the code as sg_capacity_factor has been
>> removed
>
> Yes, but.. so I suppose we need to add DVFS accounting and remove
> cpufreq from the capacity thing. Otherwise I don't see it make sense.

My understanding is that uArch scaling of capacacity_orig (therefore of 
capacity too) is done by calling arch_scale_cpu_capacity and frequency 
scaling for capacity is done by calling arch_scale_freq_capacity in 
update_cpu_capacity. I understand that this patch-set does not provide 
an implementation of arch_scale_freq_capacity though.

The uArch and frequency scaling of load & utilization will be added 
later. I know that Morten is currently working on a 'uArch and frequency 
invariant load & utilization tracking' patch-set.

Although I don't know exactly what you mean by DVFS accounting and 
remove cpufreq from the capacity here.

-- Dietmar

> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo at vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/
>

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v5 10/12] sched: get CPU's utilization statistic
  2014-09-15 19:45           ` Morten Rasmussen
@ 2014-09-16 22:43             ` Vincent Guittot
  -1 siblings, 0 replies; 164+ messages in thread
From: Vincent Guittot @ 2014-09-16 22:43 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Peter Zijlstra, Ingo Molnar, linux-kernel, Preeti U Murthy,
	Russell King - ARM Linux, LAK, Rik van Riel, Mike Galbraith,
	Nicolas Pitre, linaro-kernel, Daniel Lezcano, Dietmar Eggemann

On 15 September 2014 12:45, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> On Thu, Sep 11, 2014 at 03:04:44PM +0100, Peter Zijlstra wrote:
>> On Thu, Sep 11, 2014 at 03:07:52PM +0200, Vincent Guittot wrote:
>> > > Also I'm not entirely sure I like the usage, utilization names/metrics.
>> > > I would suggest to reverse them. Call the pure running number
>> > > 'utilization' and this scaled with capacity 'usage' or so.
>> >
>> > ok. i can invert 'usage' and 'utilization', which will give
>> >
>> > s/get_cpu_utilization/get_cpu_usage/
>> > s/sgs->group_utilization/sgs->group_usage/
>
> The confusion will have new dimensions added when we introduce
> scale-invariance too. Then the running number is already scaled by the
> current P-state compute capacity. But I don't have any better
> suggestions.
>
>> > s/cfs.usage_load_avg/cfs.utilization_load_avg/
>
> I don't like using "load" for unweighted metrics. I associate load with
> something that may be weighted by priority like load_avg_contrib, and
> utilization with pure cpu utilization as in how many cycles is spend on
> a particular task. I called it "usage_util_avg" in my own patches, but
> "util_avg" might be better if we agree that utilization == usage.

ok. so i don't have the same definition than you. IMHO, load should be
used for figures that have used the average of the geometric series
used in the per entity load tracking more than the fact that we weight
the figures with priority

>
>> > s/se->avg.usage_avg_contrib/se->avg.utilization_avg_contrib
>
> util_avg_contrib maybe to keep it shorter.
>
>> > s/__update_task_entity_usage/__update_task_entity_utilization
>> > s/__update_entity_usage_avg_contrib/__update_entity_utilization_avg_contrib
>
> Maybe use "util" here as well?

I agree that utilization can be a bit too long but util sounds a bit
too short ans we loose the utilization meaning. so we could use
activity instead of utilization

Nevertheless, the most important is that we find a common definition convention

Would the following proposal be ok ?

s/get_cpu_utilization/get_cpu_usage/
s/sgs->group_utilization/sgs->group_usage/
s/cfs.usage_load_avg/cfs.activity_load_avg/
s/se->avg.usage_avg_contrib/se->avg.activity_avg_contrib
s/__update_task_entity_usage/__update_task_entity_activity
s/__update_entity_usage_avg_contrib/__update_entity_activity_avg_contrib

Vincent
>
> Morten
>

^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v5 10/12] sched: get CPU's utilization statistic
@ 2014-09-16 22:43             ` Vincent Guittot
  0 siblings, 0 replies; 164+ messages in thread
From: Vincent Guittot @ 2014-09-16 22:43 UTC (permalink / raw)
  To: linux-arm-kernel

On 15 September 2014 12:45, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> On Thu, Sep 11, 2014 at 03:04:44PM +0100, Peter Zijlstra wrote:
>> On Thu, Sep 11, 2014 at 03:07:52PM +0200, Vincent Guittot wrote:
>> > > Also I'm not entirely sure I like the usage, utilization names/metrics.
>> > > I would suggest to reverse them. Call the pure running number
>> > > 'utilization' and this scaled with capacity 'usage' or so.
>> >
>> > ok. i can invert 'usage' and 'utilization', which will give
>> >
>> > s/get_cpu_utilization/get_cpu_usage/
>> > s/sgs->group_utilization/sgs->group_usage/
>
> The confusion will have new dimensions added when we introduce
> scale-invariance too. Then the running number is already scaled by the
> current P-state compute capacity. But I don't have any better
> suggestions.
>
>> > s/cfs.usage_load_avg/cfs.utilization_load_avg/
>
> I don't like using "load" for unweighted metrics. I associate load with
> something that may be weighted by priority like load_avg_contrib, and
> utilization with pure cpu utilization as in how many cycles is spend on
> a particular task. I called it "usage_util_avg" in my own patches, but
> "util_avg" might be better if we agree that utilization == usage.

ok. so i don't have the same definition than you. IMHO, load should be
used for figures that have used the average of the geometric series
used in the per entity load tracking more than the fact that we weight
the figures with priority

>
>> > s/se->avg.usage_avg_contrib/se->avg.utilization_avg_contrib
>
> util_avg_contrib maybe to keep it shorter.
>
>> > s/__update_task_entity_usage/__update_task_entity_utilization
>> > s/__update_entity_usage_avg_contrib/__update_entity_utilization_avg_contrib
>
> Maybe use "util" here as well?

I agree that utilization can be a bit too long but util sounds a bit
too short ans we loose the utilization meaning. so we could use
activity instead of utilization

Nevertheless, the most important is that we find a common definition convention

Would the following proposal be ok ?

s/get_cpu_utilization/get_cpu_usage/
s/sgs->group_utilization/sgs->group_usage/
s/cfs.usage_load_avg/cfs.activity_load_avg/
s/se->avg.usage_avg_contrib/se->avg.activity_avg_contrib
s/__update_task_entity_usage/__update_task_entity_activity
s/__update_entity_usage_avg_contrib/__update_entity_activity_avg_contrib

Vincent
>
> Morten
>

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v5 11/12] sched: replace capacity_factor by utilization
  2014-09-15 20:01               ` Peter Zijlstra
@ 2014-09-17 18:45                 ` Morten Rasmussen
  -1 siblings, 0 replies; 164+ messages in thread
From: Morten Rasmussen @ 2014-09-17 18:45 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nicolas Pitre, Vincent Guittot, Ingo Molnar, linux-kernel,
	Preeti U Murthy, Russell King - ARM Linux, LAK, Rik van Riel,
	Mike Galbraith, linaro-kernel, Daniel Lezcano, Dietmar Eggemann

On Mon, Sep 15, 2014 at 09:01:59PM +0100, Peter Zijlstra wrote:
> On Mon, Sep 15, 2014 at 03:07:44PM -0400, Nicolas Pitre wrote:
> > On Mon, 15 Sep 2014, Peter Zijlstra wrote:
> 
> > Let's suppose a task running on a 1GHz CPU producing a load of 100.
> > 
> > The same task on a 100MHz CPU would produce a load of 1000 because that 
> > CPU is 10x slower.  So to properly evaluate the load of a task when 
> > moving it around, we want to normalize its load based on the CPU 
> > performance.  In this case the correction factor would be 0.1.
> > 
> > Given those normalized loads, we need to scale CPU capacity as well.  If 
> > the 1GHz CPU can handle 50 of those tasks it has a capacity of 5000.
> > 
> > In theory the 100MHz CPU could handle only 5 of those tasks, meaning it 
> > has a normalized capacity of 500, but only if the load metric is already 
> > normalized as well.
> > 
> > Or am I completely missing the point here?
> 
> So I was thinking of the usage as per the next patch; where we decide if
> a cpu is 'full' or not based on the utilization measure. For this
> measure we're not interested in inter CPU relations at all, and any use
> of capacity scaling simply doesn't make sense.

Right. You don't need to scale capacity to determine whether a cpu is
full or not if you don't have DVFS, but I don't think it hurts if it is
done right. We need the scaling to figure out how much capacity is
available.

> But I think you asking this question shows a 'bigger' problem in that
> the Changelogs are entirely failing at describing the actual problem and
> proposed solution. Because if that were clear, I don't think we would be
> having this particular discussion.

Yes, the bigger problem of scaling things with DVFS and taking
big.LITTLE into account is not addressed in this patch set. This is the
scale-invariance problem that we discussed at Ksummit.

I will send a few patches shortly that provides the usage
scale-invariance. The last bit missing is then to scale capacity too (in
the right places), such that we can determine whether as cpu is full or
not, and how much spare capacity it has by comparing scale usage
(utilization) with scaled capacity. Like in Nico's example above.

Morten

^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v5 11/12] sched: replace capacity_factor by utilization
@ 2014-09-17 18:45                 ` Morten Rasmussen
  0 siblings, 0 replies; 164+ messages in thread
From: Morten Rasmussen @ 2014-09-17 18:45 UTC (permalink / raw)
  To: linux-arm-kernel

On Mon, Sep 15, 2014 at 09:01:59PM +0100, Peter Zijlstra wrote:
> On Mon, Sep 15, 2014 at 03:07:44PM -0400, Nicolas Pitre wrote:
> > On Mon, 15 Sep 2014, Peter Zijlstra wrote:
> 
> > Let's suppose a task running on a 1GHz CPU producing a load of 100.
> > 
> > The same task on a 100MHz CPU would produce a load of 1000 because that 
> > CPU is 10x slower.  So to properly evaluate the load of a task when 
> > moving it around, we want to normalize its load based on the CPU 
> > performance.  In this case the correction factor would be 0.1.
> > 
> > Given those normalized loads, we need to scale CPU capacity as well.  If 
> > the 1GHz CPU can handle 50 of those tasks it has a capacity of 5000.
> > 
> > In theory the 100MHz CPU could handle only 5 of those tasks, meaning it 
> > has a normalized capacity of 500, but only if the load metric is already 
> > normalized as well.
> > 
> > Or am I completely missing the point here?
> 
> So I was thinking of the usage as per the next patch; where we decide if
> a cpu is 'full' or not based on the utilization measure. For this
> measure we're not interested in inter CPU relations at all, and any use
> of capacity scaling simply doesn't make sense.

Right. You don't need to scale capacity to determine whether a cpu is
full or not if you don't have DVFS, but I don't think it hurts if it is
done right. We need the scaling to figure out how much capacity is
available.

> But I think you asking this question shows a 'bigger' problem in that
> the Changelogs are entirely failing at describing the actual problem and
> proposed solution. Because if that were clear, I don't think we would be
> having this particular discussion.

Yes, the bigger problem of scaling things with DVFS and taking
big.LITTLE into account is not addressed in this patch set. This is the
scale-invariance problem that we discussed at Ksummit.

I will send a few patches shortly that provides the usage
scale-invariance. The last bit missing is then to scale capacity too (in
the right places), such that we can determine whether as cpu is full or
not, and how much spare capacity it has by comparing scale usage
(utilization) with scaled capacity. Like in Nico's example above.

Morten

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v5 11/12] sched: replace capacity_factor by utilization
  2014-09-17 18:45                 ` Morten Rasmussen
@ 2014-09-17 18:58                   ` Morten Rasmussen
  -1 siblings, 0 replies; 164+ messages in thread
From: Morten Rasmussen @ 2014-09-17 18:58 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Nicolas Pitre, Vincent Guittot, Ingo Molnar, linux-kernel,
	Preeti U Murthy, Russell King - ARM Linux, LAK, Rik van Riel,
	Mike Galbraith, linaro-kernel, Daniel Lezcano, Dietmar Eggemann

On Wed, Sep 17, 2014 at 07:45:27PM +0100, Morten Rasmussen wrote:
> On Mon, Sep 15, 2014 at 09:01:59PM +0100, Peter Zijlstra wrote:
> > On Mon, Sep 15, 2014 at 03:07:44PM -0400, Nicolas Pitre wrote:
> > > On Mon, 15 Sep 2014, Peter Zijlstra wrote:
> > 
> > > Let's suppose a task running on a 1GHz CPU producing a load of 100.
> > > 
> > > The same task on a 100MHz CPU would produce a load of 1000 because that 
> > > CPU is 10x slower.  So to properly evaluate the load of a task when 
> > > moving it around, we want to normalize its load based on the CPU 
> > > performance.  In this case the correction factor would be 0.1.
> > > 
> > > Given those normalized loads, we need to scale CPU capacity as well.  If 
> > > the 1GHz CPU can handle 50 of those tasks it has a capacity of 5000.
> > > 
> > > In theory the 100MHz CPU could handle only 5 of those tasks, meaning it 
> > > has a normalized capacity of 500, but only if the load metric is already 
> > > normalized as well.
> > > 
> > > Or am I completely missing the point here?
> > 
> > So I was thinking of the usage as per the next patch; where we decide if
> > a cpu is 'full' or not based on the utilization measure. For this
> > measure we're not interested in inter CPU relations at all, and any use
> > of capacity scaling simply doesn't make sense.
> 
> Right. You don't need to scale capacity to determine whether a cpu is
> full or not if you don't have DVFS, but I don't think it hurts if it is
> done right. We need the scaling to figure out how much capacity is
> available.
> 
> > But I think you asking this question shows a 'bigger' problem in that
> > the Changelogs are entirely failing at describing the actual problem and
> > proposed solution. Because if that were clear, I don't think we would be
> > having this particular discussion.
> 
> Yes, the bigger problem of scaling things with DVFS and taking
> big.LITTLE into account is not addressed in this patch set. This is the
> scale-invariance problem that we discussed at Ksummit.

big.LITTLE is factored in this patch set, but DVFS is not. My bad.

Morten

^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v5 11/12] sched: replace capacity_factor by utilization
@ 2014-09-17 18:58                   ` Morten Rasmussen
  0 siblings, 0 replies; 164+ messages in thread
From: Morten Rasmussen @ 2014-09-17 18:58 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Sep 17, 2014 at 07:45:27PM +0100, Morten Rasmussen wrote:
> On Mon, Sep 15, 2014 at 09:01:59PM +0100, Peter Zijlstra wrote:
> > On Mon, Sep 15, 2014 at 03:07:44PM -0400, Nicolas Pitre wrote:
> > > On Mon, 15 Sep 2014, Peter Zijlstra wrote:
> > 
> > > Let's suppose a task running on a 1GHz CPU producing a load of 100.
> > > 
> > > The same task on a 100MHz CPU would produce a load of 1000 because that 
> > > CPU is 10x slower.  So to properly evaluate the load of a task when 
> > > moving it around, we want to normalize its load based on the CPU 
> > > performance.  In this case the correction factor would be 0.1.
> > > 
> > > Given those normalized loads, we need to scale CPU capacity as well.  If 
> > > the 1GHz CPU can handle 50 of those tasks it has a capacity of 5000.
> > > 
> > > In theory the 100MHz CPU could handle only 5 of those tasks, meaning it 
> > > has a normalized capacity of 500, but only if the load metric is already 
> > > normalized as well.
> > > 
> > > Or am I completely missing the point here?
> > 
> > So I was thinking of the usage as per the next patch; where we decide if
> > a cpu is 'full' or not based on the utilization measure. For this
> > measure we're not interested in inter CPU relations at all, and any use
> > of capacity scaling simply doesn't make sense.
> 
> Right. You don't need to scale capacity to determine whether a cpu is
> full or not if you don't have DVFS, but I don't think it hurts if it is
> done right. We need the scaling to figure out how much capacity is
> available.
> 
> > But I think you asking this question shows a 'bigger' problem in that
> > the Changelogs are entirely failing at describing the actual problem and
> > proposed solution. Because if that were clear, I don't think we would be
> > having this particular discussion.
> 
> Yes, the bigger problem of scaling things with DVFS and taking
> big.LITTLE into account is not addressed in this patch set. This is the
> scale-invariance problem that we discussed at Ksummit.

big.LITTLE is factored in this patch set, but DVFS is not. My bad.

Morten

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v5 11/12] sched: replace capacity_factor by utilization
  2014-09-15 22:14             ` Vincent Guittot
@ 2014-09-17 22:25               ` Peter Zijlstra
  -1 siblings, 0 replies; 164+ messages in thread
From: Peter Zijlstra @ 2014-09-17 22:25 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Ingo Molnar, linux-kernel, Preeti U Murthy,
	Russell King - ARM Linux, LAK, Rik van Riel, Morten Rasmussen,
	Mike Galbraith, Nicolas Pitre, linaro-kernel, Daniel Lezcano,
	Dietmar Eggemann

On Tue, Sep 16, 2014 at 12:14:54AM +0200, Vincent Guittot wrote:
> On 15 September 2014 13:42, Peter Zijlstra <peterz@infradead.org> wrote:

> > OK, I've reconsidered _again_, I still don't get it.
> >
> > So fundamentally I think its wrong to scale with the capacity; it just
> > doesn't make any sense. Consider big.little stuff, their CPUs are
> > inherently asymmetric in capacity, but that doesn't matter one whit for
> > utilization numbers. If a core is fully consumed its fully consumed, no
> > matter how much work it can or can not do.
> >
> >
> > So the only thing that needs correcting is the fact that these
> > statistics are based on clock_task and some of that time can end up in
> > other scheduling classes, at which point we'll never get 100% even
> > though we're 'saturated'. But correcting for that using capacity doesn't
> > 'work'.
> 
> I'm not sure to catch your last point because the capacity is the only
> figures that take into account the "time" consumed by other classes.
> Have you got in mind another way to take into account the other
> classes ?

So that was the entire point of stuffing capacity in? Note that that
point was not at all clear.

This is very much like 'all we have is a hammer, and therefore
everything is a nail'. The rt fraction is a 'small' part of what the
capacity is.

> So we have cpu_capacity that is the capacity that can be currently
> used by cfs class
> We have cfs.usage_load_avg that is the sum of running time of cfs
> tasks on the CPU and reflect the % of usage of this CPU by CFS tasks
> We have to use the same metrics to compare available capacity for CFS
> and current cfs usage

-ENOPARSE

> Now we have to use the same unit so we can either weight the
> cpu_capacity_orig with the cfs.usage_load_avg and compare it with
> cpu_capacity
> or with divide cpu_capacity by cpu_capacity_orig and scale it into the
> SCHED_LOAD_SCALE range. Is It what you are proposing ?

I'm so not getting it; orig vs capacity still includes
arch_scale_freq_capacity(), so that is not enough to isolate the rt
fraction.

^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v5 11/12] sched: replace capacity_factor by utilization
@ 2014-09-17 22:25               ` Peter Zijlstra
  0 siblings, 0 replies; 164+ messages in thread
From: Peter Zijlstra @ 2014-09-17 22:25 UTC (permalink / raw)
  To: linux-arm-kernel

On Tue, Sep 16, 2014 at 12:14:54AM +0200, Vincent Guittot wrote:
> On 15 September 2014 13:42, Peter Zijlstra <peterz@infradead.org> wrote:

> > OK, I've reconsidered _again_, I still don't get it.
> >
> > So fundamentally I think its wrong to scale with the capacity; it just
> > doesn't make any sense. Consider big.little stuff, their CPUs are
> > inherently asymmetric in capacity, but that doesn't matter one whit for
> > utilization numbers. If a core is fully consumed its fully consumed, no
> > matter how much work it can or can not do.
> >
> >
> > So the only thing that needs correcting is the fact that these
> > statistics are based on clock_task and some of that time can end up in
> > other scheduling classes, at which point we'll never get 100% even
> > though we're 'saturated'. But correcting for that using capacity doesn't
> > 'work'.
> 
> I'm not sure to catch your last point because the capacity is the only
> figures that take into account the "time" consumed by other classes.
> Have you got in mind another way to take into account the other
> classes ?

So that was the entire point of stuffing capacity in? Note that that
point was not at all clear.

This is very much like 'all we have is a hammer, and therefore
everything is a nail'. The rt fraction is a 'small' part of what the
capacity is.

> So we have cpu_capacity that is the capacity that can be currently
> used by cfs class
> We have cfs.usage_load_avg that is the sum of running time of cfs
> tasks on the CPU and reflect the % of usage of this CPU by CFS tasks
> We have to use the same metrics to compare available capacity for CFS
> and current cfs usage

-ENOPARSE

> Now we have to use the same unit so we can either weight the
> cpu_capacity_orig with the cfs.usage_load_avg and compare it with
> cpu_capacity
> or with divide cpu_capacity by cpu_capacity_orig and scale it into the
> SCHED_LOAD_SCALE range. Is It what you are proposing ?

I'm so not getting it; orig vs capacity still includes
arch_scale_freq_capacity(), so that is not enough to isolate the rt
fraction.

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v5 11/12] sched: replace capacity_factor by utilization
  2014-09-17 18:45                 ` Morten Rasmussen
@ 2014-09-17 23:03                   ` Peter Zijlstra
  -1 siblings, 0 replies; 164+ messages in thread
From: Peter Zijlstra @ 2014-09-17 23:03 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Nicolas Pitre, Vincent Guittot, Ingo Molnar, linux-kernel,
	Preeti U Murthy, Russell King - ARM Linux, LAK, Rik van Riel,
	Mike Galbraith, linaro-kernel, Daniel Lezcano, Dietmar Eggemann

On Wed, Sep 17, 2014 at 07:45:27PM +0100, Morten Rasmussen wrote:
> Right. You don't need to scale capacity to determine whether a cpu is
> full or not if you don't have DVFS, but I don't think it hurts if it is
> done right. We need the scaling to figure out how much capacity is
> available.

Maybe I'm particularly dense, but I'm not seeing how the proposed thing
is 'right' in any way.

There are no words on what specifically is addressed and how it is
achieved.

^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v5 11/12] sched: replace capacity_factor by utilization
@ 2014-09-17 23:03                   ` Peter Zijlstra
  0 siblings, 0 replies; 164+ messages in thread
From: Peter Zijlstra @ 2014-09-17 23:03 UTC (permalink / raw)
  To: linux-arm-kernel

On Wed, Sep 17, 2014 at 07:45:27PM +0100, Morten Rasmussen wrote:
> Right. You don't need to scale capacity to determine whether a cpu is
> full or not if you don't have DVFS, but I don't think it hurts if it is
> done right. We need the scaling to figure out how much capacity is
> available.

Maybe I'm particularly dense, but I'm not seeing how the proposed thing
is 'right' in any way.

There are no words on what specifically is addressed and how it is
achieved.

^ permalink raw reply	[flat|nested] 164+ messages in thread

* Re: [PATCH v5 11/12] sched: replace capacity_factor by utilization
  2014-09-17 22:25               ` Peter Zijlstra
@ 2014-09-18  1:32                 ` Vincent Guittot
  -1 siblings, 0 replies; 164+ messages in thread
From: Vincent Guittot @ 2014-09-18  1:32 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Ingo Molnar, linux-kernel, Preeti U Murthy,
	Russell King - ARM Linux, LAK, Rik van Riel, Morten Rasmussen,
	Mike Galbraith, Nicolas Pitre, linaro-kernel, Daniel Lezcano,
	Dietmar Eggemann

On 17 September 2014 15:25, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Sep 16, 2014 at 12:14:54AM +0200, Vincent Guittot wrote:
>> On 15 September 2014 13:42, Peter Zijlstra <peterz@infradead.org> wrote:
>
>> > OK, I've reconsidered _again_, I still don't get it.
>> >
>> > So fundamentally I think its wrong to scale with the capacity; it just
>> > doesn't make any sense. Consider big.little stuff, their CPUs are
>> > inherently asymmetric in capacity, but that doesn't matter one whit for
>> > utilization numbers. If a core is fully consumed its fully consumed, no
>> > matter how much work it can or can not do.
>> >
>> >
>> > So the only thing that needs correcting is the fact that these
>> > statistics are based on clock_task and some of that time can end up in
>> > other scheduling classes, at which point we'll never get 100% even
>> > though we're 'saturated'. But correcting for that using capacity doesn't
>> > 'work'.
>>
>> I'm not sure to catch your last point because the capacity is the only
>> figures that take into account the "time" consumed by other classes.
>> Have you got in mind another way to take into account the other
>> classes ?
>
> So that was the entire point of stuffing capacity in? Note that that
> point was not at all clear.
>
> This is very much like 'all we have is a hammer, and therefore
> everything is a nail'. The rt fraction is a 'small' part of what the
> capacity is.
>
>> So we have cpu_capacity that is the capacity that can be currently
>> used by cfs class
>> We have cfs.usage_load_avg that is the sum of running time of cfs
>> tasks on the CPU and reflect the % of usage of this CPU by CFS tasks
>> We have to use the same metrics to compare available capacity for CFS
>> and current cfs usage
>
> -ENOPARSE
>
>> Now we have to use the same unit so we can either weight the
>> cpu_capacity_orig with the cfs.usage_load_avg and compare it with
>> cpu_capacity
>> or with divide cpu_capacity by cpu_capacity_orig and scale it into the
>> SCHED_LOAD_SCALE range. Is It what you are proposing ?
>
> I'm so not getting it; orig vs capacity still includes
> arch_scale_freq_capacity(), so that is not enough to isolate the rt
> fraction.

This patch does not try to solve any scale invariance issue. This
patch removes capacity_factor because it rarely works correctly.
capacity_factor tries to compute how many tasks a group of CPUs can
handle at the time we are doing the load balance. The capacity_factor
is hardly working for SMT system: it sometimes works for big cores and
but fails to do the right thing for little cores.

Below are two examples to illustrate the problem that this patch solves:

capacity_factor makes the assumption that max capacity of a CPU is
SCHED_CAPACITY_SCALE and the load of a thread is always is
SCHED_LOAD_SCALE. It compares the output of these figures with the sum
of nr_running to decide if a group is overloaded or not.

But if the default capacity of a CPU is less than SCHED_CAPACITY_SCALE
(640 as an example), a group of 3 CPUS will have a max capacity_factor
of 2 ( div_round_closest(3x640/1024) = 2) which means that it will be
seen as overloaded if we have only one task per CPU.

Then, if the default capacity of a CPU is greater than
SCHED_CAPACITY_SCALE (1512 as an example), a group of 4 CPUs will have
a capacity_factor of 4 (at max and thanks to the fix[0] for SMT system
that prevent the apparition of ghost CPUs) but if one CPU is fully
used by a rt task (and its capacity is reduced to nearly nothing), the
capacity factor of the group will still be 4
(div_round_closest(3*1512/1024) = 5).

So, this patch tries to solve this issue by removing capacity_factor
and replacing it with the 2 following metrics :
-the available CPU capacity for CFS tasks which is the one currently
used by load_balance
-the capacity that are effectively used by CFS tasks on the CPU. For
that, i have re-introduced the usage_avg_contrib which is in the range
[0..SCHED_CPU_LOAD] whatever the capacity of the CPU on which the task
is running, is. This usage_avg_contrib doesn't solve the scaling
in-variance problem, so i have to scale the usage with original
capacity in get_cpu_utilization (that will become get_cpu_usage in the
next version) in order to compare it with available capacity.

Once the scaling invariance will have been added in usage_avg_contrib,
we can remove the scale by cpu_capacity_orig in get_cpu_utilization.
But the scaling invariance will come in another patchset.

Hope that this explanation makes the goal of this patchset clearer.
And I can add this explanation in the commit log if you found it clear
enough

Vincent

[0] https://lkml.org/lkml/2013/8/28/194

^ permalink raw reply	[flat|nested] 164+ messages in thread

* [PATCH v5 11/12] sched: replace capacity_factor by utilization
@ 2014-09-18  1:32                 ` Vincent Guittot
  0 siblings, 0 replies; 164+ messages in thread
From: Vincent Guittot @ 2014-09-18  1:32 UTC (permalink / raw)
  To: linux-arm-kernel

On 17 September 2014 15:25, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Sep 16, 2014 at 12:14:54AM +0200, Vincent Guittot wrote:
>> On 15 September 2014 13:42, Peter Zijlstra <peterz@infradead.org> wrote:
>
>> > OK, I've reconsidered _again_, I still don't get it.
>> >
>> > So fundamentally I think its wrong to scale with the capacity; it just
>> > doesn't make any sense. Consider big.little stuff, their CPUs are
>> > inherently asymmetric in capacity, but that doesn't matter one whit for
>> > utilization numbers. If a core is fully consumed its fully consumed, no
>> > matter how much work it can or can not do.
>> >
>> >
>> > So the only thing that needs correcting is the fact that these
>> > statistics are based on clock_task and some of that time can end up in
>> > other scheduling classes, at which point we'll never get 100% even
>> > though we're 'saturated'. But correcting for that using capacity doesn't
>> > 'work'.
>>
>> I'm not sure to catch your last point because the capacity is the only
>> figures that take into account the "time" consumed by other classes.
>> Have you got in mind another way to take into account the other
>> classes ?
>
> So that was the entire point of stuffing capacity in? Note that that
> point was not at all clear.
>
> This is very much like 'all we have is a hammer, and therefore
> everything is a nail'. The rt fraction is a 'small' part of what the
> capacity is.
>
>> So we have cpu_capacity that is the capacity that can be currently
>> used by cfs class
>> We have cfs.usage_load_avg that is the sum of running time of cfs
>> tasks on the CPU and reflect the % of usage of this CPU by CFS tasks
>> We have to use the same metrics to compare available capacity for CFS
>> and current cfs usage
>
> -ENOPARSE
>
>> Now we have to use the same unit so we can either weight the
>> cpu_capacity_orig with the cfs.usage_load_avg and compare it with
>> cpu_capacity
>> or with divide cpu_capacity by cpu_capacity_orig and scale it into the
>> SCHED_LOAD_SCALE range. Is It what you are proposing ?
>
> I'm so not getting it; orig vs capacity still includes
> arch_scale_freq_capacity(), so that is not enough to isolate the rt
> fraction.

This patch does not try to solve any scale invariance issue. This
patch removes capacity_factor because it rarely works correctly.
capacity_factor tries to compute how many tasks a group of CPUs can
handle at the time we are doing the load balance. The capacity_factor
is hardly working for SMT system: it sometimes works for big cores and
but fails to do the right thing for little cores.

Below are two examples to illustrate the problem that this patch solves:

capacity_factor makes the assumption that max capacity of a CPU is
SCHED_CAPACITY_SCALE and the load of a thread is always is
SCHED_LOAD_SCALE. It compares the output of these figures with the sum
of nr_running to decide if a group is overloaded or not.

But if the default capacity of a CPU is less than SCHED_CAPACITY_SCALE
(640 as an example), a group of 3 CPUS will have a max capacity_factor
of 2 ( div_round_closest(3x640/1024) = 2) which means that it will be
seen as overloaded if we have only one task per CPU.

Then, if the default capacity of a CPU is greater than
SCHED_CAPACITY_SCALE (1512 as an example), a group of 4 CPUs will have
a capacity_factor of 4 (at max and thanks to the fix[0] for SMT system
that prevent the apparition of ghost CPUs) but if one CPU is fully
used by a rt task (and its capacity is reduced to nearly nothing), the
capacity factor of the group will still be 4
(div_round_closest(3*1512/1024) = 5).

So, this patch tries to solve this issue by removing capacity_factor
and replacing it with the 2 following metrics :
-the available CPU capacity for CFS tasks which is the one currently
used by load_balance
-the capacity that are effectively used by CFS tasks on the CPU. For
that, i have re-introduced the usage_avg_contrib which is in the range
[0..SCHED_CPU_LOAD] whatever the capacity of the CPU on which the task
is running, is. This usage_avg_contrib doesn't solve the scaling
in-variance problem, so i have to scale the usage with original
capacity in get_cpu_utilization (that will become get_cpu_usage in the
next version) in order to compare it with available capacity.

Once the scaling invariance will have been added in usage_avg_contrib,
we can remove the scale by cpu_capacity_orig in get_cpu_utilization.
But the scaling invariance will come in another patchset.

Hope that this explanation makes the goal of this patchset clearer.
And I can add this explanation in the commit log if you found it clear
enough

Vincent

[0] https://lkml.org/lkml/2013/8/28/194

^ permalink raw reply	[flat|nested] 164+ messages in thread

* [tip:sched/core] sched: Fix imbalance flag reset
  2014-08-26 11:06   ` Vincent Guittot
  (?)
@ 2014-09-19 11:47   ` tip-bot for Vincent Guittot
  -1 siblings, 0 replies; 164+ messages in thread
From: tip-bot for Vincent Guittot @ 2014-09-19 11:47 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, torvalds, peterz, preeti,
	vincent.guittot, tglx

Commit-ID:  afdeee0510db918b31bb4aba47452df2ddbdbcf2
Gitweb:     http://git.kernel.org/tip/afdeee0510db918b31bb4aba47452df2ddbdbcf2
Author:     Vincent Guittot <vincent.guittot@linaro.org>
AuthorDate: Tue, 26 Aug 2014 13:06:44 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 19 Sep 2014 12:35:24 +0200

sched: Fix imbalance flag reset

The imbalance flag can stay set whereas there is no imbalance.

Let assume that we have 3 tasks that run on a dual cores /dual cluster system.
We will have some idle load balance which are triggered during tick.
Unfortunately, the tick is also used to queue background work so we can reach
the situation where short work has been queued on a CPU which already runs a
task. The load balance will detect this imbalance (2 tasks on 1 CPU and an idle
CPU) and will try to pull the waiting task on the idle CPU. The waiting task is
a worker thread that is pinned on a CPU so an imbalance due to pinned task is
detected and the imbalance flag is set.

Then, we will not be able to clear the flag because we have at most 1 task on
each CPU but the imbalance flag will trig to useless active load balance
between the idle CPU and the busy CPU.

We need to reset of the imbalance flag as soon as we have reached a balanced
state. If all tasks are pinned, we don't consider that as a balanced state and
let the imbalance flag set.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: riel@redhat.com
Cc: Morten.Rasmussen@arm.com
Cc: efault@gmx.de
Cc: nicolas.pitre@linaro.org
Cc: daniel.lezcano@linaro.org
Cc: dietmar.eggemann@arm.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1409051215-16788-2-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 23 +++++++++++++++++++----
 1 file changed, 19 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9807a99..01856a8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6765,10 +6765,8 @@ more_balance:
 		if (sd_parent) {
 			int *group_imbalance = &sd_parent->groups->sgc->imbalance;
 
-			if ((env.flags & LBF_SOME_PINNED) && env.imbalance > 0) {
+			if ((env.flags & LBF_SOME_PINNED) && env.imbalance > 0)
 				*group_imbalance = 1;
-			} else if (*group_imbalance)
-				*group_imbalance = 0;
 		}
 
 		/* All tasks on this runqueue were pinned by CPU affinity */
@@ -6779,7 +6777,7 @@ more_balance:
 				env.loop_break = sched_nr_migrate_break;
 				goto redo;
 			}
-			goto out_balanced;
+			goto out_all_pinned;
 		}
 	}
 
@@ -6853,6 +6851,23 @@ more_balance:
 	goto out;
 
 out_balanced:
+	/*
+	 * We reach balance although we may have faced some affinity
+	 * constraints. Clear the imbalance flag if it was set.
+	 */
+	if (sd_parent) {
+		int *group_imbalance = &sd_parent->groups->sgc->imbalance;
+
+		if (*group_imbalance)
+			*group_imbalance = 0;
+	}
+
+out_all_pinned:
+	/*
+	 * We reach balance because all tasks are pinned at this level so
+	 * we can't migrate them. Let the imbalance flag set so parent level
+	 * can try to migrate them.
+	 */
 	schedstat_inc(sd, lb_balanced[idle]);
 
 	sd->nr_balance_failed = 0;

^ permalink raw reply related	[flat|nested] 164+ messages in thread

* [tip:sched/core] sched: Remove a wake_affine() condition
  2014-08-26 11:06   ` Vincent Guittot
  (?)
@ 2014-09-19 11:47   ` tip-bot for Vincent Guittot
  -1 siblings, 0 replies; 164+ messages in thread
From: tip-bot for Vincent Guittot @ 2014-09-19 11:47 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, torvalds, peterz, vincent.guittot, tglx

Commit-ID:  05bfb65f52cbdabe26ebb629959416a6cffb034d
Gitweb:     http://git.kernel.org/tip/05bfb65f52cbdabe26ebb629959416a6cffb034d
Author:     Vincent Guittot <vincent.guittot@linaro.org>
AuthorDate: Tue, 26 Aug 2014 13:06:45 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 19 Sep 2014 12:35:25 +0200

sched: Remove a wake_affine() condition

In wake_affine() I have tried to understand the meaning of the condition:

 (this_load <= load &&
  this_load + target_load(prev_cpu, idx) <= tl_per_task)

but I failed to find a use case that can take advantage of it and I haven't
found clear description in the previous commit's log.

Futhermore, the comment of the condition refers to the task_hot function that
was used before being replaced by the current condition:

/*
 * This domain has SD_WAKE_AFFINE and
 * p is cache cold in this domain, and
 * there is no bad imbalance.
 */

If we look more deeply the below condition:

 this_load + target_load(prev_cpu, idx) <= tl_per_task

When sync is clear, we have:

 tl_per_task = runnable_load_avg / nr_running
 this_load = max(runnable_load_avg, cpuload[idx])
 target_load =  max(runnable_load_avg', cpuload'[idx])

It implies that runnable_load_avg == 0 and nr_running <= 1 in order to match the
condition. This implies that runnable_load_avg == 0 too because of the
condition: this_load <= load.

but if this _load is null, 'balanced' is already set and the test is redundant.

If sync is set, it's not as straight forward as above (especially if cgroup
are involved) but the policy should be similar as we have removed a task that's
going to sleep in order to get a more accurate load and this_load values.

The current conclusion is that these additional condition don't give any benefit
so we can remove them.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: preeti@linux.vnet.ibm.com
Cc: riel@redhat.com
Cc: Morten.Rasmussen@arm.com
Cc: efault@gmx.de
Cc: nicolas.pitre@linaro.org
Cc: daniel.lezcano@linaro.org
Cc: dietmar.eggemann@arm.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1409051215-16788-3-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 30 ++++++------------------------
 1 file changed, 6 insertions(+), 24 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 01856a8..391eaf2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4285,7 +4285,6 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
 {
 	s64 this_load, load;
 	int idx, this_cpu, prev_cpu;
-	unsigned long tl_per_task;
 	struct task_group *tg;
 	unsigned long weight;
 	int balanced;
@@ -4343,32 +4342,15 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
 		balanced = this_eff_load <= prev_eff_load;
 	} else
 		balanced = true;
-
-	/*
-	 * If the currently running task will sleep within
-	 * a reasonable amount of time then attract this newly
-	 * woken task:
-	 */
-	if (sync && balanced)
-		return 1;
-
 	schedstat_inc(p, se.statistics.nr_wakeups_affine_attempts);
-	tl_per_task = cpu_avg_load_per_task(this_cpu);
 
-	if (balanced ||
-	    (this_load <= load &&
-	     this_load + target_load(prev_cpu, idx) <= tl_per_task)) {
-		/*
-		 * This domain has SD_WAKE_AFFINE and
-		 * p is cache cold in this domain, and
-		 * there is no bad imbalance.
-		 */
-		schedstat_inc(sd, ttwu_move_affine);
-		schedstat_inc(p, se.statistics.nr_wakeups_affine);
+	if (!balanced)
+		return 0;
 
-		return 1;
-	}
-	return 0;
+	schedstat_inc(sd, ttwu_move_affine);
+	schedstat_inc(p, se.statistics.nr_wakeups_affine);
+
+	return 1;
 }
 
 /*

^ permalink raw reply related	[flat|nested] 164+ messages in thread

* [tip:sched/core] sched: Fix avg_load computation
  2014-08-26 11:06   ` Vincent Guittot
                     ` (2 preceding siblings ...)
  (?)
@ 2014-09-19 11:47   ` tip-bot for Vincent Guittot
  -1 siblings, 0 replies; 164+ messages in thread
From: tip-bot for Vincent Guittot @ 2014-09-19 11:47 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, torvalds, peterz, preeti,
	vincent.guittot, tglx

Commit-ID:  65fdac08c264506ff95ee1e34ae066e308c9e6e3
Gitweb:     http://git.kernel.org/tip/65fdac08c264506ff95ee1e34ae066e308c9e6e3
Author:     Vincent Guittot <vincent.guittot@linaro.org>
AuthorDate: Tue, 26 Aug 2014 13:06:46 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 19 Sep 2014 12:35:26 +0200

sched: Fix avg_load computation

The computation of avg_load and avg_load_per_task should only take into
account the number of CFS tasks. The non-CFS tasks are already taken into
account by decreasing the CPU's capacity and they will be tracked in the
CPU's utilization (group_utilization) of the next patches.

Reviewed-by: Preeti U Murthy <preeti@linux.vnet.ibm.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: riel@redhat.com
Cc: Morten.Rasmussen@arm.com
Cc: efault@gmx.de
Cc: nicolas.pitre@linaro.org
Cc: daniel.lezcano@linaro.org
Cc: dietmar.eggemann@arm.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1409051215-16788-4-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 391eaf2..eb87229 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4095,7 +4095,7 @@ static unsigned long capacity_of(int cpu)
 static unsigned long cpu_avg_load_per_task(int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
-	unsigned long nr_running = ACCESS_ONCE(rq->nr_running);
+	unsigned long nr_running = ACCESS_ONCE(rq->cfs.h_nr_running);
 	unsigned long load_avg = rq->cfs.runnable_load_avg;
 
 	if (nr_running)
@@ -5985,7 +5985,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 			load = source_load(i, load_idx);
 
 		sgs->group_load += load;
-		sgs->sum_nr_running += rq->nr_running;
+		sgs->sum_nr_running += rq->cfs.h_nr_running;
 
 		if (rq->nr_running > 1)
 			*overload = true;

^ permalink raw reply related	[flat|nested] 164+ messages in thread

* [tip:sched/core] sched: Allow all architectures to set ' capacity_orig'
  2014-08-26 11:06   ` Vincent Guittot
                     ` (2 preceding siblings ...)
  (?)
@ 2014-09-19 11:47   ` tip-bot for Vincent Guittot
  -1 siblings, 0 replies; 164+ messages in thread
From: tip-bot for Vincent Guittot @ 2014-09-19 11:47 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, kamalesh, torvalds, peterz, preeti,
	vincent.guittot, tglx

Commit-ID:  26bc3c50d3b3984564c270da86f1fbbfb774dbcd
Gitweb:     http://git.kernel.org/tip/26bc3c50d3b3984564c270da86f1fbbfb774dbcd
Author:     Vincent Guittot <vincent.guittot@linaro.org>
AuthorDate: Tue, 26 Aug 2014 13:06:47 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 19 Sep 2014 12:35:27 +0200

sched: Allow all architectures to set 'capacity_orig'

'capacity_orig' is only changed for systems with an SMT sched_domain level in order
to reflect the lower capacity of CPUs. Heterogenous systems also have to reflect an
original capacity that is different from the default value.

Create a more generic function arch_scale_cpu_capacity that can be also used by
non SMT platforms to set capacity_orig.

The __weak implementation of arch_scale_cpu_capacity() is the previous SMT variant,
in order to keep backward compatibility with the use of capacity_orig.

arch_scale_smt_capacity() and default_scale_smt_capacity() have been removed as
they were not used elsewhere than in arch_scale_cpu_capacity().

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Reviewed-by: Preeti U. Murthy <preeti@linux.vnet.ibm.com>
[ Added default_scale_cpu_capacity() back. ]
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: riel@redhat.com
Cc: Morten.Rasmussen@arm.com
Cc: efault@gmx.de
Cc: nicolas.pitre@linaro.org
Cc: daniel.lezcano@linaro.org
Cc: dietmar.eggemann@arm.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1409051215-16788-5-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 27 +++++++++++----------------
 1 file changed, 11 insertions(+), 16 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index eb87229..be530e4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5705,19 +5705,17 @@ unsigned long __weak arch_scale_freq_capacity(struct sched_domain *sd, int cpu)
 	return default_scale_capacity(sd, cpu);
 }
 
-static unsigned long default_scale_smt_capacity(struct sched_domain *sd, int cpu)
+static unsigned long default_scale_cpu_capacity(struct sched_domain *sd, int cpu)
 {
-	unsigned long weight = sd->span_weight;
-	unsigned long smt_gain = sd->smt_gain;
+	if ((sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1))
+		return sd->smt_gain / sd->span_weight;
 
-	smt_gain /= weight;
-
-	return smt_gain;
+	return SCHED_CAPACITY_SCALE;
 }
 
-unsigned long __weak arch_scale_smt_capacity(struct sched_domain *sd, int cpu)
+unsigned long __weak arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
 {
-	return default_scale_smt_capacity(sd, cpu);
+	return default_scale_cpu_capacity(sd, cpu);
 }
 
 static unsigned long scale_rt_capacity(int cpu)
@@ -5756,18 +5754,15 @@ static unsigned long scale_rt_capacity(int cpu)
 
 static void update_cpu_capacity(struct sched_domain *sd, int cpu)
 {
-	unsigned long weight = sd->span_weight;
 	unsigned long capacity = SCHED_CAPACITY_SCALE;
 	struct sched_group *sdg = sd->groups;
 
-	if ((sd->flags & SD_SHARE_CPUCAPACITY) && weight > 1) {
-		if (sched_feat(ARCH_CAPACITY))
-			capacity *= arch_scale_smt_capacity(sd, cpu);
-		else
-			capacity *= default_scale_smt_capacity(sd, cpu);
+	if (sched_feat(ARCH_CAPACITY))
+		capacity *= arch_scale_cpu_capacity(sd, cpu);
+	else
+		capacity *= default_scale_cpu_capacity(sd, cpu);
 
-		capacity >>= SCHED_CAPACITY_SHIFT;
-	}
+	capacity >>= SCHED_CAPACITY_SHIFT;
 
 	sdg->sgc->capacity_orig = capacity;
 

^ permalink raw reply related	[flat|nested] 164+ messages in thread

* [tip:sched/core] ARM: topology: Use the new cpu_capacity interface
  2014-08-26 11:06   ` Vincent Guittot
  (?)
  (?)
@ 2014-09-19 11:48   ` tip-bot for Vincent Guittot
  -1 siblings, 0 replies; 164+ messages in thread
From: tip-bot for Vincent Guittot @ 2014-09-19 11:48 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, grant.likely, hpa, mingo, broonie, torvalds,
	peterz, nicolas.pitre, linux, linux, robh+dt, vincent.guittot,
	nico, tglx

Commit-ID:  d3bfca1a7b028a57d648dbc0985492c6a4466ccf
Gitweb:     http://git.kernel.org/tip/d3bfca1a7b028a57d648dbc0985492c6a4466ccf
Author:     Vincent Guittot <vincent.guittot@linaro.org>
AuthorDate: Tue, 26 Aug 2014 13:06:48 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 19 Sep 2014 12:35:27 +0200

ARM: topology: Use the new cpu_capacity interface

Use the new arch_scale_cpu_capacity() scheduler facility in order to reflect
the original capacity of a CPU instead of arch_scale_freq_capacity() which is
more linked to a scaling of the capacity linked to the frequency.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Acked-by: Nicolas Pitre <nico@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: preeti@linux.vnet.ibm.com
Cc: riel@redhat.com
Cc: Morten.Rasmussen@arm.com
Cc: efault@gmx.de
Cc: daniel.lezcano@linaro.org
Cc: dietmar.eggemann@arm.com
Cc: Grant Likely <grant.likely@linaro.org>
Cc: Guenter Roeck <linux@roeck-us.net>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mark Brown <broonie@linaro.org>
Cc: Nicolas Pitre <nicolas.pitre@linaro.org>
Cc: Rob Herring <robh+dt@kernel.org>
Cc: Russell King <linux@arm.linux.org.uk>
Cc: Vincent Guittot <vincent.guittot@linaro.org>
Cc: devicetree@vger.kernel.org
Cc: linux-arm-kernel@lists.infradead.org
Link: http://lkml.kernel.org/r/1409051215-16788-6-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 arch/arm/kernel/topology.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
index e35d880..89cfdd6 100644
--- a/arch/arm/kernel/topology.c
+++ b/arch/arm/kernel/topology.c
@@ -42,7 +42,7 @@
  */
 static DEFINE_PER_CPU(unsigned long, cpu_scale);
 
-unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu)
+unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
 {
 	return per_cpu(cpu_scale, cpu);
 }
@@ -166,7 +166,7 @@ static void update_cpu_capacity(unsigned int cpu)
 	set_capacity_scale(cpu, cpu_capacity(cpu) / middle_capacity);
 
 	printk(KERN_INFO "CPU%u: update cpu_capacity %lu\n",
-		cpu, arch_scale_freq_capacity(NULL, cpu));
+		cpu, arch_scale_cpu_capacity(NULL, cpu));
 }
 
 #else

^ permalink raw reply related	[flat|nested] 164+ messages in thread

* [tip:sched/core] sched: Test the CPU's capacity in wake_affine()
  2014-08-26 11:06   ` Vincent Guittot
  (?)
  (?)
@ 2014-09-19 11:48   ` tip-bot for Vincent Guittot
  -1 siblings, 0 replies; 164+ messages in thread
From: tip-bot for Vincent Guittot @ 2014-09-19 11:48 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: linux-kernel, hpa, mingo, torvalds, peterz, vincent.guittot, tglx

Commit-ID:  bd61c98f9b3f142cd63f9e15acfe203bec9e5f5a
Gitweb:     http://git.kernel.org/tip/bd61c98f9b3f142cd63f9e15acfe203bec9e5f5a
Author:     Vincent Guittot <vincent.guittot@linaro.org>
AuthorDate: Tue, 26 Aug 2014 13:06:50 +0200
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Fri, 19 Sep 2014 12:35:28 +0200

sched: Test the CPU's capacity in wake_affine()

Currently the task always wakes affine on this_cpu if the latter is idle.
Before waking up the task on this_cpu, we check that this_cpu capacity is not
significantly reduced because of RT tasks or irq activity.

Use case where the number of irq and/or the time spent under irq is important
will take benefit of this because the task that is woken up by irq or softirq
will not use the same CPU than irq (and softirq) but a idle one.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: preeti@linux.vnet.ibm.com
Cc: riel@redhat.com
Cc: Morten.Rasmussen@arm.com
Cc: efault@gmx.de
Cc: nicolas.pitre@linaro.org
Cc: daniel.lezcano@linaro.org
Cc: dietmar.eggemann@arm.com
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Link: http://lkml.kernel.org/r/1409051215-16788-8-git-send-email-vincent.guittot@linaro.org
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 19 ++++++++++---------
 1 file changed, 10 insertions(+), 9 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index be530e4..74fa2c2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4284,6 +4284,7 @@ static int wake_wide(struct task_struct *p)
 static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
 {
 	s64 this_load, load;
+	s64 this_eff_load, prev_eff_load;
 	int idx, this_cpu, prev_cpu;
 	struct task_group *tg;
 	unsigned long weight;
@@ -4327,21 +4328,21 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
 	 * Otherwise check if either cpus are near enough in load to allow this
 	 * task to be woken on this_cpu.
 	 */
-	if (this_load > 0) {
-		s64 this_eff_load, prev_eff_load;
+	this_eff_load = 100;
+	this_eff_load *= capacity_of(prev_cpu);
+
+	prev_eff_load = 100 + (sd->imbalance_pct - 100) / 2;
+	prev_eff_load *= capacity_of(this_cpu);
 
-		this_eff_load = 100;
-		this_eff_load *= capacity_of(prev_cpu);
+	if (this_load > 0) {
 		this_eff_load *= this_load +
 			effective_load(tg, this_cpu, weight, weight);
 
-		prev_eff_load = 100 + (sd->imbalance_pct - 100) / 2;
-		prev_eff_load *= capacity_of(this_cpu);
 		prev_eff_load *= load + effective_load(tg, prev_cpu, 0, weight);
+	}
+
+	balanced = this_eff_load <= prev_eff_load;
 
-		balanced = this_eff_load <= prev_eff_load;
-	} else
-		balanced = true;
 	schedstat_inc(p, se.statistics.nr_wakeups_affine_attempts);
 
 	if (!balanced)

^ permalink raw reply related	[flat|nested] 164+ messages in thread

end of thread, other threads:[~2014-09-19 11:52 UTC | newest]

Thread overview: 164+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-08-26 11:06 [PATCH v5 00/12] sched: consolidation of cpu_capacity Vincent Guittot
2014-08-26 11:06 ` Vincent Guittot
2014-08-26 11:06 ` [PATCH v5 01/12] sched: fix imbalance flag reset Vincent Guittot
2014-08-26 11:06   ` Vincent Guittot
2014-09-19 11:47   ` [tip:sched/core] sched: Fix " tip-bot for Vincent Guittot
2014-08-26 11:06 ` [PATCH v5 02/12] sched: remove a wake_affine condition Vincent Guittot
2014-08-26 11:06   ` Vincent Guittot
2014-09-19 11:47   ` [tip:sched/core] sched: Remove a wake_affine() condition tip-bot for Vincent Guittot
2014-08-26 11:06 ` [PATCH v5 03/12] sched: fix avg_load computation Vincent Guittot
2014-08-26 11:06   ` Vincent Guittot
2014-08-30 12:00   ` Preeti U Murthy
2014-08-30 12:00     ` Preeti U Murthy
2014-09-03 11:09     ` Vincent Guittot
2014-09-03 11:09       ` Vincent Guittot
2014-09-03 23:43       ` Tim Chen
2014-09-03 23:43         ` Tim Chen
2014-09-04  7:17         ` Vincent Guittot
2014-09-04  7:17           ` Vincent Guittot
2014-09-04 16:26           ` Tim Chen
2014-09-04 16:26             ` Tim Chen
2014-09-05 11:10   ` Preeti U Murthy
2014-09-05 11:10     ` Preeti U Murthy
2014-09-19 11:47   ` [tip:sched/core] sched: Fix " tip-bot for Vincent Guittot
2014-08-26 11:06 ` [PATCH v5 04/12] sched: Allow all archs to set the capacity_orig Vincent Guittot
2014-08-26 11:06   ` Vincent Guittot
2014-08-27 13:12   ` Kamalesh Babulal
2014-08-27 13:12     ` Kamalesh Babulal
2014-08-30 17:07   ` Preeti U Murthy
2014-08-30 17:07     ` Preeti U Murthy
2014-09-01  8:05     ` Vincent Guittot
2014-09-01  8:05       ` Vincent Guittot
2014-09-03  8:41       ` Preeti U Murthy
2014-09-03  8:41         ` Preeti U Murthy
2014-09-10 13:50     ` Peter Zijlstra
2014-09-10 13:50       ` Peter Zijlstra
2014-09-10 14:22       ` Vincent Guittot
2014-09-10 14:22         ` Vincent Guittot
2014-09-11 10:36       ` Preeti U Murthy
2014-09-11 10:36         ` Preeti U Murthy
2014-09-19 11:47   ` [tip:sched/core] sched: Allow all architectures to set ' capacity_orig' tip-bot for Vincent Guittot
2014-08-26 11:06 ` [PATCH v5 05/12] ARM: topology: use new cpu_capacity interface Vincent Guittot
2014-08-26 11:06   ` Vincent Guittot
2014-09-11 18:52   ` Nicolas Pitre
2014-09-11 18:52     ` Nicolas Pitre
2014-09-19 11:48   ` [tip:sched/core] ARM: topology: Use the " tip-bot for Vincent Guittot
2014-08-26 11:06 ` [PATCH v5 06/12] sched: add per rq cpu_capacity_orig Vincent Guittot
2014-08-26 11:06   ` Vincent Guittot
2014-08-27 13:32   ` Kamalesh Babulal
2014-08-27 13:32     ` Kamalesh Babulal
2014-08-28  7:34     ` Vincent Guittot
2014-08-28  7:34       ` Vincent Guittot
2014-09-10 13:53   ` Peter Zijlstra
2014-09-10 13:53     ` Peter Zijlstra
2014-09-10 14:19     ` Vincent Guittot
2014-09-10 14:19       ` Vincent Guittot
2014-09-11 19:02   ` Nicolas Pitre
2014-09-11 19:02     ` Nicolas Pitre
2014-09-15 21:22     ` Vincent Guittot
2014-09-15 21:22       ` Vincent Guittot
2014-08-26 11:06 ` [PATCH v5 07/12] sched: test the cpu's capacity in wake affine Vincent Guittot
2014-08-26 11:06   ` Vincent Guittot
2014-09-10 14:19   ` Peter Zijlstra
2014-09-10 14:19     ` Peter Zijlstra
2014-09-19 11:48   ` [tip:sched/core] sched: Test the CPU's capacity in wake_affine() tip-bot for Vincent Guittot
2014-08-26 11:06 ` [PATCH v5 08/12] sched: move cfs task on a CPU with higher capacity Vincent Guittot
2014-08-26 11:06   ` Vincent Guittot
2014-08-30 17:50   ` Preeti U Murthy
2014-08-30 17:50     ` Preeti U Murthy
2014-09-01  8:45     ` Vincent Guittot
2014-09-01  8:45       ` Vincent Guittot
2014-09-03  9:11       ` Preeti U Murthy
2014-09-03  9:11         ` Preeti U Murthy
2014-09-03 11:44         ` Vincent Guittot
2014-09-03 11:44           ` Vincent Guittot
2014-09-03 12:26           ` Preeti U Murthy
2014-09-03 12:26             ` Preeti U Murthy
2014-09-03 12:49             ` Vincent Guittot
2014-09-03 12:49               ` Vincent Guittot
2014-09-11  9:27             ` Peter Zijlstra
2014-09-11  9:27               ` Peter Zijlstra
2014-09-05 12:06   ` Preeti U Murthy
2014-09-05 12:06     ` Preeti U Murthy
2014-09-05 12:24     ` Vincent Guittot
2014-09-05 12:24       ` Vincent Guittot
2014-09-11 10:07   ` Peter Zijlstra
2014-09-11 10:07     ` Peter Zijlstra
2014-09-11 11:20     ` Vincent Guittot
2014-09-11 11:20       ` Vincent Guittot
2014-09-11 10:13   ` Peter Zijlstra
2014-09-11 10:13     ` Peter Zijlstra
2014-09-11 12:14     ` Vincent Guittot
2014-09-11 12:14       ` Vincent Guittot
2014-09-11 11:54   ` Peter Zijlstra
2014-09-11 11:54     ` Peter Zijlstra
2014-08-26 11:06 ` [PATCH v5 09/12] sched: add usage_load_avg Vincent Guittot
2014-08-26 11:06   ` Vincent Guittot
2014-09-04  7:34   ` [PATCH v5 09/11] " Vincent Guittot
2014-09-04  7:34     ` Vincent Guittot
2014-09-11 11:17     ` Peter Zijlstra
2014-09-11 11:17       ` Peter Zijlstra
2014-09-11 11:17   ` [PATCH v5 09/12] " Peter Zijlstra
2014-09-11 11:17     ` Peter Zijlstra
2014-09-11 12:18     ` Vincent Guittot
2014-09-11 12:18       ` Vincent Guittot
2014-09-11 12:20     ` Vincent Guittot
2014-09-11 12:20       ` Vincent Guittot
2014-09-15 19:15   ` Morten Rasmussen
2014-09-15 19:15     ` Morten Rasmussen
2014-09-15 22:33     ` Vincent Guittot
2014-09-15 22:33       ` Vincent Guittot
2014-08-26 11:06 ` [PATCH v5 10/12] sched: get CPU's utilization statistic Vincent Guittot
2014-08-26 11:06   ` Vincent Guittot
2014-09-11 12:34   ` Peter Zijlstra
2014-09-11 12:34     ` Peter Zijlstra
2014-09-11 13:07     ` Vincent Guittot
2014-09-11 13:07       ` Vincent Guittot
2014-09-11 14:04       ` Peter Zijlstra
2014-09-11 14:04         ` Peter Zijlstra
2014-09-11 19:17         ` Nicolas Pitre
2014-09-11 19:17           ` Nicolas Pitre
2014-09-12  7:41           ` Vincent Guittot
2014-09-12  7:41             ` Vincent Guittot
2014-09-15 19:45         ` Morten Rasmussen
2014-09-15 19:45           ` Morten Rasmussen
2014-09-16 22:43           ` Vincent Guittot
2014-09-16 22:43             ` Vincent Guittot
2014-09-15 19:28     ` Morten Rasmussen
2014-09-15 19:28       ` Morten Rasmussen
2014-08-26 11:06 ` [PATCH v5 11/12] sched: replace capacity_factor by utilization Vincent Guittot
2014-08-26 11:06   ` Vincent Guittot
2014-09-11 15:39   ` Peter Zijlstra
2014-09-11 15:39     ` Peter Zijlstra
2014-09-11 16:15   ` Peter Zijlstra
2014-09-11 16:15     ` Peter Zijlstra
2014-09-11 17:26     ` Vincent Guittot
2014-09-11 17:26       ` Vincent Guittot
2014-09-14 19:41       ` Peter Zijlstra
2014-09-14 19:41         ` Peter Zijlstra
2014-09-14 19:51         ` Peter Zijlstra
2014-09-14 19:51           ` Peter Zijlstra
2014-09-15 11:42         ` Peter Zijlstra
2014-09-15 11:42           ` Peter Zijlstra
2014-09-15 19:07           ` Nicolas Pitre
2014-09-15 19:07             ` Nicolas Pitre
2014-09-15 20:01             ` Peter Zijlstra
2014-09-15 20:01               ` Peter Zijlstra
2014-09-17 18:45               ` Morten Rasmussen
2014-09-17 18:45                 ` Morten Rasmussen
2014-09-17 18:58                 ` Morten Rasmussen
2014-09-17 18:58                   ` Morten Rasmussen
2014-09-17 23:03                 ` Peter Zijlstra
2014-09-17 23:03                   ` Peter Zijlstra
2014-09-15 22:14           ` Vincent Guittot
2014-09-15 22:14             ` Vincent Guittot
2014-09-15 22:18             ` Vincent Guittot
2014-09-15 22:18               ` Vincent Guittot
2014-09-17 22:25             ` Peter Zijlstra
2014-09-17 22:25               ` Peter Zijlstra
2014-09-18  1:32               ` Vincent Guittot
2014-09-18  1:32                 ` Vincent Guittot
2014-09-16 17:00         ` Dietmar Eggemann
2014-09-16 17:00           ` Dietmar Eggemann
2014-08-26 11:06 ` [PATCH v5 12/12] sched: add SD_PREFER_SIBLING for SMT level Vincent Guittot
2014-08-26 11:06   ` Vincent Guittot

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.