[PATCH 0/6] sched/fair: Compute capacity invariant load/utilization tracking

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/6] sched/fair: Compute capacity invariant load/utilization tracking
@ 2015-08-14 16:23 Morten Rasmussen
  2015-08-14 16:23 ` [PATCH 1/6] sched/fair: Make load tracking frequency scale-invariant Morten Rasmussen
                   ` (7 more replies)
  0 siblings, 8 replies; 97+ messages in thread
From: Morten Rasmussen @ 2015-08-14 16:23 UTC (permalink / raw)
  To: peterz, mingo
  Cc: vincent.guittot, daniel.lezcano, Dietmar Eggemann, yuyang.du,
	mturquette, rjw, Juri Lelli, sgurrappadi, pang.xunlei,
	linux-kernel, Morten Rasmussen

Per-entity load-tracking currently only compensates for frequency scaling for
utilization tracking. This patch set extends this compensation to load as well,
and adds compute capacity (different microarchitectures and/or max
frequency/P-state) invariance to utilization. The former prevents suboptimal
load-balancing decisions when cpus run at different frequencies, while the
latter ensures that utilization (sched_avg.util_avg) can be compared across
cpus and that utilization can be compared directly to cpu capacity to determine
if the cpu is overloaded.

Note that this patch only contains the scheduler patches, the architecture
specific implementations of arch_scale_{freq, cpu}_capacity() will be posted
separately later.

The patches have posted several times before. Most recently as part of the
energy-model driven scheduling RFCv5 patch set [1] (patch #2,4,6,8-12). That
RFC also contains patches for the architecture specific side. In this posting
the commit messages have been updated and the patches have been rebased on a
more recent tip/sched/core that includes Yuyang's rewrite which made some of
the previously posted patches redundant.

Target: ARM TC2 A7-only (x3)
Test: hackbench -g 25 --threads -l 10000

Before	After
315.545	313.408	-0.68%

Target: Intel(R) Core(TM) i5 CPU M 520 @ 2.40GHz
Test: hackbench -g 25 --threads -l 1000 (avg of 10)

Before	After
6.4643	6.395	-1.07%

[1] http://www.kernelhub.org/?p=2&msg=787634

Dietmar Eggemann (4):
  sched/fair: Make load tracking frequency scale-invariant
  sched/fair: Make utilization tracking cpu scale-invariant
  sched/fair: Name utilization related data and functions consistently
  sched/fair: Get rid of scaling utilization by capacity_orig

Morten Rasmussen (2):
  sched/fair: Convert arch_scale_cpu_capacity() from weak function to
    #define
  sched/fair: Initialize task load and utilization before placing task
    on rq

 include/linux/sched.h   |   8 ++--
 kernel/sched/core.c     |   4 +-
 kernel/sched/fair.c     | 109 +++++++++++++++++++++++-------------------------
 kernel/sched/features.h |   5 ---
 kernel/sched/sched.h    |  11 +++++
 5 files changed, 69 insertions(+), 68 deletions(-)

-- 
1.9.1

^ permalink raw reply	[flat|nested] 97+ messages in thread

* [PATCH 1/6] sched/fair: Make load tracking frequency scale-invariant
  2015-08-14 16:23 [PATCH 0/6] sched/fair: Compute capacity invariant load/utilization tracking Morten Rasmussen
@ 2015-08-14 16:23 ` Morten Rasmussen
  2015-09-13 11:03   ` [tip:sched/core] " tip-bot for Dietmar Eggemann
  2015-08-14 16:23 ` [PATCH 2/6] sched/fair: Convert arch_scale_cpu_capacity() from weak function to #define Morten Rasmussen
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 97+ messages in thread
From: Morten Rasmussen @ 2015-08-14 16:23 UTC (permalink / raw)
  To: peterz, mingo
  Cc: vincent.guittot, daniel.lezcano, Dietmar Eggemann, yuyang.du,
	mturquette, rjw, Juri Lelli, sgurrappadi, pang.xunlei,
	linux-kernel, Dietmar Eggemann, Morten Rasmussen

From: Dietmar Eggemann <dietmar.eggemann@arm.com>

Apply frequency scaling correction factor to per-entity load tracking to
make it frequency invariant. Currently, load appears bigger when the cpu
is running slower which affects load-balancing decisions.

Each segment of the sched_avg.load_sum geometric series is now scaled by
the current frequency so that the sched_avg.load_avg of each sched entity
will be invariant from frequency scaling.

Moreover, cfs_rq.runnable_load_sum is scaled by the current frequency as
well.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 include/linux/sched.h |  6 +++---
 kernel/sched/fair.c   | 27 +++++++++++++++++----------
 2 files changed, 20 insertions(+), 13 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 44dca5b..a153051 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1177,9 +1177,9 @@ struct load_weight {
 
 /*
  * The load_avg/util_avg accumulates an infinite geometric series.
- * 1) load_avg factors the amount of time that a sched_entity is
- * runnable on a rq into its weight. For cfs_rq, it is the aggregated
- * such weights of all runnable and blocked sched_entities.
+ * 1) load_avg factors frequency scaling into the amount of time that a
+ * sched_entity is runnable on a rq into its weight. For cfs_rq, it is the
+ * aggregated such weights of all runnable and blocked sched_entities.
  * 2) util_avg factors frequency scaling into the amount of time
  * that a sched_entity is running on a CPU, in the range [0..SCHED_LOAD_SCALE].
  * For cfs_rq, it is the aggregated such times of all runnable and
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 858b94a..1626410 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2515,6 +2515,8 @@ static u32 __compute_runnable_contrib(u64 n)
 	return contrib + runnable_avg_yN_sum[n];
 }
 
+#define scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT)
+
 /*
  * We can represent the historical contribution to runnable average as the
  * coefficients of a geometric series.  To do this we sub-divide our runnable
@@ -2547,9 +2549,9 @@ static __always_inline int
 __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
 		  unsigned long weight, int running, struct cfs_rq *cfs_rq)
 {
-	u64 delta, periods;
+	u64 delta, scaled_delta, periods;
 	u32 contrib;
-	int delta_w, decayed = 0;
+	int delta_w, scaled_delta_w, decayed = 0;
 	unsigned long scale_freq = arch_scale_freq_capacity(NULL, cpu);
 
 	delta = now - sa->last_update_time;
@@ -2585,13 +2587,16 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
 		 * period and accrue it.
 		 */
 		delta_w = 1024 - delta_w;
+		scaled_delta_w = scale(delta_w, scale_freq);
 		if (weight) {
-			sa->load_sum += weight * delta_w;
-			if (cfs_rq)
-				cfs_rq->runnable_load_sum += weight * delta_w;
+			sa->load_sum += weight * scaled_delta_w;
+			if (cfs_rq) {
+				cfs_rq->runnable_load_sum +=
+						weight * scaled_delta_w;
+			}
 		}
 		if (running)
-			sa->util_sum += delta_w * scale_freq >> SCHED_CAPACITY_SHIFT;
+			sa->util_sum += scaled_delta_w;
 
 		delta -= delta_w;
 
@@ -2608,23 +2613,25 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
 
 		/* Efficiently calculate \sum (1..n_period) 1024*y^i */
 		contrib = __compute_runnable_contrib(periods);
+		contrib = scale(contrib, scale_freq);
 		if (weight) {
 			sa->load_sum += weight * contrib;
 			if (cfs_rq)
 				cfs_rq->runnable_load_sum += weight * contrib;
 		}
 		if (running)
-			sa->util_sum += contrib * scale_freq >> SCHED_CAPACITY_SHIFT;
+			sa->util_sum += contrib;
 	}
 
 	/* Remainder of delta accrued against u_0` */
+	scaled_delta = scale(delta, scale_freq);
 	if (weight) {
-		sa->load_sum += weight * delta;
+		sa->load_sum += weight * scaled_delta;
 		if (cfs_rq)
-			cfs_rq->runnable_load_sum += weight * delta;
+			cfs_rq->runnable_load_sum += weight * scaled_delta;
 	}
 	if (running)
-		sa->util_sum += delta * scale_freq >> SCHED_CAPACITY_SHIFT;
+		sa->util_sum += scaled_delta;
 
 	sa->period_contrib += delta;
 
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [PATCH 2/6] sched/fair: Convert arch_scale_cpu_capacity() from weak function to #define
  2015-08-14 16:23 [PATCH 0/6] sched/fair: Compute capacity invariant load/utilization tracking Morten Rasmussen
  2015-08-14 16:23 ` [PATCH 1/6] sched/fair: Make load tracking frequency scale-invariant Morten Rasmussen
@ 2015-08-14 16:23 ` Morten Rasmussen
  2015-09-02  9:31   ` Vincent Guittot
  2015-09-13 11:03   ` [tip:sched/core] " tip-bot for Morten Rasmussen
  2015-08-14 16:23 ` [PATCH 3/6] sched/fair: Make utilization tracking cpu scale-invariant Morten Rasmussen
                   ` (5 subsequent siblings)
  7 siblings, 2 replies; 97+ messages in thread
From: Morten Rasmussen @ 2015-08-14 16:23 UTC (permalink / raw)
  To: peterz, mingo
  Cc: vincent.guittot, daniel.lezcano, Dietmar Eggemann, yuyang.du,
	mturquette, rjw, Juri Lelli, sgurrappadi, pang.xunlei,
	linux-kernel, Morten Rasmussen

Bring arch_scale_cpu_capacity() in line with the recent change of its
arch_scale_freq_capacity() sibling in commit dfbca41f3479 ("sched:
Optimize freq invariant accounting") from weak function to #define to
allow inlining of the function.

While at it, remove the ARCH_CAPACITY sched_feature as well. With the
change to #define there isn't a straightforward way to allow runtime
switch between an arch implementation and the default implementation of
arch_scale_cpu_capacity() using sched_feature. The default was to use
the arch-specific implementation, but only the arm architecture provides
one and that is essentially equivalent to the default implementation.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c     | 22 +---------------------
 kernel/sched/features.h |  5 -----
 kernel/sched/sched.h    | 11 +++++++++++
 3 files changed, 12 insertions(+), 26 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1626410..c72223a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6016,19 +6016,6 @@ static inline int get_sd_load_idx(struct sched_domain *sd,
 	return load_idx;
 }
 
-static unsigned long default_scale_cpu_capacity(struct sched_domain *sd, int cpu)
-{
-	if ((sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1))
-		return sd->smt_gain / sd->span_weight;
-
-	return SCHED_CAPACITY_SCALE;
-}
-
-unsigned long __weak arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
-{
-	return default_scale_cpu_capacity(sd, cpu);
-}
-
 static unsigned long scale_rt_capacity(int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
@@ -6058,16 +6045,9 @@ static unsigned long scale_rt_capacity(int cpu)
 
 static void update_cpu_capacity(struct sched_domain *sd, int cpu)
 {
-	unsigned long capacity = SCHED_CAPACITY_SCALE;
+	unsigned long capacity = arch_scale_cpu_capacity(sd, cpu);
 	struct sched_group *sdg = sd->groups;
 
-	if (sched_feat(ARCH_CAPACITY))
-		capacity *= arch_scale_cpu_capacity(sd, cpu);
-	else
-		capacity *= default_scale_cpu_capacity(sd, cpu);
-
-	capacity >>= SCHED_CAPACITY_SHIFT;
-
 	cpu_rq(cpu)->cpu_capacity_orig = capacity;
 
 	capacity *= scale_rt_capacity(cpu);
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 83a50e7..6565eac 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -36,11 +36,6 @@ SCHED_FEAT(CACHE_HOT_BUDDY, true)
  */
 SCHED_FEAT(WAKEUP_PREEMPTION, true)
 
-/*
- * Use arch dependent cpu capacity functions
- */
-SCHED_FEAT(ARCH_CAPACITY, true)
-
 SCHED_FEAT(HRTICK, false)
 SCHED_FEAT(DOUBLE_TICK, false)
 SCHED_FEAT(LB_BIAS, true)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 22ccc55..7e6f250 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1402,6 +1402,17 @@ unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu)
 }
 #endif
 
+#ifndef arch_scale_cpu_capacity
+static __always_inline
+unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
+{
+	if ((sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1))
+		return sd->smt_gain / sd->span_weight;
+
+	return SCHED_CAPACITY_SCALE;
+}
+#endif
+
 static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta)
 {
 	rq->rt_avg += rt_delta * arch_scale_freq_capacity(NULL, cpu_of(rq));
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [PATCH 3/6] sched/fair: Make utilization tracking cpu scale-invariant
  2015-08-14 16:23 [PATCH 0/6] sched/fair: Compute capacity invariant load/utilization tracking Morten Rasmussen
  2015-08-14 16:23 ` [PATCH 1/6] sched/fair: Make load tracking frequency scale-invariant Morten Rasmussen
  2015-08-14 16:23 ` [PATCH 2/6] sched/fair: Convert arch_scale_cpu_capacity() from weak function to #define Morten Rasmussen
@ 2015-08-14 16:23 ` Morten Rasmussen
  2015-08-14 23:04   ` Dietmar Eggemann
  2015-08-14 16:23 ` [PATCH 4/6] sched/fair: Name utilization related data and functions consistently Morten Rasmussen
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 97+ messages in thread
From: Morten Rasmussen @ 2015-08-14 16:23 UTC (permalink / raw)
  To: peterz, mingo
  Cc: vincent.guittot, daniel.lezcano, Dietmar Eggemann, yuyang.du,
	mturquette, rjw, Juri Lelli, sgurrappadi, pang.xunlei,
	linux-kernel, Dietmar Eggemann, Morten Rasmussen

From: Dietmar Eggemann <dietmar.eggemann@arm.com>

Besides the existing frequency scale-invariance correction factor, apply
cpu scale-invariance correction factor to utilization tracking to
compensate for any differences in compute capacity. This could be due to
micro-architectural differences (i.e. instructions per seconds) between
cpus in HMP systems (e.g. big.LITTLE), and/or differences in the current
maximum frequency supported by individual cpus in SMP systems. In the
existing implementation utilization isn't comparable between cpus as it
is relative to the capacity of each individual cpu.

Each segment of the sched_avg.util_sum geometric series is now scaled
by the cpu performance factor too so the sched_avg.util_avg of each
sched entity will be invariant from the particular cpu of the HMP/SMP
system on which the sched entity is scheduled.

With this patch, the utilization of a cpu stays relative to the max cpu
performance of the fastest cpu in the system.

In contrast to utilization (sched_avg.util_sum), load
(sched_avg.load_sum) should not be scaled by compute capacity. The
utilization metric is based on running time which only makes sense when
cpus are _not_ fully utilized (utilization cannot go beyond 100% even if
more tasks are added), where load is runnable time which isn't limited
by the capacity of the cpu and therefore is a better metric for
overloaded scenarios. If we run two nice-0 busy loops on two cpus with
different compute capacity their load should be similar since their
compute demands are the same. We have to assume that the compute demand
of any task running on a fully utilized cpu (no spare cycles = 100%
utilization) is high and the same no matter of the compute capacity of
its current cpu, hence we shouldn't scale load by cpu capacity.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 include/linux/sched.h | 2 +-
 kernel/sched/fair.c   | 7 ++++---
 kernel/sched/sched.h  | 2 +-
 3 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index a153051..78a93d7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1180,7 +1180,7 @@ struct load_weight {
  * 1) load_avg factors frequency scaling into the amount of time that a
  * sched_entity is runnable on a rq into its weight. For cfs_rq, it is the
  * aggregated such weights of all runnable and blocked sched_entities.
- * 2) util_avg factors frequency scaling into the amount of time
+ * 2) util_avg factors frequency and cpu scaling into the amount of time
  * that a sched_entity is running on a CPU, in the range [0..SCHED_LOAD_SCALE].
  * For cfs_rq, it is the aggregated such times of all runnable and
  * blocked sched_entities.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c72223a..63be5a5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2553,6 +2553,7 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
 	u32 contrib;
 	int delta_w, scaled_delta_w, decayed = 0;
 	unsigned long scale_freq = arch_scale_freq_capacity(NULL, cpu);
+	unsigned long scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
 
 	delta = now - sa->last_update_time;
 	/*
@@ -2596,7 +2597,7 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
 			}
 		}
 		if (running)
-			sa->util_sum += scaled_delta_w;
+			sa->util_sum = scale(scaled_delta_w, scale_cpu);
 
 		delta -= delta_w;
 
@@ -2620,7 +2621,7 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
 				cfs_rq->runnable_load_sum += weight * contrib;
 		}
 		if (running)
-			sa->util_sum += contrib;
+			sa->util_sum += scale(contrib, scale_cpu);
 	}
 
 	/* Remainder of delta accrued against u_0` */
@@ -2631,7 +2632,7 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
 			cfs_rq->runnable_load_sum += weight * scaled_delta;
 	}
 	if (running)
-		sa->util_sum += scaled_delta;
+		sa->util_sum += scale(scaled_delta, scale_cpu);
 
 	sa->period_contrib += delta;
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 7e6f250..50836a9 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1406,7 +1406,7 @@ unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu)
 static __always_inline
 unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
 {
-	if ((sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1))
+	if (sd && (sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1))
 		return sd->smt_gain / sd->span_weight;
 
 	return SCHED_CAPACITY_SCALE;
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [PATCH 4/6] sched/fair: Name utilization related data and functions consistently
  2015-08-14 16:23 [PATCH 0/6] sched/fair: Compute capacity invariant load/utilization tracking Morten Rasmussen
                   ` (2 preceding siblings ...)
  2015-08-14 16:23 ` [PATCH 3/6] sched/fair: Make utilization tracking cpu scale-invariant Morten Rasmussen
@ 2015-08-14 16:23 ` Morten Rasmussen
  2015-09-04  9:08   ` Vincent Guittot
  2015-09-13 11:04   ` [tip:sched/core] " tip-bot for Dietmar Eggemann
  2015-08-14 16:23 ` [PATCH 5/6] sched/fair: Get rid of scaling utilization by capacity_orig Morten Rasmussen
                   ` (3 subsequent siblings)
  7 siblings, 2 replies; 97+ messages in thread
From: Morten Rasmussen @ 2015-08-14 16:23 UTC (permalink / raw)
  To: peterz, mingo
  Cc: vincent.guittot, daniel.lezcano, Dietmar Eggemann, yuyang.du,
	mturquette, rjw, Juri Lelli, sgurrappadi, pang.xunlei,
	linux-kernel, Dietmar Eggemann, Morten Rasmussen

From: Dietmar Eggemann <dietmar.eggemann@arm.com>

Use the advent of the per-entity load tracking rewrite to streamline the
naming of utilization related data and functions by using
{prefix_}util{_suffix} consistently. Moreover call both signals
({se,cfs}.avg.util_avg) utilization.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c | 37 +++++++++++++++++++------------------
 1 file changed, 19 insertions(+), 18 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 63be5a5..4cc3050 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4825,31 +4825,32 @@ static int select_idle_sibling(struct task_struct *p, int target)
 	return target;
 }
 /*
- * get_cpu_usage returns the amount of capacity of a CPU that is used by CFS
+ * cpu_util returns the amount of capacity of a CPU that is used by CFS
  * tasks. The unit of the return value must be the one of capacity so we can
- * compare the usage with the capacity of the CPU that is available for CFS
- * task (ie cpu_capacity).
+ * compare the utilization with the capacity of the CPU that is available for
+ * CFS task (ie cpu_capacity).
  * cfs.avg.util_avg is the sum of running time of runnable tasks on a
  * CPU. It represents the amount of utilization of a CPU in the range
- * [0..SCHED_LOAD_SCALE].  The usage of a CPU can't be higher than the full
- * capacity of the CPU because it's about the running time on this CPU.
+ * [0..SCHED_LOAD_SCALE]. The utilization of a CPU can't be higher than the
+ * full capacity of the CPU because it's about the running time on this CPU.
  * Nevertheless, cfs.avg.util_avg can be higher than SCHED_LOAD_SCALE
  * because of unfortunate rounding in util_avg or just
  * after migrating tasks until the average stabilizes with the new running
- * time. So we need to check that the usage stays into the range
+ * time. So we need to check that the utilization stays into the range
  * [0..cpu_capacity_orig] and cap if necessary.
- * Without capping the usage, a group could be seen as overloaded (CPU0 usage
- * at 121% + CPU1 usage at 80%) whereas CPU1 has 20% of available capacity
+ * Without capping the utilization, a group could be seen as overloaded (CPU0
+ * utilization at 121% + CPU1 utilization at 80%) whereas CPU1 has 20% of
+ * available capacity.
  */
-static int get_cpu_usage(int cpu)
+static int cpu_util(int cpu)
 {
-	unsigned long usage = cpu_rq(cpu)->cfs.avg.util_avg;
+	unsigned long util = cpu_rq(cpu)->cfs.avg.util_avg;
 	unsigned long capacity = capacity_orig_of(cpu);
 
-	if (usage >= SCHED_LOAD_SCALE)
+	if (util >= SCHED_LOAD_SCALE)
 		return capacity;
 
-	return (usage * capacity) >> SCHED_LOAD_SHIFT;
+	return (util * capacity) >> SCHED_LOAD_SHIFT;
 }
 
 /*
@@ -5941,7 +5942,7 @@ struct sg_lb_stats {
 	unsigned long sum_weighted_load; /* Weighted load of group's tasks */
 	unsigned long load_per_task;
 	unsigned long group_capacity;
-	unsigned long group_usage; /* Total usage of the group */
+	unsigned long group_util; /* Total utilization of the group */
 	unsigned int sum_nr_running; /* Nr tasks running in the group */
 	unsigned int idle_cpus;
 	unsigned int group_weight;
@@ -6174,8 +6175,8 @@ static inline int sg_imbalanced(struct sched_group *group)
  * group_has_capacity returns true if the group has spare capacity that could
  * be used by some tasks.
  * We consider that a group has spare capacity if the  * number of task is
- * smaller than the number of CPUs or if the usage is lower than the available
- * capacity for CFS tasks.
+ * smaller than the number of CPUs or if the utilization is lower than the
+ * available capacity for CFS tasks.
  * For the latter, we use a threshold to stabilize the state, to take into
  * account the variance of the tasks' load and to return true if the available
  * capacity in meaningful for the load balancer.
@@ -6189,7 +6190,7 @@ group_has_capacity(struct lb_env *env, struct sg_lb_stats *sgs)
 		return true;
 
 	if ((sgs->group_capacity * 100) >
-			(sgs->group_usage * env->sd->imbalance_pct))
+			(sgs->group_util * env->sd->imbalance_pct))
 		return true;
 
 	return false;
@@ -6210,7 +6211,7 @@ group_is_overloaded(struct lb_env *env, struct sg_lb_stats *sgs)
 		return false;
 
 	if ((sgs->group_capacity * 100) <
-			(sgs->group_usage * env->sd->imbalance_pct))
+			(sgs->group_util * env->sd->imbalance_pct))
 		return true;
 
 	return false;
@@ -6258,7 +6259,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 			load = source_load(i, load_idx);
 
 		sgs->group_load += load;
-		sgs->group_usage += get_cpu_usage(i);
+		sgs->group_util += cpu_util(i);
 		sgs->sum_nr_running += rq->cfs.h_nr_running;
 
 		if (rq->nr_running > 1)
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [PATCH 5/6] sched/fair: Get rid of scaling utilization by capacity_orig
  2015-08-14 16:23 [PATCH 0/6] sched/fair: Compute capacity invariant load/utilization tracking Morten Rasmussen
                   ` (3 preceding siblings ...)
  2015-08-14 16:23 ` [PATCH 4/6] sched/fair: Name utilization related data and functions consistently Morten Rasmussen
@ 2015-08-14 16:23 ` Morten Rasmussen
  2015-09-03 23:51   ` Steve Muckle
  2015-08-14 16:23 ` [PATCH 6/6] sched/fair: Initialize task load and utilization before placing task on rq Morten Rasmussen
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 97+ messages in thread
From: Morten Rasmussen @ 2015-08-14 16:23 UTC (permalink / raw)
  To: peterz, mingo
  Cc: vincent.guittot, daniel.lezcano, Dietmar Eggemann, yuyang.du,
	mturquette, rjw, Juri Lelli, sgurrappadi, pang.xunlei,
	linux-kernel, Dietmar Eggemann, Morten Rasmussen

From: Dietmar Eggemann <dietmar.eggemann@arm.com>

Utilization is currently scaled by capacity_orig, but since we now have
frequency and cpu invariant cfs_rq.avg.util_avg, frequency and cpu scaling
now happens as part of the utilization tracking itself.
So cfs_rq.avg.util_avg should no longer be scaled in cpu_util().

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c | 38 ++++++++++++++++++++++----------------
 1 file changed, 22 insertions(+), 16 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4cc3050..34e24181 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4824,33 +4824,39 @@ static int select_idle_sibling(struct task_struct *p, int target)
 done:
 	return target;
 }
+
 /*
  * cpu_util returns the amount of capacity of a CPU that is used by CFS
  * tasks. The unit of the return value must be the one of capacity so we can
  * compare the utilization with the capacity of the CPU that is available for
  * CFS task (ie cpu_capacity).
- * cfs.avg.util_avg is the sum of running time of runnable tasks on a
- * CPU. It represents the amount of utilization of a CPU in the range
- * [0..SCHED_LOAD_SCALE]. The utilization of a CPU can't be higher than the
- * full capacity of the CPU because it's about the running time on this CPU.
- * Nevertheless, cfs.avg.util_avg can be higher than SCHED_LOAD_SCALE
- * because of unfortunate rounding in util_avg or just
- * after migrating tasks until the average stabilizes with the new running
- * time. So we need to check that the utilization stays into the range
- * [0..cpu_capacity_orig] and cap if necessary.
- * Without capping the utilization, a group could be seen as overloaded (CPU0
- * utilization at 121% + CPU1 utilization at 80%) whereas CPU1 has 20% of
- * available capacity.
+ *
+ * cfs_rq.avg.util_avg is the sum of running time of runnable tasks plus the
+ * recent utilization of currently non-runnable tasks on a CPU. It represents
+ * the amount of utilization of a CPU in the range [0..capacity_orig] where
+ * capacity_orig is the cpu_capacity available at * the highest frequency
+ * (arch_scale_freq_capacity()).
+ * The utilization of a CPU converges towards a sum equal to or less than the
+ * current capacity (capacity_curr <= capacity_orig) of the CPU because it is
+ * the running time on this CPU scaled by capacity_curr.
+ *
+ * Nevertheless, cfs_rq.avg.util_avg can be higher than capacity_curr or even
+ * higher than capacity_orig because of unfortunate rounding in
+ * cfs.avg.util_avg or just after migrating tasks and new task wakeups until
+ * the average stabilizes with the new running time. We need to check that the
+ * utilization stays within the range of [0..capacity_orig] and cap it if
+ * necessary. Without utilization capping, a group could be seen as overloaded
+ * (CPU0 utilization at 121% + CPU1 utilization at 80%) whereas CPU1 has 20% of
+ * available capacity. We allow utilization to overshoot capacity_curr (but not
+ * capacity_orig) as it useful for predicting the capacity required after task
+ * migrations (scheduler-driven DVFS).
  */
 static int cpu_util(int cpu)
 {
 	unsigned long util = cpu_rq(cpu)->cfs.avg.util_avg;
 	unsigned long capacity = capacity_orig_of(cpu);
 
-	if (util >= SCHED_LOAD_SCALE)
-		return capacity;
-
-	return (util * capacity) >> SCHED_LOAD_SHIFT;
+	return (util >= capacity) ? capacity : util;
 }
 
 /*
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [PATCH 6/6] sched/fair: Initialize task load and utilization before placing task on rq
  2015-08-14 16:23 [PATCH 0/6] sched/fair: Compute capacity invariant load/utilization tracking Morten Rasmussen
                   ` (4 preceding siblings ...)
  2015-08-14 16:23 ` [PATCH 5/6] sched/fair: Get rid of scaling utilization by capacity_orig Morten Rasmussen
@ 2015-08-14 16:23 ` Morten Rasmussen
  2015-09-13 11:05   ` [tip:sched/core] " tip-bot for Morten Rasmussen
  2015-08-16 20:46 ` [PATCH 0/6] sched/fair: Compute capacity invariant load/utilization tracking Peter Zijlstra
  2015-08-31  9:24 ` Peter Zijlstra
  7 siblings, 1 reply; 97+ messages in thread
From: Morten Rasmussen @ 2015-08-14 16:23 UTC (permalink / raw)
  To: peterz, mingo
  Cc: vincent.guittot, daniel.lezcano, Dietmar Eggemann, yuyang.du,
	mturquette, rjw, Juri Lelli, sgurrappadi, pang.xunlei,
	linux-kernel, Morten Rasmussen

Task load or utilization is not currently considered in
select_task_rq_fair(), but if we want that in the future we should make
sure it is not zero for new tasks.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/core.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b11f624..ae9fca32 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2294,6 +2294,8 @@ void wake_up_new_task(struct task_struct *p)
 	struct rq *rq;
 
 	raw_spin_lock_irqsave(&p->pi_lock, flags);
+	/* Initialize new task's runnable average */
+	init_entity_runnable_average(&p->se);
 #ifdef CONFIG_SMP
 	/*
 	 * Fork balancing, do it here and not earlier because:
@@ -2303,8 +2305,6 @@ void wake_up_new_task(struct task_struct *p)
 	set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
 #endif
 
-	/* Initialize new task's runnable average */
-	init_entity_runnable_average(&p->se);
 	rq = __task_rq_lock(p);
 	activate_task(rq, p, 0);
 	p->on_rq = TASK_ON_RQ_QUEUED;
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 97+ messages in thread

* Re: [PATCH 3/6] sched/fair: Make utilization tracking cpu scale-invariant
  2015-08-14 16:23 ` [PATCH 3/6] sched/fair: Make utilization tracking cpu scale-invariant Morten Rasmussen
@ 2015-08-14 23:04   ` Dietmar Eggemann
  2015-09-04  7:52     ` Vincent Guittot
  2015-09-13 11:04     ` [tip:sched/core] sched/fair: Make utilization tracking CPU scale-invariant tip-bot for Dietmar Eggemann
  0 siblings, 2 replies; 97+ messages in thread
From: Dietmar Eggemann @ 2015-08-14 23:04 UTC (permalink / raw)
  To: Morten Rasmussen, peterz, mingo
  Cc: vincent.guittot, daniel.lezcano, yuyang.du, mturquette, rjw,
	Juri Lelli, sgurrappadi, pang.xunlei, linux-kernel

On 14/08/15 17:23, Morten Rasmussen wrote:
> From: Dietmar Eggemann <dietmar.eggemann@arm.com>

[...]

> @@ -2596,7 +2597,7 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
>  			}
>  		}
>  		if (running)
> -			sa->util_sum += scaled_delta_w;
> +			sa->util_sum = scale(scaled_delta_w, scale_cpu);

There is a small issue (using = instead of +=) with fatal consequences
for the utilization signal.

-- >8 --

Subject: [PATCH] sched/fair: Make utilization tracking cpu scale-invariant

Besides the existing frequency scale-invariance correction factor, apply
cpu scale-invariance correction factor to utilization tracking to
compensate for any differences in compute capacity. This could be due to
micro-architectural differences (i.e. instructions per seconds) between
cpus in HMP systems (e.g. big.LITTLE), and/or differences in the current
maximum frequency supported by individual cpus in SMP systems. In the
existing implementation utilization isn't comparable between cpus as it
is relative to the capacity of each individual cpu.

Each segment of the sched_avg.util_sum geometric series is now scaled
by the cpu performance factor too so the sched_avg.util_avg of each
sched entity will be invariant from the particular cpu of the HMP/SMP
system on which the sched entity is scheduled.

With this patch, the utilization of a cpu stays relative to the max cpu
performance of the fastest cpu in the system.

In contrast to utilization (sched_avg.util_sum), load
(sched_avg.load_sum) should not be scaled by compute capacity. The
utilization metric is based on running time which only makes sense when
cpus are _not_ fully utilized (utilization cannot go beyond 100% even if
more tasks are added), where load is runnable time which isn't limited
by the capacity of the cpu and therefore is a better metric for
overloaded scenarios. If we run two nice-0 busy loops on two cpus with
different compute capacity their load should be similar since their
compute demands are the same. We have to assume that the compute demand
of any task running on a fully utilized cpu (no spare cycles = 100%
utilization) is high and the same no matter of the compute capacity of
its current cpu, hence we shouldn't scale load by cpu capacity.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 include/linux/sched.h | 2 +-
 kernel/sched/fair.c   | 7 ++++---
 kernel/sched/sched.h  | 2 +-
 3 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index a15305117ace..78a93d716fcb 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1180,7 +1180,7 @@ struct load_weight {
  * 1) load_avg factors frequency scaling into the amount of time that a
  * sched_entity is runnable on a rq into its weight. For cfs_rq, it is the
  * aggregated such weights of all runnable and blocked sched_entities.
- * 2) util_avg factors frequency scaling into the amount of time
+ * 2) util_avg factors frequency and cpu scaling into the amount of time
  * that a sched_entity is running on a CPU, in the range [0..SCHED_LOAD_SCALE].
  * For cfs_rq, it is the aggregated such times of all runnable and
  * blocked sched_entities.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c72223a299a8..3321eb13e422 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2553,6 +2553,7 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
 	u32 contrib;
 	int delta_w, scaled_delta_w, decayed = 0;
 	unsigned long scale_freq = arch_scale_freq_capacity(NULL, cpu);
+	unsigned long scale_cpu = arch_scale_cpu_capacity(NULL, cpu);

 	delta = now - sa->last_update_time;
 	/*
@@ -2596,7 +2597,7 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
 			}
 		}
 		if (running)
-			sa->util_sum += scaled_delta_w;
+			sa->util_sum += scale(scaled_delta_w, scale_cpu);

 		delta -= delta_w;

@@ -2620,7 +2621,7 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
 				cfs_rq->runnable_load_sum += weight * contrib;
 		}
 		if (running)
-			sa->util_sum += contrib;
+			sa->util_sum += scale(contrib, scale_cpu);
 	}

 	/* Remainder of delta accrued against u_0` */
@@ -2631,7 +2632,7 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
 			cfs_rq->runnable_load_sum += weight * scaled_delta;
 	}
 	if (running)
-		sa->util_sum += scaled_delta;
+		sa->util_sum += scale(scaled_delta, scale_cpu);

 	sa->period_contrib += delta;

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 7e6f2506a402..50836a9301f9 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1406,7 +1406,7 @@ unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu)
 static __always_inline
 unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
 {
-	if ((sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1))
+	if (sd && (sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1))
 		return sd->smt_gain / sd->span_weight;

 	return SCHED_CAPACITY_SCALE;
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 97+ messages in thread

* Re: [PATCH 0/6] sched/fair: Compute capacity invariant load/utilization tracking
  2015-08-14 16:23 [PATCH 0/6] sched/fair: Compute capacity invariant load/utilization tracking Morten Rasmussen
                   ` (5 preceding siblings ...)
  2015-08-14 16:23 ` [PATCH 6/6] sched/fair: Initialize task load and utilization before placing task on rq Morten Rasmussen
@ 2015-08-16 20:46 ` Peter Zijlstra
  2015-08-17 11:29   ` Morten Rasmussen
  2015-08-31  9:24 ` Peter Zijlstra
  7 siblings, 1 reply; 97+ messages in thread
From: Peter Zijlstra @ 2015-08-16 20:46 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: mingo, vincent.guittot, daniel.lezcano, Dietmar Eggemann,
	yuyang.du, mturquette, rjw, Juri Lelli, sgurrappadi, pang.xunlei,
	linux-kernel

On Fri, Aug 14, 2015 at 05:23:08PM +0100, Morten Rasmussen wrote:
> Target: ARM TC2 A7-only (x3)
> Test: hackbench -g 25 --threads -l 10000
> 
> Before	After
> 315.545	313.408	-0.68%
> 
> Target: Intel(R) Core(TM) i5 CPU M 520 @ 2.40GHz
> Test: hackbench -g 25 --threads -l 1000 (avg of 10)
> 
> Before	After
> 6.4643	6.395	-1.07%

Yeah, so that is a problem.

I'm taking it some of the new scaling stuff doesn't compile away, can we
look at fixing that?

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 0/6] sched/fair: Compute capacity invariant load/utilization tracking
  2015-08-16 20:46 ` [PATCH 0/6] sched/fair: Compute capacity invariant load/utilization tracking Peter Zijlstra
@ 2015-08-17 11:29   ` Morten Rasmussen
  2015-08-17 11:48     ` Peter Zijlstra
  0 siblings, 1 reply; 97+ messages in thread
From: Morten Rasmussen @ 2015-08-17 11:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, vincent.guittot, daniel.lezcano, Dietmar Eggemann,
	yuyang.du, mturquette, rjw, Juri Lelli, sgurrappadi, pang.xunlei,
	linux-kernel

On Sun, Aug 16, 2015 at 10:46:05PM +0200, Peter Zijlstra wrote:
> On Fri, Aug 14, 2015 at 05:23:08PM +0100, Morten Rasmussen wrote:
> > Target: ARM TC2 A7-only (x3)
> > Test: hackbench -g 25 --threads -l 10000
> > 
> > Before	After
> > 315.545	313.408	-0.68%
> > 
> > Target: Intel(R) Core(TM) i5 CPU M 520 @ 2.40GHz
> > Test: hackbench -g 25 --threads -l 1000 (avg of 10)
> > 
> > Before	After
> > 6.4643	6.395	-1.07%
> 
> Yeah, so that is a problem.

Maybe I'm totally wrong, but doesn't hackbench report execution so less
is better? In that case -1.07% means we are doing better with the
patches applied (after time < before time). In any case, I should have
indicated whether the change is good or bad for performance.

> I'm taking it some of the new scaling stuff doesn't compile away, can we
> look at fixing that?

I will double-check that the stuff goes away as expected. I'm pretty
sure it does on ARM.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 0/6] sched/fair: Compute capacity invariant load/utilization tracking
  2015-08-17 11:29   ` Morten Rasmussen
@ 2015-08-17 11:48     ` Peter Zijlstra
  0 siblings, 0 replies; 97+ messages in thread
From: Peter Zijlstra @ 2015-08-17 11:48 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: mingo, vincent.guittot, daniel.lezcano, Dietmar Eggemann,
	yuyang.du, mturquette, rjw, Juri Lelli, sgurrappadi, pang.xunlei,
	linux-kernel

On Mon, Aug 17, 2015 at 12:29:51PM +0100, Morten Rasmussen wrote:
> On Sun, Aug 16, 2015 at 10:46:05PM +0200, Peter Zijlstra wrote:
> > On Fri, Aug 14, 2015 at 05:23:08PM +0100, Morten Rasmussen wrote:
> > > Target: ARM TC2 A7-only (x3)
> > > Test: hackbench -g 25 --threads -l 10000
> > > 
> > > Before	After
> > > 315.545	313.408	-0.68%
> > > 
> > > Target: Intel(R) Core(TM) i5 CPU M 520 @ 2.40GHz
> > > Test: hackbench -g 25 --threads -l 1000 (avg of 10)
> > > 
> > > Before	After
> > > 6.4643	6.395	-1.07%
> > 
> > Yeah, so that is a problem.
> 
> Maybe I'm totally wrong, but doesn't hackbench report execution so less
> is better? In that case -1.07% means we are doing better with the
> patches applied (after time < before time). In any case, I should have
> indicated whether the change is good or bad for performance.
> 
> > I'm taking it some of the new scaling stuff doesn't compile away, can we
> > look at fixing that?
> 
> I will double-check that the stuff goes away as expected. I'm pretty
> sure it does on ARM.

Ah, uhm.. you have a point there ;-) I'll run the numbers when I'm back
home again.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 0/6] sched/fair: Compute capacity invariant load/utilization tracking
  2015-08-14 16:23 [PATCH 0/6] sched/fair: Compute capacity invariant load/utilization tracking Morten Rasmussen
                   ` (6 preceding siblings ...)
  2015-08-16 20:46 ` [PATCH 0/6] sched/fair: Compute capacity invariant load/utilization tracking Peter Zijlstra
@ 2015-08-31  9:24 ` Peter Zijlstra
  2015-09-02  9:51   ` Dietmar Eggemann
  2015-09-07 12:42   ` Peter Zijlstra
  7 siblings, 2 replies; 97+ messages in thread
From: Peter Zijlstra @ 2015-08-31  9:24 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: mingo, vincent.guittot, daniel.lezcano, Dietmar Eggemann,
	yuyang.du, mturquette, rjw, Juri Lelli, sgurrappadi, pang.xunlei,
	linux-kernel

On Fri, Aug 14, 2015 at 05:23:08PM +0100, Morten Rasmussen wrote:
> Target: ARM TC2 A7-only (x3)
> Test: hackbench -g 25 --threads -l 10000
> 
> Before	After
> 315.545	313.408	-0.68%
> 
> Target: Intel(R) Core(TM) i5 CPU M 520 @ 2.40GHz
> Test: hackbench -g 25 --threads -l 1000 (avg of 10)
> 
> Before	After
> 6.4643	6.395	-1.07%
> 

A quick run here gives:

IVB-EP (2*20*2):

perf stat --null --repeat 10 -- perf bench sched messaging -g 50 -l 5000

Before:				After:
5.484170711 ( +-  0.74% )	5.590001145 ( +-  0.45% )

Which is an almost 2% slowdown :/

I've yet to look at what happens.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 2/6] sched/fair: Convert arch_scale_cpu_capacity() from weak function to #define
  2015-08-14 16:23 ` [PATCH 2/6] sched/fair: Convert arch_scale_cpu_capacity() from weak function to #define Morten Rasmussen
@ 2015-09-02  9:31   ` Vincent Guittot
  2015-09-02 12:41     ` Vincent Guittot
  2015-09-03 19:58     ` Dietmar Eggemann
  2015-09-13 11:03   ` [tip:sched/core] " tip-bot for Morten Rasmussen
  1 sibling, 2 replies; 97+ messages in thread
From: Vincent Guittot @ 2015-09-02  9:31 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Peter Zijlstra, mingo, Daniel Lezcano, Dietmar Eggemann,
	Yuyang Du, Michael Turquette, rjw, Juri Lelli,
	Sai Charan Gurrappadi, pang.xunlei, linux-kernel

Hi Morten,

On 14 August 2015 at 18:23, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> Bring arch_scale_cpu_capacity() in line with the recent change of its
> arch_scale_freq_capacity() sibling in commit dfbca41f3479 ("sched:
> Optimize freq invariant accounting") from weak function to #define to
> allow inlining of the function.
>
> While at it, remove the ARCH_CAPACITY sched_feature as well. With the
> change to #define there isn't a straightforward way to allow runtime
> switch between an arch implementation and the default implementation of
> arch_scale_cpu_capacity() using sched_feature. The default was to use
> the arch-specific implementation, but only the arm architecture provides
> one and that is essentially equivalent to the default implementation.
>
> cc: Ingo Molnar <mingo@redhat.com>
> cc: Peter Zijlstra <peterz@infradead.org>
>
> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
> ---
>  kernel/sched/fair.c     | 22 +---------------------
>  kernel/sched/features.h |  5 -----
>  kernel/sched/sched.h    | 11 +++++++++++
>  3 files changed, 12 insertions(+), 26 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1626410..c72223a 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6016,19 +6016,6 @@ static inline int get_sd_load_idx(struct sched_domain *sd,
>         return load_idx;
>  }
>
> -static unsigned long default_scale_cpu_capacity(struct sched_domain *sd, int cpu)
> -{
> -       if ((sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1))
> -               return sd->smt_gain / sd->span_weight;
> -
> -       return SCHED_CAPACITY_SCALE;
> -}
> -
> -unsigned long __weak arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
> -{
> -       return default_scale_cpu_capacity(sd, cpu);
> -}
> -
>  static unsigned long scale_rt_capacity(int cpu)
>  {
>         struct rq *rq = cpu_rq(cpu);
> @@ -6058,16 +6045,9 @@ static unsigned long scale_rt_capacity(int cpu)
>
>  static void update_cpu_capacity(struct sched_domain *sd, int cpu)
>  {
> -       unsigned long capacity = SCHED_CAPACITY_SCALE;
> +       unsigned long capacity = arch_scale_cpu_capacity(sd, cpu);
>         struct sched_group *sdg = sd->groups;
>
> -       if (sched_feat(ARCH_CAPACITY))
> -               capacity *= arch_scale_cpu_capacity(sd, cpu);
> -       else
> -               capacity *= default_scale_cpu_capacity(sd, cpu);
> -
> -       capacity >>= SCHED_CAPACITY_SHIFT;
> -
>         cpu_rq(cpu)->cpu_capacity_orig = capacity;
>
>         capacity *= scale_rt_capacity(cpu);
> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
> index 83a50e7..6565eac 100644
> --- a/kernel/sched/features.h
> +++ b/kernel/sched/features.h
> @@ -36,11 +36,6 @@ SCHED_FEAT(CACHE_HOT_BUDDY, true)
>   */
>  SCHED_FEAT(WAKEUP_PREEMPTION, true)
>
> -/*
> - * Use arch dependent cpu capacity functions
> - */
> -SCHED_FEAT(ARCH_CAPACITY, true)
> -
>  SCHED_FEAT(HRTICK, false)
>  SCHED_FEAT(DOUBLE_TICK, false)
>  SCHED_FEAT(LB_BIAS, true)
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 22ccc55..7e6f250 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1402,6 +1402,17 @@ unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu)
>  }
>  #endif
>
> +#ifndef arch_scale_cpu_capacity
> +static __always_inline
> +unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
> +{
> +       if ((sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1))
> +               return sd->smt_gain / sd->span_weight;
> +
> +       return SCHED_CAPACITY_SCALE;
> +}
> +#endif
> +

So you change the way to declare arch_scale_cpu_capacity but i don't
see the update of the arm arch which declare a
arch_scale_cpu_capacity to reflect this change in your series.

Regards,
Vincent

>  static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta)
>  {
>         rq->rt_avg += rt_delta * arch_scale_freq_capacity(NULL, cpu_of(rq));
> --
> 1.9.1
>

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 0/6] sched/fair: Compute capacity invariant load/utilization tracking
  2015-08-31  9:24 ` Peter Zijlstra
@ 2015-09-02  9:51   ` Dietmar Eggemann
  2015-09-07 12:42   ` Peter Zijlstra
  1 sibling, 0 replies; 97+ messages in thread
From: Dietmar Eggemann @ 2015-09-02  9:51 UTC (permalink / raw)
  To: Peter Zijlstra, Morten Rasmussen
  Cc: mingo, vincent.guittot, daniel.lezcano, yuyang.du, mturquette,
	rjw, Juri Lelli, sgurrappadi, pang.xunlei, linux-kernel

On 08/31/2015 11:24 AM, Peter Zijlstra wrote:
> On Fri, Aug 14, 2015 at 05:23:08PM +0100, Morten Rasmussen wrote:
>> Target: ARM TC2 A7-only (x3)
>> Test: hackbench -g 25 --threads -l 10000
>>
>> Before	After
>> 315.545	313.408	-0.68%
>>
>> Target: Intel(R) Core(TM) i5 CPU M 520 @ 2.40GHz
>> Test: hackbench -g 25 --threads -l 1000 (avg of 10)
>>
>> Before	After
>> 6.4643	6.395	-1.07%
>>
>
> A quick run here gives:
>
> IVB-EP (2*20*2):
>
> perf stat --null --repeat 10 -- perf bench sched messaging -g 50 -l 5000
>
> Before:				After:
> 5.484170711 ( +-  0.74% )	5.590001145 ( +-  0.45% )
>
> Which is an almost 2% slowdown :/
>
> I've yet to look at what happens.
>

I tested the patch-set on top of tip:

ff277d4250fe - sched/deadline: Fix comment in enqueue_task_dl()

on a 2 cluster IVB-EP (2 clusters * 10 cores * 2 HW threads) = 40 
logical cpus w/ (SMT, MC, NUMA sd's).

model name : Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz

perf stat --null --repeat 10 -- perf bench sched messaging -g 50 -l 5000

Before:				After:
5.049361160 ( +- 1.26% )	5.014980654 ( +- 1.20% )

Even by running this test multiple times I never saw something like a 2% 
slowdown.

It's a vanilla ubuntu 15.04 system which might explain the slightly 
higher stddev.

We could optimize the changes we did in __update_load_avg() by only 
calculating the additional scaled values [scaled_delta_w, contrib, 
scaled_delta] in case the function is called w/ 'weight !=0 && running 
!=0'. This is also true for the initialization of scale_freq and scale_cpu.


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 2/6] sched/fair: Convert arch_scale_cpu_capacity() from weak function to #define
  2015-09-02  9:31   ` Vincent Guittot
@ 2015-09-02 12:41     ` Vincent Guittot
  2015-09-03 19:58     ` Dietmar Eggemann
  1 sibling, 0 replies; 97+ messages in thread
From: Vincent Guittot @ 2015-09-02 12:41 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Peter Zijlstra, mingo, Daniel Lezcano, Dietmar Eggemann,
	Yuyang Du, Michael Turquette, rjw, Juri Lelli,
	Sai Charan Gurrappadi, pang.xunlei, linux-kernel

On 2 September 2015 at 11:31, Vincent Guittot
<vincent.guittot@linaro.org> wrote:
> Hi Morten,
>
> On 14 August 2015 at 18:23, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
>> Bring arch_scale_cpu_capacity() in line with the recent change of its
>> arch_scale_freq_capacity() sibling in commit dfbca41f3479 ("sched:
>> Optimize freq invariant accounting") from weak function to #define to
>> allow inlining of the function.
>>
>> While at it, remove the ARCH_CAPACITY sched_feature as well. With the
>> change to #define there isn't a straightforward way to allow runtime
>> switch between an arch implementation and the default implementation of
>> arch_scale_cpu_capacity() using sched_feature. The default was to use
>> the arch-specific implementation, but only the arm architecture provides
>> one and that is essentially equivalent to the default implementation.
>>
>> cc: Ingo Molnar <mingo@redhat.com>
>> cc: Peter Zijlstra <peterz@infradead.org>
>>
>> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
>> ---
>>  kernel/sched/fair.c     | 22 +---------------------
>>  kernel/sched/features.h |  5 -----
>>  kernel/sched/sched.h    | 11 +++++++++++
>>  3 files changed, 12 insertions(+), 26 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 1626410..c72223a 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6016,19 +6016,6 @@ static inline int get_sd_load_idx(struct sched_domain *sd,
>>         return load_idx;
>>  }
>>
>> -static unsigned long default_scale_cpu_capacity(struct sched_domain *sd, int cpu)
>> -{
>> -       if ((sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1))
>> -               return sd->smt_gain / sd->span_weight;
>> -
>> -       return SCHED_CAPACITY_SCALE;
>> -}
>> -
>> -unsigned long __weak arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
>> -{
>> -       return default_scale_cpu_capacity(sd, cpu);
>> -}
>> -
>>  static unsigned long scale_rt_capacity(int cpu)
>>  {
>>         struct rq *rq = cpu_rq(cpu);
>> @@ -6058,16 +6045,9 @@ static unsigned long scale_rt_capacity(int cpu)
>>
>>  static void update_cpu_capacity(struct sched_domain *sd, int cpu)
>>  {
>> -       unsigned long capacity = SCHED_CAPACITY_SCALE;
>> +       unsigned long capacity = arch_scale_cpu_capacity(sd, cpu);
>>         struct sched_group *sdg = sd->groups;
>>
>> -       if (sched_feat(ARCH_CAPACITY))
>> -               capacity *= arch_scale_cpu_capacity(sd, cpu);
>> -       else
>> -               capacity *= default_scale_cpu_capacity(sd, cpu);
>> -
>> -       capacity >>= SCHED_CAPACITY_SHIFT;
>> -
>>         cpu_rq(cpu)->cpu_capacity_orig = capacity;
>>
>>         capacity *= scale_rt_capacity(cpu);
>> diff --git a/kernel/sched/features.h b/kernel/sched/features.h
>> index 83a50e7..6565eac 100644
>> --- a/kernel/sched/features.h
>> +++ b/kernel/sched/features.h
>> @@ -36,11 +36,6 @@ SCHED_FEAT(CACHE_HOT_BUDDY, true)
>>   */
>>  SCHED_FEAT(WAKEUP_PREEMPTION, true)
>>
>> -/*
>> - * Use arch dependent cpu capacity functions
>> - */
>> -SCHED_FEAT(ARCH_CAPACITY, true)
>> -
>>  SCHED_FEAT(HRTICK, false)
>>  SCHED_FEAT(DOUBLE_TICK, false)
>>  SCHED_FEAT(LB_BIAS, true)
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index 22ccc55..7e6f250 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -1402,6 +1402,17 @@ unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu)
>>  }
>>  #endif
>>
>> +#ifndef arch_scale_cpu_capacity
>> +static __always_inline
>> +unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
>> +{
>> +       if ((sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1))
>> +               return sd->smt_gain / sd->span_weight;
>> +
>> +       return SCHED_CAPACITY_SCALE;
>> +}
>> +#endif
>> +
>
> So you change the way to declare arch_scale_cpu_capacity but i don't
> see the update of the arm arch which declare a
> arch_scale_cpu_capacity to reflect this change in your series.

You mentioned in the cover letter that the arch part will be posted
separately later but i'm not sure that it's a good idea to separate
the change of an interface and the update of the archs that implement
it. IMHO, it's better to keep all modification together in one
patchset

Regards,
Vincent
>
> Regards,
> Vincent
>
>>  static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta)
>>  {
>>         rq->rt_avg += rt_delta * arch_scale_freq_capacity(NULL, cpu_of(rq));
>> --
>> 1.9.1
>>

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 2/6] sched/fair: Convert arch_scale_cpu_capacity() from weak function to #define
  2015-09-02  9:31   ` Vincent Guittot
  2015-09-02 12:41     ` Vincent Guittot
@ 2015-09-03 19:58     ` Dietmar Eggemann
  2015-09-04  7:26       ` Vincent Guittot
  1 sibling, 1 reply; 97+ messages in thread
From: Dietmar Eggemann @ 2015-09-03 19:58 UTC (permalink / raw)
  To: Vincent Guittot, Morten Rasmussen
  Cc: Peter Zijlstra, mingo, Daniel Lezcano, Yuyang Du,
	Michael Turquette, rjw, Juri Lelli, Sai Charan Gurrappadi,
	pang.xunlei, linux-kernel

Hi Vincent,

On 02/09/15 10:31, Vincent Guittot wrote:
> Hi Morten,
> 
> On 14 August 2015 at 18:23, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
>> Bring arch_scale_cpu_capacity() in line with the recent change of its
>> arch_scale_freq_capacity() sibling in commit dfbca41f3479 ("sched:
>> Optimize freq invariant accounting") from weak function to #define to
>> allow inlining of the function.
>>
>> While at it, remove the ARCH_CAPACITY sched_feature as well. With the
>> change to #define there isn't a straightforward way to allow runtime
>> switch between an arch implementation and the default implementation of
>> arch_scale_cpu_capacity() using sched_feature. The default was to use
>> the arch-specific implementation, but only the arm architecture provides
>> one and that is essentially equivalent to the default implementation.

[...]

> 
> So you change the way to declare arch_scale_cpu_capacity but i don't
> see the update of the arm arch which declare a
> arch_scale_cpu_capacity to reflect this change in your series.

We were reluctant to do this because this functionality makes only sense
for ARCH=arm big.Little systems w/ cortex-a{15|7} cores and only if the
clock-frequency property is set in the dts file. 

Are you planning to push for a 'struct cpu_efficiency/clock-frequency
property' solution for ARCH=arm64 as well?

I'm asking because for ARCH=arm64 systems today (JUNO, Hi6220) we use the
capacity value of the last entry of the capacity_state vector for the cores
(e.g. cortex-a{57|53). 
    
To connect the cpu invariant engine (scale_cpu_capacity()
[arch/arm/kernel/topology.c]) with the scheduler, something like this is
missing:

diff --git a/arch/arm/include/asm/topology.h b/arch/arm/include/asm/topology.h
index 370f7a732900..17c6b3243196 100644
--- a/arch/arm/include/asm/topology.h
+++ b/arch/arm/include/asm/topology.h
@@ -24,6 +24,10 @@ void init_cpu_topology(void);
 void store_cpu_topology(unsigned int cpuid);
 const struct cpumask *cpu_coregroup_mask(int cpu);
 
+#define arch_scale_cpu_capacity scale_cpu_capacity
+struct sched_domain;
+extern unsigned long scale_cpu_capacity(struct sched_domain *sd, int cpu);
+
 #else
 
 static inline void init_cpu_topology(void) { }
diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
index 08b7847bf912..907e0d2d9b82 100644
--- a/arch/arm/kernel/topology.c
+++ b/arch/arm/kernel/topology.c
@@ -42,7 +42,7 @@
  */
 static DEFINE_PER_CPU(unsigned long, cpu_scale);
 
-unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
+unsigned long scale_cpu_capacity(struct sched_domain *sd, int cpu)
 {
        return per_cpu(cpu_scale, cpu);
 }
@@ -166,7 +166,7 @@ static void update_cpu_capacity(unsigned int cpu)
        set_capacity_scale(cpu, cpu_capacity(cpu) / middle_capacity);
 
        pr_info("CPU%u: update cpu_capacity %lu\n",
-               cpu, arch_scale_cpu_capacity(NULL, cpu));
+               cpu, scale_cpu_capacity(NULL, cpu));
 }

-- Dietmar

[...]


^ permalink raw reply related	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/6] sched/fair: Get rid of scaling utilization by capacity_orig
  2015-08-14 16:23 ` [PATCH 5/6] sched/fair: Get rid of scaling utilization by capacity_orig Morten Rasmussen
@ 2015-09-03 23:51   ` Steve Muckle
  2015-09-07 15:37     ` Dietmar Eggemann
  0 siblings, 1 reply; 97+ messages in thread
From: Steve Muckle @ 2015-09-03 23:51 UTC (permalink / raw)
  To: Morten Rasmussen, Dietmar Eggemann
  Cc: peterz, mingo, vincent.guittot, daniel.lezcano, yuyang.du,
	mturquette, rjw, Juri Lelli, sgurrappadi, pang.xunlei,
	linux-kernel

Hi Morten, Dietmar,

On 08/14/2015 09:23 AM, Morten Rasmussen wrote:
...
> + * cfs_rq.avg.util_avg is the sum of running time of runnable tasks plus the
> + * recent utilization of currently non-runnable tasks on a CPU. It represents
> + * the amount of utilization of a CPU in the range [0..capacity_orig] where

I see util_sum is scaled by SCHED_LOAD_SHIFT at the end of
__update_load_avg(). If there is now an assumption that util_avg may be
used directly as a capacity value, should it be changed to
SCHED_CAPACITY_SHIFT? These are equal right now, not sure if they will
always be or if they can be combined.

> + * capacity_orig is the cpu_capacity available at * the highest frequency

spurious *

thanks,
Steve


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 2/6] sched/fair: Convert arch_scale_cpu_capacity() from weak function to #define
  2015-09-03 19:58     ` Dietmar Eggemann
@ 2015-09-04  7:26       ` Vincent Guittot
  2015-09-07 13:25         ` Dietmar Eggemann
  2015-09-11 13:21         ` Dietmar Eggemann
  0 siblings, 2 replies; 97+ messages in thread
From: Vincent Guittot @ 2015-09-04  7:26 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Morten Rasmussen, Peter Zijlstra, mingo, Daniel Lezcano,
	Yuyang Du, Michael Turquette, rjw, Juri Lelli,
	Sai Charan Gurrappadi, pang.xunlei, linux-kernel

On 3 September 2015 at 21:58, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> Hi Vincent,
>
> On 02/09/15 10:31, Vincent Guittot wrote:
>> Hi Morten,
>>
>> On 14 August 2015 at 18:23, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
>>> Bring arch_scale_cpu_capacity() in line with the recent change of its
>>> arch_scale_freq_capacity() sibling in commit dfbca41f3479 ("sched:
>>> Optimize freq invariant accounting") from weak function to #define to
>>> allow inlining of the function.
>>>
>>> While at it, remove the ARCH_CAPACITY sched_feature as well. With the
>>> change to #define there isn't a straightforward way to allow runtime
>>> switch between an arch implementation and the default implementation of
>>> arch_scale_cpu_capacity() using sched_feature. The default was to use
>>> the arch-specific implementation, but only the arm architecture provides
>>> one and that is essentially equivalent to the default implementation.
>
> [...]
>
>>
>> So you change the way to declare arch_scale_cpu_capacity but i don't
>> see the update of the arm arch which declare a
>> arch_scale_cpu_capacity to reflect this change in your series.
>
> We were reluctant to do this because this functionality makes only sense
> for ARCH=arm big.Little systems w/ cortex-a{15|7} cores and only if the
> clock-frequency property is set in the dts file.

IMO, we should maintain the compatibility of current implementation
instead of breaking the link and creating a dead code.
Your proposal below fits the requirement

>
> Are you planning to push for a 'struct cpu_efficiency/clock-frequency
> property' solution for ARCH=arm64 as well?

I know that there has been some discussions aorund that but i didn't
follow the thread in details

>
> I'm asking because for ARCH=arm64 systems today (JUNO, Hi6220) we use the
> capacity value of the last entry of the capacity_state vector for the cores
> (e.g. cortex-a{57|53).

This is a struct of the eas feature ? Not sure that we should link the
definition of  the cpu capacity to an internal struct of a feature; DT
seems a better way to define it.
So if you want to revisit the way, we set the capacity of CPU for arm
and/or arm64, I'm fully open to the discussion but this should happen
in another thread than this one which has for only purpose the
alignment of the arch_scale_cpu_capacity interface declaration with
arch_scale_freq_capacity one.

So, with the patch below that updates the arm definition of
arch_scale_cpu_capacity, you can add my Acked-by: Vincent Guittot
<vincent.guittot@linaro.org> on this patch and the additional one
below

Regards,
Vincent

>
> To connect the cpu invariant engine (scale_cpu_capacity()
> [arch/arm/kernel/topology.c]) with the scheduler, something like this is
> missing:
>
> diff --git a/arch/arm/include/asm/topology.h b/arch/arm/include/asm/topology.h
> index 370f7a732900..17c6b3243196 100644
> --- a/arch/arm/include/asm/topology.h
> +++ b/arch/arm/include/asm/topology.h
> @@ -24,6 +24,10 @@ void init_cpu_topology(void);
>  void store_cpu_topology(unsigned int cpuid);
>  const struct cpumask *cpu_coregroup_mask(int cpu);
>
> +#define arch_scale_cpu_capacity scale_cpu_capacity
> +struct sched_domain;
> +extern unsigned long scale_cpu_capacity(struct sched_domain *sd, int cpu);
> +
>  #else
>
>  static inline void init_cpu_topology(void) { }
> diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
> index 08b7847bf912..907e0d2d9b82 100644
> --- a/arch/arm/kernel/topology.c
> +++ b/arch/arm/kernel/topology.c
> @@ -42,7 +42,7 @@
>   */
>  static DEFINE_PER_CPU(unsigned long, cpu_scale);
>
> -unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
> +unsigned long scale_cpu_capacity(struct sched_domain *sd, int cpu)
>  {
>         return per_cpu(cpu_scale, cpu);
>  }
> @@ -166,7 +166,7 @@ static void update_cpu_capacity(unsigned int cpu)
>         set_capacity_scale(cpu, cpu_capacity(cpu) / middle_capacity);
>
>         pr_info("CPU%u: update cpu_capacity %lu\n",
> -               cpu, arch_scale_cpu_capacity(NULL, cpu));
> +               cpu, scale_cpu_capacity(NULL, cpu));
>  }
>
> -- Dietmar
>
> [...]
>

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 3/6] sched/fair: Make utilization tracking cpu scale-invariant
  2015-08-14 23:04   ` Dietmar Eggemann
@ 2015-09-04  7:52     ` Vincent Guittot
  2015-09-13 11:04     ` [tip:sched/core] sched/fair: Make utilization tracking CPU scale-invariant tip-bot for Dietmar Eggemann
  1 sibling, 0 replies; 97+ messages in thread
From: Vincent Guittot @ 2015-09-04  7:52 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Morten Rasmussen, peterz, mingo, daniel.lezcano, yuyang.du,
	mturquette, rjw, Juri Lelli, sgurrappadi, pang.xunlei,
	linux-kernel

On 15 August 2015 at 01:04, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> On 14/08/15 17:23, Morten Rasmussen wrote:
>> From: Dietmar Eggemann <dietmar.eggemann@arm.com>
>
> [...]
>
>> @@ -2596,7 +2597,7 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
>>                       }
>>               }
>>               if (running)
>> -                     sa->util_sum += scaled_delta_w;
>> +                     sa->util_sum = scale(scaled_delta_w, scale_cpu);
>
>
> There is a small issue (using = instead of +=) with fatal consequences
> for the utilization signal.
>
> -- >8 --
>
> Subject: [PATCH] sched/fair: Make utilization tracking cpu scale-invariant
>
> Besides the existing frequency scale-invariance correction factor, apply
> cpu scale-invariance correction factor to utilization tracking to
> compensate for any differences in compute capacity. This could be due to
> micro-architectural differences (i.e. instructions per seconds) between
> cpus in HMP systems (e.g. big.LITTLE), and/or differences in the current
> maximum frequency supported by individual cpus in SMP systems. In the
> existing implementation utilization isn't comparable between cpus as it
> is relative to the capacity of each individual cpu.
>
> Each segment of the sched_avg.util_sum geometric series is now scaled
> by the cpu performance factor too so the sched_avg.util_avg of each
> sched entity will be invariant from the particular cpu of the HMP/SMP
> system on which the sched entity is scheduled.
>
> With this patch, the utilization of a cpu stays relative to the max cpu
> performance of the fastest cpu in the system.
>
> In contrast to utilization (sched_avg.util_sum), load
> (sched_avg.load_sum) should not be scaled by compute capacity. The
> utilization metric is based on running time which only makes sense when
> cpus are _not_ fully utilized (utilization cannot go beyond 100% even if
> more tasks are added), where load is runnable time which isn't limited
> by the capacity of the cpu and therefore is a better metric for
> overloaded scenarios. If we run two nice-0 busy loops on two cpus with
> different compute capacity their load should be similar since their
> compute demands are the same. We have to assume that the compute demand
> of any task running on a fully utilized cpu (no spare cycles = 100%
> utilization) is high and the same no matter of the compute capacity of
> its current cpu, hence we shouldn't scale load by cpu capacity.
>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
> ---
>  include/linux/sched.h | 2 +-
>  kernel/sched/fair.c   | 7 ++++---
>  kernel/sched/sched.h  | 2 +-
>  3 files changed, 6 insertions(+), 5 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index a15305117ace..78a93d716fcb 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -1180,7 +1180,7 @@ struct load_weight {
>   * 1) load_avg factors frequency scaling into the amount of time that a
>   * sched_entity is runnable on a rq into its weight. For cfs_rq, it is the
>   * aggregated such weights of all runnable and blocked sched_entities.
> - * 2) util_avg factors frequency scaling into the amount of time
> + * 2) util_avg factors frequency and cpu scaling into the amount of time
>   * that a sched_entity is running on a CPU, in the range [0..SCHED_LOAD_SCALE].
>   * For cfs_rq, it is the aggregated such times of all runnable and
>   * blocked sched_entities.
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index c72223a299a8..3321eb13e422 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2553,6 +2553,7 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
>         u32 contrib;
>         int delta_w, scaled_delta_w, decayed = 0;
>         unsigned long scale_freq = arch_scale_freq_capacity(NULL, cpu);
> +       unsigned long scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
>
>         delta = now - sa->last_update_time;
>         /*
> @@ -2596,7 +2597,7 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
>                         }
>                 }
>                 if (running)
> -                       sa->util_sum += scaled_delta_w;
> +                       sa->util_sum += scale(scaled_delta_w, scale_cpu);
>
>                 delta -= delta_w;
>
> @@ -2620,7 +2621,7 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
>                                 cfs_rq->runnable_load_sum += weight * contrib;
>                 }
>                 if (running)
> -                       sa->util_sum += contrib;
> +                       sa->util_sum += scale(contrib, scale_cpu);
>         }
>
>         /* Remainder of delta accrued against u_0` */
> @@ -2631,7 +2632,7 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
>                         cfs_rq->runnable_load_sum += weight * scaled_delta;
>         }
>         if (running)
> -               sa->util_sum += scaled_delta;
> +               sa->util_sum += scale(scaled_delta, scale_cpu);
>
>         sa->period_contrib += delta;
>
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 7e6f2506a402..50836a9301f9 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1406,7 +1406,7 @@ unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu)
>  static __always_inline
>  unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
>  {
> -       if ((sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1))
> +       if (sd && (sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1))
>                 return sd->smt_gain / sd->span_weight;
>
>         return SCHED_CAPACITY_SCALE;

FWIW, you can add my Acked-by: Vincent Guittot
<vincent.guittot@linaro.org> on this corrected version
> --
> 1.9.1
>

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 4/6] sched/fair: Name utilization related data and functions consistently
  2015-08-14 16:23 ` [PATCH 4/6] sched/fair: Name utilization related data and functions consistently Morten Rasmussen
@ 2015-09-04  9:08   ` Vincent Guittot
  2015-09-11 16:35     ` Dietmar Eggemann
  2015-09-13 11:04   ` [tip:sched/core] " tip-bot for Dietmar Eggemann
  1 sibling, 1 reply; 97+ messages in thread
From: Vincent Guittot @ 2015-09-04  9:08 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Peter Zijlstra, mingo, Daniel Lezcano, Dietmar Eggemann,
	Yuyang Du, Michael Turquette, rjw, Juri Lelli,
	Sai Charan Gurrappadi, pang.xunlei, linux-kernel

On 14 August 2015 at 18:23, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> From: Dietmar Eggemann <dietmar.eggemann@arm.com>
>
> Use the advent of the per-entity load tracking rewrite to streamline the
> naming of utilization related data and functions by using
> {prefix_}util{_suffix} consistently. Moreover call both signals
> ({se,cfs}.avg.util_avg) utilization.

I don't have a strong opinion about the naming of this variable but I
remember a discussion about this topic:
https://lkml.org/lkml/2014/9/11/474 : "Call the pure running number
'utilization' and this scaled with capacity 'usage' "

The utilization has been shorten to util with the rewrite of the pelt,
so the current use of usage in get_cpu_usage still follows this rule.

So why do you want to change that now ?
Furthermore, cfs.avg.util_avg is a load  whereas sgs->group_util is a
capacity. Both don't use the same unit and same range which can be
confusing when you read the code

Regards,
Vincent

>
> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
> ---
>  kernel/sched/fair.c | 37 +++++++++++++++++++------------------
>  1 file changed, 19 insertions(+), 18 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 63be5a5..4cc3050 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4825,31 +4825,32 @@ static int select_idle_sibling(struct task_struct *p, int target)
>         return target;
>  }
>  /*
> - * get_cpu_usage returns the amount of capacity of a CPU that is used by CFS
> + * cpu_util returns the amount of capacity of a CPU that is used by CFS
>   * tasks. The unit of the return value must be the one of capacity so we can
> - * compare the usage with the capacity of the CPU that is available for CFS
> - * task (ie cpu_capacity).
> + * compare the utilization with the capacity of the CPU that is available for
> + * CFS task (ie cpu_capacity).
>   * cfs.avg.util_avg is the sum of running time of runnable tasks on a
>   * CPU. It represents the amount of utilization of a CPU in the range
> - * [0..SCHED_LOAD_SCALE].  The usage of a CPU can't be higher than the full
> - * capacity of the CPU because it's about the running time on this CPU.
> + * [0..SCHED_LOAD_SCALE]. The utilization of a CPU can't be higher than the
> + * full capacity of the CPU because it's about the running time on this CPU.
>   * Nevertheless, cfs.avg.util_avg can be higher than SCHED_LOAD_SCALE
>   * because of unfortunate rounding in util_avg or just
>   * after migrating tasks until the average stabilizes with the new running
> - * time. So we need to check that the usage stays into the range
> + * time. So we need to check that the utilization stays into the range
>   * [0..cpu_capacity_orig] and cap if necessary.
> - * Without capping the usage, a group could be seen as overloaded (CPU0 usage
> - * at 121% + CPU1 usage at 80%) whereas CPU1 has 20% of available capacity
> + * Without capping the utilization, a group could be seen as overloaded (CPU0
> + * utilization at 121% + CPU1 utilization at 80%) whereas CPU1 has 20% of
> + * available capacity.
>   */
> -static int get_cpu_usage(int cpu)
> +static int cpu_util(int cpu)
>  {
> -       unsigned long usage = cpu_rq(cpu)->cfs.avg.util_avg;
> +       unsigned long util = cpu_rq(cpu)->cfs.avg.util_avg;
>         unsigned long capacity = capacity_orig_of(cpu);
>
> -       if (usage >= SCHED_LOAD_SCALE)
> +       if (util >= SCHED_LOAD_SCALE)
>                 return capacity;
>
> -       return (usage * capacity) >> SCHED_LOAD_SHIFT;
> +       return (util * capacity) >> SCHED_LOAD_SHIFT;
>  }
>
>  /*
> @@ -5941,7 +5942,7 @@ struct sg_lb_stats {
>         unsigned long sum_weighted_load; /* Weighted load of group's tasks */
>         unsigned long load_per_task;
>         unsigned long group_capacity;
> -       unsigned long group_usage; /* Total usage of the group */
> +       unsigned long group_util; /* Total utilization of the group */
>         unsigned int sum_nr_running; /* Nr tasks running in the group */
>         unsigned int idle_cpus;
>         unsigned int group_weight;
> @@ -6174,8 +6175,8 @@ static inline int sg_imbalanced(struct sched_group *group)
>   * group_has_capacity returns true if the group has spare capacity that could
>   * be used by some tasks.
>   * We consider that a group has spare capacity if the  * number of task is
> - * smaller than the number of CPUs or if the usage is lower than the available
> - * capacity for CFS tasks.
> + * smaller than the number of CPUs or if the utilization is lower than the
> + * available capacity for CFS tasks.
>   * For the latter, we use a threshold to stabilize the state, to take into
>   * account the variance of the tasks' load and to return true if the available
>   * capacity in meaningful for the load balancer.
> @@ -6189,7 +6190,7 @@ group_has_capacity(struct lb_env *env, struct sg_lb_stats *sgs)
>                 return true;
>
>         if ((sgs->group_capacity * 100) >
> -                       (sgs->group_usage * env->sd->imbalance_pct))
> +                       (sgs->group_util * env->sd->imbalance_pct))
>                 return true;
>
>         return false;
> @@ -6210,7 +6211,7 @@ group_is_overloaded(struct lb_env *env, struct sg_lb_stats *sgs)
>                 return false;
>
>         if ((sgs->group_capacity * 100) <
> -                       (sgs->group_usage * env->sd->imbalance_pct))
> +                       (sgs->group_util * env->sd->imbalance_pct))
>                 return true;
>
>         return false;
> @@ -6258,7 +6259,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>                         load = source_load(i, load_idx);
>
>                 sgs->group_load += load;
> -               sgs->group_usage += get_cpu_usage(i);
> +               sgs->group_util += cpu_util(i);
>                 sgs->sum_nr_running += rq->cfs.h_nr_running;
>
>                 if (rq->nr_running > 1)
> --
> 1.9.1
>

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 0/6] sched/fair: Compute capacity invariant load/utilization tracking
  2015-08-31  9:24 ` Peter Zijlstra
  2015-09-02  9:51   ` Dietmar Eggemann
@ 2015-09-07 12:42   ` Peter Zijlstra
  2015-09-07 13:21     ` Peter Zijlstra
                       ` (2 more replies)
  1 sibling, 3 replies; 97+ messages in thread
From: Peter Zijlstra @ 2015-09-07 12:42 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: mingo, vincent.guittot, daniel.lezcano, Dietmar Eggemann,
	yuyang.du, mturquette, rjw, Juri Lelli, sgurrappadi, pang.xunlei,
	linux-kernel

On Mon, Aug 31, 2015 at 11:24:49AM +0200, Peter Zijlstra wrote:

> A quick run here gives:
> 
> IVB-EP (2*20*2):

As noted by someone; that should be 2*10*2, for a total of 40 cpus in
this machine.

> 
> perf stat --null --repeat 10 -- perf bench sched messaging -g 50 -l 5000
> 
> Before:				After:
> 5.484170711 ( +-  0.74% )	5.590001145 ( +-  0.45% )
> 
> Which is an almost 2% slowdown :/
> 
> I've yet to look at what happens.

OK, so it appears this is link order nonsense. When I compared profiles
between the series, the one function that had significant change was
skb_release_data(), which doesn't make much sense.

If I do a 'make clean' in front of each build, I get a repeatable
improvement with this patch set (although how much of that is due to the
patches itself or just because of code movement is as yet undetermined).

I'm of a mind to apply these patches; with two patches on top, which
I'll post shortly.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 0/6] sched/fair: Compute capacity invariant load/utilization tracking
  2015-09-07 12:42   ` Peter Zijlstra
@ 2015-09-07 13:21     ` Peter Zijlstra
  2015-09-07 13:23     ` Peter Zijlstra
  2015-09-07 14:44     ` Dietmar Eggemann
  2 siblings, 0 replies; 97+ messages in thread
From: Peter Zijlstra @ 2015-09-07 13:21 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: mingo, vincent.guittot, daniel.lezcano, Dietmar Eggemann,
	yuyang.du, mturquette, rjw, Juri Lelli, sgurrappadi, pang.xunlei,
	linux-kernel

On Mon, Sep 07, 2015 at 02:42:20PM +0200, Peter Zijlstra wrote:
> I'm of a mind to apply these patches; with two patches on top, which
> I'll post shortly.

---
Subject: sched: Rename scale()
From: Peter Zijlstra <peterz@infradead.org>
Date: Mon Sep 7 15:05:42 CEST 2015

Rename scale() to cap_scale() to better reflect its purpose, it is
after all not a general purpose scale function, it has
SCHED_CAPACITY_SHIFT hardcoded in it.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c |   14 +++++++-------
 1 file changed, 7 insertions(+), 7 deletions(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2515,7 +2515,7 @@ static u32 __compute_runnable_contrib(u6
 	return contrib + runnable_avg_yN_sum[n];
 }
 
-#define scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT)
+#define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT)
 
 /*
  * We can represent the historical contribution to runnable average as the
@@ -2588,7 +2588,7 @@ __update_load_avg(u64 now, int cpu, stru
 		 * period and accrue it.
 		 */
 		delta_w = 1024 - delta_w;
-		scaled_delta_w = scale(delta_w, scale_freq);
+		scaled_delta_w = cap_scale(delta_w, scale_freq);
 		if (weight) {
 			sa->load_sum += weight * scaled_delta_w;
 			if (cfs_rq) {
@@ -2597,7 +2597,7 @@ __update_load_avg(u64 now, int cpu, stru
 			}
 		}
 		if (running)
-			sa->util_sum += scale(scaled_delta_w, scale_cpu);
+			sa->util_sum += cap_scale(scaled_delta_w, scale_cpu);
 
 		delta -= delta_w;
 
@@ -2614,25 +2614,25 @@ __update_load_avg(u64 now, int cpu, stru
 
 		/* Efficiently calculate \sum (1..n_period) 1024*y^i */
 		contrib = __compute_runnable_contrib(periods);
-		contrib = scale(contrib, scale_freq);
+		contrib = cap_scale(contrib, scale_freq);
 		if (weight) {
 			sa->load_sum += weight * contrib;
 			if (cfs_rq)
 				cfs_rq->runnable_load_sum += weight * contrib;
 		}
 		if (running)
-			sa->util_sum += scale(contrib, scale_cpu);
+			sa->util_sum += cap_scale(contrib, scale_cpu);
 	}
 
 	/* Remainder of delta accrued against u_0` */
-	scaled_delta = scale(delta, scale_freq);
+	scaled_delta = cap_scale(delta, scale_freq);
 	if (weight) {
 		sa->load_sum += weight * scaled_delta;
 		if (cfs_rq)
 			cfs_rq->runnable_load_sum += weight * scaled_delta;
 	}
 	if (running)
-		sa->util_sum += scale(scaled_delta, scale_cpu);
+		sa->util_sum += cap_scale(scaled_delta, scale_cpu);
 
 	sa->period_contrib += delta;
 

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 0/6] sched/fair: Compute capacity invariant load/utilization tracking
  2015-09-07 12:42   ` Peter Zijlstra
  2015-09-07 13:21     ` Peter Zijlstra
@ 2015-09-07 13:23     ` Peter Zijlstra
  2015-09-07 14:44     ` Dietmar Eggemann
  2 siblings, 0 replies; 97+ messages in thread
From: Peter Zijlstra @ 2015-09-07 13:23 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: mingo, vincent.guittot, daniel.lezcano, Dietmar Eggemann,
	yuyang.du, mturquette, rjw, Juri Lelli, sgurrappadi, pang.xunlei,
	linux-kernel

On Mon, Sep 07, 2015 at 02:42:20PM +0200, Peter Zijlstra wrote:
> I'm of a mind to apply these patches; with two patches on top, which
> I'll post shortly.

---
Subject: sched: Optimize __update_load_avg()
From: Peter Zijlstra <peterz@infradead.org>
Date: Mon Sep  7 15:09:15 CEST 2015

Prior to this patch; the line:

	scaled_delta_w = (delta_w * 1024) >> 10;

which is the result of the default arch_scale_freq_capacity()
function, turns into:

    1b03:	49 89 d1             	mov    %rdx,%r9
    1b06:	49 c1 e1 0a          	shl    $0xa,%r9
    1b0a:	49 c1 e9 0a          	shr    $0xa,%r9

Which is silly; when made unsigned int, GCC recognises this as
pointless ops and fails to emit them (confirmed on 4.9.3 and 5.1.1).

Furthermore, afaict unsigned is actually the correct type for these
fields anyway, as we've explicitly ruled out negative delta's earlier
in this function.

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 kernel/sched/fair.c |    2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2551,7 +2551,7 @@ __update_load_avg(u64 now, int cpu, stru
 {
 	u64 delta, scaled_delta, periods;
 	u32 contrib;
-	int delta_w, scaled_delta_w, decayed = 0;
+	unsigned int delta_w, scaled_delta_w, decayed = 0;
 	unsigned long scale_freq = arch_scale_freq_capacity(NULL, cpu);
 	unsigned long scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
 

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 2/6] sched/fair: Convert arch_scale_cpu_capacity() from weak function to #define
  2015-09-04  7:26       ` Vincent Guittot
@ 2015-09-07 13:25         ` Dietmar Eggemann
  2015-09-11 13:21         ` Dietmar Eggemann
  1 sibling, 0 replies; 97+ messages in thread
From: Dietmar Eggemann @ 2015-09-07 13:25 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Morten Rasmussen, Peter Zijlstra, mingo, Daniel Lezcano,
	Yuyang Du, Michael Turquette, rjw, Juri Lelli,
	Sai Charan Gurrappadi, pang.xunlei, linux-kernel

On 04/09/15 08:26, Vincent Guittot wrote:
> On 3 September 2015 at 21:58, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:

[...]

>>> So you change the way to declare arch_scale_cpu_capacity but i don't
>>> see the update of the arm arch which declare a
>>> arch_scale_cpu_capacity to reflect this change in your series.
>>
>> We were reluctant to do this because this functionality makes only sense
>> for ARCH=arm big.Little systems w/ cortex-a{15|7} cores and only if the
>> clock-frequency property is set in the dts file.
> 
> IMO, we should maintain the compatibility of current implementation
> instead of breaking the link and creating a dead code.
> Your proposal below fits the requirement

The only problem with this solution is that now we got a call to
arch_scale_cpu_capacity() in the hotpath whereas before it is only
called in update_cpu_capacity(). An implementation of
scale_cpu_capacity() in arch/arm/kernel/topology.c leads to a function
call in __update_load_avg. I'm in the middle of doing some performance
tests on TC2 w/ and w/o the cpu invariant implementation.

> 
>>
>> Are you planning to push for a 'struct cpu_efficiency/clock-frequency
>> property' solution for ARCH=arm64 as well?
> 
> I know that there has been some discussions aorund that but i didn't
> follow the thread in details
> 
>>
>> I'm asking because for ARCH=arm64 systems today (JUNO, Hi6220) we use the
>> capacity value of the last entry of the capacity_state vector for the cores
>> (e.g. cortex-a{57|53).
> 
> This is a struct of the eas feature ? Not sure that we should link the
> definition of  the cpu capacity to an internal struct of a feature; DT
> seems a better way to define it.

Yeah, the cpu invariant functionality should not base on EAS. We just
use the short-cut in EAS RFCv5 to get it working on ARM64.

> So if you want to revisit the way, we set the capacity of CPU for arm
> and/or arm64, I'm fully open to the discussion but this should happen
> in another thread than this one which has for only purpose the
> alignment of the arch_scale_cpu_capacity interface declaration with
> arch_scale_freq_capacity one.

Agreed.

> 
> So, with the patch below that updates the arm definition of
> arch_scale_cpu_capacity, you can add my Acked-by: Vincent Guittot
> <vincent.guittot@linaro.org> on this patch and the additional one
> below
> 
> Regards,
> Vincent

[...]


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 0/6] sched/fair: Compute capacity invariant load/utilization tracking
  2015-09-07 12:42   ` Peter Zijlstra
  2015-09-07 13:21     ` Peter Zijlstra
  2015-09-07 13:23     ` Peter Zijlstra
@ 2015-09-07 14:44     ` Dietmar Eggemann
  2015-09-13 11:06       ` [tip:sched/core] sched/fair: Defer calling scaling functions tip-bot for Dietmar Eggemann
  2 siblings, 1 reply; 97+ messages in thread
From: Dietmar Eggemann @ 2015-09-07 14:44 UTC (permalink / raw)
  To: Peter Zijlstra, Morten Rasmussen
  Cc: mingo, vincent.guittot, daniel.lezcano, yuyang.du, mturquette,
	rjw, Juri Lelli, sgurrappadi, pang.xunlei, linux-kernel

On 07/09/15 13:42, Peter Zijlstra wrote:
> On Mon, Aug 31, 2015 at 11:24:49AM +0200, Peter Zijlstra wrote:
> 
>> A quick run here gives:
>>
>> IVB-EP (2*20*2):
> 
> As noted by someone; that should be 2*10*2, for a total of 40 cpus in
> this machine.
> 
>>
>> perf stat --null --repeat 10 -- perf bench sched messaging -g 50 -l 5000
>>
>> Before:				After:
>> 5.484170711 ( +-  0.74% )	5.590001145 ( +-  0.45% )
>>
>> Which is an almost 2% slowdown :/
>>
>> I've yet to look at what happens.
> 
> OK, so it appears this is link order nonsense. When I compared profiles
> between the series, the one function that had significant change was
> skb_release_data(), which doesn't make much sense.
> 
> If I do a 'make clean' in front of each build, I get a repeatable
> improvement with this patch set (although how much of that is due to the
> patches itself or just because of code movement is as yet undetermined).
> 
> I'm of a mind to apply these patches; with two patches on top, which
> I'll post shortly.
> 

-- >8 --

From: Dietmar Eggemann <dietmar.eggemann@arm.com>
Date: Mon, 7 Sep 2015 14:57:22 +0100
Subject: [PATCH] sched/fair: Defer calling scaling functions

Do not call the scaling functions in case time goes backwards or the
last update of the sched_avg structure has happened less than 1024ns
ago.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
---
 kernel/sched/fair.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d6ca8d987a63..3445d2fb38f4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2552,8 +2552,7 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
 	u64 delta, scaled_delta, periods;
 	u32 contrib;
 	unsigned int delta_w, scaled_delta_w, decayed = 0;
-	unsigned long scale_freq = arch_scale_freq_capacity(NULL, cpu);
-	unsigned long scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
+	unsigned long scale_freq, scale_cpu;
 
 	delta = now - sa->last_update_time;
 	/*
@@ -2574,6 +2573,9 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
 		return 0;
 	sa->last_update_time = now;
 
+	scale_freq = arch_scale_freq_capacity(NULL, cpu);
+	scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
+
 	/* delta_w is the amount already accumulated against our next period */
 	delta_w = sa->period_contrib;
 	if (delta + delta_w >= 1024) {
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/6] sched/fair: Get rid of scaling utilization by capacity_orig
  2015-09-03 23:51   ` Steve Muckle
@ 2015-09-07 15:37     ` Dietmar Eggemann
  2015-09-07 16:21       ` Vincent Guittot
  2015-09-13 11:04       ` [tip:sched/core] " tip-bot for Dietmar Eggemann
  0 siblings, 2 replies; 97+ messages in thread
From: Dietmar Eggemann @ 2015-09-07 15:37 UTC (permalink / raw)
  To: Steve Muckle, Morten Rasmussen
  Cc: peterz, mingo, vincent.guittot, daniel.lezcano, yuyang.du,
	mturquette, rjw, Juri Lelli, sgurrappadi, pang.xunlei,
	linux-kernel

On 04/09/15 00:51, Steve Muckle wrote:
> Hi Morten, Dietmar,
> 
> On 08/14/2015 09:23 AM, Morten Rasmussen wrote:
> ...
>> + * cfs_rq.avg.util_avg is the sum of running time of runnable tasks plus the
>> + * recent utilization of currently non-runnable tasks on a CPU. It represents
>> + * the amount of utilization of a CPU in the range [0..capacity_orig] where
> 
> I see util_sum is scaled by SCHED_LOAD_SHIFT at the end of
> __update_load_avg(). If there is now an assumption that util_avg may be
> used directly as a capacity value, should it be changed to
> SCHED_CAPACITY_SHIFT? These are equal right now, not sure if they will
> always be or if they can be combined.

You're referring to the code line

2647   sa->util_avg = (sa->util_sum << SCHED_LOAD_SHIFT) / LOAD_AVG_MAX;

in __update_load_avg()?

Here we actually scale by 'SCHED_LOAD_SCALE/LOAD_AVG_MAX' so both values are
load related.

LOAD (UTIL) and CAPACITY have the same SCALE and SHIFT values because
SCHED_LOAD_RESOLUTION is always defined to 0. scale_load() and
scale_load_down() are also NOPs so this area is probably 
worth a separate clean-up.
Beyond that, I'm not sure if the current functionality is
broken if we use different SCALE and SHIFT values for LOAD and CAPACITY?

> 
>> + * capacity_orig is the cpu_capacity available at * the highest frequency
> 
> spurious *
> 
> thanks,
> Steve
> 

Fixed.

Thanks, 

-- Dietmar

-- >8 --

From: Dietmar Eggemann <dietmar.eggemann@arm.com>
Date: Fri, 14 Aug 2015 17:23:13 +0100
Subject: [PATCH] sched/fair: Get rid of scaling utilization by capacity_orig

Utilization is currently scaled by capacity_orig, but since we now have
frequency and cpu invariant cfs_rq.avg.util_avg, frequency and cpu scaling
now happens as part of the utilization tracking itself.
So cfs_rq.avg.util_avg should no longer be scaled in cpu_util().

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c | 38 ++++++++++++++++++++++----------------
 1 file changed, 22 insertions(+), 16 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2074d45a67c2..a73ece2372f5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4824,33 +4824,39 @@ static int select_idle_sibling(struct task_struct *p, int target)
 done:
 	return target;
 }
+
 /*
  * cpu_util returns the amount of capacity of a CPU that is used by CFS
  * tasks. The unit of the return value must be the one of capacity so we can
  * compare the utilization with the capacity of the CPU that is available for
  * CFS task (ie cpu_capacity).
- * cfs.avg.util_avg is the sum of running time of runnable tasks on a
- * CPU. It represents the amount of utilization of a CPU in the range
- * [0..SCHED_LOAD_SCALE]. The utilization of a CPU can't be higher than the
- * full capacity of the CPU because it's about the running time on this CPU.
- * Nevertheless, cfs.avg.util_avg can be higher than SCHED_LOAD_SCALE
- * because of unfortunate rounding in util_avg or just
- * after migrating tasks until the average stabilizes with the new running
- * time. So we need to check that the utilization stays into the range
- * [0..cpu_capacity_orig] and cap if necessary.
- * Without capping the utilization, a group could be seen as overloaded (CPU0
- * utilization at 121% + CPU1 utilization at 80%) whereas CPU1 has 20% of
- * available capacity.
+ *
+ * cfs_rq.avg.util_avg is the sum of running time of runnable tasks plus the
+ * recent utilization of currently non-runnable tasks on a CPU. It represents
+ * the amount of utilization of a CPU in the range [0..capacity_orig] where
+ * capacity_orig is the cpu_capacity available at the highest frequency
+ * (arch_scale_freq_capacity()).
+ * The utilization of a CPU converges towards a sum equal to or less than the
+ * current capacity (capacity_curr <= capacity_orig) of the CPU because it is
+ * the running time on this CPU scaled by capacity_curr.
+ *
+ * Nevertheless, cfs_rq.avg.util_avg can be higher than capacity_curr or even
+ * higher than capacity_orig because of unfortunate rounding in
+ * cfs.avg.util_avg or just after migrating tasks and new task wakeups until
+ * the average stabilizes with the new running time. We need to check that the
+ * utilization stays within the range of [0..capacity_orig] and cap it if
+ * necessary. Without utilization capping, a group could be seen as overloaded
+ * (CPU0 utilization at 121% + CPU1 utilization at 80%) whereas CPU1 has 20% of
+ * available capacity. We allow utilization to overshoot capacity_curr (but not
+ * capacity_orig) as it useful for predicting the capacity required after task
+ * migrations (scheduler-driven DVFS).
  */
 static int cpu_util(int cpu)
 {
 	unsigned long util = cpu_rq(cpu)->cfs.avg.util_avg;
 	unsigned long capacity = capacity_orig_of(cpu);
 
-	if (util >= SCHED_LOAD_SCALE)
-		return capacity;
-
-	return (util * capacity) >> SCHED_LOAD_SHIFT;
+	return (util >= capacity) ? capacity : util;
 }
 
 /*
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/6] sched/fair: Get rid of scaling utilization by capacity_orig
  2015-09-07 15:37     ` Dietmar Eggemann
@ 2015-09-07 16:21       ` Vincent Guittot
  2015-09-07 18:54         ` Dietmar Eggemann
  2015-09-13 11:04       ` [tip:sched/core] " tip-bot for Dietmar Eggemann
  1 sibling, 1 reply; 97+ messages in thread
From: Vincent Guittot @ 2015-09-07 16:21 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Steve Muckle, Morten Rasmussen, peterz, mingo, daniel.lezcano,
	yuyang.du, mturquette, rjw, Juri Lelli, sgurrappadi, pang.xunlei,
	linux-kernel

On 7 September 2015 at 17:37, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> On 04/09/15 00:51, Steve Muckle wrote:
>> Hi Morten, Dietmar,
>>
>> On 08/14/2015 09:23 AM, Morten Rasmussen wrote:
>> ...
>>> + * cfs_rq.avg.util_avg is the sum of running time of runnable tasks plus the
>>> + * recent utilization of currently non-runnable tasks on a CPU. It represents
>>> + * the amount of utilization of a CPU in the range [0..capacity_orig] where
>>
>> I see util_sum is scaled by SCHED_LOAD_SHIFT at the end of
>> __update_load_avg(). If there is now an assumption that util_avg may be
>> used directly as a capacity value, should it be changed to
>> SCHED_CAPACITY_SHIFT? These are equal right now, not sure if they will
>> always be or if they can be combined.
>
> You're referring to the code line
>
> 2647   sa->util_avg = (sa->util_sum << SCHED_LOAD_SHIFT) / LOAD_AVG_MAX;
>
> in __update_load_avg()?
>
> Here we actually scale by 'SCHED_LOAD_SCALE/LOAD_AVG_MAX' so both values are
> load related.

I agree with Steve that there is an issue from a unit point of view

sa->util_sum and LOAD_AVG_MAX have the same unit so sa->util_avg is a
load because of << SCHED_LOAD_SHIFT)

Before this patch , the translation from load to capacity unit was
done in get_cpu_usage with "* capacity) >> SCHED_LOAD_SHIFT"

So you still have to change the unit from load to capacity with a "/
SCHED_LOAD_SCALE * SCHED_CAPACITY_SCALE" somewhere.

sa->util_avg = ((sa->util_sum << SCHED_LOAD_SHIFT) /SCHED_LOAD_SCALE *
SCHED_CAPACITY_SCALE / LOAD_AVG_MAX = (sa->util_sum <<
SCHED_CAPACITY_SHIFT) / LOAD_AVG_MAX;


Regards,
Vincent


>
> LOAD (UTIL) and CAPACITY have the same SCALE and SHIFT values because
> SCHED_LOAD_RESOLUTION is always defined to 0. scale_load() and
> scale_load_down() are also NOPs so this area is probably
> worth a separate clean-up.
> Beyond that, I'm not sure if the current functionality is
> broken if we use different SCALE and SHIFT values for LOAD and CAPACITY?
>
>>
>>> + * capacity_orig is the cpu_capacity available at * the highest frequency
>>
>> spurious *
>>
>> thanks,
>> Steve
>>
>
> Fixed.
>
> Thanks,
>
> -- Dietmar
>
> -- >8 --
>
> From: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Date: Fri, 14 Aug 2015 17:23:13 +0100
> Subject: [PATCH] sched/fair: Get rid of scaling utilization by capacity_orig
>
> Utilization is currently scaled by capacity_orig, but since we now have
> frequency and cpu invariant cfs_rq.avg.util_avg, frequency and cpu scaling
> now happens as part of the utilization tracking itself.
> So cfs_rq.avg.util_avg should no longer be scaled in cpu_util().
>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
> ---
>  kernel/sched/fair.c | 38 ++++++++++++++++++++++----------------
>  1 file changed, 22 insertions(+), 16 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 2074d45a67c2..a73ece2372f5 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4824,33 +4824,39 @@ static int select_idle_sibling(struct task_struct *p, int target)
>  done:
>         return target;
>  }
> +
>  /*
>   * cpu_util returns the amount of capacity of a CPU that is used by CFS
>   * tasks. The unit of the return value must be the one of capacity so we can
>   * compare the utilization with the capacity of the CPU that is available for
>   * CFS task (ie cpu_capacity).
> - * cfs.avg.util_avg is the sum of running time of runnable tasks on a
> - * CPU. It represents the amount of utilization of a CPU in the range
> - * [0..SCHED_LOAD_SCALE]. The utilization of a CPU can't be higher than the
> - * full capacity of the CPU because it's about the running time on this CPU.
> - * Nevertheless, cfs.avg.util_avg can be higher than SCHED_LOAD_SCALE
> - * because of unfortunate rounding in util_avg or just
> - * after migrating tasks until the average stabilizes with the new running
> - * time. So we need to check that the utilization stays into the range
> - * [0..cpu_capacity_orig] and cap if necessary.
> - * Without capping the utilization, a group could be seen as overloaded (CPU0
> - * utilization at 121% + CPU1 utilization at 80%) whereas CPU1 has 20% of
> - * available capacity.
> + *
> + * cfs_rq.avg.util_avg is the sum of running time of runnable tasks plus the
> + * recent utilization of currently non-runnable tasks on a CPU. It represents
> + * the amount of utilization of a CPU in the range [0..capacity_orig] where
> + * capacity_orig is the cpu_capacity available at the highest frequency
> + * (arch_scale_freq_capacity()).
> + * The utilization of a CPU converges towards a sum equal to or less than the
> + * current capacity (capacity_curr <= capacity_orig) of the CPU because it is
> + * the running time on this CPU scaled by capacity_curr.
> + *
> + * Nevertheless, cfs_rq.avg.util_avg can be higher than capacity_curr or even
> + * higher than capacity_orig because of unfortunate rounding in
> + * cfs.avg.util_avg or just after migrating tasks and new task wakeups until
> + * the average stabilizes with the new running time. We need to check that the
> + * utilization stays within the range of [0..capacity_orig] and cap it if
> + * necessary. Without utilization capping, a group could be seen as overloaded
> + * (CPU0 utilization at 121% + CPU1 utilization at 80%) whereas CPU1 has 20% of
> + * available capacity. We allow utilization to overshoot capacity_curr (but not
> + * capacity_orig) as it useful for predicting the capacity required after task
> + * migrations (scheduler-driven DVFS).
>   */
>  static int cpu_util(int cpu)
>  {
>         unsigned long util = cpu_rq(cpu)->cfs.avg.util_avg;
>         unsigned long capacity = capacity_orig_of(cpu);
>
> -       if (util >= SCHED_LOAD_SCALE)
> -               return capacity;
> -
> -       return (util * capacity) >> SCHED_LOAD_SHIFT;
> +       return (util >= capacity) ? capacity : util;
>  }
>
>  /*
> --
> 1.9.1
>

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/6] sched/fair: Get rid of scaling utilization by capacity_orig
  2015-09-07 16:21       ` Vincent Guittot
@ 2015-09-07 18:54         ` Dietmar Eggemann
  2015-09-07 19:47           ` Peter Zijlstra
                             ` (2 more replies)
  0 siblings, 3 replies; 97+ messages in thread
From: Dietmar Eggemann @ 2015-09-07 18:54 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Steve Muckle, Morten Rasmussen, peterz, mingo, daniel.lezcano,
	yuyang.du, mturquette, rjw, Juri Lelli, sgurrappadi, pang.xunlei,
	linux-kernel

On 07/09/15 17:21, Vincent Guittot wrote:
> On 7 September 2015 at 17:37, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>> On 04/09/15 00:51, Steve Muckle wrote:
>>> Hi Morten, Dietmar,
>>>
>>> On 08/14/2015 09:23 AM, Morten Rasmussen wrote:
>>> ...
>>>> + * cfs_rq.avg.util_avg is the sum of running time of runnable tasks plus the
>>>> + * recent utilization of currently non-runnable tasks on a CPU. It represents
>>>> + * the amount of utilization of a CPU in the range [0..capacity_orig] where
>>>
>>> I see util_sum is scaled by SCHED_LOAD_SHIFT at the end of
>>> __update_load_avg(). If there is now an assumption that util_avg may be
>>> used directly as a capacity value, should it be changed to
>>> SCHED_CAPACITY_SHIFT? These are equal right now, not sure if they will
>>> always be or if they can be combined.
>>
>> You're referring to the code line
>>
>> 2647   sa->util_avg = (sa->util_sum << SCHED_LOAD_SHIFT) / LOAD_AVG_MAX;
>>
>> in __update_load_avg()?
>>
>> Here we actually scale by 'SCHED_LOAD_SCALE/LOAD_AVG_MAX' so both values are
>> load related.
> 
> I agree with Steve that there is an issue from a unit point of view
> 
> sa->util_sum and LOAD_AVG_MAX have the same unit so sa->util_avg is a
> load because of << SCHED_LOAD_SHIFT)
> 
> Before this patch , the translation from load to capacity unit was
> done in get_cpu_usage with "* capacity) >> SCHED_LOAD_SHIFT"
> 
> So you still have to change the unit from load to capacity with a "/
> SCHED_LOAD_SCALE * SCHED_CAPACITY_SCALE" somewhere.
> 
> sa->util_avg = ((sa->util_sum << SCHED_LOAD_SHIFT) /SCHED_LOAD_SCALE *
> SCHED_CAPACITY_SCALE / LOAD_AVG_MAX = (sa->util_sum <<
> SCHED_CAPACITY_SHIFT) / LOAD_AVG_MAX;

I see the point but IMHO this will only be necessary if the SCHED_LOAD_RESOLUTION
stuff gets re-enabled again.

It's not really about utilization or capacity units but rather about using the same
SCALE/SHIFT values for both sides, right?

I always thought that scale_load_down() takes care of that.

So shouldn't:

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3445d2fb38f4..b80f799aface 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2644,7 +2644,7 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
                        cfs_rq->runnable_load_avg =
                                div_u64(cfs_rq->runnable_load_sum, LOAD_AVG_MAX);
                }
-               sa->util_avg = (sa->util_sum << SCHED_LOAD_SHIFT) / LOAD_AVG_MAX;
+               sa->util_avg = (sa->util_sum * scale_load_down(SCHED_LOAD_SCALE)) / LOAD_AVG_MAX;
        }
 
        return decayed;

fix that issue in case SCHED_LOAD_RESOLUTION != 0 ?

I would vote for removing this SCHED_LOAD_RESOLUTION thing completely so that we can
assume that load/util and capacity are always using 1024/10.

Cheers,

-- Dietmar

> 
> 
> Regards,
> Vincent
> 
> 
>>
>> LOAD (UTIL) and CAPACITY have the same SCALE and SHIFT values because
>> SCHED_LOAD_RESOLUTION is always defined to 0. scale_load() and
>> scale_load_down() are also NOPs so this area is probably
>> worth a separate clean-up.
>> Beyond that, I'm not sure if the current functionality is
>> broken if we use different SCALE and SHIFT values for LOAD and CAPACITY?
>>

[...]


^ permalink raw reply related	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/6] sched/fair: Get rid of scaling utilization by capacity_orig
  2015-09-07 18:54         ` Dietmar Eggemann
@ 2015-09-07 19:47           ` Peter Zijlstra
  2015-09-08 12:47             ` Dietmar Eggemann
  2015-09-08  7:22           ` Vincent Guittot
  2015-09-08 11:44           ` Peter Zijlstra
  2 siblings, 1 reply; 97+ messages in thread
From: Peter Zijlstra @ 2015-09-07 19:47 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Vincent Guittot, Steve Muckle, Morten Rasmussen, mingo,
	daniel.lezcano, yuyang.du, mturquette, rjw, Juri Lelli,
	sgurrappadi, pang.xunlei, linux-kernel

On Mon, Sep 07, 2015 at 07:54:18PM +0100, Dietmar Eggemann wrote:
> I would vote for removing this SCHED_LOAD_RESOLUTION thing completely so that we can
> assume that load/util and capacity are always using 1024/10.

Ha!, I just requested Google look into moving it to 20 again ;-)

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/6] sched/fair: Get rid of scaling utilization by capacity_orig
  2015-09-07 18:54         ` Dietmar Eggemann
  2015-09-07 19:47           ` Peter Zijlstra
@ 2015-09-08  7:22           ` Vincent Guittot
  2015-09-08 12:26             ` Peter Zijlstra
  2015-09-08 12:50             ` Dietmar Eggemann
  2015-09-08 11:44           ` Peter Zijlstra
  2 siblings, 2 replies; 97+ messages in thread
From: Vincent Guittot @ 2015-09-08  7:22 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Steve Muckle, Morten Rasmussen, peterz, mingo, daniel.lezcano,
	yuyang.du, mturquette, rjw, Juri Lelli, sgurrappadi, pang.xunlei,
	linux-kernel

On 7 September 2015 at 20:54, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> On 07/09/15 17:21, Vincent Guittot wrote:
>> On 7 September 2015 at 17:37, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>> On 04/09/15 00:51, Steve Muckle wrote:
>>>> Hi Morten, Dietmar,
>>>>
>>>> On 08/14/2015 09:23 AM, Morten Rasmussen wrote:
>>>> ...
>>>>> + * cfs_rq.avg.util_avg is the sum of running time of runnable tasks plus the
>>>>> + * recent utilization of currently non-runnable tasks on a CPU. It represents
>>>>> + * the amount of utilization of a CPU in the range [0..capacity_orig] where
>>>>
>>>> I see util_sum is scaled by SCHED_LOAD_SHIFT at the end of
>>>> __update_load_avg(). If there is now an assumption that util_avg may be
>>>> used directly as a capacity value, should it be changed to
>>>> SCHED_CAPACITY_SHIFT? These are equal right now, not sure if they will
>>>> always be or if they can be combined.
>>>
>>> You're referring to the code line
>>>
>>> 2647   sa->util_avg = (sa->util_sum << SCHED_LOAD_SHIFT) / LOAD_AVG_MAX;
>>>
>>> in __update_load_avg()?
>>>
>>> Here we actually scale by 'SCHED_LOAD_SCALE/LOAD_AVG_MAX' so both values are
>>> load related.
>>
>> I agree with Steve that there is an issue from a unit point of view
>>
>> sa->util_sum and LOAD_AVG_MAX have the same unit so sa->util_avg is a
>> load because of << SCHED_LOAD_SHIFT)
>>
>> Before this patch , the translation from load to capacity unit was
>> done in get_cpu_usage with "* capacity) >> SCHED_LOAD_SHIFT"
>>
>> So you still have to change the unit from load to capacity with a "/
>> SCHED_LOAD_SCALE * SCHED_CAPACITY_SCALE" somewhere.
>>
>> sa->util_avg = ((sa->util_sum << SCHED_LOAD_SHIFT) /SCHED_LOAD_SCALE *
>> SCHED_CAPACITY_SCALE / LOAD_AVG_MAX = (sa->util_sum <<
>> SCHED_CAPACITY_SHIFT) / LOAD_AVG_MAX;
>
> I see the point but IMHO this will only be necessary if the SCHED_LOAD_RESOLUTION
> stuff gets re-enabled again.
>
> It's not really about utilization or capacity units but rather about using the same
> SCALE/SHIFT values for both sides, right?

It's both a unit and a SCALE/SHIFT problem, SCHED_LOAD_SHIFT and
SCHED_CAPACITY_SHIFT are defined separately so we must be sure to
scale the value in the right range. In the case of cpu_usage which
returns sa->util_avg , it's the capacity range not the load range.

>
> I always thought that scale_load_down() takes care of that.

AFAIU, scale_load_down is a way to increase the resolution  of the
load not to move from load to capacity

>
> So shouldn't:
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 3445d2fb38f4..b80f799aface 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2644,7 +2644,7 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
>                         cfs_rq->runnable_load_avg =
>                                 div_u64(cfs_rq->runnable_load_sum, LOAD_AVG_MAX);
>                 }
> -               sa->util_avg = (sa->util_sum << SCHED_LOAD_SHIFT) / LOAD_AVG_MAX;
> +               sa->util_avg = (sa->util_sum * scale_load_down(SCHED_LOAD_SCALE)) / LOAD_AVG_MAX;
>         }
>
>         return decayed;
>
> fix that issue in case SCHED_LOAD_RESOLUTION != 0 ?


No, but
sa->util_avg = (sa->util_sum << SCHED_CAPACITY_SHIFT) / LOAD_AVG_MAX;
will fix the unit issue.
I agree that i don't change the result because both SCHED_LOAD_SHIFT
and SCHED_CAPACITY_SHIFT are set to 10 but as mentioned above, they
are set separately so it can make the difference if someone change one
SHIFT value.

Regards,
Vincent

>
> I would vote for removing this SCHED_LOAD_RESOLUTION thing completely so that we can
> assume that load/util and capacity are always using 1024/10.
>
> Cheers,
>
> -- Dietmar
>
>>
>>
>> Regards,
>> Vincent
>>
>>
>>>
>>> LOAD (UTIL) and CAPACITY have the same SCALE and SHIFT values because
>>> SCHED_LOAD_RESOLUTION is always defined to 0. scale_load() and
>>> scale_load_down() are also NOPs so this area is probably
>>> worth a separate clean-up.
>>> Beyond that, I'm not sure if the current functionality is
>>> broken if we use different SCALE and SHIFT values for LOAD and CAPACITY?
>>>
>
> [...]
>

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/6] sched/fair: Get rid of scaling utilization by capacity_orig
  2015-09-07 18:54         ` Dietmar Eggemann
  2015-09-07 19:47           ` Peter Zijlstra
  2015-09-08  7:22           ` Vincent Guittot
@ 2015-09-08 11:44           ` Peter Zijlstra
  2 siblings, 0 replies; 97+ messages in thread
From: Peter Zijlstra @ 2015-09-08 11:44 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Vincent Guittot, Steve Muckle, Morten Rasmussen, mingo,
	daniel.lezcano, yuyang.du, mturquette, rjw, Juri Lelli,
	sgurrappadi, pang.xunlei, linux-kernel, Paul Turner, Ben Segall

On Mon, Sep 07, 2015 at 07:54:18PM +0100, Dietmar Eggemann wrote:
> I see the point but IMHO this will only be necessary if the SCHED_LOAD_RESOLUTION
> stuff gets re-enabled again.

Paul, Ben, gentle reminder to look at re-enabling this.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/6] sched/fair: Get rid of scaling utilization by capacity_orig
  2015-09-08  7:22           ` Vincent Guittot
@ 2015-09-08 12:26             ` Peter Zijlstra
  2015-09-08 12:52               ` Peter Zijlstra
  2015-09-08 13:39               ` Vincent Guittot
  2015-09-08 12:50             ` Dietmar Eggemann
  1 sibling, 2 replies; 97+ messages in thread
From: Peter Zijlstra @ 2015-09-08 12:26 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Dietmar Eggemann, Steve Muckle, Morten Rasmussen, mingo,
	daniel.lezcano, yuyang.du, mturquette, rjw, Juri Lelli,
	sgurrappadi, pang.xunlei, linux-kernel

On Tue, Sep 08, 2015 at 09:22:05AM +0200, Vincent Guittot wrote:
> No, but
> sa->util_avg = (sa->util_sum << SCHED_CAPACITY_SHIFT) / LOAD_AVG_MAX;
> will fix the unit issue.

Tricky that, LOAD_AVG_MAX very much relies on the unit being 1<<10.

And where load_sum already gets a factor 1024 from the weight
multiplication, util_sum does not get such a factor, and all the scaling
we do on it loose bits.

So at the moment we go compute the util_avg value, we need to inflate
util_sum with an extra factor 1024 in order to make it work.

And seeing that we do the shift up on sa->util_sum without consideration
of overflow, would it not make sense to add that factor before the
scaling and into the addition?

Now, given all that, units are a complete mess here, and I'd not mind
something like:

#if (SCHED_LOAD_SHIFT - SCHED_LOAD_RESOLUTION) != SCHED_CAPACITY_SHIFT
#error "something usefull"
#endif

somewhere near here.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/6] sched/fair: Get rid of scaling utilization by capacity_orig
  2015-09-07 19:47           ` Peter Zijlstra
@ 2015-09-08 12:47             ` Dietmar Eggemann
  0 siblings, 0 replies; 97+ messages in thread
From: Dietmar Eggemann @ 2015-09-08 12:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vincent Guittot, Steve Muckle, Morten Rasmussen, mingo,
	daniel.lezcano, yuyang.du, mturquette, rjw, Juri Lelli,
	sgurrappadi, pang.xunlei, linux-kernel

On 07/09/15 20:47, Peter Zijlstra wrote:
> On Mon, Sep 07, 2015 at 07:54:18PM +0100, Dietmar Eggemann wrote:
>> I would vote for removing this SCHED_LOAD_RESOLUTION thing completely so that we can
>> assume that load/util and capacity are always using 1024/10.
> 
> Ha!, I just requested Google look into moving it to 20 again ;-)
> 

In this case Steve and Vincent have a point here.


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/6] sched/fair: Get rid of scaling utilization by capacity_orig
  2015-09-08  7:22           ` Vincent Guittot
  2015-09-08 12:26             ` Peter Zijlstra
@ 2015-09-08 12:50             ` Dietmar Eggemann
  2015-09-08 14:01               ` Vincent Guittot
  2015-09-09 20:15               ` Yuyang Du
  1 sibling, 2 replies; 97+ messages in thread
From: Dietmar Eggemann @ 2015-09-08 12:50 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Steve Muckle, Morten Rasmussen, peterz, mingo, daniel.lezcano,
	yuyang.du, mturquette, rjw, Juri Lelli, sgurrappadi, pang.xunlei,
	linux-kernel

On 08/09/15 08:22, Vincent Guittot wrote:
> On 7 September 2015 at 20:54, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>> On 07/09/15 17:21, Vincent Guittot wrote:
>>> On 7 September 2015 at 17:37, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>>> On 04/09/15 00:51, Steve Muckle wrote:
>>>>> Hi Morten, Dietmar,
>>>>>
>>>>> On 08/14/2015 09:23 AM, Morten Rasmussen wrote:
>>>>> ...
>>>>>> + * cfs_rq.avg.util_avg is the sum of running time of runnable tasks plus the
>>>>>> + * recent utilization of currently non-runnable tasks on a CPU. It represents
>>>>>> + * the amount of utilization of a CPU in the range [0..capacity_orig] where
>>>>>
>>>>> I see util_sum is scaled by SCHED_LOAD_SHIFT at the end of
>>>>> __update_load_avg(). If there is now an assumption that util_avg may be
>>>>> used directly as a capacity value, should it be changed to
>>>>> SCHED_CAPACITY_SHIFT? These are equal right now, not sure if they will
>>>>> always be or if they can be combined.
>>>>
>>>> You're referring to the code line
>>>>
>>>> 2647   sa->util_avg = (sa->util_sum << SCHED_LOAD_SHIFT) / LOAD_AVG_MAX;
>>>>
>>>> in __update_load_avg()?
>>>>
>>>> Here we actually scale by 'SCHED_LOAD_SCALE/LOAD_AVG_MAX' so both values are
>>>> load related.
>>>
>>> I agree with Steve that there is an issue from a unit point of view
>>>
>>> sa->util_sum and LOAD_AVG_MAX have the same unit so sa->util_avg is a
>>> load because of << SCHED_LOAD_SHIFT)
>>>
>>> Before this patch , the translation from load to capacity unit was
>>> done in get_cpu_usage with "* capacity) >> SCHED_LOAD_SHIFT"
>>>
>>> So you still have to change the unit from load to capacity with a "/
>>> SCHED_LOAD_SCALE * SCHED_CAPACITY_SCALE" somewhere.
>>>
>>> sa->util_avg = ((sa->util_sum << SCHED_LOAD_SHIFT) /SCHED_LOAD_SCALE *
>>> SCHED_CAPACITY_SCALE / LOAD_AVG_MAX = (sa->util_sum <<
>>> SCHED_CAPACITY_SHIFT) / LOAD_AVG_MAX;
>>
>> I see the point but IMHO this will only be necessary if the SCHED_LOAD_RESOLUTION
>> stuff gets re-enabled again.
>>
>> It's not really about utilization or capacity units but rather about using the same
>> SCALE/SHIFT values for both sides, right?
> 
> It's both a unit and a SCALE/SHIFT problem, SCHED_LOAD_SHIFT and
> SCHED_CAPACITY_SHIFT are defined separately so we must be sure to
> scale the value in the right range. In the case of cpu_usage which
> returns sa->util_avg , it's the capacity range not the load range.

Still don't understand why it's a unit problem. IMHO LOAD/UTIL and
CAPACITY have no unit.

I agree that with the current patch-set we have a SHIFT/SCALE problem
once SCHED_LOAD_RESOLUTION is set to != 0.

> 
>>
>> I always thought that scale_load_down() takes care of that.
> 
> AFAIU, scale_load_down is a way to increase the resolution  of the
> load not to move from load to capacity

IMHO, increasing the resolution of the load is done by re-enabling this
define SCHED_LOAD_RESOLUTION  10 thing (or by setting
SCHED_LOAD_RESOLUTION to something else than 0).

I tried to figure out why we have this issue when comparing UTIL w/
CAPACITY and not LOAD w/ CAPACITY:

Both are initialized like that:

 sa->load_avg = scale_load_down(se->load.weight);
 sa->load_sum = sa->load_avg * LOAD_AVG_MAX;
 sa->util_avg = scale_load_down(SCHED_LOAD_SCALE);
 sa->util_sum = LOAD_AVG_MAX;

and we use 'se->on_rq * scale_load_down(se->load.weight)' as 'unsigned
long weight' argument to call __update_load_avg() making sure the
scaling differences between LOAD and CAPACITY are respected while
updating sa->load_sum (and sa->load_avg).

OTAH, we don't apply a scale_load_down for sa->util_[sum/avg] only a '<<
SCHED_LOAD_SHIFT) / LOAD_AVG_MAX' on sa->util_avg.
So changing '<< SCHED_LOAD_SHIFT' to '*
scale_load_down(SCHED_LOAD_SCALE)' would be the logical thing to do.

I agree that '<< SCHED_CAPACITY_SHIFT' would have the same effect but
why using a CAPACITY related thing on the LOAD/UTIL side? The only
reason would be the unit problem which I don't understand.

> 
>>
>> So shouldn't:
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 3445d2fb38f4..b80f799aface 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -2644,7 +2644,7 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
>>                         cfs_rq->runnable_load_avg =
>>                                 div_u64(cfs_rq->runnable_load_sum, LOAD_AVG_MAX);
>>                 }
>> -               sa->util_avg = (sa->util_sum << SCHED_LOAD_SHIFT) / LOAD_AVG_MAX;
>> +               sa->util_avg = (sa->util_sum * scale_load_down(SCHED_LOAD_SCALE)) / LOAD_AVG_MAX;
>>         }
>>
>>         return decayed;
>>
>> fix that issue in case SCHED_LOAD_RESOLUTION != 0 ?
> 
> 
> No, but
> sa->util_avg = (sa->util_sum << SCHED_CAPACITY_SHIFT) / LOAD_AVG_MAX;
> will fix the unit issue.
> I agree that i don't change the result because both SCHED_LOAD_SHIFT
> and SCHED_CAPACITY_SHIFT are set to 10 but as mentioned above, they
> are set separately so it can make the difference if someone change one
> SHIFT value.

SCHED_LOAD_SHIFT and SCHED_CAPACITY_SHIFT can be set separately but the
way to change SCHED_LOAD_SHIFT is by re-enabling the define
SCHED_LOAD_RESOLUTION 10 in kernel/sched/sched.h. I guess nobody wants
to change SCHED_CAPACITY_[SHIFT/SCALE].

Cheers,

-- Dietmar

[...]


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/6] sched/fair: Get rid of scaling utilization by capacity_orig
  2015-09-08 12:26             ` Peter Zijlstra
@ 2015-09-08 12:52               ` Peter Zijlstra
  2015-09-08 14:06                 ` Vincent Guittot
                                   ` (2 more replies)
  2015-09-08 13:39               ` Vincent Guittot
  1 sibling, 3 replies; 97+ messages in thread
From: Peter Zijlstra @ 2015-09-08 12:52 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Dietmar Eggemann, Steve Muckle, Morten Rasmussen, mingo,
	daniel.lezcano, yuyang.du, mturquette, rjw, Juri Lelli,
	sgurrappadi, pang.xunlei, linux-kernel

On Tue, Sep 08, 2015 at 02:26:06PM +0200, Peter Zijlstra wrote:
> On Tue, Sep 08, 2015 at 09:22:05AM +0200, Vincent Guittot wrote:
> > No, but
> > sa->util_avg = (sa->util_sum << SCHED_CAPACITY_SHIFT) / LOAD_AVG_MAX;
> > will fix the unit issue.
> 
> Tricky that, LOAD_AVG_MAX very much relies on the unit being 1<<10.
> 
> And where load_sum already gets a factor 1024 from the weight
> multiplication, util_sum does not get such a factor, and all the scaling
> we do on it loose bits.
> 
> So at the moment we go compute the util_avg value, we need to inflate
> util_sum with an extra factor 1024 in order to make it work.
> 
> And seeing that we do the shift up on sa->util_sum without consideration
> of overflow, would it not make sense to add that factor before the
> scaling and into the addition?
> 
> Now, given all that, units are a complete mess here, and I'd not mind
> something like:
> 
> #if (SCHED_LOAD_SHIFT - SCHED_LOAD_RESOLUTION) != SCHED_CAPACITY_SHIFT
> #error "something usefull"
> #endif
> 
> somewhere near here.

Something like teh below..

Another thing to ponder; the downside of scaled_delta_w is that its
fairly likely delta is small and you loose all bits, whereas the weight
is likely to be large can could loose a fwe bits without issue.

That is, in fixed point scaling like this, you want to start with the
biggest numbers, not the smallest, otherwise you loose too much.

The flip side is of course that now you can share a multiplcation.

--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -682,7 +682,7 @@ void init_entity_runnable_average(struct
 	sa->load_avg = scale_load_down(se->load.weight);
 	sa->load_sum = sa->load_avg * LOAD_AVG_MAX;
 	sa->util_avg = scale_load_down(SCHED_LOAD_SCALE);
-	sa->util_sum = LOAD_AVG_MAX;
+	sa->util_sum = sa->util_avg * LOAD_AVG_MAX;
 	/* when this task enqueue'ed, it will contribute to its cfs_rq's load_avg */
 }
 
@@ -2515,6 +2515,10 @@ static u32 __compute_runnable_contrib(u6
 	return contrib + runnable_avg_yN_sum[n];
 }
 
+#if (SCHED_LOAD_SHIFT - SCHED_LOAD_RESOLUTION) != 10 || SCHED_CAPACITY_SHIFT != 10
+#error "load tracking assumes 2^10 as unit"
+#endif
+
 #define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT)
 
 /*
@@ -2599,7 +2603,7 @@ __update_load_avg(u64 now, int cpu, stru
 			}
 		}
 		if (running)
-			sa->util_sum += cap_scale(scaled_delta_w, scale_cpu);
+			sa->util_sum += scaled_delta_w * scale_cpu;
 
 		delta -= delta_w;
 
@@ -2623,7 +2627,7 @@ __update_load_avg(u64 now, int cpu, stru
 				cfs_rq->runnable_load_sum += weight * contrib;
 		}
 		if (running)
-			sa->util_sum += cap_scale(contrib, scale_cpu);
+			sa->util_sum += contrib * scale_cpu;
 	}
 
 	/* Remainder of delta accrued against u_0` */
@@ -2634,7 +2638,7 @@ __update_load_avg(u64 now, int cpu, stru
 			cfs_rq->runnable_load_sum += weight * scaled_delta;
 	}
 	if (running)
-		sa->util_sum += cap_scale(scaled_delta, scale_cpu);
+		sa->util_sum += scaled_delta * scale_cpu;
 
 	sa->period_contrib += delta;
 
@@ -2644,7 +2648,7 @@ __update_load_avg(u64 now, int cpu, stru
 			cfs_rq->runnable_load_avg =
 				div_u64(cfs_rq->runnable_load_sum, LOAD_AVG_MAX);
 		}
-		sa->util_avg = (sa->util_sum << SCHED_LOAD_SHIFT) / LOAD_AVG_MAX;
+		sa->util_avg = sa->util_sum / LOAD_AVG_MAX;
 	}
 
 	return decayed;
@@ -2686,8 +2690,7 @@ static inline int update_cfs_rq_load_avg
 	if (atomic_long_read(&cfs_rq->removed_util_avg)) {
 		long r = atomic_long_xchg(&cfs_rq->removed_util_avg, 0);
 		sa->util_avg = max_t(long, sa->util_avg - r, 0);
-		sa->util_sum = max_t(s32, sa->util_sum -
-			((r * LOAD_AVG_MAX) >> SCHED_LOAD_SHIFT), 0);
+		sa->util_sum = max_t(s32, sa->util_sum - r * LOAD_AVG_MAX, 0);
 	}
 
 	decayed = __update_load_avg(now, cpu_of(rq_of(cfs_rq)), sa,

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/6] sched/fair: Get rid of scaling utilization by capacity_orig
  2015-09-08 12:26             ` Peter Zijlstra
  2015-09-08 12:52               ` Peter Zijlstra
@ 2015-09-08 13:39               ` Vincent Guittot
  2015-09-08 14:10                 ` Peter Zijlstra
  1 sibling, 1 reply; 97+ messages in thread
From: Vincent Guittot @ 2015-09-08 13:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dietmar Eggemann, Steve Muckle, Morten Rasmussen, mingo,
	daniel.lezcano, yuyang.du, mturquette, rjw, Juri Lelli,
	sgurrappadi, pang.xunlei, linux-kernel

On 8 September 2015 at 14:26, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Sep 08, 2015 at 09:22:05AM +0200, Vincent Guittot wrote:
>> No, but
>> sa->util_avg = (sa->util_sum << SCHED_CAPACITY_SHIFT) / LOAD_AVG_MAX;
>> will fix the unit issue.
>
> Tricky that, LOAD_AVG_MAX very much relies on the unit being 1<<10.
>
> And where load_sum already gets a factor 1024 from the weight
> multiplication, util_sum does not get such a factor, and all the scaling
> we do on it loose bits.

fair point

>
> So at the moment we go compute the util_avg value, we need to inflate
> util_sum with an extra factor 1024 in order to make it work.
>
> And seeing that we do the shift up on sa->util_sum without consideration
> of overflow, would it not make sense to add that factor before the
> scaling and into the addition?

Yes this should save 1 left shift and 1 right shift

 >>

>
> Now, given all that, units are a complete mess here, and I'd not mind
> something like:
>
> #if (SCHED_LOAD_SHIFT - SCHED_LOAD_RESOLUTION) != SCHED_CAPACITY_SHIFT
> #error "something usefull"
> #endif

In this case why not simply doing
#define SCHED_CAPACITY_SHIFT SCHED_LOAD_SHIFT
or the opposite ?

>
> somewhere near here.
>
>

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/6] sched/fair: Get rid of scaling utilization by capacity_orig
  2015-09-08 12:50             ` Dietmar Eggemann
@ 2015-09-08 14:01               ` Vincent Guittot
  2015-09-08 14:27                 ` Dietmar Eggemann
  2015-09-09 20:15               ` Yuyang Du
  1 sibling, 1 reply; 97+ messages in thread
From: Vincent Guittot @ 2015-09-08 14:01 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Steve Muckle, Morten Rasmussen, peterz, mingo, daniel.lezcano,
	yuyang.du, mturquette, rjw, Juri Lelli, sgurrappadi, pang.xunlei,
	linux-kernel

On 8 September 2015 at 14:50, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> On 08/09/15 08:22, Vincent Guittot wrote:
>> On 7 September 2015 at 20:54, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>> On 07/09/15 17:21, Vincent Guittot wrote:
>>>> On 7 September 2015 at 17:37, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>>>> On 04/09/15 00:51, Steve Muckle wrote:
>>>>>> Hi Morten, Dietmar,
>>>>>>
>>>>>> On 08/14/2015 09:23 AM, Morten Rasmussen wrote:
>>>>>> ...
>>>>>>> + * cfs_rq.avg.util_avg is the sum of running time of runnable tasks plus the
>>>>>>> + * recent utilization of currently non-runnable tasks on a CPU. It represents
>>>>>>> + * the amount of utilization of a CPU in the range [0..capacity_orig] where
>>>>>>
>>>>>> I see util_sum is scaled by SCHED_LOAD_SHIFT at the end of
>>>>>> __update_load_avg(). If there is now an assumption that util_avg may be
>>>>>> used directly as a capacity value, should it be changed to
>>>>>> SCHED_CAPACITY_SHIFT? These are equal right now, not sure if they will
>>>>>> always be or if they can be combined.
>>>>>
>>>>> You're referring to the code line
>>>>>
>>>>> 2647   sa->util_avg = (sa->util_sum << SCHED_LOAD_SHIFT) / LOAD_AVG_MAX;
>>>>>
>>>>> in __update_load_avg()?
>>>>>
>>>>> Here we actually scale by 'SCHED_LOAD_SCALE/LOAD_AVG_MAX' so both values are
>>>>> load related.
>>>>
>>>> I agree with Steve that there is an issue from a unit point of view
>>>>
>>>> sa->util_sum and LOAD_AVG_MAX have the same unit so sa->util_avg is a
>>>> load because of << SCHED_LOAD_SHIFT)
>>>>
>>>> Before this patch , the translation from load to capacity unit was
>>>> done in get_cpu_usage with "* capacity) >> SCHED_LOAD_SHIFT"
>>>>
>>>> So you still have to change the unit from load to capacity with a "/
>>>> SCHED_LOAD_SCALE * SCHED_CAPACITY_SCALE" somewhere.
>>>>
>>>> sa->util_avg = ((sa->util_sum << SCHED_LOAD_SHIFT) /SCHED_LOAD_SCALE *
>>>> SCHED_CAPACITY_SCALE / LOAD_AVG_MAX = (sa->util_sum <<
>>>> SCHED_CAPACITY_SHIFT) / LOAD_AVG_MAX;
>>>
>>> I see the point but IMHO this will only be necessary if the SCHED_LOAD_RESOLUTION
>>> stuff gets re-enabled again.
>>>
>>> It's not really about utilization or capacity units but rather about using the same
>>> SCALE/SHIFT values for both sides, right?
>>
>> It's both a unit and a SCALE/SHIFT problem, SCHED_LOAD_SHIFT and
>> SCHED_CAPACITY_SHIFT are defined separately so we must be sure to
>> scale the value in the right range. In the case of cpu_usage which
>> returns sa->util_avg , it's the capacity range not the load range.
>
> Still don't understand why it's a unit problem. IMHO LOAD/UTIL and
> CAPACITY have no unit.

If you set 2 different values to SCHED_LOAD_SHIFT and
SCHED_CAPACITY_SHIFT for test purpose, you will see that util_avg will
not use the right range of value

If we don't take into account freq and cpu invariance in a 1st step

sa->util_sum is a load in the range [0..LOAD_AVG_MAX]. I say load
because of the max value

the current implementation of util_avg is
sa->util_avg = (sa->util_sum << SCHED_LOAD_SHIFT) / LOAD_AVG_MAX

so sa->util_avg is a load in the range [0..SCHED_LOAD_SCALE]

the current implementation of get_cpu_usage is
return (sa->util_avg * capacity_orig_of(cpu)) >> SCHED_LOAD_SHIFT

so the usage has the same unit and range as capacity of the cpu and
can be compared with another capacity value

Your patchset returns directly sa->util_avg which is a load to compare
it with capacity value

So you have to convert sa->util_avg from load to capacity so if you have
sa->util_avg = (sa->util_sum << SCHED_CAPACITY_SHIFT) / LOAD_AVG_MAX

sa->util_avg is now a capacity with the same range as you cpu thanks
to the cpu invariance factor that the patch 3 has added.

the << SCHED_CAPACITY_SHIFT above can be optimized with the >>
SCHED_CAPACITY_SHIFT included in
sa->util_sum += scale(contrib, scale_cpu);
as mentioned by Peter

At now, SCHED_CAPACITY_SHIFT is set to 10 as well as SCHED_LOAD_SHIFT
so using one instead of the other doesn't change the result but if
it's no more the case, we need to take care of the range/unit that we
use

Regards,
Vincent


>
> I agree that with the current patch-set we have a SHIFT/SCALE problem
> once SCHED_LOAD_RESOLUTION is set to != 0.
>
>>
>>>
>>> I always thought that scale_load_down() takes care of that.
>>
>> AFAIU, scale_load_down is a way to increase the resolution  of the
>> load not to move from load to capacity
>
> IMHO, increasing the resolution of the load is done by re-enabling this
> define SCHED_LOAD_RESOLUTION  10 thing (or by setting
> SCHED_LOAD_RESOLUTION to something else than 0).
>
> I tried to figure out why we have this issue when comparing UTIL w/
> CAPACITY and not LOAD w/ CAPACITY:
>
> Both are initialized like that:
>
>  sa->load_avg = scale_load_down(se->load.weight);
>  sa->load_sum = sa->load_avg * LOAD_AVG_MAX;
>  sa->util_avg = scale_load_down(SCHED_LOAD_SCALE);
>  sa->util_sum = LOAD_AVG_MAX;
>
> and we use 'se->on_rq * scale_load_down(se->load.weight)' as 'unsigned
> long weight' argument to call __update_load_avg() making sure the
> scaling differences between LOAD and CAPACITY are respected while
> updating sa->load_sum (and sa->load_avg).
>
> OTAH, we don't apply a scale_load_down for sa->util_[sum/avg] only a '<<
> SCHED_LOAD_SHIFT) / LOAD_AVG_MAX' on sa->util_avg.
> So changing '<< SCHED_LOAD_SHIFT' to '*
> scale_load_down(SCHED_LOAD_SCALE)' would be the logical thing to do.
>
> I agree that '<< SCHED_CAPACITY_SHIFT' would have the same effect but
> why using a CAPACITY related thing on the LOAD/UTIL side? The only
> reason would be the unit problem which I don't understand.
>
>>
>>>
>>> So shouldn't:
>>>
>>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> index 3445d2fb38f4..b80f799aface 100644
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -2644,7 +2644,7 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
>>>                         cfs_rq->runnable_load_avg =
>>>                                 div_u64(cfs_rq->runnable_load_sum, LOAD_AVG_MAX);
>>>                 }
>>> -               sa->util_avg = (sa->util_sum << SCHED_LOAD_SHIFT) / LOAD_AVG_MAX;
>>> +               sa->util_avg = (sa->util_sum * scale_load_down(SCHED_LOAD_SCALE)) / LOAD_AVG_MAX;
>>>         }
>>>
>>>         return decayed;
>>>
>>> fix that issue in case SCHED_LOAD_RESOLUTION != 0 ?
>>
>>
>> No, but
>> sa->util_avg = (sa->util_sum << SCHED_CAPACITY_SHIFT) / LOAD_AVG_MAX;
>> will fix the unit issue.
>> I agree that i don't change the result because both SCHED_LOAD_SHIFT
>> and SCHED_CAPACITY_SHIFT are set to 10 but as mentioned above, they
>> are set separately so it can make the difference if someone change one
>> SHIFT value.
>
> SCHED_LOAD_SHIFT and SCHED_CAPACITY_SHIFT can be set separately but the
> way to change SCHED_LOAD_SHIFT is by re-enabling the define
> SCHED_LOAD_RESOLUTION 10 in kernel/sched/sched.h. I guess nobody wants
> to change SCHED_CAPACITY_[SHIFT/SCALE].
>
> Cheers,
>
> -- Dietmar
>
> [...]
>

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/6] sched/fair: Get rid of scaling utilization by capacity_orig
  2015-09-08 12:52               ` Peter Zijlstra
@ 2015-09-08 14:06                 ` Vincent Guittot
  2015-09-08 14:35                   ` Morten Rasmussen
  2015-09-08 14:31                 ` Morten Rasmussen
  2015-09-09 19:07                 ` Yuyang Du
  2 siblings, 1 reply; 97+ messages in thread
From: Vincent Guittot @ 2015-09-08 14:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dietmar Eggemann, Steve Muckle, Morten Rasmussen, mingo,
	daniel.lezcano, yuyang.du, mturquette, rjw, Juri Lelli,
	sgurrappadi, pang.xunlei, linux-kernel

On 8 September 2015 at 14:52, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Sep 08, 2015 at 02:26:06PM +0200, Peter Zijlstra wrote:
>> On Tue, Sep 08, 2015 at 09:22:05AM +0200, Vincent Guittot wrote:
>> > No, but
>> > sa->util_avg = (sa->util_sum << SCHED_CAPACITY_SHIFT) / LOAD_AVG_MAX;
>> > will fix the unit issue.
>>
>> Tricky that, LOAD_AVG_MAX very much relies on the unit being 1<<10.
>>
>> And where load_sum already gets a factor 1024 from the weight
>> multiplication, util_sum does not get such a factor, and all the scaling
>> we do on it loose bits.
>>
>> So at the moment we go compute the util_avg value, we need to inflate
>> util_sum with an extra factor 1024 in order to make it work.
>>
>> And seeing that we do the shift up on sa->util_sum without consideration
>> of overflow, would it not make sense to add that factor before the
>> scaling and into the addition?
>>
>> Now, given all that, units are a complete mess here, and I'd not mind
>> something like:
>>
>> #if (SCHED_LOAD_SHIFT - SCHED_LOAD_RESOLUTION) != SCHED_CAPACITY_SHIFT
>> #error "something usefull"
>> #endif
>>
>> somewhere near here.
>
> Something like teh below..
>
> Another thing to ponder; the downside of scaled_delta_w is that its
> fairly likely delta is small and you loose all bits, whereas the weight
> is likely to be large can could loose a fwe bits without issue.
>
> That is, in fixed point scaling like this, you want to start with the
> biggest numbers, not the smallest, otherwise you loose too much.
>
> The flip side is of course that now you can share a multiplcation.
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -682,7 +682,7 @@ void init_entity_runnable_average(struct
>         sa->load_avg = scale_load_down(se->load.weight);
>         sa->load_sum = sa->load_avg * LOAD_AVG_MAX;
>         sa->util_avg = scale_load_down(SCHED_LOAD_SCALE);
> -       sa->util_sum = LOAD_AVG_MAX;
> +       sa->util_sum = sa->util_avg * LOAD_AVG_MAX;
>         /* when this task enqueue'ed, it will contribute to its cfs_rq's load_avg */
>  }
>
> @@ -2515,6 +2515,10 @@ static u32 __compute_runnable_contrib(u6
>         return contrib + runnable_avg_yN_sum[n];
>  }
>
> +#if (SCHED_LOAD_SHIFT - SCHED_LOAD_RESOLUTION) != 10 || SCHED_CAPACITY_SHIFT != 10
> +#error "load tracking assumes 2^10 as unit"
> +#endif

so why don't we set SCHED_CAPACITY_SHIFT to SCHED_LOAD_SHIFT ?

> +
>  #define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT)
>
>  /*
> @@ -2599,7 +2603,7 @@ __update_load_avg(u64 now, int cpu, stru
>                         }
>                 }
>                 if (running)
> -                       sa->util_sum += cap_scale(scaled_delta_w, scale_cpu);
> +                       sa->util_sum += scaled_delta_w * scale_cpu;
>
>                 delta -= delta_w;
>
> @@ -2623,7 +2627,7 @@ __update_load_avg(u64 now, int cpu, stru
>                                 cfs_rq->runnable_load_sum += weight * contrib;
>                 }
>                 if (running)
> -                       sa->util_sum += cap_scale(contrib, scale_cpu);
> +                       sa->util_sum += contrib * scale_cpu;
>         }
>
>         /* Remainder of delta accrued against u_0` */
> @@ -2634,7 +2638,7 @@ __update_load_avg(u64 now, int cpu, stru
>                         cfs_rq->runnable_load_sum += weight * scaled_delta;
>         }
>         if (running)
> -               sa->util_sum += cap_scale(scaled_delta, scale_cpu);
> +               sa->util_sum += scaled_delta * scale_cpu;
>
>         sa->period_contrib += delta;
>
> @@ -2644,7 +2648,7 @@ __update_load_avg(u64 now, int cpu, stru
>                         cfs_rq->runnable_load_avg =
>                                 div_u64(cfs_rq->runnable_load_sum, LOAD_AVG_MAX);
>                 }
> -               sa->util_avg = (sa->util_sum << SCHED_LOAD_SHIFT) / LOAD_AVG_MAX;
> +               sa->util_avg = sa->util_sum / LOAD_AVG_MAX;
>         }
>
>         return decayed;
> @@ -2686,8 +2690,7 @@ static inline int update_cfs_rq_load_avg
>         if (atomic_long_read(&cfs_rq->removed_util_avg)) {
>                 long r = atomic_long_xchg(&cfs_rq->removed_util_avg, 0);
>                 sa->util_avg = max_t(long, sa->util_avg - r, 0);
> -               sa->util_sum = max_t(s32, sa->util_sum -
> -                       ((r * LOAD_AVG_MAX) >> SCHED_LOAD_SHIFT), 0);
> +               sa->util_sum = max_t(s32, sa->util_sum - r * LOAD_AVG_MAX, 0);

looks good to me

>         }
>
>         decayed = __update_load_avg(now, cpu_of(rq_of(cfs_rq)), sa,

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/6] sched/fair: Get rid of scaling utilization by capacity_orig
  2015-09-08 13:39               ` Vincent Guittot
@ 2015-09-08 14:10                 ` Peter Zijlstra
  2015-09-08 15:17                   ` Vincent Guittot
  0 siblings, 1 reply; 97+ messages in thread
From: Peter Zijlstra @ 2015-09-08 14:10 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Dietmar Eggemann, Steve Muckle, Morten Rasmussen, mingo,
	daniel.lezcano, yuyang.du, mturquette, rjw, Juri Lelli,
	sgurrappadi, pang.xunlei, linux-kernel

On Tue, Sep 08, 2015 at 03:39:37PM +0200, Vincent Guittot wrote:
> > Now, given all that, units are a complete mess here, and I'd not mind
> > something like:
> >
> > #if (SCHED_LOAD_SHIFT - SCHED_LOAD_RESOLUTION) != SCHED_CAPACITY_SHIFT
> > #error "something usefull"
> > #endif
> 
> In this case why not simply doing
> #define SCHED_CAPACITY_SHIFT SCHED_LOAD_SHIFT
> or the opposite ?

Sadly not enough; aside from the fact that we really should do !0
LOAD_RESOLUTION on 64bit, the whole magic tables (runnable_avg_yN_*[])
and LOAD_AVG_MAX* values rely on the unit being 1<<10.

So regardless of defining one in terms of the other, we should check
both are in fact 10 and error out otherwise.

Changing them must involve recomputing these numbers or otherwise
mucking about with shifts to ensure its back to 10 when we do this load
muck.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/6] sched/fair: Get rid of scaling utilization by capacity_orig
  2015-09-08 14:01               ` Vincent Guittot
@ 2015-09-08 14:27                 ` Dietmar Eggemann
  0 siblings, 0 replies; 97+ messages in thread
From: Dietmar Eggemann @ 2015-09-08 14:27 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Steve Muckle, Morten Rasmussen, peterz, mingo, daniel.lezcano,
	yuyang.du, mturquette, rjw, Juri Lelli, sgurrappadi, pang.xunlei,
	linux-kernel



On 08/09/15 15:01, Vincent Guittot wrote:
> On 8 September 2015 at 14:50, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>> On 08/09/15 08:22, Vincent Guittot wrote:
>>> On 7 September 2015 at 20:54, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>>> On 07/09/15 17:21, Vincent Guittot wrote:
>>>>> On 7 September 2015 at 17:37, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>>>>> On 04/09/15 00:51, Steve Muckle wrote:
>>>>>>> Hi Morten, Dietmar,
>>>>>>>
>>>>>>> On 08/14/2015 09:23 AM, Morten Rasmussen wrote:
>>>>>>> ...
>>>>>>>> + * cfs_rq.avg.util_avg is the sum of running time of runnable tasks plus the
>>>>>>>> + * recent utilization of currently non-runnable tasks on a CPU. It represents
>>>>>>>> + * the amount of utilization of a CPU in the range [0..capacity_orig] where
>>>>>>>
>>>>>>> I see util_sum is scaled by SCHED_LOAD_SHIFT at the end of
>>>>>>> __update_load_avg(). If there is now an assumption that util_avg may be
>>>>>>> used directly as a capacity value, should it be changed to
>>>>>>> SCHED_CAPACITY_SHIFT? These are equal right now, not sure if they will
>>>>>>> always be or if they can be combined.
>>>>>>
>>>>>> You're referring to the code line
>>>>>>
>>>>>> 2647   sa->util_avg = (sa->util_sum << SCHED_LOAD_SHIFT) / LOAD_AVG_MAX;
>>>>>>
>>>>>> in __update_load_avg()?
>>>>>>
>>>>>> Here we actually scale by 'SCHED_LOAD_SCALE/LOAD_AVG_MAX' so both values are
>>>>>> load related.
>>>>>
>>>>> I agree with Steve that there is an issue from a unit point of view
>>>>>
>>>>> sa->util_sum and LOAD_AVG_MAX have the same unit so sa->util_avg is a
>>>>> load because of << SCHED_LOAD_SHIFT)
>>>>>
>>>>> Before this patch , the translation from load to capacity unit was
>>>>> done in get_cpu_usage with "* capacity) >> SCHED_LOAD_SHIFT"
>>>>>
>>>>> So you still have to change the unit from load to capacity with a "/
>>>>> SCHED_LOAD_SCALE * SCHED_CAPACITY_SCALE" somewhere.
>>>>>
>>>>> sa->util_avg = ((sa->util_sum << SCHED_LOAD_SHIFT) /SCHED_LOAD_SCALE *
>>>>> SCHED_CAPACITY_SCALE / LOAD_AVG_MAX = (sa->util_sum <<
>>>>> SCHED_CAPACITY_SHIFT) / LOAD_AVG_MAX;
>>>>
>>>> I see the point but IMHO this will only be necessary if the SCHED_LOAD_RESOLUTION
>>>> stuff gets re-enabled again.
>>>>
>>>> It's not really about utilization or capacity units but rather about using the same
>>>> SCALE/SHIFT values for both sides, right?
>>>
>>> It's both a unit and a SCALE/SHIFT problem, SCHED_LOAD_SHIFT and
>>> SCHED_CAPACITY_SHIFT are defined separately so we must be sure to
>>> scale the value in the right range. In the case of cpu_usage which
>>> returns sa->util_avg , it's the capacity range not the load range.
>>
>> Still don't understand why it's a unit problem. IMHO LOAD/UTIL and
>> CAPACITY have no unit.
> 
> If you set 2 different values to SCHED_LOAD_SHIFT and
> SCHED_CAPACITY_SHIFT for test purpose, you will see that util_avg will
> not use the right range of value
> 
> If we don't take into account freq and cpu invariance in a 1st step
> 
> sa->util_sum is a load in the range [0..LOAD_AVG_MAX]. I say load
> because of the max value
> 
> the current implementation of util_avg is
> sa->util_avg = (sa->util_sum << SCHED_LOAD_SHIFT) / LOAD_AVG_MAX
> 
> so sa->util_avg is a load in the range [0..SCHED_LOAD_SCALE]
> 
> the current implementation of get_cpu_usage is
> return (sa->util_avg * capacity_orig_of(cpu)) >> SCHED_LOAD_SHIFT
> 
> so the usage has the same unit and range as capacity of the cpu and
> can be compared with another capacity value
> 
> Your patchset returns directly sa->util_avg which is a load to compare
> it with capacity value
> 
> So you have to convert sa->util_avg from load to capacity so if you have
> sa->util_avg = (sa->util_sum << SCHED_CAPACITY_SHIFT) / LOAD_AVG_MAX
> 
> sa->util_avg is now a capacity with the same range as you cpu thanks
> to the cpu invariance factor that the patch 3 has added.
> 
> the << SCHED_CAPACITY_SHIFT above can be optimized with the >>
> SCHED_CAPACITY_SHIFT included in
> sa->util_sum += scale(contrib, scale_cpu);
> as mentioned by Peter
> 
> At now, SCHED_CAPACITY_SHIFT is set to 10 as well as SCHED_LOAD_SHIFT
> so using one instead of the other doesn't change the result but if
> it's no more the case, we need to take care of the range/unit that we
> use

No arguing here, I just called this a SHIFT/SCALE problem.

[...]


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/6] sched/fair: Get rid of scaling utilization by capacity_orig
  2015-09-08 12:52               ` Peter Zijlstra
  2015-09-08 14:06                 ` Vincent Guittot
@ 2015-09-08 14:31                 ` Morten Rasmussen
  2015-09-08 15:33                   ` Peter Zijlstra
  2015-09-08 16:53                   ` Morten Rasmussen
  2015-09-09 19:07                 ` Yuyang Du
  2 siblings, 2 replies; 97+ messages in thread
From: Morten Rasmussen @ 2015-09-08 14:31 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vincent Guittot, Dietmar Eggemann, Steve Muckle, mingo,
	daniel.lezcano, yuyang.du, mturquette, rjw, Juri Lelli,
	sgurrappadi, pang.xunlei, linux-kernel

On Tue, Sep 08, 2015 at 02:52:05PM +0200, Peter Zijlstra wrote:
> On Tue, Sep 08, 2015 at 02:26:06PM +0200, Peter Zijlstra wrote:
> > On Tue, Sep 08, 2015 at 09:22:05AM +0200, Vincent Guittot wrote:
> > > No, but
> > > sa->util_avg = (sa->util_sum << SCHED_CAPACITY_SHIFT) / LOAD_AVG_MAX;
> > > will fix the unit issue.
> > 
> > Tricky that, LOAD_AVG_MAX very much relies on the unit being 1<<10.

I don't get why LOAD_AVG_MAX relies on the util_avg shifting being
1<<10, it is just the sum of the geometric series and the upper bound of
util_sum?

> > And where load_sum already gets a factor 1024 from the weight
> > multiplication, util_sum does not get such a factor, and all the scaling
> > we do on it loose bits.
> > 
> > So at the moment we go compute the util_avg value, we need to inflate
> > util_sum with an extra factor 1024 in order to make it work.

Agreed. Inflating the util_sum instead of util_avg like you do below
makes more sense. The load_sum/util_sum assymmetry is somewhat confusing.

> > And seeing that we do the shift up on sa->util_sum without consideration
> > of overflow, would it not make sense to add that factor before the
> > scaling and into the addition?

I don't think util_sum can overflow as it is bounded by LOAD_AVG_MAX
unless you shift it a lot, like << 20. The << SCHED_LOAD_SHIFT in the
existing code is wrong I think. Looking at the initialization of
util_avg = scale_load_down(SCHED_LOAD_SCALE) it is not using using high
resolution load.

> > Now, given all that, units are a complete mess here, and I'd not mind
> > something like:
> > 
> > #if (SCHED_LOAD_SHIFT - SCHED_LOAD_RESOLUTION) != SCHED_CAPACITY_SHIFT
> > #error "something usefull"
> > #endif
> > 
> > somewhere near here.

Yes. As I see it, it all falls completely if that isn't true. 

> 
> Something like teh below..
> 
> Another thing to ponder; the downside of scaled_delta_w is that its
> fairly likely delta is small and you loose all bits, whereas the weight
> is likely to be large can could loose a fwe bits without issue.

That issue applies both to load and util.

> 
> That is, in fixed point scaling like this, you want to start with the
> biggest numbers, not the smallest, otherwise you loose too much.
> 
> The flip side is of course that now you can share a multiplcation.

But if we apply the scaling to the weight instead of time, we would only
have to apply it once and not three times like it is now? So maybe we
can end up with almost the same number of multiplications.

We might be loosing bits for low priority task running on cpus at a low
frequency though.

> 
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -682,7 +682,7 @@ void init_entity_runnable_average(struct
>  	sa->load_avg = scale_load_down(se->load.weight);
>  	sa->load_sum = sa->load_avg * LOAD_AVG_MAX;
>  	sa->util_avg = scale_load_down(SCHED_LOAD_SCALE);
> -	sa->util_sum = LOAD_AVG_MAX;
> +	sa->util_sum = sa->util_avg * LOAD_AVG_MAX;
>  	/* when this task enqueue'ed, it will contribute to its cfs_rq's load_avg */
>  }
>  
> @@ -2515,6 +2515,10 @@ static u32 __compute_runnable_contrib(u6
>  	return contrib + runnable_avg_yN_sum[n];
>  }
>  
> +#if (SCHED_LOAD_SHIFT - SCHED_LOAD_RESOLUTION) != 10 || SCHED_CAPACITY_SHIFT != 10
> +#error "load tracking assumes 2^10 as unit"
> +#endif

As mentioned above. Does it have to be 10?

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/6] sched/fair: Get rid of scaling utilization by capacity_orig
  2015-09-08 14:06                 ` Vincent Guittot
@ 2015-09-08 14:35                   ` Morten Rasmussen
  2015-09-08 14:40                     ` Vincent Guittot
  0 siblings, 1 reply; 97+ messages in thread
From: Morten Rasmussen @ 2015-09-08 14:35 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, Dietmar Eggemann, Steve Muckle, mingo,
	daniel.lezcano, yuyang.du, mturquette, rjw, Juri Lelli,
	sgurrappadi, pang.xunlei, linux-kernel

On Tue, Sep 08, 2015 at 04:06:36PM +0200, Vincent Guittot wrote:
> On 8 September 2015 at 14:52, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Tue, Sep 08, 2015 at 02:26:06PM +0200, Peter Zijlstra wrote:
> >> On Tue, Sep 08, 2015 at 09:22:05AM +0200, Vincent Guittot wrote:
> >> > No, but
> >> > sa->util_avg = (sa->util_sum << SCHED_CAPACITY_SHIFT) / LOAD_AVG_MAX;
> >> > will fix the unit issue.
> >>
> >> Tricky that, LOAD_AVG_MAX very much relies on the unit being 1<<10.
> >>
> >> And where load_sum already gets a factor 1024 from the weight
> >> multiplication, util_sum does not get such a factor, and all the scaling
> >> we do on it loose bits.
> >>
> >> So at the moment we go compute the util_avg value, we need to inflate
> >> util_sum with an extra factor 1024 in order to make it work.
> >>
> >> And seeing that we do the shift up on sa->util_sum without consideration
> >> of overflow, would it not make sense to add that factor before the
> >> scaling and into the addition?
> >>
> >> Now, given all that, units are a complete mess here, and I'd not mind
> >> something like:
> >>
> >> #if (SCHED_LOAD_SHIFT - SCHED_LOAD_RESOLUTION) != SCHED_CAPACITY_SHIFT
> >> #error "something usefull"
> >> #endif
> >>
> >> somewhere near here.
> >
> > Something like teh below..
> >
> > Another thing to ponder; the downside of scaled_delta_w is that its
> > fairly likely delta is small and you loose all bits, whereas the weight
> > is likely to be large can could loose a fwe bits without issue.
> >
> > That is, in fixed point scaling like this, you want to start with the
> > biggest numbers, not the smallest, otherwise you loose too much.
> >
> > The flip side is of course that now you can share a multiplcation.
> >
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -682,7 +682,7 @@ void init_entity_runnable_average(struct
> >         sa->load_avg = scale_load_down(se->load.weight);
> >         sa->load_sum = sa->load_avg * LOAD_AVG_MAX;
> >         sa->util_avg = scale_load_down(SCHED_LOAD_SCALE);
> > -       sa->util_sum = LOAD_AVG_MAX;
> > +       sa->util_sum = sa->util_avg * LOAD_AVG_MAX;
> >         /* when this task enqueue'ed, it will contribute to its cfs_rq's load_avg */
> >  }
> >
> > @@ -2515,6 +2515,10 @@ static u32 __compute_runnable_contrib(u6
> >         return contrib + runnable_avg_yN_sum[n];
> >  }
> >
> > +#if (SCHED_LOAD_SHIFT - SCHED_LOAD_RESOLUTION) != 10 || SCHED_CAPACITY_SHIFT != 10
> > +#error "load tracking assumes 2^10 as unit"
> > +#endif
> 
> so why don't we set SCHED_CAPACITY_SHIFT to SCHED_LOAD_SHIFT ?

Don't you mean:

#define SCHED_LOAD_SHIFT (SCHED_CAPACITY_SHIFT + SCHED_LOAD_RESOLUTION)

?

Or do you want to increase the capacity resolution as well if you
increase the load resolution?

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/6] sched/fair: Get rid of scaling utilization by capacity_orig
  2015-09-08 14:35                   ` Morten Rasmussen
@ 2015-09-08 14:40                     ` Vincent Guittot
  0 siblings, 0 replies; 97+ messages in thread
From: Vincent Guittot @ 2015-09-08 14:40 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Peter Zijlstra, Dietmar Eggemann, Steve Muckle, mingo,
	daniel.lezcano, yuyang.du, mturquette, rjw, Juri Lelli,
	sgurrappadi, pang.xunlei, linux-kernel

On 8 September 2015 at 16:35, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> On Tue, Sep 08, 2015 at 04:06:36PM +0200, Vincent Guittot wrote:
>> On 8 September 2015 at 14:52, Peter Zijlstra <peterz@infradead.org> wrote:
>> > On Tue, Sep 08, 2015 at 02:26:06PM +0200, Peter Zijlstra wrote:
>> >> On Tue, Sep 08, 2015 at 09:22:05AM +0200, Vincent Guittot wrote:
>> >> > No, but
>> >> > sa->util_avg = (sa->util_sum << SCHED_CAPACITY_SHIFT) / LOAD_AVG_MAX;
>> >> > will fix the unit issue.
>> >>
>> >> Tricky that, LOAD_AVG_MAX very much relies on the unit being 1<<10.
>> >>
>> >> And where load_sum already gets a factor 1024 from the weight
>> >> multiplication, util_sum does not get such a factor, and all the scaling
>> >> we do on it loose bits.
>> >>
>> >> So at the moment we go compute the util_avg value, we need to inflate
>> >> util_sum with an extra factor 1024 in order to make it work.
>> >>
>> >> And seeing that we do the shift up on sa->util_sum without consideration
>> >> of overflow, would it not make sense to add that factor before the
>> >> scaling and into the addition?
>> >>
>> >> Now, given all that, units are a complete mess here, and I'd not mind
>> >> something like:
>> >>
>> >> #if (SCHED_LOAD_SHIFT - SCHED_LOAD_RESOLUTION) != SCHED_CAPACITY_SHIFT
>> >> #error "something usefull"
>> >> #endif
>> >>
>> >> somewhere near here.
>> >
>> > Something like teh below..
>> >
>> > Another thing to ponder; the downside of scaled_delta_w is that its
>> > fairly likely delta is small and you loose all bits, whereas the weight
>> > is likely to be large can could loose a fwe bits without issue.
>> >
>> > That is, in fixed point scaling like this, you want to start with the
>> > biggest numbers, not the smallest, otherwise you loose too much.
>> >
>> > The flip side is of course that now you can share a multiplcation.
>> >
>> > --- a/kernel/sched/fair.c
>> > +++ b/kernel/sched/fair.c
>> > @@ -682,7 +682,7 @@ void init_entity_runnable_average(struct
>> >         sa->load_avg = scale_load_down(se->load.weight);
>> >         sa->load_sum = sa->load_avg * LOAD_AVG_MAX;
>> >         sa->util_avg = scale_load_down(SCHED_LOAD_SCALE);
>> > -       sa->util_sum = LOAD_AVG_MAX;
>> > +       sa->util_sum = sa->util_avg * LOAD_AVG_MAX;
>> >         /* when this task enqueue'ed, it will contribute to its cfs_rq's load_avg */
>> >  }
>> >
>> > @@ -2515,6 +2515,10 @@ static u32 __compute_runnable_contrib(u6
>> >         return contrib + runnable_avg_yN_sum[n];
>> >  }
>> >
>> > +#if (SCHED_LOAD_SHIFT - SCHED_LOAD_RESOLUTION) != 10 || SCHED_CAPACITY_SHIFT != 10
>> > +#error "load tracking assumes 2^10 as unit"
>> > +#endif
>>
>> so why don't we set SCHED_CAPACITY_SHIFT to SCHED_LOAD_SHIFT ?
>
> Don't you mean:
>
> #define SCHED_LOAD_SHIFT (SCHED_CAPACITY_SHIFT + SCHED_LOAD_RESOLUTION)

yes you're right

>
> ?
>
> Or do you want to increase the capacity resolution as well if you
> increase the load resolution?

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/6] sched/fair: Get rid of scaling utilization by capacity_orig
  2015-09-08 14:10                 ` Peter Zijlstra
@ 2015-09-08 15:17                   ` Vincent Guittot
  0 siblings, 0 replies; 97+ messages in thread
From: Vincent Guittot @ 2015-09-08 15:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dietmar Eggemann, Steve Muckle, Morten Rasmussen, mingo,
	daniel.lezcano, yuyang.du, mturquette, rjw, Juri Lelli,
	sgurrappadi, pang.xunlei, linux-kernel

On 8 September 2015 at 16:10, Peter Zijlstra <peterz@infradead.org> wrote:
> On Tue, Sep 08, 2015 at 03:39:37PM +0200, Vincent Guittot wrote:
>> > Now, given all that, units are a complete mess here, and I'd not mind
>> > something like:
>> >
>> > #if (SCHED_LOAD_SHIFT - SCHED_LOAD_RESOLUTION) != SCHED_CAPACITY_SHIFT
>> > #error "something usefull"
>> > #endif
>>
>> In this case why not simply doing
>> #define SCHED_CAPACITY_SHIFT SCHED_LOAD_SHIFT
>> or the opposite ?
>
> Sadly not enough; aside from the fact that we really should do !0
> LOAD_RESOLUTION on 64bit, the whole magic tables (runnable_avg_yN_*[])
> and LOAD_AVG_MAX* values rely on the unit being 1<<10.

ah yes, i forgot to take into account the LOAD_RESOLUTION.

So after some more thinking,  i finally don't see where in the code,
we will have a issue if SCHED_CAPACITY_SHIFT is not equal to
(SCHED_LOAD_SHIFT - SCHED_LOAD_RESOLUTION) or not equal to 10 with the
respect of using a value that doesn't overflow the variables

Regards,
Vincent

>
> So regardless of defining one in terms of the other, we should check
> both are in fact 10 and error out otherwise.
>
> Changing them must involve recomputing these numbers or otherwise
> mucking about with shifts to ensure its back to 10 when we do this load
> muck.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/6] sched/fair: Get rid of scaling utilization by capacity_orig
  2015-09-08 14:31                 ` Morten Rasmussen
@ 2015-09-08 15:33                   ` Peter Zijlstra
  2015-09-09 22:23                     ` bsegall
  2015-09-08 16:53                   ` Morten Rasmussen
  1 sibling, 1 reply; 97+ messages in thread
From: Peter Zijlstra @ 2015-09-08 15:33 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Vincent Guittot, Dietmar Eggemann, Steve Muckle, mingo,
	daniel.lezcano, yuyang.du, mturquette, rjw, Juri Lelli,
	sgurrappadi, pang.xunlei, linux-kernel

On Tue, Sep 08, 2015 at 03:31:58PM +0100, Morten Rasmussen wrote:
> On Tue, Sep 08, 2015 at 02:52:05PM +0200, Peter Zijlstra wrote:
> > > Tricky that, LOAD_AVG_MAX very much relies on the unit being 1<<10.
> 
> I don't get why LOAD_AVG_MAX relies on the util_avg shifting being
> 1<<10, it is just the sum of the geometric series and the upper bound of
> util_sum?

It needs a 1024, it might just have been the 1024 ns we use a period
instead of the scale unit though.

The LOAD_AVG_MAX is the number where adding a next element to the series
doesn't change the result anymore, so scaling it up will allow more
significant elements to the series before we bottom out, which is the _N
thing.



^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/6] sched/fair: Get rid of scaling utilization by capacity_orig
  2015-09-08 14:31                 ` Morten Rasmussen
  2015-09-08 15:33                   ` Peter Zijlstra
@ 2015-09-08 16:53                   ` Morten Rasmussen
  2015-09-09  9:43                     ` Peter Zijlstra
  2015-09-11  7:46                     ` Leo Yan
  1 sibling, 2 replies; 97+ messages in thread
From: Morten Rasmussen @ 2015-09-08 16:53 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vincent Guittot, Dietmar Eggemann, Steve Muckle, mingo,
	daniel.lezcano, yuyang.du, mturquette, rjw, Juri Lelli,
	sgurrappadi, pang.xunlei, linux-kernel

On Tue, Sep 08, 2015 at 03:31:58PM +0100, Morten Rasmussen wrote:
> On Tue, Sep 08, 2015 at 02:52:05PM +0200, Peter Zijlstra wrote:
> > 
> > Something like teh below..
> > 
> > Another thing to ponder; the downside of scaled_delta_w is that its
> > fairly likely delta is small and you loose all bits, whereas the weight
> > is likely to be large can could loose a fwe bits without issue.
> 
> That issue applies both to load and util.
> 
> > 
> > That is, in fixed point scaling like this, you want to start with the
> > biggest numbers, not the smallest, otherwise you loose too much.
> > 
> > The flip side is of course that now you can share a multiplcation.
> 
> But if we apply the scaling to the weight instead of time, we would only
> have to apply it once and not three times like it is now? So maybe we
> can end up with almost the same number of multiplications.
> 
> We might be loosing bits for low priority task running on cpus at a low
> frequency though.

Something like the below. We should be saving one multiplication.

--- 8< ---

From: Morten Rasmussen <morten.rasmussen@arm.com>
Date: Tue, 8 Sep 2015 17:15:40 +0100
Subject: [PATCH] sched/fair: Scale load/util contribution rather than time

When updating load/util tracking the time delta might be very small (1)
in many cases, scaling it futher down with frequency and cpu invariance
scaling might cause us to loose precision. Instead of scaling time we
can scale the weight of the task for load and the capacity for
utilization. Both weight (>=15) and capacity should be significantly
bigger in most cases. Low priority tasks might still suffer a bit but
worst should be improved, as weight is at least 15 before invariance
scaling.

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c | 38 +++++++++++++++++++-------------------
 1 file changed, 19 insertions(+), 19 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9301291..d5ee72a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2519,8 +2519,6 @@ static u32 __compute_runnable_contrib(u64 n)
 #error "load tracking assumes 2^10 as unit"
 #endif
 
-#define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT)
-
 /*
  * We can represent the historical contribution to runnable average as the
  * coefficients of a geometric series.  To do this we sub-divide our runnable
@@ -2553,10 +2551,10 @@ static __always_inline int
 __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
 		  unsigned long weight, int running, struct cfs_rq *cfs_rq)
 {
-	u64 delta, scaled_delta, periods;
+	u64 delta, periods;
 	u32 contrib;
-	unsigned int delta_w, scaled_delta_w, decayed = 0;
-	unsigned long scale_freq, scale_cpu;
+	unsigned int delta_w, decayed = 0;
+	unsigned long scaled_weight = 0, scale_freq, scale_freq_cpu = 0;
 
 	delta = now - sa->last_update_time;
 	/*
@@ -2577,8 +2575,13 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
 		return 0;
 	sa->last_update_time = now;
 
-	scale_freq = arch_scale_freq_capacity(NULL, cpu);
-	scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
+	if (weight || running)
+		scale_freq = arch_scale_freq_capacity(NULL, cpu);
+	if (weight)
+		scaled_weight = weight * scale_freq >> SCHED_CAPACITY_SHIFT;
+	if (running)
+		scale_freq_cpu = scale_freq * arch_scale_cpu_capacity(NULL, cpu)
+							>> SCHED_CAPACITY_SHIFT;
 
 	/* delta_w is the amount already accumulated against our next period */
 	delta_w = sa->period_contrib;
@@ -2594,16 +2597,15 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
 		 * period and accrue it.
 		 */
 		delta_w = 1024 - delta_w;
-		scaled_delta_w = cap_scale(delta_w, scale_freq);
 		if (weight) {
-			sa->load_sum += weight * scaled_delta_w;
+			sa->load_sum += scaled_weight * delta_w;
 			if (cfs_rq) {
 				cfs_rq->runnable_load_sum +=
-						weight * scaled_delta_w;
+						scaled_weight * delta_w;
 			}
 		}
 		if (running)
-			sa->util_sum += scaled_delta_w * scale_cpu;
+			sa->util_sum += delta_w * scale_freq_cpu;
 
 		delta -= delta_w;
 
@@ -2620,25 +2622,23 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
 
 		/* Efficiently calculate \sum (1..n_period) 1024*y^i */
 		contrib = __compute_runnable_contrib(periods);
-		contrib = cap_scale(contrib, scale_freq);
 		if (weight) {
-			sa->load_sum += weight * contrib;
+			sa->load_sum += scaled_weight * contrib;
 			if (cfs_rq)
-				cfs_rq->runnable_load_sum += weight * contrib;
+				cfs_rq->runnable_load_sum += scaled_weight * contrib;
 		}
 		if (running)
-			sa->util_sum += contrib * scale_cpu;
+			sa->util_sum += contrib * scale_freq_cpu;
 	}
 
 	/* Remainder of delta accrued against u_0` */
-	scaled_delta = cap_scale(delta, scale_freq);
 	if (weight) {
-		sa->load_sum += weight * scaled_delta;
+		sa->load_sum += scaled_weight * delta;
 		if (cfs_rq)
-			cfs_rq->runnable_load_sum += weight * scaled_delta;
+			cfs_rq->runnable_load_sum += scaled_weight * delta;
 	}
 	if (running)
-		sa->util_sum += scaled_delta * scale_cpu;
+		sa->util_sum += delta * scale_freq_cpu;
 
 	sa->period_contrib += delta;
 
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/6] sched/fair: Get rid of scaling utilization by capacity_orig
  2015-09-08 16:53                   ` Morten Rasmussen
@ 2015-09-09  9:43                     ` Peter Zijlstra
  2015-09-09  9:45                       ` Peter Zijlstra
  2015-09-09 11:13                       ` Morten Rasmussen
  2015-09-11  7:46                     ` Leo Yan
  1 sibling, 2 replies; 97+ messages in thread
From: Peter Zijlstra @ 2015-09-09  9:43 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Vincent Guittot, Dietmar Eggemann, Steve Muckle, mingo,
	daniel.lezcano, yuyang.du, mturquette, rjw, Juri Lelli,
	sgurrappadi, pang.xunlei, linux-kernel

On Tue, Sep 08, 2015 at 05:53:31PM +0100, Morten Rasmussen wrote:
> On Tue, Sep 08, 2015 at 03:31:58PM +0100, Morten Rasmussen wrote:

> > On Tue, Sep 08, 2015 at 02:52:05PM +0200, Peter Zijlstra wrote:
> > But if we apply the scaling to the weight instead of time, we would only
> > have to apply it once and not three times like it is now? So maybe we
> > can end up with almost the same number of multiplications.
> > 
> > We might be loosing bits for low priority task running on cpus at a low
> > frequency though.
> 
> Something like the below. We should be saving one multiplication.

> @@ -2577,8 +2575,13 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
>  		return 0;
>  	sa->last_update_time = now;
>  
> -	scale_freq = arch_scale_freq_capacity(NULL, cpu);
> -	scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
> +	if (weight || running)
> +		scale_freq = arch_scale_freq_capacity(NULL, cpu);
> +	if (weight)
> +		scaled_weight = weight * scale_freq >> SCHED_CAPACITY_SHIFT;
> +	if (running)
> +		scale_freq_cpu = scale_freq * arch_scale_cpu_capacity(NULL, cpu)
> +							>> SCHED_CAPACITY_SHIFT;
>  
>  	/* delta_w is the amount already accumulated against our next period */
>  	delta_w = sa->period_contrib;
> @@ -2594,16 +2597,15 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
>  		 * period and accrue it.
>  		 */
>  		delta_w = 1024 - delta_w;
> -		scaled_delta_w = cap_scale(delta_w, scale_freq);
>  		if (weight) {
> -			sa->load_sum += weight * scaled_delta_w;
> +			sa->load_sum += scaled_weight * delta_w;
>  			if (cfs_rq) {
>  				cfs_rq->runnable_load_sum +=
> -						weight * scaled_delta_w;
> +						scaled_weight * delta_w;
>  			}
>  		}
>  		if (running)
> -			sa->util_sum += scaled_delta_w * scale_cpu;
> +			sa->util_sum += delta_w * scale_freq_cpu;
>  
>  		delta -= delta_w;
>  

Sadly that makes the code worse; I get 14 mul instructions where
previously I had 11.

What happens is that GCC gets confused and cannot constant propagate the
new variables, so what used to be shifts now end up being actual
multiplications.

With this, I get back to 11. Can you see what happens on ARM where you
have both functions defined to non constants?

---
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2551,10 +2551,10 @@ static __always_inline int
 __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
 		  unsigned long weight, int running, struct cfs_rq *cfs_rq)
 {
+	unsigned long scaled_weight, scale_freq, scale_freq_cpu;
+	unsigned int delta_w, decayed = 0;
 	u64 delta, periods;
 	u32 contrib;
-	unsigned int delta_w, decayed = 0;
-	unsigned long scaled_weight = 0, scale_freq, scale_freq_cpu = 0;
 
 	delta = now - sa->last_update_time;
 	/*
@@ -2575,13 +2575,10 @@ __update_load_avg(u64 now, int cpu, stru
 		return 0;
 	sa->last_update_time = now;
 
-	if (weight || running)
-		scale_freq = arch_scale_freq_capacity(NULL, cpu);
-	if (weight)
-		scaled_weight = weight * scale_freq >> SCHED_CAPACITY_SHIFT;
-	if (running)
-		scale_freq_cpu = scale_freq * arch_scale_cpu_capacity(NULL, cpu)
-							>> SCHED_CAPACITY_SHIFT;
+	scale_freq = arch_scale_freq_capacity(NULL, cpu);
+
+	scaled_weight = weight * scale_freq >> SCHED_CAPACITY_SHIFT;
+	scale_freq_cpu = scale_freq * arch_scale_cpu_capacity(NULL, cpu) >> SCHED_CAPACITY_SHIFT;
 
 	/* delta_w is the amount already accumulated against our next period */
 	delta_w = sa->period_contrib;

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/6] sched/fair: Get rid of scaling utilization by capacity_orig
  2015-09-09  9:43                     ` Peter Zijlstra
@ 2015-09-09  9:45                       ` Peter Zijlstra
  2015-09-09 11:13                       ` Morten Rasmussen
  1 sibling, 0 replies; 97+ messages in thread
From: Peter Zijlstra @ 2015-09-09  9:45 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Vincent Guittot, Dietmar Eggemann, Steve Muckle, mingo,
	daniel.lezcano, yuyang.du, mturquette, rjw, Juri Lelli,
	sgurrappadi, pang.xunlei, linux-kernel

On Wed, Sep 09, 2015 at 11:43:05AM +0200, Peter Zijlstra wrote:
> Sadly that makes the code worse; I get 14 mul instructions where
> previously I had 11.

FWIW I count like:

objdump -d defconfig-build/kernel/sched/fair.o |
  awk '/<[^>]*>:/ { p=0 }
       /<update_blocked_averages>:/ { p=1 }
       { if (p) print $0 }' |
  cut -d\: -f2- | grep mul | wc -l

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/6] sched/fair: Get rid of scaling utilization by capacity_orig
  2015-09-09  9:43                     ` Peter Zijlstra
  2015-09-09  9:45                       ` Peter Zijlstra
@ 2015-09-09 11:13                       ` Morten Rasmussen
  2015-09-11 17:22                         ` Morten Rasmussen
  1 sibling, 1 reply; 97+ messages in thread
From: Morten Rasmussen @ 2015-09-09 11:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vincent Guittot, Dietmar Eggemann, Steve Muckle, mingo,
	daniel.lezcano, yuyang.du, mturquette, rjw, Juri Lelli,
	sgurrappadi, pang.xunlei, linux-kernel

On Wed, Sep 09, 2015 at 11:43:05AM +0200, Peter Zijlstra wrote:
> On Tue, Sep 08, 2015 at 05:53:31PM +0100, Morten Rasmussen wrote:
> > On Tue, Sep 08, 2015 at 03:31:58PM +0100, Morten Rasmussen wrote:
> 
> > > On Tue, Sep 08, 2015 at 02:52:05PM +0200, Peter Zijlstra wrote:
> > > But if we apply the scaling to the weight instead of time, we would only
> > > have to apply it once and not three times like it is now? So maybe we
> > > can end up with almost the same number of multiplications.
> > > 
> > > We might be loosing bits for low priority task running on cpus at a low
> > > frequency though.
> > 
> > Something like the below. We should be saving one multiplication.
> 
> > @@ -2577,8 +2575,13 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
> >  		return 0;
> >  	sa->last_update_time = now;
> >  
> > -	scale_freq = arch_scale_freq_capacity(NULL, cpu);
> > -	scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
> > +	if (weight || running)
> > +		scale_freq = arch_scale_freq_capacity(NULL, cpu);
> > +	if (weight)
> > +		scaled_weight = weight * scale_freq >> SCHED_CAPACITY_SHIFT;
> > +	if (running)
> > +		scale_freq_cpu = scale_freq * arch_scale_cpu_capacity(NULL, cpu)
> > +							>> SCHED_CAPACITY_SHIFT;
> >  
> >  	/* delta_w is the amount already accumulated against our next period */
> >  	delta_w = sa->period_contrib;
> > @@ -2594,16 +2597,15 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
> >  		 * period and accrue it.
> >  		 */
> >  		delta_w = 1024 - delta_w;
> > -		scaled_delta_w = cap_scale(delta_w, scale_freq);
> >  		if (weight) {
> > -			sa->load_sum += weight * scaled_delta_w;
> > +			sa->load_sum += scaled_weight * delta_w;
> >  			if (cfs_rq) {
> >  				cfs_rq->runnable_load_sum +=
> > -						weight * scaled_delta_w;
> > +						scaled_weight * delta_w;
> >  			}
> >  		}
> >  		if (running)
> > -			sa->util_sum += scaled_delta_w * scale_cpu;
> > +			sa->util_sum += delta_w * scale_freq_cpu;
> >  
> >  		delta -= delta_w;
> >  
> 
> Sadly that makes the code worse; I get 14 mul instructions where
> previously I had 11.
> 
> What happens is that GCC gets confused and cannot constant propagate the
> new variables, so what used to be shifts now end up being actual
> multiplications.
> 
> With this, I get back to 11. Can you see what happens on ARM where you
> have both functions defined to non constants?

We repeated the experiment on arm and arm64 but still with functions
defined to constant to compare with your results. The mul instruction
count seems to be somewhat compiler version dependent, but consistently
show no effect of the patch:

arm	before	after
gcc4.9	12	12
gcc4.8	10	10

arm64	before	after
gcc4.9	11	11

I will get numbers with the arch-functions implemented as well and do
hackbench runs to see what happens in terms of performance.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/6] sched/fair: Get rid of scaling utilization by capacity_orig
  2015-09-08 12:52               ` Peter Zijlstra
  2015-09-08 14:06                 ` Vincent Guittot
  2015-09-08 14:31                 ` Morten Rasmussen
@ 2015-09-09 19:07                 ` Yuyang Du
  2015-09-10 10:06                   ` Peter Zijlstra
  2 siblings, 1 reply; 97+ messages in thread
From: Yuyang Du @ 2015-09-09 19:07 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vincent Guittot, Dietmar Eggemann, Steve Muckle,
	Morten Rasmussen, mingo, daniel.lezcano, mturquette, rjw,
	Juri Lelli, sgurrappadi, pang.xunlei, linux-kernel

On Tue, Sep 08, 2015 at 02:52:05PM +0200, Peter Zijlstra wrote:
>  
> +#if (SCHED_LOAD_SHIFT - SCHED_LOAD_RESOLUTION) != 10 || SCHED_CAPACITY_SHIFT != 10
> +#error "load tracking assumes 2^10 as unit"
> +#endif
> +

Sorry for late response. I might already missed somthing.

But I got a bit lost here, with:

#define SCHED_LOAD_SHIFT        (10 + SCHED_LOAD_RESOLUTION)
#define SCHED_CAPACITY_SHIFT    10

the #if is certainly false.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/6] sched/fair: Get rid of scaling utilization by capacity_orig
  2015-09-08 12:50             ` Dietmar Eggemann
  2015-09-08 14:01               ` Vincent Guittot
@ 2015-09-09 20:15               ` Yuyang Du
  2015-09-10 10:07                 ` Peter Zijlstra
  1 sibling, 1 reply; 97+ messages in thread
From: Yuyang Du @ 2015-09-09 20:15 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Vincent Guittot, Steve Muckle, Morten Rasmussen, peterz, mingo,
	daniel.lezcano, mturquette, rjw, Juri Lelli, sgurrappadi,
	pang.xunlei, linux-kernel

On Tue, Sep 08, 2015 at 01:50:38PM +0100, Dietmar Eggemann wrote:
> > It's both a unit and a SCALE/SHIFT problem, SCHED_LOAD_SHIFT and
> > SCHED_CAPACITY_SHIFT are defined separately so we must be sure to
> > scale the value in the right range. In the case of cpu_usage which
> > returns sa->util_avg , it's the capacity range not the load range.
> 
> Still don't understand why it's a unit problem. IMHO LOAD/UTIL and
> CAPACITY have no unit.

To be more accurate, probably, LOAD can be thought of as having unit,
but UTIL has no unit.
 
Anyway, those are my definitions:

1) unit, only for LOAD, and SCHED_LOAD_X is the unit (but
   SCHED_LOAD_RESOLUTION make it also some 2, see below)
2) range, aka, resolution or fix-point percentage (as Ben said)
3) timing ratio, LOAD_AVG_MAX etc, unralated with SCHED_LOAD_X

> >> I always thought that scale_load_down() takes care of that.
> > 
> > AFAIU, scale_load_down is a way to increase the resolution  of the
> > load not to move from load to capacity
> 
> I tried to figure out why we have this issue when comparing UTIL w/
> CAPACITY and not LOAD w/ CAPACITY:
> 
> Both are initialized like that:
> 
>  sa->load_avg = scale_load_down(se->load.weight);
>  sa->load_sum = sa->load_avg * LOAD_AVG_MAX;
>  sa->util_avg = scale_load_down(SCHED_LOAD_SCALE);
>  sa->util_sum = LOAD_AVG_MAX;
> 
> and we use 'se->on_rq * scale_load_down(se->load.weight)' as 'unsigned
> long weight' argument to call __update_load_avg() making sure the
> scaling differences between LOAD and CAPACITY are respected while
> updating sa->load_sum (and sa->load_avg).

Yes, because we used SCHED_LOAD_X as both unit and range for LOAD.
 
> OTAH, we don't apply a scale_load_down for sa->util_[sum/avg] only a '<<
> SCHED_LOAD_SHIFT) / LOAD_AVG_MAX' on sa->util_avg.
> So changing '<< SCHED_LOAD_SHIFT' to '*
> scale_load_down(SCHED_LOAD_SCALE)' would be the logical thing to do.

Actually, for UTIL, we only need range, so don't conflate with LOAD,
what about we get all these clarified by redefining SCHED_LOAD_RESOLUTION
as the resolution/range generic macro for LOAD, UTIL, and CAPACITY:

#define SCHED_RESOLUTION_SHIFT  10
#define SCHED_RESOLUTION_SCALE  (1L << SCHED_RESOLUTION_SHIFT)

#if 0 /* BITS_PER_LONG > 32 -- currently broken: it increases power usage under light load  */
# define scale_load(w)          ((w) << SCHED_RESOLUTION_SHIFT)
# define scale_load_down(w)     ((w) >> SCHED_RESOLUTION_SHIFT)
# define SCHED_LOAD_SHIFT       (10 + SCHED_RESOLUTION_SHIFT)
#else
# define scale_load(w)          (w)
# define scale_load_down(w)     (w)
# define SCHED_LOAD_SHIFT       (10)
#endif

#define SCHED_LOAD_SCALE        (1L << SCHED_LOAD_SHIFT)

For UTIL, e.g., it will be initiated as:
sa->util_avg = SCHED_RESOLUTION_SCALE;

And for capacity, we just use SCHED_RESOLUTION_SHIFT
(so SCHED_CAPACITY_SHIFT is not needed).

Thanks,
Yuyang

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/6] sched/fair: Get rid of scaling utilization by capacity_orig
  2015-09-08 15:33                   ` Peter Zijlstra
@ 2015-09-09 22:23                     ` bsegall
  2015-09-10 11:06                       ` Morten Rasmussen
  0 siblings, 1 reply; 97+ messages in thread
From: bsegall @ 2015-09-09 22:23 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Morten Rasmussen, Vincent Guittot, Dietmar Eggemann,
	Steve Muckle, mingo, daniel.lezcano, yuyang.du, mturquette, rjw,
	Juri Lelli, sgurrappadi, pang.xunlei, linux-kernel

Peter Zijlstra <peterz@infradead.org> writes:

> On Tue, Sep 08, 2015 at 03:31:58PM +0100, Morten Rasmussen wrote:
>> On Tue, Sep 08, 2015 at 02:52:05PM +0200, Peter Zijlstra wrote:
>> > > Tricky that, LOAD_AVG_MAX very much relies on the unit being 1<<10.
>> 
>> I don't get why LOAD_AVG_MAX relies on the util_avg shifting being
>> 1<<10, it is just the sum of the geometric series and the upper bound of
>> util_sum?
>
> It needs a 1024, it might just have been the 1024 ns we use a period
> instead of the scale unit though.
>
> The LOAD_AVG_MAX is the number where adding a next element to the series
> doesn't change the result anymore, so scaling it up will allow more
> significant elements to the series before we bottom out, which is the _N
> thing.
>

Yes, as the comments say, the 1024ns unit is arbitrary (and is an
average of not-quite-microseconds instead of just nanoseconds to allow
more bits to load.weight when we multiply load.weight by this number).
In fact there are two arbitrary 1024 units here, which are technically
unrelated and are both unrelated to SCHED_LOAD_RESOLUTION/etc - we
operate on units of almost-microseconds and we also do decays every
almost-millisecond.

There appears to be a bunch of confusion in the current code around
util_sum/util_avg which appears to using SCHED_LOAD_SCALE
for a fixed-point percentage or something, which is at least reasonable,
but is initializing it as scale_load_down(SCHED_LOAD_SCALE), which
results in either initializing as 100% or .1% depending on RESOLUTION.
This'll get clobbered on first update, but if it needs to be
initialized, it should either get initialized to something sane or at
least consistent.

load_sum/load_avg appear to be scale_load_down()ed properly, and appear
to be used as such at a quick glance.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/6] sched/fair: Get rid of scaling utilization by capacity_orig
  2015-09-09 19:07                 ` Yuyang Du
@ 2015-09-10 10:06                   ` Peter Zijlstra
  0 siblings, 0 replies; 97+ messages in thread
From: Peter Zijlstra @ 2015-09-10 10:06 UTC (permalink / raw)
  To: Yuyang Du
  Cc: Vincent Guittot, Dietmar Eggemann, Steve Muckle,
	Morten Rasmussen, mingo, daniel.lezcano, mturquette, rjw,
	Juri Lelli, sgurrappadi, pang.xunlei, linux-kernel

On Thu, Sep 10, 2015 at 03:07:48AM +0800, Yuyang Du wrote:
> On Tue, Sep 08, 2015 at 02:52:05PM +0200, Peter Zijlstra wrote:
> >  
> > +#if (SCHED_LOAD_SHIFT - SCHED_LOAD_RESOLUTION) != 10 || SCHED_CAPACITY_SHIFT != 10
> > +#error "load tracking assumes 2^10 as unit"
> > +#endif
> > +
> 
> Sorry for late response. I might already missed somthing.
> 
> But I got a bit lost here, with:
> 
> #define SCHED_LOAD_SHIFT        (10 + SCHED_LOAD_RESOLUTION)
> #define SCHED_CAPACITY_SHIFT    10
> 
> the #if is certainly false.

That is intended, triggering #error would be 'bad'.

The reason for this bit is to raise a stink if someone 'accidentally'
changes one of these values and expects things to just work.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/6] sched/fair: Get rid of scaling utilization by capacity_orig
  2015-09-09 20:15               ` Yuyang Du
@ 2015-09-10 10:07                 ` Peter Zijlstra
  2015-09-11  0:28                   ` Yuyang Du
  0 siblings, 1 reply; 97+ messages in thread
From: Peter Zijlstra @ 2015-09-10 10:07 UTC (permalink / raw)
  To: Yuyang Du
  Cc: Dietmar Eggemann, Vincent Guittot, Steve Muckle,
	Morten Rasmussen, mingo, daniel.lezcano, mturquette, rjw,
	Juri Lelli, sgurrappadi, pang.xunlei, linux-kernel

On Thu, Sep 10, 2015 at 04:15:20AM +0800, Yuyang Du wrote:
> On Tue, Sep 08, 2015 at 01:50:38PM +0100, Dietmar Eggemann wrote:
> > > It's both a unit and a SCALE/SHIFT problem, SCHED_LOAD_SHIFT and
> > > SCHED_CAPACITY_SHIFT are defined separately so we must be sure to
> > > scale the value in the right range. In the case of cpu_usage which
> > > returns sa->util_avg , it's the capacity range not the load range.
> > 
> > Still don't understand why it's a unit problem. IMHO LOAD/UTIL and
> > CAPACITY have no unit.
> 
> To be more accurate, probably, LOAD can be thought of as having unit,
> but UTIL has no unit.

But I'm thinking that is wrong; it should have one, esp. if we go scale
the thing. Giving it the same fixed point unit as load simplifies the
code.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/6] sched/fair: Get rid of scaling utilization by capacity_orig
  2015-09-09 22:23                     ` bsegall
@ 2015-09-10 11:06                       ` Morten Rasmussen
  2015-09-10 11:11                         ` Vincent Guittot
  2015-09-10 17:23                         ` bsegall
  0 siblings, 2 replies; 97+ messages in thread
From: Morten Rasmussen @ 2015-09-10 11:06 UTC (permalink / raw)
  To: bsegall
  Cc: Peter Zijlstra, Vincent Guittot, Dietmar Eggemann, Steve Muckle,
	mingo, daniel.lezcano, yuyang.du, mturquette, rjw, Juri Lelli,
	sgurrappadi, pang.xunlei, linux-kernel

On Wed, Sep 09, 2015 at 03:23:43PM -0700, bsegall@google.com wrote:
> Peter Zijlstra <peterz@infradead.org> writes:
> 
> > On Tue, Sep 08, 2015 at 03:31:58PM +0100, Morten Rasmussen wrote:
> >> On Tue, Sep 08, 2015 at 02:52:05PM +0200, Peter Zijlstra wrote:
> >> > > Tricky that, LOAD_AVG_MAX very much relies on the unit being 1<<10.
> >> 
> >> I don't get why LOAD_AVG_MAX relies on the util_avg shifting being
> >> 1<<10, it is just the sum of the geometric series and the upper bound of
> >> util_sum?
> >
> > It needs a 1024, it might just have been the 1024 ns we use a period
> > instead of the scale unit though.
> >
> > The LOAD_AVG_MAX is the number where adding a next element to the series
> > doesn't change the result anymore, so scaling it up will allow more
> > significant elements to the series before we bottom out, which is the _N
> > thing.
> >
> 
> Yes, as the comments say, the 1024ns unit is arbitrary (and is an
> average of not-quite-microseconds instead of just nanoseconds to allow
> more bits to load.weight when we multiply load.weight by this number).
> In fact there are two arbitrary 1024 units here, which are technically
> unrelated and are both unrelated to SCHED_LOAD_RESOLUTION/etc - we
> operate on units of almost-microseconds and we also do decays every
> almost-millisecond.
> 
> There appears to be a bunch of confusion in the current code around
> util_sum/util_avg which appears to using SCHED_LOAD_SCALE
> for a fixed-point percentage or something, which is at least reasonable,
> but is initializing it as scale_load_down(SCHED_LOAD_SCALE), which
> results in either initializing as 100% or .1% depending on RESOLUTION.
> This'll get clobbered on first update, but if it needs to be
> initialized, it should either get initialized to something sane or at
> least consistent.

This is what I thought too. The whole geometric series math is completely
independent of the scale used for priority in load_avg and the fixed
point shifting used for util_avg.

> load_sum/load_avg appear to be scale_load_down()ed properly, and appear
> to be used as such at a quick glance.

I don't think shifting by SCHED_LOAD_SHIFT in __update_load_avg() is
right:

	sa->util_avg = (sa->util_sum << SCHED_LOAD_SHIFT) / LOAD_AVG_MAX;

util_avg is initialized to low resolution (>> SCHED_LOAD_RESOLUTION):

        sa->util_avg = scale_load_down(SCHED_LOAD_SCALE);

so it appear to be intended to be using low resolution like load_avg
(weight is scaled down before it is passed into __update_load_avg()),
but util_avg is shifted up to high resolution. It should be:

	sa->util_avg = (sa->util_sum << (SCHED_LOAD_SHIFT -
					SCHED_LOAD_SHIFT)) / LOAD_AVG_MAX;

to be consistent.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/6] sched/fair: Get rid of scaling utilization by capacity_orig
  2015-09-10 11:06                       ` Morten Rasmussen
@ 2015-09-10 11:11                         ` Vincent Guittot
  2015-09-10 12:10                           ` Morten Rasmussen
  2015-09-10 17:23                         ` bsegall
  1 sibling, 1 reply; 97+ messages in thread
From: Vincent Guittot @ 2015-09-10 11:11 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Benjamin Segall, Peter Zijlstra, Dietmar Eggemann, Steve Muckle,
	mingo, daniel.lezcano, yuyang.du, mturquette, rjw, Juri Lelli,
	sgurrappadi, pang.xunlei, linux-kernel

On 10 September 2015 at 13:06, Morten Rasmussen
<morten.rasmussen@arm.com> wrote:
> On Wed, Sep 09, 2015 at 03:23:43PM -0700, bsegall@google.com wrote:
>> Peter Zijlstra <peterz@infradead.org> writes:
>>
>> > On Tue, Sep 08, 2015 at 03:31:58PM +0100, Morten Rasmussen wrote:
>> >> On Tue, Sep 08, 2015 at 02:52:05PM +0200, Peter Zijlstra wrote:
>> >> > > Tricky that, LOAD_AVG_MAX very much relies on the unit being 1<<10.
>> >>
>> >> I don't get why LOAD_AVG_MAX relies on the util_avg shifting being
>> >> 1<<10, it is just the sum of the geometric series and the upper bound of
>> >> util_sum?
>> >
>> > It needs a 1024, it might just have been the 1024 ns we use a period
>> > instead of the scale unit though.
>> >
>> > The LOAD_AVG_MAX is the number where adding a next element to the series
>> > doesn't change the result anymore, so scaling it up will allow more
>> > significant elements to the series before we bottom out, which is the _N
>> > thing.
>> >
>>
>> Yes, as the comments say, the 1024ns unit is arbitrary (and is an
>> average of not-quite-microseconds instead of just nanoseconds to allow
>> more bits to load.weight when we multiply load.weight by this number).
>> In fact there are two arbitrary 1024 units here, which are technically
>> unrelated and are both unrelated to SCHED_LOAD_RESOLUTION/etc - we
>> operate on units of almost-microseconds and we also do decays every
>> almost-millisecond.
>>
>> There appears to be a bunch of confusion in the current code around
>> util_sum/util_avg which appears to using SCHED_LOAD_SCALE
>> for a fixed-point percentage or something, which is at least reasonable,
>> but is initializing it as scale_load_down(SCHED_LOAD_SCALE), which
>> results in either initializing as 100% or .1% depending on RESOLUTION.
>> This'll get clobbered on first update, but if it needs to be
>> initialized, it should either get initialized to something sane or at
>> least consistent.
>
> This is what I thought too. The whole geometric series math is completely
> independent of the scale used for priority in load_avg and the fixed
> point shifting used for util_avg.
>
>> load_sum/load_avg appear to be scale_load_down()ed properly, and appear
>> to be used as such at a quick glance.
>
> I don't think shifting by SCHED_LOAD_SHIFT in __update_load_avg() is
> right:
>
>         sa->util_avg = (sa->util_sum << SCHED_LOAD_SHIFT) / LOAD_AVG_MAX;
>
> util_avg is initialized to low resolution (>> SCHED_LOAD_RESOLUTION):
>
>         sa->util_avg = scale_load_down(SCHED_LOAD_SCALE);
>
> so it appear to be intended to be using low resolution like load_avg
> (weight is scaled down before it is passed into __update_load_avg()),
> but util_avg is shifted up to high resolution. It should be:
>
>         sa->util_avg = (sa->util_sum << (SCHED_LOAD_SHIFT -
>                                         SCHED_LOAD_SHIFT)) / LOAD_AVG_MAX;

you probably mean (SCHED_LOAD_SHIFT -  SCHED_LOAD_RESOLUTION)

The goal of this patchset is to be able to scale util_avg in the range
of cpu capacity so why don't we directly initialize it with
sa->util_avg = SCHED_CAPACITY_SCALE;

and then use

 sa->util_avg = (sa->util_sum << SCHED_CAPACITY_SHIFT) / LOAD_AVG_MAX;

so we don't have to take care of high and low load resolution
Regards,

>
> to be consistent.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/6] sched/fair: Get rid of scaling utilization by capacity_orig
  2015-09-10 11:11                         ` Vincent Guittot
@ 2015-09-10 12:10                           ` Morten Rasmussen
  2015-09-11  0:50                             ` Yuyang Du
  0 siblings, 1 reply; 97+ messages in thread
From: Morten Rasmussen @ 2015-09-10 12:10 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Benjamin Segall, Peter Zijlstra, Dietmar Eggemann, Steve Muckle,
	mingo, daniel.lezcano, yuyang.du, mturquette, rjw, Juri Lelli,
	sgurrappadi, pang.xunlei, linux-kernel

On Thu, Sep 10, 2015 at 01:11:01PM +0200, Vincent Guittot wrote:
> On 10 September 2015 at 13:06, Morten Rasmussen
> <morten.rasmussen@arm.com> wrote:
> > On Wed, Sep 09, 2015 at 03:23:43PM -0700, bsegall@google.com wrote:
> >> Peter Zijlstra <peterz@infradead.org> writes:
> >>
> >> > On Tue, Sep 08, 2015 at 03:31:58PM +0100, Morten Rasmussen wrote:
> >> >> On Tue, Sep 08, 2015 at 02:52:05PM +0200, Peter Zijlstra wrote:
> >> >> > > Tricky that, LOAD_AVG_MAX very much relies on the unit being 1<<10.
> >> >>
> >> >> I don't get why LOAD_AVG_MAX relies on the util_avg shifting being
> >> >> 1<<10, it is just the sum of the geometric series and the upper bound of
> >> >> util_sum?
> >> >
> >> > It needs a 1024, it might just have been the 1024 ns we use a period
> >> > instead of the scale unit though.
> >> >
> >> > The LOAD_AVG_MAX is the number where adding a next element to the series
> >> > doesn't change the result anymore, so scaling it up will allow more
> >> > significant elements to the series before we bottom out, which is the _N
> >> > thing.
> >> >
> >>
> >> Yes, as the comments say, the 1024ns unit is arbitrary (and is an
> >> average of not-quite-microseconds instead of just nanoseconds to allow
> >> more bits to load.weight when we multiply load.weight by this number).
> >> In fact there are two arbitrary 1024 units here, which are technically
> >> unrelated and are both unrelated to SCHED_LOAD_RESOLUTION/etc - we
> >> operate on units of almost-microseconds and we also do decays every
> >> almost-millisecond.
> >>
> >> There appears to be a bunch of confusion in the current code around
> >> util_sum/util_avg which appears to using SCHED_LOAD_SCALE
> >> for a fixed-point percentage or something, which is at least reasonable,
> >> but is initializing it as scale_load_down(SCHED_LOAD_SCALE), which
> >> results in either initializing as 100% or .1% depending on RESOLUTION.
> >> This'll get clobbered on first update, but if it needs to be
> >> initialized, it should either get initialized to something sane or at
> >> least consistent.
> >
> > This is what I thought too. The whole geometric series math is completely
> > independent of the scale used for priority in load_avg and the fixed
> > point shifting used for util_avg.
> >
> >> load_sum/load_avg appear to be scale_load_down()ed properly, and appear
> >> to be used as such at a quick glance.
> >
> > I don't think shifting by SCHED_LOAD_SHIFT in __update_load_avg() is
> > right:
> >
> >         sa->util_avg = (sa->util_sum << SCHED_LOAD_SHIFT) / LOAD_AVG_MAX;
> >
> > util_avg is initialized to low resolution (>> SCHED_LOAD_RESOLUTION):
> >
> >         sa->util_avg = scale_load_down(SCHED_LOAD_SCALE);
> >
> > so it appear to be intended to be using low resolution like load_avg
> > (weight is scaled down before it is passed into __update_load_avg()),
> > but util_avg is shifted up to high resolution. It should be:
> >
> >         sa->util_avg = (sa->util_sum << (SCHED_LOAD_SHIFT -
> >                                         SCHED_LOAD_SHIFT)) / LOAD_AVG_MAX;
> 
> you probably mean (SCHED_LOAD_SHIFT -  SCHED_LOAD_RESOLUTION)

Yes. Thanks for providing the right expression. There seems to be enough
confusion in this thread already :)

> The goal of this patchset is to be able to scale util_avg in the range
> of cpu capacity so why don't we directly initialize it with
> sa->util_avg = SCHED_CAPACITY_SCALE;
> 
> and then use
> 
>  sa->util_avg = (sa->util_sum << SCHED_CAPACITY_SHIFT) / LOAD_AVG_MAX;
> 
> so we don't have to take care of high and low load resolution

That works for me, except that the left-shift has gone be PeterZ's
optimization patch posted earlier in this thread. It is changing
util_sum to scaled by capacity instead of being the pure geometric
series which requires the left shift at the end when we divide by
LOAD_AVG_MAX. So it should be equivalent to what you are proposing if we
change the initialization to your proposal too.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/6] sched/fair: Get rid of scaling utilization by capacity_orig
  2015-09-10 11:06                       ` Morten Rasmussen
  2015-09-10 11:11                         ` Vincent Guittot
@ 2015-09-10 17:23                         ` bsegall
  1 sibling, 0 replies; 97+ messages in thread
From: bsegall @ 2015-09-10 17:23 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Peter Zijlstra, Vincent Guittot, Dietmar Eggemann, Steve Muckle,
	mingo, daniel.lezcano, yuyang.du, mturquette, rjw, Juri Lelli,
	sgurrappadi, pang.xunlei, linux-kernel

Morten Rasmussen <morten.rasmussen@arm.com> writes:

> On Wed, Sep 09, 2015 at 03:23:43PM -0700, bsegall@google.com wrote:
>> Peter Zijlstra <peterz@infradead.org> writes:
>> 
>> > On Tue, Sep 08, 2015 at 03:31:58PM +0100, Morten Rasmussen wrote:
>> >> On Tue, Sep 08, 2015 at 02:52:05PM +0200, Peter Zijlstra wrote:
>> >> > > Tricky that, LOAD_AVG_MAX very much relies on the unit being 1<<10.
>> >> 
>> >> I don't get why LOAD_AVG_MAX relies on the util_avg shifting being
>> >> 1<<10, it is just the sum of the geometric series and the upper bound of
>> >> util_sum?
>> >
>> > It needs a 1024, it might just have been the 1024 ns we use a period
>> > instead of the scale unit though.
>> >
>> > The LOAD_AVG_MAX is the number where adding a next element to the series
>> > doesn't change the result anymore, so scaling it up will allow more
>> > significant elements to the series before we bottom out, which is the _N
>> > thing.
>> >
>> 
>> Yes, as the comments say, the 1024ns unit is arbitrary (and is an
>> average of not-quite-microseconds instead of just nanoseconds to allow
>> more bits to load.weight when we multiply load.weight by this number).
>> In fact there are two arbitrary 1024 units here, which are technically
>> unrelated and are both unrelated to SCHED_LOAD_RESOLUTION/etc - we
>> operate on units of almost-microseconds and we also do decays every
>> almost-millisecond.
>> 
>> There appears to be a bunch of confusion in the current code around
>> util_sum/util_avg which appears to using SCHED_LOAD_SCALE
>> for a fixed-point percentage or something, which is at least reasonable,
>> but is initializing it as scale_load_down(SCHED_LOAD_SCALE), which
>> results in either initializing as 100% or .1% depending on RESOLUTION.
>> This'll get clobbered on first update, but if it needs to be
>> initialized, it should either get initialized to something sane or at
>> least consistent.
>
> This is what I thought too. The whole geometric series math is completely
> independent of the scale used for priority in load_avg and the fixed
> point shifting used for util_avg.
>
>> load_sum/load_avg appear to be scale_load_down()ed properly, and appear
>> to be used as such at a quick glance.
>
> I don't think shifting by SCHED_LOAD_SHIFT in __update_load_avg() is
> right:
>
> 	sa->util_avg = (sa->util_sum << SCHED_LOAD_SHIFT) / LOAD_AVG_MAX;
>
> util_avg is initialized to low resolution (>> SCHED_LOAD_RESOLUTION):
>
>         sa->util_avg = scale_load_down(SCHED_LOAD_SCALE);
>
> so it appear to be intended to be using low resolution like load_avg
> (weight is scaled down before it is passed into __update_load_avg()),
> but util_avg is shifted up to high resolution. It should be:
>
> 	sa->util_avg = (sa->util_sum << (SCHED_LOAD_SHIFT -
> 					SCHED_LOAD_SHIFT)) / LOAD_AVG_MAX;
>
> to be consistent.

Yeah, util_avg was/is screwed up in terms of either the initialization
or which shift to use there. The load ones however appear to be fine.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/6] sched/fair: Get rid of scaling utilization by capacity_orig
  2015-09-10 10:07                 ` Peter Zijlstra
@ 2015-09-11  0:28                   ` Yuyang Du
  2015-09-11 10:31                     ` Morten Rasmussen
  0 siblings, 1 reply; 97+ messages in thread
From: Yuyang Du @ 2015-09-11  0:28 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dietmar Eggemann, Vincent Guittot, Steve Muckle,
	Morten Rasmussen, mingo, daniel.lezcano, mturquette, rjw,
	Juri Lelli, sgurrappadi, pang.xunlei, linux-kernel

On Thu, Sep 10, 2015 at 12:07:27PM +0200, Peter Zijlstra wrote:
> > > Still don't understand why it's a unit problem. IMHO LOAD/UTIL and
> > > CAPACITY have no unit.
> > 
> > To be more accurate, probably, LOAD can be thought of as having unit,
> > but UTIL has no unit.
> 
> But I'm thinking that is wrong; it should have one, esp. if we go scale
> the thing. Giving it the same fixed point unit as load simplifies the
> code.

I think we probably are saying the same thing with different terms. Anyway,
let me reiterate what I said and make it a little more formalized.

UTIL has no unit because it is pure ratio, the cpu_running%, which is in the
range of [0, 100%], and we increase the resolution, because we don't want
to lose many (due to integer rounding) by multiplying a number (say 1024), then
the range becomes [0, 1024].

CAPACITY is also a ratio of ACTUAL_PERF/MAX_PERF, from (0, 1]. Even LOAD
is the same, a ratio of NICE_X/NICE_0, from [15/1024=0.015, 88761/1024=86.68],
as it only has relativity meaning (i.e., when comparing to each other).
I said it has unit, it is in the sense that it looks like currency (for instance,
Yuan), you can use to buy CPU fair share. But it is just how you look at it and
there are certainly many other ways.

So, I still propose to generalize all these with the following patch, in the
belief that this makes it simple and clear, and error-reducing. 

--

Subject: [PATCH] sched/fair: Generalize the load/util averages resolution
 definition

A integer metric needs certain resolution to allow how much detail we
can look into (not losing detail by integer rounding), which also
determines the range of the metrics.

For instance, to increase the resolution of [0, 1] (two levels), one
can multiply 1024 and get [0, 1024] (1025 levels).

In sched/fair, a few metrics depend on the resolution: load/load_avg,
util_avg, and capacity (frequency adjustment). In order to reduce the
risks of making mistakes relating to resolution/range, we therefore
generalize the resolution by defining a basic resolution constant
number, and then formalize all metrics to depend on the basic
resolution. The basic resolution is 1024 or (1 << 10). Further, one
can recursively apply another basic resolution to increase the final
resolution (e.g., 1048676=1<<20).

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 include/linux/sched.h |  2 +-
 kernel/sched/sched.h  | 12 +++++++-----
 2 files changed, 8 insertions(+), 6 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 119823d..55a7b93 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -912,7 +912,7 @@ enum cpu_idle_type {
 /*
  * Increase resolution of cpu_capacity calculations
  */
-#define SCHED_CAPACITY_SHIFT	10
+#define SCHED_CAPACITY_SHIFT	SCHED_RESOLUTION_SHIFT
 #define SCHED_CAPACITY_SCALE	(1L << SCHED_CAPACITY_SHIFT)

 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 68cda11..d27cdd8 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -40,6 +40,9 @@ static inline void update_cpu_load_active(struct rq *this_rq) { }
  */
 #define NS_TO_JIFFIES(TIME)	((unsigned long)(TIME) / (NSEC_PER_SEC / HZ))

+# define SCHED_RESOLUTION_SHIFT	10
+# define SCHED_RESOLUTION_SCALE	(1L << SCHED_RESOLUTION_SHIFT)
+
 /*
  * Increase resolution of nice-level calculations for 64-bit architectures.
  * The extra resolution improves shares distribution and load balancing of
@@ -53,16 +56,15 @@ static inline void update_cpu_load_active(struct rq *this_rq) { }
  * increased costs.
  */
 #if 0 /* BITS_PER_LONG > 32 -- currently broken: it increases power usage under light load  */
-# define SCHED_LOAD_RESOLUTION	10
-# define scale_load(w)		((w) << SCHED_LOAD_RESOLUTION)
-# define scale_load_down(w)	((w) >> SCHED_LOAD_RESOLUTION)
+# define SCHED_LOAD_SHIFT	(SCHED_RESOLUTION_SHIFT + SCHED_RESOLUTION_SHIFT)
+# define scale_load(w)		((w) << SCHED_RESOLUTION_SHIFT)
+# define scale_load_down(w)	((w) >> SCHED_RESOLUTION_SHIFT)
 #else
-# define SCHED_LOAD_RESOLUTION	0
+# define SCHED_LOAD_SHIFT	(SCHED_RESOLUTION_SHIFT)
 # define scale_load(w)		(w)
 # define scale_load_down(w)	(w)
 #endif

-#define SCHED_LOAD_SHIFT	(10 + SCHED_LOAD_RESOLUTION)
 #define SCHED_LOAD_SCALE	(1L << SCHED_LOAD_SHIFT)

 #define NICE_0_LOAD		SCHED_LOAD_SCALE
-- 
1.9.1

^ permalink raw reply related	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/6] sched/fair: Get rid of scaling utilization by capacity_orig
  2015-09-10 12:10                           ` Morten Rasmussen
@ 2015-09-11  0:50                             ` Yuyang Du
  0 siblings, 0 replies; 97+ messages in thread
From: Yuyang Du @ 2015-09-11  0:50 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Vincent Guittot, Benjamin Segall, Peter Zijlstra,
	Dietmar Eggemann, Steve Muckle, mingo, daniel.lezcano,
	mturquette, rjw, Juri Lelli, sgurrappadi, pang.xunlei,
	linux-kernel

On Thu, Sep 10, 2015 at 01:10:19PM +0100, Morten Rasmussen wrote:
> > > so it appear to be intended to be using low resolution like load_avg
> > > (weight is scaled down before it is passed into __update_load_avg()),
> > > but util_avg is shifted up to high resolution. It should be:
> > >
> > >         sa->util_avg = (sa->util_sum << (SCHED_LOAD_SHIFT -
> > >                                         SCHED_LOAD_SHIFT)) / LOAD_AVG_MAX;
> > 
> > you probably mean (SCHED_LOAD_SHIFT -  SCHED_LOAD_RESOLUTION)
> 
> Yes. Thanks for providing the right expression. There seems to be enough
> confusion in this thread already :)
 
And yes, it is my bad in the first place, sorry, I did not think it though :)

> > The goal of this patchset is to be able to scale util_avg in the range
> > of cpu capacity so why don't we directly initialize it with
> > sa->util_avg = SCHED_CAPACITY_SCALE;

Yes, we should, and specifically, it is bacause we can combine the
resolution thing for util% * capacity%, so we only need to use the
resolution once.

> > and then use
> > 
> >  sa->util_avg = (sa->util_sum << SCHED_CAPACITY_SHIFT) / LOAD_AVG_MAX;
> > 
> > so we don't have to take care of high and low load resolution
> 
> That works for me, except that the left-shift has gone be PeterZ's
> optimization patch posted earlier in this thread. It is changing
> util_sum to scaled by capacity instead of being the pure geometric
> series which requires the left shift at the end when we divide by
> LOAD_AVG_MAX. So it should be equivalent to what you are proposing if we
> change the initialization to your proposal too.

I previously initialized the util_sum as:

sa->util_sum = LOAD_AVG_MAX;

it is because wihout capacity adjustment, this can save some multiplications
in __update_load_avg(), but actually if we do capacity adjustment, we must
multiply anyway, so it is better we initialize it as:

sa->util_sum = sa->util_avg * LOAD_AVG_MAX;

Anyway, with the patch I posted in the other email in this thread, we
can fix all this very clearly, I hope so. I did not post a fix patch,
it is because the solutions are already there, it is just how we make it 
look better, and you can provide it in your new version.

Thanks,
Yuyang

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/6] sched/fair: Get rid of scaling utilization by capacity_orig
  2015-09-08 16:53                   ` Morten Rasmussen
  2015-09-09  9:43                     ` Peter Zijlstra
@ 2015-09-11  7:46                     ` Leo Yan
  2015-09-11 10:02                       ` Morten Rasmussen
  1 sibling, 1 reply; 97+ messages in thread
From: Leo Yan @ 2015-09-11  7:46 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Peter Zijlstra, Vincent Guittot, Dietmar Eggemann, Steve Muckle,
	mingo, daniel.lezcano, yuyang.du, mturquette, rjw, Juri Lelli,
	sgurrappadi, pang.xunlei, linux-kernel

Hi Morten,

On Tue, Sep 08, 2015 at 05:53:31PM +0100, Morten Rasmussen wrote:
> On Tue, Sep 08, 2015 at 03:31:58PM +0100, Morten Rasmussen wrote:
> > On Tue, Sep 08, 2015 at 02:52:05PM +0200, Peter Zijlstra wrote:
> > > 
> > > Something like teh below..
> > > 
> > > Another thing to ponder; the downside of scaled_delta_w is that its
> > > fairly likely delta is small and you loose all bits, whereas the weight
> > > is likely to be large can could loose a fwe bits without issue.
> > 
> > That issue applies both to load and util.
> > 
> > > 
> > > That is, in fixed point scaling like this, you want to start with the
> > > biggest numbers, not the smallest, otherwise you loose too much.
> > > 
> > > The flip side is of course that now you can share a multiplcation.
> > 
> > But if we apply the scaling to the weight instead of time, we would only
> > have to apply it once and not three times like it is now? So maybe we
> > can end up with almost the same number of multiplications.
> > 
> > We might be loosing bits for low priority task running on cpus at a low
> > frequency though.
> 
> Something like the below. We should be saving one multiplication.
> 
> --- 8< ---
> 
> From: Morten Rasmussen <morten.rasmussen@arm.com>
> Date: Tue, 8 Sep 2015 17:15:40 +0100
> Subject: [PATCH] sched/fair: Scale load/util contribution rather than time
> 
> When updating load/util tracking the time delta might be very small (1)
> in many cases, scaling it futher down with frequency and cpu invariance
> scaling might cause us to loose precision. Instead of scaling time we
> can scale the weight of the task for load and the capacity for
> utilization. Both weight (>=15) and capacity should be significantly
> bigger in most cases. Low priority tasks might still suffer a bit but
> worst should be improved, as weight is at least 15 before invariance
> scaling.
> 
> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
> ---
>  kernel/sched/fair.c | 38 +++++++++++++++++++-------------------
>  1 file changed, 19 insertions(+), 19 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 9301291..d5ee72a 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2519,8 +2519,6 @@ static u32 __compute_runnable_contrib(u64 n)
>  #error "load tracking assumes 2^10 as unit"
>  #endif
>  
> -#define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT)
> -
>  /*
>   * We can represent the historical contribution to runnable average as the
>   * coefficients of a geometric series.  To do this we sub-divide our runnable
> @@ -2553,10 +2551,10 @@ static __always_inline int
>  __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
>  		  unsigned long weight, int running, struct cfs_rq *cfs_rq)
>  {
> -	u64 delta, scaled_delta, periods;
> +	u64 delta, periods;
>  	u32 contrib;
> -	unsigned int delta_w, scaled_delta_w, decayed = 0;
> -	unsigned long scale_freq, scale_cpu;
> +	unsigned int delta_w, decayed = 0;
> +	unsigned long scaled_weight = 0, scale_freq, scale_freq_cpu = 0;
>  
>  	delta = now - sa->last_update_time;
>  	/*
> @@ -2577,8 +2575,13 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
>  		return 0;
>  	sa->last_update_time = now;
>  
> -	scale_freq = arch_scale_freq_capacity(NULL, cpu);
> -	scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
> +	if (weight || running)
> +		scale_freq = arch_scale_freq_capacity(NULL, cpu);
> +	if (weight)
> +		scaled_weight = weight * scale_freq >> SCHED_CAPACITY_SHIFT;
> +	if (running)
> +		scale_freq_cpu = scale_freq * arch_scale_cpu_capacity(NULL, cpu)
> +							>> SCHED_CAPACITY_SHIFT;

maybe below question is stupid :)

Why not calculate the scaled_weight depend on cpu's capacity as well?
So like: scaled_weight = weight * scale_freq_cpu.

>  	/* delta_w is the amount already accumulated against our next period */
>  	delta_w = sa->period_contrib;
> @@ -2594,16 +2597,15 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
>  		 * period and accrue it.
>  		 */
>  		delta_w = 1024 - delta_w;
> -		scaled_delta_w = cap_scale(delta_w, scale_freq);
>  		if (weight) {
> -			sa->load_sum += weight * scaled_delta_w;
> +			sa->load_sum += scaled_weight * delta_w;
>  			if (cfs_rq) {
>  				cfs_rq->runnable_load_sum +=
> -						weight * scaled_delta_w;
> +						scaled_weight * delta_w;
>  			}
>  		}
>  		if (running)
> -			sa->util_sum += scaled_delta_w * scale_cpu;
> +			sa->util_sum += delta_w * scale_freq_cpu;
>  
>  		delta -= delta_w;
>  
> @@ -2620,25 +2622,23 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
>  
>  		/* Efficiently calculate \sum (1..n_period) 1024*y^i */
>  		contrib = __compute_runnable_contrib(periods);
> -		contrib = cap_scale(contrib, scale_freq);
>  		if (weight) {
> -			sa->load_sum += weight * contrib;
> +			sa->load_sum += scaled_weight * contrib;
>  			if (cfs_rq)
> -				cfs_rq->runnable_load_sum += weight * contrib;
> +				cfs_rq->runnable_load_sum += scaled_weight * contrib;
>  		}
>  		if (running)
> -			sa->util_sum += contrib * scale_cpu;
> +			sa->util_sum += contrib * scale_freq_cpu;
>  	}
>  
>  	/* Remainder of delta accrued against u_0` */
> -	scaled_delta = cap_scale(delta, scale_freq);
>  	if (weight) {
> -		sa->load_sum += weight * scaled_delta;
> +		sa->load_sum += scaled_weight * delta;
>  		if (cfs_rq)
> -			cfs_rq->runnable_load_sum += weight * scaled_delta;
> +			cfs_rq->runnable_load_sum += scaled_weight * delta;
>  	}
>  	if (running)
> -		sa->util_sum += scaled_delta * scale_cpu;
> +		sa->util_sum += delta * scale_freq_cpu;
>  
>  	sa->period_contrib += delta;
>  
> -- 
> 1.9.1
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/6] sched/fair: Get rid of scaling utilization by capacity_orig
  2015-09-11  7:46                     ` Leo Yan
@ 2015-09-11 10:02                       ` Morten Rasmussen
  2015-09-11 14:11                         ` Leo Yan
  0 siblings, 1 reply; 97+ messages in thread
From: Morten Rasmussen @ 2015-09-11 10:02 UTC (permalink / raw)
  To: Leo Yan
  Cc: Peter Zijlstra, Vincent Guittot, Dietmar Eggemann, Steve Muckle,
	mingo, daniel.lezcano, yuyang.du, mturquette, rjw, Juri Lelli,
	sgurrappadi, pang.xunlei, linux-kernel

On Fri, Sep 11, 2015 at 03:46:51PM +0800, Leo Yan wrote:
> Hi Morten,
> 
> On Tue, Sep 08, 2015 at 05:53:31PM +0100, Morten Rasmussen wrote:
> > On Tue, Sep 08, 2015 at 03:31:58PM +0100, Morten Rasmussen wrote:
> > > On Tue, Sep 08, 2015 at 02:52:05PM +0200, Peter Zijlstra wrote:
> > > > 
> > > > Something like teh below..
> > > > 
> > > > Another thing to ponder; the downside of scaled_delta_w is that its
> > > > fairly likely delta is small and you loose all bits, whereas the weight
> > > > is likely to be large can could loose a fwe bits without issue.
> > > 
> > > That issue applies both to load and util.
> > > 
> > > > 
> > > > That is, in fixed point scaling like this, you want to start with the
> > > > biggest numbers, not the smallest, otherwise you loose too much.
> > > > 
> > > > The flip side is of course that now you can share a multiplcation.
> > > 
> > > But if we apply the scaling to the weight instead of time, we would only
> > > have to apply it once and not three times like it is now? So maybe we
> > > can end up with almost the same number of multiplications.
> > > 
> > > We might be loosing bits for low priority task running on cpus at a low
> > > frequency though.
> > 
> > Something like the below. We should be saving one multiplication.
> > 
> > --- 8< ---
> > 
> > From: Morten Rasmussen <morten.rasmussen@arm.com>
> > Date: Tue, 8 Sep 2015 17:15:40 +0100
> > Subject: [PATCH] sched/fair: Scale load/util contribution rather than time
> > 
> > When updating load/util tracking the time delta might be very small (1)
> > in many cases, scaling it futher down with frequency and cpu invariance
> > scaling might cause us to loose precision. Instead of scaling time we
> > can scale the weight of the task for load and the capacity for
> > utilization. Both weight (>=15) and capacity should be significantly
> > bigger in most cases. Low priority tasks might still suffer a bit but
> > worst should be improved, as weight is at least 15 before invariance
> > scaling.
> > 
> > Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
> > ---
> >  kernel/sched/fair.c | 38 +++++++++++++++++++-------------------
> >  1 file changed, 19 insertions(+), 19 deletions(-)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 9301291..d5ee72a 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -2519,8 +2519,6 @@ static u32 __compute_runnable_contrib(u64 n)
> >  #error "load tracking assumes 2^10 as unit"
> >  #endif
> >  
> > -#define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT)
> > -
> >  /*
> >   * We can represent the historical contribution to runnable average as the
> >   * coefficients of a geometric series.  To do this we sub-divide our runnable
> > @@ -2553,10 +2551,10 @@ static __always_inline int
> >  __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
> >  		  unsigned long weight, int running, struct cfs_rq *cfs_rq)
> >  {
> > -	u64 delta, scaled_delta, periods;
> > +	u64 delta, periods;
> >  	u32 contrib;
> > -	unsigned int delta_w, scaled_delta_w, decayed = 0;
> > -	unsigned long scale_freq, scale_cpu;
> > +	unsigned int delta_w, decayed = 0;
> > +	unsigned long scaled_weight = 0, scale_freq, scale_freq_cpu = 0;
> >  
> >  	delta = now - sa->last_update_time;
> >  	/*
> > @@ -2577,8 +2575,13 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
> >  		return 0;
> >  	sa->last_update_time = now;
> >  
> > -	scale_freq = arch_scale_freq_capacity(NULL, cpu);
> > -	scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
> > +	if (weight || running)
> > +		scale_freq = arch_scale_freq_capacity(NULL, cpu);
> > +	if (weight)
> > +		scaled_weight = weight * scale_freq >> SCHED_CAPACITY_SHIFT;
> > +	if (running)
> > +		scale_freq_cpu = scale_freq * arch_scale_cpu_capacity(NULL, cpu)
> > +							>> SCHED_CAPACITY_SHIFT;
> 
> maybe below question is stupid :)
> 
> Why not calculate the scaled_weight depend on cpu's capacity as well?
> So like: scaled_weight = weight * scale_freq_cpu.

IMHO, we should not scale load by cpu capacity since load isn't really
comparable to capacity. It is runnable time based (not running time like
utilization) and the idea is to used it for balancing when when the
system is fully utilized. When the system is fully utilized we can't say
anything about the true compute demands of a task, it may get exactly
the cpu time it needs or it may need much more. Hence it doesn't really
make sense to scale the demand by the capacity of the cpu. Two busy
loops on cpus with different cpu capacities should have the load as they
have the same compute demands.

I mentioned this briefly in the commit message of patch 3 in this
series.

Makes sense?

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/6] sched/fair: Get rid of scaling utilization by capacity_orig
  2015-09-11  0:28                   ` Yuyang Du
@ 2015-09-11 10:31                     ` Morten Rasmussen
  2015-09-11 17:05                       ` bsegall
  0 siblings, 1 reply; 97+ messages in thread
From: Morten Rasmussen @ 2015-09-11 10:31 UTC (permalink / raw)
  To: Yuyang Du
  Cc: Peter Zijlstra, Dietmar Eggemann, Vincent Guittot, Steve Muckle,
	mingo, daniel.lezcano, mturquette, rjw, Juri Lelli, sgurrappadi,
	pang.xunlei, linux-kernel

On Fri, Sep 11, 2015 at 08:28:25AM +0800, Yuyang Du wrote:
> On Thu, Sep 10, 2015 at 12:07:27PM +0200, Peter Zijlstra wrote:
> > > > Still don't understand why it's a unit problem. IMHO LOAD/UTIL and
> > > > CAPACITY have no unit.
> > > 
> > > To be more accurate, probably, LOAD can be thought of as having unit,
> > > but UTIL has no unit.
> > 
> > But I'm thinking that is wrong; it should have one, esp. if we go scale
> > the thing. Giving it the same fixed point unit as load simplifies the
> > code.
> 
> I think we probably are saying the same thing with different terms. Anyway,
> let me reiterate what I said and make it a little more formalized.
> 
> UTIL has no unit because it is pure ratio, the cpu_running%, which is in the
> range of [0, 100%], and we increase the resolution, because we don't want
> to lose many (due to integer rounding) by multiplying a number (say 1024), then
> the range becomes [0, 1024].

Fully agree, and with frequency invariance we basically scale running
time to take into account that the cpu might be running slower that it
is capable of at the highest frequency. With cpu invariance also scale
by any difference their might be in max frequency and/or cpu
micro-archiecture so utilization becomes comparable between cpus. One
can also see it as we slow down or speed up time depending the current
compute capacity of the cpu relative to the max capacity.

> CAPACITY is also a ratio of ACTUAL_PERF/MAX_PERF, from (0, 1]. Even LOAD
> is the same, a ratio of NICE_X/NICE_0, from [15/1024=0.015, 88761/1024=86.68],
> as it only has relativity meaning (i.e., when comparing to each other).

Fully agree. Though 'LOAD' is a somewhat overloaded term in the
scheduler. Just to be clear, you refer to load.weight, load_avg is the
multiplication of load.weight and the task runnable time ratio.

> I said it has unit, it is in the sense that it looks like currency (for instance,
> Yuan), you can use to buy CPU fair share. But it is just how you look at it and
> there are certainly many other ways.
> 
> So, I still propose to generalize all these with the following patch, in the
> belief that this makes it simple and clear, and error-reducing. 
> 
> --
> 
> Subject: [PATCH] sched/fair: Generalize the load/util averages resolution
>  definition
> 
> A integer metric needs certain resolution to allow how much detail we
> can look into (not losing detail by integer rounding), which also
> determines the range of the metrics.
> 
> For instance, to increase the resolution of [0, 1] (two levels), one
> can multiply 1024 and get [0, 1024] (1025 levels).
> 
> In sched/fair, a few metrics depend on the resolution: load/load_avg,
> util_avg, and capacity (frequency adjustment). In order to reduce the
> risks of making mistakes relating to resolution/range, we therefore
> generalize the resolution by defining a basic resolution constant
> number, and then formalize all metrics to depend on the basic
> resolution. The basic resolution is 1024 or (1 << 10). Further, one
> can recursively apply another basic resolution to increase the final
> resolution (e.g., 1048676=1<<20).
> 
> Signed-off-by: Yuyang Du <yuyang.du@intel.com>
> ---
>  include/linux/sched.h |  2 +-
>  kernel/sched/sched.h  | 12 +++++++-----
>  2 files changed, 8 insertions(+), 6 deletions(-)
> 
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index 119823d..55a7b93 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -912,7 +912,7 @@ enum cpu_idle_type {
>  /*
>   * Increase resolution of cpu_capacity calculations
>   */
> -#define SCHED_CAPACITY_SHIFT	10
> +#define SCHED_CAPACITY_SHIFT	SCHED_RESOLUTION_SHIFT
>  #define SCHED_CAPACITY_SCALE	(1L << SCHED_CAPACITY_SHIFT)
>  
>  /*
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 68cda11..d27cdd8 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -40,6 +40,9 @@ static inline void update_cpu_load_active(struct rq *this_rq) { }
>   */
>  #define NS_TO_JIFFIES(TIME)	((unsigned long)(TIME) / (NSEC_PER_SEC / HZ))
>  
> +# define SCHED_RESOLUTION_SHIFT	10
> +# define SCHED_RESOLUTION_SCALE	(1L << SCHED_RESOLUTION_SHIFT)
> +
>  /*
>   * Increase resolution of nice-level calculations for 64-bit architectures.
>   * The extra resolution improves shares distribution and load balancing of
> @@ -53,16 +56,15 @@ static inline void update_cpu_load_active(struct rq *this_rq) { }
>   * increased costs.
>   */
>  #if 0 /* BITS_PER_LONG > 32 -- currently broken: it increases power usage under light load  */
> -# define SCHED_LOAD_RESOLUTION	10
> -# define scale_load(w)		((w) << SCHED_LOAD_RESOLUTION)
> -# define scale_load_down(w)	((w) >> SCHED_LOAD_RESOLUTION)
> +# define SCHED_LOAD_SHIFT	(SCHED_RESOLUTION_SHIFT + SCHED_RESOLUTION_SHIFT)
> +# define scale_load(w)		((w) << SCHED_RESOLUTION_SHIFT)
> +# define scale_load_down(w)	((w) >> SCHED_RESOLUTION_SHIFT)
>  #else
> -# define SCHED_LOAD_RESOLUTION	0
> +# define SCHED_LOAD_SHIFT	(SCHED_RESOLUTION_SHIFT)
>  # define scale_load(w)		(w)
>  # define scale_load_down(w)	(w)
>  #endif
>  
> -#define SCHED_LOAD_SHIFT	(10 + SCHED_LOAD_RESOLUTION)
>  #define SCHED_LOAD_SCALE	(1L << SCHED_LOAD_SHIFT)
>  
>  #define NICE_0_LOAD		SCHED_LOAD_SCALE

I think this is pretty much the required relationship between all the
SHIFTs and SCALEs that Peter checked for in his #if-#error thing
earlier, so no disagreements from my side :-)

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 2/6] sched/fair: Convert arch_scale_cpu_capacity() from weak function to #define
  2015-09-04  7:26       ` Vincent Guittot
  2015-09-07 13:25         ` Dietmar Eggemann
@ 2015-09-11 13:21         ` Dietmar Eggemann
  2015-09-11 14:45           ` Vincent Guittot
  1 sibling, 1 reply; 97+ messages in thread
From: Dietmar Eggemann @ 2015-09-11 13:21 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Morten Rasmussen, Peter Zijlstra, mingo, Daniel Lezcano,
	Yuyang Du, Michael Turquette, rjw, Juri Lelli,
	Sai Charan Gurrappadi, pang.xunlei, linux-kernel

On 04/09/15 08:26, Vincent Guittot wrote:
> On 3 September 2015 at 21:58, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:

[...]

> So, with the patch below that updates the arm definition of
> arch_scale_cpu_capacity, you can add my Acked-by: Vincent Guittot
> <vincent.guittot@linaro.org> on this patch and the additional one
> below

My tests on ARM TC2 (only the 2 A15's) show that the influence of the
non-default arch_scale_cpu_capacity function in
arch/arm/kernel/topology.c and with it the extra function call to it
from __update_load_avg() has no measurable influence on performance:

perf stat --null --repeat 10 -- perf bench sched messaging --g 50 -l 200

- default arch_scale_cpu_capacity function [kernel/sched/sched.h]

16.006976251 seconds time elapsed ( +- 0.29% )
16.063814914 seconds time elapsed ( +- 0.37% )
16.088199252 seconds time elapsed ( +- 0.39% )

- arch_scale_cpu_capacity function [arch/arm/kernel/topology.c]

15.945975308 seconds time elapsed ( +- 0.10% )
16.131203074 seconds time elapsed ( +- 0.21% )
16.108302562 seconds time elapsed ( +- 0.41% )

If I force the function to be inline, the result is slightly worse:

- arch_scale_cpu_capacity function [arch/arm/include/asm/topology.h]

16.122545216 seconds time elapsed ( +- 0.09% )
16.285819258 seconds time elapsed ( +- 0.35% )
16.157454024 seconds time elapsed ( +- 0.15% )

So I think we can connect the arch_scale_cpu_capacity function
[arch/arm/kernel/topology.c] to the CFS scheduler for ARCH=arm so that
people get cpu scale different to 1024 on arm big.little machines w/
A15/A7 in case the specify clock-frequency properties in their dtb file.

Can we still have your 'Acked-by' for this patch and 3/6 even though we
now scale weight (by frequency) and scale_freq (by cpu) instead of the
time related values (delta_w, contrib, delta)?

Thanks,

-- Dietmar

> 
> Regards,
> Vincent
> 
>>
>> To connect the cpu invariant engine (scale_cpu_capacity()
>> [arch/arm/kernel/topology.c]) with the scheduler, something like this is
>> missing:
>>
>> diff --git a/arch/arm/include/asm/topology.h b/arch/arm/include/asm/topology.h
>> index 370f7a732900..17c6b3243196 100644
>> --- a/arch/arm/include/asm/topology.h
>> +++ b/arch/arm/include/asm/topology.h
>> @@ -24,6 +24,10 @@ void init_cpu_topology(void);
>>  void store_cpu_topology(unsigned int cpuid);
>>  const struct cpumask *cpu_coregroup_mask(int cpu);
>>
>> +#define arch_scale_cpu_capacity scale_cpu_capacity
>> +struct sched_domain;
>> +extern unsigned long scale_cpu_capacity(struct sched_domain *sd, int cpu);
>> +
>>  #else
>>
>>  static inline void init_cpu_topology(void) { }
>> diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
>> index 08b7847bf912..907e0d2d9b82 100644
>> --- a/arch/arm/kernel/topology.c
>> +++ b/arch/arm/kernel/topology.c
>> @@ -42,7 +42,7 @@
>>   */
>>  static DEFINE_PER_CPU(unsigned long, cpu_scale);
>>
>> -unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
>> +unsigned long scale_cpu_capacity(struct sched_domain *sd, int cpu)
>>  {
>>         return per_cpu(cpu_scale, cpu);
>>  }
>> @@ -166,7 +166,7 @@ static void update_cpu_capacity(unsigned int cpu)
>>         set_capacity_scale(cpu, cpu_capacity(cpu) / middle_capacity);
>>
>>         pr_info("CPU%u: update cpu_capacity %lu\n",
>> -               cpu, arch_scale_cpu_capacity(NULL, cpu));
>> +               cpu, scale_cpu_capacity(NULL, cpu));
>>  }
>>
>> -- Dietmar
>>
>> [...]


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/6] sched/fair: Get rid of scaling utilization by capacity_orig
  2015-09-11 10:02                       ` Morten Rasmussen
@ 2015-09-11 14:11                         ` Leo Yan
  0 siblings, 0 replies; 97+ messages in thread
From: Leo Yan @ 2015-09-11 14:11 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Peter Zijlstra, Vincent Guittot, Dietmar Eggemann, Steve Muckle,
	mingo, daniel.lezcano, yuyang.du, mturquette, rjw, Juri Lelli,
	sgurrappadi, pang.xunlei, linux-kernel

On Fri, Sep 11, 2015 at 11:02:33AM +0100, Morten Rasmussen wrote:
> On Fri, Sep 11, 2015 at 03:46:51PM +0800, Leo Yan wrote:
> > Hi Morten,
> > 
> > On Tue, Sep 08, 2015 at 05:53:31PM +0100, Morten Rasmussen wrote:
> > > On Tue, Sep 08, 2015 at 03:31:58PM +0100, Morten Rasmussen wrote:
> > > > On Tue, Sep 08, 2015 at 02:52:05PM +0200, Peter Zijlstra wrote:
> > > > > 
> > > > > Something like teh below..
> > > > > 
> > > > > Another thing to ponder; the downside of scaled_delta_w is that its
> > > > > fairly likely delta is small and you loose all bits, whereas the weight
> > > > > is likely to be large can could loose a fwe bits without issue.
> > > > 
> > > > That issue applies both to load and util.
> > > > 
> > > > > 
> > > > > That is, in fixed point scaling like this, you want to start with the
> > > > > biggest numbers, not the smallest, otherwise you loose too much.
> > > > > 
> > > > > The flip side is of course that now you can share a multiplcation.
> > > > 
> > > > But if we apply the scaling to the weight instead of time, we would only
> > > > have to apply it once and not three times like it is now? So maybe we
> > > > can end up with almost the same number of multiplications.
> > > > 
> > > > We might be loosing bits for low priority task running on cpus at a low
> > > > frequency though.
> > > 
> > > Something like the below. We should be saving one multiplication.
> > > 
> > > --- 8< ---
> > > 
> > > From: Morten Rasmussen <morten.rasmussen@arm.com>
> > > Date: Tue, 8 Sep 2015 17:15:40 +0100
> > > Subject: [PATCH] sched/fair: Scale load/util contribution rather than time
> > > 
> > > When updating load/util tracking the time delta might be very small (1)
> > > in many cases, scaling it futher down with frequency and cpu invariance
> > > scaling might cause us to loose precision. Instead of scaling time we
> > > can scale the weight of the task for load and the capacity for
> > > utilization. Both weight (>=15) and capacity should be significantly
> > > bigger in most cases. Low priority tasks might still suffer a bit but
> > > worst should be improved, as weight is at least 15 before invariance
> > > scaling.
> > > 
> > > Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
> > > ---
> > >  kernel/sched/fair.c | 38 +++++++++++++++++++-------------------
> > >  1 file changed, 19 insertions(+), 19 deletions(-)
> > > 
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 9301291..d5ee72a 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -2519,8 +2519,6 @@ static u32 __compute_runnable_contrib(u64 n)
> > >  #error "load tracking assumes 2^10 as unit"
> > >  #endif
> > >  
> > > -#define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT)
> > > -
> > >  /*
> > >   * We can represent the historical contribution to runnable average as the
> > >   * coefficients of a geometric series.  To do this we sub-divide our runnable
> > > @@ -2553,10 +2551,10 @@ static __always_inline int
> > >  __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
> > >  		  unsigned long weight, int running, struct cfs_rq *cfs_rq)
> > >  {
> > > -	u64 delta, scaled_delta, periods;
> > > +	u64 delta, periods;
> > >  	u32 contrib;
> > > -	unsigned int delta_w, scaled_delta_w, decayed = 0;
> > > -	unsigned long scale_freq, scale_cpu;
> > > +	unsigned int delta_w, decayed = 0;
> > > +	unsigned long scaled_weight = 0, scale_freq, scale_freq_cpu = 0;
> > >  
> > >  	delta = now - sa->last_update_time;
> > >  	/*
> > > @@ -2577,8 +2575,13 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
> > >  		return 0;
> > >  	sa->last_update_time = now;
> > >  
> > > -	scale_freq = arch_scale_freq_capacity(NULL, cpu);
> > > -	scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
> > > +	if (weight || running)
> > > +		scale_freq = arch_scale_freq_capacity(NULL, cpu);
> > > +	if (weight)
> > > +		scaled_weight = weight * scale_freq >> SCHED_CAPACITY_SHIFT;
> > > +	if (running)
> > > +		scale_freq_cpu = scale_freq * arch_scale_cpu_capacity(NULL, cpu)
> > > +							>> SCHED_CAPACITY_SHIFT;
> > 
> > maybe below question is stupid :)
> > 
> > Why not calculate the scaled_weight depend on cpu's capacity as well?
> > So like: scaled_weight = weight * scale_freq_cpu.
> 
> IMHO, we should not scale load by cpu capacity since load isn't really
> comparable to capacity. It is runnable time based (not running time like
> utilization) and the idea is to used it for balancing when when the
> system is fully utilized. When the system is fully utilized we can't say
> anything about the true compute demands of a task, it may get exactly
> the cpu time it needs or it may need much more. Hence it doesn't really
> make sense to scale the demand by the capacity of the cpu. Two busy
> loops on cpus with different cpu capacities should have the load as they
> have the same compute demands.
> 
> I mentioned this briefly in the commit message of patch 3 in this
> series.
> 
> Makes sense?

Yeah, after your reminding, i recognise load only includes runnable
time on rq but not include running time.

Thanks,
Leo Yan

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 2/6] sched/fair: Convert arch_scale_cpu_capacity() from weak function to #define
  2015-09-11 13:21         ` Dietmar Eggemann
@ 2015-09-11 14:45           ` Vincent Guittot
  0 siblings, 0 replies; 97+ messages in thread
From: Vincent Guittot @ 2015-09-11 14:45 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Morten Rasmussen, Peter Zijlstra, mingo, Daniel Lezcano,
	Yuyang Du, Michael Turquette, rjw, Juri Lelli,
	Sai Charan Gurrappadi, pang.xunlei, linux-kernel

On 11 September 2015 at 15:21, Dietmar Eggemann
<dietmar.eggemann@arm.com> wrote:
> On 04/09/15 08:26, Vincent Guittot wrote:
>> On 3 September 2015 at 21:58, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>
> [...]
>
>> So, with the patch below that updates the arm definition of
>> arch_scale_cpu_capacity, you can add my Acked-by: Vincent Guittot
>> <vincent.guittot@linaro.org> on this patch and the additional one
>> below
>
> My tests on ARM TC2 (only the 2 A15's) show that the influence of the
> non-default arch_scale_cpu_capacity function in
> arch/arm/kernel/topology.c and with it the extra function call to it
> from __update_load_avg() has no measurable influence on performance:
>
> perf stat --null --repeat 10 -- perf bench sched messaging --g 50 -l 200
>
> - default arch_scale_cpu_capacity function [kernel/sched/sched.h]
>
> 16.006976251 seconds time elapsed ( +- 0.29% )
> 16.063814914 seconds time elapsed ( +- 0.37% )
> 16.088199252 seconds time elapsed ( +- 0.39% )
>
> - arch_scale_cpu_capacity function [arch/arm/kernel/topology.c]
>
> 15.945975308 seconds time elapsed ( +- 0.10% )
> 16.131203074 seconds time elapsed ( +- 0.21% )
> 16.108302562 seconds time elapsed ( +- 0.41% )
>
> If I force the function to be inline, the result is slightly worse:
>
> - arch_scale_cpu_capacity function [arch/arm/include/asm/topology.h]
>
> 16.122545216 seconds time elapsed ( +- 0.09% )
> 16.285819258 seconds time elapsed ( +- 0.35% )
> 16.157454024 seconds time elapsed ( +- 0.15% )
>
> So I think we can connect the arch_scale_cpu_capacity function
> [arch/arm/kernel/topology.c] to the CFS scheduler for ARCH=arm so that
> people get cpu scale different to 1024 on arm big.little machines w/
> A15/A7 in case the specify clock-frequency properties in their dtb file.
>
> Can we still have your 'Acked-by' for this patch and 3/6 even though we
> now scale weight (by frequency) and scale_freq (by cpu) instead of the
> time related values (delta_w, contrib, delta)?

Yes, Please add my Acked-by

Vincent

>
> Thanks,
>
> -- Dietmar
>
>>
>> Regards,
>> Vincent
>>
>>>
>>> To connect the cpu invariant engine (scale_cpu_capacity()
>>> [arch/arm/kernel/topology.c]) with the scheduler, something like this is
>>> missing:
>>>
>>> diff --git a/arch/arm/include/asm/topology.h b/arch/arm/include/asm/topology.h
>>> index 370f7a732900..17c6b3243196 100644
>>> --- a/arch/arm/include/asm/topology.h
>>> +++ b/arch/arm/include/asm/topology.h
>>> @@ -24,6 +24,10 @@ void init_cpu_topology(void);
>>>  void store_cpu_topology(unsigned int cpuid);
>>>  const struct cpumask *cpu_coregroup_mask(int cpu);
>>>
>>> +#define arch_scale_cpu_capacity scale_cpu_capacity
>>> +struct sched_domain;
>>> +extern unsigned long scale_cpu_capacity(struct sched_domain *sd, int cpu);
>>> +
>>>  #else
>>>
>>>  static inline void init_cpu_topology(void) { }
>>> diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
>>> index 08b7847bf912..907e0d2d9b82 100644
>>> --- a/arch/arm/kernel/topology.c
>>> +++ b/arch/arm/kernel/topology.c
>>> @@ -42,7 +42,7 @@
>>>   */
>>>  static DEFINE_PER_CPU(unsigned long, cpu_scale);
>>>
>>> -unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
>>> +unsigned long scale_cpu_capacity(struct sched_domain *sd, int cpu)
>>>  {
>>>         return per_cpu(cpu_scale, cpu);
>>>  }
>>> @@ -166,7 +166,7 @@ static void update_cpu_capacity(unsigned int cpu)
>>>         set_capacity_scale(cpu, cpu_capacity(cpu) / middle_capacity);
>>>
>>>         pr_info("CPU%u: update cpu_capacity %lu\n",
>>> -               cpu, arch_scale_cpu_capacity(NULL, cpu));
>>> +               cpu, scale_cpu_capacity(NULL, cpu));
>>>  }
>>>
>>> -- Dietmar
>>>
>>> [...]
>

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 4/6] sched/fair: Name utilization related data and functions consistently
  2015-09-04  9:08   ` Vincent Guittot
@ 2015-09-11 16:35     ` Dietmar Eggemann
  0 siblings, 0 replies; 97+ messages in thread
From: Dietmar Eggemann @ 2015-09-11 16:35 UTC (permalink / raw)
  To: Vincent Guittot, Morten Rasmussen
  Cc: Peter Zijlstra, mingo, Daniel Lezcano, Yuyang Du,
	Michael Turquette, rjw, Juri Lelli, Sai Charan Gurrappadi,
	pang.xunlei, linux-kernel

On 04/09/15 10:08, Vincent Guittot wrote:
> On 14 August 2015 at 18:23, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
>> From: Dietmar Eggemann <dietmar.eggemann@arm.com>
>>
>> Use the advent of the per-entity load tracking rewrite to streamline the
>> naming of utilization related data and functions by using
>> {prefix_}util{_suffix} consistently. Moreover call both signals
>> ({se,cfs}.avg.util_avg) utilization.
> 
> I don't have a strong opinion about the naming of this variable but I
> remember a discussion about this topic:
> https://lkml.org/lkml/2014/9/11/474 : "Call the pure running number
> 'utilization' and this scaled with capacity 'usage' "
> 
> The utilization has been shorten to util with the rewrite of the pelt,
> so the current use of usage in get_cpu_usage still follows this rule.

But since we now do the capacity scaling in __update_load_avg()

util_sum += t * scale_freq/SCHED_CAP_SCALE * arch_scale_freq_capacity()

util_avg = util_sum / LOAD_AVG_MAX;

we could either name everything 'util' or everything 'usage' (including
the utilization sum and avg in struct sched_avg).

> 
> So why do you want to change that now ?
> Furthermore, cfs.avg.util_avg is a load  whereas sgs->group_util is a
> capacity. Both don't use the same unit and same range which can be
> confusing when you read the code

[...]


^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/6] sched/fair: Get rid of scaling utilization by capacity_orig
  2015-09-11 10:31                     ` Morten Rasmussen
@ 2015-09-11 17:05                       ` bsegall
  2015-09-11 18:24                         ` Yuyang Du
  2015-09-14 12:56                         ` Morten Rasmussen
  0 siblings, 2 replies; 97+ messages in thread
From: bsegall @ 2015-09-11 17:05 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Yuyang Du, Peter Zijlstra, Dietmar Eggemann, Vincent Guittot,
	Steve Muckle, mingo, daniel.lezcano, mturquette, rjw, Juri Lelli,
	sgurrappadi, pang.xunlei, linux-kernel

Morten Rasmussen <morten.rasmussen@arm.com> writes:

> On Fri, Sep 11, 2015 at 08:28:25AM +0800, Yuyang Du wrote:
>> On Thu, Sep 10, 2015 at 12:07:27PM +0200, Peter Zijlstra wrote:
>> > > > Still don't understand why it's a unit problem. IMHO LOAD/UTIL and
>> > > > CAPACITY have no unit.
>> > > 
>> > > To be more accurate, probably, LOAD can be thought of as having unit,
>> > > but UTIL has no unit.
>> > 
>> > But I'm thinking that is wrong; it should have one, esp. if we go scale
>> > the thing. Giving it the same fixed point unit as load simplifies the
>> > code.
>> 
>> I think we probably are saying the same thing with different terms. Anyway,
>> let me reiterate what I said and make it a little more formalized.
>> 
>> UTIL has no unit because it is pure ratio, the cpu_running%, which is in the
>> range of [0, 100%], and we increase the resolution, because we don't want
>> to lose many (due to integer rounding) by multiplying a number (say 1024), then
>> the range becomes [0, 1024].
>
> Fully agree, and with frequency invariance we basically scale running
> time to take into account that the cpu might be running slower that it
> is capable of at the highest frequency. With cpu invariance also scale
> by any difference their might be in max frequency and/or cpu
> micro-archiecture so utilization becomes comparable between cpus. One
> can also see it as we slow down or speed up time depending the current
> compute capacity of the cpu relative to the max capacity.
>
>> CAPACITY is also a ratio of ACTUAL_PERF/MAX_PERF, from (0, 1]. Even LOAD
>> is the same, a ratio of NICE_X/NICE_0, from [15/1024=0.015, 88761/1024=86.68],
>> as it only has relativity meaning (i.e., when comparing to each other).
>
> Fully agree. Though 'LOAD' is a somewhat overloaded term in the
> scheduler. Just to be clear, you refer to load.weight, load_avg is the
> multiplication of load.weight and the task runnable time ratio.
>
>> I said it has unit, it is in the sense that it looks like currency (for instance,
>> Yuan), you can use to buy CPU fair share. But it is just how you look at it and
>> there are certainly many other ways.
>> 
>> So, I still propose to generalize all these with the following patch, in the
>> belief that this makes it simple and clear, and error-reducing. 
>> 
>> --
>> 
>> Subject: [PATCH] sched/fair: Generalize the load/util averages resolution
>>  definition
>> 
>> A integer metric needs certain resolution to allow how much detail we
>> can look into (not losing detail by integer rounding), which also
>> determines the range of the metrics.
>> 
>> For instance, to increase the resolution of [0, 1] (two levels), one
>> can multiply 1024 and get [0, 1024] (1025 levels).
>> 
>> In sched/fair, a few metrics depend on the resolution: load/load_avg,
>> util_avg, and capacity (frequency adjustment). In order to reduce the
>> risks of making mistakes relating to resolution/range, we therefore
>> generalize the resolution by defining a basic resolution constant
>> number, and then formalize all metrics to depend on the basic
>> resolution. The basic resolution is 1024 or (1 << 10). Further, one
>> can recursively apply another basic resolution to increase the final
>> resolution (e.g., 1048676=1<<20).
>> 
>> Signed-off-by: Yuyang Du <yuyang.du@intel.com>
>> ---
>>  include/linux/sched.h |  2 +-
>>  kernel/sched/sched.h  | 12 +++++++-----
>>  2 files changed, 8 insertions(+), 6 deletions(-)
>> 
>> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> index 119823d..55a7b93 100644
>> --- a/include/linux/sched.h
>> +++ b/include/linux/sched.h
>> @@ -912,7 +912,7 @@ enum cpu_idle_type {
>>  /*
>>   * Increase resolution of cpu_capacity calculations
>>   */
>> -#define SCHED_CAPACITY_SHIFT	10
>> +#define SCHED_CAPACITY_SHIFT	SCHED_RESOLUTION_SHIFT
>>  #define SCHED_CAPACITY_SCALE	(1L << SCHED_CAPACITY_SHIFT)
>>  
>>  /*
>> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> index 68cda11..d27cdd8 100644
>> --- a/kernel/sched/sched.h
>> +++ b/kernel/sched/sched.h
>> @@ -40,6 +40,9 @@ static inline void update_cpu_load_active(struct rq *this_rq) { }
>>   */
>>  #define NS_TO_JIFFIES(TIME)	((unsigned long)(TIME) / (NSEC_PER_SEC / HZ))
>>  
>> +# define SCHED_RESOLUTION_SHIFT	10
>> +# define SCHED_RESOLUTION_SCALE	(1L << SCHED_RESOLUTION_SHIFT)
>> +
>>  /*
>>   * Increase resolution of nice-level calculations for 64-bit architectures.
>>   * The extra resolution improves shares distribution and load balancing of
>> @@ -53,16 +56,15 @@ static inline void update_cpu_load_active(struct rq *this_rq) { }
>>   * increased costs.
>>   */
>>  #if 0 /* BITS_PER_LONG > 32 -- currently broken: it increases power usage under light load  */
>> -# define SCHED_LOAD_RESOLUTION	10
>> -# define scale_load(w)		((w) << SCHED_LOAD_RESOLUTION)
>> -# define scale_load_down(w)	((w) >> SCHED_LOAD_RESOLUTION)
>> +# define SCHED_LOAD_SHIFT	(SCHED_RESOLUTION_SHIFT + SCHED_RESOLUTION_SHIFT)
>> +# define scale_load(w)		((w) << SCHED_RESOLUTION_SHIFT)
>> +# define scale_load_down(w)	((w) >> SCHED_RESOLUTION_SHIFT)
>>  #else
>> -# define SCHED_LOAD_RESOLUTION	0
>> +# define SCHED_LOAD_SHIFT	(SCHED_RESOLUTION_SHIFT)
>>  # define scale_load(w)		(w)
>>  # define scale_load_down(w)	(w)
>>  #endif
>>  
>> -#define SCHED_LOAD_SHIFT	(10 + SCHED_LOAD_RESOLUTION)
>>  #define SCHED_LOAD_SCALE	(1L << SCHED_LOAD_SHIFT)
>>  
>>  #define NICE_0_LOAD		SCHED_LOAD_SCALE
>
> I think this is pretty much the required relationship between all the
> SHIFTs and SCALEs that Peter checked for in his #if-#error thing
> earlier, so no disagreements from my side :-)
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

SCHED_LOAD_RESOLUTION and the non-SLR part of SCHED_LOAD_SHIFT are not
required to be the same value and should not be conflated.

In particular, since cgroups are on the same timeline as tasks and their
shares are not scaled by SCHED_LOAD_SHIFT in any way (but are scaled so
that SCHED_LOAD_RESOLUTION is invisible), changing that part of
SCHED_LOAD_SHIFT would cause issues, since things can assume that nice-0
= 1024. However changing SCHED_LOAD_RESOLUTION would be fine, as that is
an internal value to the kernel.

In addition, changing the non-SLR part of SCHED_LOAD_SHIFT would require
recomputing all of prio_to_weight/wmult for the new NICE_0_LOAD.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/6] sched/fair: Get rid of scaling utilization by capacity_orig
  2015-09-09 11:13                       ` Morten Rasmussen
@ 2015-09-11 17:22                         ` Morten Rasmussen
  2015-09-17  9:51                           ` Peter Zijlstra
  2015-09-17 10:38                           ` Peter Zijlstra
  0 siblings, 2 replies; 97+ messages in thread
From: Morten Rasmussen @ 2015-09-11 17:22 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vincent Guittot, Dietmar Eggemann, Steve Muckle, mingo,
	daniel.lezcano, yuyang.du, mturquette, rjw, Juri Lelli,
	sgurrappadi, pang.xunlei, linux-kernel

On Wed, Sep 09, 2015 at 12:13:10PM +0100, Morten Rasmussen wrote:
> On Wed, Sep 09, 2015 at 11:43:05AM +0200, Peter Zijlstra wrote:
> > Sadly that makes the code worse; I get 14 mul instructions where
> > previously I had 11.
> > 
> > What happens is that GCC gets confused and cannot constant propagate the
> > new variables, so what used to be shifts now end up being actual
> > multiplications.
> > 
> > With this, I get back to 11. Can you see what happens on ARM where you
> > have both functions defined to non constants?
> 
> We repeated the experiment on arm and arm64 but still with functions
> defined to constant to compare with your results. The mul instruction
> count seems to be somewhat compiler version dependent, but consistently
> show no effect of the patch:
> 
> arm	before	after
> gcc4.9	12	12
> gcc4.8	10	10
> 
> arm64	before	after
> gcc4.9	11	11
> 
> I will get numbers with the arch-functions implemented as well and do
> hackbench runs to see what happens in terms of performance.

I have done some runs with the proposed fixes added:

1. PeterZ's util_sum shift fix (change util_sum).
2. Morten's scaling of weight instead of time (reduce bit loss).
3. PeterZ's unconditional calls to arch*() functions (compiler opt).

To be clear: 2 includes 1, and 3 includes 1 and 2.

Runs where done with the default (#define) implementation of the
arch-functions and with arch specific implementation for ARM.

I realized that just looking for 'mul' instructions in
update_blocked_averages() is probably not a fair comparison on ARM as it
turned out that it has quite a few multiply-accumulate instructions. So
I have included the total count including those too.


Test platforms:

ARM TC2 (A7x3 only)
perf stat --null --repeat 10 -- perf bench sched messaging -g 50 -l 200
#mul: grep -e mul (in update_blocked_averages())
#mul_all: grep -e mul -e mla -e mls -e mia (in update_blocked_averages())
gcc: 4.9.3

Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz
perf stat --null --repeat 10 -- perf bench sched messaging -g 50 -l 15000
#mul: grep -e mul (in update_blocked_averages())
gcc: 4.9.2


Results:

perf numbers are average of three (x10) runs. Raw data is available
further down.

ARM TC2		#mul		#mul_all	perf bench
arch*()		default	arm	default	arm	default	arm

1 shift_fix	10	16	22	36	13.401	13.288
2 scaled_weight	12	14	30	32	13.282	13.238
3 unconditional	12	14	26	32	13.296	13.427

Intel E5-2690	#mul		#mul_all	perf bench
arch*()		default		default		default

1 shift_fix	13				14.786
2 scaled_weight	18				15.078
3 unconditional	14				15.195


Overall it appears that fewer 'mul' instructions doesn't necessarily
mean better perf bench score. For ARM, 2 seems the best choice overall.
While 1 is better for Intel. If we want to try avoid the bit loss by
scaling weight instead of time, 2 is best for both. However, all that
said, looking at the raw numbers there is a significant difference
between runs of perf --repeat, so we can't really draw any strong
conclusions. It all appears to be in the noise.

I suggest that I spin a v2 of this series and go with scaled_weight to
reduce bit loss. Any objections?

While at it, should I include Yuyang's patch redefining the SCALE/SHIFT
mess?


Raw numbers:

ARM TC2

shift_fix	default_arch
gcc4.9.3
#mul 10
#mul+mla+mls+mia 22
13.384416727 seconds time elapsed ( +-  0.17% )
13.431014702 seconds time elapsed ( +-  0.18% )
13.387434890 seconds time elapsed ( +-  0.15% )

shift_fix	arm_arch
gcc4.9.3
#mul 16
#mul+mla+mls+mia 36
13.271044081 seconds time elapsed ( +-  0.11% )
13.310189123 seconds time elapsed ( +-  0.19% )
13.283594740 seconds time elapsed ( +-  0.12% )

scaled_weight	default_arch
gcc4.9.3
#mul 12
#mul+mla+mls+mia 30
13.295649553 seconds time elapsed ( +-  0.20% )
13.271634654 seconds time elapsed ( +-  0.19% )
13.280081329 seconds time elapsed ( +-  0.14% )

scaled_weight	arm_arch
gcc4.9.3
#mul 14
#mul+mla+mls+mia 32
13.230659223 seconds time elapsed ( +-  0.15% )
13.222276527 seconds time elapsed ( +-  0.15% )
13.260275081 seconds time elapsed ( +-  0.21% )

unconditional	default_arch
gcc4.9.3
#mul 12
#mul+mla+mls+mia 26
13.274904460 seconds time elapsed ( +-  0.13% )
13.307853511 seconds time elapsed ( +-  0.15% )
13.304084844 seconds time elapsed ( +-  0.22% )

unconditional	arm_arch
gcc4.9.3
#mul 14
#mul+mla+mls+mia 32
13.432878577 seconds time elapsed ( +-  0.13% )
13.417950552 seconds time elapsed ( +-  0.12% )
13.431682719 seconds time elapsed ( +-  0.18% )


Intel

shift_fix	default_arch
gcc4.9.2
#mul 13
14.905815416 seconds time elapsed ( +-  0.61% )
14.811113694 seconds time elapsed ( +-  0.84% )
14.639739309 seconds time elapsed ( +-  0.76% )

scaled_weight	default_arch
gcc4.9.2
#mul 18
15.113275474 seconds time elapsed ( +-  0.64% )
15.056777680 seconds time elapsed ( +-  0.44% )
15.064074416 seconds time elapsed ( +-  0.71% )

unconditional	default_arch
gcc4.9.2
#mul 14
15.105152500 seconds time elapsed ( +-  0.71% )
15.346405473 seconds time elapsed ( +-  0.81% )
15.132933523 seconds time elapsed ( +-  0.82% )

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/6] sched/fair: Get rid of scaling utilization by capacity_orig
  2015-09-11 17:05                       ` bsegall
@ 2015-09-11 18:24                         ` Yuyang Du
  2015-09-14 17:36                           ` bsegall
  2015-09-14 12:56                         ` Morten Rasmussen
  1 sibling, 1 reply; 97+ messages in thread
From: Yuyang Du @ 2015-09-11 18:24 UTC (permalink / raw)
  To: bsegall
  Cc: Morten Rasmussen, Peter Zijlstra, Dietmar Eggemann,
	Vincent Guittot, Steve Muckle, mingo, daniel.lezcano, mturquette,
	rjw, Juri Lelli, sgurrappadi, pang.xunlei, linux-kernel

On Fri, Sep 11, 2015 at 10:05:53AM -0700, bsegall@google.com wrote:
> 
> SCHED_LOAD_RESOLUTION and the non-SLR part of SCHED_LOAD_SHIFT are not
> required to be the same value and should not be conflated.
 
> In particular, since cgroups are on the same timeline as tasks and their
> shares are not scaled by SCHED_LOAD_SHIFT in any way (but are scaled so
> that SCHED_LOAD_RESOLUTION is invisible), changing that part of
> SCHED_LOAD_SHIFT would cause issues, since things can assume that nice-0
> = 1024. However changing SCHED_LOAD_RESOLUTION would be fine, as that is
> an internal value to the kernel.
>
> In addition, changing the non-SLR part of SCHED_LOAD_SHIFT would require
> recomputing all of prio_to_weight/wmult for the new NICE_0_LOAD.

Not fully looked into the concerns, but the new SCHED_RESOLUTION_SHIFT
is intended to formalize all the integer metrics that need better resolution.
It is not special to any metric, so actually it is to de-conflate whoever is
conflated.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* [tip:sched/core] sched/fair: Make load tracking frequency scale-invariant
  2015-08-14 16:23 ` [PATCH 1/6] sched/fair: Make load tracking frequency scale-invariant Morten Rasmussen
@ 2015-09-13 11:03   ` tip-bot for Dietmar Eggemann
  0 siblings, 0 replies; 97+ messages in thread
From: tip-bot for Dietmar Eggemann @ 2015-09-13 11:03 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: mingo, Dietmar.Eggemann, tglx, linux-kernel, hpa, peterz, efault,
	torvalds, vincent.guittot, dietmar.eggemann, Juri.Lelli,
	morten.rasmussen

Commit-ID:  e0f5f3afd2cffa96291cd852056d83ff4e2e99c7
Gitweb:     http://git.kernel.org/tip/e0f5f3afd2cffa96291cd852056d83ff4e2e99c7
Author:     Dietmar Eggemann <dietmar.eggemann@arm.com>
AuthorDate: Fri, 14 Aug 2015 17:23:09 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Sun, 13 Sep 2015 09:52:55 +0200

sched/fair: Make load tracking frequency scale-invariant

Apply frequency scaling correction factor to per-entity load tracking to
make it frequency invariant. Currently, load appears bigger when the CPU
is running slower which affects load-balancing decisions.

Each segment of the sched_avg.load_sum geometric series is now scaled by
the current frequency so that the sched_avg.load_avg of each sched entity
will be invariant from frequency scaling.

Moreover, cfs_rq.runnable_load_sum is scaled by the current frequency as
well.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Dietmar Eggemann <Dietmar.Eggemann@arm.com>
Cc: Juri Lelli <Juri.Lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: daniel.lezcano@linaro.org
Cc: mturquette@baylibre.com
Cc: pang.xunlei@zte.com.cn
Cc: rjw@rjwysocki.net
Cc: sgurrappadi@nvidia.com
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1439569394-11974-2-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/sched.h |  6 +++---
 kernel/sched/fair.c   | 27 +++++++++++++++++----------
 2 files changed, 20 insertions(+), 13 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index a4ab9da..c8d923b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1177,9 +1177,9 @@ struct load_weight {
 
 /*
  * The load_avg/util_avg accumulates an infinite geometric series.
- * 1) load_avg factors the amount of time that a sched_entity is
- * runnable on a rq into its weight. For cfs_rq, it is the aggregated
- * such weights of all runnable and blocked sched_entities.
+ * 1) load_avg factors frequency scaling into the amount of time that a
+ * sched_entity is runnable on a rq into its weight. For cfs_rq, it is the
+ * aggregated such weights of all runnable and blocked sched_entities.
  * 2) util_avg factors frequency scaling into the amount of time
  * that a sched_entity is running on a CPU, in the range [0..SCHED_LOAD_SCALE].
  * For cfs_rq, it is the aggregated such times of all runnable and
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 47ece22..86cb27c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2515,6 +2515,8 @@ static u32 __compute_runnable_contrib(u64 n)
 	return contrib + runnable_avg_yN_sum[n];
 }
 
+#define scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT)
+
 /*
  * We can represent the historical contribution to runnable average as the
  * coefficients of a geometric series.  To do this we sub-divide our runnable
@@ -2547,9 +2549,9 @@ static __always_inline int
 __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
 		  unsigned long weight, int running, struct cfs_rq *cfs_rq)
 {
-	u64 delta, periods;
+	u64 delta, scaled_delta, periods;
 	u32 contrib;
-	int delta_w, decayed = 0;
+	int delta_w, scaled_delta_w, decayed = 0;
 	unsigned long scale_freq = arch_scale_freq_capacity(NULL, cpu);
 
 	delta = now - sa->last_update_time;
@@ -2585,13 +2587,16 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
 		 * period and accrue it.
 		 */
 		delta_w = 1024 - delta_w;
+		scaled_delta_w = scale(delta_w, scale_freq);
 		if (weight) {
-			sa->load_sum += weight * delta_w;
-			if (cfs_rq)
-				cfs_rq->runnable_load_sum += weight * delta_w;
+			sa->load_sum += weight * scaled_delta_w;
+			if (cfs_rq) {
+				cfs_rq->runnable_load_sum +=
+						weight * scaled_delta_w;
+			}
 		}
 		if (running)
-			sa->util_sum += delta_w * scale_freq >> SCHED_CAPACITY_SHIFT;
+			sa->util_sum += scaled_delta_w;
 
 		delta -= delta_w;
 
@@ -2608,23 +2613,25 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
 
 		/* Efficiently calculate \sum (1..n_period) 1024*y^i */
 		contrib = __compute_runnable_contrib(periods);
+		contrib = scale(contrib, scale_freq);
 		if (weight) {
 			sa->load_sum += weight * contrib;
 			if (cfs_rq)
 				cfs_rq->runnable_load_sum += weight * contrib;
 		}
 		if (running)
-			sa->util_sum += contrib * scale_freq >> SCHED_CAPACITY_SHIFT;
+			sa->util_sum += contrib;
 	}
 
 	/* Remainder of delta accrued against u_0` */
+	scaled_delta = scale(delta, scale_freq);
 	if (weight) {
-		sa->load_sum += weight * delta;
+		sa->load_sum += weight * scaled_delta;
 		if (cfs_rq)
-			cfs_rq->runnable_load_sum += weight * delta;
+			cfs_rq->runnable_load_sum += weight * scaled_delta;
 	}
 	if (running)
-		sa->util_sum += delta * scale_freq >> SCHED_CAPACITY_SHIFT;
+		sa->util_sum += scaled_delta;
 
 	sa->period_contrib += delta;
 

^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [tip:sched/core] sched/fair: Convert arch_scale_cpu_capacity() from weak function to #define
  2015-08-14 16:23 ` [PATCH 2/6] sched/fair: Convert arch_scale_cpu_capacity() from weak function to #define Morten Rasmussen
  2015-09-02  9:31   ` Vincent Guittot
@ 2015-09-13 11:03   ` tip-bot for Morten Rasmussen
  1 sibling, 0 replies; 97+ messages in thread
From: tip-bot for Morten Rasmussen @ 2015-09-13 11:03 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: mingo, peterz, Dietmar.Eggemann, efault, tglx, Juri.Lelli,
	morten.rasmussen, linux-kernel, hpa, torvalds

Commit-ID:  8cd5601c50603caa195ce86cc465cb04079ed488
Gitweb:     http://git.kernel.org/tip/8cd5601c50603caa195ce86cc465cb04079ed488
Author:     Morten Rasmussen <morten.rasmussen@arm.com>
AuthorDate: Fri, 14 Aug 2015 17:23:10 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Sun, 13 Sep 2015 09:52:55 +0200

sched/fair: Convert arch_scale_cpu_capacity() from weak function to #define

Bring arch_scale_cpu_capacity() in line with the recent change of its
arch_scale_freq_capacity() sibling in commit dfbca41f3479 ("sched:
Optimize freq invariant accounting") from weak function to #define to
allow inlining of the function.

While at it, remove the ARCH_CAPACITY sched_feature as well. With the
change to #define there isn't a straightforward way to allow runtime
switch between an arch implementation and the default implementation of
arch_scale_cpu_capacity() using sched_feature. The default was to use
the arch-specific implementation, but only the arm architecture provides
one and that is essentially equivalent to the default implementation.

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Dietmar Eggemann <Dietmar.Eggemann@arm.com>
Cc: Juri Lelli <Juri.Lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: daniel.lezcano@linaro.org
Cc: mturquette@baylibre.com
Cc: pang.xunlei@zte.com.cn
Cc: rjw@rjwysocki.net
Cc: sgurrappadi@nvidia.com
Cc: vincent.guittot@linaro.org
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1439569394-11974-3-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c     | 22 +---------------------
 kernel/sched/features.h |  5 -----
 kernel/sched/sched.h    | 11 +++++++++++
 3 files changed, 12 insertions(+), 26 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 86cb27c..102cdf1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6054,19 +6054,6 @@ static inline int get_sd_load_idx(struct sched_domain *sd,
 	return load_idx;
 }
 
-static unsigned long default_scale_cpu_capacity(struct sched_domain *sd, int cpu)
-{
-	if ((sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1))
-		return sd->smt_gain / sd->span_weight;
-
-	return SCHED_CAPACITY_SCALE;
-}
-
-unsigned long __weak arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
-{
-	return default_scale_cpu_capacity(sd, cpu);
-}
-
 static unsigned long scale_rt_capacity(int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
@@ -6096,16 +6083,9 @@ static unsigned long scale_rt_capacity(int cpu)
 
 static void update_cpu_capacity(struct sched_domain *sd, int cpu)
 {
-	unsigned long capacity = SCHED_CAPACITY_SCALE;
+	unsigned long capacity = arch_scale_cpu_capacity(sd, cpu);
 	struct sched_group *sdg = sd->groups;
 
-	if (sched_feat(ARCH_CAPACITY))
-		capacity *= arch_scale_cpu_capacity(sd, cpu);
-	else
-		capacity *= default_scale_cpu_capacity(sd, cpu);
-
-	capacity >>= SCHED_CAPACITY_SHIFT;
-
 	cpu_rq(cpu)->cpu_capacity_orig = capacity;
 
 	capacity *= scale_rt_capacity(cpu);
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index edf5902..69631fa 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -36,11 +36,6 @@ SCHED_FEAT(CACHE_HOT_BUDDY, true)
  */
 SCHED_FEAT(WAKEUP_PREEMPTION, true)
 
-/*
- * Use arch dependent cpu capacity functions
- */
-SCHED_FEAT(ARCH_CAPACITY, true)
-
 SCHED_FEAT(HRTICK, false)
 SCHED_FEAT(DOUBLE_TICK, false)
 SCHED_FEAT(LB_BIAS, true)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 2e8530d0..c0726d5 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1394,6 +1394,17 @@ unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu)
 }
 #endif
 
+#ifndef arch_scale_cpu_capacity
+static __always_inline
+unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
+{
+	if ((sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1))
+		return sd->smt_gain / sd->span_weight;
+
+	return SCHED_CAPACITY_SCALE;
+}
+#endif
+
 static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta)
 {
 	rq->rt_avg += rt_delta * arch_scale_freq_capacity(NULL, cpu_of(rq));

^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [tip:sched/core] sched/fair: Make utilization tracking CPU scale-invariant
  2015-08-14 23:04   ` Dietmar Eggemann
  2015-09-04  7:52     ` Vincent Guittot
@ 2015-09-13 11:04     ` tip-bot for Dietmar Eggemann
  1 sibling, 0 replies; 97+ messages in thread
From: tip-bot for Dietmar Eggemann @ 2015-09-13 11:04 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: morten.rasmussen, efault, peterz, hpa, mingo, tglx, linux-kernel,
	torvalds, dietmar.eggemann

Commit-ID:  e3279a2e6d697e00e74f905851ee7cf532f72b2d
Gitweb:     http://git.kernel.org/tip/e3279a2e6d697e00e74f905851ee7cf532f72b2d
Author:     Dietmar Eggemann <dietmar.eggemann@arm.com>
AuthorDate: Sat, 15 Aug 2015 00:04:41 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Sun, 13 Sep 2015 09:52:56 +0200

sched/fair: Make utilization tracking CPU scale-invariant

Besides the existing frequency scale-invariance correction factor, apply
CPU scale-invariance correction factor to utilization tracking to
compensate for any differences in compute capacity. This could be due to
micro-architectural differences (i.e. instructions per seconds) between
cpus in HMP systems (e.g. big.LITTLE), and/or differences in the current
maximum frequency supported by individual cpus in SMP systems. In the
existing implementation utilization isn't comparable between cpus as it
is relative to the capacity of each individual CPU.

Each segment of the sched_avg.util_sum geometric series is now scaled
by the CPU performance factor too so the sched_avg.util_avg of each
sched entity will be invariant from the particular CPU of the HMP/SMP
system on which the sched entity is scheduled.

With this patch, the utilization of a CPU stays relative to the max CPU
performance of the fastest CPU in the system.

In contrast to utilization (sched_avg.util_sum), load
(sched_avg.load_sum) should not be scaled by compute capacity. The
utilization metric is based on running time which only makes sense when
cpus are _not_ fully utilized (utilization cannot go beyond 100% even if
more tasks are added), where load is runnable time which isn't limited
by the capacity of the CPU and therefore is a better metric for
overloaded scenarios. If we run two nice-0 busy loops on two cpus with
different compute capacity their load should be similar since their
compute demands are the same. We have to assume that the compute demand
of any task running on a fully utilized CPU (no spare cycles = 100%
utilization) is high and the same no matter of the compute capacity of
its current CPU, hence we shouldn't scale load by CPU capacity.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Link: http://lkml.kernel.org/r/55CE7409.1000700@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 include/linux/sched.h | 2 +-
 kernel/sched/fair.c   | 7 ++++---
 kernel/sched/sched.h  | 2 +-
 3 files changed, 6 insertions(+), 5 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index c8d923b..bd38b3e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1180,7 +1180,7 @@ struct load_weight {
  * 1) load_avg factors frequency scaling into the amount of time that a
  * sched_entity is runnable on a rq into its weight. For cfs_rq, it is the
  * aggregated such weights of all runnable and blocked sched_entities.
- * 2) util_avg factors frequency scaling into the amount of time
+ * 2) util_avg factors frequency and cpu scaling into the amount of time
  * that a sched_entity is running on a CPU, in the range [0..SCHED_LOAD_SCALE].
  * For cfs_rq, it is the aggregated such times of all runnable and
  * blocked sched_entities.
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 102cdf1..573dc98 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2553,6 +2553,7 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
 	u32 contrib;
 	int delta_w, scaled_delta_w, decayed = 0;
 	unsigned long scale_freq = arch_scale_freq_capacity(NULL, cpu);
+	unsigned long scale_cpu = arch_scale_cpu_capacity(NULL, cpu);

 	delta = now - sa->last_update_time;
 	/*
@@ -2596,7 +2597,7 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
 			}
 		}
 		if (running)
-			sa->util_sum += scaled_delta_w;
+			sa->util_sum += scale(scaled_delta_w, scale_cpu);

 		delta -= delta_w;

@@ -2620,7 +2621,7 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
 				cfs_rq->runnable_load_sum += weight * contrib;
 		}
 		if (running)
-			sa->util_sum += contrib;
+			sa->util_sum += scale(contrib, scale_cpu);
 	}

 	/* Remainder of delta accrued against u_0` */
@@ -2631,7 +2632,7 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
 			cfs_rq->runnable_load_sum += weight * scaled_delta;
 	}
 	if (running)
-		sa->util_sum += scaled_delta;
+		sa->util_sum += scale(scaled_delta, scale_cpu);

 	sa->period_contrib += delta;

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c0726d5..167ab48 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1398,7 +1398,7 @@ unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu)
 static __always_inline
 unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
 {
-	if ((sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1))
+	if (sd && (sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1))
 		return sd->smt_gain / sd->span_weight;

 	return SCHED_CAPACITY_SCALE;

^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [tip:sched/core] sched/fair: Name utilization related data and functions consistently
  2015-08-14 16:23 ` [PATCH 4/6] sched/fair: Name utilization related data and functions consistently Morten Rasmussen
  2015-09-04  9:08   ` Vincent Guittot
@ 2015-09-13 11:04   ` tip-bot for Dietmar Eggemann
  1 sibling, 0 replies; 97+ messages in thread
From: tip-bot for Dietmar Eggemann @ 2015-09-13 11:04 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: efault, dietmar.eggemann, mingo, peterz, Dietmar.Eggemann,
	morten.rasmussen, tglx, hpa, Juri.Lelli, linux-kernel, torvalds

Commit-ID:  9e91d61d9b0ca8d865dbd59af8d0d5c5b68003e9
Gitweb:     http://git.kernel.org/tip/9e91d61d9b0ca8d865dbd59af8d0d5c5b68003e9
Author:     Dietmar Eggemann <dietmar.eggemann@arm.com>
AuthorDate: Fri, 14 Aug 2015 17:23:12 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Sun, 13 Sep 2015 09:52:57 +0200

sched/fair: Name utilization related data and functions consistently

Use the advent of the per-entity load tracking rewrite to streamline the
naming of utilization related data and functions by using
{prefix_}util{_suffix} consistently. Moreover call both signals
({se,cfs}.avg.util_avg) utilization.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Dietmar Eggemann <Dietmar.Eggemann@arm.com>
Cc: Juri Lelli <Juri.Lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: daniel.lezcano@linaro.org
Cc: mturquette@baylibre.com
Cc: pang.xunlei@zte.com.cn
Cc: rjw@rjwysocki.net
Cc: sgurrappadi@nvidia.com
Cc: vincent.guittot@linaro.org
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1439569394-11974-5-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 37 +++++++++++++++++++------------------
 1 file changed, 19 insertions(+), 18 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 573dc98..1b56d63 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4863,31 +4863,32 @@ done:
 	return target;
 }
 /*
- * get_cpu_usage returns the amount of capacity of a CPU that is used by CFS
+ * cpu_util returns the amount of capacity of a CPU that is used by CFS
  * tasks. The unit of the return value must be the one of capacity so we can
- * compare the usage with the capacity of the CPU that is available for CFS
- * task (ie cpu_capacity).
+ * compare the utilization with the capacity of the CPU that is available for
+ * CFS task (ie cpu_capacity).
  * cfs.avg.util_avg is the sum of running time of runnable tasks on a
  * CPU. It represents the amount of utilization of a CPU in the range
- * [0..SCHED_LOAD_SCALE].  The usage of a CPU can't be higher than the full
- * capacity of the CPU because it's about the running time on this CPU.
+ * [0..SCHED_LOAD_SCALE]. The utilization of a CPU can't be higher than the
+ * full capacity of the CPU because it's about the running time on this CPU.
  * Nevertheless, cfs.avg.util_avg can be higher than SCHED_LOAD_SCALE
  * because of unfortunate rounding in util_avg or just
  * after migrating tasks until the average stabilizes with the new running
- * time. So we need to check that the usage stays into the range
+ * time. So we need to check that the utilization stays into the range
  * [0..cpu_capacity_orig] and cap if necessary.
- * Without capping the usage, a group could be seen as overloaded (CPU0 usage
- * at 121% + CPU1 usage at 80%) whereas CPU1 has 20% of available capacity
+ * Without capping the utilization, a group could be seen as overloaded (CPU0
+ * utilization at 121% + CPU1 utilization at 80%) whereas CPU1 has 20% of
+ * available capacity.
  */
-static int get_cpu_usage(int cpu)
+static int cpu_util(int cpu)
 {
-	unsigned long usage = cpu_rq(cpu)->cfs.avg.util_avg;
+	unsigned long util = cpu_rq(cpu)->cfs.avg.util_avg;
 	unsigned long capacity = capacity_orig_of(cpu);
 
-	if (usage >= SCHED_LOAD_SCALE)
+	if (util >= SCHED_LOAD_SCALE)
 		return capacity;
 
-	return (usage * capacity) >> SCHED_LOAD_SHIFT;
+	return (util * capacity) >> SCHED_LOAD_SHIFT;
 }
 
 /*
@@ -5979,7 +5980,7 @@ struct sg_lb_stats {
 	unsigned long sum_weighted_load; /* Weighted load of group's tasks */
 	unsigned long load_per_task;
 	unsigned long group_capacity;
-	unsigned long group_usage; /* Total usage of the group */
+	unsigned long group_util; /* Total utilization of the group */
 	unsigned int sum_nr_running; /* Nr tasks running in the group */
 	unsigned int idle_cpus;
 	unsigned int group_weight;
@@ -6212,8 +6213,8 @@ static inline int sg_imbalanced(struct sched_group *group)
  * group_has_capacity returns true if the group has spare capacity that could
  * be used by some tasks.
  * We consider that a group has spare capacity if the  * number of task is
- * smaller than the number of CPUs or if the usage is lower than the available
- * capacity for CFS tasks.
+ * smaller than the number of CPUs or if the utilization is lower than the
+ * available capacity for CFS tasks.
  * For the latter, we use a threshold to stabilize the state, to take into
  * account the variance of the tasks' load and to return true if the available
  * capacity in meaningful for the load balancer.
@@ -6227,7 +6228,7 @@ group_has_capacity(struct lb_env *env, struct sg_lb_stats *sgs)
 		return true;
 
 	if ((sgs->group_capacity * 100) >
-			(sgs->group_usage * env->sd->imbalance_pct))
+			(sgs->group_util * env->sd->imbalance_pct))
 		return true;
 
 	return false;
@@ -6248,7 +6249,7 @@ group_is_overloaded(struct lb_env *env, struct sg_lb_stats *sgs)
 		return false;
 
 	if ((sgs->group_capacity * 100) <
-			(sgs->group_usage * env->sd->imbalance_pct))
+			(sgs->group_util * env->sd->imbalance_pct))
 		return true;
 
 	return false;
@@ -6296,7 +6297,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 			load = source_load(i, load_idx);
 
 		sgs->group_load += load;
-		sgs->group_usage += get_cpu_usage(i);
+		sgs->group_util += cpu_util(i);
 		sgs->sum_nr_running += rq->cfs.h_nr_running;
 
 		if (rq->nr_running > 1)

^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [tip:sched/core] sched/fair: Get rid of scaling utilization by capacity_orig
  2015-09-07 15:37     ` Dietmar Eggemann
  2015-09-07 16:21       ` Vincent Guittot
@ 2015-09-13 11:04       ` tip-bot for Dietmar Eggemann
  1 sibling, 0 replies; 97+ messages in thread
From: tip-bot for Dietmar Eggemann @ 2015-09-13 11:04 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: yuyang.du, vincent.guittot, morten.rasmussen, mingo,
	daniel.lezcano, linux-kernel, sgurrappadi, mturquette, tglx,
	dietmar.eggemann, rjw, torvalds, hpa, pang.xunlei, peterz,
	Juri.Lelli, steve.muckle, efault

Commit-ID:  231678b768da07d19ab5683a39eeb0c250631d02
Gitweb:     http://git.kernel.org/tip/231678b768da07d19ab5683a39eeb0c250631d02
Author:     Dietmar Eggemann <dietmar.eggemann@arm.com>
AuthorDate: Fri, 14 Aug 2015 17:23:13 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Sun, 13 Sep 2015 09:52:57 +0200

sched/fair: Get rid of scaling utilization by capacity_orig

Utilization is currently scaled by capacity_orig, but since we now have
frequency and cpu invariant cfs_rq.avg.util_avg, frequency and cpu scaling
now happens as part of the utilization tracking itself.
So cfs_rq.avg.util_avg should no longer be scaled in cpu_util().

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juri Lelli <Juri.Lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Steve Muckle <steve.muckle@linaro.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: daniel.lezcano@linaro.org <daniel.lezcano@linaro.org>
Cc: mturquette@baylibre.com <mturquette@baylibre.com>
Cc: pang.xunlei@zte.com.cn <pang.xunlei@zte.com.cn>
Cc: rjw@rjwysocki.net <rjw@rjwysocki.net>
Cc: sgurrappadi@nvidia.com <sgurrappadi@nvidia.com>
Cc: vincent.guittot@linaro.org <vincent.guittot@linaro.org>
Cc: yuyang.du@intel.com <yuyang.du@intel.com>
Link: http://lkml.kernel.org/r/55EDAF43.30500@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 38 ++++++++++++++++++++++----------------
 1 file changed, 22 insertions(+), 16 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1b56d63..047fd1c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4862,33 +4862,39 @@ next:
 done:
 	return target;
 }
+
 /*
  * cpu_util returns the amount of capacity of a CPU that is used by CFS
  * tasks. The unit of the return value must be the one of capacity so we can
  * compare the utilization with the capacity of the CPU that is available for
  * CFS task (ie cpu_capacity).
- * cfs.avg.util_avg is the sum of running time of runnable tasks on a
- * CPU. It represents the amount of utilization of a CPU in the range
- * [0..SCHED_LOAD_SCALE]. The utilization of a CPU can't be higher than the
- * full capacity of the CPU because it's about the running time on this CPU.
- * Nevertheless, cfs.avg.util_avg can be higher than SCHED_LOAD_SCALE
- * because of unfortunate rounding in util_avg or just
- * after migrating tasks until the average stabilizes with the new running
- * time. So we need to check that the utilization stays into the range
- * [0..cpu_capacity_orig] and cap if necessary.
- * Without capping the utilization, a group could be seen as overloaded (CPU0
- * utilization at 121% + CPU1 utilization at 80%) whereas CPU1 has 20% of
- * available capacity.
+ *
+ * cfs_rq.avg.util_avg is the sum of running time of runnable tasks plus the
+ * recent utilization of currently non-runnable tasks on a CPU. It represents
+ * the amount of utilization of a CPU in the range [0..capacity_orig] where
+ * capacity_orig is the cpu_capacity available at the highest frequency
+ * (arch_scale_freq_capacity()).
+ * The utilization of a CPU converges towards a sum equal to or less than the
+ * current capacity (capacity_curr <= capacity_orig) of the CPU because it is
+ * the running time on this CPU scaled by capacity_curr.
+ *
+ * Nevertheless, cfs_rq.avg.util_avg can be higher than capacity_curr or even
+ * higher than capacity_orig because of unfortunate rounding in
+ * cfs.avg.util_avg or just after migrating tasks and new task wakeups until
+ * the average stabilizes with the new running time. We need to check that the
+ * utilization stays within the range of [0..capacity_orig] and cap it if
+ * necessary. Without utilization capping, a group could be seen as overloaded
+ * (CPU0 utilization at 121% + CPU1 utilization at 80%) whereas CPU1 has 20% of
+ * available capacity. We allow utilization to overshoot capacity_curr (but not
+ * capacity_orig) as it useful for predicting the capacity required after task
+ * migrations (scheduler-driven DVFS).
  */
 static int cpu_util(int cpu)
 {
 	unsigned long util = cpu_rq(cpu)->cfs.avg.util_avg;
 	unsigned long capacity = capacity_orig_of(cpu);
 
-	if (util >= SCHED_LOAD_SCALE)
-		return capacity;
-
-	return (util * capacity) >> SCHED_LOAD_SHIFT;
+	return (util >= capacity) ? capacity : util;
 }
 
 /*

^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [tip:sched/core] sched/fair: Initialize task load and utilization before placing task on rq
  2015-08-14 16:23 ` [PATCH 6/6] sched/fair: Initialize task load and utilization before placing task on rq Morten Rasmussen
@ 2015-09-13 11:05   ` tip-bot for Morten Rasmussen
  0 siblings, 0 replies; 97+ messages in thread
From: tip-bot for Morten Rasmussen @ 2015-09-13 11:05 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: morten.rasmussen, linux-kernel, peterz, tglx, Juri.Lelli,
	Dietmar.Eggemann, mingo, hpa, efault, mingo, torvalds

Commit-ID:  98d8fd8126676f7ba6e133e65b2ca4b17989d32c
Gitweb:     http://git.kernel.org/tip/98d8fd8126676f7ba6e133e65b2ca4b17989d32c
Author:     Morten Rasmussen <morten.rasmussen@arm.com>
AuthorDate: Fri, 14 Aug 2015 17:23:14 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Sun, 13 Sep 2015 09:52:58 +0200

sched/fair: Initialize task load and utilization before placing task on rq

Task load or utilization is not currently considered in
select_task_rq_fair(), but if we want that in the future we should make
sure it is not zero for new tasks.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Dietmar Eggemann <Dietmar.Eggemann@arm.com>
Cc: Juri Lelli <Juri.Lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: daniel.lezcano@linaro.org
Cc: mturquette@baylibre.com
Cc: pang.xunlei@zte.com.cn
Cc: rjw@rjwysocki.net
Cc: sgurrappadi@nvidia.com
Cc: vincent.guittot@linaro.org
Cc: yuyang.du@intel.com
Link: http://lkml.kernel.org/r/1439569394-11974-7-git-send-email-morten.rasmussen@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/core.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index b621271..6ab415a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -2343,6 +2343,8 @@ void wake_up_new_task(struct task_struct *p)
 	struct rq *rq;
 
 	raw_spin_lock_irqsave(&p->pi_lock, flags);
+	/* Initialize new task's runnable average */
+	init_entity_runnable_average(&p->se);
 #ifdef CONFIG_SMP
 	/*
 	 * Fork balancing, do it here and not earlier because:
@@ -2352,8 +2354,6 @@ void wake_up_new_task(struct task_struct *p)
 	set_task_cpu(p, select_task_rq(p, task_cpu(p), SD_BALANCE_FORK, 0));
 #endif
 
-	/* Initialize new task's runnable average */
-	init_entity_runnable_average(&p->se);
 	rq = __task_rq_lock(p);
 	activate_task(rq, p, 0);
 	p->on_rq = TASK_ON_RQ_QUEUED;

^ permalink raw reply related	[flat|nested] 97+ messages in thread

* [tip:sched/core] sched/fair: Defer calling scaling functions
  2015-09-07 14:44     ` Dietmar Eggemann
@ 2015-09-13 11:06       ` tip-bot for Dietmar Eggemann
  0 siblings, 0 replies; 97+ messages in thread
From: tip-bot for Dietmar Eggemann @ 2015-09-13 11:06 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: daniel.lezcano, linux-kernel, mturquette, mingo, rjw,
	sgurrappadi, vincent.guittot, peterz, Juri.Lelli, pang.xunlei,
	efault, dietmar.eggemann, tglx, yuyang.du, torvalds, hpa

Commit-ID:  6f2b04524f0b38bfbb8413f98d2d6af234508309
Gitweb:     http://git.kernel.org/tip/6f2b04524f0b38bfbb8413f98d2d6af234508309
Author:     Dietmar Eggemann <dietmar.eggemann@arm.com>
AuthorDate: Mon, 7 Sep 2015 14:57:22 +0100
Committer:  Ingo Molnar <mingo@kernel.org>
CommitDate: Sun, 13 Sep 2015 09:53:01 +0200

sched/fair: Defer calling scaling functions

Do not call the scaling functions in case time goes backwards or the
last update of the sched_avg structure has happened less than 1024ns
ago.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Cc: Juri Lelli <Juri.Lelli@arm.com>
Cc: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Mike Galbraith <efault@gmx.de>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: daniel.lezcano@linaro.org <daniel.lezcano@linaro.org>
Cc: mturquette@baylibre.com <mturquette@baylibre.com>
Cc: pang.xunlei@zte.com.cn <pang.xunlei@zte.com.cn>
Cc: rjw@rjwysocki.net <rjw@rjwysocki.net>
Cc: sgurrappadi@nvidia.com <sgurrappadi@nvidia.com>
Cc: vincent.guittot@linaro.org <vincent.guittot@linaro.org>
Cc: yuyang.du@intel.com <yuyang.du@intel.com>
Link: http://lkml.kernel.org/r/55EDA2E9.8040900@arm.com
Signed-off-by: Ingo Molnar <mingo@kernel.org>
---
 kernel/sched/fair.c | 6 ++++--
 1 file changed, 4 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c3c5585..fc835fa 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2552,8 +2552,7 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
 	u64 delta, scaled_delta, periods;
 	u32 contrib;
 	unsigned int delta_w, scaled_delta_w, decayed = 0;
-	unsigned long scale_freq = arch_scale_freq_capacity(NULL, cpu);
-	unsigned long scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
+	unsigned long scale_freq, scale_cpu;
 
 	delta = now - sa->last_update_time;
 	/*
@@ -2574,6 +2573,9 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
 		return 0;
 	sa->last_update_time = now;
 
+	scale_freq = arch_scale_freq_capacity(NULL, cpu);
+	scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
+
 	/* delta_w is the amount already accumulated against our next period */
 	delta_w = sa->period_contrib;
 	if (delta + delta_w >= 1024) {

^ permalink raw reply related	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/6] sched/fair: Get rid of scaling utilization by capacity_orig
  2015-09-11 17:05                       ` bsegall
  2015-09-11 18:24                         ` Yuyang Du
@ 2015-09-14 12:56                         ` Morten Rasmussen
  2015-09-14 17:34                           ` bsegall
  1 sibling, 1 reply; 97+ messages in thread
From: Morten Rasmussen @ 2015-09-14 12:56 UTC (permalink / raw)
  To: bsegall
  Cc: Yuyang Du, Peter Zijlstra, Dietmar Eggemann, Vincent Guittot,
	Steve Muckle, mingo, daniel.lezcano, mturquette, rjw, Juri Lelli,
	sgurrappadi, pang.xunlei, linux-kernel

On Fri, Sep 11, 2015 at 10:05:53AM -0700, bsegall@google.com wrote:
> Morten Rasmussen <morten.rasmussen@arm.com> writes:
> 
> > On Fri, Sep 11, 2015 at 08:28:25AM +0800, Yuyang Du wrote:
> >> diff --git a/include/linux/sched.h b/include/linux/sched.h
> >> index 119823d..55a7b93 100644
> >> --- a/include/linux/sched.h
> >> +++ b/include/linux/sched.h
> >> @@ -912,7 +912,7 @@ enum cpu_idle_type {
> >>  /*
> >>   * Increase resolution of cpu_capacity calculations
> >>   */
> >> -#define SCHED_CAPACITY_SHIFT	10
> >> +#define SCHED_CAPACITY_SHIFT	SCHED_RESOLUTION_SHIFT
> >>  #define SCHED_CAPACITY_SCALE	(1L << SCHED_CAPACITY_SHIFT)
> >>  
> >>  /*
> >> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> >> index 68cda11..d27cdd8 100644
> >> --- a/kernel/sched/sched.h
> >> +++ b/kernel/sched/sched.h
> >> @@ -40,6 +40,9 @@ static inline void update_cpu_load_active(struct rq *this_rq) { }
> >>   */
> >>  #define NS_TO_JIFFIES(TIME)	((unsigned long)(TIME) / (NSEC_PER_SEC / HZ))
> >>  
> >> +# define SCHED_RESOLUTION_SHIFT	10
> >> +# define SCHED_RESOLUTION_SCALE	(1L << SCHED_RESOLUTION_SHIFT)
> >> +
> >>  /*
> >>   * Increase resolution of nice-level calculations for 64-bit architectures.
> >>   * The extra resolution improves shares distribution and load balancing of
> >> @@ -53,16 +56,15 @@ static inline void update_cpu_load_active(struct rq *this_rq) { }
> >>   * increased costs.
> >>   */
> >>  #if 0 /* BITS_PER_LONG > 32 -- currently broken: it increases power usage under light load  */
> >> -# define SCHED_LOAD_RESOLUTION	10
> >> -# define scale_load(w)		((w) << SCHED_LOAD_RESOLUTION)
> >> -# define scale_load_down(w)	((w) >> SCHED_LOAD_RESOLUTION)
> >> +# define SCHED_LOAD_SHIFT	(SCHED_RESOLUTION_SHIFT + SCHED_RESOLUTION_SHIFT)
> >> +# define scale_load(w)		((w) << SCHED_RESOLUTION_SHIFT)
> >> +# define scale_load_down(w)	((w) >> SCHED_RESOLUTION_SHIFT)
> >>  #else
> >> -# define SCHED_LOAD_RESOLUTION	0
> >> +# define SCHED_LOAD_SHIFT	(SCHED_RESOLUTION_SHIFT)
> >>  # define scale_load(w)		(w)
> >>  # define scale_load_down(w)	(w)
> >>  #endif
> >>  
> >> -#define SCHED_LOAD_SHIFT	(10 + SCHED_LOAD_RESOLUTION)
> >>  #define SCHED_LOAD_SCALE	(1L << SCHED_LOAD_SHIFT)
> >>  
> >>  #define NICE_0_LOAD		SCHED_LOAD_SCALE
> >
> > I think this is pretty much the required relationship between all the
> > SHIFTs and SCALEs that Peter checked for in his #if-#error thing
> > earlier, so no disagreements from my side :-)
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> > the body of a message to majordomo@vger.kernel.org
> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> > Please read the FAQ at  http://www.tux.org/lkml/
> 
> SCHED_LOAD_RESOLUTION and the non-SLR part of SCHED_LOAD_SHIFT are not
> required to be the same value and should not be conflated.
> 
> In particular, since cgroups are on the same timeline as tasks and their
> shares are not scaled by SCHED_LOAD_SHIFT in any way (but are scaled so
> that SCHED_LOAD_RESOLUTION is invisible), changing that part of
> SCHED_LOAD_SHIFT would cause issues, since things can assume that nice-0
> = 1024. However changing SCHED_LOAD_RESOLUTION would be fine, as that is
> an internal value to the kernel.
> 
> In addition, changing the non-SLR part of SCHED_LOAD_SHIFT would require
> recomputing all of prio_to_weight/wmult for the new NICE_0_LOAD.

I think I follow, but doesn't that mean that the current code is broken
too? NICE_0_LOAD changes if you change SCHED_LOAD_RESOLUTION:

#define SCHED_LOAD_SHIFT        (10 + SCHED_LOAD_RESOLUTION)
#define SCHED_LOAD_SCALE        (1L << SCHED_LOAD_SHIFT)

#define NICE_0_LOAD             SCHED_LOAD_SCALE
#define NICE_0_SHIFT            SCHED_LOAD_SHIFT

To me it sounds like we need to define it the other way around:

#define NICE_0_SHIFT            10
#define NICE_0_LOAD             (1L << NICE_0_SHIFT)

and then add any additional resolution bits from there to ensure that
NICE_0_LOAD and the prio_to_weight/wmult tables are unchanged.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/6] sched/fair: Get rid of scaling utilization by capacity_orig
  2015-09-14 12:56                         ` Morten Rasmussen
@ 2015-09-14 17:34                           ` bsegall
  2015-09-14 22:56                             ` Yuyang Du
                                               ` (2 more replies)
  0 siblings, 3 replies; 97+ messages in thread
From: bsegall @ 2015-09-14 17:34 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Yuyang Du, Peter Zijlstra, Dietmar Eggemann, Vincent Guittot,
	Steve Muckle, mingo, daniel.lezcano, mturquette, rjw, Juri Lelli,
	sgurrappadi, pang.xunlei, linux-kernel

Morten Rasmussen <morten.rasmussen@arm.com> writes:

> On Fri, Sep 11, 2015 at 10:05:53AM -0700, bsegall@google.com wrote:
>> Morten Rasmussen <morten.rasmussen@arm.com> writes:
>> 
>> > On Fri, Sep 11, 2015 at 08:28:25AM +0800, Yuyang Du wrote:
>> >> diff --git a/include/linux/sched.h b/include/linux/sched.h
>> >> index 119823d..55a7b93 100644
>> >> --- a/include/linux/sched.h
>> >> +++ b/include/linux/sched.h
>> >> @@ -912,7 +912,7 @@ enum cpu_idle_type {
>> >>  /*
>> >>   * Increase resolution of cpu_capacity calculations
>> >>   */
>> >> -#define SCHED_CAPACITY_SHIFT	10
>> >> +#define SCHED_CAPACITY_SHIFT	SCHED_RESOLUTION_SHIFT
>> >>  #define SCHED_CAPACITY_SCALE	(1L << SCHED_CAPACITY_SHIFT)
>> >>  
>> >>  /*
>> >> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
>> >> index 68cda11..d27cdd8 100644
>> >> --- a/kernel/sched/sched.h
>> >> +++ b/kernel/sched/sched.h
>> >> @@ -40,6 +40,9 @@ static inline void update_cpu_load_active(struct rq *this_rq) { }
>> >>   */
>> >>  #define NS_TO_JIFFIES(TIME)	((unsigned long)(TIME) / (NSEC_PER_SEC / HZ))
>> >>  
>> >> +# define SCHED_RESOLUTION_SHIFT	10
>> >> +# define SCHED_RESOLUTION_SCALE	(1L << SCHED_RESOLUTION_SHIFT)
>> >> +
>> >>  /*
>> >>   * Increase resolution of nice-level calculations for 64-bit architectures.
>> >>   * The extra resolution improves shares distribution and load balancing of
>> >> @@ -53,16 +56,15 @@ static inline void update_cpu_load_active(struct rq *this_rq) { }
>> >>   * increased costs.
>> >>   */
>> >>  #if 0 /* BITS_PER_LONG > 32 -- currently broken: it increases power usage under light load  */
>> >> -# define SCHED_LOAD_RESOLUTION	10
>> >> -# define scale_load(w)		((w) << SCHED_LOAD_RESOLUTION)
>> >> -# define scale_load_down(w)	((w) >> SCHED_LOAD_RESOLUTION)
>> >> +# define SCHED_LOAD_SHIFT	(SCHED_RESOLUTION_SHIFT + SCHED_RESOLUTION_SHIFT)
>> >> +# define scale_load(w)		((w) << SCHED_RESOLUTION_SHIFT)
>> >> +# define scale_load_down(w)	((w) >> SCHED_RESOLUTION_SHIFT)
>> >>  #else
>> >> -# define SCHED_LOAD_RESOLUTION	0
>> >> +# define SCHED_LOAD_SHIFT	(SCHED_RESOLUTION_SHIFT)
>> >>  # define scale_load(w)		(w)
>> >>  # define scale_load_down(w)	(w)
>> >>  #endif
>> >>  
>> >> -#define SCHED_LOAD_SHIFT	(10 + SCHED_LOAD_RESOLUTION)
>> >>  #define SCHED_LOAD_SCALE	(1L << SCHED_LOAD_SHIFT)
>> >>  
>> >>  #define NICE_0_LOAD		SCHED_LOAD_SCALE
>> >
>> > I think this is pretty much the required relationship between all the
>> > SHIFTs and SCALEs that Peter checked for in his #if-#error thing
>> > earlier, so no disagreements from my side :-)
>> > --
>> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>> > the body of a message to majordomo@vger.kernel.org
>> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
>> > Please read the FAQ at  http://www.tux.org/lkml/
>> 
>> SCHED_LOAD_RESOLUTION and the non-SLR part of SCHED_LOAD_SHIFT are not
>> required to be the same value and should not be conflated.
>> 
>> In particular, since cgroups are on the same timeline as tasks and their
>> shares are not scaled by SCHED_LOAD_SHIFT in any way (but are scaled so
>> that SCHED_LOAD_RESOLUTION is invisible), changing that part of
>> SCHED_LOAD_SHIFT would cause issues, since things can assume that nice-0
>> = 1024. However changing SCHED_LOAD_RESOLUTION would be fine, as that is
>> an internal value to the kernel.
>> 
>> In addition, changing the non-SLR part of SCHED_LOAD_SHIFT would require
>> recomputing all of prio_to_weight/wmult for the new NICE_0_LOAD.
>
> I think I follow, but doesn't that mean that the current code is broken
> too? NICE_0_LOAD changes if you change SCHED_LOAD_RESOLUTION:
>
> #define SCHED_LOAD_SHIFT        (10 + SCHED_LOAD_RESOLUTION)
> #define SCHED_LOAD_SCALE        (1L << SCHED_LOAD_SHIFT)
>
> #define NICE_0_LOAD             SCHED_LOAD_SCALE
> #define NICE_0_SHIFT            SCHED_LOAD_SHIFT
>
> To me it sounds like we need to define it the other way around:
>
> #define NICE_0_SHIFT            10
> #define NICE_0_LOAD             (1L << NICE_0_SHIFT)
>
> and then add any additional resolution bits from there to ensure that
> NICE_0_LOAD and the prio_to_weight/wmult tables are unchanged.

No, NICE_0_LOAD is supposed to be scale_load(prio_to_weight[nice_0]),
ie including SLR. It has never been clear to me what
SCHED_LOAD_SCALE/SCHED_LOAD_SHIFT were for as opposed to NICE_0_LOAD,
and the new utilization uses of it are entirely unlinked to 1024 == NICE_0

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/6] sched/fair: Get rid of scaling utilization by capacity_orig
  2015-09-11 18:24                         ` Yuyang Du
@ 2015-09-14 17:36                           ` bsegall
  0 siblings, 0 replies; 97+ messages in thread
From: bsegall @ 2015-09-14 17:36 UTC (permalink / raw)
  To: Yuyang Du
  Cc: Morten Rasmussen, Peter Zijlstra, Dietmar Eggemann,
	Vincent Guittot, Steve Muckle, mingo, daniel.lezcano, mturquette,
	rjw, Juri Lelli, sgurrappadi, pang.xunlei, linux-kernel

Yuyang Du <yuyang.du@intel.com> writes:

> On Fri, Sep 11, 2015 at 10:05:53AM -0700, bsegall@google.com wrote:
>> 
>> SCHED_LOAD_RESOLUTION and the non-SLR part of SCHED_LOAD_SHIFT are not
>> required to be the same value and should not be conflated.
>  
>> In particular, since cgroups are on the same timeline as tasks and their
>> shares are not scaled by SCHED_LOAD_SHIFT in any way (but are scaled so
>> that SCHED_LOAD_RESOLUTION is invisible), changing that part of
>> SCHED_LOAD_SHIFT would cause issues, since things can assume that nice-0
>> = 1024. However changing SCHED_LOAD_RESOLUTION would be fine, as that is
>> an internal value to the kernel.
>>
>> In addition, changing the non-SLR part of SCHED_LOAD_SHIFT would require
>> recomputing all of prio_to_weight/wmult for the new NICE_0_LOAD.
>
> Not fully looked into the concerns, but the new SCHED_RESOLUTION_SHIFT
> is intended to formalize all the integer metrics that need better resolution.
> It is not special to any metric, so actually it is to de-conflate whoever is
> conflated.

It conflates the userspace-invisible SCHED_LOAD_RESOLUTION with the
userspace-visible value of scale_load_down(NICE_0_LOAD). Increasing
SCHED_LOAD_RESOLUTION must not change scale_load_down(NICE_0_LOAD).

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/6] sched/fair: Get rid of scaling utilization by capacity_orig
  2015-09-14 17:34                           ` bsegall
@ 2015-09-14 22:56                             ` Yuyang Du
  2015-09-15 17:11                               ` bsegall
  2015-09-15  8:43                             ` Morten Rasmussen
  2015-09-16 15:36                             ` Peter Zijlstra
  2 siblings, 1 reply; 97+ messages in thread
From: Yuyang Du @ 2015-09-14 22:56 UTC (permalink / raw)
  To: bsegall
  Cc: Morten Rasmussen, Peter Zijlstra, Dietmar Eggemann,
	Vincent Guittot, Steve Muckle, mingo, daniel.lezcano, mturquette,
	rjw, Juri Lelli, sgurrappadi, pang.xunlei, linux-kernel

On Mon, Sep 14, 2015 at 10:34:00AM -0700, bsegall@google.com wrote:
> >> SCHED_LOAD_RESOLUTION and the non-SLR part of SCHED_LOAD_SHIFT are not
> >> required to be the same value and should not be conflated.
> >> 
> >> In particular, since cgroups are on the same timeline as tasks and their
> >> shares are not scaled by SCHED_LOAD_SHIFT in any way (but are scaled so
> >> that SCHED_LOAD_RESOLUTION is invisible), changing that part of
> >> SCHED_LOAD_SHIFT would cause issues, since things can assume that nice-0
> >> = 1024. However changing SCHED_LOAD_RESOLUTION would be fine, as that is
> >> an internal value to the kernel.
> >> 
> >> In addition, changing the non-SLR part of SCHED_LOAD_SHIFT would require
> >> recomputing all of prio_to_weight/wmult for the new NICE_0_LOAD.
> >
> > I think I follow, but doesn't that mean that the current code is broken
> > too? NICE_0_LOAD changes if you change SCHED_LOAD_RESOLUTION:
> >
> > #define SCHED_LOAD_SHIFT        (10 + SCHED_LOAD_RESOLUTION)
> > #define SCHED_LOAD_SCALE        (1L << SCHED_LOAD_SHIFT)
> >
> > #define NICE_0_LOAD             SCHED_LOAD_SCALE
> > #define NICE_0_SHIFT            SCHED_LOAD_SHIFT
> >
> > To me it sounds like we need to define it the other way around:
> >
> > #define NICE_0_SHIFT            10
> > #define NICE_0_LOAD             (1L << NICE_0_SHIFT)
> >
> > and then add any additional resolution bits from there to ensure that
> > NICE_0_LOAD and the prio_to_weight/wmult tables are unchanged.
> 
> No, NICE_0_LOAD is supposed to be scale_load(prio_to_weight[nice_0]),
> ie including SLR. It has never been clear to me what
> SCHED_LOAD_SCALE/SCHED_LOAD_SHIFT were for as opposed to NICE_0_LOAD,
> and the new utilization uses of it are entirely unlinked to 1024 == NICE_0

Presume your SLR means SCHED_LOAD_RESOLUTION:

1) The introduction of (not redefinition of) SCHED_RESOLUTION_SHIFT does not
change anything after macro expansion.

2) The constants in prio_to_weight[] and prio_to_wmult[] are tied to a
resolution of 10bits NICE_0, i.e., 1024, I guest it is the user visible
part you mentioned, so is the cgroup share.

To me, it is all ok. With the SCHED_RESOLUTION_SHIFT, the basic resolution
unit, it is just for us to state clearly, the NICE_0's weight has a fixed
resolution of SCHED_RESOLUTION_SHIFT, or even add this:

#if prio_to_weight[20] != 1 << SCHED_RESOLUTION_SHIFT
error "NICE_0 weight not calibrated"
#endif
/* I can learn, Peter */

I guess you are saying we are conflating NICE_0 with NICE_0_LOAD. But to me,
they are just integer metrics, needing a resolution respectively. That is it.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/6] sched/fair: Get rid of scaling utilization by capacity_orig
  2015-09-14 17:34                           ` bsegall
  2015-09-14 22:56                             ` Yuyang Du
@ 2015-09-15  8:43                             ` Morten Rasmussen
  2015-09-16 15:36                             ` Peter Zijlstra
  2 siblings, 0 replies; 97+ messages in thread
From: Morten Rasmussen @ 2015-09-15  8:43 UTC (permalink / raw)
  To: bsegall
  Cc: Yuyang Du, Peter Zijlstra, Dietmar Eggemann, Vincent Guittot,
	Steve Muckle, mingo, daniel.lezcano, mturquette, rjw, Juri Lelli,
	sgurrappadi, pang.xunlei, linux-kernel

On Mon, Sep 14, 2015 at 10:34:00AM -0700, bsegall@google.com wrote:
> Morten Rasmussen <morten.rasmussen@arm.com> writes:
> 
> > On Fri, Sep 11, 2015 at 10:05:53AM -0700, bsegall@google.com wrote:
> >> Morten Rasmussen <morten.rasmussen@arm.com> writes:
> >> 
> >> > On Fri, Sep 11, 2015 at 08:28:25AM +0800, Yuyang Du wrote:
> >> >> diff --git a/include/linux/sched.h b/include/linux/sched.h
> >> >> index 119823d..55a7b93 100644
> >> >> --- a/include/linux/sched.h
> >> >> +++ b/include/linux/sched.h
> >> >> @@ -912,7 +912,7 @@ enum cpu_idle_type {
> >> >>  /*
> >> >>   * Increase resolution of cpu_capacity calculations
> >> >>   */
> >> >> -#define SCHED_CAPACITY_SHIFT	10
> >> >> +#define SCHED_CAPACITY_SHIFT	SCHED_RESOLUTION_SHIFT
> >> >>  #define SCHED_CAPACITY_SCALE	(1L << SCHED_CAPACITY_SHIFT)
> >> >>  
> >> >>  /*
> >> >> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> >> >> index 68cda11..d27cdd8 100644
> >> >> --- a/kernel/sched/sched.h
> >> >> +++ b/kernel/sched/sched.h
> >> >> @@ -40,6 +40,9 @@ static inline void update_cpu_load_active(struct rq *this_rq) { }
> >> >>   */
> >> >>  #define NS_TO_JIFFIES(TIME)	((unsigned long)(TIME) / (NSEC_PER_SEC / HZ))
> >> >>  
> >> >> +# define SCHED_RESOLUTION_SHIFT	10
> >> >> +# define SCHED_RESOLUTION_SCALE	(1L << SCHED_RESOLUTION_SHIFT)
> >> >> +
> >> >>  /*
> >> >>   * Increase resolution of nice-level calculations for 64-bit architectures.
> >> >>   * The extra resolution improves shares distribution and load balancing of
> >> >> @@ -53,16 +56,15 @@ static inline void update_cpu_load_active(struct rq *this_rq) { }
> >> >>   * increased costs.
> >> >>   */
> >> >>  #if 0 /* BITS_PER_LONG > 32 -- currently broken: it increases power usage under light load  */
> >> >> -# define SCHED_LOAD_RESOLUTION	10
> >> >> -# define scale_load(w)		((w) << SCHED_LOAD_RESOLUTION)
> >> >> -# define scale_load_down(w)	((w) >> SCHED_LOAD_RESOLUTION)
> >> >> +# define SCHED_LOAD_SHIFT	(SCHED_RESOLUTION_SHIFT + SCHED_RESOLUTION_SHIFT)
> >> >> +# define scale_load(w)		((w) << SCHED_RESOLUTION_SHIFT)
> >> >> +# define scale_load_down(w)	((w) >> SCHED_RESOLUTION_SHIFT)
> >> >>  #else
> >> >> -# define SCHED_LOAD_RESOLUTION	0
> >> >> +# define SCHED_LOAD_SHIFT	(SCHED_RESOLUTION_SHIFT)
> >> >>  # define scale_load(w)		(w)
> >> >>  # define scale_load_down(w)	(w)
> >> >>  #endif
> >> >>  
> >> >> -#define SCHED_LOAD_SHIFT	(10 + SCHED_LOAD_RESOLUTION)
> >> >>  #define SCHED_LOAD_SCALE	(1L << SCHED_LOAD_SHIFT)
> >> >>  
> >> >>  #define NICE_0_LOAD		SCHED_LOAD_SCALE
> >> >
> >> > I think this is pretty much the required relationship between all the
> >> > SHIFTs and SCALEs that Peter checked for in his #if-#error thing
> >> > earlier, so no disagreements from my side :-)
> >> > --
> >> > To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> >> > the body of a message to majordomo@vger.kernel.org
> >> > More majordomo info at  http://vger.kernel.org/majordomo-info.html
> >> > Please read the FAQ at  http://www.tux.org/lkml/
> >> 
> >> SCHED_LOAD_RESOLUTION and the non-SLR part of SCHED_LOAD_SHIFT are not
> >> required to be the same value and should not be conflated.
> >> 
> >> In particular, since cgroups are on the same timeline as tasks and their
> >> shares are not scaled by SCHED_LOAD_SHIFT in any way (but are scaled so
> >> that SCHED_LOAD_RESOLUTION is invisible), changing that part of
> >> SCHED_LOAD_SHIFT would cause issues, since things can assume that nice-0
> >> = 1024. However changing SCHED_LOAD_RESOLUTION would be fine, as that is
> >> an internal value to the kernel.
> >> 
> >> In addition, changing the non-SLR part of SCHED_LOAD_SHIFT would require
> >> recomputing all of prio_to_weight/wmult for the new NICE_0_LOAD.
> >
> > I think I follow, but doesn't that mean that the current code is broken
> > too? NICE_0_LOAD changes if you change SCHED_LOAD_RESOLUTION:
> >
> > #define SCHED_LOAD_SHIFT        (10 + SCHED_LOAD_RESOLUTION)
> > #define SCHED_LOAD_SCALE        (1L << SCHED_LOAD_SHIFT)
> >
> > #define NICE_0_LOAD             SCHED_LOAD_SCALE
> > #define NICE_0_SHIFT            SCHED_LOAD_SHIFT
> >
> > To me it sounds like we need to define it the other way around:
> >
> > #define NICE_0_SHIFT            10
> > #define NICE_0_LOAD             (1L << NICE_0_SHIFT)
> >
> > and then add any additional resolution bits from there to ensure that
> > NICE_0_LOAD and the prio_to_weight/wmult tables are unchanged.
> 
> No, NICE_0_LOAD is supposed to be scale_load(prio_to_weight[nice_0]),
> ie including SLR. It has never been clear to me what
> SCHED_LOAD_SCALE/SCHED_LOAD_SHIFT were for as opposed to NICE_0_LOAD,

I see, I wasn't sure if NICE_0_LOAD is being used in the code somewhere
with the assumption that NICE_0_LOAD = load.weight = 1024. The
scale_(down_)_load() conversion between base load (nice_0 = 1024) and
hi-res load makes makes sense.

> and the new utilization uses of it are entirely unlinked to 1024 == NICE_0

Yes, agreed. For utilization we just need to define some fixed point
resolution (as Yuyang said). That resolution is independent of the hi-res
load additional bits and should remain so. The same fixed point
resolution has to be used for capacity as well unless we want to
introduce scale_(down_)_capacity() functions to allow utilization to be
compared to capacity.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/6] sched/fair: Get rid of scaling utilization by capacity_orig
  2015-09-14 22:56                             ` Yuyang Du
@ 2015-09-15 17:11                               ` bsegall
  2015-09-15 18:39                                 ` Yuyang Du
  0 siblings, 1 reply; 97+ messages in thread
From: bsegall @ 2015-09-15 17:11 UTC (permalink / raw)
  To: Yuyang Du
  Cc: Morten Rasmussen, Peter Zijlstra, Dietmar Eggemann,
	Vincent Guittot, Steve Muckle, mingo, daniel.lezcano, mturquette,
	rjw, Juri Lelli, sgurrappadi, pang.xunlei, linux-kernel

Yuyang Du <yuyang.du@intel.com> writes:

> On Mon, Sep 14, 2015 at 10:34:00AM -0700, bsegall@google.com wrote:
>> >> SCHED_LOAD_RESOLUTION and the non-SLR part of SCHED_LOAD_SHIFT are not
>> >> required to be the same value and should not be conflated.
>> >> 
>> >> In particular, since cgroups are on the same timeline as tasks and their
>> >> shares are not scaled by SCHED_LOAD_SHIFT in any way (but are scaled so
>> >> that SCHED_LOAD_RESOLUTION is invisible), changing that part of
>> >> SCHED_LOAD_SHIFT would cause issues, since things can assume that nice-0
>> >> = 1024. However changing SCHED_LOAD_RESOLUTION would be fine, as that is
>> >> an internal value to the kernel.
>> >> 
>> >> In addition, changing the non-SLR part of SCHED_LOAD_SHIFT would require
>> >> recomputing all of prio_to_weight/wmult for the new NICE_0_LOAD.
>> >
>> > I think I follow, but doesn't that mean that the current code is broken
>> > too? NICE_0_LOAD changes if you change SCHED_LOAD_RESOLUTION:
>> >
>> > #define SCHED_LOAD_SHIFT        (10 + SCHED_LOAD_RESOLUTION)
>> > #define SCHED_LOAD_SCALE        (1L << SCHED_LOAD_SHIFT)
>> >
>> > #define NICE_0_LOAD             SCHED_LOAD_SCALE
>> > #define NICE_0_SHIFT            SCHED_LOAD_SHIFT
>> >
>> > To me it sounds like we need to define it the other way around:
>> >
>> > #define NICE_0_SHIFT            10
>> > #define NICE_0_LOAD             (1L << NICE_0_SHIFT)
>> >
>> > and then add any additional resolution bits from there to ensure that
>> > NICE_0_LOAD and the prio_to_weight/wmult tables are unchanged.
>> 
>> No, NICE_0_LOAD is supposed to be scale_load(prio_to_weight[nice_0]),
>> ie including SLR. It has never been clear to me what
>> SCHED_LOAD_SCALE/SCHED_LOAD_SHIFT were for as opposed to NICE_0_LOAD,
>> and the new utilization uses of it are entirely unlinked to 1024 == NICE_0
>
> Presume your SLR means SCHED_LOAD_RESOLUTION:
>
> 1) The introduction of (not redefinition of) SCHED_RESOLUTION_SHIFT does not
> change anything after macro expansion.
>
> 2) The constants in prio_to_weight[] and prio_to_wmult[] are tied to a
> resolution of 10bits NICE_0, i.e., 1024, I guest it is the user visible
> part you mentioned, so is the cgroup share.
>
> To me, it is all ok. With the SCHED_RESOLUTION_SHIFT, the basic resolution
> unit, it is just for us to state clearly, the NICE_0's weight has a fixed
> resolution of SCHED_RESOLUTION_SHIFT, or even add this:
>
> #if prio_to_weight[20] != 1 << SCHED_RESOLUTION_SHIFT
> error "NICE_0 weight not calibrated"
> #endif
> /* I can learn, Peter */
>
> I guess you are saying we are conflating NICE_0 with NICE_0_LOAD. But to me,
> they are just integer metrics, needing a resolution respectively. That is it.

Yes this would change nothing at the moment post-expansion, that's not
the point. SLR being 10 bits and the nice-0 being 1024 are completely
and utterly unrelated and the headers should not pretend they need to be
the same value, any more than there should be a #define that is shared
with every other use of 1024 in the kernel.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/6] sched/fair: Get rid of scaling utilization by capacity_orig
  2015-09-15 17:11                               ` bsegall
@ 2015-09-15 18:39                                 ` Yuyang Du
  2015-09-16 17:06                                   ` bsegall
  0 siblings, 1 reply; 97+ messages in thread
From: Yuyang Du @ 2015-09-15 18:39 UTC (permalink / raw)
  To: bsegall
  Cc: Morten Rasmussen, Peter Zijlstra, Dietmar Eggemann,
	Vincent Guittot, Steve Muckle, mingo, daniel.lezcano, mturquette,
	rjw, Juri Lelli, sgurrappadi, pang.xunlei, linux-kernel

On Tue, Sep 15, 2015 at 10:11:41AM -0700, bsegall@google.com wrote:
> >
> > I guess you are saying we are conflating NICE_0 with NICE_0_LOAD. But to me,
> > they are just integer metrics, needing a resolution respectively. That is it.
> 
> Yes this would change nothing at the moment post-expansion, that's not
> the point. SLR being 10 bits and the nice-0 being 1024 are completely
> and utterly unrelated and the headers should not pretend they need to be
> the same value,

I never said they are related, why should they be related. And they need or
need not to be the same value, fine.

However, the SLR has to be a value. It is because it mighe be 10 or 20 (LOAD),
therefore I make SCHED_RESOLUTION_SHIFT 10 (kind of a denominator). Not the
other way around.

We can define SCHED_RESOLUTION_SHIFT 1, and then define SLR = x * SCHED_RESOLUTION_SHIFT
with x being a random number, if you must.

And by the way, with SCHED_RESOLUTION_SHIFT, there will not be SLR anymore, we only
need SCHED_LOAD_SHIFT, which has a low resolution 1*SCHED_RESOLUTION_SHIFT or a high
one 2*SCHED_RESOLUTION_SHIFT. The scale_load*() is the conversion between the 
resolutions of NICE_0 and NICE_0_LOAD.

> any more than there should be a #define that is shared
> with every other use of 1024 in the kernel.

The point really is, metrics (if not many ) need resolution, not just NICE_0_LOAD does. 
You can choose to either hardcode a number, like SCHED_CAPACITY_SHIFT now,
or you can use SCHED_RESOLUTION_SHIFT, which is even as simple as a sign to say what
the defined is (the scaled one with a better resolution vs. the original one).
I guess this is to say we now have a (no-big-deal) resolution system.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/6] sched/fair: Get rid of scaling utilization by capacity_orig
  2015-09-14 17:34                           ` bsegall
  2015-09-14 22:56                             ` Yuyang Du
  2015-09-15  8:43                             ` Morten Rasmussen
@ 2015-09-16 15:36                             ` Peter Zijlstra
  2 siblings, 0 replies; 97+ messages in thread
From: Peter Zijlstra @ 2015-09-16 15:36 UTC (permalink / raw)
  To: bsegall
  Cc: Morten Rasmussen, Yuyang Du, Dietmar Eggemann, Vincent Guittot,
	Steve Muckle, mingo, daniel.lezcano, mturquette, rjw, Juri Lelli,
	sgurrappadi, pang.xunlei, linux-kernel

On Mon, Sep 14, 2015 at 10:34:00AM -0700, bsegall@google.com wrote:

> It has never been clear to me what
> SCHED_LOAD_SCALE/SCHED_LOAD_SHIFT were for as opposed to NICE_0_LOAD,

SCHED_LOAD_SCALE/SHIFT are the fixed point mult/shift, and NICE_0_LOAD
is the load of a nice-0 task. They happen to be the same by the choice
that nice-0 has a load of 1 (the only natural choice given proportional
weight and hyperboles etc..).

But for the fixed point math we use SCHED_LOAD_* and only when we want
to explicitly use the unit load we use NICE_0_LOAD (its only used 4-5
times or so).

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/6] sched/fair: Get rid of scaling utilization by capacity_orig
  2015-09-15 18:39                                 ` Yuyang Du
@ 2015-09-16 17:06                                   ` bsegall
  2015-09-17  2:31                                     ` Yuyang Du
  0 siblings, 1 reply; 97+ messages in thread
From: bsegall @ 2015-09-16 17:06 UTC (permalink / raw)
  To: Yuyang Du
  Cc: Morten Rasmussen, Peter Zijlstra, Dietmar Eggemann,
	Vincent Guittot, Steve Muckle, mingo, daniel.lezcano, mturquette,
	rjw, Juri Lelli, sgurrappadi, pang.xunlei, linux-kernel

Yuyang Du <yuyang.du@intel.com> writes:

> On Tue, Sep 15, 2015 at 10:11:41AM -0700, bsegall@google.com wrote:
>> >
>> > I guess you are saying we are conflating NICE_0 with NICE_0_LOAD. But to me,
>> > they are just integer metrics, needing a resolution respectively. That is it.
>> 
>> Yes this would change nothing at the moment post-expansion, that's not
>> the point. SLR being 10 bits and the nice-0 being 1024 are completely
>> and utterly unrelated and the headers should not pretend they need to be
>> the same value,
>
> I never said they are related, why should they be related. And they need or
> need not to be the same value, fine.
>
> However, the SLR has to be a value. It is because it mighe be 10 or 20 (LOAD),
> therefore I make SCHED_RESOLUTION_SHIFT 10 (kind of a denominator). Not the
> other way around.
>
> We can define SCHED_RESOLUTION_SHIFT 1, and then define SLR = x * SCHED_RESOLUTION_SHIFT
> with x being a random number, if you must.

That's sorta the point - you could do this and it would be just as (non-)sensical.

>
>> any more than there should be a #define that is shared
>> with every other use of 1024 in the kernel.
>
> The point really is, metrics (if not many ) need resolution, not just NICE_0_LOAD does. 
> You can choose to either hardcode a number, like SCHED_CAPACITY_SHIFT now,
> or you can use SCHED_RESOLUTION_SHIFT, which is even as simple as a sign to say what
> the defined is (the scaled one with a better resolution vs. the original one).
> I guess this is to say we now have a (no-big-deal) resolution system.

Yes they were chosen for similar reasons, but they are not conceptually
related, and you couldn't decide to just bump up all the resolutions by
changing SCHED_RESOLUTION_SHIFT, so doing this would just be misleading.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/6] sched/fair: Get rid of scaling utilization by capacity_orig
  2015-09-16 17:06                                   ` bsegall
@ 2015-09-17  2:31                                     ` Yuyang Du
  0 siblings, 0 replies; 97+ messages in thread
From: Yuyang Du @ 2015-09-17  2:31 UTC (permalink / raw)
  To: bsegall
  Cc: Morten Rasmussen, Peter Zijlstra, Dietmar Eggemann,
	Vincent Guittot, Steve Muckle, mingo, daniel.lezcano, mturquette,
	rjw, Juri Lelli, sgurrappadi, pang.xunlei, linux-kernel

On Wed, Sep 16, 2015 at 10:06:24AM -0700, bsegall@google.com wrote:
> > The point really is, metrics (if not many ) need resolution, not just NICE_0_LOAD does. 
> > You can choose to either hardcode a number, like SCHED_CAPACITY_SHIFT now,
> > or you can use SCHED_RESOLUTION_SHIFT, which is even as simple as a sign to say what
> > the defined is (the scaled one with a better resolution vs. the original one).
> > I guess this is to say we now have a (no-big-deal) resolution system.
> 
> Yes they were chosen for similar reasons, but they are not conceptually
> related, and you couldn't decide to just bump up all the resolutions by
> changing SCHED_RESOLUTION_SHIFT, so doing this would just be misleading.

Yes, it appears they are made seemingly conceptually related. But probably 
it isn't worth a concern, if one knows it is just a scaled integer metric.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/6] sched/fair: Get rid of scaling utilization by capacity_orig
  2015-09-11 17:22                         ` Morten Rasmussen
@ 2015-09-17  9:51                           ` Peter Zijlstra
  2015-09-17 10:38                           ` Peter Zijlstra
  1 sibling, 0 replies; 97+ messages in thread
From: Peter Zijlstra @ 2015-09-17  9:51 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Vincent Guittot, Dietmar Eggemann, Steve Muckle, mingo,
	daniel.lezcano, yuyang.du, mturquette, rjw, Juri Lelli,
	sgurrappadi, pang.xunlei, linux-kernel

On Fri, Sep 11, 2015 at 06:22:47PM +0100, Morten Rasmussen wrote:
> I have done some runs with the proposed fixes added:
> 
> 1. PeterZ's util_sum shift fix (change util_sum).
> 2. Morten's scaling of weight instead of time (reduce bit loss).
> 3. PeterZ's unconditional calls to arch*() functions (compiler opt).
> 
> To be clear: 2 includes 1, and 3 includes 1 and 2.
> 
> Runs where done with the default (#define) implementation of the
> arch-functions and with arch specific implementation for ARM.


> Results:
> 
> perf numbers are average of three (x10) runs. Raw data is available
> further down.
> 
> ARM TC2		#mul		#mul_all	perf bench
> arch*()		default	arm	default	arm	default	arm
> 
> 1 shift_fix		10	16	22	36	13.401	13.288
> 2 scaled_weight	12	14	30	32	13.282	13.238
> 3 unconditional	12	14	26	32	13.296	13.427
> 
> Intel E5-2690		#mul		#mul_all	perf bench
> arch*()		default		default		default
> 
> 1 shift_fix		13				14.786
> 2 scaled_weight	18				15.078
> 3 unconditional	14				15.195
> 
> 
> Overall it appears that fewer 'mul' instructions doesn't necessarily
> mean better perf bench score. For ARM, 2 seems the best choice overall.

I suspect you're paying for having to do an actual load which can miss
there. So that makes sense.

> While 1 is better for Intel.

Right, because GCC shits itself with those conditionals. Weirdly though;
the below version does not seem so affected.

> I suggest that I spin a v2 of this series and go with scaled_weight to
> reduce bit loss. Any objections?

Just playing devils advocate to myself; how about cgroups? Will not a
per-cpu share of the cgroup weight often be very small?


So I had a little play, and I'm not at all convinced we want to do this
(I've not actually ran any numbers on it, but I can well imagine the
extra condition to hurt on branch miss predict) but it does show GCC
need not always get confused.

---
 kernel/sched/fair.c | 58 +++++++++++++++++++++++++++++++++++------------------
 1 file changed, 38 insertions(+), 20 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 9176f7c588a8..1b60fbe3b86c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2519,7 +2519,25 @@ static u32 __compute_runnable_contrib(u64 n)
 #error "load tracking assumes 2^10 as unit"
 #endif
 
-#define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT)
+static __always_inline unsigned long fp_mult2(unsigned long x, unsigned long y)
+{
+	y *= x;
+	y >>= 10;
+
+	return y;
+}
+
+static __always_inline unsigned long fp_mult3(unsigned long x, unsigned long y, unsigned long z)
+{
+	if (x > y)
+		swap(x,y);
+
+	z *= y;
+	z >>= 10;
+	z *= x;
+
+	return z;
+}
 
 /*
  * We can represent the historical contribution to runnable average as the
@@ -2553,9 +2571,9 @@ static __always_inline int
 __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
 		  unsigned long weight, int running, struct cfs_rq *cfs_rq)
 {
-	u64 delta, scaled_delta, periods;
+	u64 delta, periods;
 	u32 contrib;
-	unsigned int delta_w, scaled_delta_w, decayed = 0;
+	unsigned int delta_w, decayed = 0;
 	unsigned long scale_freq, scale_cpu;
 
 	delta = now - sa->last_update_time;
@@ -2577,8 +2595,10 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
 		return 0;
 	sa->last_update_time = now;
 
-	scale_freq = arch_scale_freq_capacity(NULL, cpu);
-	scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
+	if (weight)
+		scale_freq = arch_scale_freq_capacity(NULL, cpu);
+	if (running)
+		scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
 
 	/* delta_w is the amount already accumulated against our next period */
 	delta_w = sa->period_contrib;
@@ -2594,16 +2614,14 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
 		 * period and accrue it.
 		 */
 		delta_w = 1024 - delta_w;
-		scaled_delta_w = cap_scale(delta_w, scale_freq);
 		if (weight) {
-			sa->load_sum += weight * scaled_delta_w;
-			if (cfs_rq) {
-				cfs_rq->runnable_load_sum +=
-						weight * scaled_delta_w;
-			}
+			unsigned long t = fp_mult3(delta_w, weight, scale_freq);
+			sa->load_sum += t;
+			if (cfs_rq)
+				cfs_rq->runnable_load_sum += t;
 		}
 		if (running)
-			sa->util_sum += scaled_delta_w * scale_cpu;
+			sa->util_sum += delta_w * fp_mult2(scale_cpu, scale_freq);
 
 		delta -= delta_w;
 
@@ -2620,25 +2638,25 @@ __update_load_avg(u64 now, int cpu, struct sched_avg *sa,
 
 		/* Efficiently calculate \sum (1..n_period) 1024*y^i */
 		contrib = __compute_runnable_contrib(periods);
-		contrib = cap_scale(contrib, scale_freq);
 		if (weight) {
-			sa->load_sum += weight * contrib;
+			unsigned long t = fp_mult3(contrib, weight, scale_freq);
+			sa->load_sum += t;
 			if (cfs_rq)
-				cfs_rq->runnable_load_sum += weight * contrib;
+				cfs_rq->runnable_load_sum += t;
 		}
 		if (running)
-			sa->util_sum += contrib * scale_cpu;
+			sa->util_sum += contrib * fp_mult2(scale_cpu, scale_freq);
 	}
 
 	/* Remainder of delta accrued against u_0` */
-	scaled_delta = cap_scale(delta, scale_freq);
 	if (weight) {
-		sa->load_sum += weight * scaled_delta;
+		unsigned long t = fp_mult3(delta, weight, scale_freq);
+		sa->load_sum += t;
 		if (cfs_rq)
-			cfs_rq->runnable_load_sum += weight * scaled_delta;
+			cfs_rq->runnable_load_sum += t;
 	}
 	if (running)
-		sa->util_sum += scaled_delta * scale_cpu;
+		sa->util_sum += delta * fp_mult2(scale_cpu, scale_freq);
 
 	sa->period_contrib += delta;
 

^ permalink raw reply related	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/6] sched/fair: Get rid of scaling utilization by capacity_orig
  2015-09-11 17:22                         ` Morten Rasmussen
  2015-09-17  9:51                           ` Peter Zijlstra
@ 2015-09-17 10:38                           ` Peter Zijlstra
  2015-09-21  1:16                             ` Yuyang Du
  1 sibling, 1 reply; 97+ messages in thread
From: Peter Zijlstra @ 2015-09-17 10:38 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Vincent Guittot, Dietmar Eggemann, Steve Muckle, mingo,
	daniel.lezcano, yuyang.du, mturquette, rjw, Juri Lelli,
	sgurrappadi, pang.xunlei, linux-kernel

On Fri, Sep 11, 2015 at 06:22:47PM +0100, Morten Rasmussen wrote:

> While at it, should I include Yuyang's patch redefining the SCALE/SHIFT
> mess?

I suspect his patch will fail to compile on ARM which uses
SCHED_CAPACITY_* outside of kernel/sched/*.

But if you all (Ben, Yuyang, you) can agree on a patch simplifying these
things I'm not opposed to it.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/6] sched/fair: Get rid of scaling utilization by capacity_orig
  2015-09-17 10:38                           ` Peter Zijlstra
@ 2015-09-21  1:16                             ` Yuyang Du
  2015-09-21 17:30                               ` bsegall
  0 siblings, 1 reply; 97+ messages in thread
From: Yuyang Du @ 2015-09-21  1:16 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Morten Rasmussen, Vincent Guittot, Dietmar Eggemann,
	Steve Muckle, mingo, daniel.lezcano, mturquette, rjw, Juri Lelli,
	sgurrappadi, pang.xunlei, linux-kernel

On Thu, Sep 17, 2015 at 12:38:25PM +0200, Peter Zijlstra wrote:
> On Fri, Sep 11, 2015 at 06:22:47PM +0100, Morten Rasmussen wrote:
> 
> > While at it, should I include Yuyang's patch redefining the SCALE/SHIFT
> > mess?
> 
> I suspect his patch will fail to compile on ARM which uses
> SCHED_CAPACITY_* outside of kernel/sched/*.
> 
> But if you all (Ben, Yuyang, you) can agree on a patch simplifying these
> things I'm not opposed to it.

Yes, indeed. So SCHED_RESOLUTION_SHIFT has to be defined in include/linux/sched.h.

With this, I think the codes still need some cleanup, and importantly
documentation.

But first, I think as load_sum and load_avg can afford NICE_0_LOAD with either high
or low resolution. So we have no reason to have low resolution (10bits) load_avg
when NICE_0_LOAD has high resolution (20bits), because load_avg = runnable% * load,
as opposed to now we have load_avg = runnable% * scale_load_down(load).

We get rid of all scale_load_down() for runnable load average?

--

Subject: [PATCH] sched/fair: Generalize the load/util averages resolution
 definition

The metric needs certain resolution to determine how much detail we
can look into (or not losing detail by integer rounding), which also
determines the range of the metrics.

For instance, to increase the resolution of [0, 1] (two levels), one
can multiply 1024 and get [0, 1024] (1025 levels).

In sched/fair, a few metrics depend on the resolution: load/load_avg,
util_avg, and capacity (frequency adjustment). In order to reduce the
risks to make mistakes relating to resolution/range, we therefore
generalize the resolution by defining a basic resolution constant
number, and then formalize all metrics by depending on the basic
resolution. The basic resolution is 1024 or (1 << 10). Further, one
can recursively apply the basic resolution to increase the final
resolution.

Pointed out by Ben Segall, NICE_0's weight (visible to user) and load
have independent resolution, but they must be well calibrated.

Signed-off-by: Yuyang Du <yuyang.du@intel.com>
---
 include/linux/sched.h |  9 ++++++---
 kernel/sched/fair.c   |  4 ----
 kernel/sched/sched.h  | 15 ++++++++++-----
 3 files changed, 16 insertions(+), 12 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index bd38b3e..9b86f79 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -909,10 +909,13 @@ enum cpu_idle_type {
 	CPU_MAX_IDLE_TYPES
 };
 
+# define SCHED_RESOLUTION_SHIFT	10
+# define SCHED_RESOLUTION_SCALE	(1L << SCHED_RESOLUTION_SHIFT)
+
 /*
  * Increase resolution of cpu_capacity calculations
  */
-#define SCHED_CAPACITY_SHIFT	10
+#define SCHED_CAPACITY_SHIFT	SCHED_RESOLUTION_SHIFT
 #define SCHED_CAPACITY_SCALE	(1L << SCHED_CAPACITY_SHIFT)
 
 /*
@@ -1180,8 +1183,8 @@ struct load_weight {
  * 1) load_avg factors frequency scaling into the amount of time that a
  * sched_entity is runnable on a rq into its weight. For cfs_rq, it is the
  * aggregated such weights of all runnable and blocked sched_entities.
- * 2) util_avg factors frequency and cpu scaling into the amount of time
- * that a sched_entity is running on a CPU, in the range [0..SCHED_LOAD_SCALE].
+ * 2) util_avg factors frequency and cpu capacity scaling into the amount of time
+ * that a sched_entity is running on a CPU, in the range [0..SCHED_CAPACITY_SCALE].
  * For cfs_rq, it is the aggregated such times of all runnable and
  * blocked sched_entities.
  * The 64 bit load_sum can:
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4df37a4..c61fd8e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2522,10 +2522,6 @@ static u32 __compute_runnable_contrib(u64 n)
 	return contrib + runnable_avg_yN_sum[n];
 }
 
-#if (SCHED_LOAD_SHIFT - SCHED_LOAD_RESOLUTION) != 10 || SCHED_CAPACITY_SHIFT != 10
-#error "load tracking assumes 2^10 as unit"
-#endif
-
 #define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT)
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 3845a71..31b4022 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -53,18 +53,23 @@ static inline void update_cpu_load_active(struct rq *this_rq) { }
  * increased costs.
  */
 #if 0 /* BITS_PER_LONG > 32 -- currently broken: it increases power usage under light load  */
-# define SCHED_LOAD_RESOLUTION	10
-# define scale_load(w)		((w) << SCHED_LOAD_RESOLUTION)
-# define scale_load_down(w)	((w) >> SCHED_LOAD_RESOLUTION)
+# define SCHED_LOAD_SHIFT	(SCHED_RESOLUTION_SHIFT + SCHED_RESOLUTION_SHIFT)
+# define scale_load(w)		((w) << SCHED_RESOLUTION_SHIFT)
+# define scale_load_down(w)	((w) >> SCHED_RESOLUTION_SHIFT)
 #else
-# define SCHED_LOAD_RESOLUTION	0
+# define SCHED_LOAD_SHIFT	(SCHED_RESOLUTION_SHIFT)
 # define scale_load(w)		(w)
 # define scale_load_down(w)	(w)
 #endif
 
-#define SCHED_LOAD_SHIFT	(10 + SCHED_LOAD_RESOLUTION)
 #define SCHED_LOAD_SCALE	(1L << SCHED_LOAD_SHIFT)
 
+/*
+ * NICE_0's weight (visible to user) and its load (invisible to user) have
+ * independent resolution, but they should be well calibrated. We use scale_load()
+ * and scale_load_down(w) to convert between them, the following must be true:
+ * scale_load(prio_to_weight[20]) == NICE_0_LOAD
+ */
 #define NICE_0_LOAD		SCHED_LOAD_SCALE
 #define NICE_0_SHIFT		SCHED_LOAD_SHIFT
 
-- 

^ permalink raw reply related	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/6] sched/fair: Get rid of scaling utilization by capacity_orig
  2015-09-21  1:16                             ` Yuyang Du
@ 2015-09-21 17:30                               ` bsegall
  2015-09-21 23:39                                 ` Yuyang Du
  0 siblings, 1 reply; 97+ messages in thread
From: bsegall @ 2015-09-21 17:30 UTC (permalink / raw)
  To: Yuyang Du
  Cc: Peter Zijlstra, Morten Rasmussen, Vincent Guittot,
	Dietmar Eggemann, Steve Muckle, mingo, daniel.lezcano,
	mturquette, rjw, Juri Lelli, sgurrappadi, pang.xunlei,
	linux-kernel

Yuyang Du <yuyang.du@intel.com> writes:

> On Thu, Sep 17, 2015 at 12:38:25PM +0200, Peter Zijlstra wrote:
>> On Fri, Sep 11, 2015 at 06:22:47PM +0100, Morten Rasmussen wrote:
>> 
>> > While at it, should I include Yuyang's patch redefining the SCALE/SHIFT
>> > mess?
>> 
>> I suspect his patch will fail to compile on ARM which uses
>> SCHED_CAPACITY_* outside of kernel/sched/*.
>> 
>> But if you all (Ben, Yuyang, you) can agree on a patch simplifying these
>> things I'm not opposed to it.
>
> Yes, indeed. So SCHED_RESOLUTION_SHIFT has to be defined in include/linux/sched.h.
>
> With this, I think the codes still need some cleanup, and importantly
> documentation.
>
> But first, I think as load_sum and load_avg can afford NICE_0_LOAD with either high
> or low resolution. So we have no reason to have low resolution (10bits) load_avg
> when NICE_0_LOAD has high resolution (20bits), because load_avg = runnable% * load,
> as opposed to now we have load_avg = runnable% * scale_load_down(load).
>
> We get rid of all scale_load_down() for runnable load average?

Hmm, LOAD_AVG_MAX * prio_to_weight[0] is 4237627662, ie barely within a
32-bit unsigned long, but in fact LOAD_AVG_MAX * MAX_SHARES is already
going to give errors on 32-bit (even with the old code in fact). This
should probably be fixed... somehow (dividing by 4 for load_sum on
32-bit would work, though be ugly. Reducing MAX_SHARES by 2 bits on
32-bit might have made sense but would be a weird difference between 32
and 64, and could break userspace anyway, so it's presumably too late
for that).

64-bit has ~30 bits free, so this would be fine so long as SLR is 0 on
32-bit.

>
> --
>
> Subject: [PATCH] sched/fair: Generalize the load/util averages resolution
>  definition
>
> The metric needs certain resolution to determine how much detail we
> can look into (or not losing detail by integer rounding), which also
> determines the range of the metrics.
>
> For instance, to increase the resolution of [0, 1] (two levels), one
> can multiply 1024 and get [0, 1024] (1025 levels).
>
> In sched/fair, a few metrics depend on the resolution: load/load_avg,
> util_avg, and capacity (frequency adjustment). In order to reduce the
> risks to make mistakes relating to resolution/range, we therefore
> generalize the resolution by defining a basic resolution constant
> number, and then formalize all metrics by depending on the basic
> resolution. The basic resolution is 1024 or (1 << 10). Further, one
> can recursively apply the basic resolution to increase the final
> resolution.
>
> Pointed out by Ben Segall, NICE_0's weight (visible to user) and load
> have independent resolution, but they must be well calibrated.
>
> Signed-off-by: Yuyang Du <yuyang.du@intel.com>
> ---
>  include/linux/sched.h |  9 ++++++---
>  kernel/sched/fair.c   |  4 ----
>  kernel/sched/sched.h  | 15 ++++++++++-----
>  3 files changed, 16 insertions(+), 12 deletions(-)
>
> diff --git a/include/linux/sched.h b/include/linux/sched.h
> index bd38b3e..9b86f79 100644
> --- a/include/linux/sched.h
> +++ b/include/linux/sched.h
> @@ -909,10 +909,13 @@ enum cpu_idle_type {
>  	CPU_MAX_IDLE_TYPES
>  };
>  
> +# define SCHED_RESOLUTION_SHIFT	10
> +# define SCHED_RESOLUTION_SCALE	(1L << SCHED_RESOLUTION_SHIFT)
> +
>  /*
>   * Increase resolution of cpu_capacity calculations
>   */
> -#define SCHED_CAPACITY_SHIFT	10
> +#define SCHED_CAPACITY_SHIFT	SCHED_RESOLUTION_SHIFT
>  #define SCHED_CAPACITY_SCALE	(1L << SCHED_CAPACITY_SHIFT)
>  
>  /*
> @@ -1180,8 +1183,8 @@ struct load_weight {
>   * 1) load_avg factors frequency scaling into the amount of time that a
>   * sched_entity is runnable on a rq into its weight. For cfs_rq, it is the
>   * aggregated such weights of all runnable and blocked sched_entities.
> - * 2) util_avg factors frequency and cpu scaling into the amount of time
> - * that a sched_entity is running on a CPU, in the range [0..SCHED_LOAD_SCALE].
> + * 2) util_avg factors frequency and cpu capacity scaling into the amount of time
> + * that a sched_entity is running on a CPU, in the range [0..SCHED_CAPACITY_SCALE].
>   * For cfs_rq, it is the aggregated such times of all runnable and
>   * blocked sched_entities.
>   * The 64 bit load_sum can:
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 4df37a4..c61fd8e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2522,10 +2522,6 @@ static u32 __compute_runnable_contrib(u64 n)
>  	return contrib + runnable_avg_yN_sum[n];
>  }
>  
> -#if (SCHED_LOAD_SHIFT - SCHED_LOAD_RESOLUTION) != 10 || SCHED_CAPACITY_SHIFT != 10
> -#error "load tracking assumes 2^10 as unit"
> -#endif
> -
>  #define cap_scale(v, s) ((v)*(s) >> SCHED_CAPACITY_SHIFT)
>  
>  /*
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 3845a71..31b4022 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -53,18 +53,23 @@ static inline void update_cpu_load_active(struct rq *this_rq) { }
>   * increased costs.
>   */
>  #if 0 /* BITS_PER_LONG > 32 -- currently broken: it increases power usage under light load  */
> -# define SCHED_LOAD_RESOLUTION	10
> -# define scale_load(w)		((w) << SCHED_LOAD_RESOLUTION)
> -# define scale_load_down(w)	((w) >> SCHED_LOAD_RESOLUTION)
> +# define SCHED_LOAD_SHIFT	(SCHED_RESOLUTION_SHIFT + SCHED_RESOLUTION_SHIFT)
> +# define scale_load(w)		((w) << SCHED_RESOLUTION_SHIFT)
> +# define scale_load_down(w)	((w) >> SCHED_RESOLUTION_SHIFT)
>  #else
> -# define SCHED_LOAD_RESOLUTION	0
> +# define SCHED_LOAD_SHIFT	(SCHED_RESOLUTION_SHIFT)
>  # define scale_load(w)		(w)
>  # define scale_load_down(w)	(w)
>  #endif
>  
> -#define SCHED_LOAD_SHIFT	(10 + SCHED_LOAD_RESOLUTION)
>  #define SCHED_LOAD_SCALE	(1L << SCHED_LOAD_SHIFT)
>  
> +/*
> + * NICE_0's weight (visible to user) and its load (invisible to user) have
> + * independent resolution, but they should be well calibrated. We use scale_load()
> + * and scale_load_down(w) to convert between them, the following must be true:
> + * scale_load(prio_to_weight[20]) == NICE_0_LOAD
> + */
>  #define NICE_0_LOAD		SCHED_LOAD_SCALE
>  #define NICE_0_SHIFT		SCHED_LOAD_SHIFT

I still think tying the scale_load shift to be the same as the
SCHED_CAPACITY/etc shift is silly, and tying the NICE_0_LOAD/SHIFT in is
worse. Honestly if I was going to change anything it would be to define
NICE_0_LOAD/SHIFT entirely separately from SCHED_LOAD_SCALE/SHIFT.

However I'm not sure if calculate_imbalance's use of SCHED_LOAD_SCALE is
actually a separate use of 1024*SLR-as-percentage or is basically
assuming most tasks are nice-0 or what. It sure /looks/ like it's
comparing values with different units - it's doing (nr_running * CONST -
group_capacity) and comparing to load, so it looks like both (ie
increasing load.weight of everything on your system by X% would change
load balancer behavior here).

Given that it might make sense to make it clear that capacity units and
nice-0-task units have to be the same thing due to load balancer
approximations (though they are still entirely separate from the
SCHED_LOAD_RESOLUTION multiplier).

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/6] sched/fair: Get rid of scaling utilization by capacity_orig
  2015-09-21 17:30                               ` bsegall
@ 2015-09-21 23:39                                 ` Yuyang Du
  2015-09-22 17:18                                   ` bsegall
  0 siblings, 1 reply; 97+ messages in thread
From: Yuyang Du @ 2015-09-21 23:39 UTC (permalink / raw)
  To: bsegall
  Cc: Peter Zijlstra, Morten Rasmussen, Vincent Guittot,
	Dietmar Eggemann, Steve Muckle, mingo, daniel.lezcano,
	mturquette, rjw, Juri Lelli, sgurrappadi, pang.xunlei,
	linux-kernel

On Mon, Sep 21, 2015 at 10:30:04AM -0700, bsegall@google.com wrote:
> > But first, I think as load_sum and load_avg can afford NICE_0_LOAD with either high
> > or low resolution. So we have no reason to have low resolution (10bits) load_avg
> > when NICE_0_LOAD has high resolution (20bits), because load_avg = runnable% * load,
> > as opposed to now we have load_avg = runnable% * scale_load_down(load).
> >
> > We get rid of all scale_load_down() for runnable load average?
> 
> Hmm, LOAD_AVG_MAX * prio_to_weight[0] is 4237627662, ie barely within a
> 32-bit unsigned long, but in fact LOAD_AVG_MAX * MAX_SHARES is already
> going to give errors on 32-bit (even with the old code in fact). This
> should probably be fixed... somehow (dividing by 4 for load_sum on
> 32-bit would work, though be ugly. Reducing MAX_SHARES by 2 bits on
> 32-bit might have made sense but would be a weird difference between 32
> and 64, and could break userspace anyway, so it's presumably too late
> for that).
> 
> 64-bit has ~30 bits free, so this would be fine so long as SLR is 0 on
> 32-bit.
> 

load_avg has no LOAD_AVG_MAX term in it, it is runnable% * load, IOW, load_avg <= load.
So, on 32bit, cfs_rq's load_avg can host 2^32/prio_to_weight[0]/1024 = 47, with 20bits
load resolution. This is ok, because struct load_weight's load is also unsigned
long. If overflown, cfs_rq->load.weight will be overflown in the first place.

However, after a second thought, this is not quite right. Because load_avg is not
necessarily no greater than load, since load_avg has blocked load in it. Although,
load_avg is still at the same level as load (converging to be <= load), we may not
want the risk to overflow on 32bit.

> > +/*
> > + * NICE_0's weight (visible to user) and its load (invisible to user) have
> > + * independent resolution, but they should be well calibrated. We use scale_load()
> > + * and scale_load_down(w) to convert between them, the following must be true:
> > + * scale_load(prio_to_weight[20]) == NICE_0_LOAD
> > + */
> >  #define NICE_0_LOAD		SCHED_LOAD_SCALE
> >  #define NICE_0_SHIFT		SCHED_LOAD_SHIFT
> 
> I still think tying the scale_load shift to be the same as the
> SCHED_CAPACITY/etc shift is silly, and tying the NICE_0_LOAD/SHIFT in is
> worse. Honestly if I was going to change anything it would be to define
> NICE_0_LOAD/SHIFT entirely separately from SCHED_LOAD_SCALE/SHIFT.

If NICE_0_LOAD is nice-0's load, and if SCHED_LOAD_SHIFT is to say how to get 
nice-0's load, I don't understand why you want to separate them.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/6] sched/fair: Get rid of scaling utilization by capacity_orig
  2015-09-21 23:39                                 ` Yuyang Du
@ 2015-09-22 17:18                                   ` bsegall
  2015-09-22 23:22                                     ` Yuyang Du
  2015-09-30 12:52                                     ` Peter Zijlstra
  0 siblings, 2 replies; 97+ messages in thread
From: bsegall @ 2015-09-22 17:18 UTC (permalink / raw)
  To: Yuyang Du
  Cc: Peter Zijlstra, Morten Rasmussen, Vincent Guittot,
	Dietmar Eggemann, Steve Muckle, mingo, daniel.lezcano,
	mturquette, rjw, Juri Lelli, sgurrappadi, pang.xunlei,
	linux-kernel

Yuyang Du <yuyang.du@intel.com> writes:

> On Mon, Sep 21, 2015 at 10:30:04AM -0700, bsegall@google.com wrote:
>> > But first, I think as load_sum and load_avg can afford NICE_0_LOAD with either high
>> > or low resolution. So we have no reason to have low resolution (10bits) load_avg
>> > when NICE_0_LOAD has high resolution (20bits), because load_avg = runnable% * load,
>> > as opposed to now we have load_avg = runnable% * scale_load_down(load).
>> >
>> > We get rid of all scale_load_down() for runnable load average?
>> 
>> Hmm, LOAD_AVG_MAX * prio_to_weight[0] is 4237627662, ie barely within a
>> 32-bit unsigned long, but in fact LOAD_AVG_MAX * MAX_SHARES is already
>> going to give errors on 32-bit (even with the old code in fact). This
>> should probably be fixed... somehow (dividing by 4 for load_sum on
>> 32-bit would work, though be ugly. Reducing MAX_SHARES by 2 bits on
>> 32-bit might have made sense but would be a weird difference between 32
>> and 64, and could break userspace anyway, so it's presumably too late
>> for that).
>> 
>> 64-bit has ~30 bits free, so this would be fine so long as SLR is 0 on
>> 32-bit.
>> 
>
> load_avg has no LOAD_AVG_MAX term in it, it is runnable% * load, IOW, load_avg <= load.
> So, on 32bit, cfs_rq's load_avg can host 2^32/prio_to_weight[0]/1024 = 47, with 20bits
> load resolution. This is ok, because struct load_weight's load is also unsigned
> long. If overflown, cfs_rq->load.weight will be overflown in the first place.
>
> However, after a second thought, this is not quite right. Because load_avg is not
> necessarily no greater than load, since load_avg has blocked load in it. Although,
> load_avg is still at the same level as load (converging to be <= load), we may not
> want the risk to overflow on 32bit.

Yeah, I missed that load_sum was u64 and only load_avg was long. This
means we're fine on 32-bit with no SLR (or more precisely, cfs_rq
runnable_load_avg can overflow, but only when cfs_rq load.weight does,
so whatever). On 64-bit you can currently have 2^36 cgroups or 2^37
tasks before load.weight overflows, and ~2^31 tasks before
runnable_load_avg does, which is obviously fine (and in fact you hit
PID_MAX_LIMIT and even if you had the cpu/ram/etc to not fall over).

Now, applying SLR to runnable_load_avg would cut this down to ~2^21
tasks running at once or 2^20 with cgroups, which is technically
allowed, though it seems utterly implausible (especially since this
would have to all be on one cpu). If SLR was increased as peterz asked
about, this could be an issue though.

All that said, using SLR on load_sum/load_avg as opposed to cfs_rq
runnable_load_avg would be fine, as they're limited to only one
task/cgroup's weight. Having it SLRed and cfs_rq not would be a
little odd, but not impossible.

>
>> > +/*
>> > + * NICE_0's weight (visible to user) and its load (invisible to user) have
>> > + * independent resolution, but they should be well calibrated. We use scale_load()
>> > + * and scale_load_down(w) to convert between them, the following must be true:
>> > + * scale_load(prio_to_weight[20]) == NICE_0_LOAD
>> > + */
>> >  #define NICE_0_LOAD		SCHED_LOAD_SCALE
>> >  #define NICE_0_SHIFT		SCHED_LOAD_SHIFT
>> 
>> I still think tying the scale_load shift to be the same as the
>> SCHED_CAPACITY/etc shift is silly, and tying the NICE_0_LOAD/SHIFT in is
>> worse. Honestly if I was going to change anything it would be to define
>> NICE_0_LOAD/SHIFT entirely separately from SCHED_LOAD_SCALE/SHIFT.
>
> If NICE_0_LOAD is nice-0's load, and if SCHED_LOAD_SHIFT is to say how to get 
> nice-0's load, I don't understand why you want to separate them.

SCHED_LOAD_SHIFT is not how to get nice-0's load, it just happens to
have the same value as NICE_0_SHIFT. (I think anyway, SCHED_LOAD_* is
used in precisely one place other than the newish util_avg, and as I
mentioned it's not remotely clear what compute_imbalance is doing theer)

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/6] sched/fair: Get rid of scaling utilization by capacity_orig
  2015-09-22 17:18                                   ` bsegall
@ 2015-09-22 23:22                                     ` Yuyang Du
  2015-09-23 16:54                                       ` bsegall
  2015-09-30 12:52                                     ` Peter Zijlstra
  1 sibling, 1 reply; 97+ messages in thread
From: Yuyang Du @ 2015-09-22 23:22 UTC (permalink / raw)
  To: bsegall
  Cc: Peter Zijlstra, Morten Rasmussen, Vincent Guittot,
	Dietmar Eggemann, Steve Muckle, mingo, daniel.lezcano,
	mturquette, rjw, Juri Lelli, sgurrappadi, pang.xunlei,
	linux-kernel

On Tue, Sep 22, 2015 at 10:18:30AM -0700, bsegall@google.com wrote:
> Yuyang Du <yuyang.du@intel.com> writes:
> 
> > On Mon, Sep 21, 2015 at 10:30:04AM -0700, bsegall@google.com wrote:
> >> > But first, I think as load_sum and load_avg can afford NICE_0_LOAD with either high
> >> > or low resolution. So we have no reason to have low resolution (10bits) load_avg
> >> > when NICE_0_LOAD has high resolution (20bits), because load_avg = runnable% * load,
> >> > as opposed to now we have load_avg = runnable% * scale_load_down(load).
> >> >
> >> > We get rid of all scale_load_down() for runnable load average?
> >> 
> >> Hmm, LOAD_AVG_MAX * prio_to_weight[0] is 4237627662, ie barely within a
> >> 32-bit unsigned long, but in fact LOAD_AVG_MAX * MAX_SHARES is already
> >> going to give errors on 32-bit (even with the old code in fact). This
> >> should probably be fixed... somehow (dividing by 4 for load_sum on
> >> 32-bit would work, though be ugly. Reducing MAX_SHARES by 2 bits on
> >> 32-bit might have made sense but would be a weird difference between 32
> >> and 64, and could break userspace anyway, so it's presumably too late
> >> for that).
> >> 
> >> 64-bit has ~30 bits free, so this would be fine so long as SLR is 0 on
> >> 32-bit.
> >> 
> >
> > load_avg has no LOAD_AVG_MAX term in it, it is runnable% * load, IOW, load_avg <= load.
> > So, on 32bit, cfs_rq's load_avg can host 2^32/prio_to_weight[0]/1024 = 47, with 20bits
> > load resolution. This is ok, because struct load_weight's load is also unsigned
> > long. If overflown, cfs_rq->load.weight will be overflown in the first place.
> >
> > However, after a second thought, this is not quite right. Because load_avg is not
> > necessarily no greater than load, since load_avg has blocked load in it. Although,
> > load_avg is still at the same level as load (converging to be <= load), we may not
> > want the risk to overflow on 32bit.
 
This second thought made a mistake (what was wrong with me). load_avg is for sure
no greater than load with or without blocked load.

With that said, it really does not matter what the following numbers are, 32bit or
64bit machine. What matters is that cfs_rq->load.weight is one that needs to worry
whether overflow or not, not the load_avg. It is as simple as that.

With that, I think we can and should get rid of the scale_load_down() for load_avg.

> Yeah, I missed that load_sum was u64 and only load_avg was long. This
> means we're fine on 32-bit with no SLR (or more precisely, cfs_rq
> runnable_load_avg can overflow, but only when cfs_rq load.weight does,
> so whatever). On 64-bit you can currently have 2^36 cgroups or 2^37
> tasks before load.weight overflows, and ~2^31 tasks before
> runnable_load_avg does, which is obviously fine (and in fact you hit
> PID_MAX_LIMIT and even if you had the cpu/ram/etc to not fall over).
> 
> Now, applying SLR to runnable_load_avg would cut this down to ~2^21
> tasks running at once or 2^20 with cgroups, which is technically
> allowed, though it seems utterly implausible (especially since this
> would have to all be on one cpu). If SLR was increased as peterz asked
> about, this could be an issue though.
> 
> All that said, using SLR on load_sum/load_avg as opposed to cfs_rq
> runnable_load_avg would be fine, as they're limited to only one
> task/cgroup's weight. Having it SLRed and cfs_rq not would be a
> little odd, but not impossible.
 

> > If NICE_0_LOAD is nice-0's load, and if SCHED_LOAD_SHIFT is to say how to get 
> > nice-0's load, I don't understand why you want to separate them.
> 
> SCHED_LOAD_SHIFT is not how to get nice-0's load, it just happens to
> have the same value as NICE_0_SHIFT. (I think anyway, SCHED_LOAD_* is
> used in precisely one place other than the newish util_avg, and as I
> mentioned it's not remotely clear what compute_imbalance is doing theer)

Yes, it is not clear to me either.

With the above proposal to get rid of scale_load_down() for load_avg, so I think
now we can remove SCHED_LOAD_*, and rename scale_load() to user_to_kernel_load(),
and raname scale_load_down() to kernel_to_user_load().

Hmm?

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/6] sched/fair: Get rid of scaling utilization by capacity_orig
  2015-09-22 23:22                                     ` Yuyang Du
@ 2015-09-23 16:54                                       ` bsegall
  2015-09-24  0:22                                         ` Yuyang Du
  0 siblings, 1 reply; 97+ messages in thread
From: bsegall @ 2015-09-23 16:54 UTC (permalink / raw)
  To: Yuyang Du
  Cc: Peter Zijlstra, Morten Rasmussen, Vincent Guittot,
	Dietmar Eggemann, Steve Muckle, mingo, daniel.lezcano,
	mturquette, rjw, Juri Lelli, sgurrappadi, pang.xunlei,
	linux-kernel

Yuyang Du <yuyang.du@intel.com> writes:

> On Tue, Sep 22, 2015 at 10:18:30AM -0700, bsegall@google.com wrote:
>> Yuyang Du <yuyang.du@intel.com> writes:
>> 
>> > On Mon, Sep 21, 2015 at 10:30:04AM -0700, bsegall@google.com wrote:
>> >> > But first, I think as load_sum and load_avg can afford NICE_0_LOAD with either high
>> >> > or low resolution. So we have no reason to have low resolution (10bits) load_avg
>> >> > when NICE_0_LOAD has high resolution (20bits), because load_avg = runnable% * load,
>> >> > as opposed to now we have load_avg = runnable% * scale_load_down(load).
>> >> >
>> >> > We get rid of all scale_load_down() for runnable load average?
>> >> 
>> >> Hmm, LOAD_AVG_MAX * prio_to_weight[0] is 4237627662, ie barely within a
>> >> 32-bit unsigned long, but in fact LOAD_AVG_MAX * MAX_SHARES is already
>> >> going to give errors on 32-bit (even with the old code in fact). This
>> >> should probably be fixed... somehow (dividing by 4 for load_sum on
>> >> 32-bit would work, though be ugly. Reducing MAX_SHARES by 2 bits on
>> >> 32-bit might have made sense but would be a weird difference between 32
>> >> and 64, and could break userspace anyway, so it's presumably too late
>> >> for that).
>> >> 
>> >> 64-bit has ~30 bits free, so this would be fine so long as SLR is 0 on
>> >> 32-bit.
>> >> 
>> >
>> > load_avg has no LOAD_AVG_MAX term in it, it is runnable% * load, IOW, load_avg <= load.
>> > So, on 32bit, cfs_rq's load_avg can host 2^32/prio_to_weight[0]/1024 = 47, with 20bits
>> > load resolution. This is ok, because struct load_weight's load is also unsigned
>> > long. If overflown, cfs_rq->load.weight will be overflown in the first place.
>> >
>> > However, after a second thought, this is not quite right. Because load_avg is not
>> > necessarily no greater than load, since load_avg has blocked load in it. Although,
>> > load_avg is still at the same level as load (converging to be <= load), we may not
>> > want the risk to overflow on 32bit.
>  
> This second thought made a mistake (what was wrong with me). load_avg is for sure
> no greater than load with or without blocked load.
>
> With that said, it really does not matter what the following numbers are, 32bit or
> 64bit machine. What matters is that cfs_rq->load.weight is one that needs to worry
> whether overflow or not, not the load_avg. It is as simple as that.
>
> With that, I think we can and should get rid of the scale_load_down()
> for load_avg.

load_avg yes is bounded by load.weight, but on 64-bit load_sum is only
bounded by load.weight * LOAD_AVG_MAX and is the same size as
load.weight (as I said below). There's still space for anything
reasonable though with 10 bits of SLR.

>
>> Yeah, I missed that load_sum was u64 and only load_avg was long. This
>> means we're fine on 32-bit with no SLR (or more precisely, cfs_rq
>> runnable_load_avg can overflow, but only when cfs_rq load.weight does,
>> so whatever). On 64-bit you can currently have 2^36 cgroups or 2^37
>> tasks before load.weight overflows, and ~2^31 tasks before
>> runnable_load_avg does, which is obviously fine (and in fact you hit
>> PID_MAX_LIMIT and even if you had the cpu/ram/etc to not fall over).
>> 
>> Now, applying SLR to runnable_load_avg would cut this down to ~2^21
>> tasks running at once or 2^20 with cgroups, which is technically
>> allowed, though it seems utterly implausible (especially since this
>> would have to all be on one cpu). If SLR was increased as peterz asked
>> about, this could be an issue though.
>> 
>> All that said, using SLR on load_sum/load_avg as opposed to cfs_rq
>> runnable_load_avg would be fine, as they're limited to only one
>> task/cgroup's weight. Having it SLRed and cfs_rq not would be a
>> little odd, but not impossible.
>  
>
>> > If NICE_0_LOAD is nice-0's load, and if SCHED_LOAD_SHIFT is to say how to get 
>> > nice-0's load, I don't understand why you want to separate them.
>> 
>> SCHED_LOAD_SHIFT is not how to get nice-0's load, it just happens to
>> have the same value as NICE_0_SHIFT. (I think anyway, SCHED_LOAD_* is
>> used in precisely one place other than the newish util_avg, and as I
>> mentioned it's not remotely clear what compute_imbalance is doing theer)
>
> Yes, it is not clear to me either.
>
> With the above proposal to get rid of scale_load_down() for load_avg, so I think
> now we can remove SCHED_LOAD_*, and rename scale_load() to user_to_kernel_load(),
> and raname scale_load_down() to kernel_to_user_load().
>
> Hmm?

I have no opinion on renaming the scale_load functions, it's certainly
reasonable, but the scale_load names seem fine too.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/6] sched/fair: Get rid of scaling utilization by capacity_orig
  2015-09-23 16:54                                       ` bsegall
@ 2015-09-24  0:22                                         ` Yuyang Du
  0 siblings, 0 replies; 97+ messages in thread
From: Yuyang Du @ 2015-09-24  0:22 UTC (permalink / raw)
  To: bsegall
  Cc: Peter Zijlstra, Morten Rasmussen, Vincent Guittot,
	Dietmar Eggemann, Steve Muckle, mingo, daniel.lezcano,
	mturquette, rjw, Juri Lelli, sgurrappadi, pang.xunlei,
	linux-kernel

On Wed, Sep 23, 2015 at 09:54:08AM -0700, bsegall@google.com wrote:
> > This second thought made a mistake (what was wrong with me). load_avg is for sure
> > no greater than load with or without blocked load.
> >
> > With that said, it really does not matter what the following numbers are, 32bit or
> > 64bit machine. What matters is that cfs_rq->load.weight is one that needs to worry
> > whether overflow or not, not the load_avg. It is as simple as that.
> >
> > With that, I think we can and should get rid of the scale_load_down()
> > for load_avg.
> 
> load_avg yes is bounded by load.weight, but on 64-bit load_sum is only
> bounded by load.weight * LOAD_AVG_MAX and is the same size as
> load.weight (as I said below). There's still space for anything
> reasonable though with 10 bits of SLR.
 
You are absolutely right.

> >> > If NICE_0_LOAD is nice-0's load, and if SCHED_LOAD_SHIFT is to say how to get 
> >> > nice-0's load, I don't understand why you want to separate them.
> >> 
> >> SCHED_LOAD_SHIFT is not how to get nice-0's load, it just happens to
> >> have the same value as NICE_0_SHIFT. (I think anyway, SCHED_LOAD_* is
> >> used in precisely one place other than the newish util_avg, and as I
> >> mentioned it's not remotely clear what compute_imbalance is doing theer)
> >
> > Yes, it is not clear to me either.
> >
> > With the above proposal to get rid of scale_load_down() for load_avg, so I think
> > now we can remove SCHED_LOAD_*, and rename scale_load() to user_to_kernel_load(),
> > and raname scale_load_down() to kernel_to_user_load().
> >
> > Hmm?
> 
> I have no opinion on renaming the scale_load functions, it's certainly
> reasonable, but the scale_load names seem fine too.

Without scale_load_down() in load_avg, it seems they are only used when
reading/writing load between user and kernel. I will ponder more, but
lets see whether others have opinion.

^ permalink raw reply	[flat|nested] 97+ messages in thread

* Re: [PATCH 5/6] sched/fair: Get rid of scaling utilization by capacity_orig
  2015-09-22 17:18                                   ` bsegall
  2015-09-22 23:22                                     ` Yuyang Du
@ 2015-09-30 12:52                                     ` Peter Zijlstra
  1 sibling, 0 replies; 97+ messages in thread
From: Peter Zijlstra @ 2015-09-30 12:52 UTC (permalink / raw)
  To: bsegall
  Cc: Yuyang Du, Morten Rasmussen, Vincent Guittot, Dietmar Eggemann,
	Steve Muckle, mingo, daniel.lezcano, mturquette, rjw, Juri Lelli,
	sgurrappadi, pang.xunlei, linux-kernel

On Tue, Sep 22, 2015 at 10:18:30AM -0700, bsegall@google.com wrote:
> If SLR was increased as peterz asked
> about

Right, so I was under the impression that you (Google) run with it
increased and in mainline its currently dead code.

So if its valuable to you guys we should fix in mainline.

^ permalink raw reply	[flat|nested] 97+ messages in thread

end of thread, other threads:[~2015-09-30 12:54 UTC | newest]

Thread overview: 97+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-08-14 16:23 [PATCH 0/6] sched/fair: Compute capacity invariant load/utilization tracking Morten Rasmussen
2015-08-14 16:23 ` [PATCH 1/6] sched/fair: Make load tracking frequency scale-invariant Morten Rasmussen
2015-09-13 11:03   ` [tip:sched/core] " tip-bot for Dietmar Eggemann
2015-08-14 16:23 ` [PATCH 2/6] sched/fair: Convert arch_scale_cpu_capacity() from weak function to #define Morten Rasmussen
2015-09-02  9:31   ` Vincent Guittot
2015-09-02 12:41     ` Vincent Guittot
2015-09-03 19:58     ` Dietmar Eggemann
2015-09-04  7:26       ` Vincent Guittot
2015-09-07 13:25         ` Dietmar Eggemann
2015-09-11 13:21         ` Dietmar Eggemann
2015-09-11 14:45           ` Vincent Guittot
2015-09-13 11:03   ` [tip:sched/core] " tip-bot for Morten Rasmussen
2015-08-14 16:23 ` [PATCH 3/6] sched/fair: Make utilization tracking cpu scale-invariant Morten Rasmussen
2015-08-14 23:04   ` Dietmar Eggemann
2015-09-04  7:52     ` Vincent Guittot
2015-09-13 11:04     ` [tip:sched/core] sched/fair: Make utilization tracking CPU scale-invariant tip-bot for Dietmar Eggemann
2015-08-14 16:23 ` [PATCH 4/6] sched/fair: Name utilization related data and functions consistently Morten Rasmussen
2015-09-04  9:08   ` Vincent Guittot
2015-09-11 16:35     ` Dietmar Eggemann
2015-09-13 11:04   ` [tip:sched/core] " tip-bot for Dietmar Eggemann
2015-08-14 16:23 ` [PATCH 5/6] sched/fair: Get rid of scaling utilization by capacity_orig Morten Rasmussen
2015-09-03 23:51   ` Steve Muckle
2015-09-07 15:37     ` Dietmar Eggemann
2015-09-07 16:21       ` Vincent Guittot
2015-09-07 18:54         ` Dietmar Eggemann
2015-09-07 19:47           ` Peter Zijlstra
2015-09-08 12:47             ` Dietmar Eggemann
2015-09-08  7:22           ` Vincent Guittot
2015-09-08 12:26             ` Peter Zijlstra
2015-09-08 12:52               ` Peter Zijlstra
2015-09-08 14:06                 ` Vincent Guittot
2015-09-08 14:35                   ` Morten Rasmussen
2015-09-08 14:40                     ` Vincent Guittot
2015-09-08 14:31                 ` Morten Rasmussen
2015-09-08 15:33                   ` Peter Zijlstra
2015-09-09 22:23                     ` bsegall
2015-09-10 11:06                       ` Morten Rasmussen
2015-09-10 11:11                         ` Vincent Guittot
2015-09-10 12:10                           ` Morten Rasmussen
2015-09-11  0:50                             ` Yuyang Du
2015-09-10 17:23                         ` bsegall
2015-09-08 16:53                   ` Morten Rasmussen
2015-09-09  9:43                     ` Peter Zijlstra
2015-09-09  9:45                       ` Peter Zijlstra
2015-09-09 11:13                       ` Morten Rasmussen
2015-09-11 17:22                         ` Morten Rasmussen
2015-09-17  9:51                           ` Peter Zijlstra
2015-09-17 10:38                           ` Peter Zijlstra
2015-09-21  1:16                             ` Yuyang Du
2015-09-21 17:30                               ` bsegall
2015-09-21 23:39                                 ` Yuyang Du
2015-09-22 17:18                                   ` bsegall
2015-09-22 23:22                                     ` Yuyang Du
2015-09-23 16:54                                       ` bsegall
2015-09-24  0:22                                         ` Yuyang Du
2015-09-30 12:52                                     ` Peter Zijlstra
2015-09-11  7:46                     ` Leo Yan
2015-09-11 10:02                       ` Morten Rasmussen
2015-09-11 14:11                         ` Leo Yan
2015-09-09 19:07                 ` Yuyang Du
2015-09-10 10:06                   ` Peter Zijlstra
2015-09-08 13:39               ` Vincent Guittot
2015-09-08 14:10                 ` Peter Zijlstra
2015-09-08 15:17                   ` Vincent Guittot
2015-09-08 12:50             ` Dietmar Eggemann
2015-09-08 14:01               ` Vincent Guittot
2015-09-08 14:27                 ` Dietmar Eggemann
2015-09-09 20:15               ` Yuyang Du
2015-09-10 10:07                 ` Peter Zijlstra
2015-09-11  0:28                   ` Yuyang Du
2015-09-11 10:31                     ` Morten Rasmussen
2015-09-11 17:05                       ` bsegall
2015-09-11 18:24                         ` Yuyang Du
2015-09-14 17:36                           ` bsegall
2015-09-14 12:56                         ` Morten Rasmussen
2015-09-14 17:34                           ` bsegall
2015-09-14 22:56                             ` Yuyang Du
2015-09-15 17:11                               ` bsegall
2015-09-15 18:39                                 ` Yuyang Du
2015-09-16 17:06                                   ` bsegall
2015-09-17  2:31                                     ` Yuyang Du
2015-09-15  8:43                             ` Morten Rasmussen
2015-09-16 15:36                             ` Peter Zijlstra
2015-09-08 11:44           ` Peter Zijlstra
2015-09-13 11:04       ` [tip:sched/core] " tip-bot for Dietmar Eggemann
2015-08-14 16:23 ` [PATCH 6/6] sched/fair: Initialize task load and utilization before placing task on rq Morten Rasmussen
2015-09-13 11:05   ` [tip:sched/core] " tip-bot for Morten Rasmussen
2015-08-16 20:46 ` [PATCH 0/6] sched/fair: Compute capacity invariant load/utilization tracking Peter Zijlstra
2015-08-17 11:29   ` Morten Rasmussen
2015-08-17 11:48     ` Peter Zijlstra
2015-08-31  9:24 ` Peter Zijlstra
2015-09-02  9:51   ` Dietmar Eggemann
2015-09-07 12:42   ` Peter Zijlstra
2015-09-07 13:21     ` Peter Zijlstra
2015-09-07 13:23     ` Peter Zijlstra
2015-09-07 14:44     ` Dietmar Eggemann
2015-09-13 11:06       ` [tip:sched/core] sched/fair: Defer calling scaling functions tip-bot for Dietmar Eggemann

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).