linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/7] sched: Scale-invariant per-entity load-tracking
@ 2014-09-22 16:24 Morten Rasmussen
  2014-09-22 16:24 ` [PATCH 1/7] sched: Introduce scale-invariant load tracking Morten Rasmussen
                   ` (6 more replies)
  0 siblings, 7 replies; 40+ messages in thread
From: Morten Rasmussen @ 2014-09-22 16:24 UTC (permalink / raw)
  To: peterz, mingo
  Cc: dietmar.eggemann, pjt, bsegall, vincent.guittot, nicolas.pitre,
	mturquette, rjw, linux-kernel, Morten Rasmussen

Per-entity load-tracking is currently unaware of frequency scaling and any
micro-architectural differences between cpus (e.g. big.LITTLE). To enable
better estimates of cpu utilization load-tracking needs to be scale-invariant.
Per-entity load-tracking needs to report the same load regardless of the cpu
and the current frequency of the cpu.

This patch-set introduces an architecture provided scale-invariance correction
factor for per-entity load-tracking which is applied when task load-tracking is
updated. The factor is in the range 0-1024, where 1024 is the scaling factor
when a task runs on the highest frequency on the fastest cpu in the system. It
means that a task can only reach 'full' load contribution when running at the
fastest cpu at the highest frequency. Without scale-invariance a task having a
load contribution of 50% on a cpu running a 50% frequency would change its
load-contribution to 25% if the frequency increased to 100%. With
scale-invariance, the load-contribution is 25% regardless of the frequency.
Hence, if we have two 25% tasks on different cpus, we know that their combined
load can be accommodated on any cpu running at least 50% frequency (assuming
the micro-architecture is the same).

Scaling the per-entity load-tracking is only a part of providing
scale-invariance. The cpu capacity used for load-balancing decisions needs to
be scaled as well to allow comparison between scale-invariant load and compute
capacity. This is not addressed yet. Vincent's recent patches [1] provide
micro-architectural scaling of compute capacity, but frequency scaling is still
missing and will have to be aligned with these patches.

This patch-set also includes a scale-invariant utilization metric which is
based on Paul Turner's original usage tracking which Vincent's patches also
reintroduces. Hence this patch set will need to be rebased on top of those.

Morten

[1] https://lkml.org/lkml/2014/8/26/288

Dietmar Eggemann (1):
  sched: Introduce scale-invariant load tracking

Morten Rasmussen (6):
  cpufreq: Architecture specific callback for frequency changes
  arm: Frequency invariant scheduler load-tracking support
  arm: Micro-architecture invariant load tracking support
  sched: Implement usage tracking
  sched: Make sched entity usage tracking scale-invariant
  sched: Track sched_entity usage contributions

 arch/arm/kernel/topology.c |   44 +++++++++++++++++++++++
 drivers/cpufreq/cpufreq.c  |   10 +++++-
 include/linux/sched.h      |    3 +-
 kernel/sched/debug.c       |    5 +++
 kernel/sched/fair.c        |   86 +++++++++++++++++++++++++++++++++++---------
 kernel/sched/sched.h       |    2 +-
 6 files changed, 131 insertions(+), 19 deletions(-)

-- 
1.7.9.5



^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH 1/7] sched: Introduce scale-invariant load tracking
  2014-09-22 16:24 [PATCH 0/7] sched: Scale-invariant per-entity load-tracking Morten Rasmussen
@ 2014-09-22 16:24 ` Morten Rasmussen
  2014-09-25 13:48   ` Vincent Guittot
  2014-10-08  0:50   ` Yuyang Du
  2014-09-22 16:24 ` [PATCH 2/7] cpufreq: Architecture specific callback for frequency changes Morten Rasmussen
                   ` (5 subsequent siblings)
  6 siblings, 2 replies; 40+ messages in thread
From: Morten Rasmussen @ 2014-09-22 16:24 UTC (permalink / raw)
  To: peterz, mingo
  Cc: dietmar.eggemann, pjt, bsegall, vincent.guittot, nicolas.pitre,
	mturquette, rjw, linux-kernel, Morten Rasmussen

From: Dietmar Eggemann <dietmar.eggemann@arm.com>

The per-entity load-tracking currently neither accounts for frequency
changes due to frequency scaling (cpufreq) nor for micro-architectural
differences between cpus (ARM big.LITTLE). Comparing tracked loads
between different cpus might therefore be quite misleading.

This patch introduces a scale-invariance scaling factor to the
load-tracking computation that can be used to compensate for compute
capacity variations. The scaling factor is to be provided by the
architecture through an arch specific function. It may be as simple as:

	current_freq(cpu) * SCHED_CAPACITY_SCALE / max_freq(cpu)

If the architecture has more sophisticated ways of tracking compute
capacity, it can do so in its implementation. By default, no scaling is
applied.

The patch is loosely based on a patch by Chris Redpath
<Chris.Redpath@arm.com>.

cc: Paul Turner <pjt@google.com>
cc: Ben Segall <bsegall@google.com>

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c |   32 ++++++++++++++++++++++++++------
 1 file changed, 26 insertions(+), 6 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2a1e6ac..52abb3e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2267,6 +2267,8 @@ static u32 __compute_runnable_contrib(u64 n)
 	return contrib + runnable_avg_yN_sum[n];
 }
 
+unsigned long arch_scale_load_capacity(int cpu);
+
 /*
  * We can represent the historical contribution to runnable average as the
  * coefficients of a geometric series.  To do this we sub-divide our runnable
@@ -2295,13 +2297,14 @@ static u32 __compute_runnable_contrib(u64 n)
  *   load_avg = u_0` + y*(u_0 + u_1*y + u_2*y^2 + ... )
  *            = u_0 + u_1*y + u_2*y^2 + ... [re-labeling u_i --> u_{i+1}]
  */
-static __always_inline int __update_entity_runnable_avg(u64 now,
+static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,
 							struct sched_avg *sa,
 							int runnable)
 {
 	u64 delta, periods;
 	u32 runnable_contrib;
 	int delta_w, decayed = 0;
+	u32 scale_cap = arch_scale_load_capacity(cpu);
 
 	delta = now - sa->last_runnable_update;
 	/*
@@ -2334,8 +2337,10 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
 		 * period and accrue it.
 		 */
 		delta_w = 1024 - delta_w;
+
 		if (runnable)
-			sa->runnable_avg_sum += delta_w;
+			sa->runnable_avg_sum += (delta_w * scale_cap)
+					>> SCHED_CAPACITY_SHIFT;
 		sa->runnable_avg_period += delta_w;
 
 		delta -= delta_w;
@@ -2351,14 +2356,17 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
 
 		/* Efficiently calculate \sum (1..n_period) 1024*y^i */
 		runnable_contrib = __compute_runnable_contrib(periods);
+
 		if (runnable)
-			sa->runnable_avg_sum += runnable_contrib;
+			sa->runnable_avg_sum += (runnable_contrib * scale_cap)
+						>> SCHED_CAPACITY_SHIFT;
 		sa->runnable_avg_period += runnable_contrib;
 	}
 
 	/* Remainder of delta accrued against u_0` */
 	if (runnable)
-		sa->runnable_avg_sum += delta;
+		sa->runnable_avg_sum += (delta * scale_cap)
+				>> SCHED_CAPACITY_SHIFT;
 	sa->runnable_avg_period += delta;
 
 	return decayed;
@@ -2464,7 +2472,8 @@ static inline void __update_group_entity_contrib(struct sched_entity *se)
 
 static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
 {
-	__update_entity_runnable_avg(rq_clock_task(rq), &rq->avg, runnable);
+	__update_entity_runnable_avg(rq_clock_task(rq), rq->cpu, &rq->avg,
+					runnable);
 	__update_tg_runnable_avg(&rq->avg, &rq->cfs);
 }
 #else /* CONFIG_FAIR_GROUP_SCHED */
@@ -2518,6 +2527,7 @@ static inline void update_entity_load_avg(struct sched_entity *se,
 {
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 	long contrib_delta;
+	int cpu = rq_of(cfs_rq)->cpu;
 	u64 now;
 
 	/*
@@ -2529,7 +2539,7 @@ static inline void update_entity_load_avg(struct sched_entity *se,
 	else
 		now = cfs_rq_clock_task(group_cfs_rq(se));
 
-	if (!__update_entity_runnable_avg(now, &se->avg, se->on_rq))
+	if (!__update_entity_runnable_avg(now, cpu, &se->avg, se->on_rq))
 		return;
 
 	contrib_delta = __update_entity_load_avg_contrib(se);
@@ -5719,6 +5729,16 @@ unsigned long __weak arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
 	return default_scale_cpu_capacity(sd, cpu);
 }
 
+static unsigned long default_scale_load_capacity(int cpu)
+{
+	return SCHED_CAPACITY_SCALE;
+}
+
+unsigned long __weak arch_scale_load_capacity(int cpu)
+{
+	return default_scale_load_capacity(cpu);
+}
+
 static unsigned long scale_rt_capacity(int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
-- 
1.7.9.5



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 2/7] cpufreq: Architecture specific callback for frequency changes
  2014-09-22 16:24 [PATCH 0/7] sched: Scale-invariant per-entity load-tracking Morten Rasmussen
  2014-09-22 16:24 ` [PATCH 1/7] sched: Introduce scale-invariant load tracking Morten Rasmussen
@ 2014-09-22 16:24 ` Morten Rasmussen
  2014-10-08  6:07   ` Mike Turquette
  2014-09-22 16:24 ` [PATCH 3/7] arm: Frequency invariant scheduler load-tracking support Morten Rasmussen
                   ` (4 subsequent siblings)
  6 siblings, 1 reply; 40+ messages in thread
From: Morten Rasmussen @ 2014-09-22 16:24 UTC (permalink / raw)
  To: peterz, mingo
  Cc: dietmar.eggemann, pjt, bsegall, vincent.guittot, nicolas.pitre,
	mturquette, rjw, linux-kernel, Morten Rasmussen

Architectures that don't have any other means for tracking cpu frequency
changes need a callback from cpufreq to implement a scaling factor to
enable scale-invariant per-entity load-tracking in the scheduler.

To compute the scale invariance correction factor the architecture would
need to know both the max frequency and the current frequency. This
patch defines weak functions for setting both from cpufreq.

Related architecture specific functions use weak function definitions.
The same approach is followed here.

These callbacks can be used to implement frequency scaling of cpu
capacity later.

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 drivers/cpufreq/cpufreq.c |   10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index d9fdedd..e911f6b 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -278,6 +278,10 @@ static inline void adjust_jiffies(unsigned long val, struct cpufreq_freqs *ci)
 }
 #endif
 
+void __weak arch_scale_set_curr_freq(int cpu, unsigned long freq) {}
+
+void __weak arch_scale_set_max_freq(int cpu, unsigned long freq) {}
+
 static void __cpufreq_notify_transition(struct cpufreq_policy *policy,
 		struct cpufreq_freqs *freqs, unsigned int state)
 {
@@ -315,6 +319,7 @@ static void __cpufreq_notify_transition(struct cpufreq_policy *policy,
 		pr_debug("FREQ: %lu - CPU: %lu\n",
 			 (unsigned long)freqs->new, (unsigned long)freqs->cpu);
 		trace_cpu_frequency(freqs->new, freqs->cpu);
+		arch_scale_set_curr_freq(freqs->cpu, freqs->new);
 		srcu_notifier_call_chain(&cpufreq_transition_notifier_list,
 				CPUFREQ_POSTCHANGE, freqs);
 		if (likely(policy) && likely(policy->cpu == freqs->cpu))
@@ -2135,7 +2140,7 @@ static int cpufreq_set_policy(struct cpufreq_policy *policy,
 				struct cpufreq_policy *new_policy)
 {
 	struct cpufreq_governor *old_gov;
-	int ret;
+	int ret, cpu;
 
 	pr_debug("setting new policy for CPU %u: %u - %u kHz\n",
 		 new_policy->cpu, new_policy->min, new_policy->max);
@@ -2173,6 +2178,9 @@ static int cpufreq_set_policy(struct cpufreq_policy *policy,
 	policy->min = new_policy->min;
 	policy->max = new_policy->max;
 
+	for_each_cpu(cpu, policy->cpus)
+		arch_scale_set_max_freq(cpu, policy->max);
+
 	pr_debug("new min and max freqs are %u - %u kHz\n",
 		 policy->min, policy->max);
 
-- 
1.7.9.5



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 3/7] arm: Frequency invariant scheduler load-tracking support
  2014-09-22 16:24 [PATCH 0/7] sched: Scale-invariant per-entity load-tracking Morten Rasmussen
  2014-09-22 16:24 ` [PATCH 1/7] sched: Introduce scale-invariant load tracking Morten Rasmussen
  2014-09-22 16:24 ` [PATCH 2/7] cpufreq: Architecture specific callback for frequency changes Morten Rasmussen
@ 2014-09-22 16:24 ` Morten Rasmussen
  2014-09-22 16:24 ` [PATCH 4/7] arm: Micro-architecture invariant load tracking support Morten Rasmussen
                   ` (3 subsequent siblings)
  6 siblings, 0 replies; 40+ messages in thread
From: Morten Rasmussen @ 2014-09-22 16:24 UTC (permalink / raw)
  To: peterz, mingo
  Cc: dietmar.eggemann, pjt, bsegall, vincent.guittot, nicolas.pitre,
	mturquette, rjw, linux-kernel, Morten Rasmussen

Implements arch-specific function to provide the scheduler with a
frequency scaling correction factor for more accurate load-tracking. The
factor is:

	current_freq(cpu) * SCHED_CAPACITY_SCALE / max_freq(cpu)

This implementation only provides frequency invariance. No
micro-architecture invariance yet.

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 arch/arm/kernel/topology.c |   33 +++++++++++++++++++++++++++++++++
 1 file changed, 33 insertions(+)

diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
index 89cfdd6..3e9a979 100644
--- a/arch/arm/kernel/topology.c
+++ b/arch/arm/kernel/topology.c
@@ -169,6 +169,39 @@ static void update_cpu_capacity(unsigned int cpu)
 		cpu, arch_scale_cpu_capacity(NULL, cpu));
 }
 
+/*
+ * Scheduler load-tracking scale-invariance
+ *
+ * Provides the scheduler with a scale-invariance correction factor that
+ * compensates for frequency scaling.
+ */
+
+static DEFINE_PER_CPU(atomic_long_t, cpu_curr_freq);
+static DEFINE_PER_CPU(atomic_long_t, cpu_max_freq);
+
+/* cpufreq callback function setting current cpu frequency */
+void arch_scale_set_curr_freq(int cpu, unsigned long freq)
+{
+	atomic_long_set(&per_cpu(cpu_curr_freq, cpu), freq);
+}
+
+/* cpufreq callback function setting max cpu frequency */
+void arch_scale_set_max_freq(int cpu, unsigned long freq)
+{
+	atomic_long_set(&per_cpu(cpu_max_freq, cpu), freq);
+}
+
+unsigned long arch_scale_load_capacity(int cpu)
+{
+	unsigned long curr = atomic_long_read(&per_cpu(cpu_curr_freq, cpu));
+	unsigned long max = atomic_long_read(&per_cpu(cpu_max_freq, cpu));
+
+	if (!max)
+		return SCHED_CAPACITY_SCALE;
+
+	return (curr * SCHED_CAPACITY_SCALE) / max;
+}
+
 #else
 static inline void parse_dt_topology(void) {}
 static inline void update_cpu_capacity(unsigned int cpuid) {}
-- 
1.7.9.5



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 4/7] arm: Micro-architecture invariant load tracking support
  2014-09-22 16:24 [PATCH 0/7] sched: Scale-invariant per-entity load-tracking Morten Rasmussen
                   ` (2 preceding siblings ...)
  2014-09-22 16:24 ` [PATCH 3/7] arm: Frequency invariant scheduler load-tracking support Morten Rasmussen
@ 2014-09-22 16:24 ` Morten Rasmussen
  2014-09-22 16:24 ` [PATCH 5/7] sched: Implement usage tracking Morten Rasmussen
                   ` (2 subsequent siblings)
  6 siblings, 0 replies; 40+ messages in thread
From: Morten Rasmussen @ 2014-09-22 16:24 UTC (permalink / raw)
  To: peterz, mingo
  Cc: dietmar.eggemann, pjt, bsegall, vincent.guittot, nicolas.pitre,
	mturquette, rjw, linux-kernel, Morten Rasmussen

Adds micro-architecture difference into scheduler load-tracking
scale-invariance correction factor for big.LITTLE systems. The factor is
now:
	(current_freq(cpu) * uarch_factor(cpu) * SCHED_CAPACITY_SCALE) /
		(max_freq(cpu) * max_uarch_factor)

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 arch/arm/kernel/topology.c |   17 ++++++++++++++---
 1 file changed, 14 insertions(+), 3 deletions(-)

diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
index 3e9a979..9f499e2 100644
--- a/arch/arm/kernel/topology.c
+++ b/arch/arm/kernel/topology.c
@@ -79,6 +79,9 @@ static unsigned long *__cpu_capacity;
 
 static unsigned long middle_capacity = 1;
 
+static unsigned long max_raw_capacity = 1;
+static DEFINE_PER_CPU(unsigned long, cpu_raw_capacity);
+
 /*
  * Iterate all CPUs' descriptor in DT and compute the efficiency
  * (as per table_efficiency). Also calculate a middle efficiency
@@ -117,6 +120,9 @@ static void __init parse_dt_topology(void)
 		if (cpu_eff->compatible == NULL)
 			continue;
 
+		per_cpu(cpu_raw_capacity, cpu) = cpu_eff->efficiency;
+		max_raw_capacity = max(max_raw_capacity, cpu_eff->efficiency);
+
 		rate = of_get_property(cn, "clock-frequency", &len);
 		if (!rate || len != 4) {
 			pr_err("%s missing clock-frequency property\n",
@@ -173,7 +179,8 @@ static void update_cpu_capacity(unsigned int cpu)
  * Scheduler load-tracking scale-invariance
  *
  * Provides the scheduler with a scale-invariance correction factor that
- * compensates for frequency scaling.
+ * compensates for frequency scaling and micro-architecture differences between
+ * cpus.
  */
 
 static DEFINE_PER_CPU(atomic_long_t, cpu_curr_freq);
@@ -195,11 +202,15 @@ unsigned long arch_scale_load_capacity(int cpu)
 {
 	unsigned long curr = atomic_long_read(&per_cpu(cpu_curr_freq, cpu));
 	unsigned long max = atomic_long_read(&per_cpu(cpu_max_freq, cpu));
+	unsigned long ret;
 
-	if (!max)
+	if (!max || !per_cpu(cpu_raw_capacity, cpu))
 		return SCHED_CAPACITY_SCALE;
 
-	return (curr * SCHED_CAPACITY_SCALE) / max;
+	ret = (curr * SCHED_CAPACITY_SCALE) / max;
+	ret = (ret * per_cpu(cpu_raw_capacity, cpu)) / max_raw_capacity;
+
+	return ret;
 }
 
 #else
-- 
1.7.9.5



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 5/7] sched: Implement usage tracking
  2014-09-22 16:24 [PATCH 0/7] sched: Scale-invariant per-entity load-tracking Morten Rasmussen
                   ` (3 preceding siblings ...)
  2014-09-22 16:24 ` [PATCH 4/7] arm: Micro-architecture invariant load tracking support Morten Rasmussen
@ 2014-09-22 16:24 ` Morten Rasmussen
  2014-09-22 16:24 ` [PATCH 6/7] sched: Make sched entity usage tracking scale-invariant Morten Rasmussen
  2014-09-22 16:24 ` [PATCH 7/7] sched: Track sched_entity usage contributions Morten Rasmussen
  6 siblings, 0 replies; 40+ messages in thread
From: Morten Rasmussen @ 2014-09-22 16:24 UTC (permalink / raw)
  To: peterz, mingo
  Cc: dietmar.eggemann, pjt, bsegall, vincent.guittot, nicolas.pitre,
	mturquette, rjw, linux-kernel, Morten Rasmussen

With the framework for runnable tracking now fully in place, per-entity
usage tracking is a simple and low-overhead addition.

This is a rebased and significantly cut down version of a patch
originally authored by Paul Turner <pjt@google.com>.

cc: Paul Turner <pjt@google.com>
cc: Ben Segall <bsegall@google.com>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 include/linux/sched.h |    1 +
 kernel/sched/debug.c  |    1 +
 kernel/sched/fair.c   |   16 +++++++++++++---
 3 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 18f5262..0bcd8a7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1080,6 +1080,7 @@ struct sched_avg {
 	u64 last_runnable_update;
 	s64 decay_count;
 	unsigned long load_avg_contrib;
+	u32 usage_avg_sum;
 };
 
 #ifdef CONFIG_SCHEDSTATS
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index c7fe1ea0..ed5a9ce 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -95,6 +95,7 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
 #ifdef CONFIG_SMP
 	P(se->avg.runnable_avg_sum);
 	P(se->avg.runnable_avg_period);
+	P(se->avg.usage_avg_sum);
 	P(se->avg.load_avg_contrib);
 	P(se->avg.decay_count);
 #endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 52abb3e..d8a8c83 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2299,7 +2299,8 @@ unsigned long arch_scale_load_capacity(int cpu);
  */
 static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,
 							struct sched_avg *sa,
-							int runnable)
+							int runnable,
+							int running)
 {
 	u64 delta, periods;
 	u32 runnable_contrib;
@@ -2341,6 +2342,8 @@ static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,
 		if (runnable)
 			sa->runnable_avg_sum += (delta_w * scale_cap)
 					>> SCHED_CAPACITY_SHIFT;
+		if (running)
+			sa->usage_avg_sum += delta_w;
 		sa->runnable_avg_period += delta_w;
 
 		delta -= delta_w;
@@ -2353,6 +2356,7 @@ static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,
 						  periods + 1);
 		sa->runnable_avg_period = decay_load(sa->runnable_avg_period,
 						     periods + 1);
+		sa->usage_avg_sum = decay_load(sa->usage_avg_sum, periods + 1);
 
 		/* Efficiently calculate \sum (1..n_period) 1024*y^i */
 		runnable_contrib = __compute_runnable_contrib(periods);
@@ -2360,6 +2364,8 @@ static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,
 		if (runnable)
 			sa->runnable_avg_sum += (runnable_contrib * scale_cap)
 						>> SCHED_CAPACITY_SHIFT;
+		if (running)
+			sa->usage_avg_sum += runnable_contrib;
 		sa->runnable_avg_period += runnable_contrib;
 	}
 
@@ -2367,6 +2373,8 @@ static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,
 	if (runnable)
 		sa->runnable_avg_sum += (delta * scale_cap)
 				>> SCHED_CAPACITY_SHIFT;
+	if (running)
+		sa->usage_avg_sum += delta;
 	sa->runnable_avg_period += delta;
 
 	return decayed;
@@ -2473,7 +2481,7 @@ static inline void __update_group_entity_contrib(struct sched_entity *se)
 static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
 {
 	__update_entity_runnable_avg(rq_clock_task(rq), rq->cpu, &rq->avg,
-					runnable);
+					runnable, runnable);
 	__update_tg_runnable_avg(&rq->avg, &rq->cfs);
 }
 #else /* CONFIG_FAIR_GROUP_SCHED */
@@ -2539,7 +2547,8 @@ static inline void update_entity_load_avg(struct sched_entity *se,
 	else
 		now = cfs_rq_clock_task(group_cfs_rq(se));
 
-	if (!__update_entity_runnable_avg(now, cpu, &se->avg, se->on_rq))
+	if (!__update_entity_runnable_avg(now, cpu, &se->avg, se->on_rq,
+					  cfs_rq->curr == se))
 		return;
 
 	contrib_delta = __update_entity_load_avg_contrib(se);
@@ -2980,6 +2989,7 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 		 */
 		update_stats_wait_end(cfs_rq, se);
 		__dequeue_entity(cfs_rq, se);
+		update_entity_load_avg(se, 1);
 	}
 
 	update_stats_curr_start(cfs_rq, se);
-- 
1.7.9.5



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 6/7] sched: Make sched entity usage tracking scale-invariant
  2014-09-22 16:24 [PATCH 0/7] sched: Scale-invariant per-entity load-tracking Morten Rasmussen
                   ` (4 preceding siblings ...)
  2014-09-22 16:24 ` [PATCH 5/7] sched: Implement usage tracking Morten Rasmussen
@ 2014-09-22 16:24 ` Morten Rasmussen
  2014-09-22 17:13   ` bsegall
  2014-09-22 16:24 ` [PATCH 7/7] sched: Track sched_entity usage contributions Morten Rasmussen
  6 siblings, 1 reply; 40+ messages in thread
From: Morten Rasmussen @ 2014-09-22 16:24 UTC (permalink / raw)
  To: peterz, mingo
  Cc: dietmar.eggemann, pjt, bsegall, vincent.guittot, nicolas.pitre,
	mturquette, rjw, linux-kernel, Morten Rasmussen

Apply scale-invariance correction factor to usage tracking as well.

cc: Paul Turner <pjt@google.com>
cc: Ben Segall <bsegall@google.com>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c |   28 ++++++++++++++++------------
 1 file changed, 16 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d8a8c83..c7aa8c1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2302,9 +2302,9 @@ static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,
 							int runnable,
 							int running)
 {
-	u64 delta, periods;
-	u32 runnable_contrib;
-	int delta_w, decayed = 0;
+	u64 delta, scaled_delta, periods;
+	u32 runnable_contrib, scaled_runnable_contrib;
+	int delta_w, scaled_delta_w, decayed = 0;
 	u32 scale_cap = arch_scale_load_capacity(cpu);
 
 	delta = now - sa->last_runnable_update;
@@ -2339,11 +2339,12 @@ static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,
 		 */
 		delta_w = 1024 - delta_w;
 
+		scaled_delta_w = (delta_w * scale_cap) >> SCHED_CAPACITY_SHIFT;
+
 		if (runnable)
-			sa->runnable_avg_sum += (delta_w * scale_cap)
-					>> SCHED_CAPACITY_SHIFT;
+			sa->runnable_avg_sum += scaled_delta_w;
 		if (running)
-			sa->usage_avg_sum += delta_w;
+			sa->usage_avg_sum += scaled_delta_w;
 		sa->runnable_avg_period += delta_w;
 
 		delta -= delta_w;
@@ -2361,20 +2362,23 @@ static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,
 		/* Efficiently calculate \sum (1..n_period) 1024*y^i */
 		runnable_contrib = __compute_runnable_contrib(periods);
 
-		if (runnable)
-			sa->runnable_avg_sum += (runnable_contrib * scale_cap)
+		scaled_runnable_contrib = (runnable_contrib * scale_cap)
 						>> SCHED_CAPACITY_SHIFT;
+
+		if (runnable)
+			sa->runnable_avg_sum +=  scaled_runnable_contrib;
 		if (running)
-			sa->usage_avg_sum += runnable_contrib;
+			sa->usage_avg_sum +=  scaled_runnable_contrib;
 		sa->runnable_avg_period += runnable_contrib;
 	}
 
 	/* Remainder of delta accrued against u_0` */
+	scaled_delta = (delta * scale_cap) >> SCHED_CAPACITY_SHIFT;
+
 	if (runnable)
-		sa->runnable_avg_sum += (delta * scale_cap)
-				>> SCHED_CAPACITY_SHIFT;
+		sa->runnable_avg_sum += scaled_delta;
 	if (running)
-		sa->usage_avg_sum += delta;
+		sa->usage_avg_sum += scaled_delta;
 	sa->runnable_avg_period += delta;
 
 	return decayed;
-- 
1.7.9.5



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH 7/7] sched: Track sched_entity usage contributions
  2014-09-22 16:24 [PATCH 0/7] sched: Scale-invariant per-entity load-tracking Morten Rasmussen
                   ` (5 preceding siblings ...)
  2014-09-22 16:24 ` [PATCH 6/7] sched: Make sched entity usage tracking scale-invariant Morten Rasmussen
@ 2014-09-22 16:24 ` Morten Rasmussen
  2014-09-22 17:09   ` bsegall
  6 siblings, 1 reply; 40+ messages in thread
From: Morten Rasmussen @ 2014-09-22 16:24 UTC (permalink / raw)
  To: peterz, mingo
  Cc: dietmar.eggemann, pjt, bsegall, vincent.guittot, nicolas.pitre,
	mturquette, rjw, linux-kernel, Morten Rasmussen

Adds usage contribution tracking for both task and group entities.
Maintains a non-priority scaled se->avg.usage_avg_contrib for each
sched_entity and cfs_rq.usage_util_avg sum of all entity contributions.
The latter provides a more accurate estimate of the true cpu utilization
than the existing cfs_rq.runnable_load_avg (+blocked_load_avg).

Unlike se->avg.load_avg_contrib, se->avg.usage_avg_contrib for group
entities is the sum of se->avg.usage_avg_contrib for all entities on the
group runqueue. It is _not_ influenced in any way by the task group
h_load. Hence it is representing the actual cpu usage of the group, not
its intended load contribution which may differ significantly from the
usage on lightly utilized systems.

The cpu usage tracking is available as cpu_rq(cpu)->cfs.usage_util_avg.
No tracking of blocked usage has been implemented.

cc: Paul Turner <pjt@google.com>
cc: Ben Segall <bsegall@google.com>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 include/linux/sched.h |    2 +-
 kernel/sched/debug.c  |    4 ++++
 kernel/sched/fair.c   |   32 ++++++++++++++++++++++++++------
 kernel/sched/sched.h  |    2 +-
 4 files changed, 32 insertions(+), 8 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 0bcd8a7..509d5ce 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1079,7 +1079,7 @@ struct sched_avg {
 	u32 runnable_avg_sum, runnable_avg_period;
 	u64 last_runnable_update;
 	s64 decay_count;
-	unsigned long load_avg_contrib;
+	unsigned long load_avg_contrib, usage_avg_contrib;
 	u32 usage_avg_sum;
 };
 
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index ed5a9ce..a655427 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -97,6 +97,7 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
 	P(se->avg.runnable_avg_period);
 	P(se->avg.usage_avg_sum);
 	P(se->avg.load_avg_contrib);
+	P(se->avg.usage_avg_contrib);
 	P(se->avg.decay_count);
 #endif
 #undef PN
@@ -214,6 +215,8 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 #ifdef CONFIG_SMP
 	SEQ_printf(m, "  .%-30s: %ld\n", "runnable_load_avg",
 			cfs_rq->runnable_load_avg);
+	SEQ_printf(m, "  .%-30s: %ld\n", "usage_util_avg",
+			cfs_rq->usage_util_avg);
 	SEQ_printf(m, "  .%-30s: %ld\n", "blocked_load_avg",
 			cfs_rq->blocked_load_avg);
 #ifdef CONFIG_FAIR_GROUP_SCHED
@@ -634,6 +637,7 @@ void proc_sched_show_task(struct task_struct *p, struct seq_file *m)
 	P(se.avg.runnable_avg_sum);
 	P(se.avg.runnable_avg_period);
 	P(se.avg.load_avg_contrib);
+	P(se.avg.usage_avg_contrib);
 	P(se.avg.decay_count);
 #endif
 	P(policy);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index c7aa8c1..c374825 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -678,6 +678,7 @@ void init_task_runnable_average(struct task_struct *p)
 	p->se.avg.decay_count = 0;
 	slice = sched_slice(task_cfs_rq(p), &p->se) >> 10;
 	p->se.avg.runnable_avg_sum = slice;
+	p->se.avg.usage_avg_sum = slice;
 	p->se.avg.runnable_avg_period = slice;
 	__update_task_entity_contrib(&p->se);
 }
@@ -2395,6 +2396,8 @@ static inline u64 __synchronize_entity_decay(struct sched_entity *se)
 		return 0;
 
 	se->avg.load_avg_contrib = decay_load(se->avg.load_avg_contrib, decays);
+	se->avg.usage_avg_contrib = decay_load(se->avg.usage_avg_contrib,
+						decays);
 	se->avg.decay_count = 0;
 
 	return decays;
@@ -2480,6 +2483,8 @@ static inline void __update_group_entity_contrib(struct sched_entity *se)
 		se->avg.load_avg_contrib *= runnable_avg;
 		se->avg.load_avg_contrib >>= NICE_0_SHIFT;
 	}
+
+	se->avg.usage_avg_contrib = cfs_rq->usage_util_avg;
 }
 
 static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
@@ -2499,18 +2504,24 @@ static inline void update_rq_runnable_avg(struct rq *rq, int runnable) {}
 
 static inline void __update_task_entity_contrib(struct sched_entity *se)
 {
-	u32 contrib;
+	u32 contrib, usage_contrib;
 
 	/* avoid overflowing a 32-bit type w/ SCHED_LOAD_SCALE */
 	contrib = se->avg.runnable_avg_sum * scale_load_down(se->load.weight);
 	contrib /= (se->avg.runnable_avg_period + 1);
 	se->avg.load_avg_contrib = scale_load(contrib);
+
+	usage_contrib = se->avg.usage_avg_sum * scale_load_down(NICE_0_LOAD);
+	usage_contrib /= (se->avg.runnable_avg_period + 1);
+	se->avg.usage_avg_contrib = scale_load(usage_contrib);
 }
 
 /* Compute the current contribution to load_avg by se, return any delta */
-static long __update_entity_load_avg_contrib(struct sched_entity *se)
+static long __update_entity_load_avg_contrib(struct sched_entity *se,
+						long *usage_contrib_delta)
 {
 	long old_contrib = se->avg.load_avg_contrib;
+	long old_usage_contrib = se->avg.usage_avg_contrib;
 
 	if (entity_is_task(se)) {
 		__update_task_entity_contrib(se);
@@ -2519,6 +2530,10 @@ static long __update_entity_load_avg_contrib(struct sched_entity *se)
 		__update_group_entity_contrib(se);
 	}
 
+	if (usage_contrib_delta)
+		*usage_contrib_delta = se->avg.usage_avg_contrib -
+					old_usage_contrib;
+
 	return se->avg.load_avg_contrib - old_contrib;
 }
 
@@ -2538,7 +2553,7 @@ static inline void update_entity_load_avg(struct sched_entity *se,
 					  int update_cfs_rq)
 {
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
-	long contrib_delta;
+	long contrib_delta, usage_delta;
 	int cpu = rq_of(cfs_rq)->cpu;
 	u64 now;
 
@@ -2555,14 +2570,15 @@ static inline void update_entity_load_avg(struct sched_entity *se,
 					  cfs_rq->curr == se))
 		return;
 
-	contrib_delta = __update_entity_load_avg_contrib(se);
+	contrib_delta = __update_entity_load_avg_contrib(se, &usage_delta);
 
 	if (!update_cfs_rq)
 		return;
 
-	if (se->on_rq)
+	if (se->on_rq) {
 		cfs_rq->runnable_load_avg += contrib_delta;
-	else
+		cfs_rq->usage_util_avg += usage_delta;
+	} else
 		subtract_blocked_load_contrib(cfs_rq, -contrib_delta);
 }
 
@@ -2638,6 +2654,8 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
 	}
 
 	cfs_rq->runnable_load_avg += se->avg.load_avg_contrib;
+	cfs_rq->usage_util_avg += se->avg.usage_avg_contrib;
+
 	/* we force update consideration on load-balancer moves */
 	update_cfs_rq_blocked_load(cfs_rq, !wakeup);
 }
@@ -2656,6 +2674,8 @@ static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq,
 	update_cfs_rq_blocked_load(cfs_rq, !sleep);
 
 	cfs_rq->runnable_load_avg -= se->avg.load_avg_contrib;
+	cfs_rq->usage_util_avg -= se->avg.usage_avg_contrib;
+
 	if (sleep) {
 		cfs_rq->blocked_load_avg += se->avg.load_avg_contrib;
 		se->avg.decay_count = atomic64_read(&cfs_rq->decay_counter);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 1bc6aad..527ae12 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -340,7 +340,7 @@ struct cfs_rq {
 	 * This allows for the description of both thread and group usage (in
 	 * the FAIR_GROUP_SCHED case).
 	 */
-	unsigned long runnable_load_avg, blocked_load_avg;
+	unsigned long runnable_load_avg, blocked_load_avg, usage_util_avg;
 	atomic64_t decay_counter;
 	u64 last_decay;
 	atomic_long_t removed_load;
-- 
1.7.9.5



^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH 7/7] sched: Track sched_entity usage contributions
  2014-09-22 16:24 ` [PATCH 7/7] sched: Track sched_entity usage contributions Morten Rasmussen
@ 2014-09-22 17:09   ` bsegall
  2014-09-23 13:59     ` Morten Rasmussen
  0 siblings, 1 reply; 40+ messages in thread
From: bsegall @ 2014-09-22 17:09 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: peterz, mingo, dietmar.eggemann, pjt, vincent.guittot,
	nicolas.pitre, mturquette, rjw, linux-kernel

Morten Rasmussen <morten.rasmussen@arm.com> writes:

> Adds usage contribution tracking for both task and group entities.
> Maintains a non-priority scaled se->avg.usage_avg_contrib for each
> sched_entity and cfs_rq.usage_util_avg sum of all entity contributions.
> The latter provides a more accurate estimate of the true cpu utilization
> than the existing cfs_rq.runnable_load_avg (+blocked_load_avg).
>
> Unlike se->avg.load_avg_contrib, se->avg.usage_avg_contrib for group
> entities is the sum of se->avg.usage_avg_contrib for all entities on the
> group runqueue. It is _not_ influenced in any way by the task group
> h_load. Hence it is representing the actual cpu usage of the group, not
> its intended load contribution which may differ significantly from the
> usage on lightly utilized systems.
>
> The cpu usage tracking is available as cpu_rq(cpu)->cfs.usage_util_avg.
> No tracking of blocked usage has been implemented.

Isn't cfs_rq->usage_util_avg basically just
se->avg.usage_avg_sum * 1024 / se->avg.runnable_avg_period, where
se->group_cfs_rq == cfs_rq? (and for the rq as a whole, rq->avg)

The fact that usage_util_avg doesn't track blocked usage seems more
likely to be a problem than an advantage, but maybe not?

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 6/7] sched: Make sched entity usage tracking scale-invariant
  2014-09-22 16:24 ` [PATCH 6/7] sched: Make sched entity usage tracking scale-invariant Morten Rasmussen
@ 2014-09-22 17:13   ` bsegall
  2014-09-23 13:35     ` Morten Rasmussen
  0 siblings, 1 reply; 40+ messages in thread
From: bsegall @ 2014-09-22 17:13 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: peterz, mingo, dietmar.eggemann, pjt, vincent.guittot,
	nicolas.pitre, mturquette, rjw, linux-kernel

Morten Rasmussen <morten.rasmussen@arm.com> writes:

> Apply scale-invariance correction factor to usage tracking as well.

It seems like it would make more sense to order the patches as first the
usage tracking and then all of the scale-invariance together, or perhaps
to just fold this into the usage tracking patch.
>
> cc: Paul Turner <pjt@google.com>
> cc: Ben Segall <bsegall@google.com>
>
> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
> ---
>  kernel/sched/fair.c |   28 ++++++++++++++++------------
>  1 file changed, 16 insertions(+), 12 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index d8a8c83..c7aa8c1 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2302,9 +2302,9 @@ static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,
>  							int runnable,
>  							int running)
>  {
> -	u64 delta, periods;
> -	u32 runnable_contrib;
> -	int delta_w, decayed = 0;
> +	u64 delta, scaled_delta, periods;
> +	u32 runnable_contrib, scaled_runnable_contrib;
> +	int delta_w, scaled_delta_w, decayed = 0;
>  	u32 scale_cap = arch_scale_load_capacity(cpu);
>  
>  	delta = now - sa->last_runnable_update;
> @@ -2339,11 +2339,12 @@ static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,
>  		 */
>  		delta_w = 1024 - delta_w;
>  
> +		scaled_delta_w = (delta_w * scale_cap) >> SCHED_CAPACITY_SHIFT;
> +
>  		if (runnable)
> -			sa->runnable_avg_sum += (delta_w * scale_cap)
> -					>> SCHED_CAPACITY_SHIFT;
> +			sa->runnable_avg_sum += scaled_delta_w;
>  		if (running)
> -			sa->usage_avg_sum += delta_w;
> +			sa->usage_avg_sum += scaled_delta_w;
>  		sa->runnable_avg_period += delta_w;
>  
>  		delta -= delta_w;
> @@ -2361,20 +2362,23 @@ static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,
>  		/* Efficiently calculate \sum (1..n_period) 1024*y^i */
>  		runnable_contrib = __compute_runnable_contrib(periods);
>  
> -		if (runnable)
> -			sa->runnable_avg_sum += (runnable_contrib * scale_cap)
> +		scaled_runnable_contrib = (runnable_contrib * scale_cap)
>  						>> SCHED_CAPACITY_SHIFT;
> +
> +		if (runnable)
> +			sa->runnable_avg_sum +=  scaled_runnable_contrib;
>  		if (running)
> -			sa->usage_avg_sum += runnable_contrib;
> +			sa->usage_avg_sum +=  scaled_runnable_contrib;
>  		sa->runnable_avg_period += runnable_contrib;
>  	}
>  
>  	/* Remainder of delta accrued against u_0` */
> +	scaled_delta = (delta * scale_cap) >> SCHED_CAPACITY_SHIFT;
> +
>  	if (runnable)
> -		sa->runnable_avg_sum += (delta * scale_cap)
> -				>> SCHED_CAPACITY_SHIFT;
> +		sa->runnable_avg_sum += scaled_delta;
>  	if (running)
> -		sa->usage_avg_sum += delta;
> +		sa->usage_avg_sum += scaled_delta;
>  	sa->runnable_avg_period += delta;
>  
>  	return decayed;

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 6/7] sched: Make sched entity usage tracking scale-invariant
  2014-09-22 17:13   ` bsegall
@ 2014-09-23 13:35     ` Morten Rasmussen
  2014-10-02 21:04       ` Peter Zijlstra
  0 siblings, 1 reply; 40+ messages in thread
From: Morten Rasmussen @ 2014-09-23 13:35 UTC (permalink / raw)
  To: bsegall
  Cc: peterz, mingo, Dietmar Eggemann, pjt, vincent.guittot,
	nicolas.pitre, mturquette, rjw, linux-kernel

On Mon, Sep 22, 2014 at 06:13:46PM +0100, bsegall@google.com wrote:
> Morten Rasmussen <morten.rasmussen@arm.com> writes:
> 
> > Apply scale-invariance correction factor to usage tracking as well.
> 
> It seems like it would make more sense to order the patches as first the
> usage tracking and then all of the scale-invariance together, or perhaps
> to just fold this into the usage tracking patch.

Makes sense. I don't mind reordering the patches. Vincent has already
got some of the usage bits in his patch set, so I will have to rework
the usage patches anyway if Peter decides to take the rest of Vincent's
patch set.

Morten

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 7/7] sched: Track sched_entity usage contributions
  2014-09-22 17:09   ` bsegall
@ 2014-09-23 13:59     ` Morten Rasmussen
  0 siblings, 0 replies; 40+ messages in thread
From: Morten Rasmussen @ 2014-09-23 13:59 UTC (permalink / raw)
  To: bsegall
  Cc: peterz, mingo, Dietmar Eggemann, pjt, vincent.guittot,
	nicolas.pitre, mturquette, rjw, linux-kernel

On Mon, Sep 22, 2014 at 06:09:42PM +0100, bsegall@google.com wrote:
> Morten Rasmussen <morten.rasmussen@arm.com> writes:
> 
> > Adds usage contribution tracking for both task and group entities.
> > Maintains a non-priority scaled se->avg.usage_avg_contrib for each
> > sched_entity and cfs_rq.usage_util_avg sum of all entity contributions.
> > The latter provides a more accurate estimate of the true cpu utilization
> > than the existing cfs_rq.runnable_load_avg (+blocked_load_avg).
> >
> > Unlike se->avg.load_avg_contrib, se->avg.usage_avg_contrib for group
> > entities is the sum of se->avg.usage_avg_contrib for all entities on the
> > group runqueue. It is _not_ influenced in any way by the task group
> > h_load. Hence it is representing the actual cpu usage of the group, not
> > its intended load contribution which may differ significantly from the
> > usage on lightly utilized systems.
> >
> > The cpu usage tracking is available as cpu_rq(cpu)->cfs.usage_util_avg.
> > No tracking of blocked usage has been implemented.
> 
> Isn't cfs_rq->usage_util_avg basically just
> se->avg.usage_avg_sum * 1024 / se->avg.runnable_avg_period, where
> se->group_cfs_rq == cfs_rq? (and for the rq as a whole, rq->avg)

Almost, but not quite :)

cfs_rq->usage_util_avg is updated when a sched_entity is
enqueued/dequeued by adding/subtracting se->avg.usage_avg_contrib
similar to cfs_rq->runnable_avg_load and se->avg.load_avg_contrib. So it
is an instantaneous usage approximation. se->avg.usage_avg_sum * 1024 /
se->avg.runnable_avg_period for the group entity (or rq->avg) has to ramp
up/decay, so the approximation is lagging a bit behind when tasks are
migrated. On a stable system they should be the same.

> The fact that usage_util_avg doesn't track blocked usage seems more
> likely to be a problem than an advantage, but maybe not?

Yes. I think it was agreed at Ksummit that taking blocked load (and
usage) into account is the right thing to do as long as
{runnable,running}+blocked is used correctly in load-balancing
decisions.

I will look into adding blocked usage for the next version.

Thanks,
Morten

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/7] sched: Introduce scale-invariant load tracking
  2014-09-22 16:24 ` [PATCH 1/7] sched: Introduce scale-invariant load tracking Morten Rasmussen
@ 2014-09-25 13:48   ` Vincent Guittot
  2014-09-25 17:23     ` Morten Rasmussen
  2014-10-08  0:50   ` Yuyang Du
  1 sibling, 1 reply; 40+ messages in thread
From: Vincent Guittot @ 2014-09-25 13:48 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Peter Zijlstra, mingo, Dietmar Eggemann, Paul Turner,
	Benjamin Segall, Nicolas Pitre, Mike Turquette, rjw,
	linux-kernel

On 22 September 2014 18:24, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> From: Dietmar Eggemann <dietmar.eggemann@arm.com>
>
> The per-entity load-tracking currently neither accounts for frequency
> changes due to frequency scaling (cpufreq) nor for micro-architectural
> differences between cpus (ARM big.LITTLE). Comparing tracked loads
> between different cpus might therefore be quite misleading.
>
> This patch introduces a scale-invariance scaling factor to the
> load-tracking computation that can be used to compensate for compute
> capacity variations. The scaling factor is to be provided by the
> architecture through an arch specific function. It may be as simple as:
>
>         current_freq(cpu) * SCHED_CAPACITY_SCALE / max_freq(cpu)
>
> If the architecture has more sophisticated ways of tracking compute
> capacity, it can do so in its implementation. By default, no scaling is
> applied.
>
> The patch is loosely based on a patch by Chris Redpath
> <Chris.Redpath@arm.com>.
>
> cc: Paul Turner <pjt@google.com>
> cc: Ben Segall <bsegall@google.com>
>
> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
> ---
>  kernel/sched/fair.c |   32 ++++++++++++++++++++++++++------
>  1 file changed, 26 insertions(+), 6 deletions(-)
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 2a1e6ac..52abb3e 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2267,6 +2267,8 @@ static u32 __compute_runnable_contrib(u64 n)
>         return contrib + runnable_avg_yN_sum[n];
>  }
>
> +unsigned long arch_scale_load_capacity(int cpu);

Why haven't you used arch_scale_freq_capacity which has a similar
purpose in scaling the CPU capacity except the additional sched_domain
pointer argument ?

> +
>  /*
>   * We can represent the historical contribution to runnable average as the
>   * coefficients of a geometric series.  To do this we sub-divide our runnable
> @@ -2295,13 +2297,14 @@ static u32 __compute_runnable_contrib(u64 n)
>   *   load_avg = u_0` + y*(u_0 + u_1*y + u_2*y^2 + ... )
>   *            = u_0 + u_1*y + u_2*y^2 + ... [re-labeling u_i --> u_{i+1}]
>   */
> -static __always_inline int __update_entity_runnable_avg(u64 now,
> +static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,
>                                                         struct sched_avg *sa,
>                                                         int runnable)
>  {
>         u64 delta, periods;
>         u32 runnable_contrib;
>         int delta_w, decayed = 0;
> +       u32 scale_cap = arch_scale_load_capacity(cpu);
>
>         delta = now - sa->last_runnable_update;
>         /*
> @@ -2334,8 +2337,10 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
>                  * period and accrue it.
>                  */
>                 delta_w = 1024 - delta_w;
> +
>                 if (runnable)
> -                       sa->runnable_avg_sum += delta_w;
> +                       sa->runnable_avg_sum += (delta_w * scale_cap)
> +                                       >> SCHED_CAPACITY_SHIFT;
>                 sa->runnable_avg_period += delta_w;
>
>                 delta -= delta_w;
> @@ -2351,14 +2356,17 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
>
>                 /* Efficiently calculate \sum (1..n_period) 1024*y^i */
>                 runnable_contrib = __compute_runnable_contrib(periods);
> +
>                 if (runnable)
> -                       sa->runnable_avg_sum += runnable_contrib;
> +                       sa->runnable_avg_sum += (runnable_contrib * scale_cap)
> +                                               >> SCHED_CAPACITY_SHIFT;
>                 sa->runnable_avg_period += runnable_contrib;
>         }
>
>         /* Remainder of delta accrued against u_0` */
>         if (runnable)
> -               sa->runnable_avg_sum += delta;
> +               sa->runnable_avg_sum += (delta * scale_cap)
> +                               >> SCHED_CAPACITY_SHIFT;

If we take the example of an always running task, its runnable_avg_sum
should stay at the LOAD_AVG_MAX value whatever the frequency of the
CPU on which it runs. But your change links the max value of
runnable_avg_sum with the current frequency of the CPU so an always
running task will have a load contribution of 25%
your proposed scaling is fine with usage_avg_sum which reflects the
effective running time on the CPU but the runnable_avg_sum should be
able to reach LOAD_AVG_MAX whatever the current frequency is

Regards,
Vincent
>         sa->runnable_avg_period += delta;
>
>         return decayed;
> @@ -2464,7 +2472,8 @@ static inline void __update_group_entity_contrib(struct sched_entity *se)
>
>  static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
>  {
> -       __update_entity_runnable_avg(rq_clock_task(rq), &rq->avg, runnable);
> +       __update_entity_runnable_avg(rq_clock_task(rq), rq->cpu, &rq->avg,
> +                                       runnable);
>         __update_tg_runnable_avg(&rq->avg, &rq->cfs);
>  }
>  #else /* CONFIG_FAIR_GROUP_SCHED */
> @@ -2518,6 +2527,7 @@ static inline void update_entity_load_avg(struct sched_entity *se,
>  {
>         struct cfs_rq *cfs_rq = cfs_rq_of(se);
>         long contrib_delta;
> +       int cpu = rq_of(cfs_rq)->cpu;
>         u64 now;
>
>         /*
> @@ -2529,7 +2539,7 @@ static inline void update_entity_load_avg(struct sched_entity *se,
>         else
>                 now = cfs_rq_clock_task(group_cfs_rq(se));
>
> -       if (!__update_entity_runnable_avg(now, &se->avg, se->on_rq))
> +       if (!__update_entity_runnable_avg(now, cpu, &se->avg, se->on_rq))
>                 return;
>
>         contrib_delta = __update_entity_load_avg_contrib(se);
> @@ -5719,6 +5729,16 @@ unsigned long __weak arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
>         return default_scale_cpu_capacity(sd, cpu);
>  }
>
> +static unsigned long default_scale_load_capacity(int cpu)
> +{
> +       return SCHED_CAPACITY_SCALE;
> +}
> +
> +unsigned long __weak arch_scale_load_capacity(int cpu)
> +{
> +       return default_scale_load_capacity(cpu);
> +}
> +
>  static unsigned long scale_rt_capacity(int cpu)
>  {
>         struct rq *rq = cpu_rq(cpu);
> --
> 1.7.9.5
>
>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/7] sched: Introduce scale-invariant load tracking
  2014-09-25 13:48   ` Vincent Guittot
@ 2014-09-25 17:23     ` Morten Rasmussen
  2014-09-26  7:36       ` Vincent Guittot
  2014-10-02 20:34       ` Peter Zijlstra
  0 siblings, 2 replies; 40+ messages in thread
From: Morten Rasmussen @ 2014-09-25 17:23 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, mingo, Dietmar Eggemann, Paul Turner,
	Benjamin Segall, Nicolas Pitre, Mike Turquette, rjw,
	linux-kernel

On Thu, Sep 25, 2014 at 02:48:47PM +0100, Vincent Guittot wrote:
> On 22 September 2014 18:24, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> > From: Dietmar Eggemann <dietmar.eggemann@arm.com>
> >
> > The per-entity load-tracking currently neither accounts for frequency
> > changes due to frequency scaling (cpufreq) nor for micro-architectural
> > differences between cpus (ARM big.LITTLE). Comparing tracked loads
> > between different cpus might therefore be quite misleading.
> >
> > This patch introduces a scale-invariance scaling factor to the
> > load-tracking computation that can be used to compensate for compute
> > capacity variations. The scaling factor is to be provided by the
> > architecture through an arch specific function. It may be as simple as:
> >
> >         current_freq(cpu) * SCHED_CAPACITY_SCALE / max_freq(cpu)
> >
> > If the architecture has more sophisticated ways of tracking compute
> > capacity, it can do so in its implementation. By default, no scaling is
> > applied.
> >
> > The patch is loosely based on a patch by Chris Redpath
> > <Chris.Redpath@arm.com>.
> >
> > cc: Paul Turner <pjt@google.com>
> > cc: Ben Segall <bsegall@google.com>
> >
> > Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
> > ---
> >  kernel/sched/fair.c |   32 ++++++++++++++++++++++++++------
> >  1 file changed, 26 insertions(+), 6 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 2a1e6ac..52abb3e 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -2267,6 +2267,8 @@ static u32 __compute_runnable_contrib(u64 n)
> >         return contrib + runnable_avg_yN_sum[n];
> >  }
> >
> > +unsigned long arch_scale_load_capacity(int cpu);
> 
> Why haven't you used arch_scale_freq_capacity which has a similar
> purpose in scaling the CPU capacity except the additional sched_domain
> pointer argument ?

To be honest I'm not happy with introducing another arch-function
either and I'm happy to change that. It wasn't really clear to me which
functions that would remain after your cpu_capacity rework patches, so I
added this one. Now that we have most of the patches for capacity
scaling and scale-invariant load-tracking on the table I think we have a
better chance of figuring out which ones are needed and exactly how they
are supposed to work.

arch_scale_load_capacity() compensates for both frequency scaling and
micro-architectural differences, while arch_scale_freq_capacity() only
for frequency. As long as we can use arch_scale_cpu_capacity() to
provide the micro-architecture scaling we can just do the scaling in two
operations rather than one similar to how it is done for capacity in
update_cpu_capacity(). I can fix that in the next version. It will cost
an extra function call and multiplication though.

To make sure that runnable_avg_{sum, period} are still bounded by
LOAD_AVG_MAX, arch_scale_{cpu,freq}_capacity() must both return a factor
in the range 0..SCHED_CAPACITY_SCALE.

> 
> > +
> >  /*
> >   * We can represent the historical contribution to runnable average as the
> >   * coefficients of a geometric series.  To do this we sub-divide our runnable
> > @@ -2295,13 +2297,14 @@ static u32 __compute_runnable_contrib(u64 n)
> >   *   load_avg = u_0` + y*(u_0 + u_1*y + u_2*y^2 + ... )
> >   *            = u_0 + u_1*y + u_2*y^2 + ... [re-labeling u_i --> u_{i+1}]
> >   */
> > -static __always_inline int __update_entity_runnable_avg(u64 now,
> > +static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,
> >                                                         struct sched_avg *sa,
> >                                                         int runnable)
> >  {
> >         u64 delta, periods;
> >         u32 runnable_contrib;
> >         int delta_w, decayed = 0;
> > +       u32 scale_cap = arch_scale_load_capacity(cpu);
> >
> >         delta = now - sa->last_runnable_update;
> >         /*
> > @@ -2334,8 +2337,10 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
> >                  * period and accrue it.
> >                  */
> >                 delta_w = 1024 - delta_w;
> > +
> >                 if (runnable)
> > -                       sa->runnable_avg_sum += delta_w;
> > +                       sa->runnable_avg_sum += (delta_w * scale_cap)
> > +                                       >> SCHED_CAPACITY_SHIFT;
> >                 sa->runnable_avg_period += delta_w;
> >
> >                 delta -= delta_w;
> > @@ -2351,14 +2356,17 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
> >
> >                 /* Efficiently calculate \sum (1..n_period) 1024*y^i */
> >                 runnable_contrib = __compute_runnable_contrib(periods);
> > +
> >                 if (runnable)
> > -                       sa->runnable_avg_sum += runnable_contrib;
> > +                       sa->runnable_avg_sum += (runnable_contrib * scale_cap)
> > +                                               >> SCHED_CAPACITY_SHIFT;
> >                 sa->runnable_avg_period += runnable_contrib;
> >         }
> >
> >         /* Remainder of delta accrued against u_0` */
> >         if (runnable)
> > -               sa->runnable_avg_sum += delta;
> > +               sa->runnable_avg_sum += (delta * scale_cap)
> > +                               >> SCHED_CAPACITY_SHIFT;
> 
> If we take the example of an always running task, its runnable_avg_sum
> should stay at the LOAD_AVG_MAX value whatever the frequency of the
> CPU on which it runs. But your change links the max value of
> runnable_avg_sum with the current frequency of the CPU so an always
> running task will have a load contribution of 25%
> your proposed scaling is fine with usage_avg_sum which reflects the
> effective running time on the CPU but the runnable_avg_sum should be
> able to reach LOAD_AVG_MAX whatever the current frequency is

I don't think it makes sense to scale one metric and not the other. You
will end up with two very different (potentially opposite) views of the
cpu load/utilization situation in many scenarios. As I see it,
scale-invariance and load-balancing with scale-invariance present can be
done in two ways:

1. Leave runnable_avg_sum unscaled and scale running_avg_sum.
se->avg.load_avg_contrib will remain unscaled and so will
cfs_rq->runnable_load_avg, cfs_rq->blocked_load_avg, and
weighted_cpuload(). Essentially all the existing load-balancing code
will continue to use unscaled load. When we want to improve cpu
utilization and energy-awareness we will have to bypass most of this
code as it is likely to lead us on the wrong direction since it has a
potentially wrong view of the cpu load due to the lack of
scale-invariance.

2. Scale both runnable_avg_sum and running_avg_sum. All existing load
metrics including weighted_cpuload() are scaled and thus more accurate.
The difference between se->avg.load_avg_contrib and
se->avg.usage_avg_contrib is the priority scaling and whether or not
runqueue waiting time is counted. se->avg.load_avg_contrib can only
reach se->load.weight when running on the fastest cpu at the highest
frequency, but it is now scale-invariant so we have much better idea
about how much load we are pulling when load-balancing two cpus running
at different frequencies. The load-balance code-path still has to be
audited to see if anything blows up due to the scaling. I haven't
finished doing that yet. This patch set doesn't include patches to
address such issues (yet). IMHO, by scaling runnable_avg_sum we can more
easily make the existing load-balancing code do the right thing.

For both options we have to go through the existing load-balancing code
to either change it to use the scale-invariant metric (running_avg_sum)
when appropriate or to fix bits that don't work properly with a
scale-invariant runnable_avg_sum and reuse the existing code. I think
the latter is less intrusive, but I might be wrong.

Opinions?

Morten

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/7] sched: Introduce scale-invariant load tracking
  2014-09-25 17:23     ` Morten Rasmussen
@ 2014-09-26  7:36       ` Vincent Guittot
  2014-09-26  9:38         ` Morten Rasmussen
  2014-10-02 20:34       ` Peter Zijlstra
  1 sibling, 1 reply; 40+ messages in thread
From: Vincent Guittot @ 2014-09-26  7:36 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Peter Zijlstra, mingo, Dietmar Eggemann, Paul Turner,
	Benjamin Segall, Nicolas Pitre, Mike Turquette, rjw,
	linux-kernel

On 25 September 2014 19:23, Morten Rasmussen <morten.rasmussen@arm.com> wrote:

[snip]

>> >         /* Remainder of delta accrued against u_0` */
>> >         if (runnable)
>> > -               sa->runnable_avg_sum += delta;
>> > +               sa->runnable_avg_sum += (delta * scale_cap)
>> > +                               >> SCHED_CAPACITY_SHIFT;
>>
>> If we take the example of an always running task, its runnable_avg_sum
>> should stay at the LOAD_AVG_MAX value whatever the frequency of the
>> CPU on which it runs. But your change links the max value of
>> runnable_avg_sum with the current frequency of the CPU so an always
>> running task will have a load contribution of 25%
>> your proposed scaling is fine with usage_avg_sum which reflects the
>> effective running time on the CPU but the runnable_avg_sum should be
>> able to reach LOAD_AVG_MAX whatever the current frequency is
>
> I don't think it makes sense to scale one metric and not the other. You
> will end up with two very different (potentially opposite) views of the

you have missed my point, i fully agree that scaling in-variance is a
good enhancement but IIUC your patchset doesn't solve the whole
problem.

Let me try to explain with examples :
- A task with a load of 10% on a CPU at max frequency will keep a load
of  10% if the frequency of the CPU is divided by 2 which is fine
- But an always running task with a load of 100% on a CPU at max
frequency will have a load of 50% if the frequency of the CPU is
divided by 2 which is not what we want; the load of such task should
stay at 100%
- if we have 2 identical always running tasks on CPUs with different
frequency, their load will be different

So your patchset adds scaling invariance for small tasks but add some
scaling variances for heavy tasks

Regards,
Vincent


> cpu load/utilization situation in many scenarios. As I see it,
> scale-invariance and load-balancing with scale-invariance present can be
> done in two ways:
>
> 1. Leave runnable_avg_sum unscaled and scale running_avg_sum.
> se->avg.load_avg_contrib will remain unscaled and so will
> cfs_rq->runnable_load_avg, cfs_rq->blocked_load_avg, and
> weighted_cpuload(). Essentially all the existing load-balancing code
> will continue to use unscaled load. When we want to improve cpu
> utilization and energy-awareness we will have to bypass most of this
> code as it is likely to lead us on the wrong direction since it has a
> potentially wrong view of the cpu load due to the lack of
> scale-invariance.
>
> 2. Scale both runnable_avg_sum and running_avg_sum. All existing load
> metrics including weighted_cpuload() are scaled and thus more accurate.
> The difference between se->avg.load_avg_contrib and
> se->avg.usage_avg_contrib is the priority scaling and whether or not
> runqueue waiting time is counted. se->avg.load_avg_contrib can only
> reach se->load.weight when running on the fastest cpu at the highest
> frequency, but it is now scale-invariant so we have much better idea
> about how much load we are pulling when load-balancing two cpus running
> at different frequencies. The load-balance code-path still has to be
> audited to see if anything blows up due to the scaling. I haven't
> finished doing that yet. This patch set doesn't include patches to
> address such issues (yet). IMHO, by scaling runnable_avg_sum we can more
> easily make the existing load-balancing code do the right thing.
>
> For both options we have to go through the existing load-balancing code
> to either change it to use the scale-invariant metric (running_avg_sum)
> when appropriate or to fix bits that don't work properly with a
> scale-invariant runnable_avg_sum and reuse the existing code. I think
> the latter is less intrusive, but I might be wrong.
>
> Opinions?
>
> Morten
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/7] sched: Introduce scale-invariant load tracking
  2014-09-26  7:36       ` Vincent Guittot
@ 2014-09-26  9:38         ` Morten Rasmussen
  0 siblings, 0 replies; 40+ messages in thread
From: Morten Rasmussen @ 2014-09-26  9:38 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, mingo, Dietmar Eggemann, Paul Turner,
	Benjamin Segall, Nicolas Pitre, Mike Turquette, rjw,
	linux-kernel

On Fri, Sep 26, 2014 at 08:36:53AM +0100, Vincent Guittot wrote:
> On 25 September 2014 19:23, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> 
> [snip]
> 
> >> >         /* Remainder of delta accrued against u_0` */
> >> >         if (runnable)
> >> > -               sa->runnable_avg_sum += delta;
> >> > +               sa->runnable_avg_sum += (delta * scale_cap)
> >> > +                               >> SCHED_CAPACITY_SHIFT;
> >>
> >> If we take the example of an always running task, its runnable_avg_sum
> >> should stay at the LOAD_AVG_MAX value whatever the frequency of the
> >> CPU on which it runs. But your change links the max value of
> >> runnable_avg_sum with the current frequency of the CPU so an always
> >> running task will have a load contribution of 25%
> >> your proposed scaling is fine with usage_avg_sum which reflects the
> >> effective running time on the CPU but the runnable_avg_sum should be
> >> able to reach LOAD_AVG_MAX whatever the current frequency is
> >
> > I don't think it makes sense to scale one metric and not the other. You
> > will end up with two very different (potentially opposite) views of the
> 
> you have missed my point, i fully agree that scaling in-variance is a
> good enhancement but IIUC your patchset doesn't solve the whole
> problem.
> 
> Let me try to explain with examples :
> - A task with a load of 10% on a CPU at max frequency will keep a load
> of  10% if the frequency of the CPU is divided by 2 which is fine

Yes.

> - But an always running task with a load of 100% on a CPU at max
> frequency will have a load of 50% if the frequency of the CPU is
> divided by 2 which is not what we want; the load of such task should
> stay at 100%

I think that is fine too and that is intentional. We can't say anything
about the load/utilization of an always running no matter what cpu and
at what frequency it is running. As soon as the tracked load/utilization
indicates always running, we don't know how much load/utilization it
will cause on a faster cpu. However, if it is 99.9% we are fine (well,
we do probably want some bigger margin). As I see it, always running
tasks must be treated specially. We can easily figure out which tasks
are always running by comparing the scale load divided by
se->load.weight to the current compute capacity on the cpu it is running
on. If they are equal (or close), the task is always running. If we
migrate it to a different cpu we should take into account that its load
might increase if it gets more cycles to spend. You could even do
something like:

unsigned long migration_load(sched_entity *se) {
	if (se->avg.load_avg_contrib >=
		current_capacity(cpu_of(se)) * se->load.weight)
		return se->load.weight;
	return se->avg.load_avg_contrib;
}

for use when moving tasks between cpus when the source cpu is fully
loaded at its current capacity. The task load is actually 100% relative
to the current compute capacity of the task cpu, but not compared to the
fastest cpu in the system.

As I said in my previous reply, this isn't covered yet by this patch
set. It is of course necessary to go through the load-balancing
conditions to see where/if modifications are needed to do the right
thing for scale-invariant load.

> - if we have 2 identical always running tasks on CPUs with different
> frequency, their load will be different

Yes, in terms of absolute load and it is only the case for always
running tasks. However, they would both have a load equal to the cpu
capacity divided by se->avg.load_avg_contrib, so we can easily identify
them.

> So your patchset adds scaling invariance for small tasks but add some
> scaling variances for heavy tasks

For always running tasks, yes, but I don't see how we can avoid treating
them specially anyway as we don't know anything about their true load.
That doesn't change by changing how we scale their load.

Better suggestions are of course welcome :)

Morten

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/7] sched: Introduce scale-invariant load tracking
  2014-09-25 17:23     ` Morten Rasmussen
  2014-09-26  7:36       ` Vincent Guittot
@ 2014-10-02 20:34       ` Peter Zijlstra
  2014-10-08 11:00         ` Morten Rasmussen
  2014-10-08 11:38         ` Vincent Guittot
  1 sibling, 2 replies; 40+ messages in thread
From: Peter Zijlstra @ 2014-10-02 20:34 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Vincent Guittot, mingo, Dietmar Eggemann, Paul Turner,
	Benjamin Segall, Nicolas Pitre, Mike Turquette, rjw,
	linux-kernel

On Thu, Sep 25, 2014 at 06:23:43PM +0100, Morten Rasmussen wrote:

> > Why haven't you used arch_scale_freq_capacity which has a similar
> > purpose in scaling the CPU capacity except the additional sched_domain
> > pointer argument ?
> 
> To be honest I'm not happy with introducing another arch-function
> either and I'm happy to change that. It wasn't really clear to me which
> functions that would remain after your cpu_capacity rework patches, so I
> added this one. Now that we have most of the patches for capacity
> scaling and scale-invariant load-tracking on the table I think we have a
> better chance of figuring out which ones are needed and exactly how they
> are supposed to work.
> 
> arch_scale_load_capacity() compensates for both frequency scaling and
> micro-architectural differences, while arch_scale_freq_capacity() only
> for frequency. As long as we can use arch_scale_cpu_capacity() to
> provide the micro-architecture scaling we can just do the scaling in two
> operations rather than one similar to how it is done for capacity in
> update_cpu_capacity(). I can fix that in the next version. It will cost
> an extra function call and multiplication though.
> 
> To make sure that runnable_avg_{sum, period} are still bounded by
> LOAD_AVG_MAX, arch_scale_{cpu,freq}_capacity() must both return a factor
> in the range 0..SCHED_CAPACITY_SCALE.

I would certainly like some words in the Changelog on how and that the
math is still free of overflows. Clearly you've thought about it, so
please feel free to elucidate the rest of us :-)

> > If we take the example of an always running task, its runnable_avg_sum
> > should stay at the LOAD_AVG_MAX value whatever the frequency of the
> > CPU on which it runs. But your change links the max value of
> > runnable_avg_sum with the current frequency of the CPU so an always
> > running task will have a load contribution of 25%
> > your proposed scaling is fine with usage_avg_sum which reflects the
> > effective running time on the CPU but the runnable_avg_sum should be
> > able to reach LOAD_AVG_MAX whatever the current frequency is
> 
> I don't think it makes sense to scale one metric and not the other. You
> will end up with two very different (potentially opposite) views of the
> cpu load/utilization situation in many scenarios. As I see it,
> scale-invariance and load-balancing with scale-invariance present can be
> done in two ways:
> 
> 1. Leave runnable_avg_sum unscaled and scale running_avg_sum.
> se->avg.load_avg_contrib will remain unscaled and so will
> cfs_rq->runnable_load_avg, cfs_rq->blocked_load_avg, and
> weighted_cpuload(). Essentially all the existing load-balancing code
> will continue to use unscaled load. When we want to improve cpu
> utilization and energy-awareness we will have to bypass most of this
> code as it is likely to lead us on the wrong direction since it has a
> potentially wrong view of the cpu load due to the lack of
> scale-invariance.
> 
> 2. Scale both runnable_avg_sum and running_avg_sum. All existing load
> metrics including weighted_cpuload() are scaled and thus more accurate.
> The difference between se->avg.load_avg_contrib and
> se->avg.usage_avg_contrib is the priority scaling and whether or not
> runqueue waiting time is counted. se->avg.load_avg_contrib can only
> reach se->load.weight when running on the fastest cpu at the highest
> frequency, but it is now scale-invariant so we have much better idea
> about how much load we are pulling when load-balancing two cpus running
> at different frequencies. The load-balance code-path still has to be
> audited to see if anything blows up due to the scaling. I haven't
> finished doing that yet. This patch set doesn't include patches to
> address such issues (yet). IMHO, by scaling runnable_avg_sum we can more
> easily make the existing load-balancing code do the right thing.
> 
> For both options we have to go through the existing load-balancing code
> to either change it to use the scale-invariant metric (running_avg_sum)
> when appropriate or to fix bits that don't work properly with a
> scale-invariant runnable_avg_sum and reuse the existing code. I think
> the latter is less intrusive, but I might be wrong.
> 
> Opinions?

/me votes #2, I think the example in the reply is a false one, an always
running task will/should ramp up the cpufreq and get us at full speed
(and yes I'm aware of the case where you're memory bound and raising the
cpu freq isn't going to actually improve performance, but I'm not sure
we want to get/be that smart, esp. at this stage).

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 6/7] sched: Make sched entity usage tracking scale-invariant
  2014-09-23 13:35     ` Morten Rasmussen
@ 2014-10-02 21:04       ` Peter Zijlstra
  0 siblings, 0 replies; 40+ messages in thread
From: Peter Zijlstra @ 2014-10-02 21:04 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: bsegall, mingo, Dietmar Eggemann, pjt, vincent.guittot,
	nicolas.pitre, mturquette, rjw, linux-kernel

On Tue, Sep 23, 2014 at 02:35:03PM +0100, Morten Rasmussen wrote:
> On Mon, Sep 22, 2014 at 06:13:46PM +0100, bsegall@google.com wrote:
> > Morten Rasmussen <morten.rasmussen@arm.com> writes:
> > 
> > > Apply scale-invariance correction factor to usage tracking as well.
> > 
> > It seems like it would make more sense to order the patches as first the
> > usage tracking and then all of the scale-invariance together, or perhaps
> > to just fold this into the usage tracking patch.
> 
> Makes sense. I don't mind reordering the patches. Vincent has already
> got some of the usage bits in his patch set, so I will have to rework
> the usage patches anyway if Peter decides to take the rest of Vincent's
> patch set.

Yes, please reorder. I'll try and get back to wrapping brain around the
rest of Vincent's patches. Had to put that on hold to avoid getting
burried under incoming bits.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/7] sched: Introduce scale-invariant load tracking
  2014-09-22 16:24 ` [PATCH 1/7] sched: Introduce scale-invariant load tracking Morten Rasmussen
  2014-09-25 13:48   ` Vincent Guittot
@ 2014-10-08  0:50   ` Yuyang Du
  2014-10-08 12:54     ` Dietmar Eggemann
  2014-10-10  9:14     ` Peter Zijlstra
  1 sibling, 2 replies; 40+ messages in thread
From: Yuyang Du @ 2014-10-08  0:50 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: peterz, mingo, dietmar.eggemann, pjt, bsegall, vincent.guittot,
	nicolas.pitre, mturquette, rjw, linux-kernel

Hi Morten,

Sorry for late jumping in.

The problem seems to be self-evident. But for the implementation to be
equally attractive it needs to account for every freq change for every task,
or anything less than that makes it less attractive.

But this should be very hard. Intel Architecture has limitation to capture all
the freq changes in software and also the intel_pstate should have no
notification.

For every task, this makes the updating of the entire queue in load tracking
more needed, so once again, ping maintainers for the rewrite patches, :)

Thanks,
Yuyang

On Mon, Sep 22, 2014 at 05:24:01PM +0100, Morten Rasmussen wrote:
> From: Dietmar Eggemann <dietmar.eggemann@arm.com>
> 
> The per-entity load-tracking currently neither accounts for frequency
> changes due to frequency scaling (cpufreq) nor for micro-architectural
> differences between cpus (ARM big.LITTLE). Comparing tracked loads
> between different cpus might therefore be quite misleading.
> 
> This patch introduces a scale-invariance scaling factor to the
> load-tracking computation that can be used to compensate for compute
> capacity variations. The scaling factor is to be provided by the
> architecture through an arch specific function. It may be as simple as:
> 
> 	current_freq(cpu) * SCHED_CAPACITY_SCALE / max_freq(cpu)
> 
> If the architecture has more sophisticated ways of tracking compute
> capacity, it can do so in its implementation. By default, no scaling is
> applied.
> 
> The patch is loosely based on a patch by Chris Redpath
> <Chris.Redpath@arm.com>.
> 
> cc: Paul Turner <pjt@google.com>
> cc: Ben Segall <bsegall@google.com>
> 
> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 2/7] cpufreq: Architecture specific callback for frequency changes
  2014-09-22 16:24 ` [PATCH 2/7] cpufreq: Architecture specific callback for frequency changes Morten Rasmussen
@ 2014-10-08  6:07   ` Mike Turquette
  2014-10-08  6:26     ` [PATCH RFC 0/2] introduce capacity_ops to CFS Mike Turquette
  0 siblings, 1 reply; 40+ messages in thread
From: Mike Turquette @ 2014-10-08  6:07 UTC (permalink / raw)
  To: Morten Rasmussen, peterz, mingo
  Cc: dietmar.eggemann, pjt, bsegall, vincent.guittot, nicolas.pitre,
	rjw, linux-kernel, Morten Rasmussen, tuukka.tikkanen

Quoting Morten Rasmussen (2014-09-22 09:24:02)
> Architectures that don't have any other means for tracking cpu frequency
> changes need a callback from cpufreq to implement a scaling factor to
> enable scale-invariant per-entity load-tracking in the scheduler.
> 
> To compute the scale invariance correction factor the architecture would
> need to know both the max frequency and the current frequency. This
> patch defines weak functions for setting both from cpufreq.
> 
> Related architecture specific functions use weak function definitions.
> The same approach is followed here.
> 
> These callbacks can be used to implement frequency scaling of cpu
> capacity later.
> 
> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
> ---
>  drivers/cpufreq/cpufreq.c |   10 +++++++++-
>  1 file changed, 9 insertions(+), 1 deletion(-)
> 
> diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
> index d9fdedd..e911f6b 100644
> --- a/drivers/cpufreq/cpufreq.c
> +++ b/drivers/cpufreq/cpufreq.c
> @@ -278,6 +278,10 @@ static inline void adjust_jiffies(unsigned long val, struct cpufreq_freqs *ci)
>  }
>  #endif
>  
> +void __weak arch_scale_set_curr_freq(int cpu, unsigned long freq) {}
> +
> +void __weak arch_scale_set_max_freq(int cpu, unsigned long freq) {}

Hi Morten,

This approach assumes a single implementation for an architecture, which
probably will not scale across the myriad platforms we have merged in
mainline today. For ARM there could be any number of methods for
determining a cpus capacity, including use of CPUfreq or some other
platform-specific method.

I am vaguely aware of Intel-based platforms that do not implement ACPI,
so for that architecture I assume that an arch hook that assumes ACPI is
similarly restrictive.

I'll reply to this thread with a pair of patches that try to generalize
the functions you created in patches #2-4 of this series. I'm currently
hacking on a Chromebook 2 using the arm_big_little CPUfreq driver, so
you'll notice that is where I decided to implement the ops. Of course
those could be implemented in arch code, or some other driver.

Regards,
Mike

> +
>  static void __cpufreq_notify_transition(struct cpufreq_policy *policy,
>                 struct cpufreq_freqs *freqs, unsigned int state)
>  {
> @@ -315,6 +319,7 @@ static void __cpufreq_notify_transition(struct cpufreq_policy *policy,
>                 pr_debug("FREQ: %lu - CPU: %lu\n",
>                          (unsigned long)freqs->new, (unsigned long)freqs->cpu);
>                 trace_cpu_frequency(freqs->new, freqs->cpu);
> +               arch_scale_set_curr_freq(freqs->cpu, freqs->new);
>                 srcu_notifier_call_chain(&cpufreq_transition_notifier_list,
>                                 CPUFREQ_POSTCHANGE, freqs);
>                 if (likely(policy) && likely(policy->cpu == freqs->cpu))
> @@ -2135,7 +2140,7 @@ static int cpufreq_set_policy(struct cpufreq_policy *policy,
>                                 struct cpufreq_policy *new_policy)
>  {
>         struct cpufreq_governor *old_gov;
> -       int ret;
> +       int ret, cpu;
>  
>         pr_debug("setting new policy for CPU %u: %u - %u kHz\n",
>                  new_policy->cpu, new_policy->min, new_policy->max);
> @@ -2173,6 +2178,9 @@ static int cpufreq_set_policy(struct cpufreq_policy *policy,
>         policy->min = new_policy->min;
>         policy->max = new_policy->max;
>  
> +       for_each_cpu(cpu, policy->cpus)
> +               arch_scale_set_max_freq(cpu, policy->max);
> +
>         pr_debug("new min and max freqs are %u - %u kHz\n",
>                  policy->min, policy->max);
>  
> -- 
> 1.7.9.5
> 
> 

^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH RFC 0/2] introduce capacity_ops to CFS
  2014-10-08  6:07   ` Mike Turquette
@ 2014-10-08  6:26     ` Mike Turquette
  2014-10-08  6:26       ` [PATCH RFC 1/2] sched: cfs: introduce capacity_ops Mike Turquette
  2014-10-08  6:26       ` [PATCH RFC 2/2] cpufreq: arm_big_little: provide cpu capacity Mike Turquette
  0 siblings, 2 replies; 40+ messages in thread
From: Mike Turquette @ 2014-10-08  6:26 UTC (permalink / raw)
  To: Morten Rasmussen, peterz, mingo
  Cc: dietmar.eggemann, pjt, bsegall, vincent.guittot, nicolas.pitre,
	rjw, linux-kernel, tuukka.tikkanen, Mike Turquette

The fair scheduler needs a method to retrieve the capacity of a cpu,
which may be derived from several platform-specific factors including
micro-architectural differences (e.g. big.LITTLE cpus), cpus with
different transistor types or process node properties (e.g. Nvidia
Tegra30 LP cpu) and cpu frequency (for cpus that dynamically scale clock
speed).

This info is inherently machine-specific and the first patch in this
series implements a new callback, .get_capacity as part of struct
capacity_ops. The default simply returns SCHED_CAPACITY_SCALE.

The second patch populates that callback with CPUfreq-based method for
machines using the arm_big_little CPUfreq driver. This can likely be
abstracted out a bit more to be generally useful to more CPUfreq drivers
but I wanted to gather feedback on the approach before going any
further.

Mike Turquette (2):
  sched: cfs: introduce capacity_ops
  cpufreq: arm_big_little: provide cpu capacity

 arch/arm/include/asm/topology.h  |  2 ++
 arch/arm/kernel/topology.c       | 42 ++-------------------------------
 drivers/cpufreq/arm_big_little.c | 51 ++++++++++++++++++++++++++++++++++++++++
 include/linux/sched.h            | 28 ++++++++++++++++++++++
 kernel/sched/fair.c              | 41 +++++++++++++++++++++++++++-----
 5 files changed, 118 insertions(+), 46 deletions(-)

-- 
1.8.3.2


^ permalink raw reply	[flat|nested] 40+ messages in thread

* [PATCH RFC 1/2] sched: cfs: introduce capacity_ops
  2014-10-08  6:26     ` [PATCH RFC 0/2] introduce capacity_ops to CFS Mike Turquette
@ 2014-10-08  6:26       ` Mike Turquette
  2014-10-08  8:37         ` Peter Zijlstra
  2014-10-08  6:26       ` [PATCH RFC 2/2] cpufreq: arm_big_little: provide cpu capacity Mike Turquette
  1 sibling, 1 reply; 40+ messages in thread
From: Mike Turquette @ 2014-10-08  6:26 UTC (permalink / raw)
  To: Morten Rasmussen, peterz, mingo
  Cc: dietmar.eggemann, pjt, bsegall, vincent.guittot, nicolas.pitre,
	rjw, linux-kernel, tuukka.tikkanen, Mike Turquette

The scheduler needs to know the current capacity of a cpu taking into
account micro-architectural differences as well as current cpu
frequency. The method for determining this may vary not only from
architecture to architecture, but also within differnt platforms of the
same architectures.

struct capacity_ops allows for a machine-specific backend to provide
this data. Access to the ops is protected by rcu_read_lock(). This is to
prevent a loadable module that provides capacity_ops callbacks from
pulling the rug out from under us while the scheduler is still using the
function.

A weak arch function used to be responsible for this, but that approach
is too limiting. For example various ARM SoCs may have wildly different
methods for determining the current cpu capacity. capacity_ops allows
many methods to be compiled into the kernel and then selects one at
run-time.

The default ops require no knowledge of hardware and do nothing. This
patch only includes .get_capacity, but future ops for updating and
setting the capacity in the works.

Signed-off-by: Mike Turquette <mturquette@linaro.org>
---
Note that struct capacity_ops should have other members in it in the
future. I have an additional patch that I plan to post soon which adds
.eval_capacity as a way to for the scheduler to initiate a change in cpu
capacity (e.g. scale cpu frequency).

 include/linux/sched.h | 28 ++++++++++++++++++++++++++++
 kernel/sched/fair.c   | 41 +++++++++++++++++++++++++++++++++++------
 2 files changed, 63 insertions(+), 6 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index fa0b121..4b69000 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -846,6 +846,34 @@ enum cpu_idle_type {
 	CPU_MAX_IDLE_TYPES
 };
 
+/**
+ * capacity_ops -  helpers for understanding and changing scalable cpu capacity
+ * @get_capacity:  returns current capacity of cpu, accounting for
+ * 		   micro-architecture and frequency variability
+ *
+ * capacity_ops contains machine-specific callbacks for retreiving
+ * power-adjusted capacity and updating capacity on a set of cpus.
+ *
+ * The default ops do not interact with hardware.
+ *
+ * Using these ops requires holding rcu_read_lock() across the function call to
+ * protect against function pointers disappearing during use. This can happen
+ * if a loadable module provides the callbacks and is unloaded during execution
+ * of the callback.
+ *
+ * Updates to the ops (such as implementations based on a CPUfreq backend)
+ * requires acquring capacity_ops.lock during the change, followed by a call to
+ * synchronize_rcu().
+ */
+struct capacity_ops {
+	unsigned long (*get_capacity)(int cpu);
+	spinlock_t lock;
+};
+
+extern struct capacity_ops cfs_capacity_ops;
+
+unsigned long default_scale_load_capacity(int cpu);
+
 /*
  * Increase resolution of cpu_capacity calculations
  */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 025bf3c..8124c7b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2264,8 +2264,6 @@ static u32 __compute_runnable_contrib(u64 n)
 	return contrib + runnable_avg_yN_sum[n];
 }
 
-unsigned long arch_scale_load_capacity(int cpu);
-
 /*
  * We can represent the historical contribution to runnable average as the
  * coefficients of a geometric series.  To do this we sub-divide our runnable
@@ -2302,7 +2300,7 @@ static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,
 	u64 delta, periods;
 	u32 runnable_contrib;
 	int delta_w, decayed = 0;
-	u32 scale_cap = arch_scale_load_capacity(cpu);
+	u32 scale_cap = cfs_get_capacity(cpu);
 
 	delta = now - sa->last_runnable_update;
 	/*
@@ -5795,14 +5793,44 @@ unsigned long __weak arch_scale_smt_capacity(struct sched_domain *sd, int cpu)
 	return default_scale_smt_capacity(sd, cpu);
 }
 
-static unsigned long default_scale_load_capacity(int cpu)
+unsigned long default_scale_load_capacity(int cpu)
 {
 	return SCHED_CAPACITY_SCALE;
 }
 
-unsigned long __weak arch_scale_load_capacity(int cpu)
+struct capacity_ops cfs_capacity_ops = {
+	.get_capacity = default_scale_load_capacity,
+};
+
+static unsigned long cfs_get_capacity(int cpu)
 {
-	return default_scale_load_capacity(cpu);
+	unsigned long ret;
+
+	rcu_read_lock();
+	ret = cfs_capacity_ops.get_capacity(cpu);
+	rcu_read_unlock();
+
+	return ret;
+}
+
+/**
+ * set_default_capacity_ops - reset capacity ops to their default
+ * @eops - capacity_ops we are reseting
+ *
+ * Useful for loadable modules that supply custom capacity_ops callbacks.  When
+ * unloading these modules need to restore the originals before the custom
+ * callbacks disappear.
+ *
+ * FIXME - belongs in kernel/sched/core.c?
+ */
+void set_default_capacity_ops(struct capacity_ops *eops)
+{
+	unsigned long flags;
+
+	spin_lock_irqsave(&eops->lock, flags);
+	eops->get_capacity = default_scale_load_capacity;
+	spin_unlock_irqrestore(&eops->lock, flags);
+	synchronize_rcu();
 }
 
 static unsigned long scale_rt_capacity(int cpu)
@@ -8026,6 +8054,7 @@ void print_cfs_stats(struct seq_file *m, int cpu)
 __init void init_sched_fair_class(void)
 {
 #ifdef CONFIG_SMP
+	spin_lock_init(&cfs_capacity_ops.lock);
 	open_softirq(SCHED_SOFTIRQ, run_rebalance_domains);
 
 #ifdef CONFIG_NO_HZ_COMMON
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* [PATCH RFC 2/2] cpufreq: arm_big_little: provide cpu capacity
  2014-10-08  6:26     ` [PATCH RFC 0/2] introduce capacity_ops to CFS Mike Turquette
  2014-10-08  6:26       ` [PATCH RFC 1/2] sched: cfs: introduce capacity_ops Mike Turquette
@ 2014-10-08  6:26       ` Mike Turquette
  2014-10-08 15:48         ` Morten Rasmussen
  1 sibling, 1 reply; 40+ messages in thread
From: Mike Turquette @ 2014-10-08  6:26 UTC (permalink / raw)
  To: Morten Rasmussen, peterz, mingo
  Cc: dietmar.eggemann, pjt, bsegall, vincent.guittot, nicolas.pitre,
	rjw, linux-kernel, tuukka.tikkanen, Mike Turquette

Move the cpu capacity bits out of arch/arm/ and into the CPUfreq driver.
Not all ARM devices will use CPUfreq and it is unsafe to assume as such
in topology.c.

Instead, use the new capacity_ops introduced into CFS. If this code is
generic enough then it could be factored and shared via a header to make
it easier for other CPUfreq drivers to take advantage of it.

Signed-off-by: Mike Turquette <mturquette@linaro.org>
---
This approach simply builds on top of Morten's series. I am not sure
that the per-cpu method is the best way to go in the future. And if so I
imagine that the CPUfreq core could provide everything except for the
cpu_eff part.

In general I think that the overlap between CPUfreq drivers and
arch/arm/kernel/topology.c is something that needs to addresssed soon,
as both pieces of code are re-inventing parts of each other.

 arch/arm/include/asm/topology.h  |  2 ++
 arch/arm/kernel/topology.c       | 42 ++-------------------------------
 drivers/cpufreq/arm_big_little.c | 51 ++++++++++++++++++++++++++++++++++++++++
 3 files changed, 55 insertions(+), 40 deletions(-)

diff --git a/arch/arm/include/asm/topology.h b/arch/arm/include/asm/topology.h
index 2fe85ff..3951232 100644
--- a/arch/arm/include/asm/topology.h
+++ b/arch/arm/include/asm/topology.h
@@ -14,6 +14,8 @@ struct cputopo_arm {
 };
 
 extern struct cputopo_arm cpu_topology[NR_CPUS];
+extern unsigned long max_raw_capacity;
+DECLARE_PER_CPU(unsigned long, cpu_raw_capacity);
 
 #define topology_physical_package_id(cpu)	(cpu_topology[cpu].socket_id)
 #define topology_core_id(cpu)		(cpu_topology[cpu].core_id)
diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
index 5f049ec..a2c9b5f 100644
--- a/arch/arm/kernel/topology.c
+++ b/arch/arm/kernel/topology.c
@@ -79,8 +79,8 @@ static unsigned long *__cpu_capacity;
 
 static unsigned long middle_capacity = 1;
 
-static unsigned long max_raw_capacity = 1;
-static DEFINE_PER_CPU(unsigned long, cpu_raw_capacity);
+unsigned long max_raw_capacity = 1;
+DEFINE_PER_CPU(unsigned long, cpu_raw_capacity);
 
 /*
  * Iterate all CPUs' descriptor in DT and compute the efficiency
@@ -175,44 +175,6 @@ static void update_cpu_capacity(unsigned int cpu)
 		cpu, arch_scale_freq_capacity(NULL, cpu));
 }
 
-/*
- * Scheduler load-tracking scale-invariance
- *
- * Provides the scheduler with a scale-invariance correction factor that
- * compensates for frequency scaling and micro-architecture differences between
- * cpus.
- */
-
-static DEFINE_PER_CPU(atomic_long_t, cpu_curr_freq);
-static DEFINE_PER_CPU(atomic_long_t, cpu_max_freq);
-
-/* cpufreq callback function setting current cpu frequency */
-void arch_scale_set_curr_freq(int cpu, unsigned long freq)
-{
-	atomic_long_set(&per_cpu(cpu_curr_freq, cpu), freq);
-}
-
-/* cpufreq callback function setting max cpu frequency */
-void arch_scale_set_max_freq(int cpu, unsigned long freq)
-{
-	atomic_long_set(&per_cpu(cpu_max_freq, cpu), freq);
-}
-
-unsigned long arch_scale_load_capacity(int cpu)
-{
-	unsigned long curr = atomic_long_read(&per_cpu(cpu_curr_freq, cpu));
-	unsigned long max = atomic_long_read(&per_cpu(cpu_max_freq, cpu));
-	unsigned long ret;
-
-	if (!max || !per_cpu(cpu_raw_capacity, cpu))
-		return SCHED_CAPACITY_SCALE;
-
-	ret = (curr * SCHED_CAPACITY_SCALE) / max;
-	ret = (ret * per_cpu(cpu_raw_capacity, cpu)) / max_raw_capacity;
-
-	return ret;
-}
-
 #else
 static inline void parse_dt_topology(void) {}
 static inline void update_cpu_capacity(unsigned int cpuid) {}
diff --git a/drivers/cpufreq/arm_big_little.c b/drivers/cpufreq/arm_big_little.c
index a46c223..5baffbd 100644
--- a/drivers/cpufreq/arm_big_little.c
+++ b/drivers/cpufreq/arm_big_little.c
@@ -31,7 +31,10 @@
 #include <linux/slab.h>
 #include <linux/topology.h>
 #include <linux/types.h>
+#include <linux/sched.h>
+#include <linux/rcupdate.h>
 #include <asm/bL_switcher.h>
+#include <asm/topology.h>
 
 #include "arm_big_little.h"
 
@@ -533,9 +536,52 @@ static struct notifier_block bL_switcher_notifier = {
 	.notifier_call = bL_cpufreq_switcher_notifier,
 };
 
+/*
+ * Scheduler load-tracking scale-invariance
+ *
+ * Provides the scheduler with a scale-invariance correction factor that
+ * compensates for frequency scaling and micro-architecture differences between
+ * cpus.
+ */
+
+static DEFINE_PER_CPU(atomic_long_t, cpu_curr_freq);
+static DEFINE_PER_CPU(atomic_long_t, cpu_max_freq);
+
+/* cpufreq callback function setting current cpu frequency */
+void arch_scale_set_curr_freq(int cpu, unsigned long freq)
+{
+	atomic_long_set(&per_cpu(cpu_curr_freq, cpu), freq);
+}
+
+/* cpufreq callback function setting max cpu frequency */
+void arch_scale_set_max_freq(int cpu, unsigned long freq)
+{
+	atomic_long_set(&per_cpu(cpu_max_freq, cpu), freq);
+}
+
+/*
+ * scale_load_capacity returns the current capacity for a given cpu, adjusted
+ * for micro-architectural differences and taking into accout cpu frequency
+ */
+unsigned long scale_load_capacity(int cpu)
+{
+	unsigned long curr = atomic_long_read(&per_cpu(cpu_curr_freq, cpu));
+	unsigned long max = atomic_long_read(&per_cpu(cpu_max_freq, cpu));
+	unsigned long ret;
+
+	if (!max || !per_cpu(cpu_raw_capacity, cpu))
+		return SCHED_CAPACITY_SCALE;
+
+	ret = (curr * SCHED_CAPACITY_SCALE) / max;
+	ret = (ret * per_cpu(cpu_raw_capacity, cpu)) / max_raw_capacity;
+
+	return ret;
+}
+
 int bL_cpufreq_register(struct cpufreq_arm_bL_ops *ops)
 {
 	int ret, i;
+	unsigned long flags;
 
 	if (arm_bL_ops) {
 		pr_debug("%s: Already registered: %s, exiting\n", __func__,
@@ -550,6 +596,11 @@ int bL_cpufreq_register(struct cpufreq_arm_bL_ops *ops)
 
 	arm_bL_ops = ops;
 
+	spin_lock_irqsave(&cfs_capacity_ops.lock, flags);
+	cfs_capacity_ops.get_capacity = scale_load_capacity;
+	spin_unlock_irqrestore(&cfs_capacity_ops.lock, flags);
+	synchronize_rcu();
+
 	ret = bL_switcher_get_enabled();
 	set_switching_enabled(ret);
 
-- 
1.8.3.2


^ permalink raw reply related	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC 1/2] sched: cfs: introduce capacity_ops
  2014-10-08  6:26       ` [PATCH RFC 1/2] sched: cfs: introduce capacity_ops Mike Turquette
@ 2014-10-08  8:37         ` Peter Zijlstra
       [not found]           ` <20141008232836.4379.3339@quantum>
  0 siblings, 1 reply; 40+ messages in thread
From: Peter Zijlstra @ 2014-10-08  8:37 UTC (permalink / raw)
  To: Mike Turquette
  Cc: Morten Rasmussen, mingo, dietmar.eggemann, pjt, bsegall,
	vincent.guittot, nicolas.pitre, rjw, linux-kernel,
	tuukka.tikkanen

On Tue, Oct 07, 2014 at 11:26:11PM -0700, Mike Turquette wrote:
> +struct capacity_ops {
> +	unsigned long (*get_capacity)(int cpu);
> +	spinlock_t lock;
> +};

Yeah, fail there. Ops vectors should not contain serialization, that
simply doesn't work. It means you cannot switch the entire vector out.

Secondly, I dislike this because indirect function calls are more
expensive than direct calls.

> +static unsigned long cfs_get_capacity(int cpu)
>  {
> -	return default_scale_load_capacity(cpu);
> +	unsigned long ret;
> +
> +	rcu_read_lock();
> +	ret = cfs_capacity_ops.get_capacity(cpu);
> +	rcu_read_unlock();
> +
> +	return ret;
> +}

So ideally we'd never change these things, so rcu or whatnot should not
be required.

> +/**
> + * set_default_capacity_ops - reset capacity ops to their default
> + * @eops - capacity_ops we are reseting
> + *
> + * Useful for loadable modules that supply custom capacity_ops callbacks.  When
> + * unloading these modules need to restore the originals before the custom
> + * callbacks disappear.

Yeah, like hell no. We do not want modules to affect scheduler
behaviour.


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/7] sched: Introduce scale-invariant load tracking
  2014-10-02 20:34       ` Peter Zijlstra
@ 2014-10-08 11:00         ` Morten Rasmussen
  2014-10-08 11:21           ` Vincent Guittot
  2014-10-08 11:38         ` Vincent Guittot
  1 sibling, 1 reply; 40+ messages in thread
From: Morten Rasmussen @ 2014-10-08 11:00 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Vincent Guittot, mingo, Dietmar Eggemann, Paul Turner,
	Benjamin Segall, Nicolas Pitre, Mike Turquette, rjw,
	linux-kernel

On Thu, Oct 02, 2014 at 09:34:28PM +0100, Peter Zijlstra wrote:
> On Thu, Sep 25, 2014 at 06:23:43PM +0100, Morten Rasmussen wrote:
> 
> > > Why haven't you used arch_scale_freq_capacity which has a similar
> > > purpose in scaling the CPU capacity except the additional sched_domain
> > > pointer argument ?
> > 
> > To be honest I'm not happy with introducing another arch-function
> > either and I'm happy to change that. It wasn't really clear to me which
> > functions that would remain after your cpu_capacity rework patches, so I
> > added this one. Now that we have most of the patches for capacity
> > scaling and scale-invariant load-tracking on the table I think we have a
> > better chance of figuring out which ones are needed and exactly how they
> > are supposed to work.
> > 
> > arch_scale_load_capacity() compensates for both frequency scaling and
> > micro-architectural differences, while arch_scale_freq_capacity() only
> > for frequency. As long as we can use arch_scale_cpu_capacity() to
> > provide the micro-architecture scaling we can just do the scaling in two
> > operations rather than one similar to how it is done for capacity in
> > update_cpu_capacity(). I can fix that in the next version. It will cost
> > an extra function call and multiplication though.
> > 
> > To make sure that runnable_avg_{sum, period} are still bounded by
> > LOAD_AVG_MAX, arch_scale_{cpu,freq}_capacity() must both return a factor
> > in the range 0..SCHED_CAPACITY_SCALE.
> 
> I would certainly like some words in the Changelog on how and that the
> math is still free of overflows. Clearly you've thought about it, so
> please feel free to elucidate the rest of us :-)

Sure. The easiest way to avoid introducing overflows is to ensure that
we always scale by a factor >= 1.0. That should be true as long as
arch_scale_{cpu,freq}_capacity() never returns anything greater than
SCHED_CAPACITY_SCALE (= 1024 = 1.0).

If we take big.LITTLE is an example, the max cpu capacity of a big cpu
would be 1024 and since we multiply the scaling factors (as in
update_cpu_capacity()) the max frequency scaling capacity factor would
be 1024. The result is a 1.0 (1.0 * 1.0) scaling factor when a task is
running on a big cpu at the highest frequency. At 50% frequency, the
scaling factor is 0.5 (1.0 * 0.5).

For a little cpu arch_scale_cpu_capacity() would return something less
than 1024, 512 for example. The max frequency scaling capacity factor is
1024. A task running on a little cpu at max frequency would have its
load scaled by 0.5 (0.5 * 1.0). At 50% frequency, it would be 0.25 (0.5
* 0.5).

However, as said earlier (below), we have to go through the load-balance
code to ensure that it doesn't blow up when cpu capacities get small
(huge.TINY), but the load-tracking code itself should be fine I think.

> 
> > > If we take the example of an always running task, its runnable_avg_sum
> > > should stay at the LOAD_AVG_MAX value whatever the frequency of the
> > > CPU on which it runs. But your change links the max value of
> > > runnable_avg_sum with the current frequency of the CPU so an always
> > > running task will have a load contribution of 25%
> > > your proposed scaling is fine with usage_avg_sum which reflects the
> > > effective running time on the CPU but the runnable_avg_sum should be
> > > able to reach LOAD_AVG_MAX whatever the current frequency is
> > 
> > I don't think it makes sense to scale one metric and not the other. You
> > will end up with two very different (potentially opposite) views of the
> > cpu load/utilization situation in many scenarios. As I see it,
> > scale-invariance and load-balancing with scale-invariance present can be
> > done in two ways:
> > 
> > 1. Leave runnable_avg_sum unscaled and scale running_avg_sum.
> > se->avg.load_avg_contrib will remain unscaled and so will
> > cfs_rq->runnable_load_avg, cfs_rq->blocked_load_avg, and
> > weighted_cpuload(). Essentially all the existing load-balancing code
> > will continue to use unscaled load. When we want to improve cpu
> > utilization and energy-awareness we will have to bypass most of this
> > code as it is likely to lead us on the wrong direction since it has a
> > potentially wrong view of the cpu load due to the lack of
> > scale-invariance.
> > 
> > 2. Scale both runnable_avg_sum and running_avg_sum. All existing load
> > metrics including weighted_cpuload() are scaled and thus more accurate.
> > The difference between se->avg.load_avg_contrib and
> > se->avg.usage_avg_contrib is the priority scaling and whether or not
> > runqueue waiting time is counted. se->avg.load_avg_contrib can only
> > reach se->load.weight when running on the fastest cpu at the highest
> > frequency, but it is now scale-invariant so we have much better idea
> > about how much load we are pulling when load-balancing two cpus running
> > at different frequencies. The load-balance code-path still has to be
> > audited to see if anything blows up due to the scaling. I haven't
> > finished doing that yet. This patch set doesn't include patches to
> > address such issues (yet). IMHO, by scaling runnable_avg_sum we can more
> > easily make the existing load-balancing code do the right thing.
> > 
> > For both options we have to go through the existing load-balancing code
> > to either change it to use the scale-invariant metric (running_avg_sum)
> > when appropriate or to fix bits that don't work properly with a
> > scale-invariant runnable_avg_sum and reuse the existing code. I think
> > the latter is less intrusive, but I might be wrong.
> > 
> > Opinions?
> 
> /me votes #2, I think the example in the reply is a false one, an always
> running task will/should ramp up the cpufreq and get us at full speed
> (and yes I'm aware of the case where you're memory bound and raising the
> cpu freq isn't going to actually improve performance, but I'm not sure
> we want to get/be that smart, esp. at this stage).

Okay, and agreed that memory bound task smarts are out of scope for the
time being.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/7] sched: Introduce scale-invariant load tracking
  2014-10-08 11:00         ` Morten Rasmussen
@ 2014-10-08 11:21           ` Vincent Guittot
  2014-10-08 13:53             ` Morten Rasmussen
  0 siblings, 1 reply; 40+ messages in thread
From: Vincent Guittot @ 2014-10-08 11:21 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Peter Zijlstra, mingo, Dietmar Eggemann, Paul Turner,
	Benjamin Segall, Nicolas Pitre, Mike Turquette, rjw,
	linux-kernel

On 8 October 2014 13:00, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> On Thu, Oct 02, 2014 at 09:34:28PM +0100, Peter Zijlstra wrote:
>> On Thu, Sep 25, 2014 at 06:23:43PM +0100, Morten Rasmussen wrote:
>>
>> > > Why haven't you used arch_scale_freq_capacity which has a similar
>> > > purpose in scaling the CPU capacity except the additional sched_domain
>> > > pointer argument ?
>> >
>> > To be honest I'm not happy with introducing another arch-function
>> > either and I'm happy to change that. It wasn't really clear to me which
>> > functions that would remain after your cpu_capacity rework patches, so I
>> > added this one. Now that we have most of the patches for capacity
>> > scaling and scale-invariant load-tracking on the table I think we have a
>> > better chance of figuring out which ones are needed and exactly how they
>> > are supposed to work.
>> >
>> > arch_scale_load_capacity() compensates for both frequency scaling and
>> > micro-architectural differences, while arch_scale_freq_capacity() only
>> > for frequency. As long as we can use arch_scale_cpu_capacity() to
>> > provide the micro-architecture scaling we can just do the scaling in two
>> > operations rather than one similar to how it is done for capacity in
>> > update_cpu_capacity(). I can fix that in the next version. It will cost
>> > an extra function call and multiplication though.
>> >
>> > To make sure that runnable_avg_{sum, period} are still bounded by
>> > LOAD_AVG_MAX, arch_scale_{cpu,freq}_capacity() must both return a factor
>> > in the range 0..SCHED_CAPACITY_SCALE.
>>
>> I would certainly like some words in the Changelog on how and that the
>> math is still free of overflows. Clearly you've thought about it, so
>> please feel free to elucidate the rest of us :-)
>
> Sure. The easiest way to avoid introducing overflows is to ensure that
> we always scale by a factor >= 1.0. That should be true as long as
> arch_scale_{cpu,freq}_capacity() never returns anything greater than
> SCHED_CAPACITY_SCALE (= 1024 = 1.0).

the current ARM arch_scale_cpu is in the range [1536..0] which is free
of overflow AFAICT

>
> If we take big.LITTLE is an example, the max cpu capacity of a big cpu
> would be 1024 and since we multiply the scaling factors (as in
> update_cpu_capacity()) the max frequency scaling capacity factor would
> be 1024. The result is a 1.0 (1.0 * 1.0) scaling factor when a task is
> running on a big cpu at the highest frequency. At 50% frequency, the
> scaling factor is 0.5 (1.0 * 0.5).
>
> For a little cpu arch_scale_cpu_capacity() would return something less
> than 1024, 512 for example. The max frequency scaling capacity factor is
> 1024. A task running on a little cpu at max frequency would have its
> load scaled by 0.5 (0.5 * 1.0). At 50% frequency, it would be 0.25 (0.5
> * 0.5).
>
> However, as said earlier (below), we have to go through the load-balance
> code to ensure that it doesn't blow up when cpu capacities get small
> (huge.TINY), but the load-tracking code itself should be fine I think.
>
>>
>> > > If we take the example of an always running task, its runnable_avg_sum
>> > > should stay at the LOAD_AVG_MAX value whatever the frequency of the
>> > > CPU on which it runs. But your change links the max value of
>> > > runnable_avg_sum with the current frequency of the CPU so an always
>> > > running task will have a load contribution of 25%
>> > > your proposed scaling is fine with usage_avg_sum which reflects the
>> > > effective running time on the CPU but the runnable_avg_sum should be
>> > > able to reach LOAD_AVG_MAX whatever the current frequency is
>> >
>> > I don't think it makes sense to scale one metric and not the other. You
>> > will end up with two very different (potentially opposite) views of the
>> > cpu load/utilization situation in many scenarios. As I see it,
>> > scale-invariance and load-balancing with scale-invariance present can be
>> > done in two ways:
>> >
>> > 1. Leave runnable_avg_sum unscaled and scale running_avg_sum.
>> > se->avg.load_avg_contrib will remain unscaled and so will
>> > cfs_rq->runnable_load_avg, cfs_rq->blocked_load_avg, and
>> > weighted_cpuload(). Essentially all the existing load-balancing code
>> > will continue to use unscaled load. When we want to improve cpu
>> > utilization and energy-awareness we will have to bypass most of this
>> > code as it is likely to lead us on the wrong direction since it has a
>> > potentially wrong view of the cpu load due to the lack of
>> > scale-invariance.
>> >
>> > 2. Scale both runnable_avg_sum and running_avg_sum. All existing load
>> > metrics including weighted_cpuload() are scaled and thus more accurate.
>> > The difference between se->avg.load_avg_contrib and
>> > se->avg.usage_avg_contrib is the priority scaling and whether or not
>> > runqueue waiting time is counted. se->avg.load_avg_contrib can only
>> > reach se->load.weight when running on the fastest cpu at the highest
>> > frequency, but it is now scale-invariant so we have much better idea
>> > about how much load we are pulling when load-balancing two cpus running
>> > at different frequencies. The load-balance code-path still has to be
>> > audited to see if anything blows up due to the scaling. I haven't
>> > finished doing that yet. This patch set doesn't include patches to
>> > address such issues (yet). IMHO, by scaling runnable_avg_sum we can more
>> > easily make the existing load-balancing code do the right thing.
>> >
>> > For both options we have to go through the existing load-balancing code
>> > to either change it to use the scale-invariant metric (running_avg_sum)
>> > when appropriate or to fix bits that don't work properly with a
>> > scale-invariant runnable_avg_sum and reuse the existing code. I think
>> > the latter is less intrusive, but I might be wrong.
>> >
>> > Opinions?
>>
>> /me votes #2, I think the example in the reply is a false one, an always
>> running task will/should ramp up the cpufreq and get us at full speed
>> (and yes I'm aware of the case where you're memory bound and raising the
>> cpu freq isn't going to actually improve performance, but I'm not sure
>> we want to get/be that smart, esp. at this stage).
>
> Okay, and agreed that memory bound task smarts are out of scope for the
> time being.
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/7] sched: Introduce scale-invariant load tracking
  2014-10-02 20:34       ` Peter Zijlstra
  2014-10-08 11:00         ` Morten Rasmussen
@ 2014-10-08 11:38         ` Vincent Guittot
  2014-10-08 14:05           ` Morten Rasmussen
  2014-10-10  9:07           ` Peter Zijlstra
  1 sibling, 2 replies; 40+ messages in thread
From: Vincent Guittot @ 2014-10-08 11:38 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Morten Rasmussen, mingo, Dietmar Eggemann, Paul Turner,
	Benjamin Segall, Nicolas Pitre, Mike Turquette, rjw,
	linux-kernel

On 2 October 2014 22:34, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, Sep 25, 2014 at 06:23:43PM +0100, Morten Rasmussen wrote:
>
>> > Why haven't you used arch_scale_freq_capacity which has a similar
>> > purpose in scaling the CPU capacity except the additional sched_domain
>> > pointer argument ?
>>
>> To be honest I'm not happy with introducing another arch-function
>> either and I'm happy to change that. It wasn't really clear to me which
>> functions that would remain after your cpu_capacity rework patches, so I
>> added this one. Now that we have most of the patches for capacity
>> scaling and scale-invariant load-tracking on the table I think we have a
>> better chance of figuring out which ones are needed and exactly how they
>> are supposed to work.
>>
>> arch_scale_load_capacity() compensates for both frequency scaling and
>> micro-architectural differences, while arch_scale_freq_capacity() only
>> for frequency. As long as we can use arch_scale_cpu_capacity() to
>> provide the micro-architecture scaling we can just do the scaling in two
>> operations rather than one similar to how it is done for capacity in
>> update_cpu_capacity(). I can fix that in the next version. It will cost
>> an extra function call and multiplication though.
>>
>> To make sure that runnable_avg_{sum, period} are still bounded by
>> LOAD_AVG_MAX, arch_scale_{cpu,freq}_capacity() must both return a factor
>> in the range 0..SCHED_CAPACITY_SCALE.
>
> I would certainly like some words in the Changelog on how and that the
> math is still free of overflows. Clearly you've thought about it, so
> please feel free to elucidate the rest of us :-)
>
>> > If we take the example of an always running task, its runnable_avg_sum
>> > should stay at the LOAD_AVG_MAX value whatever the frequency of the
>> > CPU on which it runs. But your change links the max value of
>> > runnable_avg_sum with the current frequency of the CPU so an always
>> > running task will have a load contribution of 25%
>> > your proposed scaling is fine with usage_avg_sum which reflects the
>> > effective running time on the CPU but the runnable_avg_sum should be
>> > able to reach LOAD_AVG_MAX whatever the current frequency is
>>
>> I don't think it makes sense to scale one metric and not the other. You
>> will end up with two very different (potentially opposite) views of the
>> cpu load/utilization situation in many scenarios. As I see it,
>> scale-invariance and load-balancing with scale-invariance present can be
>> done in two ways:
>>
>> 1. Leave runnable_avg_sum unscaled and scale running_avg_sum.
>> se->avg.load_avg_contrib will remain unscaled and so will
>> cfs_rq->runnable_load_avg, cfs_rq->blocked_load_avg, and
>> weighted_cpuload(). Essentially all the existing load-balancing code
>> will continue to use unscaled load. When we want to improve cpu
>> utilization and energy-awareness we will have to bypass most of this
>> code as it is likely to lead us on the wrong direction since it has a
>> potentially wrong view of the cpu load due to the lack of
>> scale-invariance.
>>
>> 2. Scale both runnable_avg_sum and running_avg_sum. All existing load
>> metrics including weighted_cpuload() are scaled and thus more accurate.
>> The difference between se->avg.load_avg_contrib and
>> se->avg.usage_avg_contrib is the priority scaling and whether or not
>> runqueue waiting time is counted. se->avg.load_avg_contrib can only
>> reach se->load.weight when running on the fastest cpu at the highest
>> frequency, but it is now scale-invariant so we have much better idea
>> about how much load we are pulling when load-balancing two cpus running
>> at different frequencies. The load-balance code-path still has to be
>> audited to see if anything blows up due to the scaling. I haven't
>> finished doing that yet. This patch set doesn't include patches to
>> address such issues (yet). IMHO, by scaling runnable_avg_sum we can more
>> easily make the existing load-balancing code do the right thing.
>>
>> For both options we have to go through the existing load-balancing code
>> to either change it to use the scale-invariant metric (running_avg_sum)
>> when appropriate or to fix bits that don't work properly with a
>> scale-invariant runnable_avg_sum and reuse the existing code. I think
>> the latter is less intrusive, but I might be wrong.
>>
>> Opinions?
>
> /me votes #2, I think the example in the reply is a false one, an always
> running task will/should ramp up the cpufreq and get us at full speed

I have in mind some system where the max achievable freq of a core
depends of how many cores are running simultaneously because of some
HW constraint like max current. In this case, the CPU might not reach
max frequency even with an always running task.
Then, beside frequency scaling, their is the uarch invariance that is
introduced by patch 4 that will generate similar behavior of the load.

> (and yes I'm aware of the case where you're memory bound and raising the
> cpu freq isn't going to actually improve performance, but I'm not sure
> we want to get/be that smart, esp. at this stage).

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/7] sched: Introduce scale-invariant load tracking
  2014-10-08  0:50   ` Yuyang Du
@ 2014-10-08 12:54     ` Dietmar Eggemann
  2014-10-10  9:16       ` Peter Zijlstra
  2014-10-10  9:14     ` Peter Zijlstra
  1 sibling, 1 reply; 40+ messages in thread
From: Dietmar Eggemann @ 2014-10-08 12:54 UTC (permalink / raw)
  To: Yuyang Du, Morten Rasmussen
  Cc: peterz, mingo, pjt, bsegall, vincent.guittot, nicolas.pitre,
	mturquette, rjw, linux-kernel

Hi Yuyang,

On 08/10/14 01:50, Yuyang Du wrote:
> Hi Morten,
> 
> Sorry for late jumping in.
> 
> The problem seems to be self-evident. But for the implementation to be
> equally attractive it needs to account for every freq change for every task,
> or anything less than that makes it less attractive.
> 
> But this should be very hard. Intel Architecture has limitation to capture all
> the freq changes in software and also the intel_pstate should have no
> notification.

We encountered this missing notification for current frequency with
Intel systems (e.g. i5-3320M) using the intel_pstate driver while
testing this patch-set. The arch_scale_set_curr_freq call in
__cpufreq_notify_transition [[PATCH 2/7] cpufreq: Architecture specific
callback for frequency changes] will not work on such a system.

In our internal testing, we placed arch_scale_set_curr_freq(cpu->cpu,
sample->freq) into intel_pstate_timer_func [intel_pstate.c] to get the
current frequency for a cpu.

The arch_scale_set_max_freq call in cpufreq_set_policy
[drivers/cpufreq/cpufreq.c] still works although the driver exposes the
max turbo pstate and not the max pstate. That's an additional problem
because we don't want to use turbo states for frequency scaling.

> 
> For every task, this makes the updating of the entire queue in load tracking
> more needed, so once again, ping maintainers for the rewrite patches, :)
> 
> Thanks,
> Yuyang
> 

[...]


^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/7] sched: Introduce scale-invariant load tracking
  2014-10-08 11:21           ` Vincent Guittot
@ 2014-10-08 13:53             ` Morten Rasmussen
  2014-10-08 14:08               ` Vincent Guittot
  0 siblings, 1 reply; 40+ messages in thread
From: Morten Rasmussen @ 2014-10-08 13:53 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, mingo, Dietmar Eggemann, Paul Turner,
	Benjamin Segall, Nicolas Pitre, Mike Turquette, rjw,
	linux-kernel

On Wed, Oct 08, 2014 at 12:21:45PM +0100, Vincent Guittot wrote:
> On 8 October 2014 13:00, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> > On Thu, Oct 02, 2014 at 09:34:28PM +0100, Peter Zijlstra wrote:
> >> On Thu, Sep 25, 2014 at 06:23:43PM +0100, Morten Rasmussen wrote:
> >>
> >> > > Why haven't you used arch_scale_freq_capacity which has a similar
> >> > > purpose in scaling the CPU capacity except the additional sched_domain
> >> > > pointer argument ?
> >> >
> >> > To be honest I'm not happy with introducing another arch-function
> >> > either and I'm happy to change that. It wasn't really clear to me which
> >> > functions that would remain after your cpu_capacity rework patches, so I
> >> > added this one. Now that we have most of the patches for capacity
> >> > scaling and scale-invariant load-tracking on the table I think we have a
> >> > better chance of figuring out which ones are needed and exactly how they
> >> > are supposed to work.
> >> >
> >> > arch_scale_load_capacity() compensates for both frequency scaling and
> >> > micro-architectural differences, while arch_scale_freq_capacity() only
> >> > for frequency. As long as we can use arch_scale_cpu_capacity() to
> >> > provide the micro-architecture scaling we can just do the scaling in two
> >> > operations rather than one similar to how it is done for capacity in
> >> > update_cpu_capacity(). I can fix that in the next version. It will cost
> >> > an extra function call and multiplication though.
> >> >
> >> > To make sure that runnable_avg_{sum, period} are still bounded by
> >> > LOAD_AVG_MAX, arch_scale_{cpu,freq}_capacity() must both return a factor
> >> > in the range 0..SCHED_CAPACITY_SCALE.
> >>
> >> I would certainly like some words in the Changelog on how and that the
> >> math is still free of overflows. Clearly you've thought about it, so
> >> please feel free to elucidate the rest of us :-)
> >
> > Sure. The easiest way to avoid introducing overflows is to ensure that
> > we always scale by a factor >= 1.0. That should be true as long as
> > arch_scale_{cpu,freq}_capacity() never returns anything greater than
> > SCHED_CAPACITY_SCALE (= 1024 = 1.0).
> 
> the current ARM arch_scale_cpu is in the range [1536..0] which is free
> of overflow AFAICT

If I'm not mistaken, that will cause an overflow in
__update_task_entity_contrib():

static inline void __update_task_entity_contrib(struct sched_entity *se)
{
        u32 contrib;
        /* avoid overflowing a 32-bit type w/ SCHED_LOAD_SCALE */
        contrib = se->avg.runnable_avg_sum * scale_load_down(se->load.weight);
        contrib /= (se->avg.avg_period + 1);
        se->avg.load_avg_contrib = scale_load(contrib);
}

With arch_scale_cpu_capacity() > 1024 se->avg.runnable_avg_sum is no
longer bounded by LOAD_AVG_MAX = 47742. scale_load_down(se->load.weight)
== se->load.weight =< 88761.

	47742 * 88761 = 4237627662 (2^32 = 4294967296)

To avoid overflow se->avg.runnable_avg_sum must be less than 2^32/88761
= 48388, which means that arch_scale_cpu_capacity() must be in the range
0..48388*1024/47742 = 0..1037.

I also think it is easier to have a fixed defined max scaling factor,
but that might just be me.

Regarding the ARM arch_scale_cpu_capacity() implementation, I think that
can be changed to fit the 0..1024 range easily. Currently, it will only
report a non-default (1024) capacity for big.LITTLE systems and actually
enabling it (requires a certain property to be set in device tree) leads
to broken load-balancing decisions. We have discussed that several times
in the past. I wouldn't recommend enabling it until the load-balance
code can deal with big.LITTLE compute capacities correctly. This is also
why it isn't implemented by ARM64.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/7] sched: Introduce scale-invariant load tracking
  2014-10-08 11:38         ` Vincent Guittot
@ 2014-10-08 14:05           ` Morten Rasmussen
  2014-10-10  9:07           ` Peter Zijlstra
  1 sibling, 0 replies; 40+ messages in thread
From: Morten Rasmussen @ 2014-10-08 14:05 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, mingo, Dietmar Eggemann, Paul Turner,
	Benjamin Segall, Nicolas Pitre, Mike Turquette, rjw,
	linux-kernel

On Wed, Oct 08, 2014 at 12:38:40PM +0100, Vincent Guittot wrote:
> On 2 October 2014 22:34, Peter Zijlstra <peterz@infradead.org> wrote:
> > On Thu, Sep 25, 2014 at 06:23:43PM +0100, Morten Rasmussen wrote:
> >
> >> > Why haven't you used arch_scale_freq_capacity which has a similar
> >> > purpose in scaling the CPU capacity except the additional sched_domain
> >> > pointer argument ?
> >>
> >> To be honest I'm not happy with introducing another arch-function
> >> either and I'm happy to change that. It wasn't really clear to me which
> >> functions that would remain after your cpu_capacity rework patches, so I
> >> added this one. Now that we have most of the patches for capacity
> >> scaling and scale-invariant load-tracking on the table I think we have a
> >> better chance of figuring out which ones are needed and exactly how they
> >> are supposed to work.
> >>
> >> arch_scale_load_capacity() compensates for both frequency scaling and
> >> micro-architectural differences, while arch_scale_freq_capacity() only
> >> for frequency. As long as we can use arch_scale_cpu_capacity() to
> >> provide the micro-architecture scaling we can just do the scaling in two
> >> operations rather than one similar to how it is done for capacity in
> >> update_cpu_capacity(). I can fix that in the next version. It will cost
> >> an extra function call and multiplication though.
> >>
> >> To make sure that runnable_avg_{sum, period} are still bounded by
> >> LOAD_AVG_MAX, arch_scale_{cpu,freq}_capacity() must both return a factor
> >> in the range 0..SCHED_CAPACITY_SCALE.
> >
> > I would certainly like some words in the Changelog on how and that the
> > math is still free of overflows. Clearly you've thought about it, so
> > please feel free to elucidate the rest of us :-)
> >
> >> > If we take the example of an always running task, its runnable_avg_sum
> >> > should stay at the LOAD_AVG_MAX value whatever the frequency of the
> >> > CPU on which it runs. But your change links the max value of
> >> > runnable_avg_sum with the current frequency of the CPU so an always
> >> > running task will have a load contribution of 25%
> >> > your proposed scaling is fine with usage_avg_sum which reflects the
> >> > effective running time on the CPU but the runnable_avg_sum should be
> >> > able to reach LOAD_AVG_MAX whatever the current frequency is
> >>
> >> I don't think it makes sense to scale one metric and not the other. You
> >> will end up with two very different (potentially opposite) views of the
> >> cpu load/utilization situation in many scenarios. As I see it,
> >> scale-invariance and load-balancing with scale-invariance present can be
> >> done in two ways:
> >>
> >> 1. Leave runnable_avg_sum unscaled and scale running_avg_sum.
> >> se->avg.load_avg_contrib will remain unscaled and so will
> >> cfs_rq->runnable_load_avg, cfs_rq->blocked_load_avg, and
> >> weighted_cpuload(). Essentially all the existing load-balancing code
> >> will continue to use unscaled load. When we want to improve cpu
> >> utilization and energy-awareness we will have to bypass most of this
> >> code as it is likely to lead us on the wrong direction since it has a
> >> potentially wrong view of the cpu load due to the lack of
> >> scale-invariance.
> >>
> >> 2. Scale both runnable_avg_sum and running_avg_sum. All existing load
> >> metrics including weighted_cpuload() are scaled and thus more accurate.
> >> The difference between se->avg.load_avg_contrib and
> >> se->avg.usage_avg_contrib is the priority scaling and whether or not
> >> runqueue waiting time is counted. se->avg.load_avg_contrib can only
> >> reach se->load.weight when running on the fastest cpu at the highest
> >> frequency, but it is now scale-invariant so we have much better idea
> >> about how much load we are pulling when load-balancing two cpus running
> >> at different frequencies. The load-balance code-path still has to be
> >> audited to see if anything blows up due to the scaling. I haven't
> >> finished doing that yet. This patch set doesn't include patches to
> >> address such issues (yet). IMHO, by scaling runnable_avg_sum we can more
> >> easily make the existing load-balancing code do the right thing.
> >>
> >> For both options we have to go through the existing load-balancing code
> >> to either change it to use the scale-invariant metric (running_avg_sum)
> >> when appropriate or to fix bits that don't work properly with a
> >> scale-invariant runnable_avg_sum and reuse the existing code. I think
> >> the latter is less intrusive, but I might be wrong.
> >>
> >> Opinions?
> >
> > /me votes #2, I think the example in the reply is a false one, an always
> > running task will/should ramp up the cpufreq and get us at full speed
> 
> I have in mind some system where the max achievable freq of a core
> depends of how many cores are running simultaneously because of some
> HW constraint like max current. In this case, the CPU might not reach
> max frequency even with an always running task.

If we compare scale-invariant task load to the current frequency scaled
compute capacity of the cpu when making load-balancing decisions as I
described in my other reply that shouldn't be a problem.

> Then, beside frequency scaling, their is the uarch invariance that is
> introduced by patch 4 that will generate similar behavior of the load.

I don't quite follow. When we make task load frequency and uarch
invariant, we must scale compute capacity accordingly. So compute
capacity is bigger for big cores and smaller for little cores.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/7] sched: Introduce scale-invariant load tracking
  2014-10-08 13:53             ` Morten Rasmussen
@ 2014-10-08 14:08               ` Vincent Guittot
  2014-10-08 14:16                 ` Morten Rasmussen
  0 siblings, 1 reply; 40+ messages in thread
From: Vincent Guittot @ 2014-10-08 14:08 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Peter Zijlstra, mingo, Dietmar Eggemann, Paul Turner,
	Benjamin Segall, Nicolas Pitre, Mike Turquette, rjw,
	linux-kernel

On 8 October 2014 15:53, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> On Wed, Oct 08, 2014 at 12:21:45PM +0100, Vincent Guittot wrote:
>> On 8 October 2014 13:00, Morten Rasmussen <morten.rasmussen@arm.com> wrote:

>> >
>> > Sure. The easiest way to avoid introducing overflows is to ensure that
>> > we always scale by a factor >= 1.0. That should be true as long as
>> > arch_scale_{cpu,freq}_capacity() never returns anything greater than
>> > SCHED_CAPACITY_SCALE (= 1024 = 1.0).
>>
>> the current ARM arch_scale_cpu is in the range [1536..0] which is free
>> of overflow AFAICT
>
> If I'm not mistaken, that will cause an overflow in
> __update_task_entity_contrib():
>
> static inline void __update_task_entity_contrib(struct sched_entity *se)
> {
>         u32 contrib;
>         /* avoid overflowing a 32-bit type w/ SCHED_LOAD_SCALE */
>         contrib = se->avg.runnable_avg_sum * scale_load_down(se->load.weight);
>         contrib /= (se->avg.avg_period + 1);
>         se->avg.load_avg_contrib = scale_load(contrib);
> }
>
> With arch_scale_cpu_capacity() > 1024 se->avg.runnable_avg_sum is no
> longer bounded by LOAD_AVG_MAX = 47742. scale_load_down(se->load.weight)
> == se->load.weight =< 88761.
>
>         47742 * 88761 = 4237627662 (2^32 = 4294967296)
>
> To avoid overflow se->avg.runnable_avg_sum must be less than 2^32/88761
> = 48388, which means that arch_scale_cpu_capacity() must be in the range
> 0..48388*1024/47742 = 0..1037.
>
> I also think it is easier to have a fixed defined max scaling factor,
> but that might just be me.

OK,  overflow comes with adding uarch invariance into runnable load average

>
> Regarding the ARM arch_scale_cpu_capacity() implementation, I think that
> can be changed to fit the 0..1024 range easily. Currently, it will only
> report a non-default (1024) capacity for big.LITTLE systems and actually
> enabling it (requires a certain property to be set in device tree) leads
> to broken load-balancing decisions. We have discussed that several times

Only the 1 task per CPU is broken but in the other hand, it better
handles the overload use case where we have more tasks than CPU and
other middle range use case by putting more task on big cluster.

> in the past. I wouldn't recommend enabling it until the load-balance
> code can deal with big.LITTLE compute capacities correctly. This is also
> why it isn't implemented by ARM64.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/7] sched: Introduce scale-invariant load tracking
  2014-10-08 14:08               ` Vincent Guittot
@ 2014-10-08 14:16                 ` Morten Rasmussen
  0 siblings, 0 replies; 40+ messages in thread
From: Morten Rasmussen @ 2014-10-08 14:16 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, mingo, Dietmar Eggemann, Paul Turner,
	Benjamin Segall, Nicolas Pitre, Mike Turquette, rjw,
	linux-kernel

On Wed, Oct 08, 2014 at 03:08:04PM +0100, Vincent Guittot wrote:
> On 8 October 2014 15:53, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> > On Wed, Oct 08, 2014 at 12:21:45PM +0100, Vincent Guittot wrote:
> >> On 8 October 2014 13:00, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> 
> >> >
> >> > Sure. The easiest way to avoid introducing overflows is to ensure that
> >> > we always scale by a factor >= 1.0. That should be true as long as
> >> > arch_scale_{cpu,freq}_capacity() never returns anything greater than
> >> > SCHED_CAPACITY_SCALE (= 1024 = 1.0).
> >>
> >> the current ARM arch_scale_cpu is in the range [1536..0] which is free
> >> of overflow AFAICT
> >
> > If I'm not mistaken, that will cause an overflow in
> > __update_task_entity_contrib():
> >
> > static inline void __update_task_entity_contrib(struct sched_entity *se)
> > {
> >         u32 contrib;
> >         /* avoid overflowing a 32-bit type w/ SCHED_LOAD_SCALE */
> >         contrib = se->avg.runnable_avg_sum * scale_load_down(se->load.weight);
> >         contrib /= (se->avg.avg_period + 1);
> >         se->avg.load_avg_contrib = scale_load(contrib);
> > }
> >
> > With arch_scale_cpu_capacity() > 1024 se->avg.runnable_avg_sum is no
> > longer bounded by LOAD_AVG_MAX = 47742. scale_load_down(se->load.weight)
> > == se->load.weight =< 88761.
> >
> >         47742 * 88761 = 4237627662 (2^32 = 4294967296)
> >
> > To avoid overflow se->avg.runnable_avg_sum must be less than 2^32/88761
> > = 48388, which means that arch_scale_cpu_capacity() must be in the range
> > 0..48388*1024/47742 = 0..1037.
> >
> > I also think it is easier to have a fixed defined max scaling factor,
> > but that might just be me.
> 
> OK,  overflow comes with adding uarch invariance into runnable load average
> 
> >
> > Regarding the ARM arch_scale_cpu_capacity() implementation, I think that
> > can be changed to fit the 0..1024 range easily. Currently, it will only
> > report a non-default (1024) capacity for big.LITTLE systems and actually
> > enabling it (requires a certain property to be set in device tree) leads
> > to broken load-balancing decisions. We have discussed that several times
> 
> Only the 1 task per CPU is broken but in the other hand, it better
> handles the overload use case where we have more tasks than CPU and
> other middle range use case by putting more task on big cluster.

Yes, agreed. My point was just to say that it shouldn't cause a lot of
harm changing the range of arch_scale_cpu_capacity() for ARM. We need to
fix things anyway.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC 2/2] cpufreq: arm_big_little: provide cpu capacity
  2014-10-08  6:26       ` [PATCH RFC 2/2] cpufreq: arm_big_little: provide cpu capacity Mike Turquette
@ 2014-10-08 15:48         ` Morten Rasmussen
       [not found]           ` <20141008223732.4379.78047@quantum>
  0 siblings, 1 reply; 40+ messages in thread
From: Morten Rasmussen @ 2014-10-08 15:48 UTC (permalink / raw)
  To: Mike Turquette
  Cc: peterz, mingo, Dietmar Eggemann, pjt, bsegall, vincent.guittot,
	nicolas.pitre, rjw, linux-kernel, tuukka.tikkanen

On Wed, Oct 08, 2014 at 07:26:12AM +0100, Mike Turquette wrote:
> Move the cpu capacity bits out of arch/arm/ and into the CPUfreq driver.
> Not all ARM devices will use CPUfreq and it is unsafe to assume as such
> in topology.c.
> 
> Instead, use the new capacity_ops introduced into CFS. If this code is
> generic enough then it could be factored and shared via a header to make
> it easier for other CPUfreq drivers to take advantage of it.
> 
> Signed-off-by: Mike Turquette <mturquette@linaro.org>
> ---
> This approach simply builds on top of Morten's series. I am not sure
> that the per-cpu method is the best way to go in the future. And if so I
> imagine that the CPUfreq core could provide everything except for the
> cpu_eff part.
> 
> In general I think that the overlap between CPUfreq drivers and
> arch/arm/kernel/topology.c is something that needs to addresssed soon,
> as both pieces of code are re-inventing parts of each other.

I think it would be easier to deal with the capacity scaling if we split
it into two scaling factors (like Vincent proposed by using the existing
arch_scale_{cpu,freq}_capacity() functions). Then whatever handles
frequency scaling doesn't have to coordinate with uarch scaling. Both
would be factors in the range 0..1024. We would just multiply the two
scaling factors in fair.c when needed (and shift as necessary).

Wouldn't that make it simple enough that we can implement it in a
generic way in arch/*/topology.c with a bit of help from the cpufreq
framework without involving any drivers directly? Or is that just
wishful thinking? :)

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC 1/2] sched: cfs: introduce capacity_ops
       [not found]           ` <20141008232836.4379.3339@quantum>
@ 2014-10-09  9:00             ` Peter Zijlstra
       [not found]               ` <20141009173433.4379.58492@quantum>
  0 siblings, 1 reply; 40+ messages in thread
From: Peter Zijlstra @ 2014-10-09  9:00 UTC (permalink / raw)
  To: Mike Turquette
  Cc: Morten Rasmussen, mingo, dietmar.eggemann, pjt, bsegall,
	vincent.guittot, nicolas.pitre, rjw, linux-kernel,
	tuukka.tikkanen

On Wed, Oct 08, 2014 at 04:28:36PM -0700, Mike Turquette wrote:
> > Yeah, like hell no. We do not want modules to affect scheduler
> > behaviour.
> 
> If a CPUfreq driver is the best place to know how frequency affects the
> capacity of a CPU for a given system, are you suggesting that we must
> compile that code into the kernel image instead of it being a loadable
> module?

Ideally we'll end up doing away with the cpufreq policy modules and
integrate the entire thing into the scheduler.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC 2/2] cpufreq: arm_big_little: provide cpu capacity
       [not found]           ` <20141008223732.4379.78047@quantum>
@ 2014-10-09  9:02             ` Peter Zijlstra
       [not found]               ` <20141009172513.4379.56718@quantum>
  0 siblings, 1 reply; 40+ messages in thread
From: Peter Zijlstra @ 2014-10-09  9:02 UTC (permalink / raw)
  To: Mike Turquette
  Cc: Morten Rasmussen, mingo, Dietmar Eggemann, pjt, bsegall,
	vincent.guittot, nicolas.pitre, rjw, linux-kernel,
	tuukka.tikkanen

On Wed, Oct 08, 2014 at 03:37:32PM -0700, Mike Turquette wrote:
> It creates a dependency such that any ARM platform that wants to have
> frequency-invariant load must use CPUfreq. I don't think we want that
> dependency. CPUfreq is a likely back-end for many devices, but not all.
> 
> Consider near-future ARM devices that support ACPI for power management
> with no corresponding CPUfreq driver. For example if the CPPC driver is
> not hooked up to CPUfreq, platforms using CPPC will not jive with the
> ARM arch hook that depends on CPUfreq.

Oh crap no, CPPC will not add yet another interface to cpu frequency
stuff.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC 2/2] cpufreq: arm_big_little: provide cpu capacity
       [not found]               ` <20141009172513.4379.56718@quantum>
@ 2014-10-09 17:38                 ` Peter Zijlstra
  0 siblings, 0 replies; 40+ messages in thread
From: Peter Zijlstra @ 2014-10-09 17:38 UTC (permalink / raw)
  To: Mike Turquette
  Cc: Morten Rasmussen, mingo, Dietmar Eggemann, pjt, bsegall,
	vincent.guittot, nicolas.pitre, rjw, linux-kernel,
	tuukka.tikkanen

On Thu, Oct 09, 2014 at 10:25:13AM -0700, Mike Turquette wrote:
> Quoting Peter Zijlstra (2014-10-09 02:02:52)
> > On Wed, Oct 08, 2014 at 03:37:32PM -0700, Mike Turquette wrote:
> > > It creates a dependency such that any ARM platform that wants to have
> > > frequency-invariant load must use CPUfreq. I don't think we want that
> > > dependency. CPUfreq is a likely back-end for many devices, but not all.
> > > 
> > > Consider near-future ARM devices that support ACPI for power management
> > > with no corresponding CPUfreq driver. For example if the CPPC driver is
> > > not hooked up to CPUfreq, platforms using CPPC will not jive with the
> > > ARM arch hook that depends on CPUfreq.
> > 
> > Oh crap no, CPPC will not add yet another interface to cpu frequency
> > stuff.
> 
> Right.
> 
> So let's say the ARM arch hook creates a dependency on CPUfreq to scale
> capacity as a function of cpu frequency (as it does in the ARM scale
> invariance series).
> 
> Then let's say that a hypothetical ARM platform named "foo" uses CPPC
> and not CPUfreq to scale frequency. Foo's implementation of CPPC does
> not use any of the full-auto or hw-auto stuff. It allows the OS to
> request minimum performance levels and the like.
> 
> In this case, how can foo take advantage of the scale invariant stuff?
> 
> Also, feel free to replace "CPPC" with "anything other than CPUfreq".
> The problem is a general one and not specific to CPPC or ACPI.

Well answer #1 is that you simply should not ever bypass cpufreq for
setting cpu frequencies (be this the existing cpufreq or the future
integrated cpufreq).

Answer #2 is that if you were allowed to create a second infrastructure
and you're not calling the right scheduler hooks right along with it,
you're buggy.

In short, your problem, not mine.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH RFC 1/2] sched: cfs: introduce capacity_ops
       [not found]               ` <20141009173433.4379.58492@quantum>
@ 2014-10-09 19:00                 ` Peter Zijlstra
  0 siblings, 0 replies; 40+ messages in thread
From: Peter Zijlstra @ 2014-10-09 19:00 UTC (permalink / raw)
  To: Mike Turquette
  Cc: Morten Rasmussen, mingo, dietmar.eggemann, pjt, bsegall,
	vincent.guittot, nicolas.pitre, rjw, linux-kernel,
	tuukka.tikkanen

On Thu, Oct 09, 2014 at 10:34:33AM -0700, Mike Turquette wrote:
> Quoting Peter Zijlstra (2014-10-09 02:00:24)
> > On Wed, Oct 08, 2014 at 04:28:36PM -0700, Mike Turquette wrote:
> > > > Yeah, like hell no. We do not want modules to affect scheduler
> > > > behaviour.
> > > 
> > > If a CPUfreq driver is the best place to know how frequency affects the
> > > capacity of a CPU for a given system, are you suggesting that we must
> > > compile that code into the kernel image instead of it being a loadable
> > > module?
> > 
> > Ideally we'll end up doing away with the cpufreq policy modules and
> > integrate the entire thing into the scheduler.
> 
> I said "CPUfreq driver", not "CPUfreq governor". Certainly the scheduler
> can pick a capacity state/p-state/frequency and, in doing so, replace
> the CPUfreq policy bits.
> 
> My question above was about the necessity to select the right CPUfreq
> driver at compile-time and lose support for those drivers to be loadable
> modules. Sounds like a bad idea.

Drivers should not care one way or another.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/7] sched: Introduce scale-invariant load tracking
  2014-10-08 11:38         ` Vincent Guittot
  2014-10-08 14:05           ` Morten Rasmussen
@ 2014-10-10  9:07           ` Peter Zijlstra
  1 sibling, 0 replies; 40+ messages in thread
From: Peter Zijlstra @ 2014-10-10  9:07 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Morten Rasmussen, mingo, Dietmar Eggemann, Paul Turner,
	Benjamin Segall, Nicolas Pitre, Mike Turquette, rjw,
	linux-kernel

On Wed, Oct 08, 2014 at 01:38:40PM +0200, Vincent Guittot wrote:
> I have in mind some system where the max achievable freq of a core
> depends of how many cores are running simultaneously because of some
> HW constraint like max current. In this case, the CPU might not reach
> max frequency even with an always running task.
> Then, beside frequency scaling, their is the uarch invariance that is
> introduced by patch 4 that will generate similar behavior of the load.

This is a 'common' issue. x86 and powerpc also suffer this. It can be
the result of either thermal or power thresholds.



^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/7] sched: Introduce scale-invariant load tracking
  2014-10-08  0:50   ` Yuyang Du
  2014-10-08 12:54     ` Dietmar Eggemann
@ 2014-10-10  9:14     ` Peter Zijlstra
  1 sibling, 0 replies; 40+ messages in thread
From: Peter Zijlstra @ 2014-10-10  9:14 UTC (permalink / raw)
  To: Yuyang Du
  Cc: Morten Rasmussen, mingo, dietmar.eggemann, pjt, bsegall,
	vincent.guittot, nicolas.pitre, mturquette, rjw, linux-kernel

On Wed, Oct 08, 2014 at 08:50:07AM +0800, Yuyang Du wrote:
> Hi Morten,
> 
> Sorry for late jumping in.
> 
> The problem seems to be self-evident. But for the implementation to be
> equally attractive it needs to account for every freq change for every task,
> or anything less than that makes it less attractive.

I'm not entirely sure that is indeed required.

> But this should be very hard. Intel Architecture has limitation to capture all
> the freq changes in software and also the intel_pstate should have no
> notification.

So current Intel arch takes P-states as hints and then can mostly lower
their actual frequency, right? In this case still accounting at the
higher frequency is not a problem, the hardware conserves more power
than we know, but that's the right kind of error to have :-)

For the full automatic hardware, that's basically hardware without DVFS
capability so we should not bother at all and simply disable this
scaling.

> For every task, this makes the updating of the entire queue in load tracking
> more needed, so once again, ping maintainers for the rewrite patches, :)

Could you remind me what your latest version is; please reply to those
patches with a new email so that I can more conveniently locate it.

^ permalink raw reply	[flat|nested] 40+ messages in thread

* Re: [PATCH 1/7] sched: Introduce scale-invariant load tracking
  2014-10-08 12:54     ` Dietmar Eggemann
@ 2014-10-10  9:16       ` Peter Zijlstra
  0 siblings, 0 replies; 40+ messages in thread
From: Peter Zijlstra @ 2014-10-10  9:16 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Yuyang Du, Morten Rasmussen, mingo, pjt, bsegall,
	vincent.guittot, nicolas.pitre, mturquette, rjw, linux-kernel

On Wed, Oct 08, 2014 at 01:54:58PM +0100, Dietmar Eggemann wrote:

> > But this should be very hard. Intel Architecture has limitation to capture all
> > the freq changes in software and also the intel_pstate should have no
> > notification.
> 
> We encountered this missing notification for current frequency with
> Intel systems (e.g. i5-3320M) using the intel_pstate driver while
> testing this patch-set. The arch_scale_set_curr_freq call in
> __cpufreq_notify_transition [[PATCH 2/7] cpufreq: Architecture specific
> callback for frequency changes] will not work on such a system.
> 
> In our internal testing, we placed arch_scale_set_curr_freq(cpu->cpu,
> sample->freq) into intel_pstate_timer_func [intel_pstate.c] to get the
> current frequency for a cpu.
> 
> The arch_scale_set_max_freq call in cpufreq_set_policy
> [drivers/cpufreq/cpufreq.c] still works although the driver exposes the
> max turbo pstate and not the max pstate. That's an additional problem
> because we don't want to use turbo states for frequency scaling.

Right, so when we pull the policy part into the scheduler, intel_pstate
will revert to just another driver without such logic and things should
just work.

But yes, currently it also implements policy, that needs to go away.

^ permalink raw reply	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2014-10-10  9:16 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-09-22 16:24 [PATCH 0/7] sched: Scale-invariant per-entity load-tracking Morten Rasmussen
2014-09-22 16:24 ` [PATCH 1/7] sched: Introduce scale-invariant load tracking Morten Rasmussen
2014-09-25 13:48   ` Vincent Guittot
2014-09-25 17:23     ` Morten Rasmussen
2014-09-26  7:36       ` Vincent Guittot
2014-09-26  9:38         ` Morten Rasmussen
2014-10-02 20:34       ` Peter Zijlstra
2014-10-08 11:00         ` Morten Rasmussen
2014-10-08 11:21           ` Vincent Guittot
2014-10-08 13:53             ` Morten Rasmussen
2014-10-08 14:08               ` Vincent Guittot
2014-10-08 14:16                 ` Morten Rasmussen
2014-10-08 11:38         ` Vincent Guittot
2014-10-08 14:05           ` Morten Rasmussen
2014-10-10  9:07           ` Peter Zijlstra
2014-10-08  0:50   ` Yuyang Du
2014-10-08 12:54     ` Dietmar Eggemann
2014-10-10  9:16       ` Peter Zijlstra
2014-10-10  9:14     ` Peter Zijlstra
2014-09-22 16:24 ` [PATCH 2/7] cpufreq: Architecture specific callback for frequency changes Morten Rasmussen
2014-10-08  6:07   ` Mike Turquette
2014-10-08  6:26     ` [PATCH RFC 0/2] introduce capacity_ops to CFS Mike Turquette
2014-10-08  6:26       ` [PATCH RFC 1/2] sched: cfs: introduce capacity_ops Mike Turquette
2014-10-08  8:37         ` Peter Zijlstra
     [not found]           ` <20141008232836.4379.3339@quantum>
2014-10-09  9:00             ` Peter Zijlstra
     [not found]               ` <20141009173433.4379.58492@quantum>
2014-10-09 19:00                 ` Peter Zijlstra
2014-10-08  6:26       ` [PATCH RFC 2/2] cpufreq: arm_big_little: provide cpu capacity Mike Turquette
2014-10-08 15:48         ` Morten Rasmussen
     [not found]           ` <20141008223732.4379.78047@quantum>
2014-10-09  9:02             ` Peter Zijlstra
     [not found]               ` <20141009172513.4379.56718@quantum>
2014-10-09 17:38                 ` Peter Zijlstra
2014-09-22 16:24 ` [PATCH 3/7] arm: Frequency invariant scheduler load-tracking support Morten Rasmussen
2014-09-22 16:24 ` [PATCH 4/7] arm: Micro-architecture invariant load tracking support Morten Rasmussen
2014-09-22 16:24 ` [PATCH 5/7] sched: Implement usage tracking Morten Rasmussen
2014-09-22 16:24 ` [PATCH 6/7] sched: Make sched entity usage tracking scale-invariant Morten Rasmussen
2014-09-22 17:13   ` bsegall
2014-09-23 13:35     ` Morten Rasmussen
2014-10-02 21:04       ` Peter Zijlstra
2014-09-22 16:24 ` [PATCH 7/7] sched: Track sched_entity usage contributions Morten Rasmussen
2014-09-22 17:09   ` bsegall
2014-09-23 13:59     ` Morten Rasmussen

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).