All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFCv3 PATCH 00/48] sched: Energy cost model for energy-aware scheduling
@ 2015-02-04 18:30 Morten Rasmussen
  2015-02-04 18:30 ` [RFCv3 PATCH 01/48] sched: add utilization_avg_contrib Morten Rasmussen
                   ` (48 more replies)
  0 siblings, 49 replies; 124+ messages in thread
From: Morten Rasmussen @ 2015-02-04 18:30 UTC (permalink / raw)
  To: peterz, mingo
  Cc: vincent.guittot, dietmar.eggemann, yuyang.du, preeti, mturquette,
	nico, rjw, juri.lelli, linux-kernel

Several techniques for saving energy through various scheduler
modifications have been proposed in the past, however most of the
techniques have not been universally beneficial for all use-cases and
platforms. For example, consolidating tasks on fewer cpus is an
effective way to save energy on some platforms, while it might make
things worse on others.

This proposal, which is inspired by the Ksummit workshop discussions in
2013 [1], takes a different approach by using a (relatively) simple
platform energy cost model to guide scheduling decisions. By providing
the model with platform specific costing data the model can provide a
estimate of the energy implications of scheduling decisions. So instead
of blindly applying scheduling techniques that may or may not work for
the current use-case, the scheduler can make informed energy-aware
decisions. We believe this approach provides a methodology that can be
adapted to any platform, including heterogeneous systems such as ARM
big.LITTLE. The model considers cpus only, i.e. no peripherals, GPU or
memory. Model data includes power consumption at each P-state and
C-state.

This is an RFC and there are some loose ends that have not been
addressed here or in the code yet. The model and its infrastructure is
in place in the scheduler and it is being used for load-balancing
decisions. The energy model data is hardcoded, the load-balancing
heuristics are still under development, and there are some limitations
still to be addressed. However, the main idea is presented here, which
is the use of an energy model for scheduling decisions.

RFCv3 is a consolidation of the latest energy model related patches and
previously posted patch sets related to capacity and utilization
tracking [2][3] to show where we are heading. [2] and [3] have been
rebased onto v3.19-rc7 with a few minor modifications. Large parts of
the energy model code and use of the energy model in the scheduler has
been rewritten and simplified. The patch set consists of three main
parts (more details further down):

Patch 1-11:  sched: consolidation of CPU capacity and usage [2] (rebase)

Patch 12-19: sched: frequency and cpu invariant per-entity load-tracking
                    and other load-tracking bits [3] (rebase)

Patch 20-48: sched: Energy cost model for energy-aware scheduling (RFCv3)

Test results for ARM TC2 (2xA15+3xA7) with cpufreq enabled:

sysbench: Single task running for 3 seconds.
rt-app [4]: 5 medium (~50%) periodic tasks
rt-app [4]: 2 light (~10%) periodic tasks

Average numbers for 20 runs per test.

Energy		sysbench	rt-app medium	rt-app light
Mainline	100*		100		100
EA		279		88		63

* Sensitive to task placement on big.LITTLE. Mainline may put it on
  either cpu due to it's lack of compute capacity awareness, while EA
  consistently puts heavy tasks on big cpus. The EA energy increase came
  with a 2.65x _increase_ in performance (throughput).

[1] http://etherpad.osuosl.org/energy-aware-scheduling-ks-2013 (search
for 'cost')
[2] https://lkml.org/lkml/2015/1/15/136
[3] https://lkml.org/lkml/2014/12/2/328
[4] https://wiki.linaro.org/WorkingGroups/PowerManagement/Resources/Tools/WorkloadGen

Changes:

RFCv3:

'sched: Energy cost model for energy-aware scheduling' changes:
RFCv2->RFCv3:

(1) Remove frequency- and cpu-invariant load/utilization patches since
    this is now provided by [2] and [3].

(2) Remove system-wide sched_energy to make the code easier to
    understand, i.e. single socket systems are not supported (yet).

(3) Remove wake-up energy. Extra complexity that wasn't fully justified.
    Idle-state awareness introduced recently in mainline may be
    sufficient.

(4) Remove procfs interface for energy data to make the patch-set
    smaller.

(5) Rework energy-aware load balancing code.

    In RFCv2 we only attempted to pick the source cpu in an energy-aware
    fashion. In addition to support for finding the most energy
    inefficient source CPU during the load-balancing action, RFCv3 also
    introduces the energy-aware based moving of tasks between cpus as
    well as support for managing the 'tipping point' - the threshold
    where we switch away from energy model based load balancing to
    conventional load balancing.

'sched: frequency and cpu invariant per-entity load-tracking and other
load-tracking bits' [3]

(1) Remove blocked load from load tracking.

(2) Remove cpu-invariant load tracking.

    Both (1) and (2) require changes to the existing load-balance code
    which haven't been done yet. These are therefore left out until that
    has been addressed.

(3) One patch renamed.

'sched: consolidation of CPU capacity and usage' [2]

(1) Fixed conflict when rebasing to v3.19-rc7.

(2) One patch subject changed slightly.


RFC v2:
 - Extended documentation:
   - Cover the energy model in greater detail.
   - Recipe for deriving platform energy model.
 - Replaced Kconfig with sched feature (jump label).
 - Add unweighted load tracking.
 - Use unweighted load as task/cpu utilization.
 - Support for multiple idle states per sched_group. cpuidle integration
   still missing.
 - Changed energy aware functionality in select_idle_sibling().
 - Experimental energy aware load-balance support.


Dietmar Eggemann (17):
  sched: Make load tracking frequency scale-invariant
  sched: Make usage tracking cpu scale-invariant
  arm: vexpress: Add CPU clock-frequencies to TC2 device-tree
  arm: Cpu invariant scheduler load-tracking support
  sched: Get rid of scaling usage by cpu_capacity_orig
  sched: Introduce energy data structures
  sched: Allocate and initialize energy data structures
  arm: topology: Define TC2 energy and provide it to the scheduler
  sched: Infrastructure to query if load balancing is energy-aware
  sched: Introduce energy awareness into update_sg_lb_stats
  sched: Introduce energy awareness into update_sd_lb_stats
  sched: Introduce energy awareness into find_busiest_group
  sched: Introduce energy awareness into find_busiest_queue
  sched: Introduce energy awareness into detach_tasks
  sched: Tipping point from energy-aware to conventional load balancing
  sched: Skip cpu as lb src which has one task and capacity gte the dst
    cpu
  sched: Turn off fast idling of cpus on a partially loaded system

Morten Rasmussen (23):
  sched: Track group sched_entity usage contributions
  sched: Make sched entity usage tracking frequency-invariant
  cpufreq: Architecture specific callback for frequency changes
  arm: Frequency invariant scheduler load-tracking support
  sched: Track blocked utilization contributions
  sched: Include blocked utilization in usage tracking
  sched: Documentation for scheduler energy cost model
  sched: Make energy awareness a sched feature
  sched: Introduce SD_SHARE_CAP_STATES sched_domain flag
  sched: Compute cpu capacity available at current frequency
  sched: Relocated get_cpu_usage()
  sched: Use capacity_curr to cap utilization in get_cpu_usage()
  sched: Highest energy aware balancing sched_domain level pointer
  sched: Calculate energy consumption of sched_group
  sched: Extend sched_group_energy to test load-balancing decisions
  sched: Estimate energy impact of scheduling decisions
  sched: Energy-aware wake-up task placement
  sched: Bias new task wakeups towards higher capacity cpus
  sched, cpuidle: Track cpuidle state index in the scheduler
  sched: Count number of shallower idle-states in struct
    sched_group_energy
  sched: Determine the current sched_group idle-state
  sched: Enable active migration for cpus of lower capacity
  sched: Disable energy-unfriendly nohz kicks

Vincent Guittot (8):
  sched: add utilization_avg_contrib
  sched: remove frequency scaling from cpu_capacity
  sched: make scale_rt invariant with frequency
  sched: add per rq cpu_capacity_orig
  sched: get CPU's usage statistic
  sched: replace capacity_factor by usage
  sched: add SD_PREFER_SIBLING for SMT level
  sched: move cfs task on a CPU with higher capacity

 Documentation/scheduler/sched-energy.txt   | 359 +++++++++++
 arch/arm/boot/dts/vexpress-v2p-ca15_a7.dts |   5 +
 arch/arm/kernel/topology.c                 | 218 +++++--
 drivers/cpufreq/cpufreq.c                  |  10 +-
 include/linux/sched.h                      |  43 +-
 kernel/sched/core.c                        | 119 +++-
 kernel/sched/debug.c                       |  12 +-
 kernel/sched/fair.c                        | 935 ++++++++++++++++++++++++-----
 kernel/sched/features.h                    |   6 +
 kernel/sched/idle.c                        |   2 +
 kernel/sched/sched.h                       |  75 ++-
 11 files changed, 1559 insertions(+), 225 deletions(-)
 create mode 100644 Documentation/scheduler/sched-energy.txt

-- 
1.9.1


^ permalink raw reply	[flat|nested] 124+ messages in thread

* [RFCv3 PATCH 01/48] sched: add utilization_avg_contrib
  2015-02-04 18:30 [RFCv3 PATCH 00/48] sched: Energy cost model for energy-aware scheduling Morten Rasmussen
@ 2015-02-04 18:30 ` Morten Rasmussen
  2015-02-11  8:50   ` Preeti U Murthy
  2015-02-04 18:30 ` [RFCv3 PATCH 02/48] sched: Track group sched_entity usage contributions Morten Rasmussen
                   ` (47 subsequent siblings)
  48 siblings, 1 reply; 124+ messages in thread
From: Morten Rasmussen @ 2015-02-04 18:30 UTC (permalink / raw)
  To: peterz, mingo
  Cc: vincent.guittot, dietmar.eggemann, yuyang.du, preeti, mturquette,
	nico, rjw, juri.lelli, linux-kernel, Paul Turner, Ben Segall

From: Vincent Guittot <vincent.guittot@linaro.org>

Add new statistics which reflect the average time a task is running on the CPU
and the sum of these running time of the tasks on a runqueue. The latter is
named utilization_load_avg.

This patch is based on the usage metric that was proposed in the 1st
versions of the per-entity load tracking patchset by Paul Turner
<pjt@google.com> but that has be removed afterwards. This version differs from
the original one in the sense that it's not linked to task_group.

The rq's utilization_load_avg will be used to check if a rq is overloaded or
not instead of trying to compute how many tasks a group of CPUs can handle.

Rename runnable_avg_period into avg_period as it is now used with both
runnable_avg_sum and running_avg_sum

Add some descriptions of the variables to explain their differences

cc: Paul Turner <pjt@google.com>
cc: Ben Segall <bsegall@google.com>

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Acked-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 include/linux/sched.h | 21 ++++++++++++---
 kernel/sched/debug.c  | 10 ++++---
 kernel/sched/fair.c   | 74 ++++++++++++++++++++++++++++++++++++++++-----------
 kernel/sched/sched.h  |  8 +++++-
 4 files changed, 89 insertions(+), 24 deletions(-)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8db31ef..e220a91 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1111,15 +1111,28 @@ struct load_weight {
 };
 
 struct sched_avg {
+	u64 last_runnable_update;
+	s64 decay_count;
+	/*
+	 * utilization_avg_contrib describes the amount of time that a
+	 * sched_entity is running on a CPU. It is based on running_avg_sum
+	 * and is scaled in the range [0..SCHED_LOAD_SCALE].
+	 * load_avg_contrib described the amount of time that a sched_entity
+	 * is runnable on a rq. It is based on both runnable_avg_sum and the
+	 * weight of the task.
+	 */
+	unsigned long load_avg_contrib, utilization_avg_contrib;
 	/*
 	 * These sums represent an infinite geometric series and so are bound
 	 * above by 1024/(1-y).  Thus we only need a u32 to store them for all
 	 * choices of y < 1-2^(-32)*1024.
+	 * running_avg_sum reflects the time that the sched_entity is
+	 * effectively running on the CPU.
+	 * runnable_avg_sum represents the amount of time a sched_entity is on
+	 * a runqueue which includes the running time that is monitored by
+	 * running_avg_sum.
 	 */
-	u32 runnable_avg_sum, runnable_avg_period;
-	u64 last_runnable_update;
-	s64 decay_count;
-	unsigned long load_avg_contrib;
+	u32 runnable_avg_sum, avg_period, running_avg_sum;
 };
 
 #ifdef CONFIG_SCHEDSTATS
diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 92cc520..3033aaa 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -71,7 +71,7 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
 	if (!se) {
 		struct sched_avg *avg = &cpu_rq(cpu)->avg;
 		P(avg->runnable_avg_sum);
-		P(avg->runnable_avg_period);
+		P(avg->avg_period);
 		return;
 	}
 
@@ -94,7 +94,7 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
 	P(se->load.weight);
 #ifdef CONFIG_SMP
 	P(se->avg.runnable_avg_sum);
-	P(se->avg.runnable_avg_period);
+	P(se->avg.avg_period);
 	P(se->avg.load_avg_contrib);
 	P(se->avg.decay_count);
 #endif
@@ -214,6 +214,8 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 			cfs_rq->runnable_load_avg);
 	SEQ_printf(m, "  .%-30s: %ld\n", "blocked_load_avg",
 			cfs_rq->blocked_load_avg);
+	SEQ_printf(m, "  .%-30s: %ld\n", "utilization_load_avg",
+			cfs_rq->utilization_load_avg);
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	SEQ_printf(m, "  .%-30s: %ld\n", "tg_load_contrib",
 			cfs_rq->tg_load_contrib);
@@ -635,8 +637,10 @@ void proc_sched_show_task(struct task_struct *p, struct seq_file *m)
 	P(se.load.weight);
 #ifdef CONFIG_SMP
 	P(se.avg.runnable_avg_sum);
-	P(se.avg.runnable_avg_period);
+	P(se.avg.running_avg_sum);
+	P(se.avg.avg_period);
 	P(se.avg.load_avg_contrib);
+	P(se.avg.utilization_avg_contrib);
 	P(se.avg.decay_count);
 #endif
 	P(policy);
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 40667cb..29adcbb 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -670,6 +670,7 @@ static int select_idle_sibling(struct task_struct *p, int cpu);
 static unsigned long task_h_load(struct task_struct *p);
 
 static inline void __update_task_entity_contrib(struct sched_entity *se);
+static inline void __update_task_entity_utilization(struct sched_entity *se);
 
 /* Give new task start runnable values to heavy its load in infant time */
 void init_task_runnable_average(struct task_struct *p)
@@ -678,9 +679,10 @@ void init_task_runnable_average(struct task_struct *p)
 
 	p->se.avg.decay_count = 0;
 	slice = sched_slice(task_cfs_rq(p), &p->se) >> 10;
-	p->se.avg.runnable_avg_sum = slice;
-	p->se.avg.runnable_avg_period = slice;
+	p->se.avg.runnable_avg_sum = p->se.avg.running_avg_sum = slice;
+	p->se.avg.avg_period = slice;
 	__update_task_entity_contrib(&p->se);
+	__update_task_entity_utilization(&p->se);
 }
 #else
 void init_task_runnable_average(struct task_struct *p)
@@ -1674,7 +1676,7 @@ static u64 numa_get_avg_runtime(struct task_struct *p, u64 *period)
 		*period = now - p->last_task_numa_placement;
 	} else {
 		delta = p->se.avg.runnable_avg_sum;
-		*period = p->se.avg.runnable_avg_period;
+		*period = p->se.avg.avg_period;
 	}
 
 	p->last_sum_exec_runtime = runtime;
@@ -2500,7 +2502,8 @@ static u32 __compute_runnable_contrib(u64 n)
  */
 static __always_inline int __update_entity_runnable_avg(u64 now,
 							struct sched_avg *sa,
-							int runnable)
+							int runnable,
+							int running)
 {
 	u64 delta, periods;
 	u32 runnable_contrib;
@@ -2526,7 +2529,7 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
 	sa->last_runnable_update = now;
 
 	/* delta_w is the amount already accumulated against our next period */
-	delta_w = sa->runnable_avg_period % 1024;
+	delta_w = sa->avg_period % 1024;
 	if (delta + delta_w >= 1024) {
 		/* period roll-over */
 		decayed = 1;
@@ -2539,7 +2542,9 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
 		delta_w = 1024 - delta_w;
 		if (runnable)
 			sa->runnable_avg_sum += delta_w;
-		sa->runnable_avg_period += delta_w;
+		if (running)
+			sa->running_avg_sum += delta_w;
+		sa->avg_period += delta_w;
 
 		delta -= delta_w;
 
@@ -2549,20 +2554,26 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
 
 		sa->runnable_avg_sum = decay_load(sa->runnable_avg_sum,
 						  periods + 1);
-		sa->runnable_avg_period = decay_load(sa->runnable_avg_period,
+		sa->running_avg_sum = decay_load(sa->running_avg_sum,
+						  periods + 1);
+		sa->avg_period = decay_load(sa->avg_period,
 						     periods + 1);
 
 		/* Efficiently calculate \sum (1..n_period) 1024*y^i */
 		runnable_contrib = __compute_runnable_contrib(periods);
 		if (runnable)
 			sa->runnable_avg_sum += runnable_contrib;
-		sa->runnable_avg_period += runnable_contrib;
+		if (running)
+			sa->running_avg_sum += runnable_contrib;
+		sa->avg_period += runnable_contrib;
 	}
 
 	/* Remainder of delta accrued against u_0` */
 	if (runnable)
 		sa->runnable_avg_sum += delta;
-	sa->runnable_avg_period += delta;
+	if (running)
+		sa->running_avg_sum += delta;
+	sa->avg_period += delta;
 
 	return decayed;
 }
@@ -2578,6 +2589,8 @@ static inline u64 __synchronize_entity_decay(struct sched_entity *se)
 		return 0;
 
 	se->avg.load_avg_contrib = decay_load(se->avg.load_avg_contrib, decays);
+	se->avg.utilization_avg_contrib =
+		decay_load(se->avg.utilization_avg_contrib, decays);
 	se->avg.decay_count = 0;
 
 	return decays;
@@ -2614,7 +2627,7 @@ static inline void __update_tg_runnable_avg(struct sched_avg *sa,
 
 	/* The fraction of a cpu used by this cfs_rq */
 	contrib = div_u64((u64)sa->runnable_avg_sum << NICE_0_SHIFT,
-			  sa->runnable_avg_period + 1);
+			  sa->avg_period + 1);
 	contrib -= cfs_rq->tg_runnable_contrib;
 
 	if (abs(contrib) > cfs_rq->tg_runnable_contrib / 64) {
@@ -2667,7 +2680,8 @@ static inline void __update_group_entity_contrib(struct sched_entity *se)
 
 static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
 {
-	__update_entity_runnable_avg(rq_clock_task(rq), &rq->avg, runnable);
+	__update_entity_runnable_avg(rq_clock_task(rq), &rq->avg, runnable,
+			runnable);
 	__update_tg_runnable_avg(&rq->avg, &rq->cfs);
 }
 #else /* CONFIG_FAIR_GROUP_SCHED */
@@ -2685,7 +2699,7 @@ static inline void __update_task_entity_contrib(struct sched_entity *se)
 
 	/* avoid overflowing a 32-bit type w/ SCHED_LOAD_SCALE */
 	contrib = se->avg.runnable_avg_sum * scale_load_down(se->load.weight);
-	contrib /= (se->avg.runnable_avg_period + 1);
+	contrib /= (se->avg.avg_period + 1);
 	se->avg.load_avg_contrib = scale_load(contrib);
 }
 
@@ -2704,6 +2718,27 @@ static long __update_entity_load_avg_contrib(struct sched_entity *se)
 	return se->avg.load_avg_contrib - old_contrib;
 }
 
+
+static inline void __update_task_entity_utilization(struct sched_entity *se)
+{
+	u32 contrib;
+
+	/* avoid overflowing a 32-bit type w/ SCHED_LOAD_SCALE */
+	contrib = se->avg.running_avg_sum * scale_load_down(SCHED_LOAD_SCALE);
+	contrib /= (se->avg.avg_period + 1);
+	se->avg.utilization_avg_contrib = scale_load(contrib);
+}
+
+static long __update_entity_utilization_avg_contrib(struct sched_entity *se)
+{
+	long old_contrib = se->avg.utilization_avg_contrib;
+
+	if (entity_is_task(se))
+		__update_task_entity_utilization(se);
+
+	return se->avg.utilization_avg_contrib - old_contrib;
+}
+
 static inline void subtract_blocked_load_contrib(struct cfs_rq *cfs_rq,
 						 long load_contrib)
 {
@@ -2720,7 +2755,7 @@ static inline void update_entity_load_avg(struct sched_entity *se,
 					  int update_cfs_rq)
 {
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
-	long contrib_delta;
+	long contrib_delta, utilization_delta;
 	u64 now;
 
 	/*
@@ -2732,18 +2767,22 @@ static inline void update_entity_load_avg(struct sched_entity *se,
 	else
 		now = cfs_rq_clock_task(group_cfs_rq(se));
 
-	if (!__update_entity_runnable_avg(now, &se->avg, se->on_rq))
+	if (!__update_entity_runnable_avg(now, &se->avg, se->on_rq,
+					cfs_rq->curr == se))
 		return;
 
 	contrib_delta = __update_entity_load_avg_contrib(se);
+	utilization_delta = __update_entity_utilization_avg_contrib(se);
 
 	if (!update_cfs_rq)
 		return;
 
-	if (se->on_rq)
+	if (se->on_rq) {
 		cfs_rq->runnable_load_avg += contrib_delta;
-	else
+		cfs_rq->utilization_load_avg += utilization_delta;
+	} else {
 		subtract_blocked_load_contrib(cfs_rq, -contrib_delta);
+	}
 }
 
 /*
@@ -2818,6 +2857,7 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
 	}
 
 	cfs_rq->runnable_load_avg += se->avg.load_avg_contrib;
+	cfs_rq->utilization_load_avg += se->avg.utilization_avg_contrib;
 	/* we force update consideration on load-balancer moves */
 	update_cfs_rq_blocked_load(cfs_rq, !wakeup);
 }
@@ -2836,6 +2876,7 @@ static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq,
 	update_cfs_rq_blocked_load(cfs_rq, !sleep);
 
 	cfs_rq->runnable_load_avg -= se->avg.load_avg_contrib;
+	cfs_rq->utilization_load_avg -= se->avg.utilization_avg_contrib;
 	if (sleep) {
 		cfs_rq->blocked_load_avg += se->avg.load_avg_contrib;
 		se->avg.decay_count = atomic64_read(&cfs_rq->decay_counter);
@@ -3173,6 +3214,7 @@ set_next_entity(struct cfs_rq *cfs_rq, struct sched_entity *se)
 		 */
 		update_stats_wait_end(cfs_rq, se);
 		__dequeue_entity(cfs_rq, se);
+		update_entity_load_avg(se, 1);
 	}
 
 	update_stats_curr_start(cfs_rq, se);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 9a2a45c..17a3b6b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -362,8 +362,14 @@ struct cfs_rq {
 	 * Under CFS, load is tracked on a per-entity basis and aggregated up.
 	 * This allows for the description of both thread and group usage (in
 	 * the FAIR_GROUP_SCHED case).
+	 * runnable_load_avg is the sum of the load_avg_contrib of the
+	 * sched_entities on the rq.
+	 * blocked_load_avg is similar to runnable_load_avg except that its
+	 * the blocked sched_entities on the rq.
+	 * utilization_load_avg is the sum of the average running time of the
+	 * sched_entities on the rq.
 	 */
-	unsigned long runnable_load_avg, blocked_load_avg;
+	unsigned long runnable_load_avg, blocked_load_avg, utilization_load_avg;
 	atomic64_t decay_counter;
 	u64 last_decay;
 	atomic_long_t removed_load;
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [RFCv3 PATCH 02/48] sched: Track group sched_entity usage contributions
  2015-02-04 18:30 [RFCv3 PATCH 00/48] sched: Energy cost model for energy-aware scheduling Morten Rasmussen
  2015-02-04 18:30 ` [RFCv3 PATCH 01/48] sched: add utilization_avg_contrib Morten Rasmussen
@ 2015-02-04 18:30 ` Morten Rasmussen
  2015-02-04 18:30 ` [RFCv3 PATCH 03/48] sched: remove frequency scaling from cpu_capacity Morten Rasmussen
                   ` (46 subsequent siblings)
  48 siblings, 0 replies; 124+ messages in thread
From: Morten Rasmussen @ 2015-02-04 18:30 UTC (permalink / raw)
  To: peterz, mingo
  Cc: vincent.guittot, dietmar.eggemann, yuyang.du, preeti, mturquette,
	nico, rjw, juri.lelli, linux-kernel, Paul Turner, Ben Segall

Adds usage contribution tracking for group entities. Unlike
se->avg.load_avg_contrib, se->avg.utilization_avg_contrib for group
entities is the sum of se->avg.utilization_avg_contrib for all entities on the
group runqueue. It is _not_ influenced in any way by the task group
h_load. Hence it is representing the actual cpu usage of the group, not
its intended load contribution which may differ significantly from the
utilization on lightly utilized systems.

cc: Paul Turner <pjt@google.com>
cc: Ben Segall <bsegall@google.com>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/debug.c | 2 ++
 kernel/sched/fair.c  | 3 +++
 2 files changed, 5 insertions(+)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 3033aaa..9dce8b5 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -94,8 +94,10 @@ static void print_cfs_group_stats(struct seq_file *m, int cpu, struct task_group
 	P(se->load.weight);
 #ifdef CONFIG_SMP
 	P(se->avg.runnable_avg_sum);
+	P(se->avg.running_avg_sum);
 	P(se->avg.avg_period);
 	P(se->avg.load_avg_contrib);
+	P(se->avg.utilization_avg_contrib);
 	P(se->avg.decay_count);
 #endif
 #undef PN
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 29adcbb..fad93d8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2735,6 +2735,9 @@ static long __update_entity_utilization_avg_contrib(struct sched_entity *se)
 
 	if (entity_is_task(se))
 		__update_task_entity_utilization(se);
+	else
+		se->avg.utilization_avg_contrib =
+					group_cfs_rq(se)->utilization_load_avg;
 
 	return se->avg.utilization_avg_contrib - old_contrib;
 }
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [RFCv3 PATCH 03/48] sched: remove frequency scaling from cpu_capacity
  2015-02-04 18:30 [RFCv3 PATCH 00/48] sched: Energy cost model for energy-aware scheduling Morten Rasmussen
  2015-02-04 18:30 ` [RFCv3 PATCH 01/48] sched: add utilization_avg_contrib Morten Rasmussen
  2015-02-04 18:30 ` [RFCv3 PATCH 02/48] sched: Track group sched_entity usage contributions Morten Rasmussen
@ 2015-02-04 18:30 ` Morten Rasmussen
  2015-02-04 18:30 ` [RFCv3 PATCH 04/48] sched: Make sched entity usage tracking frequency-invariant Morten Rasmussen
                   ` (45 subsequent siblings)
  48 siblings, 0 replies; 124+ messages in thread
From: Morten Rasmussen @ 2015-02-04 18:30 UTC (permalink / raw)
  To: peterz, mingo
  Cc: vincent.guittot, dietmar.eggemann, yuyang.du, preeti, mturquette,
	nico, rjw, juri.lelli, linux-kernel

From: Vincent Guittot <vincent.guittot@linaro.org>

Now that arch_scale_cpu_capacity has been introduced to scale the original
capacity, the arch_scale_freq_capacity is no longer used (it was
previously used by ARM arch). Remove arch_scale_freq_capacity from the
computation of cpu_capacity. The frequency invariance will be handled in the
load tracking and not in the CPU capacity. arch_scale_freq_capacity will be
revisited for scaling load with the current frequency of the CPUs in a later
patch.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Acked-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c | 7 -------
 1 file changed, 7 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fad93d8..35fd296 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6030,13 +6030,6 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
 
 	sdg->sgc->capacity_orig = capacity;
 
-	if (sched_feat(ARCH_CAPACITY))
-		capacity *= arch_scale_freq_capacity(sd, cpu);
-	else
-		capacity *= default_scale_capacity(sd, cpu);
-
-	capacity >>= SCHED_CAPACITY_SHIFT;
-
 	capacity *= scale_rt_capacity(cpu);
 	capacity >>= SCHED_CAPACITY_SHIFT;
 
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [RFCv3 PATCH 04/48] sched: Make sched entity usage tracking frequency-invariant
  2015-02-04 18:30 [RFCv3 PATCH 00/48] sched: Energy cost model for energy-aware scheduling Morten Rasmussen
                   ` (2 preceding siblings ...)
  2015-02-04 18:30 ` [RFCv3 PATCH 03/48] sched: remove frequency scaling from cpu_capacity Morten Rasmussen
@ 2015-02-04 18:30 ` Morten Rasmussen
  2015-02-04 18:30 ` [RFCv3 PATCH 05/48] sched: make scale_rt invariant with frequency Morten Rasmussen
                   ` (44 subsequent siblings)
  48 siblings, 0 replies; 124+ messages in thread
From: Morten Rasmussen @ 2015-02-04 18:30 UTC (permalink / raw)
  To: peterz, mingo
  Cc: vincent.guittot, dietmar.eggemann, yuyang.du, preeti, mturquette,
	nico, rjw, juri.lelli, linux-kernel, Paul Turner, Ben Segall

Apply frequency scale-invariance correction factor to usage tracking.
Each segment of the running_load_avg geometric series is now scaled by the
current frequency so the utilization_avg_contrib of each entity will be
invariant with frequency scaling. As a result, utilization_load_avg which is
the sum of utilization_avg_contrib, becomes invariant too. So the usage level
that is returned by get_cpu_usage, stays relative to the max frequency as the
cpu_capacity which is is compared against.
Then, we want the keep the load tracking values in a 32bits type, which implies
that the max value of {runnable|running}_avg_sum must be lower than
2^32/88761=48388 (88761 is the max weigth of a task). As LOAD_AVG_MAX = 47742,
arch_scale_freq_capacity must return a value less than
(48388/47742) << SCHED_CAPACITY_SHIFT = 1037 (SCHED_SCALE_CAPACITY = 1024).
So we define the range to [0..SCHED_SCALE_CAPACITY] in order to avoid overflow.

cc: Paul Turner <pjt@google.com>
cc: Ben Segall <bsegall@google.com>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 21 ++++++++++++++-------
 1 file changed, 14 insertions(+), 7 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 35fd296..b6fb7c4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2472,6 +2472,8 @@ static u32 __compute_runnable_contrib(u64 n)
 	return contrib + runnable_avg_yN_sum[n];
 }
 
+unsigned long __weak arch_scale_freq_capacity(struct sched_domain *sd, int cpu);
+
 /*
  * We can represent the historical contribution to runnable average as the
  * coefficients of a geometric series.  To do this we sub-divide our runnable
@@ -2500,7 +2502,7 @@ static u32 __compute_runnable_contrib(u64 n)
  *   load_avg = u_0` + y*(u_0 + u_1*y + u_2*y^2 + ... )
  *            = u_0 + u_1*y + u_2*y^2 + ... [re-labeling u_i --> u_{i+1}]
  */
-static __always_inline int __update_entity_runnable_avg(u64 now,
+static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,
 							struct sched_avg *sa,
 							int runnable,
 							int running)
@@ -2508,6 +2510,7 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
 	u64 delta, periods;
 	u32 runnable_contrib;
 	int delta_w, decayed = 0;
+	unsigned long scale_freq = arch_scale_freq_capacity(NULL, cpu);
 
 	delta = now - sa->last_runnable_update;
 	/*
@@ -2543,7 +2546,8 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
 		if (runnable)
 			sa->runnable_avg_sum += delta_w;
 		if (running)
-			sa->running_avg_sum += delta_w;
+			sa->running_avg_sum += delta_w * scale_freq
+				>> SCHED_CAPACITY_SHIFT;
 		sa->avg_period += delta_w;
 
 		delta -= delta_w;
@@ -2564,7 +2568,8 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
 		if (runnable)
 			sa->runnable_avg_sum += runnable_contrib;
 		if (running)
-			sa->running_avg_sum += runnable_contrib;
+			sa->running_avg_sum += runnable_contrib * scale_freq
+				>> SCHED_CAPACITY_SHIFT;
 		sa->avg_period += runnable_contrib;
 	}
 
@@ -2572,7 +2577,8 @@ static __always_inline int __update_entity_runnable_avg(u64 now,
 	if (runnable)
 		sa->runnable_avg_sum += delta;
 	if (running)
-		sa->running_avg_sum += delta;
+		sa->running_avg_sum += delta * scale_freq
+			>> SCHED_CAPACITY_SHIFT;
 	sa->avg_period += delta;
 
 	return decayed;
@@ -2680,8 +2686,8 @@ static inline void __update_group_entity_contrib(struct sched_entity *se)
 
 static inline void update_rq_runnable_avg(struct rq *rq, int runnable)
 {
-	__update_entity_runnable_avg(rq_clock_task(rq), &rq->avg, runnable,
-			runnable);
+	__update_entity_runnable_avg(rq_clock_task(rq), cpu_of(rq), &rq->avg,
+			runnable, runnable);
 	__update_tg_runnable_avg(&rq->avg, &rq->cfs);
 }
 #else /* CONFIG_FAIR_GROUP_SCHED */
@@ -2759,6 +2765,7 @@ static inline void update_entity_load_avg(struct sched_entity *se,
 {
 	struct cfs_rq *cfs_rq = cfs_rq_of(se);
 	long contrib_delta, utilization_delta;
+	int cpu = cpu_of(rq_of(cfs_rq));
 	u64 now;
 
 	/*
@@ -2770,7 +2777,7 @@ static inline void update_entity_load_avg(struct sched_entity *se,
 	else
 		now = cfs_rq_clock_task(group_cfs_rq(se));
 
-	if (!__update_entity_runnable_avg(now, &se->avg, se->on_rq,
+	if (!__update_entity_runnable_avg(now, cpu, &se->avg, se->on_rq,
 					cfs_rq->curr == se))
 		return;
 
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [RFCv3 PATCH 05/48] sched: make scale_rt invariant with frequency
  2015-02-04 18:30 [RFCv3 PATCH 00/48] sched: Energy cost model for energy-aware scheduling Morten Rasmussen
                   ` (3 preceding siblings ...)
  2015-02-04 18:30 ` [RFCv3 PATCH 04/48] sched: Make sched entity usage tracking frequency-invariant Morten Rasmussen
@ 2015-02-04 18:30 ` Morten Rasmussen
  2015-02-04 18:30 ` [RFCv3 PATCH 06/48] sched: add per rq cpu_capacity_orig Morten Rasmussen
                   ` (43 subsequent siblings)
  48 siblings, 0 replies; 124+ messages in thread
From: Morten Rasmussen @ 2015-02-04 18:30 UTC (permalink / raw)
  To: peterz, mingo
  Cc: vincent.guittot, dietmar.eggemann, yuyang.du, preeti, mturquette,
	nico, rjw, juri.lelli, linux-kernel

From: Vincent Guittot <vincent.guittot@linaro.org>

The average running time of RT tasks is used to estimate the remaining compute
capacity for CFS tasks. This remaining capacity is the original capacity scaled
down by a factor (aka scale_rt_capacity). This estimation of available capacity
must also be invariant with frequency scaling.

A frequency scaling factor is applied on the running time of the RT tasks for
computing scale_rt_capacity.

In sched_rt_avg_update, we scale the RT execution time like below:
rq->rt_avg += rt_delta * arch_scale_freq_capacity() >> SCHED_CAPACITY_SHIFT

Then, scale_rt_capacity can be summarized by:
scale_rt_capacity = SCHED_CAPACITY_SCALE -
		((rq->rt_avg << SCHED_CAPACITY_SHIFT) / period)

We can optimize by removing right and left shift in the computation of rq->rt_avg
and scale_rt_capacity

The call to arch_scale_frequency_capacity in the rt scheduling path might be
a concern for RT folks because I'm not sure whether we can rely on
arch_scale_freq_capacity to be short and efficient ?

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Acked-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c  | 17 +++++------------
 kernel/sched/sched.h |  4 +++-
 2 files changed, 8 insertions(+), 13 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b6fb7c4..cfe3aea 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5992,7 +5992,7 @@ unsigned long __weak arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
 static unsigned long scale_rt_capacity(int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
-	u64 total, available, age_stamp, avg;
+	u64 total, used, age_stamp, avg;
 	s64 delta;
 
 	/*
@@ -6008,19 +6008,12 @@ static unsigned long scale_rt_capacity(int cpu)
 
 	total = sched_avg_period() + delta;
 
-	if (unlikely(total < avg)) {
-		/* Ensures that capacity won't end up being negative */
-		available = 0;
-	} else {
-		available = total - avg;
-	}
+	used = div_u64(avg, total);
 
-	if (unlikely((s64)total < SCHED_CAPACITY_SCALE))
-		total = SCHED_CAPACITY_SCALE;
+	if (likely(used < SCHED_CAPACITY_SCALE))
+		return SCHED_CAPACITY_SCALE - used;
 
-	total >>= SCHED_CAPACITY_SHIFT;
-
-	return div_u64(available, total);
+	return 1;
 }
 
 static void update_cpu_capacity(struct sched_domain *sd, int cpu)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 17a3b6b..e61f00e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1356,9 +1356,11 @@ static inline int hrtick_enabled(struct rq *rq)
 
 #ifdef CONFIG_SMP
 extern void sched_avg_update(struct rq *rq);
+extern unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu);
+
 static inline void sched_rt_avg_update(struct rq *rq, u64 rt_delta)
 {
-	rq->rt_avg += rt_delta;
+	rq->rt_avg += rt_delta * arch_scale_freq_capacity(NULL, cpu_of(rq));
 	sched_avg_update(rq);
 }
 #else
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [RFCv3 PATCH 06/48] sched: add per rq cpu_capacity_orig
  2015-02-04 18:30 [RFCv3 PATCH 00/48] sched: Energy cost model for energy-aware scheduling Morten Rasmussen
                   ` (4 preceding siblings ...)
  2015-02-04 18:30 ` [RFCv3 PATCH 05/48] sched: make scale_rt invariant with frequency Morten Rasmussen
@ 2015-02-04 18:30 ` Morten Rasmussen
  2015-02-04 18:30 ` [RFCv3 PATCH 07/48] sched: get CPU's usage statistic Morten Rasmussen
                   ` (42 subsequent siblings)
  48 siblings, 0 replies; 124+ messages in thread
From: Morten Rasmussen @ 2015-02-04 18:30 UTC (permalink / raw)
  To: peterz, mingo
  Cc: vincent.guittot, dietmar.eggemann, yuyang.du, preeti, mturquette,
	nico, rjw, juri.lelli, linux-kernel

From: Vincent Guittot <vincent.guittot@linaro.org>

This new field cpu_capacity_orig reflects the original capacity of a CPU
before being altered by rt tasks and/or IRQ

The cpu_capacity_orig will be used:
- to detect when the capacity of a CPU has been noticeably reduced so we can
  trig load balance to look for a CPU with better capacity. As an example, we
  can detect when a CPU handles a significant amount of irq
  (with CONFIG_IRQ_TIME_ACCOUNTING) but this CPU is seen as an idle CPU by
  scheduler whereas CPUs, which are really idle, are available.
- evaluate the available capacity for CFS tasks

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Kamalesh Babulal <kamalesh@linux.vnet.ibm.com>
Acked-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/core.c  | 2 +-
 kernel/sched/fair.c  | 8 +++++++-
 kernel/sched/sched.h | 1 +
 3 files changed, 9 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e628cb1..48f9053 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -7212,7 +7212,7 @@ void __init sched_init(void)
 #ifdef CONFIG_SMP
 		rq->sd = NULL;
 		rq->rd = NULL;
-		rq->cpu_capacity = SCHED_CAPACITY_SCALE;
+		rq->cpu_capacity = rq->cpu_capacity_orig = SCHED_CAPACITY_SCALE;
 		rq->post_schedule = 0;
 		rq->active_balance = 0;
 		rq->next_balance = jiffies;
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cfe3aea..3fdad38 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4351,6 +4351,11 @@ static unsigned long capacity_of(int cpu)
 	return cpu_rq(cpu)->cpu_capacity;
 }
 
+static unsigned long capacity_orig_of(int cpu)
+{
+	return cpu_rq(cpu)->cpu_capacity_orig;
+}
+
 static unsigned long cpu_avg_load_per_task(int cpu)
 {
 	struct rq *rq = cpu_rq(cpu);
@@ -6028,6 +6033,7 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
 
 	capacity >>= SCHED_CAPACITY_SHIFT;
 
+	cpu_rq(cpu)->cpu_capacity_orig = capacity;
 	sdg->sgc->capacity_orig = capacity;
 
 	capacity *= scale_rt_capacity(cpu);
@@ -6082,7 +6088,7 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
 			 * Runtime updates will correct capacity_orig.
 			 */
 			if (unlikely(!rq->sd)) {
-				capacity_orig += capacity_of(cpu);
+				capacity_orig += capacity_orig_of(cpu);
 				capacity += capacity_of(cpu);
 				continue;
 			}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e61f00e..09bb18b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -604,6 +604,7 @@ struct rq {
 	struct sched_domain *sd;
 
 	unsigned long cpu_capacity;
+	unsigned long cpu_capacity_orig;
 
 	unsigned char idle_balance;
 	/* For active balancing */
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [RFCv3 PATCH 07/48] sched: get CPU's usage statistic
  2015-02-04 18:30 [RFCv3 PATCH 00/48] sched: Energy cost model for energy-aware scheduling Morten Rasmussen
                   ` (5 preceding siblings ...)
  2015-02-04 18:30 ` [RFCv3 PATCH 06/48] sched: add per rq cpu_capacity_orig Morten Rasmussen
@ 2015-02-04 18:30 ` Morten Rasmussen
  2015-02-04 18:30 ` [RFCv3 PATCH 08/48] sched: replace capacity_factor by usage Morten Rasmussen
                   ` (41 subsequent siblings)
  48 siblings, 0 replies; 124+ messages in thread
From: Morten Rasmussen @ 2015-02-04 18:30 UTC (permalink / raw)
  To: peterz, mingo
  Cc: vincent.guittot, dietmar.eggemann, yuyang.du, preeti, mturquette,
	nico, rjw, juri.lelli, linux-kernel

From: Vincent Guittot <vincent.guittot@linaro.org>

Monitor the usage level of each group of each sched_domain level. The usage is
the portion of cpu_capacity_orig that is currently used on a CPU or group of
CPUs. We use the utilization_load_avg to evaluate the usage level of each
group.

The utilization_load_avg only takes into account the running time of the CFS
tasks on a CPU with a maximum value of SCHED_LOAD_SCALE when the CPU is fully
utilized. Nevertheless, we must cap utilization_load_avg which can be temporaly
greater than SCHED_LOAD_SCALE after the migration of a task on this CPU and
until the metrics are stabilized.

The utilization_load_avg is in the range [0..SCHED_LOAD_SCALE] to reflect the
running load on the CPU whereas the available capacity for the CFS task is in
the range [0..cpu_capacity_orig]. In order to test if a CPU is fully utilized
by CFS tasks, we have to scale the utilization in the cpu_capacity_orig range
of the CPU to get the usage of the latter. The usage can then be compared with
the available capacity (ie cpu_capacity) to deduct the usage level of a CPU.

The frequency scaling invariance of the usage is not taken into account in this
patch, it will be solved in another patch which will deal with frequency
scaling invariance on the running_load_avg.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Acked-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c | 29 +++++++++++++++++++++++++++++
 1 file changed, 29 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3fdad38..7ec48db 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4769,6 +4769,33 @@ static int select_idle_sibling(struct task_struct *p, int target)
 done:
 	return target;
 }
+/*
+ * get_cpu_usage returns the amount of capacity of a CPU that is used by CFS
+ * tasks. The unit of the return value must capacity so we can compare the
+ * usage with the capacity of the CPU that is available for CFS task (ie
+ * cpu_capacity).
+ * cfs.utilization_load_avg is the sum of running time of runnable tasks on a
+ * CPU. It represents the amount of utilization of a CPU in the range
+ * [0..SCHED_LOAD_SCALE].  The usage of a CPU can't be higher than the full
+ * capacity of the CPU because it's about the running time on this CPU.
+ * Nevertheless, cfs.utilization_load_avg can be higher than SCHED_LOAD_SCALE
+ * because of unfortunate rounding in avg_period and running_load_avg or just
+ * after migrating tasks until the average stabilizes with the new running
+ * time. So we need to check that the usage stays into the range
+ * [0..cpu_capacity_orig] and cap if necessary.
+ * Without capping the usage, a group could be seen as overloaded (CPU0 usage
+ * at 121% + CPU1 usage at 80%) whereas CPU1 has 20% of available capacity/
+ */
+static int get_cpu_usage(int cpu)
+{
+	unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg;
+	unsigned long capacity = capacity_orig_of(cpu);
+
+	if (usage >= SCHED_LOAD_SCALE)
+		return capacity;
+
+	return (usage * capacity) >> SCHED_LOAD_SHIFT;
+}
 
 /*
  * select_task_rq_fair: Select target runqueue for the waking task in domains
@@ -5895,6 +5922,7 @@ struct sg_lb_stats {
 	unsigned long sum_weighted_load; /* Weighted load of group's tasks */
 	unsigned long load_per_task;
 	unsigned long group_capacity;
+	unsigned long group_usage; /* Total usage of the group */
 	unsigned int sum_nr_running; /* Nr tasks running in the group */
 	unsigned int group_capacity_factor;
 	unsigned int idle_cpus;
@@ -6243,6 +6271,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 			load = source_load(i, load_idx);
 
 		sgs->group_load += load;
+		sgs->group_usage += get_cpu_usage(i);
 		sgs->sum_nr_running += rq->cfs.h_nr_running;
 
 		if (rq->nr_running > 1)
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [RFCv3 PATCH 08/48] sched: replace capacity_factor by usage
  2015-02-04 18:30 [RFCv3 PATCH 00/48] sched: Energy cost model for energy-aware scheduling Morten Rasmussen
                   ` (6 preceding siblings ...)
  2015-02-04 18:30 ` [RFCv3 PATCH 07/48] sched: get CPU's usage statistic Morten Rasmussen
@ 2015-02-04 18:30 ` Morten Rasmussen
  2015-02-04 18:30 ` [RFCv3 PATCH 09/48] sched: add SD_PREFER_SIBLING for SMT level Morten Rasmussen
                   ` (40 subsequent siblings)
  48 siblings, 0 replies; 124+ messages in thread
From: Morten Rasmussen @ 2015-02-04 18:30 UTC (permalink / raw)
  To: peterz, mingo
  Cc: vincent.guittot, dietmar.eggemann, yuyang.du, preeti, mturquette,
	nico, rjw, juri.lelli, linux-kernel

From: Vincent Guittot <vincent.guittot@linaro.org>

The scheduler tries to compute how many tasks a group of CPUs can handle by
assuming that a task's load is SCHED_LOAD_SCALE and a CPU's capacity is
SCHED_CAPACITY_SCALE. group_capacity_factor divides the capacity of the group
by SCHED_LOAD_SCALE to estimate how many task can run in the group. Then, it
compares this value with the sum of nr_running to decide if the group is
overloaded or not. But the group_capacity_factor is hardly working for SMT
 system, it sometimes works for big cores but fails to do the right thing for
 little cores.

Below are two examples to illustrate the problem that this patch solves:

1- If the original capacity of a CPU is less than SCHED_CAPACITY_SCALE
(640 as an example), a group of 3 CPUS will have a max capacity_factor of 2
(div_round_closest(3x640/1024) = 2) which means that it will be seen as
overloaded even if we have only one task per CPU.

2 - If the original capacity of a CPU is greater than SCHED_CAPACITY_SCALE
(1512 as an example), a group of 4 CPUs will have a capacity_factor of 4
(at max and thanks to the fix [0] for SMT system that prevent the apparition
of ghost CPUs) but if one CPU is fully used by rt tasks (and its capacity is
reduced to nearly nothing), the capacity factor of the group will still be 4
(div_round_closest(3*1512/1024) = 5 which is cap to 4 with [0]).

So, this patch tries to solve this issue by removing capacity_factor and
replacing it with the 2 following metrics :
-The available CPU's capacity for CFS tasks which is already used by
 load_balance.
-The usage of the CPU by the CFS tasks. For the latter, utilization_avg_contrib
has been re-introduced to compute the usage of a CPU by CFS tasks.

group_capacity_factor and group_has_free_capacity has been removed and replaced
by group_no_capacity. We compare the number of task with the number of CPUs and
we evaluate the level of utilization of the CPUs to define if a group is
overloaded or if a group has capacity to handle more tasks.

For SD_PREFER_SIBLING, a group is tagged overloaded if it has more than 1 task
so it will be selected in priority (among the overloaded groups). Since [1],
SD_PREFER_SIBLING is no more concerned by the computation of load_above_capacity
because local is not overloaded.

Finally, the sched_group->sched_group_capacity->capacity_orig has been removed
because it's no more used during load balance.

[1] https://lkml.org/lkml/2014/8/12/295

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
[Fixed merge conflict on v3.19-rc6: Morten Rasmussen
<morten.rasmussen@arm.com>]
---
 kernel/sched/core.c  |  12 ----
 kernel/sched/fair.c  | 152 +++++++++++++++++++++++++--------------------------
 kernel/sched/sched.h |   2 +-
 3 files changed, 75 insertions(+), 91 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 48f9053..252011d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5442,17 +5442,6 @@ static int sched_domain_debug_one(struct sched_domain *sd, int cpu, int level,
 			break;
 		}
 
-		/*
-		 * Even though we initialize ->capacity to something semi-sane,
-		 * we leave capacity_orig unset. This allows us to detect if
-		 * domain iteration is still funny without causing /0 traps.
-		 */
-		if (!group->sgc->capacity_orig) {
-			printk(KERN_CONT "\n");
-			printk(KERN_ERR "ERROR: domain->cpu_capacity not set\n");
-			break;
-		}
-
 		if (!cpumask_weight(sched_group_cpus(group))) {
 			printk(KERN_CONT "\n");
 			printk(KERN_ERR "ERROR: empty group\n");
@@ -5937,7 +5926,6 @@ build_overlap_sched_groups(struct sched_domain *sd, int cpu)
 		 * die on a /0 trap.
 		 */
 		sg->sgc->capacity = SCHED_CAPACITY_SCALE * cpumask_weight(sg_span);
-		sg->sgc->capacity_orig = sg->sgc->capacity;
 
 		/*
 		 * Make sure the first group of this domain contains the
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7ec48db..52c494f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5924,11 +5924,10 @@ struct sg_lb_stats {
 	unsigned long group_capacity;
 	unsigned long group_usage; /* Total usage of the group */
 	unsigned int sum_nr_running; /* Nr tasks running in the group */
-	unsigned int group_capacity_factor;
 	unsigned int idle_cpus;
 	unsigned int group_weight;
 	enum group_type group_type;
-	int group_has_free_capacity;
+	int group_no_capacity;
 #ifdef CONFIG_NUMA_BALANCING
 	unsigned int nr_numa_running;
 	unsigned int nr_preferred_running;
@@ -6062,7 +6061,6 @@ static void update_cpu_capacity(struct sched_domain *sd, int cpu)
 	capacity >>= SCHED_CAPACITY_SHIFT;
 
 	cpu_rq(cpu)->cpu_capacity_orig = capacity;
-	sdg->sgc->capacity_orig = capacity;
 
 	capacity *= scale_rt_capacity(cpu);
 	capacity >>= SCHED_CAPACITY_SHIFT;
@@ -6078,7 +6076,7 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
 {
 	struct sched_domain *child = sd->child;
 	struct sched_group *group, *sdg = sd->groups;
-	unsigned long capacity, capacity_orig;
+	unsigned long capacity;
 	unsigned long interval;
 
 	interval = msecs_to_jiffies(sd->balance_interval);
@@ -6090,7 +6088,7 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
 		return;
 	}
 
-	capacity_orig = capacity = 0;
+	capacity = 0;
 
 	if (child->flags & SD_OVERLAP) {
 		/*
@@ -6110,19 +6108,15 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
 			 * Use capacity_of(), which is set irrespective of domains
 			 * in update_cpu_capacity().
 			 *
-			 * This avoids capacity/capacity_orig from being 0 and
-			 * causing divide-by-zero issues on boot.
-			 *
-			 * Runtime updates will correct capacity_orig.
+			 * This avoids capacity from being 0 and causing
+			 * divide-by-zero issues on boot.
 			 */
 			if (unlikely(!rq->sd)) {
-				capacity_orig += capacity_orig_of(cpu);
 				capacity += capacity_of(cpu);
 				continue;
 			}
 
 			sgc = rq->sd->groups->sgc;
-			capacity_orig += sgc->capacity_orig;
 			capacity += sgc->capacity;
 		}
 	} else  {
@@ -6133,39 +6127,24 @@ void update_group_capacity(struct sched_domain *sd, int cpu)
 
 		group = child->groups;
 		do {
-			capacity_orig += group->sgc->capacity_orig;
 			capacity += group->sgc->capacity;
 			group = group->next;
 		} while (group != child->groups);
 	}
 
-	sdg->sgc->capacity_orig = capacity_orig;
 	sdg->sgc->capacity = capacity;
 }
 
 /*
- * Try and fix up capacity for tiny siblings, this is needed when
- * things like SD_ASYM_PACKING need f_b_g to select another sibling
- * which on its own isn't powerful enough.
- *
- * See update_sd_pick_busiest() and check_asym_packing().
+ * Check whether the capacity of the rq has been noticeably reduced by side
+ * activity. The imbalance_pct is used for the threshold.
+ * Return true is the capacity is reduced
  */
 static inline int
-fix_small_capacity(struct sched_domain *sd, struct sched_group *group)
+check_cpu_capacity(struct rq *rq, struct sched_domain *sd)
 {
-	/*
-	 * Only siblings can have significantly less than SCHED_CAPACITY_SCALE
-	 */
-	if (!(sd->flags & SD_SHARE_CPUCAPACITY))
-		return 0;
-
-	/*
-	 * If ~90% of the cpu_capacity is still there, we're good.
-	 */
-	if (group->sgc->capacity * 32 > group->sgc->capacity_orig * 29)
-		return 1;
-
-	return 0;
+	return ((rq->cpu_capacity * sd->imbalance_pct) <
+				(rq->cpu_capacity_orig * 100));
 }
 
 /*
@@ -6203,37 +6182,54 @@ static inline int sg_imbalanced(struct sched_group *group)
 }
 
 /*
- * Compute the group capacity factor.
- *
- * Avoid the issue where N*frac(smt_capacity) >= 1 creates 'phantom' cores by
- * first dividing out the smt factor and computing the actual number of cores
- * and limit unit capacity with that.
+ * group_has_capacity returns true if the group has spare capacity that could
+ * be used by some tasks. We consider that a group has spare capacity if the
+ * number of task is smaller than the number of CPUs or if the usage is lower
+ * than the available capacity for CFS tasks. For the latter, we use a
+ * threshold to stabilize the state, to take into account the variance of the
+ * tasks' load and to return true if the available capacity in meaningful for
+ * the load balancer. As an example, an available capacity of 1% can appear
+ * but it doesn't make any benefit for the load balance.
  */
-static inline int sg_capacity_factor(struct lb_env *env, struct sched_group *group)
+static inline bool
+group_has_capacity(struct lb_env *env, struct sg_lb_stats *sgs)
 {
-	unsigned int capacity_factor, smt, cpus;
-	unsigned int capacity, capacity_orig;
+	if ((sgs->group_capacity * 100) >
+			(sgs->group_usage * env->sd->imbalance_pct))
+		return true;
 
-	capacity = group->sgc->capacity;
-	capacity_orig = group->sgc->capacity_orig;
-	cpus = group->group_weight;
+	if (sgs->sum_nr_running < sgs->group_weight)
+		return true;
 
-	/* smt := ceil(cpus / capacity), assumes: 1 < smt_capacity < 2 */
-	smt = DIV_ROUND_UP(SCHED_CAPACITY_SCALE * cpus, capacity_orig);
-	capacity_factor = cpus / smt; /* cores */
+	return false;
+}
 
-	capacity_factor = min_t(unsigned,
-		capacity_factor, DIV_ROUND_CLOSEST(capacity, SCHED_CAPACITY_SCALE));
-	if (!capacity_factor)
-		capacity_factor = fix_small_capacity(env->sd, group);
+/*
+ *  group_is_overloaded returns true if the group has more tasks than it can
+ *  handle. We consider that a group is overloaded if the number of tasks is
+ *  greater than the number of CPUs and the tasks already use all available
+ *  capacity for CFS tasks. For the latter, we use a threshold to stabilize
+ *  the state, to take into account the variance of tasks' load and to return
+ *  true if available capacity is no more meaningful for load balancer
+ */
+static inline bool
+group_is_overloaded(struct lb_env *env, struct sg_lb_stats *sgs)
+{
+	if (sgs->sum_nr_running <= sgs->group_weight)
+		return false;
 
-	return capacity_factor;
+	if ((sgs->group_capacity * 100) <
+			(sgs->group_usage * env->sd->imbalance_pct))
+		return true;
+
+	return false;
 }
 
-static enum group_type
-group_classify(struct sched_group *group, struct sg_lb_stats *sgs)
+static enum group_type group_classify(struct lb_env *env,
+		struct sched_group *group,
+		struct sg_lb_stats *sgs)
 {
-	if (sgs->sum_nr_running > sgs->group_capacity_factor)
+	if (sgs->group_no_capacity)
 		return group_overloaded;
 
 	if (sg_imbalanced(group))
@@ -6294,11 +6290,9 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 		sgs->load_per_task = sgs->sum_weighted_load / sgs->sum_nr_running;
 
 	sgs->group_weight = group->group_weight;
-	sgs->group_capacity_factor = sg_capacity_factor(env, group);
-	sgs->group_type = group_classify(group, sgs);
 
-	if (sgs->group_capacity_factor > sgs->sum_nr_running)
-		sgs->group_has_free_capacity = 1;
+	sgs->group_no_capacity = group_is_overloaded(env, sgs);
+	sgs->group_type = group_classify(env, group, sgs);
 }
 
 /**
@@ -6420,18 +6414,19 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 
 		/*
 		 * In case the child domain prefers tasks go to siblings
-		 * first, lower the sg capacity factor to one so that we'll try
+		 * first, lower the sg capacity to one so that we'll try
 		 * and move all the excess tasks away. We lower the capacity
 		 * of a group only if the local group has the capacity to fit
-		 * these excess tasks, i.e. nr_running < group_capacity_factor. The
-		 * extra check prevents the case where you always pull from the
-		 * heaviest group when it is already under-utilized (possible
-		 * with a large weight task outweighs the tasks on the system).
+		 * these excess tasks. The extra check prevents the case where
+		 * you always pull from the heaviest group when it is already
+		 * under-utilized (possible with a large weight task outweighs
+		 * the tasks on the system).
 		 */
 		if (prefer_sibling && sds->local &&
-		    sds->local_stat.group_has_free_capacity) {
-			sgs->group_capacity_factor = min(sgs->group_capacity_factor, 1U);
-			sgs->group_type = group_classify(sg, sgs);
+		    group_has_capacity(env, &sds->local_stat) &&
+		    (sgs->sum_nr_running > 1)) {
+			sgs->group_no_capacity = 1;
+			sgs->group_type = group_overloaded;
 		}
 
 		if (update_sd_pick_busiest(env, sds, sg, sgs)) {
@@ -6611,11 +6606,12 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 	 */
 	if (busiest->group_type == group_overloaded &&
 	    local->group_type   == group_overloaded) {
-		load_above_capacity =
-			(busiest->sum_nr_running - busiest->group_capacity_factor);
-
-		load_above_capacity *= (SCHED_LOAD_SCALE * SCHED_CAPACITY_SCALE);
-		load_above_capacity /= busiest->group_capacity;
+		load_above_capacity = busiest->sum_nr_running *
+					SCHED_LOAD_SCALE;
+		if (load_above_capacity > busiest->group_capacity)
+			load_above_capacity -= busiest->group_capacity;
+		else
+			load_above_capacity = ~0UL;
 	}
 
 	/*
@@ -6678,6 +6674,7 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
 	local = &sds.local_stat;
 	busiest = &sds.busiest_stat;
 
+	/* ASYM feature bypasses nice load balance check */
 	if ((env->idle == CPU_IDLE || env->idle == CPU_NEWLY_IDLE) &&
 	    check_asym_packing(env, &sds))
 		return sds.busiest;
@@ -6698,8 +6695,8 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
 		goto force_balance;
 
 	/* SD_BALANCE_NEWIDLE trumps SMP nice when underutilized */
-	if (env->idle == CPU_NEWLY_IDLE && local->group_has_free_capacity &&
-	    !busiest->group_has_free_capacity)
+	if (env->idle == CPU_NEWLY_IDLE && group_has_capacity(env, local) &&
+	    busiest->group_no_capacity)
 		goto force_balance;
 
 	/*
@@ -6758,7 +6755,7 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 	int i;
 
 	for_each_cpu_and(i, sched_group_cpus(group), env->cpus) {
-		unsigned long capacity, capacity_factor, wl;
+		unsigned long capacity, wl;
 		enum fbq_type rt;
 
 		rq = cpu_rq(i);
@@ -6787,9 +6784,6 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 			continue;
 
 		capacity = capacity_of(i);
-		capacity_factor = DIV_ROUND_CLOSEST(capacity, SCHED_CAPACITY_SCALE);
-		if (!capacity_factor)
-			capacity_factor = fix_small_capacity(env->sd, group);
 
 		wl = weighted_cpuload(i);
 
@@ -6797,7 +6791,9 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 		 * When comparing with imbalance, use weighted_cpuload()
 		 * which is not scaled with the cpu capacity.
 		 */
-		if (capacity_factor && rq->nr_running == 1 && wl > env->imbalance)
+
+		if (rq->nr_running == 1 && wl > env->imbalance &&
+		    !check_cpu_capacity(rq, env->sd))
 			continue;
 
 		/*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 09bb18b..e402133 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -796,7 +796,7 @@ struct sched_group_capacity {
 	 * CPU capacity of this group, SCHED_LOAD_SCALE being max capacity
 	 * for a single CPU.
 	 */
-	unsigned int capacity, capacity_orig;
+	unsigned int capacity;
 	unsigned long next_update;
 	int imbalance; /* XXX unrelated to capacity but shared group state */
 	/*
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [RFCv3 PATCH 09/48] sched: add SD_PREFER_SIBLING for SMT level
  2015-02-04 18:30 [RFCv3 PATCH 00/48] sched: Energy cost model for energy-aware scheduling Morten Rasmussen
                   ` (7 preceding siblings ...)
  2015-02-04 18:30 ` [RFCv3 PATCH 08/48] sched: replace capacity_factor by usage Morten Rasmussen
@ 2015-02-04 18:30 ` Morten Rasmussen
  2015-02-04 18:30 ` [RFCv3 PATCH 10/48] sched: move cfs task on a CPU with higher capacity Morten Rasmussen
                   ` (39 subsequent siblings)
  48 siblings, 0 replies; 124+ messages in thread
From: Morten Rasmussen @ 2015-02-04 18:30 UTC (permalink / raw)
  To: peterz, mingo
  Cc: vincent.guittot, dietmar.eggemann, yuyang.du, preeti, mturquette,
	nico, rjw, juri.lelli, linux-kernel

From: Vincent Guittot <vincent.guittot@linaro.org>

Add the SD_PREFER_SIBLING flag for SMT level in order to ensure that
the scheduler will put at least 1 task per core.

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Preeti U. Murthy <preeti@linux.vnet.ibm.com>
---
 kernel/sched/core.c | 1 +
 1 file changed, 1 insertion(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 252011d..a00a4c3 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6236,6 +6236,7 @@ sd_init(struct sched_domain_topology_level *tl, int cpu)
 	 */
 
 	if (sd->flags & SD_SHARE_CPUCAPACITY) {
+		sd->flags |= SD_PREFER_SIBLING;
 		sd->imbalance_pct = 110;
 		sd->smt_gain = 1178; /* ~15% */
 
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [RFCv3 PATCH 10/48] sched: move cfs task on a CPU with higher capacity
  2015-02-04 18:30 [RFCv3 PATCH 00/48] sched: Energy cost model for energy-aware scheduling Morten Rasmussen
                   ` (8 preceding siblings ...)
  2015-02-04 18:30 ` [RFCv3 PATCH 09/48] sched: add SD_PREFER_SIBLING for SMT level Morten Rasmussen
@ 2015-02-04 18:30 ` Morten Rasmussen
  2015-02-04 18:30 ` [RFCv3 PATCH 11/48] sched: Make load tracking frequency scale-invariant Morten Rasmussen
                   ` (38 subsequent siblings)
  48 siblings, 0 replies; 124+ messages in thread
From: Morten Rasmussen @ 2015-02-04 18:30 UTC (permalink / raw)
  To: peterz, mingo
  Cc: vincent.guittot, dietmar.eggemann, yuyang.du, preeti, mturquette,
	nico, rjw, juri.lelli, linux-kernel

From: Vincent Guittot <vincent.guittot@linaro.org>

When a CPU is used to handle a lot of IRQs or some RT tasks, the remaining
capacity for CFS tasks can be significantly reduced. Once we detect such
situation by comparing cpu_capacity_orig and cpu_capacity, we trig an idle
load balance to check if it's worth moving its tasks on an idle CPU.

Once the idle load_balance has selected the busiest CPU, it will look for an
active load balance for only two cases :
- there is only 1 task on the busiest CPU.
- we haven't been able to move a task of the busiest rq.

A CPU with a reduced capacity is included in the 1st case, and it's worth to
actively migrate its task if the idle CPU has got full capacity. This test has
been added in need_active_balance.

As a sidenote, this will note generate more spurious ilb because we already
trig an ilb if there is more than 1 busy cpu. If this cpu is the only one that
has a task, we will trig the ilb once for migrating the task.

The nohz_kick_needed function has been cleaned up a bit while adding the new
test

env.src_cpu and env.src_rq must be set unconditionnally because they are used
in need_active_balance which is called even if busiest->nr_running equals 1

Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 74 ++++++++++++++++++++++++++++++++++++++---------------
 1 file changed, 53 insertions(+), 21 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 52c494f..bd73f26 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6841,6 +6841,28 @@ static int need_active_balance(struct lb_env *env)
 			return 1;
 	}
 
+	/*
+	 * The dst_cpu is idle and the src_cpu CPU has only 1 CFS task.
+	 * It's worth migrating the task if the src_cpu's capacity is reduced
+	 * because of other sched_class or IRQs whereas capacity stays
+	 * available on dst_cpu.
+	 */
+	if ((env->idle != CPU_NOT_IDLE) &&
+			(env->src_rq->cfs.h_nr_running == 1)) {
+		unsigned long src_eff_capacity, dst_eff_capacity;
+
+		dst_eff_capacity = 100;
+		dst_eff_capacity *= capacity_of(env->dst_cpu);
+		dst_eff_capacity *= capacity_orig_of(env->src_cpu);
+
+		src_eff_capacity = sd->imbalance_pct;
+		src_eff_capacity *= capacity_of(env->src_cpu);
+		src_eff_capacity *= capacity_orig_of(env->dst_cpu);
+
+		if (src_eff_capacity < dst_eff_capacity)
+			return 1;
+	}
+
 	return unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2);
 }
 
@@ -6940,6 +6962,9 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 
 	schedstat_add(sd, lb_imbalance[idle], env.imbalance);
 
+	env.src_cpu = busiest->cpu;
+	env.src_rq = busiest;
+
 	ld_moved = 0;
 	if (busiest->nr_running > 1) {
 		/*
@@ -6949,8 +6974,6 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 		 * correctly treated as an imbalance.
 		 */
 		env.flags |= LBF_ALL_PINNED;
-		env.src_cpu   = busiest->cpu;
-		env.src_rq    = busiest;
 		env.loop_max  = min(sysctl_sched_nr_migrate, busiest->nr_running);
 
 more_balance:
@@ -7650,22 +7673,25 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
 
 /*
  * Current heuristic for kicking the idle load balancer in the presence
- * of an idle cpu is the system.
+ * of an idle cpu in the system.
  *   - This rq has more than one task.
- *   - At any scheduler domain level, this cpu's scheduler group has multiple
- *     busy cpu's exceeding the group's capacity.
+ *   - This rq has at least one CFS task and the capacity of the CPU is
+ *     significantly reduced because of RT tasks or IRQs.
+ *   - At parent of LLC scheduler domain level, this cpu's scheduler group has
+ *     multiple busy cpu.
  *   - For SD_ASYM_PACKING, if the lower numbered cpu's in the scheduler
  *     domain span are idle.
  */
-static inline int nohz_kick_needed(struct rq *rq)
+static inline bool nohz_kick_needed(struct rq *rq)
 {
 	unsigned long now = jiffies;
 	struct sched_domain *sd;
 	struct sched_group_capacity *sgc;
 	int nr_busy, cpu = rq->cpu;
+	bool kick = false;
 
 	if (unlikely(rq->idle_balance))
-		return 0;
+		return false;
 
        /*
 	* We may be recently in ticked or tickless idle mode. At the first
@@ -7679,38 +7705,44 @@ static inline int nohz_kick_needed(struct rq *rq)
 	 * balancing.
 	 */
 	if (likely(!atomic_read(&nohz.nr_cpus)))
-		return 0;
+		return false;
 
 	if (time_before(now, nohz.next_balance))
-		return 0;
+		return false;
 
 	if (rq->nr_running >= 2)
-		goto need_kick;
+		return true;
 
 	rcu_read_lock();
 	sd = rcu_dereference(per_cpu(sd_busy, cpu));
-
 	if (sd) {
 		sgc = sd->groups->sgc;
 		nr_busy = atomic_read(&sgc->nr_busy_cpus);
 
-		if (nr_busy > 1)
-			goto need_kick_unlock;
+		if (nr_busy > 1) {
+			kick = true;
+			goto unlock;
+		}
+
 	}
 
-	sd = rcu_dereference(per_cpu(sd_asym, cpu));
+	sd = rcu_dereference(rq->sd);
+	if (sd) {
+		if ((rq->cfs.h_nr_running >= 1) &&
+				check_cpu_capacity(rq, sd)) {
+			kick = true;
+			goto unlock;
+		}
+	}
 
+	sd = rcu_dereference(per_cpu(sd_asym, cpu));
 	if (sd && (cpumask_first_and(nohz.idle_cpus_mask,
 				  sched_domain_span(sd)) < cpu))
-		goto need_kick_unlock;
+		kick = true;
 
+unlock:
 	rcu_read_unlock();
-	return 0;
-
-need_kick_unlock:
-	rcu_read_unlock();
-need_kick:
-	return 1;
+	return kick;
 }
 #else
 static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle) { }
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [RFCv3 PATCH 11/48] sched: Make load tracking frequency scale-invariant
  2015-02-04 18:30 [RFCv3 PATCH 00/48] sched: Energy cost model for energy-aware scheduling Morten Rasmussen
                   ` (9 preceding siblings ...)
  2015-02-04 18:30 ` [RFCv3 PATCH 10/48] sched: move cfs task on a CPU with higher capacity Morten Rasmussen
@ 2015-02-04 18:30 ` Morten Rasmussen
  2015-02-04 18:30 ` [RFCv3 PATCH 12/48] sched: Make usage tracking cpu scale-invariant Morten Rasmussen
                   ` (37 subsequent siblings)
  48 siblings, 0 replies; 124+ messages in thread
From: Morten Rasmussen @ 2015-02-04 18:30 UTC (permalink / raw)
  To: peterz, mingo
  Cc: vincent.guittot, dietmar.eggemann, yuyang.du, preeti, mturquette,
	nico, rjw, juri.lelli, linux-kernel

From: Dietmar Eggemann <dietmar.eggemann@arm.com>

Apply frequency scale-invariance correction factor to load tracking.
Each segment of the sched_avg::runnable_avg_sum geometric series is now
scaled by the current frequency so the sched_avg::load_avg_contrib of each
entity will be invariant with frequency scaling. As a result,
cfs_rq::runnable_load_avg which is the sum of sched_avg::load_avg_contrib,
becomes invariant too. So the load level that is returned by
weighted_cpuload, stays relative to the max frequency of the cpu.

Then, we want the keep the load tracking values in a 32bits type, which
implies that the max value of sched_avg::{runnable|running}_avg_sum must
be lower than 2^32/88761=48388 (88761 is the max weight of a task). As
LOAD_AVG_MAX = 47742, arch_scale_freq_capacity must return a value less
than (48388/47742) << SCHED_CAPACITY_SHIFT = 1037 (SCHED_SCALE_CAPACITY =
1024). So we define the range to [0..SCHED_SCALE_CAPACITY] in order to
avoid overflow.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Acked-by: Vincent Guittot <vincent.guittot@linaro.org>
---
 kernel/sched/fair.c | 28 ++++++++++++++++------------
 1 file changed, 16 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bd73f26..e9a26b1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2507,9 +2507,9 @@ static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,
 							int runnable,
 							int running)
 {
-	u64 delta, periods;
-	u32 runnable_contrib;
-	int delta_w, decayed = 0;
+	u64 delta, scaled_delta, periods;
+	u32 runnable_contrib, scaled_runnable_contrib;
+	int delta_w, scaled_delta_w, decayed = 0;
 	unsigned long scale_freq = arch_scale_freq_capacity(NULL, cpu);
 
 	delta = now - sa->last_runnable_update;
@@ -2543,11 +2543,12 @@ static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,
 		 * period and accrue it.
 		 */
 		delta_w = 1024 - delta_w;
+		scaled_delta_w = (delta_w * scale_freq) >> SCHED_CAPACITY_SHIFT;
+
 		if (runnable)
-			sa->runnable_avg_sum += delta_w;
+			sa->runnable_avg_sum += scaled_delta_w;
 		if (running)
-			sa->running_avg_sum += delta_w * scale_freq
-				>> SCHED_CAPACITY_SHIFT;
+			sa->running_avg_sum += scaled_delta_w;
 		sa->avg_period += delta_w;
 
 		delta -= delta_w;
@@ -2565,20 +2566,23 @@ static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,
 
 		/* Efficiently calculate \sum (1..n_period) 1024*y^i */
 		runnable_contrib = __compute_runnable_contrib(periods);
+		scaled_runnable_contrib = (runnable_contrib * scale_freq)
+						>> SCHED_CAPACITY_SHIFT;
+
 		if (runnable)
-			sa->runnable_avg_sum += runnable_contrib;
+			sa->runnable_avg_sum += scaled_runnable_contrib;
 		if (running)
-			sa->running_avg_sum += runnable_contrib * scale_freq
-				>> SCHED_CAPACITY_SHIFT;
+			sa->running_avg_sum += scaled_runnable_contrib;
 		sa->avg_period += runnable_contrib;
 	}
 
 	/* Remainder of delta accrued against u_0` */
+	scaled_delta = (delta * scale_freq) >> SCHED_CAPACITY_SHIFT;
+
 	if (runnable)
-		sa->runnable_avg_sum += delta;
+		sa->runnable_avg_sum += scaled_delta;
 	if (running)
-		sa->running_avg_sum += delta * scale_freq
-			>> SCHED_CAPACITY_SHIFT;
+		sa->running_avg_sum += scaled_delta;
 	sa->avg_period += delta;
 
 	return decayed;
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [RFCv3 PATCH 12/48] sched: Make usage tracking cpu scale-invariant
  2015-02-04 18:30 [RFCv3 PATCH 00/48] sched: Energy cost model for energy-aware scheduling Morten Rasmussen
                   ` (10 preceding siblings ...)
  2015-02-04 18:30 ` [RFCv3 PATCH 11/48] sched: Make load tracking frequency scale-invariant Morten Rasmussen
@ 2015-02-04 18:30 ` Morten Rasmussen
  2015-03-23 14:46   ` Peter Zijlstra
  2015-02-04 18:30 ` [RFCv3 PATCH 13/48] cpufreq: Architecture specific callback for frequency changes Morten Rasmussen
                   ` (36 subsequent siblings)
  48 siblings, 1 reply; 124+ messages in thread
From: Morten Rasmussen @ 2015-02-04 18:30 UTC (permalink / raw)
  To: peterz, mingo
  Cc: vincent.guittot, dietmar.eggemann, yuyang.du, preeti, mturquette,
	nico, rjw, juri.lelli, linux-kernel

From: Dietmar Eggemann <dietmar.eggemann@arm.com>

Besides the existing frequency scale-invariance correction factor, apply
cpu scale-invariance correction factor to usage tracking.

Cpu scale-invariance takes cpu performance deviations due to
micro-architectural differences (i.e. instructions per seconds) between
cpus in HMP systems (e.g. big.LITTLE) and differences in the frequency
value of the highest OPP between cpus in SMP systems into consideration.

Each segment of the sched_avg::running_avg_sum geometric series is now
scaled by the cpu performance factor too so the
sched_avg::utilization_avg_contrib of each entity will be invariant from
the particular cpu of the HMP/SMP system it is gathered on.

So the usage level that is returned by get_cpu_usage stays relative to
the max cpu performance of the system.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
---
 kernel/sched/fair.c | 16 +++++++++++++++-
 1 file changed, 15 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e9a26b1..5375ab1 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2473,6 +2473,7 @@ static u32 __compute_runnable_contrib(u64 n)
 }
 
 unsigned long __weak arch_scale_freq_capacity(struct sched_domain *sd, int cpu);
+unsigned long __weak arch_scale_cpu_capacity(struct sched_domain *sd, int cpu);
 
 /*
  * We can represent the historical contribution to runnable average as the
@@ -2511,6 +2512,7 @@ static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,
 	u32 runnable_contrib, scaled_runnable_contrib;
 	int delta_w, scaled_delta_w, decayed = 0;
 	unsigned long scale_freq = arch_scale_freq_capacity(NULL, cpu);
+	unsigned long scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
 
 	delta = now - sa->last_runnable_update;
 	/*
@@ -2547,6 +2549,10 @@ static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,
 
 		if (runnable)
 			sa->runnable_avg_sum += scaled_delta_w;
+
+		scaled_delta_w *= scale_cpu;
+		scaled_delta_w >>= SCHED_CAPACITY_SHIFT;
+
 		if (running)
 			sa->running_avg_sum += scaled_delta_w;
 		sa->avg_period += delta_w;
@@ -2571,6 +2577,10 @@ static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,
 
 		if (runnable)
 			sa->runnable_avg_sum += scaled_runnable_contrib;
+
+		scaled_runnable_contrib *= scale_cpu;
+		scaled_runnable_contrib >>= SCHED_CAPACITY_SHIFT;
+
 		if (running)
 			sa->running_avg_sum += scaled_runnable_contrib;
 		sa->avg_period += runnable_contrib;
@@ -2581,6 +2591,10 @@ static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,
 
 	if (runnable)
 		sa->runnable_avg_sum += scaled_delta;
+
+	scaled_delta *= scale_cpu;
+	scaled_delta >>= SCHED_CAPACITY_SHIFT;
+
 	if (running)
 		sa->running_avg_sum += scaled_delta;
 	sa->avg_period += delta;
@@ -6014,7 +6028,7 @@ unsigned long __weak arch_scale_freq_capacity(struct sched_domain *sd, int cpu)
 
 static unsigned long default_scale_cpu_capacity(struct sched_domain *sd, int cpu)
 {
-	if ((sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1))
+	if (sd && (sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1))
 		return sd->smt_gain / sd->span_weight;
 
 	return SCHED_CAPACITY_SCALE;
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [RFCv3 PATCH 13/48] cpufreq: Architecture specific callback for frequency changes
  2015-02-04 18:30 [RFCv3 PATCH 00/48] sched: Energy cost model for energy-aware scheduling Morten Rasmussen
                   ` (11 preceding siblings ...)
  2015-02-04 18:30 ` [RFCv3 PATCH 12/48] sched: Make usage tracking cpu scale-invariant Morten Rasmussen
@ 2015-02-04 18:30 ` Morten Rasmussen
  2015-02-04 18:30 ` [RFCv3 PATCH 14/48] arm: Frequency invariant scheduler load-tracking support Morten Rasmussen
                   ` (35 subsequent siblings)
  48 siblings, 0 replies; 124+ messages in thread
From: Morten Rasmussen @ 2015-02-04 18:30 UTC (permalink / raw)
  To: peterz, mingo
  Cc: vincent.guittot, dietmar.eggemann, yuyang.du, preeti, mturquette,
	nico, rjw, juri.lelli, linux-kernel, Morten Rasmussen,
	Viresh Kumar

From: Morten Rasmussen <Morten.Rasmussen@arm.com>

Architectures that don't have any other means for tracking cpu frequency
changes need a callback from cpufreq to implement a scaling factor to
enable scale-invariant per-entity load-tracking in the scheduler.

To compute the scale invariance correction factor the architecture would
need to know both the max frequency and the current frequency. This
patch defines weak functions for setting both from cpufreq.

Related architecture specific functions use weak function definitions.
The same approach is followed here.

These callbacks can be used to implement frequency scaling of cpu
capacity later.

Cc: Rafael J. Wysocki <rjw@rjwysocki.net>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 drivers/cpufreq/cpufreq.c | 10 +++++++++-
 1 file changed, 9 insertions(+), 1 deletion(-)

diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index 46bed4f..951df85 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -278,6 +278,10 @@ static inline void adjust_jiffies(unsigned long val, struct cpufreq_freqs *ci)
 }
 #endif
 
+void __weak arch_scale_set_curr_freq(int cpu, unsigned long freq) {}
+
+void __weak arch_scale_set_max_freq(int cpu, unsigned long freq) {}
+
 static void __cpufreq_notify_transition(struct cpufreq_policy *policy,
 		struct cpufreq_freqs *freqs, unsigned int state)
 {
@@ -315,6 +319,7 @@ static void __cpufreq_notify_transition(struct cpufreq_policy *policy,
 		pr_debug("FREQ: %lu - CPU: %lu\n",
 			 (unsigned long)freqs->new, (unsigned long)freqs->cpu);
 		trace_cpu_frequency(freqs->new, freqs->cpu);
+		arch_scale_set_curr_freq(freqs->cpu, freqs->new);
 		srcu_notifier_call_chain(&cpufreq_transition_notifier_list,
 				CPUFREQ_POSTCHANGE, freqs);
 		if (likely(policy) && likely(policy->cpu == freqs->cpu))
@@ -2178,7 +2183,7 @@ static int cpufreq_set_policy(struct cpufreq_policy *policy,
 				struct cpufreq_policy *new_policy)
 {
 	struct cpufreq_governor *old_gov;
-	int ret;
+	int ret, cpu;
 
 	pr_debug("setting new policy for CPU %u: %u - %u kHz\n",
 		 new_policy->cpu, new_policy->min, new_policy->max);
@@ -2216,6 +2221,9 @@ static int cpufreq_set_policy(struct cpufreq_policy *policy,
 	policy->min = new_policy->min;
 	policy->max = new_policy->max;
 
+	for_each_cpu(cpu, policy->cpus)
+		arch_scale_set_max_freq(cpu, policy->max);
+
 	pr_debug("new min and max freqs are %u - %u kHz\n",
 		 policy->min, policy->max);
 
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [RFCv3 PATCH 14/48] arm: Frequency invariant scheduler load-tracking support
  2015-02-04 18:30 [RFCv3 PATCH 00/48] sched: Energy cost model for energy-aware scheduling Morten Rasmussen
                   ` (12 preceding siblings ...)
  2015-02-04 18:30 ` [RFCv3 PATCH 13/48] cpufreq: Architecture specific callback for frequency changes Morten Rasmussen
@ 2015-02-04 18:30 ` Morten Rasmussen
  2015-03-23 13:39   ` Peter Zijlstra
  2015-02-04 18:30 ` [RFCv3 PATCH 15/48] arm: vexpress: Add CPU clock-frequencies to TC2 device-tree Morten Rasmussen
                   ` (34 subsequent siblings)
  48 siblings, 1 reply; 124+ messages in thread
From: Morten Rasmussen @ 2015-02-04 18:30 UTC (permalink / raw)
  To: peterz, mingo
  Cc: vincent.guittot, dietmar.eggemann, yuyang.du, preeti, mturquette,
	nico, rjw, juri.lelli, linux-kernel, Morten Rasmussen,
	Russell King

From: Morten Rasmussen <Morten.Rasmussen@arm.com>

Implements arch-specific function to provide the scheduler with a
frequency scaling correction factor for more accurate load-tracking. The
factor is:

	current_freq(cpu) * SCHED_CAPACITY_SCALE / max_freq(cpu)

This implementation only provides frequency invariance. No
micro-architecture invariance yet.

Cc: Russell King <linux@arm.linux.org.uk>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 arch/arm/kernel/topology.c | 33 +++++++++++++++++++++++++++++++++
 1 file changed, 33 insertions(+)

diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
index 08b7847..a1274e6 100644
--- a/arch/arm/kernel/topology.c
+++ b/arch/arm/kernel/topology.c
@@ -169,6 +169,39 @@ static void update_cpu_capacity(unsigned int cpu)
 		cpu, arch_scale_cpu_capacity(NULL, cpu));
 }
 
+/*
+ * Scheduler load-tracking scale-invariance
+ *
+ * Provides the scheduler with a scale-invariance correction factor that
+ * compensates for frequency scaling.
+ */
+
+static DEFINE_PER_CPU(atomic_long_t, cpu_curr_freq);
+static DEFINE_PER_CPU(atomic_long_t, cpu_max_freq);
+
+/* cpufreq callback function setting current cpu frequency */
+void arch_scale_set_curr_freq(int cpu, unsigned long freq)
+{
+	atomic_long_set(&per_cpu(cpu_curr_freq, cpu), freq);
+}
+
+/* cpufreq callback function setting max cpu frequency */
+void arch_scale_set_max_freq(int cpu, unsigned long freq)
+{
+	atomic_long_set(&per_cpu(cpu_max_freq, cpu), freq);
+}
+
+unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu)
+{
+	unsigned long curr = atomic_long_read(&per_cpu(cpu_curr_freq, cpu));
+	unsigned long max = atomic_long_read(&per_cpu(cpu_max_freq, cpu));
+
+	if (!curr || !max)
+		return SCHED_CAPACITY_SCALE;
+
+	return (curr * SCHED_CAPACITY_SCALE) / max;
+}
+
 #else
 static inline void parse_dt_topology(void) {}
 static inline void update_cpu_capacity(unsigned int cpuid) {}
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [RFCv3 PATCH 15/48] arm: vexpress: Add CPU clock-frequencies to TC2 device-tree
  2015-02-04 18:30 [RFCv3 PATCH 00/48] sched: Energy cost model for energy-aware scheduling Morten Rasmussen
                   ` (13 preceding siblings ...)
  2015-02-04 18:30 ` [RFCv3 PATCH 14/48] arm: Frequency invariant scheduler load-tracking support Morten Rasmussen
@ 2015-02-04 18:30 ` Morten Rasmussen
  2015-02-04 18:30 ` [RFCv3 PATCH 16/48] arm: Cpu invariant scheduler load-tracking support Morten Rasmussen
                   ` (33 subsequent siblings)
  48 siblings, 0 replies; 124+ messages in thread
From: Morten Rasmussen @ 2015-02-04 18:30 UTC (permalink / raw)
  To: peterz, mingo
  Cc: vincent.guittot, dietmar.eggemann, yuyang.du, preeti, mturquette,
	nico, rjw, juri.lelli, linux-kernel, Jon Medhurst, Russell King

From: Dietmar Eggemann <dietmar.eggemann@arm.com>

To enable the parsing of clock frequency and cpu efficiency values
inside parse_dt_topology [arch/arm/kernel/topology.c] to scale the
relative capacity of the cpus, this property has to be provided within
the cpu nodes of the dts file.

The patch is a copy of commit 8f15973ef8c3 ("ARM: vexpress: Add CPU
clock-frequencies to TC2 device-tree") taken from Linaro Stable Kernel
(LSK) massaged into mainline.

Cc: Jon Medhurst <tixy@linaro.org>
Cc: Russell King <linux@arm.linux.org.uk>

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
---
 arch/arm/boot/dts/vexpress-v2p-ca15_a7.dts | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/arch/arm/boot/dts/vexpress-v2p-ca15_a7.dts b/arch/arm/boot/dts/vexpress-v2p-ca15_a7.dts
index 33920df..43841b5 100644
--- a/arch/arm/boot/dts/vexpress-v2p-ca15_a7.dts
+++ b/arch/arm/boot/dts/vexpress-v2p-ca15_a7.dts
@@ -39,6 +39,7 @@
 			reg = <0>;
 			cci-control-port = <&cci_control1>;
 			cpu-idle-states = <&CLUSTER_SLEEP_BIG>;
+			clock-frequency = <1000000000>;
 		};
 
 		cpu1: cpu@1 {
@@ -47,6 +48,7 @@
 			reg = <1>;
 			cci-control-port = <&cci_control1>;
 			cpu-idle-states = <&CLUSTER_SLEEP_BIG>;
+			clock-frequency = <1000000000>;
 		};
 
 		cpu2: cpu@2 {
@@ -55,6 +57,7 @@
 			reg = <0x100>;
 			cci-control-port = <&cci_control2>;
 			cpu-idle-states = <&CLUSTER_SLEEP_LITTLE>;
+			clock-frequency = <800000000>;
 		};
 
 		cpu3: cpu@3 {
@@ -63,6 +66,7 @@
 			reg = <0x101>;
 			cci-control-port = <&cci_control2>;
 			cpu-idle-states = <&CLUSTER_SLEEP_LITTLE>;
+			clock-frequency = <800000000>;
 		};
 
 		cpu4: cpu@4 {
@@ -71,6 +75,7 @@
 			reg = <0x102>;
 			cci-control-port = <&cci_control2>;
 			cpu-idle-states = <&CLUSTER_SLEEP_LITTLE>;
+			clock-frequency = <800000000>;
 		};
 
 		idle-states {
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [RFCv3 PATCH 16/48] arm: Cpu invariant scheduler load-tracking support
  2015-02-04 18:30 [RFCv3 PATCH 00/48] sched: Energy cost model for energy-aware scheduling Morten Rasmussen
                   ` (14 preceding siblings ...)
  2015-02-04 18:30 ` [RFCv3 PATCH 15/48] arm: vexpress: Add CPU clock-frequencies to TC2 device-tree Morten Rasmussen
@ 2015-02-04 18:30 ` Morten Rasmussen
  2015-02-04 18:30 ` [RFCv3 PATCH 17/48] sched: Get rid of scaling usage by cpu_capacity_orig Morten Rasmussen
                   ` (32 subsequent siblings)
  48 siblings, 0 replies; 124+ messages in thread
From: Morten Rasmussen @ 2015-02-04 18:30 UTC (permalink / raw)
  To: peterz, mingo
  Cc: vincent.guittot, dietmar.eggemann, yuyang.du, preeti, mturquette,
	nico, rjw, juri.lelli, linux-kernel, Russell King

From: Dietmar Eggemann <dietmar.eggemann@arm.com>

Reuses the existing infrastructure for cpu_scale to provide the scheduler
with a cpu scaling correction factor for more accurate load-tracking.
This factor comprises a micro-architectural part, which is based on the
cpu efficiency value of a cpu as well as a platform-wide max frequency
part, which relates to the dtb property clock-frequency of a cpu node.

The calculation of cpu_scale, return value of arch_scale_cpu_capacity,
changes from:

    capacity / middle_capacity

    with capacity = (clock_frequency >> 20) * cpu_efficiency

to:

    SCHED_CAPACITY_SCALE * cpu_perf / max_cpu_perf

The range of the cpu_scale value changes from
[0..3*SCHED_CAPACITY_SCALE/2] to [0..SCHED_CAPACITY_SCALE].

The functionality to calculate the middle_capacity which corresponds to an
'average' cpu has been taken out since the scaling is now done
differently.

In the case that either the cpu efficiency or the clock-frequency value
for a cpu is missing, no cpu scaling is done for any cpu.

The platform-wide max frequency part of the factor should not be confused
with the frequency invariant scheduler load-tracking support which deals
with frequency related scaling due to DFVS functionality on a cpu.

Cc: Russell King <linux@arm.linux.org.uk>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
---
 arch/arm/kernel/topology.c | 64 +++++++++++++++++-----------------------------
 1 file changed, 23 insertions(+), 41 deletions(-)

diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
index a1274e6..34ecbdc 100644
--- a/arch/arm/kernel/topology.c
+++ b/arch/arm/kernel/topology.c
@@ -62,9 +62,7 @@ struct cpu_efficiency {
  * Table of relative efficiency of each processors
  * The efficiency value must fit in 20bit and the final
  * cpu_scale value must be in the range
- *   0 < cpu_scale < 3*SCHED_CAPACITY_SCALE/2
- * in order to return at most 1 when DIV_ROUND_CLOSEST
- * is used to compute the capacity of a CPU.
+ *   0 < cpu_scale < SCHED_CAPACITY_SCALE.
  * Processors that are not defined in the table,
  * use the default SCHED_CAPACITY_SCALE value for cpu_scale.
  */
@@ -77,24 +75,18 @@ static const struct cpu_efficiency table_efficiency[] = {
 static unsigned long *__cpu_capacity;
 #define cpu_capacity(cpu)	__cpu_capacity[cpu]
 
-static unsigned long middle_capacity = 1;
+static unsigned long max_cpu_perf;
 
 /*
  * Iterate all CPUs' descriptor in DT and compute the efficiency
- * (as per table_efficiency). Also calculate a middle efficiency
- * as close as possible to  (max{eff_i} - min{eff_i}) / 2
- * This is later used to scale the cpu_capacity field such that an
- * 'average' CPU is of middle capacity. Also see the comments near
- * table_efficiency[] and update_cpu_capacity().
+ * (as per table_efficiency). Calculate the max cpu performance too.
  */
+
 static void __init parse_dt_topology(void)
 {
 	const struct cpu_efficiency *cpu_eff;
 	struct device_node *cn = NULL;
-	unsigned long min_capacity = ULONG_MAX;
-	unsigned long max_capacity = 0;
-	unsigned long capacity = 0;
-	int cpu = 0;
+	int cpu = 0, i = 0;
 
 	__cpu_capacity = kcalloc(nr_cpu_ids, sizeof(*__cpu_capacity),
 				 GFP_NOWAIT);
@@ -102,6 +94,7 @@ static void __init parse_dt_topology(void)
 	for_each_possible_cpu(cpu) {
 		const u32 *rate;
 		int len;
+		unsigned long cpu_perf;
 
 		/* too early to use cpu->of_node */
 		cn = of_get_cpu_node(cpu, NULL);
@@ -124,46 +117,35 @@ static void __init parse_dt_topology(void)
 			continue;
 		}
 
-		capacity = ((be32_to_cpup(rate)) >> 20) * cpu_eff->efficiency;
-
-		/* Save min capacity of the system */
-		if (capacity < min_capacity)
-			min_capacity = capacity;
-
-		/* Save max capacity of the system */
-		if (capacity > max_capacity)
-			max_capacity = capacity;
-
-		cpu_capacity(cpu) = capacity;
+		cpu_perf = ((be32_to_cpup(rate)) >> 20) * cpu_eff->efficiency;
+		cpu_capacity(cpu) = cpu_perf;
+		max_cpu_perf = max(max_cpu_perf, cpu_perf);
+		i++;
 	}
 
-	/* If min and max capacities are equals, we bypass the update of the
-	 * cpu_scale because all CPUs have the same capacity. Otherwise, we
-	 * compute a middle_capacity factor that will ensure that the capacity
-	 * of an 'average' CPU of the system will be as close as possible to
-	 * SCHED_CAPACITY_SCALE, which is the default value, but with the
-	 * constraint explained near table_efficiency[].
-	 */
-	if (4*max_capacity < (3*(max_capacity + min_capacity)))
-		middle_capacity = (min_capacity + max_capacity)
-				>> (SCHED_CAPACITY_SHIFT+1);
-	else
-		middle_capacity = ((max_capacity / 3)
-				>> (SCHED_CAPACITY_SHIFT-1)) + 1;
-
+	if (i < num_possible_cpus())
+		max_cpu_perf = 0;
 }
 
 /*
  * Look for a customed capacity of a CPU in the cpu_capacity table during the
  * boot. The update of all CPUs is in O(n^2) for heteregeneous system but the
- * function returns directly for SMP system.
+ * function returns directly for SMP systems or if there is no complete set
+ * of cpu efficiency, clock frequency data for each cpu.
  */
 static void update_cpu_capacity(unsigned int cpu)
 {
-	if (!cpu_capacity(cpu))
+	unsigned long capacity = cpu_capacity(cpu);
+
+	if (!capacity || !max_cpu_perf) {
+		cpu_capacity(cpu) = 0;
 		return;
+	}
+
+	capacity *= SCHED_CAPACITY_SCALE;
+	capacity /= max_cpu_perf;
 
-	set_capacity_scale(cpu, cpu_capacity(cpu) / middle_capacity);
+	set_capacity_scale(cpu, capacity);
 
 	pr_info("CPU%u: update cpu_capacity %lu\n",
 		cpu, arch_scale_cpu_capacity(NULL, cpu));
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [RFCv3 PATCH 17/48] sched: Get rid of scaling usage by cpu_capacity_orig
  2015-02-04 18:30 [RFCv3 PATCH 00/48] sched: Energy cost model for energy-aware scheduling Morten Rasmussen
                   ` (15 preceding siblings ...)
  2015-02-04 18:30 ` [RFCv3 PATCH 16/48] arm: Cpu invariant scheduler load-tracking support Morten Rasmussen
@ 2015-02-04 18:30 ` Morten Rasmussen
       [not found]   ` <OFFC493540.15A92099-ON48257E35.0026F60C-48257E35.0027A5FB@zte.com.cn>
  2015-02-04 18:30 ` [RFCv3 PATCH 18/48] sched: Track blocked utilization contributions Morten Rasmussen
                   ` (31 subsequent siblings)
  48 siblings, 1 reply; 124+ messages in thread
From: Morten Rasmussen @ 2015-02-04 18:30 UTC (permalink / raw)
  To: peterz, mingo
  Cc: vincent.guittot, dietmar.eggemann, yuyang.du, preeti, mturquette,
	nico, rjw, juri.lelli, linux-kernel

From: Dietmar Eggemann <dietmar.eggemann@arm.com>

Since now we have besides frequency invariant also cpu (uarch plus max
system frequency) invariant cfs_rq::utilization_load_avg both, frequency
and cpu scaling happens as part of the load tracking.
So cfs_rq::utilization_load_avg does not have to be scaled by the original
capacity of the cpu again.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
---
 kernel/sched/fair.c | 5 ++---
 1 file changed, 2 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5375ab1..a85c34b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4807,12 +4807,11 @@ static int select_idle_sibling(struct task_struct *p, int target)
 static int get_cpu_usage(int cpu)
 {
 	unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg;
-	unsigned long capacity = capacity_orig_of(cpu);
 
 	if (usage >= SCHED_LOAD_SCALE)
-		return capacity;
+		return capacity_orig_of(cpu);
 
-	return (usage * capacity) >> SCHED_LOAD_SHIFT;
+	return usage;
 }
 
 /*
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [RFCv3 PATCH 18/48] sched: Track blocked utilization contributions
  2015-02-04 18:30 [RFCv3 PATCH 00/48] sched: Energy cost model for energy-aware scheduling Morten Rasmussen
                   ` (16 preceding siblings ...)
  2015-02-04 18:30 ` [RFCv3 PATCH 17/48] sched: Get rid of scaling usage by cpu_capacity_orig Morten Rasmussen
@ 2015-02-04 18:30 ` Morten Rasmussen
  2015-03-23 14:08   ` Peter Zijlstra
  2015-02-04 18:30 ` [RFCv3 PATCH 19/48] sched: Include blocked utilization in usage tracking Morten Rasmussen
                   ` (30 subsequent siblings)
  48 siblings, 1 reply; 124+ messages in thread
From: Morten Rasmussen @ 2015-02-04 18:30 UTC (permalink / raw)
  To: peterz, mingo
  Cc: vincent.guittot, dietmar.eggemann, yuyang.du, preeti, mturquette,
	nico, rjw, juri.lelli, linux-kernel

Introduces the blocked utilization, the utilization counter-part to
cfs_rq->utilization_load_avg. It is the sum of sched_entity utilization
contributions of entities that were recently on the cfs_rq that are
currently blocked. Combined with sum of contributions of entities
currently on the cfs_rq or currently running
(cfs_rq->utilization_load_avg) this can provide a more stable average
view of the cpu usage.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c  | 30 +++++++++++++++++++++++++++++-
 kernel/sched/sched.h |  8 ++++++--
 2 files changed, 35 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index a85c34b..0fc8963 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2775,6 +2775,15 @@ static inline void subtract_blocked_load_contrib(struct cfs_rq *cfs_rq,
 		cfs_rq->blocked_load_avg = 0;
 }
 
+static inline void subtract_utilization_blocked_contrib(struct cfs_rq *cfs_rq,
+						long utilization_contrib)
+{
+	if (likely(utilization_contrib < cfs_rq->utilization_blocked_avg))
+		cfs_rq->utilization_blocked_avg -= utilization_contrib;
+	else
+		cfs_rq->utilization_blocked_avg = 0;
+}
+
 static inline u64 cfs_rq_clock_task(struct cfs_rq *cfs_rq);
 
 /* Update a sched_entity's runnable average */
@@ -2810,6 +2819,8 @@ static inline void update_entity_load_avg(struct sched_entity *se,
 		cfs_rq->utilization_load_avg += utilization_delta;
 	} else {
 		subtract_blocked_load_contrib(cfs_rq, -contrib_delta);
+		subtract_utilization_blocked_contrib(cfs_rq,
+							-utilization_delta);
 	}
 }
 
@@ -2827,14 +2838,20 @@ static void update_cfs_rq_blocked_load(struct cfs_rq *cfs_rq, int force_update)
 		return;
 
 	if (atomic_long_read(&cfs_rq->removed_load)) {
-		unsigned long removed_load;
+		unsigned long removed_load, removed_utilization;
 		removed_load = atomic_long_xchg(&cfs_rq->removed_load, 0);
+		removed_utilization =
+			atomic_long_xchg(&cfs_rq->removed_utilization, 0);
 		subtract_blocked_load_contrib(cfs_rq, removed_load);
+		subtract_utilization_blocked_contrib(cfs_rq,
+							removed_utilization);
 	}
 
 	if (decays) {
 		cfs_rq->blocked_load_avg = decay_load(cfs_rq->blocked_load_avg,
 						      decays);
+		cfs_rq->utilization_blocked_avg =
+			decay_load(cfs_rq->utilization_blocked_avg, decays);
 		atomic64_add(decays, &cfs_rq->decay_counter);
 		cfs_rq->last_decay = now;
 	}
@@ -2881,6 +2898,8 @@ static inline void enqueue_entity_load_avg(struct cfs_rq *cfs_rq,
 	/* migrated tasks did not contribute to our blocked load */
 	if (wakeup) {
 		subtract_blocked_load_contrib(cfs_rq, se->avg.load_avg_contrib);
+		subtract_utilization_blocked_contrib(cfs_rq,
+					se->avg.utilization_avg_contrib);
 		update_entity_load_avg(se, 0);
 	}
 
@@ -2907,6 +2926,8 @@ static inline void dequeue_entity_load_avg(struct cfs_rq *cfs_rq,
 	cfs_rq->utilization_load_avg -= se->avg.utilization_avg_contrib;
 	if (sleep) {
 		cfs_rq->blocked_load_avg += se->avg.load_avg_contrib;
+		cfs_rq->utilization_blocked_avg +=
+						se->avg.utilization_avg_contrib;
 		se->avg.decay_count = atomic64_read(&cfs_rq->decay_counter);
 	} /* migrations, e.g. sleep=0 leave decay_count == 0 */
 }
@@ -4927,6 +4948,8 @@ migrate_task_rq_fair(struct task_struct *p, int next_cpu)
 		se->avg.decay_count = -__synchronize_entity_decay(se);
 		atomic_long_add(se->avg.load_avg_contrib,
 						&cfs_rq->removed_load);
+		atomic_long_add(se->avg.utilization_avg_contrib,
+					&cfs_rq->removed_utilization);
 	}
 
 	/* We have migrated, no longer consider this task hot */
@@ -7942,6 +7965,8 @@ static void switched_from_fair(struct rq *rq, struct task_struct *p)
 	if (se->avg.decay_count) {
 		__synchronize_entity_decay(se);
 		subtract_blocked_load_contrib(cfs_rq, se->avg.load_avg_contrib);
+		subtract_utilization_blocked_contrib(cfs_rq,
+					se->avg.utilization_avg_contrib);
 	}
 #endif
 }
@@ -8001,6 +8026,7 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
 #ifdef CONFIG_SMP
 	atomic64_set(&cfs_rq->decay_counter, 1);
 	atomic_long_set(&cfs_rq->removed_load, 0);
+	atomic_long_set(&cfs_rq->removed_utilization, 0);
 #endif
 }
 
@@ -8053,6 +8079,8 @@ static void task_move_group_fair(struct task_struct *p, int queued)
 		 */
 		se->avg.decay_count = atomic64_read(&cfs_rq->decay_counter);
 		cfs_rq->blocked_load_avg += se->avg.load_avg_contrib;
+		cfs_rq->utilization_blocked_avg +=
+						se->avg.utilization_avg_contrib;
 #endif
 	}
 }
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e402133..208237f 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -368,11 +368,15 @@ struct cfs_rq {
 	 * the blocked sched_entities on the rq.
 	 * utilization_load_avg is the sum of the average running time of the
 	 * sched_entities on the rq.
+	 * utilization_blocked_avg is the utilization equivalent of
+	 * blocked_load_avg, i.e. the sum of running contributions of blocked
+	 * sched_entities associated with the rq.
 	 */
-	unsigned long runnable_load_avg, blocked_load_avg, utilization_load_avg;
+	unsigned long runnable_load_avg, blocked_load_avg;
+	unsigned long utilization_load_avg, utilization_blocked_avg;
 	atomic64_t decay_counter;
 	u64 last_decay;
-	atomic_long_t removed_load;
+	atomic_long_t removed_load, removed_utilization;
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	/* Required to track per-cpu representation of a task_group */
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [RFCv3 PATCH 19/48] sched: Include blocked utilization in usage tracking
  2015-02-04 18:30 [RFCv3 PATCH 00/48] sched: Energy cost model for energy-aware scheduling Morten Rasmussen
                   ` (17 preceding siblings ...)
  2015-02-04 18:30 ` [RFCv3 PATCH 18/48] sched: Track blocked utilization contributions Morten Rasmussen
@ 2015-02-04 18:30 ` Morten Rasmussen
  2015-02-04 18:30 ` [RFCv3 PATCH 20/48] sched: Documentation for scheduler energy cost model Morten Rasmussen
                   ` (29 subsequent siblings)
  48 siblings, 0 replies; 124+ messages in thread
From: Morten Rasmussen @ 2015-02-04 18:30 UTC (permalink / raw)
  To: peterz, mingo
  Cc: vincent.guittot, dietmar.eggemann, yuyang.du, preeti, mturquette,
	nico, rjw, juri.lelli, linux-kernel

Add the blocked utilization contribution to group sched_entity
utilization (se->avg.utilization_avg_contrib) and to get_cpu_usage().
With this change cpu usage now includes recent usage by currently
non-runnable tasks, hence it provides a more stable view of the cpu
usage. It does, however, also mean that the meaning of usage is changed:
A cpu may be momentarily idle while usage >0. It can no longer be
assumed that cpu usage >0 implies runnable tasks on the rq.
cfs_rq->utilization_load_avg or nr_running should be used instead to get
the current rq status.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c | 8 +++++---
 1 file changed, 5 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0fc8963..33d3d81 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -2761,7 +2761,8 @@ static long __update_entity_utilization_avg_contrib(struct sched_entity *se)
 		__update_task_entity_utilization(se);
 	else
 		se->avg.utilization_avg_contrib =
-					group_cfs_rq(se)->utilization_load_avg;
+				group_cfs_rq(se)->utilization_load_avg +
+				group_cfs_rq(se)->utilization_blocked_avg;
 
 	return se->avg.utilization_avg_contrib - old_contrib;
 }
@@ -4828,11 +4829,12 @@ static int select_idle_sibling(struct task_struct *p, int target)
 static int get_cpu_usage(int cpu)
 {
 	unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg;
+	unsigned long blocked = cpu_rq(cpu)->cfs.utilization_blocked_avg;
 
-	if (usage >= SCHED_LOAD_SCALE)
+	if (usage + blocked >= SCHED_LOAD_SCALE)
 		return capacity_orig_of(cpu);
 
-	return usage;
+	return usage + blocked;
 }
 
 /*
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [RFCv3 PATCH 20/48] sched: Documentation for scheduler energy cost model
  2015-02-04 18:30 [RFCv3 PATCH 00/48] sched: Energy cost model for energy-aware scheduling Morten Rasmussen
                   ` (18 preceding siblings ...)
  2015-02-04 18:30 ` [RFCv3 PATCH 19/48] sched: Include blocked utilization in usage tracking Morten Rasmussen
@ 2015-02-04 18:30 ` Morten Rasmussen
  2015-02-04 18:30 ` [RFCv3 PATCH 21/48] sched: Make energy awareness a sched feature Morten Rasmussen
                   ` (28 subsequent siblings)
  48 siblings, 0 replies; 124+ messages in thread
From: Morten Rasmussen @ 2015-02-04 18:30 UTC (permalink / raw)
  To: peterz, mingo
  Cc: vincent.guittot, dietmar.eggemann, yuyang.du, preeti, mturquette,
	nico, rjw, juri.lelli, linux-kernel

This documentation patch provides an overview of the experimental
scheduler energy costing model, associated data structures, and a
reference recipe on how platforms can be characterized to derive energy
models.

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 Documentation/scheduler/sched-energy.txt | 359 +++++++++++++++++++++++++++++++
 1 file changed, 359 insertions(+)
 create mode 100644 Documentation/scheduler/sched-energy.txt

diff --git a/Documentation/scheduler/sched-energy.txt b/Documentation/scheduler/sched-energy.txt
new file mode 100644
index 0000000..c179df0
--- /dev/null
+++ b/Documentation/scheduler/sched-energy.txt
@@ -0,0 +1,359 @@
+Energy cost model for energy-aware scheduling (EXPERIMENTAL)
+
+Introduction
+=============
+
+The basic energy model uses platform energy data stored in sched_group_energy
+data structures attached to the sched_groups in the sched_domain hierarchy. The
+energy cost model offers two functions that can be used to guide scheduling
+decisions:
+
+1.	static unsigned int sched_group_energy(struct energy_env *eenv)
+2.	static int energy_diff(struct energy_env *eenv)
+
+sched_group_energy() estimates the energy consumed by all cpus in a specific
+sched_group including any shared resources owned exclusively by this group of
+cpus. Resources shared with other cpus are excluded (e.g. later level caches).
+
+energy_diff() estimates the total energy impact of a utilization change. That
+is, adding, removing, or migrating utilization (tasks).
+
+Both functions use a struct energy_env to specify the scenario to be evaluated:
+
+	struct energy_env {
+		struct sched_group      *sg_top;
+		struct sched_group      *sg_cap;
+		int                     usage_delta;
+		int                     src_cpu;
+		int                     dst_cpu;
+		int                     energy;
+	};
+
+sg_top: sched_group to be evaluated. Not used by energy_diff().
+
+sg_cap: sched_group covering the cpus in the same frequency domain. Set by
+sched_group_energy().
+
+usage_delta: Amount of utilization to be added, removed, or migrated.
+
+src_cpu: Source cpu from where 'usage_delta' utilization is removed. Should be
+-1 if no source (e.g. task wake-up).
+
+dst_cpu: Destination cpu where 'usage_delta' utilization is added. Should be -1
+if utilization is removed (e.g. terminating tasks).
+
+energy: Result of sched_group_energy().
+
+The metric used to represent utilization is the actual per-entity running time
+averaged over time using a geometric series. Very similar to the existing
+per-entity load-tracking, but _not_ scaled by task priority and capped by the
+capacity of the cpu. The latter property does mean that utilization may
+underestimate the compute requirements for task on fully/over utilized cpus.
+The greatest potential for energy savings without affecting performance too much
+is scenarios where the system isn't fully utilized. If the system is deemed
+fully utilized load-balancing should be done with task load (includes task
+priority) instead in the interest of fairness and performance.
+
+
+Background and Terminology
+===========================
+
+To make it clear from the start:
+
+energy = [joule] (resource like a battery on powered devices)
+power = energy/time = [joule/second] = [watt]
+
+The goal of energy-aware scheduling is to minimize energy, while still getting
+the job done. That is, we want to maximize:
+
+	performance [inst/s]
+	--------------------
+	    power [W]
+
+which is equivalent to minimizing:
+
+	energy [J]
+	-----------
+	instruction
+
+while still getting 'good' performance. It is essentially an alternative
+optimization objective to the current performance-only objective for the
+scheduler. This alternative considers two objectives: energy-efficiency and
+performance. Hence, there needs to be a user controllable knob to switch the
+objective. Since it is early days, this is currently a sched_feature
+(ENERGY_AWARE).
+
+The idea behind introducing an energy cost model is to allow the scheduler to
+evaluate the implications of its decisions rather than applying energy-saving
+techniques blindly that may only have positive effects on some platforms. At
+the same time, the energy cost model must be as simple as possible to minimize
+the scheduler latency impact.
+
+Platform topology
+------------------
+
+The system topology (cpus, caches, and NUMA information, not peripherals) is
+represented in the scheduler by the sched_domain hierarchy which has
+sched_groups attached at each level that covers one or more cpus (see
+sched-domains.txt for more details). To add energy awareness to the scheduler
+we need to consider power and frequency domains.
+
+Power domain:
+
+A power domain is a part of the system that can be powered on/off
+independently. Power domains are typically organized in a hierarchy where you
+may be able to power down just a cpu or a group of cpus along with any
+associated resources (e.g.  shared caches). Powering up a cpu means that all
+power domains it is a part of in the hierarchy must be powered up. Hence, it is
+more expensive to power up the first cpu that belongs to a higher level power
+domain than powering up additional cpus in the same high level domain. Two
+level power domain hierarchy example:
+
+		Power source
+		         +-------------------------------+----...
+per group PD		 G                               G
+		         |           +----------+        |
+		    +--------+-------| Shared   |  (other groups)
+per-cpu PD	    G        G       | resource |
+		    |        |       +----------+
+		+-------+ +-------+
+		| CPU 0 | | CPU 1 |
+		+-------+ +-------+
+
+Frequency domain:
+
+Frequency domains (P-states) typically cover the same group of cpus as one of
+the power domain levels. That is, there might be several smaller power domains
+sharing the same frequency (P-state) or there might be a power domain spanning
+multiple frequency domains.
+
+From a scheduling point of view there is no need to know the actual frequencies
+[Hz]. All the scheduler cares about is the compute capacity available at the
+current state (P-state) the cpu is in and any other available states. For that
+reason, and to also factor in any cpu micro-architecture differences, compute
+capacity scaling states are called 'capacity states' in this document. For SMP
+systems this is equivalent to P-states. For mixed micro-architecture systems
+(like ARM big.LITTLE) it is P-states scaled according to the micro-architecture
+performance relative to the other cpus in the system.
+
+Energy modelling:
+------------------
+
+Due to the hierarchical nature of the power domains, the most obvious way to
+model energy costs is therefore to associate power and energy costs with
+domains (groups of cpus). Energy costs of shared resources are associated with
+the group of cpus that share the resources, only the cost of powering the
+cpu itself and any private resources (e.g. private L1 caches) is associated
+with the per-cpu groups (lowest level).
+
+For example, for an SMP system with per-cpu power domains and a cluster level
+(group of cpus) power domain we get the overall energy costs to be:
+
+	energy = energy_cluster + n * energy_cpu
+
+where 'n' is the number of cpus powered up and energy_cluster is the cost paid
+as soon as any cpu in the cluster is powered up.
+
+The power and frequency domains can naturally be mapped onto the existing
+sched_domain hierarchy and sched_groups by adding the necessary data to the
+existing data structures.
+
+The energy model considers energy consumption from two contributors (shown in
+the illustration below):
+
+1. Busy energy: Energy consumed while a cpu and the higher level groups that it
+belongs to are busy running tasks. Busy energy is associated with the state of
+the cpu, not an event. The time the cpu spends in this state varies. Thus, the
+most obvious platform parameter for this contribution is busy power
+(energy/time).
+
+2. Idle energy: Energy consumed while a cpu and higher level groups that it
+belongs to are idle (in a C-state). Like busy energy, idle energy is associated
+with the state of the cpu. Thus, the platform parameter for this contribution
+is idle power (energy/time).
+
+Energy consumed during transitions from an idle-state (C-state) to a busy state
+(P-staet) or going the other way is ignored by the model to simplify the energy
+model calculations.
+
+
+	Power
+	^
+	|            busy->idle             idle->busy
+	|            transition             transition
+	|
+	|                _                      __
+	|               / \                    /  \__________________
+	|______________/   \                  /
+	|                   \                /
+	|  Busy              \    Idle      /        Busy
+	|  low P-state        \____________/         high P-state
+	|
+	+------------------------------------------------------------> time
+
+Busy    |--------------|                          |-----------------|
+
+Wakeup                 |------|            |------|
+
+Idle                          |------------|
+
+
+The basic algorithm
+====================
+
+The basic idea is to determine the total energy impact when utilization is
+added or removed by estimating the impact at each level in the sched_domain
+hierarchy starting from the bottom (sched_group contains just a single cpu).
+The energy cost comes from busy time (sched_group is awake because one or more
+cpus are busy) and idle time (in an idle-state). Energy model numbers account
+for energy costs associated with all cpus in the sched_group as a group.
+
+	for_each_domain(cpu, sd) {
+		sg = sched_group_of(cpu)
+		energy_before = curr_util(sg) * busy_power(sg)
+				+ (1-curr_util(sg)) * idle_power(sg)
+		energy_after = new_util(sg) * busy_power(sg)
+				+ (1-new_util(sg)) * idle_power(sg)
+		energy_diff += energy_before - energy_after
+
+	}
+
+	return energy_diff
+
+{curr, new}_util: The cpu utilization at the lowest level and the overall
+non-idle time for the entire group for higher levels. Utilization is in the
+range 0.0 to 1.0 in the pseudo-code.
+
+busy_power: The power consumption of the sched_group.
+
+idle_power: The power consumption of the sched_group when idle.
+
+Note: It is a fundamental assumption that the utilization is (roughly) scale
+invariant. Task utilization tracking factors in any frequency scaling and
+performance scaling differences due to difference cpu microarchitectures such
+that task utilization can be used across the entire system.
+
+
+Platform energy data
+=====================
+
+struct sched_group_energy can be attached to sched_groups in the sched_domain
+hierarchy and has the following members:
+
+cap_states:
+	List of struct capacity_state representing the supported capacity states
+	(P-states). struct capacity_state has two members: cap and power, which
+	represents the compute capacity and the busy_power of the state. The
+	list must be ordered by capacity low->high.
+
+nr_cap_states:
+	Number of capacity states in cap_states list.
+
+idle_states:
+	List of struct idle_state containing idle_state power cost for each
+	idle-state support by the sched_group. Note that the energy model
+	calculations will use this table to determine idle power even if no idle
+	state is actually entered by cpuidle. That is, if latency constraints
+	prevents that the group enters a coupled state or no idle-states are
+	supported. Hence, the first entry of the list must be the idle power
+	when idle, but no idle state was actually entered ('active idle'). This
+	state may be left out groups with one cpu if the cpu is guaranteed to
+	enter the state when idle.
+
+nr_idle_states:
+	Number of idle states in idle_states list.
+
+idle_states_below:
+	Number of idle-states below current level. Filled by generic code, not
+	to be provided by the platform.
+
+There are no unit requirements for the energy cost data. Data can be normalized
+with any reference, however, the normalization must be consistent across all
+energy cost data. That is, one bogo-joule/watt must be the same quantity for
+data, but we don't care what it is.
+
+A recipe for platform characterization
+=======================================
+
+Obtaining the actual model data for a particular platform requires some way of
+measuring power/energy. There isn't a tool to help with this (yet). This
+section provides a recipe for use as reference. It covers the steps used to
+characterize the ARM TC2 development platform. This sort of measurements is
+expected to be done anyway when tuning cpuidle and cpufreq for a given
+platform.
+
+The energy model needs two types of data (struct sched_group_energy holds
+these) for each sched_group where energy costs should be taken into account:
+
+1. Capacity state information
+
+A list containing the compute capacity and power consumption when fully
+utilized attributed to the group as a whole for each available capacity state.
+At the lowest level (group contains just a single cpu) this is the power of the
+cpu alone without including power consumed by resources shared with other cpus.
+It basically needs to fit the basic modelling approach described in "Background
+and Terminology" section:
+
+	energy_system = energy_shared + n * energy_cpu
+
+for a system containing 'n' busy cpus. Only 'energy_cpu' should be included at
+the lowest level. 'energy_shared' is included at the next level which
+represents the group of cpus among which the resources are shared.
+
+This model is, of course, a simplification of reality. Thus, power/energy
+attributions might not always exactly represent how the hardware is designed.
+Also, busy power is likely to depend on the workload. It is therefore
+recommended to use a representative mix of workloads when characterizing the
+capacity states.
+
+If the group has no capacity scaling support, the list will contain a single
+state where power is the busy power attributed to the group. The capacity
+should be set to a default value (1024).
+
+When frequency domains include multiple power domains, the group representing
+the frequency domain and all child groups share capacity states. This must be
+indicated by setting the SD_SHARE_CAP_STATES sched_domain flag. All groups at
+all levels that share the capacity state must have the list of capacity states
+with the power set to the contribution of the individual group.
+
+2. Idle power information
+
+Stored in the idle_states list. The power number is the group idle power
+consumption in each idle state as well when the group is idle but has not
+entered an idle-state ('active idle' as mentioned earlier). Due to the way the
+energy model is defined, the idle power of the deepest group idle state can
+alternatively be accounted for in the parent group busy power. In that case the
+group idle state power values are offset such that the idle power of the
+deepest state is zero. It is less intuitive, but it is easier to measure as
+idle power consumed by the group and the busy/idle power of the parent group
+cannot be distinguished without per group measurement points.
+
+Measuring capacity states and idle power:
+
+The capacity states' capacity and power can be estimated by running a benchmark
+workload at each available capacity state. By restricting the benchmark to run
+on subsets of cpus it is possible to extrapolate the power consumption of
+shared resources.
+
+ARM TC2 has two clusters of two and three cpus respectively. Each cluster has a
+shared L2 cache. TC2 has on-chip energy counters per cluster. Running a
+benchmark workload on just one cpu in a cluster means that power is consumed in
+the cluster (higher level group) and a single cpu (lowest level group). Adding
+another benchmark task to another cpu increases the power consumption by the
+amount consumed by the additional cpu. Hence, it is possible to extrapolate the
+cluster busy power.
+
+For platforms that don't have energy counters or equivalent instrumentation
+built-in, it may be possible to use an external DAQ to acquire similar data.
+
+If the benchmark includes some performance score (for example sysbench cpu
+benchmark), this can be used to record the compute capacity.
+
+Measuring idle power requires insight into the idle state implementation on the
+particular platform. Specifically, if the platform has coupled idle-states (or
+package states). To measure non-coupled per-cpu idle-states it is necessary to
+keep one cpu busy to keep any shared resources alive to isolate the idle power
+of the cpu from idle/busy power of the shared resources. The cpu can be tricked
+into different per-cpu idle states by disabling the other states. Based on
+various combinations of measurements with specific cpus busy and disabling
+idle-states it is possible to extrapolate the idle-state power.
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [RFCv3 PATCH 21/48] sched: Make energy awareness a sched feature
  2015-02-04 18:30 [RFCv3 PATCH 00/48] sched: Energy cost model for energy-aware scheduling Morten Rasmussen
                   ` (19 preceding siblings ...)
  2015-02-04 18:30 ` [RFCv3 PATCH 20/48] sched: Documentation for scheduler energy cost model Morten Rasmussen
@ 2015-02-04 18:30 ` Morten Rasmussen
  2015-02-04 18:30 ` [RFCv3 PATCH 22/48] sched: Introduce energy data structures Morten Rasmussen
                   ` (27 subsequent siblings)
  48 siblings, 0 replies; 124+ messages in thread
From: Morten Rasmussen @ 2015-02-04 18:30 UTC (permalink / raw)
  To: peterz, mingo
  Cc: vincent.guittot, dietmar.eggemann, yuyang.du, preeti, mturquette,
	nico, rjw, juri.lelli, linux-kernel

This patch introduces the ENERGY_AWARE sched feature, which is
implemented using jump labels when SCHED_DEBUG is defined. It is
statically set false when SCHED_DEBUG is not defined. Hence this doesn't
allow energy awareness to be enabled without SCHED_DEBUG. This
sched_feature knob will be replaced later with a more appropriate
control knob when things have matured a bit.

ENERGY_AWARE is based on per-entity load-tracking hence FAIR_GROUP_SCHED
must be enable. This dependency isn't checked at compile time yet.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c     | 6 ++++++
 kernel/sched/features.h | 6 ++++++
 2 files changed, 12 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 33d3d81..2557774 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4554,6 +4554,7 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
 
 	return wl;
 }
+
 #else
 
 static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
@@ -4563,6 +4564,11 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
 
 #endif
 
+static inline bool energy_aware(void)
+{
+	return sched_feat(ENERGY_AWARE);
+}
+
 static int wake_wide(struct task_struct *p)
 {
 	int factor = this_cpu_read(sd_llc_size);
diff --git a/kernel/sched/features.h b/kernel/sched/features.h
index 90284d1..199ee3a 100644
--- a/kernel/sched/features.h
+++ b/kernel/sched/features.h
@@ -83,3 +83,9 @@ SCHED_FEAT(NUMA_FAVOUR_HIGHER, true)
  */
 SCHED_FEAT(NUMA_RESIST_LOWER, false)
 #endif
+
+/*
+ * Energy aware scheduling. Use platform energy model to guide scheduling
+ * decisions optimizing for energy efficiency.
+ */
+SCHED_FEAT(ENERGY_AWARE, false)
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [RFCv3 PATCH 22/48] sched: Introduce energy data structures
  2015-02-04 18:30 [RFCv3 PATCH 00/48] sched: Energy cost model for energy-aware scheduling Morten Rasmussen
                   ` (20 preceding siblings ...)
  2015-02-04 18:30 ` [RFCv3 PATCH 21/48] sched: Make energy awareness a sched feature Morten Rasmussen
@ 2015-02-04 18:30 ` Morten Rasmussen
  2015-02-04 18:31 ` [RFCv3 PATCH 23/48] sched: Allocate and initialize " Morten Rasmussen
                   ` (26 subsequent siblings)
  48 siblings, 0 replies; 124+ messages in thread
From: Morten Rasmussen @ 2015-02-04 18:30 UTC (permalink / raw)
  To: peterz, mingo
  Cc: vincent.guittot, dietmar.eggemann, yuyang.du, preeti, mturquette,
	nico, rjw, juri.lelli, linux-kernel

From: Dietmar Eggemann <dietmar.eggemann@arm.com>

The struct sched_group_energy represents the per sched_group related
data which is needed for energy aware scheduling. It contains:

  (1) atomic reference counter for scheduler internal bookkeeping of
      data allocation and freeing
  (2) number of elements of the idle state array
  (3) pointer to the idle state array which comprises 'power consumption'
      for each idle state
  (4) number of elements of the capacity state array
  (5) pointer to the capacity state array which comprises 'compute
      capacity and power consumption' tuples for each capacity state

Allocation and freeing of struct sched_group_energy utilizes the existing
infrastructure of the scheduler which is currently used for the other sd
hierarchy data structures (e.g. struct sched_domain) as well. That's why
struct sd_data is provisioned with a per cpu struct sched_group_energy
double pointer.

The struct sched_group obtains a pointer to a struct sched_group_energy.

The function pointer sched_domain_energy_f is introduced into struct
sched_domain_topology_level which will allow the arch to pass a particular
struct sched_group_energy from the topology shim layer into the scheduler
core.

The function pointer sched_domain_energy_f has an 'int cpu' parameter
since the folding of two adjacent sd levels via sd degenerate doesn't work
for all sd levels. I.e. it is not possible for example to use this feature
to provide per-cpu energy in sd level DIE on ARM's TC2 platform.

It was discussed that the folding of sd levels approach is preferable
over the cpu parameter approach, simply because the user (the arch
specifying the sd topology table) can introduce less errors. But since
it is not working, the 'int cpu' parameter is the only way out. It's
possible to use the folding of sd levels approach for
sched_domain_flags_f and the cpu parameter approach for the
sched_domain_energy_f at the same time though. With the use of the
'int cpu' parameter, an extra check function has to be provided to make
sure that all cpus spanned by a sched group are provisioned with the same
energy data.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
---
 include/linux/sched.h | 20 ++++++++++++++++++++
 kernel/sched/sched.h  |  1 +
 2 files changed, 21 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index e220a91..2ea93fb 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -944,6 +944,23 @@ struct sched_domain_attr {
 
 extern int sched_domain_level_max;
 
+struct capacity_state {
+	unsigned long cap;	/* compute capacity */
+	unsigned long power;	/* power consumption at this compute capacity */
+};
+
+struct idle_state {
+	unsigned long power;	 /* power consumption in this idle state */
+};
+
+struct sched_group_energy {
+	atomic_t ref;
+	unsigned int nr_idle_states;	/* number of idle states */
+	struct idle_state *idle_states;	/* ptr to idle state array */
+	unsigned int nr_cap_states;	/* number of capacity states */
+	struct capacity_state *cap_states; /* ptr to capacity state array */
+};
+
 struct sched_group;
 
 struct sched_domain {
@@ -1042,6 +1059,7 @@ bool cpus_share_cache(int this_cpu, int that_cpu);
 
 typedef const struct cpumask *(*sched_domain_mask_f)(int cpu);
 typedef int (*sched_domain_flags_f)(void);
+typedef const struct sched_group_energy *(*sched_domain_energy_f)(int cpu);
 
 #define SDTL_OVERLAP	0x01
 
@@ -1049,11 +1067,13 @@ struct sd_data {
 	struct sched_domain **__percpu sd;
 	struct sched_group **__percpu sg;
 	struct sched_group_capacity **__percpu sgc;
+	struct sched_group_energy **__percpu sge;
 };
 
 struct sched_domain_topology_level {
 	sched_domain_mask_f mask;
 	sched_domain_flags_f sd_flags;
+	sched_domain_energy_f energy;
 	int		    flags;
 	int		    numa_level;
 	struct sd_data      data;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 208237f..0e9dcc6 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -817,6 +817,7 @@ struct sched_group {
 
 	unsigned int group_weight;
 	struct sched_group_capacity *sgc;
+	struct sched_group_energy *sge;
 
 	/*
 	 * The CPUs this group covers.
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [RFCv3 PATCH 23/48] sched: Allocate and initialize energy data structures
  2015-02-04 18:30 [RFCv3 PATCH 00/48] sched: Energy cost model for energy-aware scheduling Morten Rasmussen
                   ` (21 preceding siblings ...)
  2015-02-04 18:30 ` [RFCv3 PATCH 22/48] sched: Introduce energy data structures Morten Rasmussen
@ 2015-02-04 18:31 ` Morten Rasmussen
       [not found]   ` <OF29F384AC.37929D8E-ON48257E35.002FCB0C-48257E35.003156FE@zte.com.cn>
  2015-02-04 18:31 ` [RFCv3 PATCH 24/48] sched: Introduce SD_SHARE_CAP_STATES sched_domain flag Morten Rasmussen
                   ` (25 subsequent siblings)
  48 siblings, 1 reply; 124+ messages in thread
From: Morten Rasmussen @ 2015-02-04 18:31 UTC (permalink / raw)
  To: peterz, mingo
  Cc: vincent.guittot, dietmar.eggemann, yuyang.du, preeti, mturquette,
	nico, rjw, juri.lelli, linux-kernel

From: Dietmar Eggemann <dietmar.eggemann@arm.com>

The per sched group sched_group_energy structure plus the related
idle_state and capacity_state arrays are allocated like the other sched
domain (sd) hierarchy data structures. This includes the freeing of
sched_group_energy structures which are not used.

One problem is that the number of elements of the idle_state and the
capacity_state arrays is not fixed and has to be retrieved in
__sdt_alloc() to allocate memory for the sched_group_energy structure and
the two arrays in one chunk. The array pointers (idle_states and
cap_states) are initialized here to point to the correct place inside the
memory chunk.

The new function init_sched_energy() initializes the sched_group_energy
structure and the two arrays in case the sd topology level contains energy
information.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
---
 kernel/sched/core.c  | 71 +++++++++++++++++++++++++++++++++++++++++++++++++++-
 kernel/sched/sched.h | 33 ++++++++++++++++++++++++
 2 files changed, 103 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index a00a4c3..031ea48 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5707,6 +5707,7 @@ static void free_sched_domain(struct rcu_head *rcu)
 		free_sched_groups(sd->groups, 1);
 	} else if (atomic_dec_and_test(&sd->groups->ref)) {
 		kfree(sd->groups->sgc);
+		kfree(sd->groups->sge);
 		kfree(sd->groups);
 	}
 	kfree(sd);
@@ -5965,6 +5966,8 @@ static int get_group(int cpu, struct sd_data *sdd, struct sched_group **sg)
 		*sg = *per_cpu_ptr(sdd->sg, cpu);
 		(*sg)->sgc = *per_cpu_ptr(sdd->sgc, cpu);
 		atomic_set(&(*sg)->sgc->ref, 1); /* for claim_allocations */
+		(*sg)->sge = *per_cpu_ptr(sdd->sge, cpu);
+		atomic_set(&(*sg)->sge->ref, 1); /* for claim_allocations */
 	}
 
 	return cpu;
@@ -6054,6 +6057,28 @@ static void init_sched_groups_capacity(int cpu, struct sched_domain *sd)
 	atomic_set(&sg->sgc->nr_busy_cpus, sg->group_weight);
 }
 
+static void init_sched_energy(int cpu, struct sched_domain *sd,
+			      struct sched_domain_topology_level *tl)
+{
+	struct sched_group *sg = sd->groups;
+	struct sched_group_energy *energy = sg->sge;
+	sched_domain_energy_f fn = tl->energy;
+	struct cpumask *mask = sched_group_cpus(sg);
+
+	if (!fn || !fn(cpu))
+		return;
+
+	if (cpumask_weight(mask) > 1)
+		check_sched_energy_data(cpu, fn, mask);
+
+	energy->nr_idle_states = fn(cpu)->nr_idle_states;
+	memcpy(energy->idle_states, fn(cpu)->idle_states,
+	       energy->nr_idle_states*sizeof(struct idle_state));
+	energy->nr_cap_states = fn(cpu)->nr_cap_states;
+	memcpy(energy->cap_states, fn(cpu)->cap_states,
+	       energy->nr_cap_states*sizeof(struct capacity_state));
+}
+
 /*
  * Initializers for schedule domains
  * Non-inlined to reduce accumulated stack pressure in build_sched_domains()
@@ -6144,6 +6169,9 @@ static void claim_allocations(int cpu, struct sched_domain *sd)
 
 	if (atomic_read(&(*per_cpu_ptr(sdd->sgc, cpu))->ref))
 		*per_cpu_ptr(sdd->sgc, cpu) = NULL;
+
+	if (atomic_read(&(*per_cpu_ptr(sdd->sge, cpu))->ref))
+		*per_cpu_ptr(sdd->sge, cpu) = NULL;
 }
 
 #ifdef CONFIG_NUMA
@@ -6609,10 +6637,24 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
 		if (!sdd->sgc)
 			return -ENOMEM;
 
+		sdd->sge = alloc_percpu(struct sched_group_energy *);
+		if (!sdd->sge)
+			return -ENOMEM;
+
 		for_each_cpu(j, cpu_map) {
 			struct sched_domain *sd;
 			struct sched_group *sg;
 			struct sched_group_capacity *sgc;
+			struct sched_group_energy *sge;
+			sched_domain_energy_f fn = tl->energy;
+			unsigned int nr_idle_states = 0;
+			unsigned int nr_cap_states = 0;
+
+			if (fn && fn(j)) {
+				nr_idle_states = fn(j)->nr_idle_states;
+				nr_cap_states = fn(j)->nr_cap_states;
+				BUG_ON(!nr_idle_states || !nr_cap_states);
+			}
 
 		       	sd = kzalloc_node(sizeof(struct sched_domain) + cpumask_size(),
 					GFP_KERNEL, cpu_to_node(j));
@@ -6636,6 +6678,26 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
 				return -ENOMEM;
 
 			*per_cpu_ptr(sdd->sgc, j) = sgc;
+
+			sge = kzalloc_node(sizeof(struct sched_group_energy) +
+				nr_idle_states*sizeof(struct idle_state) +
+				nr_cap_states*sizeof(struct capacity_state),
+				GFP_KERNEL, cpu_to_node(j));
+
+			if (!sge)
+				return -ENOMEM;
+
+			sge->idle_states = (struct idle_state *)
+					   ((void *)&sge->cap_states +
+					    sizeof(sge->cap_states));
+
+			sge->cap_states = (struct capacity_state *)
+					  ((void *)&sge->cap_states +
+					   sizeof(sge->cap_states) +
+					   nr_idle_states*
+					   sizeof(struct idle_state));
+
+			*per_cpu_ptr(sdd->sge, j) = sge;
 		}
 	}
 
@@ -6664,6 +6726,8 @@ static void __sdt_free(const struct cpumask *cpu_map)
 				kfree(*per_cpu_ptr(sdd->sg, j));
 			if (sdd->sgc)
 				kfree(*per_cpu_ptr(sdd->sgc, j));
+			if (sdd->sge)
+				kfree(*per_cpu_ptr(sdd->sge, j));
 		}
 		free_percpu(sdd->sd);
 		sdd->sd = NULL;
@@ -6671,6 +6735,8 @@ static void __sdt_free(const struct cpumask *cpu_map)
 		sdd->sg = NULL;
 		free_percpu(sdd->sgc);
 		sdd->sgc = NULL;
+		free_percpu(sdd->sge);
+		sdd->sge = NULL;
 	}
 }
 
@@ -6756,10 +6822,13 @@ static int build_sched_domains(const struct cpumask *cpu_map,
 
 	/* Calculate CPU capacity for physical packages and nodes */
 	for (i = nr_cpumask_bits-1; i >= 0; i--) {
+		struct sched_domain_topology_level *tl = sched_domain_topology;
+
 		if (!cpumask_test_cpu(i, cpu_map))
 			continue;
 
-		for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
+		for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent, tl++) {
+			init_sched_energy(i, sd, tl);
 			claim_allocations(i, sd);
 			init_sched_groups_capacity(i, sd);
 		}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 0e9dcc6..86cf6b2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -854,6 +854,39 @@ static inline unsigned int group_first_cpu(struct sched_group *group)
 
 extern int group_balance_cpu(struct sched_group *sg);
 
+/*
+ * Check that the per-cpu provided sd energy data is consistent for all cpus
+ * within the mask.
+ */
+static inline void check_sched_energy_data(int cpu, sched_domain_energy_f fn,
+					   const struct cpumask *cpumask)
+{
+	struct cpumask mask;
+	int i;
+
+	cpumask_xor(&mask, cpumask, get_cpu_mask(cpu));
+
+	for_each_cpu(i, &mask) {
+		int y;
+
+		BUG_ON(fn(i)->nr_idle_states != fn(cpu)->nr_idle_states);
+
+		for (y = 0; y < (fn(i)->nr_idle_states); y++) {
+			BUG_ON(fn(i)->idle_states[y].power !=
+					fn(cpu)->idle_states[y].power);
+		}
+
+		BUG_ON(fn(i)->nr_cap_states != fn(cpu)->nr_cap_states);
+
+		for (y = 0; y < (fn(i)->nr_cap_states); y++) {
+			BUG_ON(fn(i)->cap_states[y].cap !=
+					fn(cpu)->cap_states[y].cap);
+			BUG_ON(fn(i)->cap_states[y].power !=
+					fn(cpu)->cap_states[y].power);
+		}
+	}
+}
+
 #else
 
 static inline void sched_ttwu_pending(void) { }
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [RFCv3 PATCH 24/48] sched: Introduce SD_SHARE_CAP_STATES sched_domain flag
  2015-02-04 18:30 [RFCv3 PATCH 00/48] sched: Energy cost model for energy-aware scheduling Morten Rasmussen
                   ` (22 preceding siblings ...)
  2015-02-04 18:31 ` [RFCv3 PATCH 23/48] sched: Allocate and initialize " Morten Rasmussen
@ 2015-02-04 18:31 ` Morten Rasmussen
  2015-02-04 18:31 ` [RFCv3 PATCH 25/48] arm: topology: Define TC2 energy and provide it to the scheduler Morten Rasmussen
                   ` (24 subsequent siblings)
  48 siblings, 0 replies; 124+ messages in thread
From: Morten Rasmussen @ 2015-02-04 18:31 UTC (permalink / raw)
  To: peterz, mingo
  Cc: vincent.guittot, dietmar.eggemann, yuyang.du, preeti, mturquette,
	nico, rjw, juri.lelli, linux-kernel, Russell King

cpufreq is currently keeping it a secret which cpus are sharing
clock source. The scheduler needs to know about clock domains as well
to become more energy aware. The SD_SHARE_CAP_STATES domain flag
indicates whether cpus belonging to the sched_domain share capacity
states (P-states).

There is no connection with cpufreq (yet). The flag must be set by
the arch specific topology code.

cc: Russell King <linux@arm.linux.org.uk>
cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 arch/arm/kernel/topology.c |  3 ++-
 include/linux/sched.h      |  1 +
 kernel/sched/core.c        | 10 +++++++---
 3 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
index 34ecbdc..fdbe784 100644
--- a/arch/arm/kernel/topology.c
+++ b/arch/arm/kernel/topology.c
@@ -292,7 +292,8 @@ void store_cpu_topology(unsigned int cpuid)
 
 static inline int cpu_corepower_flags(void)
 {
-	return SD_SHARE_PKG_RESOURCES  | SD_SHARE_POWERDOMAIN;
+	return SD_SHARE_PKG_RESOURCES  | SD_SHARE_POWERDOMAIN | \
+	       SD_SHARE_CAP_STATES;
 }
 
 static struct sched_domain_topology_level arm_topology[] = {
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 2ea93fb..78b6eb7 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -912,6 +912,7 @@ enum cpu_idle_type {
 #define SD_PREFER_SIBLING	0x1000	/* Prefer to place tasks in a sibling domain */
 #define SD_OVERLAP		0x2000	/* sched_domains of this level overlap */
 #define SD_NUMA			0x4000	/* cross-node balancing */
+#define SD_SHARE_CAP_STATES	0x8000  /* Domain members share capacity state */
 
 #ifdef CONFIG_SCHED_SMT
 static inline int cpu_smt_flags(void)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 031ea48..c49f3ee 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5522,7 +5522,8 @@ static int sd_degenerate(struct sched_domain *sd)
 			 SD_BALANCE_EXEC |
 			 SD_SHARE_CPUCAPACITY |
 			 SD_SHARE_PKG_RESOURCES |
-			 SD_SHARE_POWERDOMAIN)) {
+			 SD_SHARE_POWERDOMAIN |
+			 SD_SHARE_CAP_STATES)) {
 		if (sd->groups != sd->groups->next)
 			return 0;
 	}
@@ -5554,7 +5555,8 @@ sd_parent_degenerate(struct sched_domain *sd, struct sched_domain *parent)
 				SD_SHARE_CPUCAPACITY |
 				SD_SHARE_PKG_RESOURCES |
 				SD_PREFER_SIBLING |
-				SD_SHARE_POWERDOMAIN);
+				SD_SHARE_POWERDOMAIN |
+				SD_SHARE_CAP_STATES);
 		if (nr_node_ids == 1)
 			pflags &= ~SD_SERIALIZE;
 	}
@@ -6190,6 +6192,7 @@ static int sched_domains_curr_level;
  * SD_SHARE_PKG_RESOURCES - describes shared caches
  * SD_NUMA                - describes NUMA topologies
  * SD_SHARE_POWERDOMAIN   - describes shared power domain
+ * SD_SHARE_CAP_STATES    - describes shared capacity states
  *
  * Odd one out:
  * SD_ASYM_PACKING        - describes SMT quirks
@@ -6199,7 +6202,8 @@ static int sched_domains_curr_level;
 	 SD_SHARE_PKG_RESOURCES |	\
 	 SD_NUMA |			\
 	 SD_ASYM_PACKING |		\
-	 SD_SHARE_POWERDOMAIN)
+	 SD_SHARE_POWERDOMAIN |		\
+	 SD_SHARE_CAP_STATES)
 
 static struct sched_domain *
 sd_init(struct sched_domain_topology_level *tl, int cpu)
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [RFCv3 PATCH 25/48] arm: topology: Define TC2 energy and provide it to the scheduler
  2015-02-04 18:30 [RFCv3 PATCH 00/48] sched: Energy cost model for energy-aware scheduling Morten Rasmussen
                   ` (23 preceding siblings ...)
  2015-02-04 18:31 ` [RFCv3 PATCH 24/48] sched: Introduce SD_SHARE_CAP_STATES sched_domain flag Morten Rasmussen
@ 2015-02-04 18:31 ` Morten Rasmussen
  2015-02-04 18:31 ` [RFCv3 PATCH 26/48] sched: Compute cpu capacity available at current frequency Morten Rasmussen
                   ` (23 subsequent siblings)
  48 siblings, 0 replies; 124+ messages in thread
From: Morten Rasmussen @ 2015-02-04 18:31 UTC (permalink / raw)
  To: peterz, mingo
  Cc: vincent.guittot, dietmar.eggemann, yuyang.du, preeti, mturquette,
	nico, rjw, juri.lelli, linux-kernel, Russell King

From: Dietmar Eggemann <dietmar.eggemann@arm.com>

This patch is only here to be able to test provisioning of energy related
data from an arch topology shim layer to the scheduler. Since there is no
code today which deals with extracting energy related data from the dtb or
acpi, and process it in the topology shim layer, the content of the
sched_group_energy structures as well as the idle_state and capacity_state
arrays are hard-coded here.

This patch defines the sched_group_energy structure as well as the
idle_state and capacity_state array for the cluster (relates to sched
groups (sgs) in DIE sched domain level) and for the core (relates to sgs
in MC sd level) for a Cortex A7 as well as for a Cortex A15.
It further provides related implementations of the sched_domain_energy_f
functions (cpu_cluster_energy() and cpu_core_energy()).

To be able to propagate this information from the topology shim layer to
the scheduler, the elements of the arm_topology[] table have been
provisioned with the appropriate sched_domain_energy_f functions.

cc: Russell King <linux@arm.linux.org.uk>

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
---
 arch/arm/kernel/topology.c | 118 +++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 115 insertions(+), 3 deletions(-)

diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
index fdbe784..d3811e5 100644
--- a/arch/arm/kernel/topology.c
+++ b/arch/arm/kernel/topology.c
@@ -290,6 +290,119 @@ void store_cpu_topology(unsigned int cpuid)
 		cpu_topology[cpuid].socket_id, mpidr);
 }
 
+/*
+ * ARM TC2 specific energy cost model data. There are no unit requirements for
+ * the data. Data can be normalized to any reference point, but the
+ * normalization must be consistent. That is, one bogo-joule/watt must be the
+ * same quantity for all data, but we don't care what it is.
+ */
+static struct idle_state idle_states_cluster_a7[] = {
+	 { .power = 25 }, /* WFI */
+	 { .power = 10 }, /* cluster-sleep-l */
+	};
+
+static struct idle_state idle_states_cluster_a15[] = {
+	 { .power = 70 }, /* WFI */
+	 { .power = 25 }, /* cluster-sleep-b */
+	};
+
+static struct capacity_state cap_states_cluster_a7[] = {
+	/* Cluster only power */
+	 { .cap =  150, .power = 2967, }, /*  350 MHz */
+	 { .cap =  172, .power = 2792, }, /*  400 MHz */
+	 { .cap =  215, .power = 2810, }, /*  500 MHz */
+	 { .cap =  258, .power = 2815, }, /*  600 MHz */
+	 { .cap =  301, .power = 2919, }, /*  700 MHz */
+	 { .cap =  344, .power = 2847, }, /*  800 MHz */
+	 { .cap =  387, .power = 3917, }, /*  900 MHz */
+	 { .cap =  430, .power = 4905, }, /* 1000 MHz */
+	};
+
+static struct capacity_state cap_states_cluster_a15[] = {
+	/* Cluster only power */
+	 { .cap =  426, .power =  7920, }, /*  500 MHz */
+	 { .cap =  512, .power =  8165, }, /*  600 MHz */
+	 { .cap =  597, .power =  8172, }, /*  700 MHz */
+	 { .cap =  682, .power =  8195, }, /*  800 MHz */
+	 { .cap =  768, .power =  8265, }, /*  900 MHz */
+	 { .cap =  853, .power =  8446, }, /* 1000 MHz */
+	 { .cap =  938, .power = 11426, }, /* 1100 MHz */
+	 { .cap = 1024, .power = 15200, }, /* 1200 MHz */
+	};
+
+static struct sched_group_energy energy_cluster_a7 = {
+	  .nr_idle_states = ARRAY_SIZE(idle_states_cluster_a7),
+	  .idle_states    = idle_states_cluster_a7,
+	  .nr_cap_states  = ARRAY_SIZE(cap_states_cluster_a7),
+	  .cap_states     = cap_states_cluster_a7,
+};
+
+static struct sched_group_energy energy_cluster_a15 = {
+	  .nr_idle_states = ARRAY_SIZE(idle_states_cluster_a15),
+	  .idle_states    = idle_states_cluster_a15,
+	  .nr_cap_states  = ARRAY_SIZE(cap_states_cluster_a15),
+	  .cap_states     = cap_states_cluster_a15,
+};
+
+static struct idle_state idle_states_core_a7[] = {
+	 { .power = 0 }, /* WFI */
+	};
+
+static struct idle_state idle_states_core_a15[] = {
+	 { .power = 0 }, /* WFI */
+	};
+
+static struct capacity_state cap_states_core_a7[] = {
+	/* Power per cpu */
+	 { .cap =  150, .power =  187, }, /*  350 MHz */
+	 { .cap =  172, .power =  275, }, /*  400 MHz */
+	 { .cap =  215, .power =  334, }, /*  500 MHz */
+	 { .cap =  258, .power =  407, }, /*  600 MHz */
+	 { .cap =  301, .power =  447, }, /*  700 MHz */
+	 { .cap =  344, .power =  549, }, /*  800 MHz */
+	 { .cap =  387, .power =  761, }, /*  900 MHz */
+	 { .cap =  430, .power = 1024, }, /* 1000 MHz */
+	};
+
+static struct capacity_state cap_states_core_a15[] = {
+	/* Power per cpu */
+	 { .cap =  426, .power = 2021, }, /*  500 MHz */
+	 { .cap =  512, .power = 2312, }, /*  600 MHz */
+	 { .cap =  597, .power = 2756, }, /*  700 MHz */
+	 { .cap =  682, .power = 3125, }, /*  800 MHz */
+	 { .cap =  768, .power = 3524, }, /*  900 MHz */
+	 { .cap =  853, .power = 3846, }, /* 1000 MHz */
+	 { .cap =  938, .power = 5177, }, /* 1100 MHz */
+	 { .cap = 1024, .power = 6997, }, /* 1200 MHz */
+	};
+
+static struct sched_group_energy energy_core_a7 = {
+	  .nr_idle_states = ARRAY_SIZE(idle_states_core_a7),
+	  .idle_states    = idle_states_core_a7,
+	  .nr_cap_states  = ARRAY_SIZE(cap_states_core_a7),
+	  .cap_states     = cap_states_core_a7,
+};
+
+static struct sched_group_energy energy_core_a15 = {
+	  .nr_idle_states = ARRAY_SIZE(idle_states_core_a15),
+	  .idle_states    = idle_states_core_a15,
+	  .nr_cap_states  = ARRAY_SIZE(cap_states_core_a15),
+	  .cap_states     = cap_states_core_a15,
+};
+
+/* sd energy functions */
+static inline const struct sched_group_energy *cpu_cluster_energy(int cpu)
+{
+	return cpu_topology[cpu].socket_id ? &energy_cluster_a7 :
+			&energy_cluster_a15;
+}
+
+static inline const struct sched_group_energy *cpu_core_energy(int cpu)
+{
+	return cpu_topology[cpu].socket_id ? &energy_core_a7 :
+			&energy_core_a15;
+}
+
 static inline int cpu_corepower_flags(void)
 {
 	return SD_SHARE_PKG_RESOURCES  | SD_SHARE_POWERDOMAIN | \
@@ -298,10 +411,9 @@ static inline int cpu_corepower_flags(void)
 
 static struct sched_domain_topology_level arm_topology[] = {
 #ifdef CONFIG_SCHED_MC
-	{ cpu_corepower_mask, cpu_corepower_flags, SD_INIT_NAME(GMC) },
-	{ cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },
+	{ cpu_coregroup_mask, cpu_corepower_flags, cpu_core_energy, SD_INIT_NAME(MC) },
 #endif
-	{ cpu_cpu_mask, SD_INIT_NAME(DIE) },
+	{ cpu_cpu_mask, 0, cpu_cluster_energy, SD_INIT_NAME(DIE) },
 	{ NULL, },
 };
 
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [RFCv3 PATCH 26/48] sched: Compute cpu capacity available at current frequency
  2015-02-04 18:30 [RFCv3 PATCH 00/48] sched: Energy cost model for energy-aware scheduling Morten Rasmussen
                   ` (24 preceding siblings ...)
  2015-02-04 18:31 ` [RFCv3 PATCH 25/48] arm: topology: Define TC2 energy and provide it to the scheduler Morten Rasmussen
@ 2015-02-04 18:31 ` Morten Rasmussen
  2015-02-04 18:31 ` [RFCv3 PATCH 27/48] sched: Relocated get_cpu_usage() Morten Rasmussen
                   ` (22 subsequent siblings)
  48 siblings, 0 replies; 124+ messages in thread
From: Morten Rasmussen @ 2015-02-04 18:31 UTC (permalink / raw)
  To: peterz, mingo
  Cc: vincent.guittot, dietmar.eggemann, yuyang.du, preeti, mturquette,
	nico, rjw, juri.lelli, linux-kernel

capacity_orig_of() returns the max available compute capacity of a cpu.
For scale-invariant utilization tracking and energy-aware scheduling
decisions it is useful to know the compute capacity available at the
current OPP of a cpu.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 2557774..70acb4c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4564,6 +4564,17 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
 
 #endif
 
+/*
+ * Returns the current capacity of cpu after applying both
+ * cpu and freq scaling.
+ */
+static unsigned long capacity_curr_of(int cpu)
+{
+	return cpu_rq(cpu)->cpu_capacity_orig *
+	       arch_scale_freq_capacity(NULL, cpu)
+	       >> SCHED_CAPACITY_SHIFT;
+}
+
 static inline bool energy_aware(void)
 {
 	return sched_feat(ENERGY_AWARE);
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [RFCv3 PATCH 27/48] sched: Relocated get_cpu_usage()
  2015-02-04 18:30 [RFCv3 PATCH 00/48] sched: Energy cost model for energy-aware scheduling Morten Rasmussen
                   ` (25 preceding siblings ...)
  2015-02-04 18:31 ` [RFCv3 PATCH 26/48] sched: Compute cpu capacity available at current frequency Morten Rasmussen
@ 2015-02-04 18:31 ` Morten Rasmussen
  2015-02-04 18:31 ` [RFCv3 PATCH 28/48] sched: Use capacity_curr to cap utilization in get_cpu_usage() Morten Rasmussen
                   ` (21 subsequent siblings)
  48 siblings, 0 replies; 124+ messages in thread
From: Morten Rasmussen @ 2015-02-04 18:31 UTC (permalink / raw)
  To: peterz, mingo
  Cc: vincent.guittot, dietmar.eggemann, yuyang.du, preeti, mturquette,
	nico, rjw, juri.lelli, linux-kernel

Move get_cpu_usage() to an earlier position in fair.c.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c | 56 ++++++++++++++++++++++++++---------------------------
 1 file changed, 28 insertions(+), 28 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 70acb4c..071310a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4575,6 +4575,34 @@ static unsigned long capacity_curr_of(int cpu)
 	       >> SCHED_CAPACITY_SHIFT;
 }
 
+/*
+ * get_cpu_usage returns the amount of capacity of a CPU that is used by CFS
+ * tasks. The unit of the return value must capacity so we can compare the
+ * usage with the capacity of the CPU that is available for CFS task (ie
+ * cpu_capacity).
+ * cfs.utilization_load_avg is the sum of running time of runnable tasks on a
+ * CPU. It represents the amount of utilization of a CPU in the range
+ * [0..SCHED_LOAD_SCALE].  The usage of a CPU can't be higher than the full
+ * capacity of the CPU because it's about the running time on this CPU.
+ * Nevertheless, cfs.utilization_load_avg can be higher than SCHED_LOAD_SCALE
+ * because of unfortunate rounding in avg_period and running_load_avg or just
+ * after migrating tasks until the average stabilizes with the new running
+ * time. So we need to check that the usage stays into the range
+ * [0..cpu_capacity_orig] and cap if necessary.
+ * Without capping the usage, a group could be seen as overloaded (CPU0 usage
+ * at 121% + CPU1 usage at 80%) whereas CPU1 has 20% of available capacity/
+ */
+static int get_cpu_usage(int cpu)
+{
+	unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg;
+	unsigned long blocked = cpu_rq(cpu)->cfs.utilization_blocked_avg;
+
+	if (usage + blocked >= SCHED_LOAD_SCALE)
+		return capacity_orig_of(cpu);
+
+	return usage + blocked;
+}
+
 static inline bool energy_aware(void)
 {
 	return sched_feat(ENERGY_AWARE);
@@ -4827,34 +4855,6 @@ static int select_idle_sibling(struct task_struct *p, int target)
 	return target;
 }
 /*
- * get_cpu_usage returns the amount of capacity of a CPU that is used by CFS
- * tasks. The unit of the return value must capacity so we can compare the
- * usage with the capacity of the CPU that is available for CFS task (ie
- * cpu_capacity).
- * cfs.utilization_load_avg is the sum of running time of runnable tasks on a
- * CPU. It represents the amount of utilization of a CPU in the range
- * [0..SCHED_LOAD_SCALE].  The usage of a CPU can't be higher than the full
- * capacity of the CPU because it's about the running time on this CPU.
- * Nevertheless, cfs.utilization_load_avg can be higher than SCHED_LOAD_SCALE
- * because of unfortunate rounding in avg_period and running_load_avg or just
- * after migrating tasks until the average stabilizes with the new running
- * time. So we need to check that the usage stays into the range
- * [0..cpu_capacity_orig] and cap if necessary.
- * Without capping the usage, a group could be seen as overloaded (CPU0 usage
- * at 121% + CPU1 usage at 80%) whereas CPU1 has 20% of available capacity/
- */
-static int get_cpu_usage(int cpu)
-{
-	unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg;
-	unsigned long blocked = cpu_rq(cpu)->cfs.utilization_blocked_avg;
-
-	if (usage + blocked >= SCHED_LOAD_SCALE)
-		return capacity_orig_of(cpu);
-
-	return usage + blocked;
-}
-
-/*
  * select_task_rq_fair: Select target runqueue for the waking task in domains
  * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
  * SD_BALANCE_FORK, or SD_BALANCE_EXEC.
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [RFCv3 PATCH 28/48] sched: Use capacity_curr to cap utilization in get_cpu_usage()
  2015-02-04 18:30 [RFCv3 PATCH 00/48] sched: Energy cost model for energy-aware scheduling Morten Rasmussen
                   ` (26 preceding siblings ...)
  2015-02-04 18:31 ` [RFCv3 PATCH 27/48] sched: Relocated get_cpu_usage() Morten Rasmussen
@ 2015-02-04 18:31 ` Morten Rasmussen
  2015-03-23 16:14   ` Peter Zijlstra
  2015-02-04 18:31 ` [RFCv3 PATCH 29/48] sched: Highest energy aware balancing sched_domain level pointer Morten Rasmussen
                   ` (20 subsequent siblings)
  48 siblings, 1 reply; 124+ messages in thread
From: Morten Rasmussen @ 2015-02-04 18:31 UTC (permalink / raw)
  To: peterz, mingo
  Cc: vincent.guittot, dietmar.eggemann, yuyang.du, preeti, mturquette,
	nico, rjw, juri.lelli, linux-kernel

With scale-invariant usage tracking get_cpu_usage() should never return
a usage above the current compute capacity of the cpu (capacity_curr).
The scaling of the utilization tracking contributions should generally
cause the cpu utilization to saturate at capacity_curr, but it may
temporarily exceed this value in certain situations. This patch changes
the cap from capacity_orig to capacity_curr.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c | 11 ++++++-----
 1 file changed, 6 insertions(+), 5 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 071310a..872ae0e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4582,13 +4582,13 @@ static unsigned long capacity_curr_of(int cpu)
  * cpu_capacity).
  * cfs.utilization_load_avg is the sum of running time of runnable tasks on a
  * CPU. It represents the amount of utilization of a CPU in the range
- * [0..SCHED_LOAD_SCALE].  The usage of a CPU can't be higher than the full
+ * [0..capacity_curr]. The usage of a CPU can't be higher than the current
  * capacity of the CPU because it's about the running time on this CPU.
- * Nevertheless, cfs.utilization_load_avg can be higher than SCHED_LOAD_SCALE
+ * Nevertheless, cfs.utilization_load_avg can be higher than capacity_curr
  * because of unfortunate rounding in avg_period and running_load_avg or just
  * after migrating tasks until the average stabilizes with the new running
  * time. So we need to check that the usage stays into the range
- * [0..cpu_capacity_orig] and cap if necessary.
+ * [0..cpu_capacity_curr] and cap if necessary.
  * Without capping the usage, a group could be seen as overloaded (CPU0 usage
  * at 121% + CPU1 usage at 80%) whereas CPU1 has 20% of available capacity/
  */
@@ -4596,9 +4596,10 @@ static int get_cpu_usage(int cpu)
 {
 	unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg;
 	unsigned long blocked = cpu_rq(cpu)->cfs.utilization_blocked_avg;
+	unsigned long capacity_curr = capacity_curr_of(cpu);
 
-	if (usage + blocked >= SCHED_LOAD_SCALE)
-		return capacity_orig_of(cpu);
+	if (usage + blocked >= capacity_curr)
+		return capacity_curr;
 
 	return usage + blocked;
 }
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [RFCv3 PATCH 29/48] sched: Highest energy aware balancing sched_domain level pointer
  2015-02-04 18:30 [RFCv3 PATCH 00/48] sched: Energy cost model for energy-aware scheduling Morten Rasmussen
                   ` (27 preceding siblings ...)
  2015-02-04 18:31 ` [RFCv3 PATCH 28/48] sched: Use capacity_curr to cap utilization in get_cpu_usage() Morten Rasmussen
@ 2015-02-04 18:31 ` Morten Rasmussen
  2015-03-23 16:16   ` Peter Zijlstra
       [not found]   ` <OF5977496A.A21A7B96-ON48257E35.002EC23C-48257E35.00324DAD@zte.com.cn>
  2015-02-04 18:31 ` [RFCv3 PATCH 30/48] sched: Calculate energy consumption of sched_group Morten Rasmussen
                   ` (19 subsequent siblings)
  48 siblings, 2 replies; 124+ messages in thread
From: Morten Rasmussen @ 2015-02-04 18:31 UTC (permalink / raw)
  To: peterz, mingo
  Cc: vincent.guittot, dietmar.eggemann, yuyang.du, preeti, mturquette,
	nico, rjw, juri.lelli, linux-kernel

Add another member to the family of per-cpu sched_domain shortcut
pointers. This one, sd_ea, points to the highest level at which energy
model is provided. At this level and all levels below all sched_groups
have energy model data attached.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/core.c  | 11 ++++++++++-
 kernel/sched/sched.h |  1 +
 2 files changed, 11 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c49f3ee..e47febf 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5741,11 +5741,12 @@ DEFINE_PER_CPU(int, sd_llc_id);
 DEFINE_PER_CPU(struct sched_domain *, sd_numa);
 DEFINE_PER_CPU(struct sched_domain *, sd_busy);
 DEFINE_PER_CPU(struct sched_domain *, sd_asym);
+DEFINE_PER_CPU(struct sched_domain *, sd_ea);
 
 static void update_top_cache_domain(int cpu)
 {
 	struct sched_domain *sd;
-	struct sched_domain *busy_sd = NULL;
+	struct sched_domain *busy_sd = NULL, *ea_sd = NULL;
 	int id = cpu;
 	int size = 1;
 
@@ -5766,6 +5767,14 @@ static void update_top_cache_domain(int cpu)
 
 	sd = highest_flag_domain(cpu, SD_ASYM_PACKING);
 	rcu_assign_pointer(per_cpu(sd_asym, cpu), sd);
+
+	for_each_domain(cpu, sd) {
+		if (sd->groups->sge)
+			ea_sd = sd;
+		else
+			break;
+	}
+	rcu_assign_pointer(per_cpu(sd_ea, cpu), ea_sd);
 }
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 86cf6b2..dedf0ec 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -793,6 +793,7 @@ DECLARE_PER_CPU(int, sd_llc_id);
 DECLARE_PER_CPU(struct sched_domain *, sd_numa);
 DECLARE_PER_CPU(struct sched_domain *, sd_busy);
 DECLARE_PER_CPU(struct sched_domain *, sd_asym);
+DECLARE_PER_CPU(struct sched_domain *, sd_ea);
 
 struct sched_group_capacity {
 	atomic_t ref;
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [RFCv3 PATCH 30/48] sched: Calculate energy consumption of sched_group
  2015-02-04 18:30 [RFCv3 PATCH 00/48] sched: Energy cost model for energy-aware scheduling Morten Rasmussen
                   ` (28 preceding siblings ...)
  2015-02-04 18:31 ` [RFCv3 PATCH 29/48] sched: Highest energy aware balancing sched_domain level pointer Morten Rasmussen
@ 2015-02-04 18:31 ` Morten Rasmussen
  2015-03-13 22:54   ` Sai Gurrappadi
  2015-03-20 18:40   ` Sai Gurrappadi
  2015-02-04 18:31 ` [RFCv3 PATCH 31/48] sched: Extend sched_group_energy to test load-balancing decisions Morten Rasmussen
                   ` (18 subsequent siblings)
  48 siblings, 2 replies; 124+ messages in thread
From: Morten Rasmussen @ 2015-02-04 18:31 UTC (permalink / raw)
  To: peterz, mingo
  Cc: vincent.guittot, dietmar.eggemann, yuyang.du, preeti, mturquette,
	nico, rjw, juri.lelli, linux-kernel

For energy-aware load-balancing decisions it is necessary to know the
energy consumption estimates of groups of cpus. This patch introduces a
basic function, sched_group_energy(), which estimates the energy
consumption of the cpus in the group and any resources shared by the
members of the group.

NOTE: The function has five levels of identation and breaks the 80
character limit. Refactoring is necessary.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c | 143 ++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 143 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 872ae0e..d12aa63 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4609,6 +4609,149 @@ static inline bool energy_aware(void)
 	return sched_feat(ENERGY_AWARE);
 }
 
+/*
+ * cpu_norm_usage() returns the cpu usage relative to it's current capacity,
+ * i.e. it's busy ratio, in the range [0..SCHED_LOAD_SCALE] which is useful for
+ * energy calculations. Using the scale-invariant usage returned by
+ * get_cpu_usage() and approximating scale-invariant usage by:
+ *
+ *   usage ~ (curr_freq/max_freq)*1024 * capacity_orig/1024 * running_time/time
+ *
+ * the normalized usage can be found using capacity_curr.
+ *
+ *   capacity_curr = capacity_orig * curr_freq/max_freq
+ *
+ *   norm_usage = running_time/time ~ usage/capacity_curr
+ */
+static inline unsigned long cpu_norm_usage(int cpu)
+{
+	unsigned long capacity_curr = capacity_curr_of(cpu);
+
+	return (get_cpu_usage(cpu) << SCHED_CAPACITY_SHIFT)/capacity_curr;
+}
+
+static unsigned group_max_usage(struct sched_group *sg)
+{
+	int i;
+	int max_usage = 0;
+
+	for_each_cpu(i, sched_group_cpus(sg))
+		max_usage = max(max_usage, get_cpu_usage(i));
+
+	return max_usage;
+}
+
+/*
+ * group_norm_usage() returns the approximated group usage relative to it's
+ * current capacity (busy ratio) in the range [0..SCHED_LOAD_SCALE] for use in
+ * energy calculations. Since task executions may or may not overlap in time in
+ * the group the true normalized usage is between max(cpu_norm_usage(i)) and
+ * sum(cpu_norm_usage(i)) when iterating over all cpus in the group, i. The
+ * latter is used as the estimate as it leads to a more pessimistic energy
+ * estimate (more busy).
+ */
+static unsigned group_norm_usage(struct sched_group *sg)
+{
+	int i;
+	unsigned long usage_sum = 0;
+
+	for_each_cpu(i, sched_group_cpus(sg))
+		usage_sum += cpu_norm_usage(i);
+
+	if (usage_sum > SCHED_CAPACITY_SCALE)
+		return SCHED_CAPACITY_SCALE;
+	return usage_sum;
+}
+
+static int find_new_capacity(struct sched_group *sg,
+		struct sched_group_energy *sge)
+{
+	int idx;
+	unsigned long util = group_max_usage(sg);
+
+	for (idx = 0; idx < sge->nr_cap_states; idx++) {
+		if (sge->cap_states[idx].cap >= util)
+			return idx;
+	}
+
+	return idx;
+}
+
+/*
+ * sched_group_energy(): Returns absolute energy consumption of cpus belonging
+ * to the sched_group including shared resources shared only by members of the
+ * group. Iterates over all cpus in the hierarchy below the sched_group starting
+ * from the bottom working it's way up before going to the next cpu until all
+ * cpus are covered at all levels. The current implementation is likely to
+ * gather the same usage statistics multiple times. This can probably be done in
+ * a faster but more complex way.
+ */
+static unsigned int sched_group_energy(struct sched_group *sg_top)
+{
+	struct sched_domain *sd;
+	int cpu, total_energy = 0;
+	struct cpumask visit_cpus;
+	struct sched_group *sg;
+
+	WARN_ON(!sg_top->sge);
+
+	cpumask_copy(&visit_cpus, sched_group_cpus(sg_top));
+
+	while (!cpumask_empty(&visit_cpus)) {
+		struct sched_group *sg_shared_cap = NULL;
+
+		cpu = cpumask_first(&visit_cpus);
+
+		/*
+		 * Is the group utilization affected by cpus outside this
+		 * sched_group?
+		 */
+		sd = highest_flag_domain(cpu, SD_SHARE_CAP_STATES);
+		if (sd && sd->parent)
+			sg_shared_cap = sd->parent->groups;
+
+		for_each_domain(cpu, sd) {
+			sg = sd->groups;
+
+			/* Has this sched_domain already been visited? */
+			if (sd->child && cpumask_first(sched_group_cpus(sg)) != cpu)
+				break;
+
+			do {
+				struct sched_group *sg_cap_util;
+				unsigned group_util;
+				int sg_busy_energy, sg_idle_energy;
+				int cap_idx;
+
+				if (sg_shared_cap && sg_shared_cap->group_weight >= sg->group_weight)
+					sg_cap_util = sg_shared_cap;
+				else
+					sg_cap_util = sg;
+
+				cap_idx = find_new_capacity(sg_cap_util, sg->sge);
+				group_util = group_norm_usage(sg);
+				sg_busy_energy = (group_util * sg->sge->cap_states[cap_idx].power)
+										>> SCHED_CAPACITY_SHIFT;
+				sg_idle_energy = ((SCHED_LOAD_SCALE-group_util) * sg->sge->idle_states[0].power)
+										>> SCHED_CAPACITY_SHIFT;
+
+				total_energy += sg_busy_energy + sg_idle_energy;
+
+				if (!sd->child)
+					cpumask_xor(&visit_cpus, &visit_cpus, sched_group_cpus(sg));
+
+				if (cpumask_equal(sched_group_cpus(sg), sched_group_cpus(sg_top)))
+					goto next_cpu;
+
+			} while (sg = sg->next, sg != sd->groups);
+		}
+next_cpu:
+		continue;
+	}
+
+	return total_energy;
+}
+
 static int wake_wide(struct task_struct *p)
 {
 	int factor = this_cpu_read(sd_llc_size);
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [RFCv3 PATCH 31/48] sched: Extend sched_group_energy to test load-balancing decisions
  2015-02-04 18:30 [RFCv3 PATCH 00/48] sched: Energy cost model for energy-aware scheduling Morten Rasmussen
                   ` (29 preceding siblings ...)
  2015-02-04 18:31 ` [RFCv3 PATCH 30/48] sched: Calculate energy consumption of sched_group Morten Rasmussen
@ 2015-02-04 18:31 ` Morten Rasmussen
       [not found]   ` <OF081FBA75.F80B8844-ON48257E37.00261E89-48257E37.00267F24@zte.com.cn>
  2015-02-04 18:31 ` [RFCv3 PATCH 32/48] sched: Estimate energy impact of scheduling decisions Morten Rasmussen
                   ` (17 subsequent siblings)
  48 siblings, 1 reply; 124+ messages in thread
From: Morten Rasmussen @ 2015-02-04 18:31 UTC (permalink / raw)
  To: peterz, mingo
  Cc: vincent.guittot, dietmar.eggemann, yuyang.du, preeti, mturquette,
	nico, rjw, juri.lelli, linux-kernel

Extended sched_group_energy() to support energy prediction with usage
(tasks) added/removed from a specific cpu or migrated between a pair of
cpus. Useful for load-balancing decision making.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c | 90 +++++++++++++++++++++++++++++++++++++++--------------
 1 file changed, 66 insertions(+), 24 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d12aa63..07c84af 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4592,23 +4592,44 @@ static unsigned long capacity_curr_of(int cpu)
  * Without capping the usage, a group could be seen as overloaded (CPU0 usage
  * at 121% + CPU1 usage at 80%) whereas CPU1 has 20% of available capacity/
  */
-static int get_cpu_usage(int cpu)
+static int __get_cpu_usage(int cpu, int delta)
 {
+	int sum;
 	unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg;
 	unsigned long blocked = cpu_rq(cpu)->cfs.utilization_blocked_avg;
 	unsigned long capacity_curr = capacity_curr_of(cpu);
 
-	if (usage + blocked >= capacity_curr)
+	sum = usage + blocked + delta;
+
+	if (sum < 0)
+		return 0;
+
+	if (sum >= capacity_curr)
 		return capacity_curr;
 
-	return usage + blocked;
+	return sum;
 }
 
+static int get_cpu_usage(int cpu)
+{
+	return __get_cpu_usage(cpu, 0);
+}
+
+
 static inline bool energy_aware(void)
 {
 	return sched_feat(ENERGY_AWARE);
 }
 
+struct energy_env {
+	struct sched_group	*sg_top;
+	struct sched_group	*sg_cap;
+	int			usage_delta;
+	int			src_cpu;
+	int			dst_cpu;
+	int			energy;
+};
+
 /*
  * cpu_norm_usage() returns the cpu usage relative to it's current capacity,
  * i.e. it's busy ratio, in the range [0..SCHED_LOAD_SCALE] which is useful for
@@ -4623,20 +4644,38 @@ static inline bool energy_aware(void)
  *
  *   norm_usage = running_time/time ~ usage/capacity_curr
  */
-static inline unsigned long cpu_norm_usage(int cpu)
+static inline unsigned long __cpu_norm_usage(int cpu, int delta)
 {
 	unsigned long capacity_curr = capacity_curr_of(cpu);
 
-	return (get_cpu_usage(cpu) << SCHED_CAPACITY_SHIFT)/capacity_curr;
+	return (__get_cpu_usage(cpu, delta) << SCHED_CAPACITY_SHIFT)
+							/capacity_curr;
 }
 
-static unsigned group_max_usage(struct sched_group *sg)
+static inline unsigned long cpu_norm_usage(int cpu)
 {
-	int i;
+	return __cpu_norm_usage(cpu, 0);
+}
+
+static inline int calc_usage_delta(struct energy_env *eenv, int cpu)
+{
+	if (cpu == eenv->src_cpu)
+		return -eenv->usage_delta;
+	if (cpu == eenv->dst_cpu)
+		return eenv->usage_delta;
+	return 0;
+}
+
+static unsigned group_max_usage(struct energy_env *eenv,
+					struct sched_group *sg)
+{
+	int i, delta;
 	int max_usage = 0;
 
-	for_each_cpu(i, sched_group_cpus(sg))
-		max_usage = max(max_usage, get_cpu_usage(i));
+	for_each_cpu(i, sched_group_cpus(sg)) {
+		delta = calc_usage_delta(eenv, i);
+		max_usage = max(max_usage, __get_cpu_usage(i, delta));
+	}
 
 	return max_usage;
 }
@@ -4650,24 +4689,27 @@ static unsigned group_max_usage(struct sched_group *sg)
  * latter is used as the estimate as it leads to a more pessimistic energy
  * estimate (more busy).
  */
-static unsigned group_norm_usage(struct sched_group *sg)
+static unsigned group_norm_usage(struct energy_env *eenv,
+					struct sched_group *sg)
 {
-	int i;
+	int i, delta;
 	unsigned long usage_sum = 0;
 
-	for_each_cpu(i, sched_group_cpus(sg))
-		usage_sum += cpu_norm_usage(i);
+	for_each_cpu(i, sched_group_cpus(sg)) {
+		delta = calc_usage_delta(eenv, i);
+		usage_sum += __cpu_norm_usage(i, delta);
+	}
 
 	if (usage_sum > SCHED_CAPACITY_SCALE)
 		return SCHED_CAPACITY_SCALE;
 	return usage_sum;
 }
 
-static int find_new_capacity(struct sched_group *sg,
+static int find_new_capacity(struct energy_env *eenv,
 		struct sched_group_energy *sge)
 {
 	int idx;
-	unsigned long util = group_max_usage(sg);
+	unsigned long util = group_max_usage(eenv, eenv->sg_cap);
 
 	for (idx = 0; idx < sge->nr_cap_states; idx++) {
 		if (sge->cap_states[idx].cap >= util)
@@ -4686,16 +4728,16 @@ static int find_new_capacity(struct sched_group *sg,
  * gather the same usage statistics multiple times. This can probably be done in
  * a faster but more complex way.
  */
-static unsigned int sched_group_energy(struct sched_group *sg_top)
+static unsigned int sched_group_energy(struct energy_env *eenv)
 {
 	struct sched_domain *sd;
 	int cpu, total_energy = 0;
 	struct cpumask visit_cpus;
 	struct sched_group *sg;
 
-	WARN_ON(!sg_top->sge);
+	WARN_ON(!eenv->sg_top->sge);
 
-	cpumask_copy(&visit_cpus, sched_group_cpus(sg_top));
+	cpumask_copy(&visit_cpus, sched_group_cpus(eenv->sg_top));
 
 	while (!cpumask_empty(&visit_cpus)) {
 		struct sched_group *sg_shared_cap = NULL;
@@ -4718,18 +4760,17 @@ static unsigned int sched_group_energy(struct sched_group *sg_top)
 				break;
 
 			do {
-				struct sched_group *sg_cap_util;
 				unsigned group_util;
 				int sg_busy_energy, sg_idle_energy;
 				int cap_idx;
 
 				if (sg_shared_cap && sg_shared_cap->group_weight >= sg->group_weight)
-					sg_cap_util = sg_shared_cap;
+					eenv->sg_cap = sg_shared_cap;
 				else
-					sg_cap_util = sg;
+					eenv->sg_cap = sg;
 
-				cap_idx = find_new_capacity(sg_cap_util, sg->sge);
-				group_util = group_norm_usage(sg);
+				cap_idx = find_new_capacity(eenv, sg->sge);
+				group_util = group_norm_usage(eenv, sg);
 				sg_busy_energy = (group_util * sg->sge->cap_states[cap_idx].power)
 										>> SCHED_CAPACITY_SHIFT;
 				sg_idle_energy = ((SCHED_LOAD_SCALE-group_util) * sg->sge->idle_states[0].power)
@@ -4740,7 +4781,7 @@ static unsigned int sched_group_energy(struct sched_group *sg_top)
 				if (!sd->child)
 					cpumask_xor(&visit_cpus, &visit_cpus, sched_group_cpus(sg));
 
-				if (cpumask_equal(sched_group_cpus(sg), sched_group_cpus(sg_top)))
+				if (cpumask_equal(sched_group_cpus(sg), sched_group_cpus(eenv->sg_top)))
 					goto next_cpu;
 
 			} while (sg = sg->next, sg != sd->groups);
@@ -4749,6 +4790,7 @@ static unsigned int sched_group_energy(struct sched_group *sg_top)
 		continue;
 	}
 
+	eenv->energy = total_energy;
 	return total_energy;
 }
 
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [RFCv3 PATCH 32/48] sched: Estimate energy impact of scheduling decisions
  2015-02-04 18:30 [RFCv3 PATCH 00/48] sched: Energy cost model for energy-aware scheduling Morten Rasmussen
                   ` (30 preceding siblings ...)
  2015-02-04 18:31 ` [RFCv3 PATCH 31/48] sched: Extend sched_group_energy to test load-balancing decisions Morten Rasmussen
@ 2015-02-04 18:31 ` Morten Rasmussen
  2015-02-04 18:31 ` [RFCv3 PATCH 33/48] sched: Energy-aware wake-up task placement Morten Rasmussen
                   ` (16 subsequent siblings)
  48 siblings, 0 replies; 124+ messages in thread
From: Morten Rasmussen @ 2015-02-04 18:31 UTC (permalink / raw)
  To: peterz, mingo
  Cc: vincent.guittot, dietmar.eggemann, yuyang.du, preeti, mturquette,
	nico, rjw, juri.lelli, linux-kernel

Adds a generic energy-aware helper function, energy_diff(), that
calculates energy impact of adding, removing, and migrating utilization
in the system.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c | 51 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 51 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 07c84af..b371f32 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4794,6 +4794,57 @@ static unsigned int sched_group_energy(struct energy_env *eenv)
 	return total_energy;
 }
 
+/*
+ * energy_diff(): Estimate the energy impact of changing the utilization
+ * distribution. eenv specifies the change: utilisation amount, source, and
+ * destination cpu. Source or destination cpu may be -1 in which case the
+ * utilization is removed from or added to the system (e.g. task wake-up). If
+ * both are specified, the utilization is migrated.
+ */
+static int energy_diff(struct energy_env *eenv)
+{
+	struct sched_domain *sd;
+	struct sched_group *sg;
+	int sd_cpu = -1, energy_before = 0, energy_after = 0;
+
+	struct energy_env eenv_before = {
+		.usage_delta	= 0,
+		.src_cpu	= eenv->src_cpu,
+		.dst_cpu	= eenv->dst_cpu,
+	};
+
+	if (eenv->src_cpu == eenv->dst_cpu)
+		return 0;
+
+	sd_cpu = (eenv->src_cpu != -1) ? eenv->src_cpu : eenv->dst_cpu;
+	sd = rcu_dereference(per_cpu(sd_ea, sd_cpu));
+
+	if (!sd)
+		return 0; /* Error */
+
+	sg = sd->groups;
+	do {
+		if (eenv->src_cpu != -1 && cpumask_test_cpu(eenv->src_cpu,
+							sched_group_cpus(sg))) {
+			eenv_before.sg_top = eenv->sg_top = sg;
+			energy_before += sched_group_energy(&eenv_before);
+			energy_after += sched_group_energy(eenv);
+
+			/* src_cpu and dst_cpu may belong to the same group */
+			continue;
+		}
+
+		if (eenv->dst_cpu != -1	&& cpumask_test_cpu(eenv->dst_cpu,
+							sched_group_cpus(sg))) {
+			eenv_before.sg_top = eenv->sg_top = sg;
+			energy_before += sched_group_energy(&eenv_before);
+			energy_after += sched_group_energy(eenv);
+		}
+	} while (sg = sg->next, sg != sd->groups);
+
+	return energy_after-energy_before;
+}
+
 static int wake_wide(struct task_struct *p)
 {
 	int factor = this_cpu_read(sd_llc_size);
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [RFCv3 PATCH 33/48] sched: Energy-aware wake-up task placement
  2015-02-04 18:30 [RFCv3 PATCH 00/48] sched: Energy cost model for energy-aware scheduling Morten Rasmussen
                   ` (31 preceding siblings ...)
  2015-02-04 18:31 ` [RFCv3 PATCH 32/48] sched: Estimate energy impact of scheduling decisions Morten Rasmussen
@ 2015-02-04 18:31 ` Morten Rasmussen
  2015-03-13 22:47   ` Sai Gurrappadi
                     ` (2 more replies)
  2015-02-04 18:31 ` [RFCv3 PATCH 34/48] sched: Bias new task wakeups towards higher capacity cpus Morten Rasmussen
                   ` (15 subsequent siblings)
  48 siblings, 3 replies; 124+ messages in thread
From: Morten Rasmussen @ 2015-02-04 18:31 UTC (permalink / raw)
  To: peterz, mingo
  Cc: vincent.guittot, dietmar.eggemann, yuyang.du, preeti, mturquette,
	nico, rjw, juri.lelli, linux-kernel

Let available compute capacity and estimated energy impact select
wake-up target cpu when energy-aware scheduling is enabled.
energy_aware_wake_cpu() attempts to find group of cpus with sufficient
compute capacity to accommodate the task and find a cpu with enough spare
capacity to handle the task within that group. Preference is given to
cpus with enough spare capacity at the current OPP. Finally, the energy
impact of the new target and the previous task cpu is compared to select
the wake-up target cpu.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c | 90 +++++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 90 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b371f32..8713310 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5091,6 +5091,92 @@ static int select_idle_sibling(struct task_struct *p, int target)
 done:
 	return target;
 }
+
+static unsigned long group_max_capacity(struct sched_group *sg)
+{
+	int max_idx;
+
+	if (!sg->sge)
+		return 0;
+
+	max_idx = sg->sge->nr_cap_states-1;
+
+	return sg->sge->cap_states[max_idx].cap;
+}
+
+static inline unsigned long task_utilization(struct task_struct *p)
+{
+	return p->se.avg.utilization_avg_contrib;
+}
+
+static int cpu_overutilized(int cpu, struct sched_domain *sd)
+{
+	return (capacity_orig_of(cpu) * 100) <
+				(get_cpu_usage(cpu) * sd->imbalance_pct);
+}
+
+static int energy_aware_wake_cpu(struct task_struct *p)
+{
+	struct sched_domain *sd;
+	struct sched_group *sg, *sg_target;
+	int target_max_cap = SCHED_CAPACITY_SCALE;
+	int target_cpu = task_cpu(p);
+	int i;
+
+	sd = rcu_dereference(per_cpu(sd_ea, task_cpu(p)));
+
+	if (!sd)
+		return -1;
+
+	sg = sd->groups;
+	sg_target = sg;
+	/* Find group with sufficient capacity */
+	do {
+		int sg_max_capacity = group_max_capacity(sg);
+
+		if (sg_max_capacity >= task_utilization(p) &&
+				sg_max_capacity <= target_max_cap) {
+			sg_target = sg;
+			target_max_cap = sg_max_capacity;
+		}
+	} while (sg = sg->next, sg != sd->groups);
+
+	/* Find cpu with sufficient capacity */
+	for_each_cpu_and(i, tsk_cpus_allowed(p), sched_group_cpus(sg_target)) {
+		int new_usage = get_cpu_usage(i) + task_utilization(p);
+
+		if (new_usage >	capacity_orig_of(i))
+			continue;
+
+		if (new_usage <	capacity_curr_of(i)) {
+			target_cpu = i;
+			if (!cpu_rq(i)->nr_running)
+				break;
+		}
+
+		/* cpu has capacity at higher OPP, keep it as fallback */
+		if (target_cpu == task_cpu(p))
+			target_cpu = i;
+	}
+
+	if (target_cpu != task_cpu(p)) {
+		struct energy_env eenv = {
+			.usage_delta	= task_utilization(p),
+			.src_cpu	= task_cpu(p),
+			.dst_cpu	= target_cpu,
+		};
+
+		/* Not enough spare capacity on previous cpu */
+		if (cpu_overutilized(task_cpu(p), sd))
+			return target_cpu;
+
+		if (energy_diff(&eenv) >= 0)
+			return task_cpu(p);
+	}
+
+	return target_cpu;
+}
+
 /*
  * select_task_rq_fair: Select target runqueue for the waking task in domains
  * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
@@ -5138,6 +5224,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 		prev_cpu = cpu;
 
 	if (sd_flag & SD_BALANCE_WAKE) {
+		if (energy_aware()) {
+			new_cpu = energy_aware_wake_cpu(p);
+			goto unlock;
+		}
 		new_cpu = select_idle_sibling(p, prev_cpu);
 		goto unlock;
 	}
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [RFCv3 PATCH 34/48] sched: Bias new task wakeups towards higher capacity cpus
  2015-02-04 18:30 [RFCv3 PATCH 00/48] sched: Energy cost model for energy-aware scheduling Morten Rasmussen
                   ` (32 preceding siblings ...)
  2015-02-04 18:31 ` [RFCv3 PATCH 33/48] sched: Energy-aware wake-up task placement Morten Rasmussen
@ 2015-02-04 18:31 ` Morten Rasmussen
  2015-03-24 13:33   ` Peter Zijlstra
  2015-02-04 18:31 ` [RFCv3 PATCH 35/48] sched, cpuidle: Track cpuidle state index in the scheduler Morten Rasmussen
                   ` (14 subsequent siblings)
  48 siblings, 1 reply; 124+ messages in thread
From: Morten Rasmussen @ 2015-02-04 18:31 UTC (permalink / raw)
  To: peterz, mingo
  Cc: vincent.guittot, dietmar.eggemann, yuyang.du, preeti, mturquette,
	nico, rjw, juri.lelli, linux-kernel

Make wake-ups of new tasks (find_idlest_group) aware of any differences
in cpu compute capacity so new tasks don't get handed off to a cpus with
lower capacity.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c | 15 +++++++++++++--
 1 file changed, 13 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8713310..4251e75 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4950,6 +4950,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 {
 	struct sched_group *idlest = NULL, *group = sd->groups;
 	unsigned long min_load = ULONG_MAX, this_load = 0;
+	unsigned long this_cpu_cap = 0, idlest_cpu_cap = 0;
 	int load_idx = sd->forkexec_idx;
 	int imbalance = 100 + (sd->imbalance_pct-100)/2;
 
@@ -4957,7 +4958,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 		load_idx = sd->wake_idx;
 
 	do {
-		unsigned long load, avg_load;
+		unsigned long load, avg_load, cpu_capacity;
 		int local_group;
 		int i;
 
@@ -4971,6 +4972,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 
 		/* Tally up the load of all CPUs in the group */
 		avg_load = 0;
+		cpu_capacity = 0;
 
 		for_each_cpu(i, sched_group_cpus(group)) {
 			/* Bias balancing toward cpus of our domain */
@@ -4980,6 +4982,9 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 				load = target_load(i, load_idx);
 
 			avg_load += load;
+
+			if (cpu_capacity < capacity_of(i))
+				cpu_capacity = capacity_of(i);
 		}
 
 		/* Adjust by relative CPU capacity of the group */
@@ -4987,14 +4992,20 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 
 		if (local_group) {
 			this_load = avg_load;
+			this_cpu_cap = cpu_capacity;
 		} else if (avg_load < min_load) {
 			min_load = avg_load;
 			idlest = group;
+			idlest_cpu_cap = cpu_capacity;
 		}
 	} while (group = group->next, group != sd->groups);
 
-	if (!idlest || 100*this_load < imbalance*min_load)
+	if (!idlest)
+		return NULL;
+
+	if (100*this_load < imbalance*min_load && this_cpu_cap >= idlest_cpu_cap)
 		return NULL;
+
 	return idlest;
 }
 
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [RFCv3 PATCH 35/48] sched, cpuidle: Track cpuidle state index in the scheduler
  2015-02-04 18:30 [RFCv3 PATCH 00/48] sched: Energy cost model for energy-aware scheduling Morten Rasmussen
                   ` (33 preceding siblings ...)
  2015-02-04 18:31 ` [RFCv3 PATCH 34/48] sched: Bias new task wakeups towards higher capacity cpus Morten Rasmussen
@ 2015-02-04 18:31 ` Morten Rasmussen
  2015-02-04 18:31 ` [RFCv3 PATCH 36/48] sched: Count number of shallower idle-states in struct sched_group_energy Morten Rasmussen
                   ` (13 subsequent siblings)
  48 siblings, 0 replies; 124+ messages in thread
From: Morten Rasmussen @ 2015-02-04 18:31 UTC (permalink / raw)
  To: peterz, mingo
  Cc: vincent.guittot, dietmar.eggemann, yuyang.du, preeti, mturquette,
	nico, rjw, juri.lelli, linux-kernel

The idle-state of each cpu is currently pointed to by rq->idle_state but
there isn't any information in the struct cpuidle_state that can used to
look up the idle-state energy model data stored in struct
sched_group_energy. For this purpose is necessary to store the idle
state index as well. Ideally, the idle-state data should be unified.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/idle.c  |  2 ++
 kernel/sched/sched.h | 21 +++++++++++++++++++++
 2 files changed, 23 insertions(+)

diff --git a/kernel/sched/idle.c b/kernel/sched/idle.c
index c47fce7..e46c85c 100644
--- a/kernel/sched/idle.c
+++ b/kernel/sched/idle.c
@@ -149,6 +149,7 @@ static void cpuidle_idle_call(void)
 
 	/* Take note of the planned idle state. */
 	idle_set_state(this_rq(), &drv->states[next_state]);
+	idle_set_state_idx(this_rq(), next_state);
 
 	/*
 	 * Enter the idle state previously returned by the governor decision.
@@ -159,6 +160,7 @@ static void cpuidle_idle_call(void)
 
 	/* The cpu is no longer idle or about to enter idle. */
 	idle_set_state(this_rq(), NULL);
+	idle_set_state_idx(this_rq(), -1);
 
 	if (broadcast)
 		clockevents_notify(CLOCK_EVT_NOTIFY_BROADCAST_EXIT, &dev->cpu);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index dedf0ec..107f478 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -678,6 +678,7 @@ struct rq {
 #ifdef CONFIG_CPU_IDLE
 	/* Must be inspected within a rcu lock section */
 	struct cpuidle_state *idle_state;
+	int idle_state_idx;
 #endif
 };
 
@@ -1274,6 +1275,17 @@ static inline struct cpuidle_state *idle_get_state(struct rq *rq)
 	WARN_ON(!rcu_read_lock_held());
 	return rq->idle_state;
 }
+
+static inline void idle_set_state_idx(struct rq *rq, int idle_state_idx)
+{
+	rq->idle_state_idx = idle_state_idx;
+}
+
+static inline int idle_get_state_idx(struct rq *rq)
+{
+	WARN_ON(!rcu_read_lock_held());
+	return rq->idle_state_idx;
+}
 #else
 static inline void idle_set_state(struct rq *rq,
 				  struct cpuidle_state *idle_state)
@@ -1284,6 +1296,15 @@ static inline struct cpuidle_state *idle_get_state(struct rq *rq)
 {
 	return NULL;
 }
+
+static inline void idle_set_state_idx(struct rq *rq, int idle_state_idx)
+{
+}
+
+static inline int idle_get_state_idx(struct rq *rq)
+{
+	return -1;
+}
 #endif
 
 extern void sysrq_sched_debug_show(void);
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [RFCv3 PATCH 36/48] sched: Count number of shallower idle-states in struct sched_group_energy
  2015-02-04 18:30 [RFCv3 PATCH 00/48] sched: Energy cost model for energy-aware scheduling Morten Rasmussen
                   ` (34 preceding siblings ...)
  2015-02-04 18:31 ` [RFCv3 PATCH 35/48] sched, cpuidle: Track cpuidle state index in the scheduler Morten Rasmussen
@ 2015-02-04 18:31 ` Morten Rasmussen
  2015-03-24 13:14   ` Peter Zijlstra
  2015-02-04 18:31 ` [RFCv3 PATCH 37/48] sched: Determine the current sched_group idle-state Morten Rasmussen
                   ` (12 subsequent siblings)
  48 siblings, 1 reply; 124+ messages in thread
From: Morten Rasmussen @ 2015-02-04 18:31 UTC (permalink / raw)
  To: peterz, mingo
  Cc: vincent.guittot, dietmar.eggemann, yuyang.du, preeti, mturquette,
	nico, rjw, juri.lelli, linux-kernel

cpuidle associates all idle-states with each cpu while the energy model
associates them with the sched_group covering the cpus coordinating
entry to the idle-state. To get idle-state power consumption it is
therefore necessary to translate from cpuidle idle-state index to energy
model index. For this purpose it is helpful to know how many idle-states
that are listed in lower level sched_groups (in struct
sched_group_energy).

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 include/linux/sched.h |  1 +
 kernel/sched/core.c   | 12 ++++++++++++
 2 files changed, 13 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 78b6eb7..f984b4e 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -958,6 +958,7 @@ struct sched_group_energy {
 	atomic_t ref;
 	unsigned int nr_idle_states;	/* number of idle states */
 	struct idle_state *idle_states;	/* ptr to idle state array */
+	unsigned int idle_states_below; /* Number idle states in lower groups */
 	unsigned int nr_cap_states;	/* number of capacity states */
 	struct capacity_state *cap_states; /* ptr to capacity state array */
 };
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index e47febf..4f52c2e 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6075,6 +6075,7 @@ static void init_sched_energy(int cpu, struct sched_domain *sd,
 	struct sched_group_energy *energy = sg->sge;
 	sched_domain_energy_f fn = tl->energy;
 	struct cpumask *mask = sched_group_cpus(sg);
+	int idle_states_below = 0;
 
 	if (!fn || !fn(cpu))
 		return;
@@ -6082,9 +6083,20 @@ static void init_sched_energy(int cpu, struct sched_domain *sd,
 	if (cpumask_weight(mask) > 1)
 		check_sched_energy_data(cpu, fn, mask);
 
+	/* Figure out the number of true cpuidle states below current group */
+	sd = sd->child;
+	for_each_lower_domain(sd) {
+		idle_states_below += sd->groups->sge->nr_idle_states;
+
+		/* Disregard non-cpuidle 'active' idle states */
+		if (sd->child)
+			idle_states_below--;
+	}
+
 	energy->nr_idle_states = fn(cpu)->nr_idle_states;
 	memcpy(energy->idle_states, fn(cpu)->idle_states,
 	       energy->nr_idle_states*sizeof(struct idle_state));
+	energy->idle_states_below = idle_states_below;
 	energy->nr_cap_states = fn(cpu)->nr_cap_states;
 	memcpy(energy->cap_states, fn(cpu)->cap_states,
 	       energy->nr_cap_states*sizeof(struct capacity_state));
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [RFCv3 PATCH 37/48] sched: Determine the current sched_group idle-state
  2015-02-04 18:30 [RFCv3 PATCH 00/48] sched: Energy cost model for energy-aware scheduling Morten Rasmussen
                   ` (35 preceding siblings ...)
  2015-02-04 18:31 ` [RFCv3 PATCH 36/48] sched: Count number of shallower idle-states in struct sched_group_energy Morten Rasmussen
@ 2015-02-04 18:31 ` Morten Rasmussen
       [not found]   ` <OF1FDC99CD.22435E74-ON48257E37.001BA739-48257E37.001CA5ED@zte.com.cn>
  2015-02-04 18:31 ` [RFCv3 PATCH 38/48] sched: Infrastructure to query if load balancing is energy-aware Morten Rasmussen
                   ` (11 subsequent siblings)
  48 siblings, 1 reply; 124+ messages in thread
From: Morten Rasmussen @ 2015-02-04 18:31 UTC (permalink / raw)
  To: peterz, mingo
  Cc: vincent.guittot, dietmar.eggemann, yuyang.du, preeti, mturquette,
	nico, rjw, juri.lelli, linux-kernel

To estimate the energy consumption of a sched_group in
sched_group_energy() it is necessary to know which idle-state the group
is in when it is idle. For now, it is assumed that this is the current
idle-state (though it might be wrong). Based on the individual cpu
idle-states group_idle_state() finds the group idle-state.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c | 32 ++++++++++++++++++++++++++++----
 1 file changed, 28 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4251e75..0e95eb5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4719,6 +4719,28 @@ static int find_new_capacity(struct energy_env *eenv,
 	return idx;
 }
 
+static int group_idle_state(struct sched_group *sg)
+{
+	struct sched_group_energy *sge = sg->sge;
+	int shallowest_state = sge->idle_states_below + sge->nr_idle_states;
+	int i;
+
+	for_each_cpu(i, sched_group_cpus(sg)) {
+		int cpuidle_idx = idle_get_state_idx(cpu_rq(i));
+		int group_idx = cpuidle_idx - sge->idle_states_below + 1;
+
+		if (group_idx <= 0)
+			return 0;
+
+		shallowest_state = min(shallowest_state, group_idx);
+	}
+
+	if (shallowest_state >= sge->nr_idle_states)
+		return sge->nr_idle_states - 1;
+
+	return shallowest_state;
+}
+
 /*
  * sched_group_energy(): Returns absolute energy consumption of cpus belonging
  * to the sched_group including shared resources shared only by members of the
@@ -4762,7 +4784,7 @@ static unsigned int sched_group_energy(struct energy_env *eenv)
 			do {
 				unsigned group_util;
 				int sg_busy_energy, sg_idle_energy;
-				int cap_idx;
+				int cap_idx, idle_idx;
 
 				if (sg_shared_cap && sg_shared_cap->group_weight >= sg->group_weight)
 					eenv->sg_cap = sg_shared_cap;
@@ -4770,11 +4792,13 @@ static unsigned int sched_group_energy(struct energy_env *eenv)
 					eenv->sg_cap = sg;
 
 				cap_idx = find_new_capacity(eenv, sg->sge);
+				idle_idx = group_idle_state(sg);
 				group_util = group_norm_usage(eenv, sg);
 				sg_busy_energy = (group_util * sg->sge->cap_states[cap_idx].power)
-										>> SCHED_CAPACITY_SHIFT;
-				sg_idle_energy = ((SCHED_LOAD_SCALE-group_util) * sg->sge->idle_states[0].power)
-										>> SCHED_CAPACITY_SHIFT;
+								>> SCHED_CAPACITY_SHIFT;
+				sg_idle_energy = ((SCHED_LOAD_SCALE-group_util)
+								* sg->sge->idle_states[idle_idx].power)
+								>> SCHED_CAPACITY_SHIFT;
 
 				total_energy += sg_busy_energy + sg_idle_energy;
 
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [RFCv3 PATCH 38/48] sched: Infrastructure to query if load balancing is energy-aware
  2015-02-04 18:30 [RFCv3 PATCH 00/48] sched: Energy cost model for energy-aware scheduling Morten Rasmussen
                   ` (36 preceding siblings ...)
  2015-02-04 18:31 ` [RFCv3 PATCH 37/48] sched: Determine the current sched_group idle-state Morten Rasmussen
@ 2015-02-04 18:31 ` Morten Rasmussen
  2015-03-24 13:41   ` Peter Zijlstra
  2015-03-24 13:56   ` Peter Zijlstra
  2015-02-04 18:31 ` [RFCv3 PATCH 39/48] sched: Introduce energy awareness into update_sg_lb_stats Morten Rasmussen
                   ` (10 subsequent siblings)
  48 siblings, 2 replies; 124+ messages in thread
From: Morten Rasmussen @ 2015-02-04 18:31 UTC (permalink / raw)
  To: peterz, mingo
  Cc: vincent.guittot, dietmar.eggemann, yuyang.du, preeti, mturquette,
	nico, rjw, juri.lelli, linux-kernel

From: Dietmar Eggemann <dietmar.eggemann@arm.com>

Energy-aware load balancing should only happen if the ENERGY_AWARE feature
is turned on and the sched domain on which the load balancing is performed
on contains energy data.
There is also a need during a load balance action to be able to query if we
should continue to load balance energy-aware or if we reached the tipping
point which forces us to fall back to the conventional load balancing
functionality.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
---
 kernel/sched/fair.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0e95eb5..45c784f 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5836,6 +5836,7 @@ struct lb_env {
 
 	enum fbq_type		fbq_type;
 	struct list_head	tasks;
+	bool                    use_ea;
 };
 
 /*
@@ -7348,6 +7349,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
 		.cpus		= cpus,
 		.fbq_type	= all,
 		.tasks		= LIST_HEAD_INIT(env.tasks),
+		.use_ea		= (energy_aware() && sd->groups->sge) ? true : false,
 	};
 
 	/*
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [RFCv3 PATCH 39/48] sched: Introduce energy awareness into update_sg_lb_stats
  2015-02-04 18:30 [RFCv3 PATCH 00/48] sched: Energy cost model for energy-aware scheduling Morten Rasmussen
                   ` (37 preceding siblings ...)
  2015-02-04 18:31 ` [RFCv3 PATCH 38/48] sched: Infrastructure to query if load balancing is energy-aware Morten Rasmussen
@ 2015-02-04 18:31 ` Morten Rasmussen
  2015-02-04 18:31 ` [RFCv3 PATCH 40/48] sched: Introduce energy awareness into update_sd_lb_stats Morten Rasmussen
                   ` (9 subsequent siblings)
  48 siblings, 0 replies; 124+ messages in thread
From: Morten Rasmussen @ 2015-02-04 18:31 UTC (permalink / raw)
  To: peterz, mingo
  Cc: vincent.guittot, dietmar.eggemann, yuyang.du, preeti, mturquette,
	nico, rjw, juri.lelli, linux-kernel

From: Dietmar Eggemann <dietmar.eggemann@arm.com>

To be able to identify the least efficient (costliest) sched group
introduce group_eff as the efficiency of the sched group into sg_lb_stats.
The group efficiency is defined as the ratio between the group usage and
the group energy consumption.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
---
 kernel/sched/fair.c | 16 ++++++++++++++++
 1 file changed, 16 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 45c784f..bfa335e 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6345,6 +6345,7 @@ struct sg_lb_stats {
 	unsigned long load_per_task;
 	unsigned long group_capacity;
 	unsigned long group_usage; /* Total usage of the group */
+	unsigned long group_eff;
 	unsigned int sum_nr_running; /* Nr tasks running in the group */
 	unsigned int idle_cpus;
 	unsigned int group_weight;
@@ -6715,6 +6716,21 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 
 	sgs->group_no_capacity = group_is_overloaded(env, sgs);
 	sgs->group_type = group_classify(env, group, sgs);
+
+	if (env->use_ea) {
+		struct energy_env eenv = {
+			.sg_top         = group,
+			.usage_delta    = 0,
+			.src_cpu        = -1,
+			.dst_cpu        = -1,
+		};
+		unsigned long group_energy = sched_group_energy(&eenv);
+
+		if (group_energy)
+			sgs->group_eff = 1024*sgs->group_usage/group_energy;
+		else
+			sgs->group_eff = ULONG_MAX;
+	}
 }
 
 /**
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [RFCv3 PATCH 40/48] sched: Introduce energy awareness into update_sd_lb_stats
  2015-02-04 18:30 [RFCv3 PATCH 00/48] sched: Energy cost model for energy-aware scheduling Morten Rasmussen
                   ` (38 preceding siblings ...)
  2015-02-04 18:31 ` [RFCv3 PATCH 39/48] sched: Introduce energy awareness into update_sg_lb_stats Morten Rasmussen
@ 2015-02-04 18:31 ` Morten Rasmussen
  2015-02-04 18:31 ` [RFCv3 PATCH 41/48] sched: Introduce energy awareness into find_busiest_group Morten Rasmussen
                   ` (8 subsequent siblings)
  48 siblings, 0 replies; 124+ messages in thread
From: Morten Rasmussen @ 2015-02-04 18:31 UTC (permalink / raw)
  To: peterz, mingo
  Cc: vincent.guittot, dietmar.eggemann, yuyang.du, preeti, mturquette,
	nico, rjw, juri.lelli, linux-kernel

From: Dietmar Eggemann <dietmar.eggemann@arm.com>

Energy-aware load balancing has to work alongside the conventional load
based functionality. This includes the tipping point feature, i.e. being
able to fall back from energy aware to the conventional load based
functionality during an ongoing load balancing action.
That is why this patch introduces an additional reference to hold the
least efficient sched group (costliest) as well its statistics in form of
an extra sg_lb_stats structure (costliest_stat).
The function update_sd_pick_costliest is used to assign the least
efficient sched group parallel to the existing update_sd_pick_busiest.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
---
 kernel/sched/fair.c | 21 +++++++++++++++++++++
 1 file changed, 21 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index bfa335e..36f3c77 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6363,12 +6363,14 @@ struct sg_lb_stats {
  */
 struct sd_lb_stats {
 	struct sched_group *busiest;	/* Busiest group in this sd */
+	struct sched_group *costliest;	/* Least efficient group in this sd */
 	struct sched_group *local;	/* Local group in this sd */
 	unsigned long total_load;	/* Total load of all groups in sd */
 	unsigned long total_capacity;	/* Total capacity of all groups in sd */
 	unsigned long avg_load;	/* Average load across all groups in sd */
 
 	struct sg_lb_stats busiest_stat;/* Statistics of the busiest group */
+	struct sg_lb_stats costliest_stat;/* Statistics of the least efficient group */
 	struct sg_lb_stats local_stat;	/* Statistics of the local group */
 };
 
@@ -6390,6 +6392,9 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds)
 			.sum_nr_running = 0,
 			.group_type = group_other,
 		},
+		.costliest_stat = {
+			.group_eff = ULONG_MAX,
+		},
 	};
 }
 
@@ -6782,6 +6787,17 @@ static bool update_sd_pick_busiest(struct lb_env *env,
 	return false;
 }
 
+static noinline bool update_sd_pick_costliest(struct sd_lb_stats *sds,
+					      struct sg_lb_stats *sgs)
+{
+	struct sg_lb_stats *costliest = &sds->costliest_stat;
+
+	if (sgs->group_eff < costliest->group_eff)
+		return true;
+
+	return false;
+}
+
 #ifdef CONFIG_NUMA_BALANCING
 static inline enum fbq_type fbq_classify_group(struct sg_lb_stats *sgs)
 {
@@ -6872,6 +6888,11 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 			sds->busiest_stat = *sgs;
 		}
 
+		if (env->use_ea && update_sd_pick_costliest(sds, sgs)) {
+			sds->costliest = sg;
+			sds->costliest_stat = *sgs;
+		}
+
 next_group:
 		/* Now, start updating sd_lb_stats */
 		sds->total_load += sgs->group_load;
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [RFCv3 PATCH 41/48] sched: Introduce energy awareness into find_busiest_group
  2015-02-04 18:30 [RFCv3 PATCH 00/48] sched: Energy cost model for energy-aware scheduling Morten Rasmussen
                   ` (39 preceding siblings ...)
  2015-02-04 18:31 ` [RFCv3 PATCH 40/48] sched: Introduce energy awareness into update_sd_lb_stats Morten Rasmussen
@ 2015-02-04 18:31 ` Morten Rasmussen
  2015-02-04 18:31 ` [RFCv3 PATCH 42/48] sched: Introduce energy awareness into find_busiest_queue Morten Rasmussen
                   ` (7 subsequent siblings)
  48 siblings, 0 replies; 124+ messages in thread
From: Morten Rasmussen @ 2015-02-04 18:31 UTC (permalink / raw)
  To: peterz, mingo
  Cc: vincent.guittot, dietmar.eggemann, yuyang.du, preeti, mturquette,
	nico, rjw, juri.lelli, linux-kernel

From: Dietmar Eggemann <dietmar.eggemann@arm.com>

In case that after the gathering of sched domain statistics the current
load balancing operation is still in energy-aware mode, just return the
least efficient (costliest) reference. That implies the system is
considered to be balanced in case no least efficient sched group was
found.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
---
 kernel/sched/fair.c | 3 +++
 1 file changed, 3 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 36f3c77..199ffff 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7133,6 +7133,9 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
 	local = &sds.local_stat;
 	busiest = &sds.busiest_stat;
 
+	if (env->use_ea)
+		return sds.costliest;
+
 	/* ASYM feature bypasses nice load balance check */
 	if ((env->idle == CPU_IDLE || env->idle == CPU_NEWLY_IDLE) &&
 	    check_asym_packing(env, &sds))
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [RFCv3 PATCH 42/48] sched: Introduce energy awareness into find_busiest_queue
  2015-02-04 18:30 [RFCv3 PATCH 00/48] sched: Energy cost model for energy-aware scheduling Morten Rasmussen
                   ` (40 preceding siblings ...)
  2015-02-04 18:31 ` [RFCv3 PATCH 41/48] sched: Introduce energy awareness into find_busiest_group Morten Rasmussen
@ 2015-02-04 18:31 ` Morten Rasmussen
  2015-03-24 15:21   ` Peter Zijlstra
  2015-02-04 18:31 ` [RFCv3 PATCH 43/48] sched: Introduce energy awareness into detach_tasks Morten Rasmussen
                   ` (6 subsequent siblings)
  48 siblings, 1 reply; 124+ messages in thread
From: Morten Rasmussen @ 2015-02-04 18:31 UTC (permalink / raw)
  To: peterz, mingo
  Cc: vincent.guittot, dietmar.eggemann, yuyang.du, preeti, mturquette,
	nico, rjw, juri.lelli, linux-kernel

From: Dietmar Eggemann <dietmar.eggemann@arm.com>

In case that after the gathering of sched domain statistics the current
load balancing operation is still in energy-aware mode and a least
efficient sched group has been found, detect the least efficient cpu by
comparing the cpu efficiency (ratio between cpu usage and cpu energy
consumption) among all cpus of the least efficient sched group.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
---
 kernel/sched/fair.c | 31 +++++++++++++++++++++++++++++++
 1 file changed, 31 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 199ffff..48cd5b5 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7216,6 +7216,37 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 	unsigned long busiest_load = 0, busiest_capacity = 1;
 	int i;
 
+	if (env->use_ea) {
+		struct rq *costliest = NULL;
+		unsigned long costliest_usage = 1024, costliest_energy = 1;
+
+		for_each_cpu_and(i, sched_group_cpus(group), env->cpus) {
+			unsigned long usage = get_cpu_usage(i);
+			struct rq *rq = cpu_rq(i);
+			struct sched_domain *sd = rcu_dereference(rq->sd);
+			struct energy_env eenv = {
+				.sg_top = sd->groups,
+				.usage_delta    = 0,
+				.src_cpu        = -1,
+				.dst_cpu        = -1,
+			};
+			unsigned long energy = sched_group_energy(&eenv);
+
+			/*
+			 * We're looking for the minimal cpu efficiency
+			 * min(u_i / e_i), crosswise multiplication leads to
+			 * u_i * e_j < u_j * e_i with j as previous minimum.
+			 */
+			if (usage * costliest_energy < costliest_usage * energy) {
+				costliest_usage = usage;
+				costliest_energy = energy;
+				costliest = rq;
+			}
+		}
+
+		return costliest;
+	}
+
 	for_each_cpu_and(i, sched_group_cpus(group), env->cpus) {
 		unsigned long capacity, wl;
 		enum fbq_type rt;
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [RFCv3 PATCH 43/48] sched: Introduce energy awareness into detach_tasks
  2015-02-04 18:30 [RFCv3 PATCH 00/48] sched: Energy cost model for energy-aware scheduling Morten Rasmussen
                   ` (41 preceding siblings ...)
  2015-02-04 18:31 ` [RFCv3 PATCH 42/48] sched: Introduce energy awareness into find_busiest_queue Morten Rasmussen
@ 2015-02-04 18:31 ` Morten Rasmussen
  2015-03-24 15:25   ` Peter Zijlstra
  2015-03-25 23:50   ` Sai Gurrappadi
  2015-02-04 18:31 ` [RFCv3 PATCH 44/48] sched: Tipping point from energy-aware to conventional load balancing Morten Rasmussen
                   ` (5 subsequent siblings)
  48 siblings, 2 replies; 124+ messages in thread
From: Morten Rasmussen @ 2015-02-04 18:31 UTC (permalink / raw)
  To: peterz, mingo
  Cc: vincent.guittot, dietmar.eggemann, yuyang.du, preeti, mturquette,
	nico, rjw, juri.lelli, linux-kernel

From: Dietmar Eggemann <dietmar.eggemann@arm.com>

Energy-aware load balancing does not rely on env->imbalance but instead it
evaluates the system-wide energy difference for each task on the src rq by
potentially moving it to the dst rq. If this energy difference is lesser
than zero the task is actually moved from src to dst rq.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
---
 kernel/sched/fair.c | 21 ++++++++++++++++++---
 1 file changed, 18 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 48cd5b5..6b79603 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6095,12 +6095,12 @@ static int detach_tasks(struct lb_env *env)
 {
 	struct list_head *tasks = &env->src_rq->cfs_tasks;
 	struct task_struct *p;
-	unsigned long load;
+	unsigned long load = 0;
 	int detached = 0;
 
 	lockdep_assert_held(&env->src_rq->lock);
 
-	if (env->imbalance <= 0)
+	if (!env->use_ea && env->imbalance <= 0)
 		return 0;
 
 	while (!list_empty(tasks)) {
@@ -6121,6 +6121,20 @@ static int detach_tasks(struct lb_env *env)
 		if (!can_migrate_task(p, env))
 			goto next;
 
+		if (env->use_ea) {
+			struct energy_env eenv = {
+				.src_cpu = env->src_cpu,
+				.dst_cpu = env->dst_cpu,
+				.usage_delta = task_utilization(p),
+			};
+			int e_diff = energy_diff(&eenv);
+
+			if (e_diff >= 0)
+				goto next;
+
+			goto detach;
+		}
+
 		load = task_h_load(p);
 
 		if (sched_feat(LB_MIN) && load < 16 && !env->sd->nr_balance_failed)
@@ -6129,6 +6143,7 @@ static int detach_tasks(struct lb_env *env)
 		if ((load / 2) > env->imbalance)
 			goto next;
 
+detach:
 		detach_task(p, env);
 		list_add(&p->se.group_node, &env->tasks);
 
@@ -6149,7 +6164,7 @@ static int detach_tasks(struct lb_env *env)
 		 * We only want to steal up to the prescribed amount of
 		 * weighted load.
 		 */
-		if (env->imbalance <= 0)
+		if (!env->use_ea && env->imbalance <= 0)
 			break;
 
 		continue;
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [RFCv3 PATCH 44/48] sched: Tipping point from energy-aware to conventional load balancing
  2015-02-04 18:30 [RFCv3 PATCH 00/48] sched: Energy cost model for energy-aware scheduling Morten Rasmussen
                   ` (42 preceding siblings ...)
  2015-02-04 18:31 ` [RFCv3 PATCH 43/48] sched: Introduce energy awareness into detach_tasks Morten Rasmussen
@ 2015-02-04 18:31 ` Morten Rasmussen
  2015-03-24 15:26   ` Peter Zijlstra
  2015-02-04 18:31 ` [RFCv3 PATCH 45/48] sched: Skip cpu as lb src which has one task and capacity gte the dst cpu Morten Rasmussen
                   ` (4 subsequent siblings)
  48 siblings, 1 reply; 124+ messages in thread
From: Morten Rasmussen @ 2015-02-04 18:31 UTC (permalink / raw)
  To: peterz, mingo
  Cc: vincent.guittot, dietmar.eggemann, yuyang.du, preeti, mturquette,
	nico, rjw, juri.lelli, linux-kernel

From: Dietmar Eggemann <dietmar.eggemann@arm.com>

Energy-aware load balancing bases on cpu usage so the upper bound of its
operational range is a fully utilized cpu. Above this tipping point it
makes more sense to use weighted_cpuload to preserve smp_nice.
This patch implements the tipping point detection in update_sg_lb_stats
as if one cpu is over-utilized the current energy-aware load balance
operation will fall back into the conventional weighted load based one.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
---
 kernel/sched/fair.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6b79603..4849bad 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6723,6 +6723,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 		sgs->sum_weighted_load += weighted_cpuload(i);
 		if (idle_cpu(i))
 			sgs->idle_cpus++;
+
+		/* If cpu is over-utilized, bail out of ea */
+		if (env->use_ea && cpu_overutilized(i, env->sd))
+			env->use_ea = false;
 	}
 
 	/* Adjust by relative CPU capacity of the group */
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [RFCv3 PATCH 45/48] sched: Skip cpu as lb src which has one task and capacity gte the dst cpu
  2015-02-04 18:30 [RFCv3 PATCH 00/48] sched: Energy cost model for energy-aware scheduling Morten Rasmussen
                   ` (43 preceding siblings ...)
  2015-02-04 18:31 ` [RFCv3 PATCH 44/48] sched: Tipping point from energy-aware to conventional load balancing Morten Rasmussen
@ 2015-02-04 18:31 ` Morten Rasmussen
  2015-03-24 15:27   ` Peter Zijlstra
  2015-02-04 18:31 ` [RFCv3 PATCH 46/48] sched: Turn off fast idling of cpus on a partially loaded system Morten Rasmussen
                   ` (3 subsequent siblings)
  48 siblings, 1 reply; 124+ messages in thread
From: Morten Rasmussen @ 2015-02-04 18:31 UTC (permalink / raw)
  To: peterz, mingo
  Cc: vincent.guittot, dietmar.eggemann, yuyang.du, preeti, mturquette,
	nico, rjw, juri.lelli, linux-kernel

From: Dietmar Eggemann <dietmar.eggemann@arm.com>

Skip cpu as a potential src (costliest) in case it has only one task
running and its original capacity is greater than or equal to the
original capacity of the dst cpu.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
---
 kernel/sched/fair.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4849bad..b6e2e92 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7251,6 +7251,10 @@ static struct rq *find_busiest_queue(struct lb_env *env,
 			};
 			unsigned long energy = sched_group_energy(&eenv);
 
+			if (rq->nr_running == 1 && capacity_orig_of(i) >=
+					capacity_orig_of(env->dst_cpu))
+				continue;
+
 			/*
 			 * We're looking for the minimal cpu efficiency
 			 * min(u_i / e_i), crosswise multiplication leads to
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [RFCv3 PATCH 46/48] sched: Turn off fast idling of cpus on a partially loaded system
  2015-02-04 18:30 [RFCv3 PATCH 00/48] sched: Energy cost model for energy-aware scheduling Morten Rasmussen
                   ` (44 preceding siblings ...)
  2015-02-04 18:31 ` [RFCv3 PATCH 45/48] sched: Skip cpu as lb src which has one task and capacity gte the dst cpu Morten Rasmussen
@ 2015-02-04 18:31 ` Morten Rasmussen
  2015-03-24 16:01   ` Peter Zijlstra
  2015-02-04 18:31 ` [RFCv3 PATCH 47/48] sched: Enable active migration for cpus of lower capacity Morten Rasmussen
                   ` (2 subsequent siblings)
  48 siblings, 1 reply; 124+ messages in thread
From: Morten Rasmussen @ 2015-02-04 18:31 UTC (permalink / raw)
  To: peterz, mingo
  Cc: vincent.guittot, dietmar.eggemann, yuyang.du, preeti, mturquette,
	nico, rjw, juri.lelli, linux-kernel

From: Dietmar Eggemann <dietmar.eggemann@arm.com>

We do not want to miss out on the ability to do energy-aware idle load
balancing if the system is only partially loaded since the operational
range of energy-aware scheduling corresponds to a partially loaded
system. We might want to pull a single remaining task from a potential
src cpu towards an idle destination cpu if the energy model tells us
this is worth doing to save energy.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b6e2e92..92fd1d8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7734,7 +7734,7 @@ static int idle_balance(struct rq *this_rq)
 	this_rq->idle_stamp = rq_clock(this_rq);
 
 	if (this_rq->avg_idle < sysctl_sched_migration_cost ||
-	    !this_rq->rd->overload) {
+	    (!energy_aware() && !this_rq->rd->overload)) {
 		rcu_read_lock();
 		sd = rcu_dereference_check_sched_domain(this_rq->sd);
 		if (sd)
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [RFCv3 PATCH 47/48] sched: Enable active migration for cpus of lower capacity
  2015-02-04 18:30 [RFCv3 PATCH 00/48] sched: Energy cost model for energy-aware scheduling Morten Rasmussen
                   ` (45 preceding siblings ...)
  2015-02-04 18:31 ` [RFCv3 PATCH 46/48] sched: Turn off fast idling of cpus on a partially loaded system Morten Rasmussen
@ 2015-02-04 18:31 ` Morten Rasmussen
  2015-03-24 16:02   ` Peter Zijlstra
  2015-02-04 18:31 ` [RFCv3 PATCH 48/48] sched: Disable energy-unfriendly nohz kicks Morten Rasmussen
  2015-04-02 12:43 ` [RFCv3 PATCH 00/48] sched: Energy cost model for energy-aware scheduling Vincent Guittot
  48 siblings, 1 reply; 124+ messages in thread
From: Morten Rasmussen @ 2015-02-04 18:31 UTC (permalink / raw)
  To: peterz, mingo
  Cc: vincent.guittot, dietmar.eggemann, yuyang.du, preeti, mturquette,
	nico, rjw, juri.lelli, linux-kernel

Add an extra criteria to need_active_balance() to kick off active load
balance if the source cpu is overutilized and has lower capacity than
the destination cpus.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c | 7 +++++++
 1 file changed, 7 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 92fd1d8..1c248f8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7379,6 +7379,13 @@ static int need_active_balance(struct lb_env *env)
 			return 1;
 	}
 
+	if ((capacity_of(env->src_cpu) < capacity_of(env->dst_cpu)) &&
+				env->src_rq->cfs.h_nr_running == 1 &&
+				cpu_overutilized(env->src_cpu, env->sd) &&
+				!cpu_overutilized(env->dst_cpu, env->sd)) {
+			return 1;
+	}
+
 	return unlikely(sd->nr_balance_failed > sd->cache_nice_tries+2);
 }
 
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* [RFCv3 PATCH 48/48] sched: Disable energy-unfriendly nohz kicks
  2015-02-04 18:30 [RFCv3 PATCH 00/48] sched: Energy cost model for energy-aware scheduling Morten Rasmussen
                   ` (46 preceding siblings ...)
  2015-02-04 18:31 ` [RFCv3 PATCH 47/48] sched: Enable active migration for cpus of lower capacity Morten Rasmussen
@ 2015-02-04 18:31 ` Morten Rasmussen
  2015-02-20 19:26   ` Dietmar Eggemann
  2015-04-02 12:43 ` [RFCv3 PATCH 00/48] sched: Energy cost model for energy-aware scheduling Vincent Guittot
  48 siblings, 1 reply; 124+ messages in thread
From: Morten Rasmussen @ 2015-02-04 18:31 UTC (permalink / raw)
  To: peterz, mingo
  Cc: vincent.guittot, dietmar.eggemann, yuyang.du, preeti, mturquette,
	nico, rjw, juri.lelli, linux-kernel

With energy-aware scheduling enabled nohz_kick_needed() generates many
nohz idle-balance kicks which lead to nothing when multiple tasks get
packed on a single cpu to save energy. This causes unnecessary wake-ups
and hence wastes energy. Make these conditions depend on !energy_aware()
for now until the energy-aware nohz story gets sorted out.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c | 7 +++++--
 1 file changed, 5 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 1c248f8..cfe65ae 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8195,6 +8195,8 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
 	clear_bit(NOHZ_BALANCE_KICK, nohz_flags(this_cpu));
 }
 
+static int cpu_overutilized(int cpu, struct sched_domain *sd);
+
 /*
  * Current heuristic for kicking the idle load balancer in the presence
  * of an idle cpu in the system.
@@ -8234,12 +8236,13 @@ static inline bool nohz_kick_needed(struct rq *rq)
 	if (time_before(now, nohz.next_balance))
 		return false;
 
-	if (rq->nr_running >= 2)
+	sd = rcu_dereference(rq->sd);
+	if (rq->nr_running >= 2 && (!energy_aware() || cpu_overutilized(cpu, sd)))
 		return true;
 
 	rcu_read_lock();
 	sd = rcu_dereference(per_cpu(sd_busy, cpu));
-	if (sd) {
+	if (sd && !energy_aware()) {
 		sgc = sd->groups->sgc;
 		nr_busy = atomic_read(&sgc->nr_busy_cpus);
 
-- 
1.9.1


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* Re: [RFCv3 PATCH 01/48] sched: add utilization_avg_contrib
  2015-02-04 18:30 ` [RFCv3 PATCH 01/48] sched: add utilization_avg_contrib Morten Rasmussen
@ 2015-02-11  8:50   ` Preeti U Murthy
  2015-02-12  1:07     ` Vincent Guittot
  0 siblings, 1 reply; 124+ messages in thread
From: Preeti U Murthy @ 2015-02-11  8:50 UTC (permalink / raw)
  To: Morten Rasmussen, peterz, mingo, vincent.guittot
  Cc: dietmar.eggemann, yuyang.du, mturquette, nico, rjw, juri.lelli,
	linux-kernel, Paul Turner, Ben Segall

On 02/05/2015 12:00 AM, Morten Rasmussen wrote:
> From: Vincent Guittot <vincent.guittot@linaro.org>
> 
> Add new statistics which reflect the average time a task is running on the CPU
> and the sum of these running time of the tasks on a runqueue. The latter is
> named utilization_load_avg.
> 
> This patch is based on the usage metric that was proposed in the 1st
> versions of the per-entity load tracking patchset by Paul Turner
> <pjt@google.com> but that has be removed afterwards. This version differs from
> the original one in the sense that it's not linked to task_group.
> 
> The rq's utilization_load_avg will be used to check if a rq is overloaded or
> not instead of trying to compute how many tasks a group of CPUs can handle.
> 
> Rename runnable_avg_period into avg_period as it is now used with both
> runnable_avg_sum and running_avg_sum
> 
> Add some descriptions of the variables to explain their differences
> 
> cc: Paul Turner <pjt@google.com>
> cc: Ben Segall <bsegall@google.com>
> 
> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> Acked-by: Morten Rasmussen <morten.rasmussen@arm.com>
> ---
>  include/linux/sched.h | 21 ++++++++++++---
>  kernel/sched/debug.c  | 10 ++++---
>  kernel/sched/fair.c   | 74 ++++++++++++++++++++++++++++++++++++++++-----------
>  kernel/sched/sched.h  |  8 +++++-
>  4 files changed, 89 insertions(+), 24 deletions(-)

> +static inline void __update_task_entity_utilization(struct sched_entity *se)
> +{
> +	u32 contrib;
> +
> +	/* avoid overflowing a 32-bit type w/ SCHED_LOAD_SCALE */
> +	contrib = se->avg.running_avg_sum * scale_load_down(SCHED_LOAD_SCALE);
> +	contrib /= (se->avg.avg_period + 1);
> +	se->avg.utilization_avg_contrib = scale_load(contrib);
> +}
> +
> +static long __update_entity_utilization_avg_contrib(struct sched_entity *se)
> +{
> +	long old_contrib = se->avg.utilization_avg_contrib;
> +
> +	if (entity_is_task(se))
> +		__update_task_entity_utilization(se);
> +
> +	return se->avg.utilization_avg_contrib - old_contrib;

When the entity is not a task, shouldn't the utilization_avg_contrib be
updated like this :

se->avg.utilization_avg_contrib = group_cfs_rq(se)->utilization_load_avg
? and then the delta with old_contrib be passed ?

Or is this being updated someplace that I have missed ?

Regards
Preeti U Murthy


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [RFCv3 PATCH 01/48] sched: add utilization_avg_contrib
  2015-02-11  8:50   ` Preeti U Murthy
@ 2015-02-12  1:07     ` Vincent Guittot
  0 siblings, 0 replies; 124+ messages in thread
From: Vincent Guittot @ 2015-02-12  1:07 UTC (permalink / raw)
  To: Preeti U Murthy
  Cc: Morten Rasmussen, Peter Zijlstra, mingo, Dietmar Eggemann,
	Yuyang Du, Michael Turquette, Nicolas Pitre, rjw, Juri Lelli,
	linux-kernel, Paul Turner, Ben Segall

On 11 February 2015 at 09:50, Preeti U Murthy <preeti@linux.vnet.ibm.com> wrote:
> On 02/05/2015 12:00 AM, Morten Rasmussen wrote:
>> From: Vincent Guittot <vincent.guittot@linaro.org>
>>
>> Add new statistics which reflect the average time a task is running on the CPU
>> and the sum of these running time of the tasks on a runqueue. The latter is
>> named utilization_load_avg.
>>
>> This patch is based on the usage metric that was proposed in the 1st
>> versions of the per-entity load tracking patchset by Paul Turner
>> <pjt@google.com> but that has be removed afterwards. This version differs from
>> the original one in the sense that it's not linked to task_group.
>>
>> The rq's utilization_load_avg will be used to check if a rq is overloaded or
>> not instead of trying to compute how many tasks a group of CPUs can handle.
>>
>> Rename runnable_avg_period into avg_period as it is now used with both
>> runnable_avg_sum and running_avg_sum
>>
>> Add some descriptions of the variables to explain their differences
>>
>> cc: Paul Turner <pjt@google.com>
>> cc: Ben Segall <bsegall@google.com>
>>
>> Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
>> Acked-by: Morten Rasmussen <morten.rasmussen@arm.com>
>> ---
>>  include/linux/sched.h | 21 ++++++++++++---
>>  kernel/sched/debug.c  | 10 ++++---
>>  kernel/sched/fair.c   | 74 ++++++++++++++++++++++++++++++++++++++++-----------
>>  kernel/sched/sched.h  |  8 +++++-
>>  4 files changed, 89 insertions(+), 24 deletions(-)
>
>> +static inline void __update_task_entity_utilization(struct sched_entity *se)
>> +{
>> +     u32 contrib;
>> +
>> +     /* avoid overflowing a 32-bit type w/ SCHED_LOAD_SCALE */
>> +     contrib = se->avg.running_avg_sum * scale_load_down(SCHED_LOAD_SCALE);
>> +     contrib /= (se->avg.avg_period + 1);
>> +     se->avg.utilization_avg_contrib = scale_load(contrib);
>> +}
>> +
>> +static long __update_entity_utilization_avg_contrib(struct sched_entity *se)
>> +{
>> +     long old_contrib = se->avg.utilization_avg_contrib;
>> +
>> +     if (entity_is_task(se))
>> +             __update_task_entity_utilization(se);
>> +
>> +     return se->avg.utilization_avg_contrib - old_contrib;
>
> When the entity is not a task, shouldn't the utilization_avg_contrib be
> updated like this :
>
> se->avg.utilization_avg_contrib = group_cfs_rq(se)->utilization_load_avg
> ? and then the delta with old_contrib be passed ?
>
> Or is this being updated someplace that I have missed ?

Patch 02 handles the contribution of entities which are not tasks.

Regards,
Vincent

>
> Regards
> Preeti U Murthy
>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [RFCv3 PATCH 48/48] sched: Disable energy-unfriendly nohz kicks
  2015-02-04 18:31 ` [RFCv3 PATCH 48/48] sched: Disable energy-unfriendly nohz kicks Morten Rasmussen
@ 2015-02-20 19:26   ` Dietmar Eggemann
  0 siblings, 0 replies; 124+ messages in thread
From: Dietmar Eggemann @ 2015-02-20 19:26 UTC (permalink / raw)
  To: Morten Rasmussen, peterz, mingo
  Cc: vincent.guittot, yuyang.du, preeti, mturquette, nico, rjw,
	Juri Lelli, linux-kernel

Hi Morten,

On 04/02/15 18:31, Morten Rasmussen wrote:
> With energy-aware scheduling enabled nohz_kick_needed() generates many
> nohz idle-balance kicks which lead to nothing when multiple tasks get
> packed on a single cpu to save energy. This causes unnecessary wake-ups
> and hence wastes energy. Make these conditions depend on !energy_aware()
> for now until the energy-aware nohz story gets sorted out.
> 
> cc: Ingo Molnar <mingo@redhat.com>
> cc: Peter Zijlstra <peterz@infradead.org>
> 
> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
> ---
>  kernel/sched/fair.c | 7 +++++--
>  1 file changed, 5 insertions(+), 2 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 1c248f8..cfe65ae 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -8195,6 +8195,8 @@ static void nohz_idle_balance(struct rq *this_rq, enum cpu_idle_type idle)
>  	clear_bit(NOHZ_BALANCE_KICK, nohz_flags(this_cpu));
>  }
>  
> +static int cpu_overutilized(int cpu, struct sched_domain *sd);
> +
>  /*
>   * Current heuristic for kicking the idle load balancer in the presence
>   * of an idle cpu in the system.
> @@ -8234,12 +8236,13 @@ static inline bool nohz_kick_needed(struct rq *rq)
>  	if (time_before(now, nohz.next_balance))
>  		return false;
>  
> -	if (rq->nr_running >= 2)
> +	sd = rcu_dereference(rq->sd);
> +	if (rq->nr_running >= 2 && (!energy_aware() || cpu_overutilized(cpu, sd)))
>  		return true;

CONFIG_PROVE_RCU checking revealed this one:

[    3.814454] ===============================
[    3.826989] [ INFO: suspicious RCU usage. ]
[    3.839526] 3.19.0-rc7+ #10 Not tainted
[    3.851018] -------------------------------
[    3.863554] kernel/sched/fair.c:8239 suspicious
rcu_dereference_check() usage!
[    3.885216]
[    3.885216] other info that might help us debug this:
[    3.885216]
[    3.909236]
[    3.909236] rcu_scheduler_active = 1, debug_locks = 1
[    3.928817] no locks held by kthreadd/437.

The RCU read-side critical section has to be extended to incorporate
this sd = rcu_dereference(rq->sd):

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cfe65aec3237..145360ee6e4a 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8236,11 +8236,13 @@ static inline bool nohz_kick_needed(struct rq *rq)
        if (time_before(now, nohz.next_balance))
                return false;

+       rcu_read_lock();
        sd = rcu_dereference(rq->sd);
-       if (rq->nr_running >= 2 && (!energy_aware() ||
cpu_overutilized(cpu, sd)))
-               return true;
+       if (rq->nr_running >= 2 && (!energy_aware() ||
cpu_overutilized(cpu, sd))) {
+               kick = true;
+               goto unlock;
+       }

-       rcu_read_lock();
        sd = rcu_dereference(per_cpu(sd_busy, cpu));
        if (sd && !energy_aware()) {
                sgc = sd->groups->sgc;

-- Dietmar

>  
>  	rcu_read_lock();
>  	sd = rcu_dereference(per_cpu(sd_busy, cpu));
> -	if (sd) {
> +	if (sd && !energy_aware()) {
>  		sgc = sd->groups->sgc;
>  		nr_busy = atomic_read(&sgc->nr_busy_cpus);
>  
> 



^ permalink raw reply related	[flat|nested] 124+ messages in thread

* Re: [RFCv3 PATCH 33/48] sched: Energy-aware wake-up task placement
  2015-02-04 18:31 ` [RFCv3 PATCH 33/48] sched: Energy-aware wake-up task placement Morten Rasmussen
@ 2015-03-13 22:47   ` Sai Gurrappadi
  2015-03-16 14:47     ` Morten Rasmussen
  2015-03-24 13:00   ` Peter Zijlstra
  2015-03-24 16:35   ` Peter Zijlstra
  2 siblings, 1 reply; 124+ messages in thread
From: Sai Gurrappadi @ 2015-03-13 22:47 UTC (permalink / raw)
  To: Morten Rasmussen, peterz, mingo
  Cc: vincent.guittot, Dietmar Eggemann, yuyang.du, preeti, mturquette,
	nico, rjw, Juri Lelli, linux-kernel

On 02/04/2015 10:31 AM, Morten Rasmussen wrote:
> Let available compute capacity and estimated energy impact select
> wake-up target cpu when energy-aware scheduling is enabled.
> energy_aware_wake_cpu() attempts to find group of cpus with sufficient
> compute capacity to accommodate the task and find a cpu with enough spare
> capacity to handle the task within that group. Preference is given to
> cpus with enough spare capacity at the current OPP. Finally, the energy
> impact of the new target and the previous task cpu is compared to select
> the wake-up target cpu.
> 
> cc: Ingo Molnar <mingo@redhat.com>
> cc: Peter Zijlstra <peterz@infradead.org>
> 
> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
> ---
>  kernel/sched/fair.c | 90 +++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 90 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index b371f32..8713310 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5091,6 +5091,92 @@ static int select_idle_sibling(struct task_struct *p, int target)
>  done:
>  	return target;
>  }
> +
> +static unsigned long group_max_capacity(struct sched_group *sg)
> +{
> +	int max_idx;
> +
> +	if (!sg->sge)
> +		return 0;
> +
> +	max_idx = sg->sge->nr_cap_states-1;
> +
> +	return sg->sge->cap_states[max_idx].cap;
> +}
> +
> +static inline unsigned long task_utilization(struct task_struct *p)
> +{
> +	return p->se.avg.utilization_avg_contrib;
> +}
> +
> +static int cpu_overutilized(int cpu, struct sched_domain *sd)
> +{
> +	return (capacity_orig_of(cpu) * 100) <
> +				(get_cpu_usage(cpu) * sd->imbalance_pct);
> +}
> +
> +static int energy_aware_wake_cpu(struct task_struct *p)
> +{
> +	struct sched_domain *sd;
> +	struct sched_group *sg, *sg_target;
> +	int target_max_cap = SCHED_CAPACITY_SCALE;
> +	int target_cpu = task_cpu(p);
> +	int i;
> +
> +	sd = rcu_dereference(per_cpu(sd_ea, task_cpu(p)));
> +
> +	if (!sd)
> +		return -1;
> +
> +	sg = sd->groups;
> +	sg_target = sg;
> +	/* Find group with sufficient capacity */
> +	do {
> +		int sg_max_capacity = group_max_capacity(sg);
> +
> +		if (sg_max_capacity >= task_utilization(p) &&
> +				sg_max_capacity <= target_max_cap) {
> +			sg_target = sg;
> +			target_max_cap = sg_max_capacity;
> +		}
> +	} while (sg = sg->next, sg != sd->groups);

If a 'small' task suddenly becomes 'big' i.e close to 100% util, the
above loop would still pick the little/small cluster because
task_utilization(p) is upper-bounded by the arch-invariant capacity of
the little CPU/group right?

Also, this heuristic for determining sg_target is a big little
assumption. I don't think it is necessarily correct to assume that this
is true for all platforms. This heuristic should be derived from the
energy model for the platform instead.

> +
> +	/* Find cpu with sufficient capacity */
> +	for_each_cpu_and(i, tsk_cpus_allowed(p), sched_group_cpus(sg_target)) {
> +		int new_usage = get_cpu_usage(i) + task_utilization(p);

Isn't this double accounting the task's usage in case task_cpu(p)
belongs to sg_target?

> +
> +		if (new_usage >	capacity_orig_of(i))
> +			continue;
> +
> +		if (new_usage <	capacity_curr_of(i)) {
> +			target_cpu = i;
> +			if (!cpu_rq(i)->nr_running)
> +				break;
> +		}
> +
> +		/* cpu has capacity at higher OPP, keep it as fallback */
> +		if (target_cpu == task_cpu(p))
> +			target_cpu = i;
> +	}
> +
> +	if (target_cpu != task_cpu(p)) {
> +		struct energy_env eenv = {
> +			.usage_delta	= task_utilization(p),
> +			.src_cpu	= task_cpu(p),
> +			.dst_cpu	= target_cpu,
> +		};
> +
> +		/* Not enough spare capacity on previous cpu */
> +		if (cpu_overutilized(task_cpu(p), sd))
> +			return target_cpu;
> +
> +		if (energy_diff(&eenv) >= 0)
> +			return task_cpu(p);
> +	}
> +
> +	return target_cpu;
> +}
> +
>  /*
>   * select_task_rq_fair: Select target runqueue for the waking task in domains
>   * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
> @@ -5138,6 +5224,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
>  		prev_cpu = cpu;
>  
>  	if (sd_flag & SD_BALANCE_WAKE) {
> +		if (energy_aware()) {
> +			new_cpu = energy_aware_wake_cpu(p);
> +			goto unlock;
> +		}
>  		new_cpu = select_idle_sibling(p, prev_cpu);
>  		goto unlock;
>  	}
> 

-Sai

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [RFCv3 PATCH 30/48] sched: Calculate energy consumption of sched_group
  2015-02-04 18:31 ` [RFCv3 PATCH 30/48] sched: Calculate energy consumption of sched_group Morten Rasmussen
@ 2015-03-13 22:54   ` Sai Gurrappadi
  2015-03-16 14:15     ` Morten Rasmussen
  2015-03-20 18:40   ` Sai Gurrappadi
  1 sibling, 1 reply; 124+ messages in thread
From: Sai Gurrappadi @ 2015-03-13 22:54 UTC (permalink / raw)
  To: Morten Rasmussen, peterz, mingo
  Cc: vincent.guittot, Dietmar Eggemann, yuyang.du, preeti, mturquette,
	nico, rjw, Juri Lelli, linux-kernel, Peter Boonstoppel

On 02/04/2015 10:31 AM, Morten Rasmussen wrote:
> For energy-aware load-balancing decisions it is necessary to know the
> energy consumption estimates of groups of cpus. This patch introduces a
> basic function, sched_group_energy(), which estimates the energy
> consumption of the cpus in the group and any resources shared by the
> members of the group.
> 
> NOTE: The function has five levels of identation and breaks the 80
> character limit. Refactoring is necessary.
> 
> cc: Ingo Molnar <mingo@redhat.com>
> cc: Peter Zijlstra <peterz@infradead.org>
> 
> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
> ---
>  kernel/sched/fair.c | 143 ++++++++++++++++++++++++++++++++++++++++++++++++++++
>  1 file changed, 143 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 872ae0e..d12aa63 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -4609,6 +4609,149 @@ static inline bool energy_aware(void)
>  	return sched_feat(ENERGY_AWARE);
>  }
>  
> +/*
> + * cpu_norm_usage() returns the cpu usage relative to it's current capacity,
> + * i.e. it's busy ratio, in the range [0..SCHED_LOAD_SCALE] which is useful for
> + * energy calculations. Using the scale-invariant usage returned by
> + * get_cpu_usage() and approximating scale-invariant usage by:
> + *
> + *   usage ~ (curr_freq/max_freq)*1024 * capacity_orig/1024 * running_time/time
> + *
> + * the normalized usage can be found using capacity_curr.
> + *
> + *   capacity_curr = capacity_orig * curr_freq/max_freq
> + *
> + *   norm_usage = running_time/time ~ usage/capacity_curr
> + */
> +static inline unsigned long cpu_norm_usage(int cpu)
> +{
> +	unsigned long capacity_curr = capacity_curr_of(cpu);
> +
> +	return (get_cpu_usage(cpu) << SCHED_CAPACITY_SHIFT)/capacity_curr;
> +}
> +
> +static unsigned group_max_usage(struct sched_group *sg)
> +{
> +	int i;
> +	int max_usage = 0;
> +
> +	for_each_cpu(i, sched_group_cpus(sg))
> +		max_usage = max(max_usage, get_cpu_usage(i));
> +
> +	return max_usage;
> +}
> +
> +/*
> + * group_norm_usage() returns the approximated group usage relative to it's
> + * current capacity (busy ratio) in the range [0..SCHED_LOAD_SCALE] for use in
> + * energy calculations. Since task executions may or may not overlap in time in
> + * the group the true normalized usage is between max(cpu_norm_usage(i)) and
> + * sum(cpu_norm_usage(i)) when iterating over all cpus in the group, i. The
> + * latter is used as the estimate as it leads to a more pessimistic energy
> + * estimate (more busy).
> + */
> +static unsigned group_norm_usage(struct sched_group *sg)
> +{
> +	int i;
> +	unsigned long usage_sum = 0;
> +
> +	for_each_cpu(i, sched_group_cpus(sg))
> +		usage_sum += cpu_norm_usage(i);
> +
> +	if (usage_sum > SCHED_CAPACITY_SCALE)
> +		return SCHED_CAPACITY_SCALE;
> +	return usage_sum;
> +}
> +
> +static int find_new_capacity(struct sched_group *sg,
> +		struct sched_group_energy *sge)
> +{
> +	int idx;
> +	unsigned long util = group_max_usage(sg);
> +
> +	for (idx = 0; idx < sge->nr_cap_states; idx++) {
> +		if (sge->cap_states[idx].cap >= util)
> +			return idx;
> +	}
> +
> +	return idx;
> +}
> +
> +/*
> + * sched_group_energy(): Returns absolute energy consumption of cpus belonging
> + * to the sched_group including shared resources shared only by members of the
> + * group. Iterates over all cpus in the hierarchy below the sched_group starting
> + * from the bottom working it's way up before going to the next cpu until all
> + * cpus are covered at all levels. The current implementation is likely to
> + * gather the same usage statistics multiple times. This can probably be done in
> + * a faster but more complex way.
> + */
> +static unsigned int sched_group_energy(struct sched_group *sg_top)
> +{
> +	struct sched_domain *sd;
> +	int cpu, total_energy = 0;
> +	struct cpumask visit_cpus;
> +	struct sched_group *sg;
> +
> +	WARN_ON(!sg_top->sge);
> +
> +	cpumask_copy(&visit_cpus, sched_group_cpus(sg_top));
> +
> +	while (!cpumask_empty(&visit_cpus)) {
> +		struct sched_group *sg_shared_cap = NULL;
> +
> +		cpu = cpumask_first(&visit_cpus);
> +
> +		/*
> +		 * Is the group utilization affected by cpus outside this
> +		 * sched_group?
> +		 */
> +		sd = highest_flag_domain(cpu, SD_SHARE_CAP_STATES);
> +		if (sd && sd->parent)
> +			sg_shared_cap = sd->parent->groups;

The above bit looks like it avoids supporting SD_SHARE_CAP_STATES for
the top level sd (!sd->parent). Is it because there is no group that
spans all the CPUs spanned by this sd? It seems like sg_cap is just
being used as a proxy for the cpumask of CPUs to check for max_usage.

> +
> +		for_each_domain(cpu, sd) {
> +			sg = sd->groups;
> +
> +			/* Has this sched_domain already been visited? */
> +			if (sd->child && cpumask_first(sched_group_cpus(sg)) != cpu)
> +				break;
> +
> +			do {
> +				struct sched_group *sg_cap_util;
> +				unsigned group_util;
> +				int sg_busy_energy, sg_idle_energy;
> +				int cap_idx;
> +
> +				if (sg_shared_cap && sg_shared_cap->group_weight >= sg->group_weight)
> +					sg_cap_util = sg_shared_cap;
> +				else
> +					sg_cap_util = sg;
> +
> +				cap_idx = find_new_capacity(sg_cap_util, sg->sge);
> +				group_util = group_norm_usage(sg);
> +				sg_busy_energy = (group_util * sg->sge->cap_states[cap_idx].power)
> +										>> SCHED_CAPACITY_SHIFT;
> +				sg_idle_energy = ((SCHED_LOAD_SCALE-group_util) * sg->sge->idle_states[0].power)
> +										>> SCHED_CAPACITY_SHIFT;
> +
> +				total_energy += sg_busy_energy + sg_idle_energy;
> +
> +				if (!sd->child)
> +					cpumask_xor(&visit_cpus, &visit_cpus, sched_group_cpus(sg));
> +
> +				if (cpumask_equal(sched_group_cpus(sg), sched_group_cpus(sg_top)))
> +					goto next_cpu;
> +
> +			} while (sg = sg->next, sg != sd->groups);
> +		}
> +next_cpu:
> +		continue;
> +	}
> +
> +	return total_energy;
> +}
> +
>  static int wake_wide(struct task_struct *p)
>  {
>  	int factor = this_cpu_read(sd_llc_size);
> 

-Sai

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [RFCv3 PATCH 30/48] sched: Calculate energy consumption of sched_group
  2015-03-13 22:54   ` Sai Gurrappadi
@ 2015-03-16 14:15     ` Morten Rasmussen
  2015-03-23 16:47       ` Peter Zijlstra
  0 siblings, 1 reply; 124+ messages in thread
From: Morten Rasmussen @ 2015-03-16 14:15 UTC (permalink / raw)
  To: Sai Gurrappadi
  Cc: peterz, mingo, vincent.guittot, Dietmar Eggemann, yuyang.du,
	preeti, mturquette, nico, rjw, Juri Lelli, linux-kernel,
	Peter Boonstoppel

On Fri, Mar 13, 2015 at 10:54:25PM +0000, Sai Gurrappadi wrote:
> On 02/04/2015 10:31 AM, Morten Rasmussen wrote:
> > For energy-aware load-balancing decisions it is necessary to know the
> > energy consumption estimates of groups of cpus. This patch introduces a
> > basic function, sched_group_energy(), which estimates the energy
> > consumption of the cpus in the group and any resources shared by the
> > members of the group.
> > 
> > NOTE: The function has five levels of identation and breaks the 80
> > character limit. Refactoring is necessary.
> > 
> > cc: Ingo Molnar <mingo@redhat.com>
> > cc: Peter Zijlstra <peterz@infradead.org>
> > 
> > Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
> > ---
> >  kernel/sched/fair.c | 143 ++++++++++++++++++++++++++++++++++++++++++++++++++++
> >  1 file changed, 143 insertions(+)
> > 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 872ae0e..d12aa63 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -4609,6 +4609,149 @@ static inline bool energy_aware(void)
> >  	return sched_feat(ENERGY_AWARE);
> >  }
> >  
> > +/*
> > + * cpu_norm_usage() returns the cpu usage relative to it's current capacity,
> > + * i.e. it's busy ratio, in the range [0..SCHED_LOAD_SCALE] which is useful for
> > + * energy calculations. Using the scale-invariant usage returned by
> > + * get_cpu_usage() and approximating scale-invariant usage by:
> > + *
> > + *   usage ~ (curr_freq/max_freq)*1024 * capacity_orig/1024 * running_time/time
> > + *
> > + * the normalized usage can be found using capacity_curr.
> > + *
> > + *   capacity_curr = capacity_orig * curr_freq/max_freq
> > + *
> > + *   norm_usage = running_time/time ~ usage/capacity_curr
> > + */
> > +static inline unsigned long cpu_norm_usage(int cpu)
> > +{
> > +	unsigned long capacity_curr = capacity_curr_of(cpu);
> > +
> > +	return (get_cpu_usage(cpu) << SCHED_CAPACITY_SHIFT)/capacity_curr;
> > +}
> > +
> > +static unsigned group_max_usage(struct sched_group *sg)
> > +{
> > +	int i;
> > +	int max_usage = 0;
> > +
> > +	for_each_cpu(i, sched_group_cpus(sg))
> > +		max_usage = max(max_usage, get_cpu_usage(i));
> > +
> > +	return max_usage;
> > +}
> > +
> > +/*
> > + * group_norm_usage() returns the approximated group usage relative to it's
> > + * current capacity (busy ratio) in the range [0..SCHED_LOAD_SCALE] for use in
> > + * energy calculations. Since task executions may or may not overlap in time in
> > + * the group the true normalized usage is between max(cpu_norm_usage(i)) and
> > + * sum(cpu_norm_usage(i)) when iterating over all cpus in the group, i. The
> > + * latter is used as the estimate as it leads to a more pessimistic energy
> > + * estimate (more busy).
> > + */
> > +static unsigned group_norm_usage(struct sched_group *sg)
> > +{
> > +	int i;
> > +	unsigned long usage_sum = 0;
> > +
> > +	for_each_cpu(i, sched_group_cpus(sg))
> > +		usage_sum += cpu_norm_usage(i);
> > +
> > +	if (usage_sum > SCHED_CAPACITY_SCALE)
> > +		return SCHED_CAPACITY_SCALE;
> > +	return usage_sum;
> > +}
> > +
> > +static int find_new_capacity(struct sched_group *sg,
> > +		struct sched_group_energy *sge)
> > +{
> > +	int idx;
> > +	unsigned long util = group_max_usage(sg);
> > +
> > +	for (idx = 0; idx < sge->nr_cap_states; idx++) {
> > +		if (sge->cap_states[idx].cap >= util)
> > +			return idx;
> > +	}
> > +
> > +	return idx;
> > +}
> > +
> > +/*
> > + * sched_group_energy(): Returns absolute energy consumption of cpus belonging
> > + * to the sched_group including shared resources shared only by members of the
> > + * group. Iterates over all cpus in the hierarchy below the sched_group starting
> > + * from the bottom working it's way up before going to the next cpu until all
> > + * cpus are covered at all levels. The current implementation is likely to
> > + * gather the same usage statistics multiple times. This can probably be done in
> > + * a faster but more complex way.
> > + */
> > +static unsigned int sched_group_energy(struct sched_group *sg_top)
> > +{
> > +	struct sched_domain *sd;
> > +	int cpu, total_energy = 0;
> > +	struct cpumask visit_cpus;
> > +	struct sched_group *sg;
> > +
> > +	WARN_ON(!sg_top->sge);
> > +
> > +	cpumask_copy(&visit_cpus, sched_group_cpus(sg_top));
> > +
> > +	while (!cpumask_empty(&visit_cpus)) {
> > +		struct sched_group *sg_shared_cap = NULL;
> > +
> > +		cpu = cpumask_first(&visit_cpus);
> > +
> > +		/*
> > +		 * Is the group utilization affected by cpus outside this
> > +		 * sched_group?
> > +		 */
> > +		sd = highest_flag_domain(cpu, SD_SHARE_CAP_STATES);
> > +		if (sd && sd->parent)
> > +			sg_shared_cap = sd->parent->groups;
> 
> The above bit looks like it avoids supporting SD_SHARE_CAP_STATES for
> the top level sd (!sd->parent). Is it because there is no group that
> spans all the CPUs spanned by this sd? It seems like sg_cap is just
> being used as a proxy for the cpumask of CPUs to check for max_usage.

You are absolutely right. The current code is broken for system
topologies where all cpus share the same clock source. To be honest, it
is actually worse than that and you already pointed out the reason. We
don't have a way of representing top level contributions to power
consumption in RFCv3, as we don't have sched_group spanning all cpus in
single cluster system. For example, we can't represent L2 cache and
interconnect power consumption on such systems.

In RFCv2 we had a system wide sched_group dangling by itself for that
purpose. We chose to remove that in this rewrite as it led to messy
code. In my opinion, a more elegant solution is to introduce an
additional sched_domain above the current top level which has a single
sched_group spanning all cpus in the system. That should fix the
SD_SHARE_CAP_STATES problem and allow us to attach power data for the
top level.

It is on the todo list to add that extra sched_domain/group. In the
meantime a workaround could be to just use the domain mask instead for
checking max_usage.

Thanks,
Morten


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [RFCv3 PATCH 33/48] sched: Energy-aware wake-up task placement
  2015-03-13 22:47   ` Sai Gurrappadi
@ 2015-03-16 14:47     ` Morten Rasmussen
  2015-03-18 20:15       ` Sai Gurrappadi
  2015-03-24 13:00       ` Peter Zijlstra
  0 siblings, 2 replies; 124+ messages in thread
From: Morten Rasmussen @ 2015-03-16 14:47 UTC (permalink / raw)
  To: Sai Gurrappadi
  Cc: peterz, mingo, vincent.guittot, Dietmar Eggemann, yuyang.du,
	preeti, mturquette, nico, rjw, Juri Lelli, linux-kernel

On Fri, Mar 13, 2015 at 10:47:16PM +0000, Sai Gurrappadi wrote:
> On 02/04/2015 10:31 AM, Morten Rasmussen wrote:
> > +static int energy_aware_wake_cpu(struct task_struct *p)
> > +{
> > +	struct sched_domain *sd;
> > +	struct sched_group *sg, *sg_target;
> > +	int target_max_cap = SCHED_CAPACITY_SCALE;
> > +	int target_cpu = task_cpu(p);
> > +	int i;
> > +
> > +	sd = rcu_dereference(per_cpu(sd_ea, task_cpu(p)));
> > +
> > +	if (!sd)
> > +		return -1;
> > +
> > +	sg = sd->groups;
> > +	sg_target = sg;
> > +	/* Find group with sufficient capacity */
> > +	do {
> > +		int sg_max_capacity = group_max_capacity(sg);
> > +
> > +		if (sg_max_capacity >= task_utilization(p) &&
> > +				sg_max_capacity <= target_max_cap) {
> > +			sg_target = sg;
> > +			target_max_cap = sg_max_capacity;
> > +		}
> > +	} while (sg = sg->next, sg != sd->groups);
> 
> If a 'small' task suddenly becomes 'big' i.e close to 100% util, the
> above loop would still pick the little/small cluster because
> task_utilization(p) is upper-bounded by the arch-invariant capacity of
> the little CPU/group right?

Yes. Usually the 'big thread stuck on little cpu' problem gets
eventually resolved by load_balance() called from periodic load-balance,
idle-balance, or nohz idle-balance. But you are right, there should be
some margin added to that comparison (capacity > task_utilization(p) +
margin).

We want to have a sort of bias tunable that will affect decisions like
this one such that if you are willing to sacrifice some energy you can
put tasks on big cpus even if they could be left on a little cpu. I
think this margin above could be linked to that tunable if not the
tunable itself.

Another problem with the sched_group selection above is that we don't
consider the current utilization when choosing the group. Small tasks
end up in the 'little' group even if it is already fully utilized and
the 'big' group might have spare capacity. We could have additional
checks here, but it would add to latency and it would eventually get
sorted out by periodic/idle balance anyway. The task would have to live
with the suboptimal placement for a while though which isn't ideal.

> 
> Also, this heuristic for determining sg_target is a big little
> assumption. I don't think it is necessarily correct to assume that this
> is true for all platforms. This heuristic should be derived from the
> energy model for the platform instead.

I have had the same thought, but I ended up making that assumption since
I holds for the for the few platforms I have data for and it simplified
the code a lot. I think we can potentially remove this assumption later.

> 
> > +
> > +	/* Find cpu with sufficient capacity */
> > +	for_each_cpu_and(i, tsk_cpus_allowed(p), sched_group_cpus(sg_target)) {
> > +		int new_usage = get_cpu_usage(i) + task_utilization(p);
> 
> Isn't this double accounting the task's usage in case task_cpu(p)
> belongs to sg_target?

Again you are right. We could make the + task_utilization(p) conditional
on i != task_cpu(p). One argument against doing that is that in
select_task_rq_fair() task_utilization(p) hasn't been decayed yet while
it blocked load on the previous cpu (rq) has. If the task has been gone
for a long time, its blocked contribution may have decayed to zero and
therefore be a poor estimate of the utilization increase caused by
putting the task back on the previous cpu. Particularly if we still use
the non-decayed task_utilization(p) to estimate the utilization increase
on other cpus (!task_cpu(p)). In the interest of responsiveness and not
trying to squeeze tasks back onto the previous cpu which might soon run
out of capacity when utilization increases we could leave it as a sort
of performance bias.

In any case it deserves a comment in the code I think.

Thanks,
Morten

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [RFCv3 PATCH 33/48] sched: Energy-aware wake-up task placement
  2015-03-16 14:47     ` Morten Rasmussen
@ 2015-03-18 20:15       ` Sai Gurrappadi
  2015-03-27 16:37         ` Morten Rasmussen
  2015-03-24 13:00       ` Peter Zijlstra
  1 sibling, 1 reply; 124+ messages in thread
From: Sai Gurrappadi @ 2015-03-18 20:15 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: peterz, mingo, vincent.guittot, Dietmar Eggemann, yuyang.du,
	preeti, mturquette, nico, rjw, Juri Lelli, linux-kernel,
	Peter Boonstoppel

On 03/16/2015 07:47 AM, Morten Rasmussen wrote:
> On Fri, Mar 13, 2015 at 10:47:16PM +0000, Sai Gurrappadi wrote:
>> On 02/04/2015 10:31 AM, Morten Rasmussen wrote:
>>> +static int energy_aware_wake_cpu(struct task_struct *p)
>>> +{
>>> +	struct sched_domain *sd;
>>> +	struct sched_group *sg, *sg_target;
>>> +	int target_max_cap = SCHED_CAPACITY_SCALE;
>>> +	int target_cpu = task_cpu(p);
>>> +	int i;
>>> +
>>> +	sd = rcu_dereference(per_cpu(sd_ea, task_cpu(p)));
>>> +
>>> +	if (!sd)
>>> +		return -1;
>>> +
>>> +	sg = sd->groups;
>>> +	sg_target = sg;
>>> +	/* Find group with sufficient capacity */
>>> +	do {
>>> +		int sg_max_capacity = group_max_capacity(sg);
>>> +
>>> +		if (sg_max_capacity >= task_utilization(p) &&
>>> +				sg_max_capacity <= target_max_cap) {
>>> +			sg_target = sg;
>>> +			target_max_cap = sg_max_capacity;
>>> +		}
>>> +	} while (sg = sg->next, sg != sd->groups);
>>
>> If a 'small' task suddenly becomes 'big' i.e close to 100% util, the
>> above loop would still pick the little/small cluster because
>> task_utilization(p) is upper-bounded by the arch-invariant capacity of
>> the little CPU/group right?
> 
> Yes. Usually the 'big thread stuck on little cpu' problem gets
> eventually resolved by load_balance() called from periodic load-balance,
> idle-balance, or nohz idle-balance. But you are right, there should be
> some margin added to that comparison (capacity > task_utilization(p) +
> margin).
> 
> We want to have a sort of bias tunable that will affect decisions like
> this one such that if you are willing to sacrifice some energy you can
> put tasks on big cpus even if they could be left on a little cpu. I
> think this margin above could be linked to that tunable if not the
> tunable itself.
> 
> Another problem with the sched_group selection above is that we don't
> consider the current utilization when choosing the group. Small tasks
> end up in the 'little' group even if it is already fully utilized and
> the 'big' group might have spare capacity. We could have additional
> checks here, but it would add to latency and it would eventually get
> sorted out by periodic/idle balance anyway. The task would have to live
> with the suboptimal placement for a while though which isn't ideal.
> 
>>
>> Also, this heuristic for determining sg_target is a big little
>> assumption. I don't think it is necessarily correct to assume that this
>> is true for all platforms. This heuristic should be derived from the
>> energy model for the platform instead.
> 
> I have had the same thought, but I ended up making that assumption since
> I holds for the for the few platforms I have data for and it simplified
> the code a lot. I think we can potentially remove this assumption later.
> 
>>
>>> +
>>> +	/* Find cpu with sufficient capacity */
>>> +	for_each_cpu_and(i, tsk_cpus_allowed(p), sched_group_cpus(sg_target)) {
>>> +		int new_usage = get_cpu_usage(i) + task_utilization(p);
>>
>> Isn't this double accounting the task's usage in case task_cpu(p)
>> belongs to sg_target?
> 
> Again you are right. We could make the + task_utilization(p) conditional
> on i != task_cpu(p). One argument against doing that is that in
> select_task_rq_fair() task_utilization(p) hasn't been decayed yet while
> it blocked load on the previous cpu (rq) has. If the task has been gone
> for a long time, its blocked contribution may have decayed to zero and
> therefore be a poor estimate of the utilization increase caused by
> putting the task back on the previous cpu. Particularly if we still use
> the non-decayed task_utilization(p) to estimate the utilization increase
> on other cpus (!task_cpu(p)). In the interest of responsiveness and not
> trying to squeeze tasks back onto the previous cpu which might soon run
> out of capacity when utilization increases we could leave it as a sort
> of performance bias.
> 
> In any case it deserves a comment in the code I think.

I think it makes sense to use the non-decayed value of the the task's
contrib. on wake but I am not sure if we should do this 2x accounting
all the time.

Another slightly related issue is that NOHZ could cause blocked rq sums
to remain stale for long periods if there aren't frequent enough
idle/nohz-idle-balances. This would cause the above bit and
energy_diff() to compute incorrect values.

-Sai

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [RFCv3 PATCH 30/48] sched: Calculate energy consumption of sched_group
  2015-02-04 18:31 ` [RFCv3 PATCH 30/48] sched: Calculate energy consumption of sched_group Morten Rasmussen
  2015-03-13 22:54   ` Sai Gurrappadi
@ 2015-03-20 18:40   ` Sai Gurrappadi
  2015-03-27 15:58     ` Morten Rasmussen
  1 sibling, 1 reply; 124+ messages in thread
From: Sai Gurrappadi @ 2015-03-20 18:40 UTC (permalink / raw)
  To: Morten Rasmussen, peterz, mingo
  Cc: vincent.guittot, Dietmar Eggemann, yuyang.du, preeti, mturquette,
	nico, rjw, Juri Lelli, linux-kernel, Peter Boonstoppel

On 02/04/2015 10:31 AM, Morten Rasmussen wrote:
> +/*
> + * sched_group_energy(): Returns absolute energy consumption of cpus belonging
> + * to the sched_group including shared resources shared only by members of the
> + * group. Iterates over all cpus in the hierarchy below the sched_group starting
> + * from the bottom working it's way up before going to the next cpu until all
> + * cpus are covered at all levels. The current implementation is likely to
> + * gather the same usage statistics multiple times. This can probably be done in
> + * a faster but more complex way.
> + */
> +static unsigned int sched_group_energy(struct sched_group *sg_top)
> +{
> +	struct sched_domain *sd;
> +	int cpu, total_energy = 0;
> +	struct cpumask visit_cpus;
> +	struct sched_group *sg;
> +
> +	WARN_ON(!sg_top->sge);
> +
> +	cpumask_copy(&visit_cpus, sched_group_cpus(sg_top));
> +
> +	while (!cpumask_empty(&visit_cpus)) {
> +		struct sched_group *sg_shared_cap = NULL;
> +
> +		cpu = cpumask_first(&visit_cpus);
> +
> +		/*
> +		 * Is the group utilization affected by cpus outside this
> +		 * sched_group?
> +		 */
> +		sd = highest_flag_domain(cpu, SD_SHARE_CAP_STATES);
> +		if (sd && sd->parent)
> +			sg_shared_cap = sd->parent->groups;
> +
> +		for_each_domain(cpu, sd) {
> +			sg = sd->groups;
> +
> +			/* Has this sched_domain already been visited? */
> +			if (sd->child && cpumask_first(sched_group_cpus(sg)) != cpu)
> +				break;
> +
> +			do {
> +				struct sched_group *sg_cap_util;
> +				unsigned group_util;
> +				int sg_busy_energy, sg_idle_energy;
> +				int cap_idx;
> +
> +				if (sg_shared_cap && sg_shared_cap->group_weight >= sg->group_weight)
> +					sg_cap_util = sg_shared_cap;
> +				else
> +					sg_cap_util = sg;
> +
> +				cap_idx = find_new_capacity(sg_cap_util, sg->sge);
> +				group_util = group_norm_usage(sg);
> +				sg_busy_energy = (group_util * sg->sge->cap_states[cap_idx].power)
> +										>> SCHED_CAPACITY_SHIFT;
> +				sg_idle_energy = ((SCHED_LOAD_SCALE-group_util) * sg->sge->idle_states[0].power)
> +										>> SCHED_CAPACITY_SHIFT;
> +
> +				total_energy += sg_busy_energy + sg_idle_energy;

Should normalize group_util with the newly found capacity instead of
capacity_curr.

-Sai

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [RFCv3 PATCH 14/48] arm: Frequency invariant scheduler load-tracking support
  2015-02-04 18:30 ` [RFCv3 PATCH 14/48] arm: Frequency invariant scheduler load-tracking support Morten Rasmussen
@ 2015-03-23 13:39   ` Peter Zijlstra
  2015-03-24  9:41     ` Morten Rasmussen
  0 siblings, 1 reply; 124+ messages in thread
From: Peter Zijlstra @ 2015-03-23 13:39 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: mingo, vincent.guittot, dietmar.eggemann, yuyang.du, preeti,
	mturquette, nico, rjw, juri.lelli, linux-kernel, Russell King

On Wed, Feb 04, 2015 at 06:30:51PM +0000, Morten Rasmussen wrote:
> +/* cpufreq callback function setting current cpu frequency */
> +void arch_scale_set_curr_freq(int cpu, unsigned long freq)
> +{
> +	atomic_long_set(&per_cpu(cpu_curr_freq, cpu), freq);
> +}
> +
> +/* cpufreq callback function setting max cpu frequency */
> +void arch_scale_set_max_freq(int cpu, unsigned long freq)
> +{
> +	atomic_long_set(&per_cpu(cpu_max_freq, cpu), freq);
> +}
> +
> +unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu)
> +{
> +	unsigned long curr = atomic_long_read(&per_cpu(cpu_curr_freq, cpu));
> +	unsigned long max = atomic_long_read(&per_cpu(cpu_max_freq, cpu));
> +
> +	if (!curr || !max)
> +		return SCHED_CAPACITY_SCALE;
> +
> +	return (curr * SCHED_CAPACITY_SCALE) / max;
> +}

so I've no idea how many cycles an (integer) division takes on ARM; but
doesn't it make sense to do this division (once) in
arch_scale_set_curr_freq() instead of every time we need the result?

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [RFCv3 PATCH 18/48] sched: Track blocked utilization contributions
  2015-02-04 18:30 ` [RFCv3 PATCH 18/48] sched: Track blocked utilization contributions Morten Rasmussen
@ 2015-03-23 14:08   ` Peter Zijlstra
  2015-03-24  9:43     ` Morten Rasmussen
  0 siblings, 1 reply; 124+ messages in thread
From: Peter Zijlstra @ 2015-03-23 14:08 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: mingo, vincent.guittot, dietmar.eggemann, yuyang.du, preeti,
	mturquette, nico, rjw, juri.lelli, linux-kernel

On Wed, Feb 04, 2015 at 06:30:55PM +0000, Morten Rasmussen wrote:
> Introduces the blocked utilization, the utilization counter-part to
> cfs_rq->utilization_load_avg. It is the sum of sched_entity utilization
> contributions of entities that were recently on the cfs_rq that are
> currently blocked. Combined with sum of contributions of entities
> currently on the cfs_rq or currently running
> (cfs_rq->utilization_load_avg) this can provide a more stable average
> view of the cpu usage.

So it would be nice if you add performance numbers for all these patches
that add accounting muck.. 

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [RFCv3 PATCH 12/48] sched: Make usage tracking cpu scale-invariant
  2015-02-04 18:30 ` [RFCv3 PATCH 12/48] sched: Make usage tracking cpu scale-invariant Morten Rasmussen
@ 2015-03-23 14:46   ` Peter Zijlstra
  2015-03-23 19:19     ` Dietmar Eggemann
  0 siblings, 1 reply; 124+ messages in thread
From: Peter Zijlstra @ 2015-03-23 14:46 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: mingo, vincent.guittot, dietmar.eggemann, yuyang.du, preeti,
	mturquette, nico, rjw, juri.lelli, linux-kernel

On Wed, Feb 04, 2015 at 06:30:49PM +0000, Morten Rasmussen wrote:
> From: Dietmar Eggemann <dietmar.eggemann@arm.com>
> 
> Besides the existing frequency scale-invariance correction factor, apply
> cpu scale-invariance correction factor to usage tracking.
> 
> Cpu scale-invariance takes cpu performance deviations due to
> micro-architectural differences (i.e. instructions per seconds) between
> cpus in HMP systems (e.g. big.LITTLE) and differences in the frequency
> value of the highest OPP between cpus in SMP systems into consideration.
> 
> Each segment of the sched_avg::running_avg_sum geometric series is now
> scaled by the cpu performance factor too so the
> sched_avg::utilization_avg_contrib of each entity will be invariant from
> the particular cpu of the HMP/SMP system it is gathered on.
> 
> So the usage level that is returned by get_cpu_usage stays relative to
> the max cpu performance of the system.

> @@ -2547,6 +2549,10 @@ static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,
>  
>  		if (runnable)
>  			sa->runnable_avg_sum += scaled_delta_w;
> +
> +		scaled_delta_w *= scale_cpu;
> +		scaled_delta_w >>= SCHED_CAPACITY_SHIFT;
> +
>  		if (running)
>  			sa->running_avg_sum += scaled_delta_w;
>  		sa->avg_period += delta_w;

Maybe help remind me why we want this asymmetry between runnable and
running in terms of scaling?

The above talks about why we want running scaled with the cpu metric,
but it forgets to tell me why we do not want to scale runnable.

(even if I were to have a vague recollection it seems like a good thing
to write down someplace ;-).

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [RFCv3 PATCH 28/48] sched: Use capacity_curr to cap utilization in get_cpu_usage()
  2015-02-04 18:31 ` [RFCv3 PATCH 28/48] sched: Use capacity_curr to cap utilization in get_cpu_usage() Morten Rasmussen
@ 2015-03-23 16:14   ` Peter Zijlstra
  2015-03-24 11:36     ` Morten Rasmussen
  0 siblings, 1 reply; 124+ messages in thread
From: Peter Zijlstra @ 2015-03-23 16:14 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: mingo, vincent.guittot, dietmar.eggemann, yuyang.du, preeti,
	mturquette, nico, rjw, juri.lelli, linux-kernel

On Wed, Feb 04, 2015 at 06:31:05PM +0000, Morten Rasmussen wrote:

> @@ -4596,9 +4596,10 @@ static int get_cpu_usage(int cpu)
>  {
>  	unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg;
>  	unsigned long blocked = cpu_rq(cpu)->cfs.utilization_blocked_avg;
> +	unsigned long capacity_curr = capacity_curr_of(cpu);
>  
> -	if (usage + blocked >= SCHED_LOAD_SCALE)
> -		return capacity_orig_of(cpu);
> +	if (usage + blocked >= capacity_curr)
> +		return capacity_curr;

It makes more sense to do return capacity_curr_of(), since that defers
the computation capacity_curr_of() does to the point where its actually
required, instead of making it unconditional.

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [RFCv3 PATCH 29/48] sched: Highest energy aware balancing sched_domain level pointer
  2015-02-04 18:31 ` [RFCv3 PATCH 29/48] sched: Highest energy aware balancing sched_domain level pointer Morten Rasmussen
@ 2015-03-23 16:16   ` Peter Zijlstra
  2015-03-24 10:52     ` Morten Rasmussen
       [not found]   ` <OF5977496A.A21A7B96-ON48257E35.002EC23C-48257E35.00324DAD@zte.com.cn>
  1 sibling, 1 reply; 124+ messages in thread
From: Peter Zijlstra @ 2015-03-23 16:16 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: mingo, vincent.guittot, dietmar.eggemann, yuyang.du, preeti,
	mturquette, nico, rjw, juri.lelli, linux-kernel

On Wed, Feb 04, 2015 at 06:31:06PM +0000, Morten Rasmussen wrote:
> Add another member to the family of per-cpu sched_domain shortcut
> pointers. This one, sd_ea, points to the highest level at which energy
> model is provided. At this level and all levels below all sched_groups
> have energy model data attached.

Here might be a good point to discuss what it means for the topology to
have partial energy model information.

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [RFCv3 PATCH 30/48] sched: Calculate energy consumption of sched_group
  2015-03-16 14:15     ` Morten Rasmussen
@ 2015-03-23 16:47       ` Peter Zijlstra
  2015-03-23 20:21         ` Dietmar Eggemann
  0 siblings, 1 reply; 124+ messages in thread
From: Peter Zijlstra @ 2015-03-23 16:47 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Sai Gurrappadi, mingo, vincent.guittot, Dietmar Eggemann,
	yuyang.du, preeti, mturquette, nico, rjw, Juri Lelli,
	linux-kernel, Peter Boonstoppel

On Mon, Mar 16, 2015 at 02:15:46PM +0000, Morten Rasmussen wrote:
> You are absolutely right. The current code is broken for system
> topologies where all cpus share the same clock source. To be honest, it
> is actually worse than that and you already pointed out the reason. We
> don't have a way of representing top level contributions to power
> consumption in RFCv3, as we don't have sched_group spanning all cpus in
> single cluster system. For example, we can't represent L2 cache and
> interconnect power consumption on such systems.
> 
> In RFCv2 we had a system wide sched_group dangling by itself for that
> purpose. We chose to remove that in this rewrite as it led to messy
> code. In my opinion, a more elegant solution is to introduce an
> additional sched_domain above the current top level which has a single
> sched_group spanning all cpus in the system. That should fix the
> SD_SHARE_CAP_STATES problem and allow us to attach power data for the
> top level.

Maybe remind us why this needs to be tied to sched_groups ? Why can't we
attach the energy information to the domains?

There is an additional problem with groups you've not yet discovered and
that is overlapping groups. Certain NUMA topologies result in this.
There the sum of cpus over the groups is greater than the total cpus in
the domain.


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [RFCv3 PATCH 12/48] sched: Make usage tracking cpu scale-invariant
  2015-03-23 14:46   ` Peter Zijlstra
@ 2015-03-23 19:19     ` Dietmar Eggemann
       [not found]       ` <OF8A3E3617.0D4400A5-ON48257E3A.001B38D9-48257E3A.002379A4@zte.com.cn>
  0 siblings, 1 reply; 124+ messages in thread
From: Dietmar Eggemann @ 2015-03-23 19:19 UTC (permalink / raw)
  To: Peter Zijlstra, Morten Rasmussen
  Cc: mingo, vincent.guittot, yuyang.du, preeti, mturquette, nico, rjw,
	Juri Lelli, linux-kernel

On 23/03/15 14:46, Peter Zijlstra wrote:
> On Wed, Feb 04, 2015 at 06:30:49PM +0000, Morten Rasmussen wrote:
>> From: Dietmar Eggemann <dietmar.eggemann@arm.com>
>>
>> Besides the existing frequency scale-invariance correction factor, apply
>> cpu scale-invariance correction factor to usage tracking.
>>
>> Cpu scale-invariance takes cpu performance deviations due to
>> micro-architectural differences (i.e. instructions per seconds) between
>> cpus in HMP systems (e.g. big.LITTLE) and differences in the frequency
>> value of the highest OPP between cpus in SMP systems into consideration.
>>
>> Each segment of the sched_avg::running_avg_sum geometric series is now
>> scaled by the cpu performance factor too so the
>> sched_avg::utilization_avg_contrib of each entity will be invariant from
>> the particular cpu of the HMP/SMP system it is gathered on.
>>
>> So the usage level that is returned by get_cpu_usage stays relative to
>> the max cpu performance of the system.
>
>> @@ -2547,6 +2549,10 @@ static __always_inline int __update_entity_runnable_avg(u64 now, int cpu,
>>
>>   		if (runnable)
>>   			sa->runnable_avg_sum += scaled_delta_w;
>> +
>> +		scaled_delta_w *= scale_cpu;
>> +		scaled_delta_w >>= SCHED_CAPACITY_SHIFT;
>> +
>>   		if (running)
>>   			sa->running_avg_sum += scaled_delta_w;
>>   		sa->avg_period += delta_w;
>
> Maybe help remind me why we want this asymmetry between runnable and
> running in terms of scaling?

In the previous patch-set https://lkml.org/lkml/2014/12/2/332 we 
cpu-scaled both (sched_avg::runnable_avg_sum (load) and 
sched_avg::running_avg_sum (utilization)) but during the review Vincent 
pointed out that a cpu-scaled invariant load signal messes up 
load-balancing based on s[dg]_lb_stats::avg_load in overload scenarios.

avg_load = load/capacity and load can't be simply replaced here by 
'cpu-scale invariant load' (which is load*capacity).

> The above talks about why we want running scaled with the cpu metric,
> but it forgets to tell me why we do not want to scale runnable.

Yes, I will add the missing explanation to this patch.

> (even if I were to have a vague recollection it seems like a good thing
> to write down someplace ;-).

Definitely true.

Back in December last year we talked about adding the now missing 
cpu-scale invariant load signal to the end (which should contain more 
experimental bits) of the patch-set. I guess we haven't done this simply 
because of the missing modifications around s[dg]_lb_stats::avg_load 
which would be then needed.


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [RFCv3 PATCH 30/48] sched: Calculate energy consumption of sched_group
  2015-03-23 16:47       ` Peter Zijlstra
@ 2015-03-23 20:21         ` Dietmar Eggemann
  2015-03-24 10:44           ` Morten Rasmussen
  0 siblings, 1 reply; 124+ messages in thread
From: Dietmar Eggemann @ 2015-03-23 20:21 UTC (permalink / raw)
  To: Peter Zijlstra, Morten Rasmussen
  Cc: Sai Gurrappadi, mingo, vincent.guittot, yuyang.du, preeti,
	mturquette, nico, rjw, Juri Lelli, linux-kernel,
	Peter Boonstoppel

On 23/03/15 16:47, Peter Zijlstra wrote:
> On Mon, Mar 16, 2015 at 02:15:46PM +0000, Morten Rasmussen wrote:
>> You are absolutely right. The current code is broken for system
>> topologies where all cpus share the same clock source. To be honest, it
>> is actually worse than that and you already pointed out the reason. We
>> don't have a way of representing top level contributions to power
>> consumption in RFCv3, as we don't have sched_group spanning all cpus in
>> single cluster system. For example, we can't represent L2 cache and
>> interconnect power consumption on such systems.
>>
>> In RFCv2 we had a system wide sched_group dangling by itself for that
>> purpose. We chose to remove that in this rewrite as it led to messy
>> code. In my opinion, a more elegant solution is to introduce an
>> additional sched_domain above the current top level which has a single
>> sched_group spanning all cpus in the system. That should fix the
>> SD_SHARE_CAP_STATES problem and allow us to attach power data for the
>> top level.
>
> Maybe remind us why this needs to be tied to sched_groups ? Why can't we
> attach the energy information to the domains?

Currently on our 2 cluster (big.LITTLE) system (cluster0: big cpus, 
cluster1: little cpus) we attach energy information onto all sg's in MC 
(cpu/core related energy data) and DIE sd level (cluster related energy 
data).

For an MC level (cpus sharing the same u-arch) attaching the energy 
information onto the sd is clearly much easier then attaching it onto 
the individual sg's.

But on DIE level when we want to figure out the cluster energy data for 
a cluster represented by an sg other than the first sg (sg0) than we 
would have to access its cluster energy data via the DIE sd of one of 
the cpus of this cluster. I haven't seen code actually doing that in CFS.

IMHO, the current code is always iterating over the sg's of the sd and 
accessing either sg (sched_group) or sg->sgc (sched_group_capacity) 
data. Our energy data follows the sched_group_capacity example.

> There is an additional problem with groups you've not yet discovered and
> that is overlapping groups. Certain NUMA topologies result in this.
> There the sum of cpus over the groups is greater than the total cpus in
> the domain.

Yeah, we haven't tried EAS on such a system nor did we enable 
FORCE_SD_OVERLAP sched feature for a long time.


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [RFCv3 PATCH 14/48] arm: Frequency invariant scheduler load-tracking support
  2015-03-23 13:39   ` Peter Zijlstra
@ 2015-03-24  9:41     ` Morten Rasmussen
  0 siblings, 0 replies; 124+ messages in thread
From: Morten Rasmussen @ 2015-03-24  9:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, vincent.guittot, Dietmar Eggemann, yuyang.du, preeti,
	mturquette, nico, rjw, Juri Lelli, linux-kernel, Russell King

On Mon, Mar 23, 2015 at 01:39:44PM +0000, Peter Zijlstra wrote:
> On Wed, Feb 04, 2015 at 06:30:51PM +0000, Morten Rasmussen wrote:
> > +/* cpufreq callback function setting current cpu frequency */
> > +void arch_scale_set_curr_freq(int cpu, unsigned long freq)
> > +{
> > +	atomic_long_set(&per_cpu(cpu_curr_freq, cpu), freq);
> > +}
> > +
> > +/* cpufreq callback function setting max cpu frequency */
> > +void arch_scale_set_max_freq(int cpu, unsigned long freq)
> > +{
> > +	atomic_long_set(&per_cpu(cpu_max_freq, cpu), freq);
> > +}
> > +
> > +unsigned long arch_scale_freq_capacity(struct sched_domain *sd, int cpu)
> > +{
> > +	unsigned long curr = atomic_long_read(&per_cpu(cpu_curr_freq, cpu));
> > +	unsigned long max = atomic_long_read(&per_cpu(cpu_max_freq, cpu));
> > +
> > +	if (!curr || !max)
> > +		return SCHED_CAPACITY_SCALE;
> > +
> > +	return (curr * SCHED_CAPACITY_SCALE) / max;
> > +}
> 
> so I've no idea how many cycles an (integer) division takes on ARM; but
> doesn't it make sense to do this division (once) in
> arch_scale_set_curr_freq() instead of every time we need the result?

It does. In fact I have already prepared that fix for v4. Integer
division is expensive ARM and we call this function a lot.

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [RFCv3 PATCH 18/48] sched: Track blocked utilization contributions
  2015-03-23 14:08   ` Peter Zijlstra
@ 2015-03-24  9:43     ` Morten Rasmussen
  2015-03-24 16:07       ` Peter Zijlstra
  0 siblings, 1 reply; 124+ messages in thread
From: Morten Rasmussen @ 2015-03-24  9:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, vincent.guittot, Dietmar Eggemann, yuyang.du, preeti,
	mturquette, nico, rjw, Juri Lelli, linux-kernel

On Mon, Mar 23, 2015 at 02:08:01PM +0000, Peter Zijlstra wrote:
> On Wed, Feb 04, 2015 at 06:30:55PM +0000, Morten Rasmussen wrote:
> > Introduces the blocked utilization, the utilization counter-part to
> > cfs_rq->utilization_load_avg. It is the sum of sched_entity utilization
> > contributions of entities that were recently on the cfs_rq that are
> > currently blocked. Combined with sum of contributions of entities
> > currently on the cfs_rq or currently running
> > (cfs_rq->utilization_load_avg) this can provide a more stable average
> > view of the cpu usage.
> 
> So it would be nice if you add performance numbers for all these patches
> that add accounting muck.. 

Total scheduler latency (as in hackbench?), individual function
latencies, or something else?

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [RFCv3 PATCH 30/48] sched: Calculate energy consumption of sched_group
  2015-03-23 20:21         ` Dietmar Eggemann
@ 2015-03-24 10:44           ` Morten Rasmussen
  2015-03-24 16:10             ` Peter Zijlstra
  0 siblings, 1 reply; 124+ messages in thread
From: Morten Rasmussen @ 2015-03-24 10:44 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Peter Zijlstra, Sai Gurrappadi, mingo, vincent.guittot,
	yuyang.du, preeti, mturquette, nico, rjw, Juri Lelli,
	linux-kernel, Peter Boonstoppel

On Mon, Mar 23, 2015 at 08:21:45PM +0000, Dietmar Eggemann wrote:
> On 23/03/15 16:47, Peter Zijlstra wrote:
> > On Mon, Mar 16, 2015 at 02:15:46PM +0000, Morten Rasmussen wrote:
> >> You are absolutely right. The current code is broken for system
> >> topologies where all cpus share the same clock source. To be honest, it
> >> is actually worse than that and you already pointed out the reason. We
> >> don't have a way of representing top level contributions to power
> >> consumption in RFCv3, as we don't have sched_group spanning all cpus in
> >> single cluster system. For example, we can't represent L2 cache and
> >> interconnect power consumption on such systems.
> >>
> >> In RFCv2 we had a system wide sched_group dangling by itself for that
> >> purpose. We chose to remove that in this rewrite as it led to messy
> >> code. In my opinion, a more elegant solution is to introduce an
> >> additional sched_domain above the current top level which has a single
> >> sched_group spanning all cpus in the system. That should fix the
> >> SD_SHARE_CAP_STATES problem and allow us to attach power data for the
> >> top level.
> >
> > Maybe remind us why this needs to be tied to sched_groups ? Why can't we
> > attach the energy information to the domains?
> 
> Currently on our 2 cluster (big.LITTLE) system (cluster0: big cpus, 
> cluster1: little cpus) we attach energy information onto all sg's in MC 
> (cpu/core related energy data) and DIE sd level (cluster related energy 
> data).
> 
> For an MC level (cpus sharing the same u-arch) attaching the energy 
> information onto the sd is clearly much easier then attaching it onto 
> the individual sg's.

In the current domain hierarchy you don't have domains with just one cpu
in them. If you attach the per-cpu energy data to the MC level domain
which spans the whole cluster, you break the current idea of attaching
information to the cpumask (currently sched_group, but could be
sched_domain as we discuss here) the information is associated with. You
would have to either introduce a level of single cpu domains at the
lowest level or move away from the idea of attaching data to the cpumask
that is associated with it.

Using sched_groups we do already have single cpu groups that we can
attach per-cpu data to, but we are missing a top level group spanning
the entire system for system wide energy data. So from that point of
view groups and domains are equally bad.

> But on DIE level when we want to figure out the cluster energy data for 
> a cluster represented by an sg other than the first sg (sg0) than we 
> would have to access its cluster energy data via the DIE sd of one of 
> the cpus of this cluster. I haven't seen code actually doing that in CFS.
> 
> IMHO, the current code is always iterating over the sg's of the sd and 
> accessing either sg (sched_group) or sg->sgc (sched_group_capacity) 
> data. Our energy data follows the sched_group_capacity example.

Right, using domains we can't directly see sibling domains. Using groups
we can see some sibling groups directly without accessing a different
per-cpu view of the domain hierarchy, but not all of them. We do have to
do the per-cpu thing in some cases similar to how it is currently done
in select_task_rq_fair().

> 
> > There is an additional problem with groups you've not yet discovered and
> > that is overlapping groups. Certain NUMA topologies result in this.
> > There the sum of cpus over the groups is greater than the total cpus in
> > the domain.
> 
> Yeah, we haven't tried EAS on such a system nor did we enable 
> FORCE_SD_OVERLAP sched feature for a long time.

There are things to be discussed for NUMA and energy awareness. I'm not
sure if the NUMA folks would be interested in energy awareness and if
so, how to couple it in a meaningful way with the NUMA scheduling
strategy which involves memory location aspects.

The current patches should provide the basics to enable partial energy
awareness for NUMA systems in the sense that energy aware scheduling
decisions will be made up to the level pointed to by the sd_ea pointer
(somewhat similar to sd_llc). So it should be possible to for example
do energy aware scheduling within a NUMA node and let cross NUMA-node
scheduling be done as it is currently done. It is entirely untested, but
that is at least how it is intended to work :)

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [RFCv3 PATCH 29/48] sched: Highest energy aware balancing sched_domain level pointer
  2015-03-23 16:16   ` Peter Zijlstra
@ 2015-03-24 10:52     ` Morten Rasmussen
  0 siblings, 0 replies; 124+ messages in thread
From: Morten Rasmussen @ 2015-03-24 10:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, vincent.guittot, Dietmar Eggemann, yuyang.du, preeti,
	mturquette, nico, rjw, Juri Lelli, linux-kernel

On Mon, Mar 23, 2015 at 04:16:06PM +0000, Peter Zijlstra wrote:
> On Wed, Feb 04, 2015 at 06:31:06PM +0000, Morten Rasmussen wrote:
> > Add another member to the family of per-cpu sched_domain shortcut
> > pointers. This one, sd_ea, points to the highest level at which energy
> > model is provided. At this level and all levels below all sched_groups
> > have energy model data attached.
> 
> Here might be a good point to discuss what it means for the topology to
> have partial energy model information.

Partial energy model information is supported as in you can have energy
model data for just lower level sched domains and leave scheduling
between higher level sched_domains to be done the non-energy-aware
(current) way. For example, have energy-aware scheduling working within
each socket on a multi-socket system and let normal performance
(spreading) load-balancing work between sockets. Or within a NUMA node
and not between NUMA nodes as I mentioned in my reply to the next patch.

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [RFCv3 PATCH 28/48] sched: Use capacity_curr to cap utilization in get_cpu_usage()
  2015-03-23 16:14   ` Peter Zijlstra
@ 2015-03-24 11:36     ` Morten Rasmussen
  2015-03-24 12:59       ` Peter Zijlstra
  0 siblings, 1 reply; 124+ messages in thread
From: Morten Rasmussen @ 2015-03-24 11:36 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, vincent.guittot, Dietmar Eggemann, yuyang.du, preeti,
	mturquette, nico, rjw, Juri Lelli, linux-kernel

On Mon, Mar 23, 2015 at 04:14:00PM +0000, Peter Zijlstra wrote:
> On Wed, Feb 04, 2015 at 06:31:05PM +0000, Morten Rasmussen wrote:
> 
> > @@ -4596,9 +4596,10 @@ static int get_cpu_usage(int cpu)
> >  {
> >  	unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg;
> >  	unsigned long blocked = cpu_rq(cpu)->cfs.utilization_blocked_avg;
> > +	unsigned long capacity_curr = capacity_curr_of(cpu);
> >  
> > -	if (usage + blocked >= SCHED_LOAD_SCALE)
> > -		return capacity_orig_of(cpu);
> > +	if (usage + blocked >= capacity_curr)
> > +		return capacity_curr;
> 
> It makes more sense to do return capacity_curr_of(), since that defers
> the computation capacity_curr_of() does to the point where its actually
> required, instead of making it unconditional.

capacity_curr_of() is used in the if-condition itself as well so we need
it unconditionally. No?

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [RFCv3 PATCH 28/48] sched: Use capacity_curr to cap utilization in get_cpu_usage()
  2015-03-24 11:36     ` Morten Rasmussen
@ 2015-03-24 12:59       ` Peter Zijlstra
  0 siblings, 0 replies; 124+ messages in thread
From: Peter Zijlstra @ 2015-03-24 12:59 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: mingo, vincent.guittot, Dietmar Eggemann, yuyang.du, preeti,
	mturquette, nico, rjw, Juri Lelli, linux-kernel

On Tue, Mar 24, 2015 at 11:36:51AM +0000, Morten Rasmussen wrote:
> On Mon, Mar 23, 2015 at 04:14:00PM +0000, Peter Zijlstra wrote:
> > On Wed, Feb 04, 2015 at 06:31:05PM +0000, Morten Rasmussen wrote:
> > 
> > > @@ -4596,9 +4596,10 @@ static int get_cpu_usage(int cpu)
> > >  {
> > >  	unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg;
> > >  	unsigned long blocked = cpu_rq(cpu)->cfs.utilization_blocked_avg;
> > > +	unsigned long capacity_curr = capacity_curr_of(cpu);
> > >  
> > > -	if (usage + blocked >= SCHED_LOAD_SCALE)
> > > -		return capacity_orig_of(cpu);
> > > +	if (usage + blocked >= capacity_curr)
> > > +		return capacity_curr;
> > 
> > It makes more sense to do return capacity_curr_of(), since that defers
> > the computation capacity_curr_of() does to the point where its actually
> > required, instead of making it unconditional.
> 
> capacity_curr_of() is used in the if-condition itself as well so we need
> it unconditionally. No?

Duh.. I can't read it seems ;-)

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [RFCv3 PATCH 33/48] sched: Energy-aware wake-up task placement
  2015-03-16 14:47     ` Morten Rasmussen
  2015-03-18 20:15       ` Sai Gurrappadi
@ 2015-03-24 13:00       ` Peter Zijlstra
  2015-03-24 15:24         ` Morten Rasmussen
  1 sibling, 1 reply; 124+ messages in thread
From: Peter Zijlstra @ 2015-03-24 13:00 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Sai Gurrappadi, mingo, vincent.guittot, Dietmar Eggemann,
	yuyang.du, preeti, mturquette, nico, rjw, Juri Lelli,
	linux-kernel

On Mon, Mar 16, 2015 at 02:47:23PM +0000, Morten Rasmussen wrote:
> > Also, this heuristic for determining sg_target is a big little
> > assumption. I don't think it is necessarily correct to assume that this
> > is true for all platforms. This heuristic should be derived from the
> > energy model for the platform instead.
> 
> I have had the same thought, but I ended up making that assumption since
> I holds for the for the few platforms I have data for and it simplified
> the code a lot. I think we can potentially remove this assumption later.

So the thing is; if we know all the possible behavioural modes of our
system we could pre compute which are applicable to the system at hand.

All we need is a semi formal definition of our model and a friendly
mathematician who can do the bifurcation analysis to find the modal
boundaries and their parameters.

I suspect there is a limited number of modes and parameters and if we
implement all we just need to select the right ones, this could save a
lot of runtime computation.

But yes, we don't need to start with that, we can do the brute force
thing for a while.

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [RFCv3 PATCH 33/48] sched: Energy-aware wake-up task placement
  2015-02-04 18:31 ` [RFCv3 PATCH 33/48] sched: Energy-aware wake-up task placement Morten Rasmussen
  2015-03-13 22:47   ` Sai Gurrappadi
@ 2015-03-24 13:00   ` Peter Zijlstra
  2015-03-24 15:42     ` Morten Rasmussen
  2015-03-24 16:35   ` Peter Zijlstra
  2 siblings, 1 reply; 124+ messages in thread
From: Peter Zijlstra @ 2015-03-24 13:00 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: mingo, vincent.guittot, dietmar.eggemann, yuyang.du, preeti,
	mturquette, nico, rjw, juri.lelli, linux-kernel

On Wed, Feb 04, 2015 at 06:31:10PM +0000, Morten Rasmussen wrote:
> @@ -5138,6 +5224,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
>  		prev_cpu = cpu;
>  
>  	if (sd_flag & SD_BALANCE_WAKE) {
> +		if (energy_aware()) {
> +			new_cpu = energy_aware_wake_cpu(p);
> +			goto unlock;
> +		}
>  		new_cpu = select_idle_sibling(p, prev_cpu);
>  		goto unlock;
>  	}

So that is fundamentally wrong I think. We only care about power aware
scheduling when U < 1, after that we should do the normal thing. This
setup does not allow for that.

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [RFCv3 PATCH 36/48] sched: Count number of shallower idle-states in struct sched_group_energy
  2015-02-04 18:31 ` [RFCv3 PATCH 36/48] sched: Count number of shallower idle-states in struct sched_group_energy Morten Rasmussen
@ 2015-03-24 13:14   ` Peter Zijlstra
  2015-03-24 17:13     ` Morten Rasmussen
  0 siblings, 1 reply; 124+ messages in thread
From: Peter Zijlstra @ 2015-03-24 13:14 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: mingo, vincent.guittot, dietmar.eggemann, yuyang.du, preeti,
	mturquette, nico, rjw, juri.lelli, linux-kernel

On Wed, Feb 04, 2015 at 06:31:13PM +0000, Morten Rasmussen wrote:
> cpuidle associates all idle-states with each cpu while the energy model
> associates them with the sched_group covering the cpus coordinating
> entry to the idle-state. To get idle-state power consumption it is
> therefore necessary to translate from cpuidle idle-state index to energy
> model index. For this purpose it is helpful to know how many idle-states
> that are listed in lower level sched_groups (in struct
> sched_group_energy).

I think this could use some text to describe how that number is useful.

I suspect is has something to do with bigger domains having more idle
modes (package C states etc..).

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [RFCv3 PATCH 34/48] sched: Bias new task wakeups towards higher capacity cpus
  2015-02-04 18:31 ` [RFCv3 PATCH 34/48] sched: Bias new task wakeups towards higher capacity cpus Morten Rasmussen
@ 2015-03-24 13:33   ` Peter Zijlstra
  2015-03-25 18:18     ` Morten Rasmussen
  0 siblings, 1 reply; 124+ messages in thread
From: Peter Zijlstra @ 2015-03-24 13:33 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: mingo, vincent.guittot, dietmar.eggemann, yuyang.du, preeti,
	mturquette, nico, rjw, juri.lelli, linux-kernel

On Wed, Feb 04, 2015 at 06:31:11PM +0000, Morten Rasmussen wrote:
> Make wake-ups of new tasks (find_idlest_group) aware of any differences
> in cpu compute capacity so new tasks don't get handed off to a cpus with
> lower capacity.

That says what; but fails to explain why we want to do this.

Remember Changelogs should answer what+why and if complicated also
reason about how.

> @@ -4971,6 +4972,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
>  
>  		/* Tally up the load of all CPUs in the group */
>  		avg_load = 0;
> +		cpu_capacity = 0;
>  
>  		for_each_cpu(i, sched_group_cpus(group)) {
>  			/* Bias balancing toward cpus of our domain */
> @@ -4980,6 +4982,9 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
>  				load = target_load(i, load_idx);
>  
>  			avg_load += load;
> +
> +			if (cpu_capacity < capacity_of(i))
> +				cpu_capacity = capacity_of(i);
>  		}
>  
>  		/* Adjust by relative CPU capacity of the group */

So basically you're constructing the max_cpu_capacity for that group
here. Might it be clearer to explicitly name/write it as such?

		max_cpu_capacity = max(max_cpu_capacity, capacity_of(i));

> @@ -4987,14 +4992,20 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
>  
>  		if (local_group) {
>  			this_load = avg_load;
> +			this_cpu_cap = cpu_capacity;
>  		} else if (avg_load < min_load) {
>  			min_load = avg_load;
>  			idlest = group;
> +			idlest_cpu_cap = cpu_capacity;
>  		}
>  	} while (group = group->next, group != sd->groups);
>  
> -	if (!idlest || 100*this_load < imbalance*min_load)
> +	if (!idlest)
> +		return NULL;
> +
> +	if (100*this_load < imbalance*min_load && this_cpu_cap >= idlest_cpu_cap)
>  		return NULL;

And here you then fail to provide an idlest group if the selected group
has less (max) capacity than the current group.

/me goes double check wth capacity_of() returns again, yes, this seems
dubious. Suppose we have our two symmetric clusters and for some reason
we've routed all our interrupts to one cluster and every cpu regularly
receives interrupts. This means that the capacity_of() of this IRQ
cluster is always less than the other.

The above modification will result in tasks always being placed on the
other cluster, even though it might be very busy indeed.

If you want to do something like this; one should really add in a
current usage metric or so.

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [RFCv3 PATCH 38/48] sched: Infrastructure to query if load balancing is energy-aware
  2015-02-04 18:31 ` [RFCv3 PATCH 38/48] sched: Infrastructure to query if load balancing is energy-aware Morten Rasmussen
@ 2015-03-24 13:41   ` Peter Zijlstra
  2015-03-24 16:17     ` Dietmar Eggemann
  2015-03-24 13:56   ` Peter Zijlstra
  1 sibling, 1 reply; 124+ messages in thread
From: Peter Zijlstra @ 2015-03-24 13:41 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: mingo, vincent.guittot, dietmar.eggemann, yuyang.du, preeti,
	mturquette, nico, rjw, juri.lelli, linux-kernel

On Wed, Feb 04, 2015 at 06:31:15PM +0000, Morten Rasmussen wrote:

> +		.use_ea		= (energy_aware() && sd->groups->sge) ? true : false,

The return value of a logical and should already be a boolean.

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [RFCv3 PATCH 38/48] sched: Infrastructure to query if load balancing is energy-aware
  2015-02-04 18:31 ` [RFCv3 PATCH 38/48] sched: Infrastructure to query if load balancing is energy-aware Morten Rasmussen
  2015-03-24 13:41   ` Peter Zijlstra
@ 2015-03-24 13:56   ` Peter Zijlstra
  2015-03-24 16:22     ` Dietmar Eggemann
  1 sibling, 1 reply; 124+ messages in thread
From: Peter Zijlstra @ 2015-03-24 13:56 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: mingo, vincent.guittot, dietmar.eggemann, yuyang.du, preeti,
	mturquette, nico, rjw, juri.lelli, linux-kernel

On Wed, Feb 04, 2015 at 06:31:15PM +0000, Morten Rasmussen wrote:
> From: Dietmar Eggemann <dietmar.eggemann@arm.com>
> 
> Energy-aware load balancing should only happen if the ENERGY_AWARE feature
> is turned on and the sched domain on which the load balancing is performed
> on contains energy data.
> There is also a need during a load balance action to be able to query if we
> should continue to load balance energy-aware or if we reached the tipping
> point which forces us to fall back to the conventional load balancing
> functionality.

> @@ -7348,6 +7349,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>  		.cpus		= cpus,
>  		.fbq_type	= all,
>  		.tasks		= LIST_HEAD_INIT(env.tasks),
> +		.use_ea		= (energy_aware() && sd->groups->sge) ? true : false,

fwiw, no tipping point in that logic.

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [RFCv3 PATCH 42/48] sched: Introduce energy awareness into find_busiest_queue
  2015-02-04 18:31 ` [RFCv3 PATCH 42/48] sched: Introduce energy awareness into find_busiest_queue Morten Rasmussen
@ 2015-03-24 15:21   ` Peter Zijlstra
  2015-03-24 18:04     ` Dietmar Eggemann
  0 siblings, 1 reply; 124+ messages in thread
From: Peter Zijlstra @ 2015-03-24 15:21 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: mingo, vincent.guittot, dietmar.eggemann, yuyang.du, preeti,
	mturquette, nico, rjw, juri.lelli, linux-kernel

On Wed, Feb 04, 2015 at 06:31:19PM +0000, Morten Rasmussen wrote:
> +++ b/kernel/sched/fair.c
> @@ -7216,6 +7216,37 @@ static struct rq *find_busiest_queue(struct lb_env *env,
>  	unsigned long busiest_load = 0, busiest_capacity = 1;
>  	int i;
>  
> +	if (env->use_ea) {
> +		struct rq *costliest = NULL;
> +		unsigned long costliest_usage = 1024, costliest_energy = 1;
> +
> +		for_each_cpu_and(i, sched_group_cpus(group), env->cpus) {
> +			unsigned long usage = get_cpu_usage(i);
> +			struct rq *rq = cpu_rq(i);
> +			struct sched_domain *sd = rcu_dereference(rq->sd);
> +			struct energy_env eenv = {
> +				.sg_top = sd->groups,
> +				.usage_delta    = 0,
> +				.src_cpu        = -1,
> +				.dst_cpu        = -1,
> +			};
> +			unsigned long energy = sched_group_energy(&eenv);
> +
> +			/*
> +			 * We're looking for the minimal cpu efficiency
> +			 * min(u_i / e_i), crosswise multiplication leads to
> +			 * u_i * e_j < u_j * e_i with j as previous minimum.
> +			 */
> +			if (usage * costliest_energy < costliest_usage * energy) {
> +				costliest_usage = usage;
> +				costliest_energy = energy;
> +				costliest = rq;
> +			}
> +		}
> +
> +		return costliest;
> +	}
> +
>  	for_each_cpu_and(i, sched_group_cpus(group), env->cpus) {
>  		unsigned long capacity, wl;
>  		enum fbq_type rt;

So I've thought about parametrizing the whole load balance thing to
avoid things like this.

Irrespective of whether we balance on pure load or another metric we
have the same structure, only different units plugged in.

I've not really spend too much time on it to see what it would look
like, but I think it would be a good avenue to investigate to avoid
patches like this.



^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [RFCv3 PATCH 33/48] sched: Energy-aware wake-up task placement
  2015-03-24 13:00       ` Peter Zijlstra
@ 2015-03-24 15:24         ` Morten Rasmussen
  0 siblings, 0 replies; 124+ messages in thread
From: Morten Rasmussen @ 2015-03-24 15:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Sai Gurrappadi, mingo, vincent.guittot, Dietmar Eggemann,
	yuyang.du, preeti, mturquette, nico, rjw, Juri Lelli,
	linux-kernel

On Tue, Mar 24, 2015 at 01:00:00PM +0000, Peter Zijlstra wrote:
> On Mon, Mar 16, 2015 at 02:47:23PM +0000, Morten Rasmussen wrote:
> > > Also, this heuristic for determining sg_target is a big little
> > > assumption. I don't think it is necessarily correct to assume that this
> > > is true for all platforms. This heuristic should be derived from the
> > > energy model for the platform instead.
> > 
> > I have had the same thought, but I ended up making that assumption since
> > I holds for the for the few platforms I have data for and it simplified
> > the code a lot. I think we can potentially remove this assumption later.
> 
> So the thing is; if we know all the possible behavioural modes of our
> system we could pre compute which are applicable to the system at hand.
> 
> All we need is a semi formal definition of our model and a friendly
> mathematician who can do the bifurcation analysis to find the modal
> boundaries and their parameters.
> 
> I suspect there is a limited number of modes and parameters and if we
> implement all we just need to select the right ones, this could save a
> lot of runtime computation.
> 
> But yes, we don't need to start with that, we can do the brute force
> thing for a while.

Yes, I think we need to get the basics (which are already quite
complicated) right before we start optimizing. Though, depending on how
crazy the energy model looks like, it should be possible to assign some
energy 'priority' to each group when the sched_domain hierarchy is built
based on the energy model data rather than basing the choice on max
capacity. At least until we find a friendly mathematician willing to
help us out :) 

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [RFCv3 PATCH 43/48] sched: Introduce energy awareness into detach_tasks
  2015-02-04 18:31 ` [RFCv3 PATCH 43/48] sched: Introduce energy awareness into detach_tasks Morten Rasmussen
@ 2015-03-24 15:25   ` Peter Zijlstra
  2015-03-25 23:50   ` Sai Gurrappadi
  1 sibling, 0 replies; 124+ messages in thread
From: Peter Zijlstra @ 2015-03-24 15:25 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: mingo, vincent.guittot, dietmar.eggemann, yuyang.du, preeti,
	mturquette, nico, rjw, juri.lelli, linux-kernel

On Wed, Feb 04, 2015 at 06:31:20PM +0000, Morten Rasmussen wrote:
> From: Dietmar Eggemann <dietmar.eggemann@arm.com>
> 
> Energy-aware load balancing does not rely on env->imbalance but instead it
> evaluates the system-wide energy difference for each task on the src rq by
> potentially moving it to the dst rq. If this energy difference is lesser
> than zero the task is actually moved from src to dst rq.

See this is another place where the parameterization makes perfect
sense; the weight imbalance or the energy difference is the 'same' thing
just a different unit.

All we really need is a task -> unit map to make this whole again.

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [RFCv3 PATCH 44/48] sched: Tipping point from energy-aware to conventional load balancing
  2015-02-04 18:31 ` [RFCv3 PATCH 44/48] sched: Tipping point from energy-aware to conventional load balancing Morten Rasmussen
@ 2015-03-24 15:26   ` Peter Zijlstra
  2015-03-24 18:47     ` Dietmar Eggemann
  0 siblings, 1 reply; 124+ messages in thread
From: Peter Zijlstra @ 2015-03-24 15:26 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: mingo, vincent.guittot, dietmar.eggemann, yuyang.du, preeti,
	mturquette, nico, rjw, juri.lelli, linux-kernel

On Wed, Feb 04, 2015 at 06:31:21PM +0000, Morten Rasmussen wrote:
> From: Dietmar Eggemann <dietmar.eggemann@arm.com>
> 
> Energy-aware load balancing bases on cpu usage so the upper bound of its
> operational range is a fully utilized cpu. Above this tipping point it
> makes more sense to use weighted_cpuload to preserve smp_nice.
> This patch implements the tipping point detection in update_sg_lb_stats
> as if one cpu is over-utilized the current energy-aware load balance
> operation will fall back into the conventional weighted load based one.
> 
> cc: Ingo Molnar <mingo@redhat.com>
> cc: Peter Zijlstra <peterz@infradead.org>
> 
> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> ---
>  kernel/sched/fair.c | 4 ++++
>  1 file changed, 4 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6b79603..4849bad 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6723,6 +6723,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>  		sgs->sum_weighted_load += weighted_cpuload(i);
>  		if (idle_cpu(i))
>  			sgs->idle_cpus++;
> +
> +		/* If cpu is over-utilized, bail out of ea */
> +		if (env->use_ea && cpu_overutilized(i, env->sd))
> +			env->use_ea = false;
>  	}

I don't immediately see why this is desired. Why would a single
overloaded CPU be reason to quit? It could be the cpus simply aren't
'balanced' right and the group as a whole is still under utilized.

In that case we want to continue the balance pass to reach this
equilibrium.

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [RFCv3 PATCH 45/48] sched: Skip cpu as lb src which has one task and capacity gte the dst cpu
  2015-02-04 18:31 ` [RFCv3 PATCH 45/48] sched: Skip cpu as lb src which has one task and capacity gte the dst cpu Morten Rasmussen
@ 2015-03-24 15:27   ` Peter Zijlstra
  2015-03-25 18:44     ` Dietmar Eggemann
  0 siblings, 1 reply; 124+ messages in thread
From: Peter Zijlstra @ 2015-03-24 15:27 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: mingo, vincent.guittot, dietmar.eggemann, yuyang.du, preeti,
	mturquette, nico, rjw, juri.lelli, linux-kernel

On Wed, Feb 04, 2015 at 06:31:22PM +0000, Morten Rasmussen wrote:
> From: Dietmar Eggemann <dietmar.eggemann@arm.com>
> 
> Skip cpu as a potential src (costliest) in case it has only one task
> running and its original capacity is greater than or equal to the
> original capacity of the dst cpu.

Again, that's what, but is lacking a why.

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [RFCv3 PATCH 33/48] sched: Energy-aware wake-up task placement
  2015-03-24 13:00   ` Peter Zijlstra
@ 2015-03-24 15:42     ` Morten Rasmussen
  2015-03-24 15:53       ` Peter Zijlstra
  0 siblings, 1 reply; 124+ messages in thread
From: Morten Rasmussen @ 2015-03-24 15:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, vincent.guittot, Dietmar Eggemann, yuyang.du, preeti,
	mturquette, nico, rjw, Juri Lelli, linux-kernel

On Tue, Mar 24, 2015 at 01:00:58PM +0000, Peter Zijlstra wrote:
> On Wed, Feb 04, 2015 at 06:31:10PM +0000, Morten Rasmussen wrote:
> > @@ -5138,6 +5224,10 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
> >  		prev_cpu = cpu;
> >  
> >  	if (sd_flag & SD_BALANCE_WAKE) {
> > +		if (energy_aware()) {
> > +			new_cpu = energy_aware_wake_cpu(p);
> > +			goto unlock;
> > +		}
> >  		new_cpu = select_idle_sibling(p, prev_cpu);
> >  		goto unlock;
> >  	}
> 
> So that is fundamentally wrong I think. We only care about power aware
> scheduling when U < 1, after that we should do the normal thing. This
> setup does not allow for that.

Right, I agree that we should preferably do the normal thing for U ~= 1.
We can restructure the wake-up path to follow that pattern, but we need
to know U beforehand to choose the right path. U isn't just
get_cpu_usage(prev_cpu) but some broader view of the of the cpu
utilizations. For example, prev_cpu might be full, but everyone else is
idle so we still want to try to do an energy aware wake-up on some other
cpu. U could be the minium utilization of all cpus in prev_cpu's
sd_llc, which is somewhat similar to what energy_aware_wake_cpu() does.

I guess energy_aware_wake_cpu() could be refactored to call
select_idle_sibling() if it find U ~= 1?

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [RFCv3 PATCH 33/48] sched: Energy-aware wake-up task placement
  2015-03-24 15:42     ` Morten Rasmussen
@ 2015-03-24 15:53       ` Peter Zijlstra
  2015-03-24 17:47         ` Morten Rasmussen
  0 siblings, 1 reply; 124+ messages in thread
From: Peter Zijlstra @ 2015-03-24 15:53 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: mingo, vincent.guittot, Dietmar Eggemann, yuyang.du, preeti,
	mturquette, nico, rjw, Juri Lelli, linux-kernel

On Tue, Mar 24, 2015 at 03:42:42PM +0000, Morten Rasmussen wrote:
> Right, I agree that we should preferably do the normal thing for U ~= 1.
> We can restructure the wake-up path to follow that pattern, but we need
> to know U beforehand to choose the right path. U isn't just
> get_cpu_usage(prev_cpu) but some broader view of the of the cpu
> utilizations. For example, prev_cpu might be full, but everyone else is
> idle so we still want to try to do an energy aware wake-up on some other
> cpu. U could be the minium utilization of all cpus in prev_cpu's
> sd_llc, which is somewhat similar to what energy_aware_wake_cpu() does.

Yeah, or a setting in the root domain set by the regular periodic load
balancer; that already grew some mojo to determine this in a patch I
recently commented on.

> I guess energy_aware_wake_cpu() could be refactored to call
> select_idle_sibling() if it find U ~= 1?

Sure yeah, that's not the hard part I think.

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [RFCv3 PATCH 46/48] sched: Turn off fast idling of cpus on a partially loaded system
  2015-02-04 18:31 ` [RFCv3 PATCH 46/48] sched: Turn off fast idling of cpus on a partially loaded system Morten Rasmussen
@ 2015-03-24 16:01   ` Peter Zijlstra
  0 siblings, 0 replies; 124+ messages in thread
From: Peter Zijlstra @ 2015-03-24 16:01 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: mingo, vincent.guittot, dietmar.eggemann, yuyang.du, preeti,
	mturquette, nico, rjw, juri.lelli, linux-kernel

On Wed, Feb 04, 2015 at 06:31:23PM +0000, Morten Rasmussen wrote:
> From: Dietmar Eggemann <dietmar.eggemann@arm.com>
> 
> We do not want to miss out on the ability to do energy-aware idle load
> balancing if the system is only partially loaded since the operational
> range of energy-aware scheduling corresponds to a partially loaded
> system. We might want to pull a single remaining task from a potential
> src cpu towards an idle destination cpu if the energy model tells us
> this is worth doing to save energy.

Should this not pair with a change to need_active_balance() ?

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [RFCv3 PATCH 47/48] sched: Enable active migration for cpus of lower capacity
  2015-02-04 18:31 ` [RFCv3 PATCH 47/48] sched: Enable active migration for cpus of lower capacity Morten Rasmussen
@ 2015-03-24 16:02   ` Peter Zijlstra
  0 siblings, 0 replies; 124+ messages in thread
From: Peter Zijlstra @ 2015-03-24 16:02 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: mingo, vincent.guittot, dietmar.eggemann, yuyang.du, preeti,
	mturquette, nico, rjw, juri.lelli, linux-kernel

On Wed, Feb 04, 2015 at 06:31:24PM +0000, Morten Rasmussen wrote:
> Add an extra criteria to need_active_balance() to kick off active load
> balance if the source cpu is overutilized and has lower capacity than
> the destination cpus.
> 
> cc: Ingo Molnar <mingo@redhat.com>
> cc: Peter Zijlstra <peterz@infradead.org>
> 
> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
> ---
>  kernel/sched/fair.c | 7 +++++++
>  1 file changed, 7 insertions(+)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 92fd1d8..1c248f8 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -7379,6 +7379,13 @@ static int need_active_balance(struct lb_env *env)
>  			return 1;
>  	}
>  
> +	if ((capacity_of(env->src_cpu) < capacity_of(env->dst_cpu)) &&
> +				env->src_rq->cfs.h_nr_running == 1 &&
> +				cpu_overutilized(env->src_cpu, env->sd) &&
> +				!cpu_overutilized(env->dst_cpu, env->sd)) {
> +			return 1;
> +	}

Ah, see does this want to get squashed into the previuos patch? Together
they seem to make more sense.

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [RFCv3 PATCH 18/48] sched: Track blocked utilization contributions
  2015-03-24  9:43     ` Morten Rasmussen
@ 2015-03-24 16:07       ` Peter Zijlstra
  2015-03-24 17:44         ` Morten Rasmussen
  0 siblings, 1 reply; 124+ messages in thread
From: Peter Zijlstra @ 2015-03-24 16:07 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: mingo, vincent.guittot, Dietmar Eggemann, yuyang.du, preeti,
	mturquette, nico, rjw, Juri Lelli, linux-kernel

On Tue, Mar 24, 2015 at 09:43:47AM +0000, Morten Rasmussen wrote:
> On Mon, Mar 23, 2015 at 02:08:01PM +0000, Peter Zijlstra wrote:
> > On Wed, Feb 04, 2015 at 06:30:55PM +0000, Morten Rasmussen wrote:
> > > Introduces the blocked utilization, the utilization counter-part to
> > > cfs_rq->utilization_load_avg. It is the sum of sched_entity utilization
> > > contributions of entities that were recently on the cfs_rq that are
> > > currently blocked. Combined with sum of contributions of entities
> > > currently on the cfs_rq or currently running
> > > (cfs_rq->utilization_load_avg) this can provide a more stable average
> > > view of the cpu usage.
> > 
> > So it would be nice if you add performance numbers for all these patches
> > that add accounting muck.. 
> 
> Total scheduler latency (as in hackbench?), individual function
> latencies, or something else?

Yeah, good question that. Something that is good at running this code a
lot. So dequeue_entity() -> dequeue_entity_load_avg() ->
update_entity_load_avg() -> __update_entity_runnable_avg() seems a
reliable way into here, and IIRC hackbench does a lot of that, so yes,
that might just work.

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [RFCv3 PATCH 30/48] sched: Calculate energy consumption of sched_group
  2015-03-24 10:44           ` Morten Rasmussen
@ 2015-03-24 16:10             ` Peter Zijlstra
  2015-03-24 17:39               ` Morten Rasmussen
  0 siblings, 1 reply; 124+ messages in thread
From: Peter Zijlstra @ 2015-03-24 16:10 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Dietmar Eggemann, Sai Gurrappadi, mingo, vincent.guittot,
	yuyang.du, preeti, mturquette, nico, rjw, Juri Lelli,
	linux-kernel, Peter Boonstoppel

On Tue, Mar 24, 2015 at 10:44:24AM +0000, Morten Rasmussen wrote:
> > > Maybe remind us why this needs to be tied to sched_groups ? Why can't we
> > > attach the energy information to the domains?

> In the current domain hierarchy you don't have domains with just one cpu
> in them. If you attach the per-cpu energy data to the MC level domain
> which spans the whole cluster, you break the current idea of attaching
> information to the cpumask (currently sched_group, but could be
> sched_domain as we discuss here) the information is associated with. You
> would have to either introduce a level of single cpu domains at the
> lowest level or move away from the idea of attaching data to the cpumask
> that is associated with it.
> 
> Using sched_groups we do already have single cpu groups that we can
> attach per-cpu data to, but we are missing a top level group spanning
> the entire system for system wide energy data. So from that point of
> view groups and domains are equally bad.

Oh urgh, good point that. Cursed if you do, cursed if you don't. Bugger.

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [RFCv3 PATCH 38/48] sched: Infrastructure to query if load balancing is energy-aware
  2015-03-24 13:41   ` Peter Zijlstra
@ 2015-03-24 16:17     ` Dietmar Eggemann
  0 siblings, 0 replies; 124+ messages in thread
From: Dietmar Eggemann @ 2015-03-24 16:17 UTC (permalink / raw)
  To: Peter Zijlstra, Morten Rasmussen
  Cc: mingo, vincent.guittot, yuyang.du, preeti, mturquette, nico, rjw,
	Juri Lelli, linux-kernel

On 24/03/15 13:41, Peter Zijlstra wrote:
> On Wed, Feb 04, 2015 at 06:31:15PM +0000, Morten Rasmussen wrote:
>
>> +		.use_ea		= (energy_aware() && sd->groups->sge) ? true : false,
>
> The return value of a logical and should already be a boolean.

Indeed, thanks for spotting this!



^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [RFCv3 PATCH 38/48] sched: Infrastructure to query if load balancing is energy-aware
  2015-03-24 13:56   ` Peter Zijlstra
@ 2015-03-24 16:22     ` Dietmar Eggemann
  0 siblings, 0 replies; 124+ messages in thread
From: Dietmar Eggemann @ 2015-03-24 16:22 UTC (permalink / raw)
  To: Peter Zijlstra, Morten Rasmussen
  Cc: mingo, vincent.guittot, yuyang.du, preeti, mturquette, nico, rjw,
	Juri Lelli, linux-kernel

On 24/03/15 13:56, Peter Zijlstra wrote:
> On Wed, Feb 04, 2015 at 06:31:15PM +0000, Morten Rasmussen wrote:
>> From: Dietmar Eggemann <dietmar.eggemann@arm.com>
>>
>> Energy-aware load balancing should only happen if the ENERGY_AWARE feature
>> is turned on and the sched domain on which the load balancing is performed
>> on contains energy data.
>> There is also a need during a load balance action to be able to query if we
>> should continue to load balance energy-aware or if we reached the tipping
>> point which forces us to fall back to the conventional load balancing
>> functionality.
>
>> @@ -7348,6 +7349,7 @@ static int load_balance(int this_cpu, struct rq *this_rq,
>>   		.cpus		= cpus,
>>   		.fbq_type	= all,
>>   		.tasks		= LIST_HEAD_INIT(env.tasks),
>> +		.use_ea		= (energy_aware() && sd->groups->sge) ? true : false,
>
> fwiw, no tipping point in that logic.

Wanted to explain why I added lv_env::use_ea. But mentioning the tipping 
point problem here seems to be a little far-fetched. Will get rid of the 
second sentence in the header.


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [RFCv3 PATCH 33/48] sched: Energy-aware wake-up task placement
  2015-02-04 18:31 ` [RFCv3 PATCH 33/48] sched: Energy-aware wake-up task placement Morten Rasmussen
  2015-03-13 22:47   ` Sai Gurrappadi
  2015-03-24 13:00   ` Peter Zijlstra
@ 2015-03-24 16:35   ` Peter Zijlstra
  2015-03-25 18:01     ` Juri Lelli
  2 siblings, 1 reply; 124+ messages in thread
From: Peter Zijlstra @ 2015-03-24 16:35 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: mingo, vincent.guittot, dietmar.eggemann, yuyang.du, preeti,
	mturquette, nico, rjw, juri.lelli, linux-kernel

On Wed, Feb 04, 2015 at 06:31:10PM +0000, Morten Rasmussen wrote:
> +static int energy_aware_wake_cpu(struct task_struct *p)
> +{
> +	struct sched_domain *sd;
> +	struct sched_group *sg, *sg_target;
> +	int target_max_cap = SCHED_CAPACITY_SCALE;
> +	int target_cpu = task_cpu(p);
> +	int i;
> +
> +	sd = rcu_dereference(per_cpu(sd_ea, task_cpu(p)));
> +
> +	if (!sd)
> +		return -1;
> +
> +	sg = sd->groups;
> +	sg_target = sg;
> +	/* Find group with sufficient capacity */
> +	do {
> +		int sg_max_capacity = group_max_capacity(sg);
> +
> +		if (sg_max_capacity >= task_utilization(p) &&
> +				sg_max_capacity <= target_max_cap) {
> +			sg_target = sg;
> +			target_max_cap = sg_max_capacity;
> +		}
> +	} while (sg = sg->next, sg != sd->groups);
> +
> +	/* Find cpu with sufficient capacity */
> +	for_each_cpu_and(i, tsk_cpus_allowed(p), sched_group_cpus(sg_target)) {
> +		int new_usage = get_cpu_usage(i) + task_utilization(p);
> +
> +		if (new_usage >	capacity_orig_of(i))
> +			continue;
> +
> +		if (new_usage <	capacity_curr_of(i)) {
> +			target_cpu = i;
> +			if (!cpu_rq(i)->nr_running)
> +				break;
> +		}
> +
> +		/* cpu has capacity at higher OPP, keep it as fallback */
> +		if (target_cpu == task_cpu(p))
> +			target_cpu = i;
> +	}
> +
> +	if (target_cpu != task_cpu(p)) {
> +		struct energy_env eenv = {
> +			.usage_delta	= task_utilization(p),
> +			.src_cpu	= task_cpu(p),
> +			.dst_cpu	= target_cpu,
> +		};
> +
> +		/* Not enough spare capacity on previous cpu */
> +		if (cpu_overutilized(task_cpu(p), sd))
> +			return target_cpu;
> +
> +		if (energy_diff(&eenv) >= 0)
> +			return task_cpu(p);
> +	}
> +
> +	return target_cpu;
> +}

So while you have some cpufreq -> sched coupling (the capacity_curr
thing) this would be the site where you could provide sched -> cpufreq
coupling, right?

So does it make sense to at least put in the right hooks now? I realize
we'll likely take cpufreq out back and feed it to the bears but
something managing P states will be there whatever we'll call the new
fangled thing and this would be the place to hook it still.

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [RFCv3 PATCH 36/48] sched: Count number of shallower idle-states in struct sched_group_energy
  2015-03-24 13:14   ` Peter Zijlstra
@ 2015-03-24 17:13     ` Morten Rasmussen
  0 siblings, 0 replies; 124+ messages in thread
From: Morten Rasmussen @ 2015-03-24 17:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, vincent.guittot, Dietmar Eggemann, yuyang.du, preeti,
	mturquette, nico, rjw, Juri Lelli, linux-kernel

On Tue, Mar 24, 2015 at 01:14:39PM +0000, Peter Zijlstra wrote:
> On Wed, Feb 04, 2015 at 06:31:13PM +0000, Morten Rasmussen wrote:
> > cpuidle associates all idle-states with each cpu while the energy model
> > associates them with the sched_group covering the cpus coordinating
> > entry to the idle-state. To get idle-state power consumption it is
> > therefore necessary to translate from cpuidle idle-state index to energy
> > model index. For this purpose it is helpful to know how many idle-states
> > that are listed in lower level sched_groups (in struct
> > sched_group_energy).
> 
> I think this could use some text to describe how that number is useful.
> 
> I suspect is has something to do with bigger domains having more idle
> modes (package C states etc..).

Close :)

You are not the first to be confused about the idle state representation
and numbering. Maybe I should just change it.

If we take typical ARM idle-states as an example, we have both per-cpu
and per-cluster idle-states. Unlike x86 (IIUC), cluster states are
controlled by cpuidle. All states are represented in the cpuidle state
table for each cpu regardless of whether it is a per-cpu or per-cluster
state. For the energy model we have organized them by attaching the
states to the cpumask representing the cpus that need to coordination to
enter the state as this is rather important to know to estimate energy
consumption.

Idle-state		cpuidle		Energy model table indices
			index		per-cpu sg	per-cluster sg
WFI			0		0
Core power-down		1		1
Cluster power-down	2				0

Cluster power-down is the first (and only in this example) per-cluster
idle-state and in is therefore put in the idle-state table for the
sched_group spanning the whole cluster. Since it is first it has index
0. However, the same state has index 2 in cpuidle as it only has a table
per cpu. To do an easy translation from cpuidle index to energy model
idle-state table index it is therefore quite useful to know how many
states are in the tables of of the energy model attached to groups a
lower levels. Basically, energy_model_idx = cpuidle_idx - state_below,
which is 2 - 2 = 0 for cluster power-down.

An alternative that could avoid this translation is to have a full table
at each level (3 entries for this example) and insert dummy values on
indices not applicable to the group the table is attached to. For
example insert '0' on index=2 for the per-cpu sg energy model data.

We can't avoid index translation entirely though. We need to know the
cluster power consumption when all cpus are in state 0 or 1, but the
cluster is still up in an idle but yet active state to estimate energy
consumption. The energy model therefore has an additional 'active idle'
idle state for the cluster which sits before the first true idle-state
in the energy model idle-state table. In the example above, active idle
would be per-cluster sg energy model table index 0 and cluster
power-down index 1.

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [RFCv3 PATCH 30/48] sched: Calculate energy consumption of sched_group
  2015-03-24 16:10             ` Peter Zijlstra
@ 2015-03-24 17:39               ` Morten Rasmussen
  2015-03-26 15:23                 ` Dietmar Eggemann
  0 siblings, 1 reply; 124+ messages in thread
From: Morten Rasmussen @ 2015-03-24 17:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dietmar Eggemann, Sai Gurrappadi, mingo, vincent.guittot,
	yuyang.du, preeti, mturquette, nico, rjw, Juri Lelli,
	linux-kernel, Peter Boonstoppel

On Tue, Mar 24, 2015 at 04:10:37PM +0000, Peter Zijlstra wrote:
> On Tue, Mar 24, 2015 at 10:44:24AM +0000, Morten Rasmussen wrote:
> > > > Maybe remind us why this needs to be tied to sched_groups ? Why can't we
> > > > attach the energy information to the domains?
> 
> > In the current domain hierarchy you don't have domains with just one cpu
> > in them. If you attach the per-cpu energy data to the MC level domain
> > which spans the whole cluster, you break the current idea of attaching
> > information to the cpumask (currently sched_group, but could be
> > sched_domain as we discuss here) the information is associated with. You
> > would have to either introduce a level of single cpu domains at the
> > lowest level or move away from the idea of attaching data to the cpumask
> > that is associated with it.
> > 
> > Using sched_groups we do already have single cpu groups that we can
> > attach per-cpu data to, but we are missing a top level group spanning
> > the entire system for system wide energy data. So from that point of
> > view groups and domains are equally bad.
> 
> Oh urgh, good point that. Cursed if you do, cursed if you don't. Bugger.

Yeah :( I don't really care which one we choose. Adding another top
level domain with one big group spanning all cpus, but with all SD flags
disabled seems less intrusive than adding a level at the bottom.

Better ideas are very welcome.

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [RFCv3 PATCH 18/48] sched: Track blocked utilization contributions
  2015-03-24 16:07       ` Peter Zijlstra
@ 2015-03-24 17:44         ` Morten Rasmussen
  0 siblings, 0 replies; 124+ messages in thread
From: Morten Rasmussen @ 2015-03-24 17:44 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, vincent.guittot, Dietmar Eggemann, yuyang.du, preeti,
	mturquette, nico, rjw, Juri Lelli, linux-kernel

On Tue, Mar 24, 2015 at 04:07:29PM +0000, Peter Zijlstra wrote:
> On Tue, Mar 24, 2015 at 09:43:47AM +0000, Morten Rasmussen wrote:
> > On Mon, Mar 23, 2015 at 02:08:01PM +0000, Peter Zijlstra wrote:
> > > On Wed, Feb 04, 2015 at 06:30:55PM +0000, Morten Rasmussen wrote:
> > > > Introduces the blocked utilization, the utilization counter-part to
> > > > cfs_rq->utilization_load_avg. It is the sum of sched_entity utilization
> > > > contributions of entities that were recently on the cfs_rq that are
> > > > currently blocked. Combined with sum of contributions of entities
> > > > currently on the cfs_rq or currently running
> > > > (cfs_rq->utilization_load_avg) this can provide a more stable average
> > > > view of the cpu usage.
> > > 
> > > So it would be nice if you add performance numbers for all these patches
> > > that add accounting muck.. 
> > 
> > Total scheduler latency (as in hackbench?), individual function
> > latencies, or something else?
> 
> Yeah, good question that. Something that is good at running this code a
> lot. So dequeue_entity() -> dequeue_entity_load_avg() ->
> update_entity_load_avg() -> __update_entity_runnable_avg() seems a
> reliable way into here, and IIRC hackbench does a lot of that, so yes,
> that might just work.

Hackbench does a lot of that. I used it recently to measure the impact
of the weak arch_scale_*() functions. I will dig out some numbers.

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [RFCv3 PATCH 33/48] sched: Energy-aware wake-up task placement
  2015-03-24 15:53       ` Peter Zijlstra
@ 2015-03-24 17:47         ` Morten Rasmussen
  0 siblings, 0 replies; 124+ messages in thread
From: Morten Rasmussen @ 2015-03-24 17:47 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, vincent.guittot, Dietmar Eggemann, yuyang.du, preeti,
	mturquette, nico, rjw, Juri Lelli, linux-kernel

On Tue, Mar 24, 2015 at 03:53:52PM +0000, Peter Zijlstra wrote:
> On Tue, Mar 24, 2015 at 03:42:42PM +0000, Morten Rasmussen wrote:
> > Right, I agree that we should preferably do the normal thing for U ~= 1.
> > We can restructure the wake-up path to follow that pattern, but we need
> > to know U beforehand to choose the right path. U isn't just
> > get_cpu_usage(prev_cpu) but some broader view of the of the cpu
> > utilizations. For example, prev_cpu might be full, but everyone else is
> > idle so we still want to try to do an energy aware wake-up on some other
> > cpu. U could be the minium utilization of all cpus in prev_cpu's
> > sd_llc, which is somewhat similar to what energy_aware_wake_cpu() does.
> 
> Yeah, or a setting in the root domain set by the regular periodic load
> balancer; that already grew some mojo to determine this in a patch I
> recently commented on.

Yes, I was thinking something like that too.


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [RFCv3 PATCH 42/48] sched: Introduce energy awareness into find_busiest_queue
  2015-03-24 15:21   ` Peter Zijlstra
@ 2015-03-24 18:04     ` Dietmar Eggemann
  0 siblings, 0 replies; 124+ messages in thread
From: Dietmar Eggemann @ 2015-03-24 18:04 UTC (permalink / raw)
  To: Peter Zijlstra, Morten Rasmussen
  Cc: mingo, vincent.guittot, yuyang.du, preeti, mturquette, nico, rjw,
	Juri Lelli, linux-kernel

On 24/03/15 15:21, Peter Zijlstra wrote:
> On Wed, Feb 04, 2015 at 06:31:19PM +0000, Morten Rasmussen wrote:
>> +++ b/kernel/sched/fair.c
>> @@ -7216,6 +7216,37 @@ static struct rq *find_busiest_queue(struct lb_env *env,
>>   	unsigned long busiest_load = 0, busiest_capacity = 1;
>>   	int i;
>>
>> +	if (env->use_ea) {
>> +		struct rq *costliest = NULL;
>> +		unsigned long costliest_usage = 1024, costliest_energy = 1;
>> +
>> +		for_each_cpu_and(i, sched_group_cpus(group), env->cpus) {
>> +			unsigned long usage = get_cpu_usage(i);
>> +			struct rq *rq = cpu_rq(i);
>> +			struct sched_domain *sd = rcu_dereference(rq->sd);
>> +			struct energy_env eenv = {
>> +				.sg_top = sd->groups,
>> +				.usage_delta    = 0,
>> +				.src_cpu        = -1,
>> +				.dst_cpu        = -1,
>> +			};
>> +			unsigned long energy = sched_group_energy(&eenv);
>> +
>> +			/*
>> +			 * We're looking for the minimal cpu efficiency
>> +			 * min(u_i / e_i), crosswise multiplication leads to
>> +			 * u_i * e_j < u_j * e_i with j as previous minimum.
>> +			 */
>> +			if (usage * costliest_energy < costliest_usage * energy) {
>> +				costliest_usage = usage;
>> +				costliest_energy = energy;
>> +				costliest = rq;
>> +			}
>> +		}
>> +
>> +		return costliest;
>> +	}
>> +
>>   	for_each_cpu_and(i, sched_group_cpus(group), env->cpus) {
>>   		unsigned long capacity, wl;
>>   		enum fbq_type rt;
>
> So I've thought about parametrizing the whole load balance thing to
> avoid things like this.
>
> Irrespective of whether we balance on pure load or another metric we
> have the same structure, only different units plugged in.
>
> I've not really spend too much time on it to see what it would look
> like, but I think it would be a good avenue to investigate to avoid
> patches like this.

Yes, although I tried to keep the EAS specific code in lb small, w/o 
such an abstraction the code becomes very quickly pretty ugly.

So far I see the following parameters for such a 'unit':

conv. CFS vs. EAS:

(weighted) load vs. (cpu) usage
(cpu) capacity vs. (group) energy
(per task) imbalance vs. (per task) energy diff

For me this 'unit' is very close to the existing struct lb_env. Not sure 
yet what to do with s[dg]_lb_stats here? So far they obviously only 
contain data necessary for conv. CFS.

But I assume some of the if/else conditions in the lb code path 
(load_balance, fbg, update_s[dg]_lb_stats, fbq, detach_tasks, acive lb) 
will stay 'unit' specific.


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [RFCv3 PATCH 44/48] sched: Tipping point from energy-aware to conventional load balancing
  2015-03-24 15:26   ` Peter Zijlstra
@ 2015-03-24 18:47     ` Dietmar Eggemann
  0 siblings, 0 replies; 124+ messages in thread
From: Dietmar Eggemann @ 2015-03-24 18:47 UTC (permalink / raw)
  To: Peter Zijlstra, Morten Rasmussen
  Cc: mingo, vincent.guittot, yuyang.du, preeti, mturquette, nico, rjw,
	Juri Lelli, linux-kernel

On 24/03/15 15:26, Peter Zijlstra wrote:
> On Wed, Feb 04, 2015 at 06:31:21PM +0000, Morten Rasmussen wrote:
>> From: Dietmar Eggemann <dietmar.eggemann@arm.com>
>>
>> Energy-aware load balancing bases on cpu usage so the upper bound of its
>> operational range is a fully utilized cpu. Above this tipping point it
>> makes more sense to use weighted_cpuload to preserve smp_nice.
>> This patch implements the tipping point detection in update_sg_lb_stats
>> as if one cpu is over-utilized the current energy-aware load balance
>> operation will fall back into the conventional weighted load based one.
>>
>> cc: Ingo Molnar <mingo@redhat.com>
>> cc: Peter Zijlstra <peterz@infradead.org>
>>
>> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
>> ---
>>   kernel/sched/fair.c | 4 ++++
>>   1 file changed, 4 insertions(+)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 6b79603..4849bad 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6723,6 +6723,10 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>>   		sgs->sum_weighted_load += weighted_cpuload(i);
>>   		if (idle_cpu(i))
>>   			sgs->idle_cpus++;
>> +
>> +		/* If cpu is over-utilized, bail out of ea */
>> +		if (env->use_ea && cpu_overutilized(i, env->sd))
>> +			env->use_ea = false;
>>   	}
>
> I don't immediately see why this is desired. Why would a single
> overloaded CPU be reason to quit? It could be the cpus simply aren't
> 'balanced' right and the group as a whole is still under utilized.

We want to play it safe here.

E.g. in a >2 cluster system, this over-utilized cpu could run >1 high 
priority tasks on a cluster with energy efficient cpus and this cluster 
could still not be the lb src on DIE level because a not over-utilized 
cluster with less energy-efficient cpus (burning more energy) could be 
chosen instead. We could construct cases where the other cpus in this 
energy efficient cluster can't help the over-utilized cpu during lb on 
MC level.

I can see that using per-cpu data in code which deals w/ sg's is against 
the sd scalability design where we should rely on per-sg and not per-cpu 
data though.

By bailing out in such a scenario we at least guarantee smpnice provided 
by conv. CFS.

We could also favor an sg with an over-utilized cpu to become the src 
but which one do we pick if there're multiple potential src sg's w/ an 
over-utilized cpu?

>
> In that case we want to continue the balance pass to reach this
> equilibrium.
>


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [RFCv3 PATCH 33/48] sched: Energy-aware wake-up task placement
  2015-03-24 16:35   ` Peter Zijlstra
@ 2015-03-25 18:01     ` Juri Lelli
  2015-03-25 18:14       ` Peter Zijlstra
  0 siblings, 1 reply; 124+ messages in thread
From: Juri Lelli @ 2015-03-25 18:01 UTC (permalink / raw)
  To: Peter Zijlstra, Morten Rasmussen
  Cc: mingo, vincent.guittot, Dietmar Eggemann, yuyang.du, preeti,
	mturquette, nico, rjw, linux-kernel

Hi Peter,

On 24/03/15 16:35, Peter Zijlstra wrote:
> On Wed, Feb 04, 2015 at 06:31:10PM +0000, Morten Rasmussen wrote:
>> +static int energy_aware_wake_cpu(struct task_struct *p)
>> +{
>> +	struct sched_domain *sd;
>> +	struct sched_group *sg, *sg_target;
>> +	int target_max_cap = SCHED_CAPACITY_SCALE;
>> +	int target_cpu = task_cpu(p);
>> +	int i;
>> +
>> +	sd = rcu_dereference(per_cpu(sd_ea, task_cpu(p)));
>> +
>> +	if (!sd)
>> +		return -1;
>> +
>> +	sg = sd->groups;
>> +	sg_target = sg;
>> +	/* Find group with sufficient capacity */
>> +	do {
>> +		int sg_max_capacity = group_max_capacity(sg);
>> +
>> +		if (sg_max_capacity >= task_utilization(p) &&
>> +				sg_max_capacity <= target_max_cap) {
>> +			sg_target = sg;
>> +			target_max_cap = sg_max_capacity;
>> +		}
>> +	} while (sg = sg->next, sg != sd->groups);
>> +
>> +	/* Find cpu with sufficient capacity */
>> +	for_each_cpu_and(i, tsk_cpus_allowed(p), sched_group_cpus(sg_target)) {
>> +		int new_usage = get_cpu_usage(i) + task_utilization(p);
>> +
>> +		if (new_usage >	capacity_orig_of(i))
>> +			continue;
>> +
>> +		if (new_usage <	capacity_curr_of(i)) {
>> +			target_cpu = i;
>> +			if (!cpu_rq(i)->nr_running)
>> +				break;
>> +		}
>> +
>> +		/* cpu has capacity at higher OPP, keep it as fallback */
>> +		if (target_cpu == task_cpu(p))
>> +			target_cpu = i;
>> +	}
>> +
>> +	if (target_cpu != task_cpu(p)) {
>> +		struct energy_env eenv = {
>> +			.usage_delta	= task_utilization(p),
>> +			.src_cpu	= task_cpu(p),
>> +			.dst_cpu	= target_cpu,
>> +		};
>> +
>> +		/* Not enough spare capacity on previous cpu */
>> +		if (cpu_overutilized(task_cpu(p), sd))
>> +			return target_cpu;
>> +
>> +		if (energy_diff(&eenv) >= 0)
>> +			return task_cpu(p);
>> +	}
>> +
>> +	return target_cpu;
>> +}

Mike kept working on this since last LPC discussion, and I could
spend some cycles on this thing too lately, reviewing/discussing
wip with him. So, I guess I'll jump into this :).

> 
> So while you have some cpufreq -> sched coupling (the capacity_curr
> thing) this would be the site where you could provide sched -> cpufreq
> coupling, right?
> 

Yes and no, IMHO. It makes perfect sense to trigger cpufreq on the
target_cpu's freq domain, as we know that we are going to add p's
utilization there. Anyway, I was thinking that we could just
rely on triggering points in {en,de}queue_task_fair and task_tick_fair.
We end up calling one of them every time we wake-up a task, perform
a load balancing decision or just while running the task itself
(we have to react to tasks phase changes). This way we should be
able to reduce the number of triggering points and be more general
at the same time.  

> So does it make sense to at least put in the right hooks now? I realize
> we'll likely take cpufreq out back and feed it to the bears but
> something managing P states will be there whatever we'll call the new
> fangled thing and this would be the place to hook it still.
> 

We should be able to clean up and post something along this line
fairly soon.

Best,

- Juri


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [RFCv3 PATCH 33/48] sched: Energy-aware wake-up task placement
  2015-03-25 18:01     ` Juri Lelli
@ 2015-03-25 18:14       ` Peter Zijlstra
  2015-03-26 10:21         ` Juri Lelli
  0 siblings, 1 reply; 124+ messages in thread
From: Peter Zijlstra @ 2015-03-25 18:14 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Morten Rasmussen, mingo, vincent.guittot, Dietmar Eggemann,
	yuyang.du, preeti, mturquette, nico, rjw, linux-kernel

On Wed, Mar 25, 2015 at 06:01:22PM +0000, Juri Lelli wrote:

> Yes and no, IMHO. It makes perfect sense to trigger cpufreq on the
> target_cpu's freq domain, as we know that we are going to add p's
> utilization there.

Fair point; I mainly wanted to start this discussion so that seems to
have been a success :-)

> Anyway, I was thinking that we could just
> rely on triggering points in {en,de}queue_task_fair and task_tick_fair.
> We end up calling one of them every time we wake-up a task, perform
> a load balancing decision or just while running the task itself
> (we have to react to tasks phase changes). This way we should be
> able to reduce the number of triggering points and be more general
> at the same time.

The one worry I have with that is that it might need to re-compute which
P state to request, where in the above (now trimmed quoted) code we
already figured out which P state we needed to be in, any hook in
enqueue would have forgotten that.

> > So does it make sense to at least put in the right hooks now? I realize
> > we'll likely take cpufreq out back and feed it to the bears but
> > something managing P states will be there whatever we'll call the new
> > fangled thing and this would be the place to hook it still.
> > 
> 
> We should be able to clean up and post something along this line
> fairly soon.

Grand!

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [RFCv3 PATCH 34/48] sched: Bias new task wakeups towards higher capacity cpus
  2015-03-24 13:33   ` Peter Zijlstra
@ 2015-03-25 18:18     ` Morten Rasmussen
  0 siblings, 0 replies; 124+ messages in thread
From: Morten Rasmussen @ 2015-03-25 18:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: mingo, vincent.guittot, Dietmar Eggemann, yuyang.du, preeti,
	mturquette, nico, rjw, Juri Lelli, linux-kernel

On Tue, Mar 24, 2015 at 01:33:22PM +0000, Peter Zijlstra wrote:
> On Wed, Feb 04, 2015 at 06:31:11PM +0000, Morten Rasmussen wrote:
> > Make wake-ups of new tasks (find_idlest_group) aware of any differences
> > in cpu compute capacity so new tasks don't get handed off to a cpus with
> > lower capacity.
> 
> That says what; but fails to explain why we want to do this.
> 
> Remember Changelogs should answer what+why and if complicated also
> reason about how.

Right, I will make sure to fix that.

cpu compute capacity may be different for two reasons: 1. Compute
capacity of some cpus may be reduced due to RT and/or IRQ work
(capacity_of(cpu)), and 2.  Compute capacity of some cpus is lower due
to different cpu microarchitectures (big.LITTLE, impacts both
capacity_of(cpu) and capacity_orig_of(cpu)).

find_idlest_group() is used for SD_BALANCE_{FORK,EXEC} but not
SD_BALANCE_WAKE. That means only for new tasks. Since we can't predict
how they are going to behave we assume that they are cpu intensive. This
is how the load tracking is initialized. If we stick we that assumption
new tasks should go on cpus with most capacity to ensure we don't slow
them down while they are building up their track load history. Putting
new tasks on high capacity cpus also improves system responsiveness as
it may take a while before tasks put on a cpu with too little capacity
gets migrated to one with higher capacity.

One could also imagine that one day thermal management constraints in
terms of frequency capping would be visible through capacity in the
scheduler, which be a third reason for having different compute
capacities.

> 
> > @@ -4971,6 +4972,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
> >  
> >  		/* Tally up the load of all CPUs in the group */
> >  		avg_load = 0;
> > +		cpu_capacity = 0;
> >  
> >  		for_each_cpu(i, sched_group_cpus(group)) {
> >  			/* Bias balancing toward cpus of our domain */
> > @@ -4980,6 +4982,9 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
> >  				load = target_load(i, load_idx);
> >  
> >  			avg_load += load;
> > +
> > +			if (cpu_capacity < capacity_of(i))
> > +				cpu_capacity = capacity_of(i);
> >  		}
> >  
> >  		/* Adjust by relative CPU capacity of the group */
> 
> So basically you're constructing the max_cpu_capacity for that group
> here. Might it be clearer to explicitly name/write it as such?
> 
> 		max_cpu_capacity = max(max_cpu_capacity, capacity_of(i));

Agreed.

> 
> > @@ -4987,14 +4992,20 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
> >  
> >  		if (local_group) {
> >  			this_load = avg_load;
> > +			this_cpu_cap = cpu_capacity;
> >  		} else if (avg_load < min_load) {
> >  			min_load = avg_load;
> >  			idlest = group;
> > +			idlest_cpu_cap = cpu_capacity;
> >  		}
> >  	} while (group = group->next, group != sd->groups);
> >  
> > -	if (!idlest || 100*this_load < imbalance*min_load)
> > +	if (!idlest)
> > +		return NULL;
> > +
> > +	if (100*this_load < imbalance*min_load && this_cpu_cap >= idlest_cpu_cap)
> >  		return NULL;
> 
> And here you then fail to provide an idlest group if the selected group
> has less (max) capacity than the current group.
> 
> /me goes double check wth capacity_of() returns again, yes, this seems
> dubious. Suppose we have our two symmetric clusters and for some reason
> we've routed all our interrupts to one cluster and every cpu regularly
> receives interrupts. This means that the capacity_of() of this IRQ
> cluster is always less than the other.
> 
> The above modification will result in tasks always being placed on the
> other cluster, even though it might be very busy indeed.
> 
> If you want to do something like this; one should really add in a
> current usage metric or so.

Right, that if-condition is broken :(

We want to put new tasks on the cpu which has most spare capacity and if
nobody has spare capacity then decide based on the existing
(100*this_load < imbalance*min_load) condition. Maybe gather max cpu
available capacity instead of max_cpu_capacity. I will have to ponder
that a bit and come up with something.

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [RFCv3 PATCH 45/48] sched: Skip cpu as lb src which has one task and capacity gte the dst cpu
  2015-03-24 15:27   ` Peter Zijlstra
@ 2015-03-25 18:44     ` Dietmar Eggemann
       [not found]       ` <OF9320540C.255228F9-ON48257E37.002A02D1-48257E37.002AB5EE@zte.com.cn>
  0 siblings, 1 reply; 124+ messages in thread
From: Dietmar Eggemann @ 2015-03-25 18:44 UTC (permalink / raw)
  To: Peter Zijlstra, Morten Rasmussen
  Cc: mingo, vincent.guittot, yuyang.du, preeti, mturquette, nico, rjw,
	Juri Lelli, linux-kernel

On 24/03/15 15:27, Peter Zijlstra wrote:
> On Wed, Feb 04, 2015 at 06:31:22PM +0000, Morten Rasmussen wrote:
>> From: Dietmar Eggemann <dietmar.eggemann@arm.com>
>>
>> Skip cpu as a potential src (costliest) in case it has only one task
>> running and its original capacity is greater than or equal to the
>> original capacity of the dst cpu.
>
> Again, that's what, but is lacking a why.
>

You're right, the 'why' is completely missing.

This is one of our heterogeneous (big.LITTLE) cpu related patches. We 
don't want to end up migrating this single task from a big to a little 
cpu, hence the use of capacity_orig_of(cpu). Our cpu topology makes sure 
that this rule is only active on DIE sd level.

We could replace the use of capacity_orig_of(cpu) w/ capacity_of(cpu) to 
take big.LITTLE and RT and/or IRQ work into consideration which then 
would effect the MC sd level too.

Will add the 'why' in the next version.


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [RFCv3 PATCH 43/48] sched: Introduce energy awareness into detach_tasks
  2015-02-04 18:31 ` [RFCv3 PATCH 43/48] sched: Introduce energy awareness into detach_tasks Morten Rasmussen
  2015-03-24 15:25   ` Peter Zijlstra
@ 2015-03-25 23:50   ` Sai Gurrappadi
  2015-03-27 15:03     ` Dietmar Eggemann
  1 sibling, 1 reply; 124+ messages in thread
From: Sai Gurrappadi @ 2015-03-25 23:50 UTC (permalink / raw)
  To: Morten Rasmussen, peterz, mingo
  Cc: vincent.guittot, Dietmar Eggemann, yuyang.du, preeti, mturquette,
	nico, rjw, Juri Lelli, linux-kernel, Peter Boonstoppel

On 02/04/2015 10:31 AM, Morten Rasmussen wrote:
> From: Dietmar Eggemann <dietmar.eggemann@arm.com>
>  	while (!list_empty(tasks)) {
> @@ -6121,6 +6121,20 @@ static int detach_tasks(struct lb_env *env)
>  		if (!can_migrate_task(p, env))
>  			goto next;
>  
> +		if (env->use_ea) {
> +			struct energy_env eenv = {
> +				.src_cpu = env->src_cpu,
> +				.dst_cpu = env->dst_cpu,
> +				.usage_delta = task_utilization(p),
> +			};
> +			int e_diff = energy_diff(&eenv);
> +
> +			if (e_diff >= 0)
> +				goto next;
> +
> +			goto detach;
> +		}
> +
>  		load = task_h_load(p);
>  
>  		if (sched_feat(LB_MIN) && load < 16 && !env->sd->nr_balance_failed)
> @@ -6129,6 +6143,7 @@ static int detach_tasks(struct lb_env *env)
>  		if ((load / 2) > env->imbalance)
>  			goto next;
>  
> +detach:
>  		detach_task(p, env);
>  		list_add(&p->se.group_node, &env->tasks);

Hi Morten, Dietmar,

Wouldn't the above energy_diff() use the 'old' value of dst_cpu's util?
Tasks are detached/dequeued in this loop so they have their util
contrib. removed from src_cpu but their contrib. hasn't been added to
dst_cpu yet (happens in attach_tasks).

Thanks,
-Sai

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [RFCv3 PATCH 33/48] sched: Energy-aware wake-up task placement
  2015-03-25 18:14       ` Peter Zijlstra
@ 2015-03-26 10:21         ` Juri Lelli
  2015-03-26 10:41           ` Peter Zijlstra
  0 siblings, 1 reply; 124+ messages in thread
From: Juri Lelli @ 2015-03-26 10:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Morten Rasmussen, mingo, vincent.guittot, Dietmar Eggemann,
	yuyang.du, preeti, mturquette, nico, rjw, linux-kernel

On 25/03/15 18:14, Peter Zijlstra wrote:
> On Wed, Mar 25, 2015 at 06:01:22PM +0000, Juri Lelli wrote:
> 
>> Yes and no, IMHO. It makes perfect sense to trigger cpufreq on the
>> target_cpu's freq domain, as we know that we are going to add p's
>> utilization there.
> 
> Fair point; I mainly wanted to start this discussion so that seems to
> have been a success :-)
> 
>> Anyway, I was thinking that we could just
>> rely on triggering points in {en,de}queue_task_fair and task_tick_fair.
>> We end up calling one of them every time we wake-up a task, perform
>> a load balancing decision or just while running the task itself
>> (we have to react to tasks phase changes). This way we should be
>> able to reduce the number of triggering points and be more general
>> at the same time.
> 
> The one worry I have with that is that it might need to re-compute which
> P state to request, where in the above (now trimmed quoted) code we
> already figured out which P state we needed to be in, any hook in
> enqueue would have forgotten that.
>

Right. And we currently have some of this re-compute needs. The reason
why I thought we could still give a try to this approach comes from a
few points:

 - we can't be fully synchronous yet (cpufreq drivers might sleep) and
   if we rely on some asynchronous entity to actually do the freq change
   we might already defeat the purpose of passing any sort of guidance
   to it (as things can be changed a lot by the time he has to take a
   decision); of course, once we'll get there things will change :)

 - how do we cope with codepaths that don't rely on usage for taking
   decisions? I guess we'll have to modify those to be able to drive
   cpufreq, or we could just trade some re-compute burden with this need

 - what about other sched classes? I know that this is very premature,
   but I can help but thinking that we'll need to do some sort of
   aggregation of requests, and if we put triggers in very specialized
   points we might lose some of the sched classes separation

Anyway, I'd say we try to look at what we have and then move forward
from there :).

Thanks!

- Juri

>>> So does it make sense to at least put in the right hooks now? I realize
>>> we'll likely take cpufreq out back and feed it to the bears but
>>> something managing P states will be there whatever we'll call the new
>>> fangled thing and this would be the place to hook it still.
>>>
>>
>> We should be able to clean up and post something along this line
>> fairly soon.
> 
> Grand!
> 


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [RFCv3 PATCH 33/48] sched: Energy-aware wake-up task placement
  2015-03-26 10:21         ` Juri Lelli
@ 2015-03-26 10:41           ` Peter Zijlstra
  2015-04-27 16:01             ` Michael Turquette
  0 siblings, 1 reply; 124+ messages in thread
From: Peter Zijlstra @ 2015-03-26 10:41 UTC (permalink / raw)
  To: Juri Lelli
  Cc: Morten Rasmussen, mingo, vincent.guittot, Dietmar Eggemann,
	yuyang.du, preeti, mturquette, nico, rjw, linux-kernel

On Thu, Mar 26, 2015 at 10:21:24AM +0000, Juri Lelli wrote:
>  - what about other sched classes? I know that this is very premature,
>    but I can help but thinking that we'll need to do some sort of
>    aggregation of requests, and if we put triggers in very specialized
>    points we might lose some of the sched classes separation

So for deadline we can do P state selection (as you're well aware) based
on the requested utilization. Not sure what to do for fifo/rr though,
they lack much useful information (as always).

Now if we also look ahead to things like the ACPI CPPC stuff we'll see
that CFS and DL place different requirements on the hints. Where CFS
would like to hint a max perf (the hardware going slower due to the code
consisting of mostly stalls is always fine from a best effort energy
pov), the DL stuff would like to hint a min perf, seeing how it 'needs'
to provide a QoS.

So we either need to carry this information along in a 'generic' way
between the various classes or put the hinting in every class.

But yes, food for thought for sure.

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [RFCv3 PATCH 30/48] sched: Calculate energy consumption of sched_group
  2015-03-24 17:39               ` Morten Rasmussen
@ 2015-03-26 15:23                 ` Dietmar Eggemann
  0 siblings, 0 replies; 124+ messages in thread
From: Dietmar Eggemann @ 2015-03-26 15:23 UTC (permalink / raw)
  To: Morten Rasmussen, Peter Zijlstra
  Cc: Sai Gurrappadi, mingo, vincent.guittot, yuyang.du, preeti,
	mturquette, nico, rjw, Juri Lelli, linux-kernel,
	Peter Boonstoppel

On 24/03/15 17:39, Morten Rasmussen wrote:
> On Tue, Mar 24, 2015 at 04:10:37PM +0000, Peter Zijlstra wrote:
>> On Tue, Mar 24, 2015 at 10:44:24AM +0000, Morten Rasmussen wrote:
>>>>> Maybe remind us why this needs to be tied to sched_groups ? Why can't we
>>>>> attach the energy information to the domains?
>>
>>> In the current domain hierarchy you don't have domains with just one cpu
>>> in them. If you attach the per-cpu energy data to the MC level domain
>>> which spans the whole cluster, you break the current idea of attaching
>>> information to the cpumask (currently sched_group, but could be
>>> sched_domain as we discuss here) the information is associated with. You
>>> would have to either introduce a level of single cpu domains at the
>>> lowest level or move away from the idea of attaching data to the cpumask
>>> that is associated with it.
>>>
>>> Using sched_groups we do already have single cpu groups that we can
>>> attach per-cpu data to, but we are missing a top level group spanning
>>> the entire system for system wide energy data. So from that point of
>>> view groups and domains are equally bad.
>>
>> Oh urgh, good point that. Cursed if you do, cursed if you don't. Bugger.
> 
> Yeah :( I don't really care which one we choose. Adding another top
> level domain with one big group spanning all cpus, but with all SD flags
> disabled seems less intrusive than adding a level at the bottom.
> 
> Better ideas are very welcome.
> 

I had a stab at integrating such a top level (SYS) domain w/ all known SD
flags disabled. This SYS sd exposes itself w/ all counters set to 0 in
/proc/schedstat.

There're still some kludges in the patch blow:

- The need for a new topology SD flag to tell sd_init() that we want to
  reset the default sd configuration. 
- Don't break in build_sched_domains() at the first sd spanning cpu_map
- Don't decay newidle max times in rebalance_domains() by bailing early 
  on SYS sd.

It survived booting on single (MC-SYS) and dual cluster ARM (MC-DIE-SYS)
systems.
Would something like this be acceptable?

diff --git a/include/linux/sched.h b/include/linux/sched.h
index f984b4e58865..8fbc9976f5d1 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -904,6 +904,7 @@ enum cpu_idle_type {
 #define SD_BALANCE_FORK                0x0008  /* Balance on fork, clone */
 #define SD_BALANCE_WAKE                0x0010  /* Balance on wakeup */
 #define SD_WAKE_AFFINE         0x0020  /* Wake task to waking CPU */
+#define SD_SHARE_ENERGY                0x0040  /* System-wide energy data */
 #define SD_SHARE_CPUCAPACITY   0x0080  /* Domain members share cpu power */
 #define SD_SHARE_POWERDOMAIN   0x0100  /* Domain members share power domain */
 #define SD_SHARE_PKG_RESOURCES 0x0200  /* Domain members share cpu pkg resources */
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 4f52c2e7484e..d058dc1e639f 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5529,7 +5529,7 @@ static int sd_degenerate(struct sched_domain *sd)
        }
 
        /* Following flags don't use groups */
-       if (sd->flags & (SD_WAKE_AFFINE))
+       if (sd->flags & (SD_WAKE_AFFINE | SD_SHARE_ENERGY))
                return 0;
 
        return 1;
@@ -6215,8 +6215,9 @@ static int sched_domains_curr_level;
  * SD_SHARE_POWERDOMAIN   - describes shared power domain
  * SD_SHARE_CAP_STATES    - describes shared capacity states
  *
- * Odd one out:
+ * Odd two out:
  * SD_ASYM_PACKING        - describes SMT quirks
+ * SD_SHARE_ENERGY        - describes EAS quirks
  */
 #define TOPOLOGY_SD_FLAGS              \
        (SD_SHARE_CPUCAPACITY |         \
@@ -6224,7 +6225,8 @@ static int sched_domains_curr_level;
         SD_NUMA |                      \
         SD_ASYM_PACKING |              \
         SD_SHARE_POWERDOMAIN |         \
-        SD_SHARE_CAP_STATES)
+        SD_SHARE_CAP_STATES |          \
+        SD_SHARE_ENERGY)
 
 static struct sched_domain *
 sd_init(struct sched_domain_topology_level *tl, int cpu)
@@ -6298,6 +6300,14 @@ sd_init(struct sched_domain_topology_level *tl, int cpu)
                sd->cache_nice_tries = 1;
                sd->busy_idx = 2;
 
+       } else if (sd->flags & SD_SHARE_ENERGY) {
+               /* Reset the default configuration completely */
+               memset(sd, 0, sizeof(*sd));
+
+               sd->flags = 1*SD_SHARE_ENERGY;
+#ifdef CONFIG_SCHED_DEBUG
+               sd->name = tl->name;
+#endif
 #ifdef CONFIG_NUMA
        } else if (sd->flags & SD_NUMA) {
                sd->cache_nice_tries = 2;
@@ -6826,8 +6836,6 @@ static int build_sched_domains(const struct cpumask *cpu_map,
                                *per_cpu_ptr(d.sd, i) = sd;
                        if (tl->flags & SDTL_OVERLAP || sched_feat(FORCE_SD_OVERLAP))
                                sd->flags |= SD_OVERLAP;
-                       if (cpumask_equal(cpu_map, sched_domain_span(sd)))
-                               break;
                }
        }
 
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index cfe65aec3237..8d4cc72f4778 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8073,6 +8073,10 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
 
        rcu_read_lock();
        for_each_domain(cpu, sd) {
+
+               if (sd->flags & SD_SHARE_ENERGY)
+                       continue;
+
                /*
                 * Decay the newidle max times here because this is a regular
                 * visit to all the domains. Decay ~1% per second.


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* Re: [RFCv3 PATCH 43/48] sched: Introduce energy awareness into detach_tasks
  2015-03-25 23:50   ` Sai Gurrappadi
@ 2015-03-27 15:03     ` Dietmar Eggemann
       [not found]       ` <OFDCE15EEF.2F536D7F-ON48257E37.002565ED-48257E37.0027A8B9@zte.com.cn>
  0 siblings, 1 reply; 124+ messages in thread
From: Dietmar Eggemann @ 2015-03-27 15:03 UTC (permalink / raw)
  To: Sai Gurrappadi, Morten Rasmussen, peterz, mingo
  Cc: vincent.guittot, yuyang.du, preeti, mturquette, nico, rjw,
	Juri Lelli, linux-kernel, Peter Boonstoppel

On 25/03/15 23:50, Sai Gurrappadi wrote:
> On 02/04/2015 10:31 AM, Morten Rasmussen wrote:
>> From: Dietmar Eggemann <dietmar.eggemann@arm.com>
>>   	while (!list_empty(tasks)) {
>> @@ -6121,6 +6121,20 @@ static int detach_tasks(struct lb_env *env)
>>   		if (!can_migrate_task(p, env))
>>   			goto next;
>>   
>> +		if (env->use_ea) {
>> +			struct energy_env eenv = {
>> +				.src_cpu = env->src_cpu,
>> +				.dst_cpu = env->dst_cpu,
>> +				.usage_delta = task_utilization(p),
>> +			};
>> +			int e_diff = energy_diff(&eenv);
>> +
>> +			if (e_diff >= 0)
>> +				goto next;
>> +
>> +			goto detach;
>> +		}
>> +
>>   		load = task_h_load(p);
>>   
>>   		if (sched_feat(LB_MIN) && load < 16 && !env->sd->nr_balance_failed)
>> @@ -6129,6 +6143,7 @@ static int detach_tasks(struct lb_env *env)
>>   		if ((load / 2) > env->imbalance)
>>   			goto next;
>>   
>> +detach:
>>   		detach_task(p, env);
>>   		list_add(&p->se.group_node, &env->tasks);
> 
> Hi Morten, Dietmar,
> 
> Wouldn't the above energy_diff() use the 'old' value of dst_cpu's util?
> Tasks are detached/dequeued in this loop so they have their util
> contrib. removed from src_cpu but their contrib. hasn't been added to
> dst_cpu yet (happens in attach_tasks).

You're absolutely right Sai. Thanks for pointing this out! I guess I rather
have to accumulate the usage of tasks I've detached and add this to the
eenv::usage_delta of the energy_diff() call for the next task.

Something like this (only slightly tested):

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8d4cc72f4778..d0d0e965fd0c 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6097,6 +6097,7 @@ static int detach_tasks(struct lb_env *env)
        struct task_struct *p;
        unsigned long load = 0;
        int detached = 0;
+       int usage_delta = 0;
 
        lockdep_assert_held(&env->src_rq->lock);
 
@@ -6122,16 +6123,19 @@ static int detach_tasks(struct lb_env *env)
                        goto next;
 
                if (env->use_ea) {
+                       int util = task_utilization(p);
                        struct energy_env eenv = {
                                .src_cpu = env->src_cpu,
                                .dst_cpu = env->dst_cpu,
-                               .usage_delta = task_utilization(p),
+                               .usage_delta = usage_delta + util,
                        };
                        int e_diff = energy_diff(&eenv);
 
                        if (e_diff >= 0)
                                goto next;
 
+                       usage_delta += util;
+
                        goto detach;
                }

[...]


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* Re: [RFCv3 PATCH 30/48] sched: Calculate energy consumption of sched_group
  2015-03-20 18:40   ` Sai Gurrappadi
@ 2015-03-27 15:58     ` Morten Rasmussen
  0 siblings, 0 replies; 124+ messages in thread
From: Morten Rasmussen @ 2015-03-27 15:58 UTC (permalink / raw)
  To: Sai Gurrappadi
  Cc: peterz, mingo, vincent.guittot, Dietmar Eggemann, yuyang.du,
	preeti, mturquette, nico, rjw, Juri Lelli, linux-kernel,
	Peter Boonstoppel

On Fri, Mar 20, 2015 at 06:40:39PM +0000, Sai Gurrappadi wrote:
> On 02/04/2015 10:31 AM, Morten Rasmussen wrote:
> > +/*
> > + * sched_group_energy(): Returns absolute energy consumption of cpus belonging
> > + * to the sched_group including shared resources shared only by members of the
> > + * group. Iterates over all cpus in the hierarchy below the sched_group starting
> > + * from the bottom working it's way up before going to the next cpu until all
> > + * cpus are covered at all levels. The current implementation is likely to
> > + * gather the same usage statistics multiple times. This can probably be done in
> > + * a faster but more complex way.
> > + */
> > +static unsigned int sched_group_energy(struct sched_group *sg_top)
> > +{
> > +	struct sched_domain *sd;
> > +	int cpu, total_energy = 0;
> > +	struct cpumask visit_cpus;
> > +	struct sched_group *sg;
> > +
> > +	WARN_ON(!sg_top->sge);
> > +
> > +	cpumask_copy(&visit_cpus, sched_group_cpus(sg_top));
> > +
> > +	while (!cpumask_empty(&visit_cpus)) {
> > +		struct sched_group *sg_shared_cap = NULL;
> > +
> > +		cpu = cpumask_first(&visit_cpus);
> > +
> > +		/*
> > +		 * Is the group utilization affected by cpus outside this
> > +		 * sched_group?
> > +		 */
> > +		sd = highest_flag_domain(cpu, SD_SHARE_CAP_STATES);
> > +		if (sd && sd->parent)
> > +			sg_shared_cap = sd->parent->groups;
> > +
> > +		for_each_domain(cpu, sd) {
> > +			sg = sd->groups;
> > +
> > +			/* Has this sched_domain already been visited? */
> > +			if (sd->child && cpumask_first(sched_group_cpus(sg)) != cpu)
> > +				break;
> > +
> > +			do {
> > +				struct sched_group *sg_cap_util;
> > +				unsigned group_util;
> > +				int sg_busy_energy, sg_idle_energy;
> > +				int cap_idx;
> > +
> > +				if (sg_shared_cap && sg_shared_cap->group_weight >= sg->group_weight)
> > +					sg_cap_util = sg_shared_cap;
> > +				else
> > +					sg_cap_util = sg;
> > +
> > +				cap_idx = find_new_capacity(sg_cap_util, sg->sge);
> > +				group_util = group_norm_usage(sg);
> > +				sg_busy_energy = (group_util * sg->sge->cap_states[cap_idx].power)
> > +										>> SCHED_CAPACITY_SHIFT;
> > +				sg_idle_energy = ((SCHED_LOAD_SCALE-group_util) * sg->sge->idle_states[0].power)
> > +										>> SCHED_CAPACITY_SHIFT;
> > +
> > +				total_energy += sg_busy_energy + sg_idle_energy;
> 
> Should normalize group_util with the newly found capacity instead of
> capacity_curr.

You're right. In the next patch when sched_group_energy() can be used
for energy predictions based on usage deltas group_util should be
normalized to the new capacity.

Thanks for spotting this mistake.

Morten

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [RFCv3 PATCH 33/48] sched: Energy-aware wake-up task placement
  2015-03-18 20:15       ` Sai Gurrappadi
@ 2015-03-27 16:37         ` Morten Rasmussen
  0 siblings, 0 replies; 124+ messages in thread
From: Morten Rasmussen @ 2015-03-27 16:37 UTC (permalink / raw)
  To: Sai Gurrappadi
  Cc: peterz, mingo, vincent.guittot, Dietmar Eggemann, yuyang.du,
	preeti, mturquette, nico, rjw, Juri Lelli, linux-kernel,
	Peter Boonstoppel

On Wed, Mar 18, 2015 at 08:15:59PM +0000, Sai Gurrappadi wrote:
> On 03/16/2015 07:47 AM, Morten Rasmussen wrote:
> > Again you are right. We could make the + task_utilization(p) conditional
> > on i != task_cpu(p). One argument against doing that is that in
> > select_task_rq_fair() task_utilization(p) hasn't been decayed yet while
> > it blocked load on the previous cpu (rq) has. If the task has been gone
> > for a long time, its blocked contribution may have decayed to zero and
> > therefore be a poor estimate of the utilization increase caused by
> > putting the task back on the previous cpu. Particularly if we still use
> > the non-decayed task_utilization(p) to estimate the utilization increase
> > on other cpus (!task_cpu(p)). In the interest of responsiveness and not
> > trying to squeeze tasks back onto the previous cpu which might soon run
> > out of capacity when utilization increases we could leave it as a sort
> > of performance bias.
> > 
> > In any case it deserves a comment in the code I think.
> 
> I think it makes sense to use the non-decayed value of the the task's
> contrib. on wake but I am not sure if we should do this 2x accounting
> all the time.

If we could just find a way to remove the blocked load contribution and
only use the non-decayed value. I'll have a look and see if I can do
better.

> Another slightly related issue is that NOHZ could cause blocked rq sums
> to remain stale for long periods if there aren't frequent enough
> idle/nohz-idle-balances. This would cause the above bit and
> energy_diff() to compute incorrect values.

I have looked into load tracking behaviour when cpus are in nohz idle.
It is not easy to fix properly. You will either need to put the burden
of updating the blocked load of the nohz-idle cpu on one of the non-idle
cpus and thereby spend precious cycles on busy cpus, or make sure to
kick a nohz-idle cpu to do the updates on a regular basis.

I am experimenting a bit with a third option which is to 'pre-decay' the
blocked load/usage when a cpu enters nohz-idle based on the nohz-idle
predicted period of idle. When the cpu exits nohz-idle I swap the
non-decayed blocked back in so it get decayed properly as if the no
pre-decay had happened. If some other cpu running nohz_idle_balance()
decides to update the blocked load the original is swapped back in as
well. It isn't bulletproof as nohz_idle_balance() updates from other
cpus ruins the pre-decay and prediction used for pre-decay might be
wrong. So I'm not really convinced if it is the right way to go.

Any better ideas?

NOHZ full (tickless busy) is a nightmare for accurate load-tracking that
I don't want to face right now.

Morten

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [RFCv3 PATCH 00/48] sched: Energy cost model for energy-aware scheduling
  2015-02-04 18:30 [RFCv3 PATCH 00/48] sched: Energy cost model for energy-aware scheduling Morten Rasmussen
                   ` (47 preceding siblings ...)
  2015-02-04 18:31 ` [RFCv3 PATCH 48/48] sched: Disable energy-unfriendly nohz kicks Morten Rasmussen
@ 2015-04-02 12:43 ` Vincent Guittot
  2015-04-08 13:33   ` Morten Rasmussen
  48 siblings, 1 reply; 124+ messages in thread
From: Vincent Guittot @ 2015-04-02 12:43 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Peter Zijlstra, mingo, Dietmar Eggemann, Yuyang Du,
	Preeti U Murthy, Mike Turquette, Nicolas Pitre, rjw, Juri Lelli,
	linux-kernel

On 4 February 2015 at 19:30, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> Several techniques for saving energy through various scheduler
> modifications have been proposed in the past, however most of the
> techniques have not been universally beneficial for all use-cases and
> platforms. For example, consolidating tasks on fewer cpus is an
> effective way to save energy on some platforms, while it might make
> things worse on others.
>
> This proposal, which is inspired by the Ksummit workshop discussions in
> 2013 [1], takes a different approach by using a (relatively) simple
> platform energy cost model to guide scheduling decisions. By providing
> the model with platform specific costing data the model can provide a
> estimate of the energy implications of scheduling decisions. So instead
> of blindly applying scheduling techniques that may or may not work for
> the current use-case, the scheduler can make informed energy-aware
> decisions. We believe this approach provides a methodology that can be
> adapted to any platform, including heterogeneous systems such as ARM
> big.LITTLE. The model considers cpus only, i.e. no peripherals, GPU or
> memory. Model data includes power consumption at each P-state and
> C-state.
>
> This is an RFC and there are some loose ends that have not been
> addressed here or in the code yet. The model and its infrastructure is
> in place in the scheduler and it is being used for load-balancing
> decisions. The energy model data is hardcoded, the load-balancing
> heuristics are still under development, and there are some limitations
> still to be addressed. However, the main idea is presented here, which
> is the use of an energy model for scheduling decisions.
>
> RFCv3 is a consolidation of the latest energy model related patches and
> previously posted patch sets related to capacity and utilization
> tracking [2][3] to show where we are heading. [2] and [3] have been
> rebased onto v3.19-rc7 with a few minor modifications. Large parts of
> the energy model code and use of the energy model in the scheduler has
> been rewritten and simplified. The patch set consists of three main
> parts (more details further down):
>
> Patch 1-11:  sched: consolidation of CPU capacity and usage [2] (rebase)
>
> Patch 12-19: sched: frequency and cpu invariant per-entity load-tracking
>                     and other load-tracking bits [3] (rebase)
>
> Patch 20-48: sched: Energy cost model for energy-aware scheduling (RFCv3)


Hi Morten,

48 patches is a big number of patches and when i look into your
patchset, some feature are quite self contained. IMHO it would be
worth splitting it in smaller patchsets in order to ease the review
and the regression test.
>From a 1st look at your patchset , i have found
-patches 11,13,14 and 15 are only linked to frequency scaling invariance
-patches 12, 17 and 17 are only about adding cpu scaling invariance
-patches 18 and 19 are about tracking and adding the blocked
utilization in the CPU usage
-patches 20 to the end is linked the EAS

Regards,
Vincent

>
> Test results for ARM TC2 (2xA15+3xA7) with cpufreq enabled:
>
> sysbench: Single task running for 3 seconds.
> rt-app [4]: 5 medium (~50%) periodic tasks
> rt-app [4]: 2 light (~10%) periodic tasks
>
> Average numbers for 20 runs per test.
>
> Energy          sysbench        rt-app medium   rt-app light
> Mainline        100*            100             100
> EA              279             88              63
>
> * Sensitive to task placement on big.LITTLE. Mainline may put it on
>   either cpu due to it's lack of compute capacity awareness, while EA
>   consistently puts heavy tasks on big cpus. The EA energy increase came
>   with a 2.65x _increase_ in performance (throughput).
>
> [1] http://etherpad.osuosl.org/energy-aware-scheduling-ks-2013 (search
> for 'cost')
> [2] https://lkml.org/lkml/2015/1/15/136
> [3] https://lkml.org/lkml/2014/12/2/328
> [4] https://wiki.linaro.org/WorkingGroups/PowerManagement/Resources/Tools/WorkloadGen
>
> Changes:
>
> RFCv3:
>
> 'sched: Energy cost model for energy-aware scheduling' changes:
> RFCv2->RFCv3:
>
> (1) Remove frequency- and cpu-invariant load/utilization patches since
>     this is now provided by [2] and [3].
>
> (2) Remove system-wide sched_energy to make the code easier to
>     understand, i.e. single socket systems are not supported (yet).
>
> (3) Remove wake-up energy. Extra complexity that wasn't fully justified.
>     Idle-state awareness introduced recently in mainline may be
>     sufficient.
>
> (4) Remove procfs interface for energy data to make the patch-set
>     smaller.
>
> (5) Rework energy-aware load balancing code.
>
>     In RFCv2 we only attempted to pick the source cpu in an energy-aware
>     fashion. In addition to support for finding the most energy
>     inefficient source CPU during the load-balancing action, RFCv3 also
>     introduces the energy-aware based moving of tasks between cpus as
>     well as support for managing the 'tipping point' - the threshold
>     where we switch away from energy model based load balancing to
>     conventional load balancing.
>
> 'sched: frequency and cpu invariant per-entity load-tracking and other
> load-tracking bits' [3]
>
> (1) Remove blocked load from load tracking.
>
> (2) Remove cpu-invariant load tracking.
>
>     Both (1) and (2) require changes to the existing load-balance code
>     which haven't been done yet. These are therefore left out until that
>     has been addressed.
>
> (3) One patch renamed.
>
> 'sched: consolidation of CPU capacity and usage' [2]
>
> (1) Fixed conflict when rebasing to v3.19-rc7.
>
> (2) One patch subject changed slightly.
>
>
> RFC v2:
>  - Extended documentation:
>    - Cover the energy model in greater detail.
>    - Recipe for deriving platform energy model.
>  - Replaced Kconfig with sched feature (jump label).
>  - Add unweighted load tracking.
>  - Use unweighted load as task/cpu utilization.
>  - Support for multiple idle states per sched_group. cpuidle integration
>    still missing.
>  - Changed energy aware functionality in select_idle_sibling().
>  - Experimental energy aware load-balance support.
>
>
> Dietmar Eggemann (17):
>   sched: Make load tracking frequency scale-invariant
>   sched: Make usage tracking cpu scale-invariant
>   arm: vexpress: Add CPU clock-frequencies to TC2 device-tree
>   arm: Cpu invariant scheduler load-tracking support
>   sched: Get rid of scaling usage by cpu_capacity_orig
>   sched: Introduce energy data structures
>   sched: Allocate and initialize energy data structures
>   arm: topology: Define TC2 energy and provide it to the scheduler
>   sched: Infrastructure to query if load balancing is energy-aware
>   sched: Introduce energy awareness into update_sg_lb_stats
>   sched: Introduce energy awareness into update_sd_lb_stats
>   sched: Introduce energy awareness into find_busiest_group
>   sched: Introduce energy awareness into find_busiest_queue
>   sched: Introduce energy awareness into detach_tasks
>   sched: Tipping point from energy-aware to conventional load balancing
>   sched: Skip cpu as lb src which has one task and capacity gte the dst
>     cpu
>   sched: Turn off fast idling of cpus on a partially loaded system
>
> Morten Rasmussen (23):
>   sched: Track group sched_entity usage contributions
>   sched: Make sched entity usage tracking frequency-invariant
>   cpufreq: Architecture specific callback for frequency changes
>   arm: Frequency invariant scheduler load-tracking support
>   sched: Track blocked utilization contributions
>   sched: Include blocked utilization in usage tracking
>   sched: Documentation for scheduler energy cost model
>   sched: Make energy awareness a sched feature
>   sched: Introduce SD_SHARE_CAP_STATES sched_domain flag
>   sched: Compute cpu capacity available at current frequency
>   sched: Relocated get_cpu_usage()
>   sched: Use capacity_curr to cap utilization in get_cpu_usage()
>   sched: Highest energy aware balancing sched_domain level pointer
>   sched: Calculate energy consumption of sched_group
>   sched: Extend sched_group_energy to test load-balancing decisions
>   sched: Estimate energy impact of scheduling decisions
>   sched: Energy-aware wake-up task placement
>   sched: Bias new task wakeups towards higher capacity cpus
>   sched, cpuidle: Track cpuidle state index in the scheduler
>   sched: Count number of shallower idle-states in struct
>     sched_group_energy
>   sched: Determine the current sched_group idle-state
>   sched: Enable active migration for cpus of lower capacity
>   sched: Disable energy-unfriendly nohz kicks
>
> Vincent Guittot (8):
>   sched: add utilization_avg_contrib
>   sched: remove frequency scaling from cpu_capacity
>   sched: make scale_rt invariant with frequency
>   sched: add per rq cpu_capacity_orig
>   sched: get CPU's usage statistic
>   sched: replace capacity_factor by usage
>   sched: add SD_PREFER_SIBLING for SMT level
>   sched: move cfs task on a CPU with higher capacity
>
>  Documentation/scheduler/sched-energy.txt   | 359 +++++++++++
>  arch/arm/boot/dts/vexpress-v2p-ca15_a7.dts |   5 +
>  arch/arm/kernel/topology.c                 | 218 +++++--
>  drivers/cpufreq/cpufreq.c                  |  10 +-
>  include/linux/sched.h                      |  43 +-
>  kernel/sched/core.c                        | 119 +++-
>  kernel/sched/debug.c                       |  12 +-
>  kernel/sched/fair.c                        | 935 ++++++++++++++++++++++++-----
>  kernel/sched/features.h                    |   6 +
>  kernel/sched/idle.c                        |   2 +
>  kernel/sched/sched.h                       |  75 ++-
>  11 files changed, 1559 insertions(+), 225 deletions(-)
>  create mode 100644 Documentation/scheduler/sched-energy.txt
>
> --
> 1.9.1
>

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [RFCv3 PATCH 00/48] sched: Energy cost model for energy-aware scheduling
  2015-04-02 12:43 ` [RFCv3 PATCH 00/48] sched: Energy cost model for energy-aware scheduling Vincent Guittot
@ 2015-04-08 13:33   ` Morten Rasmussen
  2015-04-09  7:41     ` Vincent Guittot
  0 siblings, 1 reply; 124+ messages in thread
From: Morten Rasmussen @ 2015-04-08 13:33 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, mingo, Dietmar Eggemann, Yuyang Du,
	Preeti U Murthy, Mike Turquette, Nicolas Pitre, rjw, Juri Lelli,
	linux-kernel

Hi Vincent,

On Thu, Apr 02, 2015 at 01:43:31PM +0100, Vincent Guittot wrote:
> On 4 February 2015 at 19:30, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> > RFCv3 is a consolidation of the latest energy model related patches and
> > previously posted patch sets related to capacity and utilization
> > tracking [2][3] to show where we are heading. [2] and [3] have been
> > rebased onto v3.19-rc7 with a few minor modifications. Large parts of
> > the energy model code and use of the energy model in the scheduler has
> > been rewritten and simplified. The patch set consists of three main
> > parts (more details further down):
> >
> > Patch 1-11:  sched: consolidation of CPU capacity and usage [2] (rebase)
> >
> > Patch 12-19: sched: frequency and cpu invariant per-entity load-tracking
> >                     and other load-tracking bits [3] (rebase)
> >
> > Patch 20-48: sched: Energy cost model for energy-aware scheduling (RFCv3)
> 
> 
> Hi Morten,
> 
> 48 patches is a big number of patches and when i look into your
> patchset, some feature are quite self contained. IMHO it would be
> worth splitting it in smaller patchsets in order to ease the review
> and the regression test.
> From a 1st look at your patchset , i have found
> -patches 11,13,14 and 15 are only linked to frequency scaling invariance
> -patches 12, 17 and 17 are only about adding cpu scaling invariance
> -patches 18 and 19 are about tracking and adding the blocked
> utilization in the CPU usage
> -patches 20 to the end is linked the EAS

I agree it makes sense to regroup the patches as you suggest. A better
logical ordering should make the reviewing a less daunting task. I'm a
bit hesitant to float many small sets of patches as their role in the
bigger picture would be less clear and hence risk loosing the 'why'.
IMHO, it should be as easy (if not easier) to review and pick patches in
a larger set as it is for multiple smaller sets. However, I guess that
is individual and for automated testing it would be easier to have them
split out.

How about focusing on one (or two) of these smaller patch sets at the
time to minimize the potential confusion and post them separately?

I would still include them in updated mega-postings that includes all
the dependencies so the full story would still available for those who
are interested. I would of course make it clear which patches that are
also posted separately.

Thanks,
Morten

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [RFCv3 PATCH 00/48] sched: Energy cost model for energy-aware scheduling
  2015-04-08 13:33   ` Morten Rasmussen
@ 2015-04-09  7:41     ` Vincent Guittot
  2015-04-10 14:46       ` Morten Rasmussen
  0 siblings, 1 reply; 124+ messages in thread
From: Vincent Guittot @ 2015-04-09  7:41 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Peter Zijlstra, mingo, Dietmar Eggemann, Yuyang Du,
	Preeti U Murthy, Mike Turquette, Nicolas Pitre, rjw, Juri Lelli,
	linux-kernel

On 8 April 2015 at 15:33, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> Hi Vincent,
>
> On Thu, Apr 02, 2015 at 01:43:31PM +0100, Vincent Guittot wrote:
>> On 4 February 2015 at 19:30, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
>> > RFCv3 is a consolidation of the latest energy model related patches and
>> > previously posted patch sets related to capacity and utilization
>> > tracking [2][3] to show where we are heading. [2] and [3] have been
>> > rebased onto v3.19-rc7 with a few minor modifications. Large parts of
>> > the energy model code and use of the energy model in the scheduler has
>> > been rewritten and simplified. The patch set consists of three main
>> > parts (more details further down):
>> >
>> > Patch 1-11:  sched: consolidation of CPU capacity and usage [2] (rebase)
>> >
>> > Patch 12-19: sched: frequency and cpu invariant per-entity load-tracking
>> >                     and other load-tracking bits [3] (rebase)
>> >
>> > Patch 20-48: sched: Energy cost model for energy-aware scheduling (RFCv3)
>>
>>
>> Hi Morten,
>>
>> 48 patches is a big number of patches and when i look into your
>> patchset, some feature are quite self contained. IMHO it would be
>> worth splitting it in smaller patchsets in order to ease the review
>> and the regression test.
>> From a 1st look at your patchset , i have found
>> -patches 11,13,14 and 15 are only linked to frequency scaling invariance
>> -patches 12, 17 and 17 are only about adding cpu scaling invariance
>> -patches 18 and 19 are about tracking and adding the blocked
>> utilization in the CPU usage
>> -patches 20 to the end is linked the EAS
>
> I agree it makes sense to regroup the patches as you suggest. A better
> logical ordering should make the reviewing a less daunting task. I'm a
> bit hesitant to float many small sets of patches as their role in the
> bigger picture would be less clear and hence risk loosing the 'why'.
> IMHO, it should be as easy (if not easier) to review and pick patches in
> a larger set as it is for multiple smaller sets. However, I guess that

Having self contained patchset merged in a larger set  can create so
useless dependency between them as they modify same area but for
different goal

> is individual and for automated testing it would be easier to have them
> split out.
>
> How about focusing on one (or two) of these smaller patch sets at the
> time to minimize the potential confusion and post them separately?

I'm fine with your proposal to start with 1 or 2 smaller patchset. The
2 following patchset are, IMHO, the ones the most self contained and
straight forward:
- patches 11,13,14 and 15 are only linked to frequency scaling invariance
- patches 18 and 19 are about tracking and adding the blocked
utilization in the CPU usage

May be we can start with them ?

Regards,
Vincent

>
> I would still include them in updated mega-postings that includes all
> the dependencies so the full story would still available for those who
> are interested. I would of course make it clear which patches that are
> also posted separately.

that's fair enough

>
> Thanks,
> Morten

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [RFCv3 PATCH 00/48] sched: Energy cost model for energy-aware scheduling
  2015-04-09  7:41     ` Vincent Guittot
@ 2015-04-10 14:46       ` Morten Rasmussen
  0 siblings, 0 replies; 124+ messages in thread
From: Morten Rasmussen @ 2015-04-10 14:46 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, mingo, Dietmar Eggemann, Yuyang Du,
	Preeti U Murthy, Mike Turquette, Nicolas Pitre, rjw, Juri Lelli,
	linux-kernel

On Thu, Apr 09, 2015 at 08:41:34AM +0100, Vincent Guittot wrote:
> On 8 April 2015 at 15:33, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> > Hi Vincent,
> >
> > On Thu, Apr 02, 2015 at 01:43:31PM +0100, Vincent Guittot wrote:
> >> On 4 February 2015 at 19:30, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> >> > RFCv3 is a consolidation of the latest energy model related patches and
> >> > previously posted patch sets related to capacity and utilization
> >> > tracking [2][3] to show where we are heading. [2] and [3] have been
> >> > rebased onto v3.19-rc7 with a few minor modifications. Large parts of
> >> > the energy model code and use of the energy model in the scheduler has
> >> > been rewritten and simplified. The patch set consists of three main
> >> > parts (more details further down):
> >> >
> >> > Patch 1-11:  sched: consolidation of CPU capacity and usage [2] (rebase)
> >> >
> >> > Patch 12-19: sched: frequency and cpu invariant per-entity load-tracking
> >> >                     and other load-tracking bits [3] (rebase)
> >> >
> >> > Patch 20-48: sched: Energy cost model for energy-aware scheduling (RFCv3)
> >>
> >>
> >> Hi Morten,
> >>
> >> 48 patches is a big number of patches and when i look into your
> >> patchset, some feature are quite self contained. IMHO it would be
> >> worth splitting it in smaller patchsets in order to ease the review
> >> and the regression test.
> >> From a 1st look at your patchset , i have found
> >> -patches 11,13,14 and 15 are only linked to frequency scaling invariance
> >> -patches 12, 17 and 17 are only about adding cpu scaling invariance
> >> -patches 18 and 19 are about tracking and adding the blocked
> >> utilization in the CPU usage
> >> -patches 20 to the end is linked the EAS
> >
> > I agree it makes sense to regroup the patches as you suggest. A better
> > logical ordering should make the reviewing a less daunting task. I'm a
> > bit hesitant to float many small sets of patches as their role in the
> > bigger picture would be less clear and hence risk loosing the 'why'.
> > IMHO, it should be as easy (if not easier) to review and pick patches in
> > a larger set as it is for multiple smaller sets. However, I guess that
> 
> Having self contained patchset merged in a larger set  can create so
> useless dependency between them as they modify same area but for
> different goal
> 
> > is individual and for automated testing it would be easier to have them
> > split out.
> >
> > How about focusing on one (or two) of these smaller patch sets at the
> > time to minimize the potential confusion and post them separately?
> 
> I'm fine with your proposal to start with 1 or 2 smaller patchset. The
> 2 following patchset are, IMHO, the ones the most self contained and
> straight forward:
> - patches 11,13,14 and 15 are only linked to frequency scaling invariance
> - patches 18 and 19 are about tracking and adding the blocked
> utilization in the CPU usage
> 
> May be we can start with them ?

Agreed. Those two would form meaningful patch sets. I will fix them and
split them out.

Thanks,
Morten

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [RFCv3 PATCH 33/48] sched: Energy-aware wake-up task placement
  2015-03-26 10:41           ` Peter Zijlstra
@ 2015-04-27 16:01             ` Michael Turquette
  2015-04-28 13:06               ` Peter Zijlstra
  0 siblings, 1 reply; 124+ messages in thread
From: Michael Turquette @ 2015-04-27 16:01 UTC (permalink / raw)
  To: Peter Zijlstra, Juri Lelli
  Cc: Morten Rasmussen, mingo, vincent.guittot, Dietmar Eggemann,
	yuyang.du, preeti, nico, rjw, linux-kernel

Quoting Peter Zijlstra (2015-03-26 03:41:50)
> On Thu, Mar 26, 2015 at 10:21:24AM +0000, Juri Lelli wrote:
> >  - what about other sched classes? I know that this is very premature,
> >    but I can help but thinking that we'll need to do some sort of
> >    aggregation of requests, and if we put triggers in very specialized
> >    points we might lose some of the sched classes separation
> 
> So for deadline we can do P state selection (as you're well aware) based
> on the requested utilization. Not sure what to do for fifo/rr though,
> they lack much useful information (as always).
> 
> Now if we also look ahead to things like the ACPI CPPC stuff we'll see
> that CFS and DL place different requirements on the hints. Where CFS
> would like to hint a max perf (the hardware going slower due to the code
> consisting of mostly stalls is always fine from a best effort energy
> pov), the DL stuff would like to hint a min perf, seeing how it 'needs'
> to provide a QoS.
> 
> So we either need to carry this information along in a 'generic' way
> between the various classes or put the hinting in every class.
> 
> But yes, food for thought for sure.

I am a fan of putting the hints in every class. One idea I've been
considering is that each sched class could have a small, simple cpufreq
governor that expresses its constraints (max for cfs, min qos for dl)
and then the cpufreq core Does The Right Thing.

This would be a multi-governor approach, which requires some surgery to
cpufreq core code, but I like the modularity and maintainability of it
more than having one big super governor that has to satisfy every need.

Regards,
Mike

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [RFCv3 PATCH 33/48] sched: Energy-aware wake-up task placement
  2015-04-27 16:01             ` Michael Turquette
@ 2015-04-28 13:06               ` Peter Zijlstra
  0 siblings, 0 replies; 124+ messages in thread
From: Peter Zijlstra @ 2015-04-28 13:06 UTC (permalink / raw)
  To: Michael Turquette
  Cc: Juri Lelli, Morten Rasmussen, mingo, vincent.guittot,
	Dietmar Eggemann, yuyang.du, preeti, nico, rjw, linux-kernel

On Mon, Apr 27, 2015 at 09:01:13AM -0700, Michael Turquette wrote:
> Quoting Peter Zijlstra (2015-03-26 03:41:50)
> > On Thu, Mar 26, 2015 at 10:21:24AM +0000, Juri Lelli wrote:
> > >  - what about other sched classes? I know that this is very premature,
> > >    but I can help but thinking that we'll need to do some sort of
> > >    aggregation of requests, and if we put triggers in very specialized
> > >    points we might lose some of the sched classes separation
> > 
> > So for deadline we can do P state selection (as you're well aware) based
> > on the requested utilization. Not sure what to do for fifo/rr though,
> > they lack much useful information (as always).
> > 
> > Now if we also look ahead to things like the ACPI CPPC stuff we'll see
> > that CFS and DL place different requirements on the hints. Where CFS
> > would like to hint a max perf (the hardware going slower due to the code
> > consisting of mostly stalls is always fine from a best effort energy
> > pov), the DL stuff would like to hint a min perf, seeing how it 'needs'
> > to provide a QoS.
> > 
> > So we either need to carry this information along in a 'generic' way
> > between the various classes or put the hinting in every class.
> > 
> > But yes, food for thought for sure.
> 
> I am a fan of putting the hints in every class. One idea I've been
> considering is that each sched class could have a small, simple cpufreq
> governor that expresses its constraints (max for cfs, min qos for dl)
> and then the cpufreq core Does The Right Thing.
> 
> This would be a multi-governor approach, which requires some surgery to
> cpufreq core code, but I like the modularity and maintainability of it
> more than having one big super governor that has to satisfy every need.

Well, at that point we really don't need cpufreq anymore do we? All
you need is the hardware driver (ACPI P-state, ACPI CPPC etc.).

Because as I understand it, cpufreq currently is mostly the governor
thing (which we'll replace) and some infra for dealing with these head
cases that require scheduling for changing P states (which we can leave
on cpufreq proper for the time being).

Would it no be easier to just start from scratch and convert the (few)
drivers we need to prototype this? Instead of trying to drag the
entirety of cpufreq along just to keep all the drivers?

^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [RFCv3 PATCH 17/48] sched: Get rid of scaling usage by cpu_capacity_orig
       [not found]   ` <OFFC493540.15A92099-ON48257E35.0026F60C-48257E35.0027A5FB@zte.com.cn>
@ 2015-04-28 16:54     ` Dietmar Eggemann
  0 siblings, 0 replies; 124+ messages in thread
From: Dietmar Eggemann @ 2015-04-28 16:54 UTC (permalink / raw)
  To: pang.xunlei, Morten Rasmussen
  Cc: Juri Lelli, linux-kernel, linux-kernel-owner, mingo, mturquette,
	nico, peterz, preeti, rjw, vincent.guittot, yuyang.du

On 28/04/15 08:12, pang.xunlei@zte.com.cn wrote:
> Morten Rasmussen <morten.rasmussen@arm.com>  wrote 2015-02-05 AM 02:30:54:
>> From: Dietmar Eggemann <dietmar.eggemann@arm.com>
>>
>> Since now we have besides frequency invariant also cpu (uarch plus max
>> system frequency) invariant cfs_rq::utilization_load_avg both, frequency
>> and cpu scaling happens as part of the load tracking.
>> So cfs_rq::utilization_load_avg does not have to be scaled by the original
>> capacity of the cpu again.
>>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
>> ---
>>  kernel/sched/fair.c | 5 ++---
>>  1 file changed, 2 insertions(+), 3 deletions(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 5375ab1..a85c34b 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -4807,12 +4807,11 @@ static int select_idle_sibling(struct
>> task_struct *p, int target)
>>  static int get_cpu_usage(int cpu)
>>  {
>>     unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg;
>> -   unsigned long capacity = capacity_orig_of(cpu);
>>  
>>     if (usage >= SCHED_LOAD_SCALE)
> 
> Since utilization_load_avg has been applied all the scaling,
> it can't exceed its original capacity. SCHED_LOAD_SCALE is
> the max original capacity of all, right?
> 
> So, shouldn't this be "if(usage >= orig_capacity)"?

Absolutely, you're right. The usage on cpus which have a smaller orig
capacity (<1024) than the cpus with the highest orig capacity (1024)
have to be limited to their orig capacity.

There is another patch in this series '[RFCv3 PATCH 28/48] sched: Use
capacity_curr to cap utilization in get_cpu_usage()' which changes the
upper bound from SCHED_LOAD_SCALE to capacity_curr_of(cpu) and returns
this current capacity in case the usage (running and blocked) exceeds it.

Our testing in the meantime has shown that this is the wrong approach in
some cases. Like adding more tasks to a cpu and deciding the new
capacity state (OPP) based on get_cpu_usage(). We are likely to change
this to capacity_orig_of(cpu) in the next version.

[...]


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [RFCv3 PATCH 23/48] sched: Allocate and initialize energy data structures
       [not found]   ` <OF29F384AC.37929D8E-ON48257E35.002FCB0C-48257E35.003156FE@zte.com.cn>
@ 2015-04-29 15:43     ` Dietmar Eggemann
  0 siblings, 0 replies; 124+ messages in thread
From: Dietmar Eggemann @ 2015-04-29 15:43 UTC (permalink / raw)
  To: pang.xunlei, Morten Rasmussen
  Cc: Juri Lelli, linux-kernel, linux-kernel-owner, mingo, mturquette,
	nico, peterz, preeti, rjw, vincent.guittot, yuyang.du

On 28/04/15 09:58, pang.xunlei@zte.com.cn wrote:
> Morten Rasmussen <morten.rasmussen@arm.com>  wrote 2015-02-05 AM 02:31:00:
>> [RFCv3 PATCH 23/48] sched: Allocate and initialize energy data structures
>>
>> From: Dietmar Eggemann <dietmar.eggemann@arm.com>
>>
>> The per sched group sched_group_energy structure plus the related
>> idle_state and capacity_state arrays are allocated like the other sched
>> domain (sd) hierarchy data structures. This includes the freeing of
>> sched_group_energy structures which are not used.
>>
>> One problem is that the number of elements of the idle_state and the
>> capacity_state arrays is not fixed and has to be retrieved in
>> __sdt_alloc() to allocate memory for the sched_group_energy structure and
>> the two arrays in one chunk. The array pointers (idle_states and
>> cap_states) are initialized here to point to the correct place inside the
>> memory chunk.
>>
>> The new function init_sched_energy() initializes the sched_group_energy
>> structure and the two arrays in case the sd topology level contains energy
>> information.

[...]

>>  
>> +static void init_sched_energy(int cpu, struct sched_domain *sd,
>> +               struct sched_domain_topology_level *tl)
>> +{
>> +   struct sched_group *sg = sd->groups;
>> +   struct sched_group_energy *energy = sg->sge;
>> +   sched_domain_energy_f fn = tl->energy;
>> +   struct cpumask *mask = sched_group_cpus(sg);
>> +
>> +   if (!fn || !fn(cpu))
>> +      return;
> 
> Maybe if there's no valid fn(), we can dec the sched_group_energy's
> reference count, so that it can be freed if no one uses it.

Good catch! We actually want that sg->sge is NULL if there is no 
energy function fn or fn returns NULL. We never noticed that this is 
not the case since we have tested the whole patch-set only with energy 
functions available for each sched domain (sd) so far.

All sd's lower or equal 'struct sched_domain * ea_sd' (highest level 
at which energy model is provided) have to provide a valid energy 
function fn. A check which is currently missing as well.

Instead of dec the ref count, I could defer the inc from get_group to 
init_sched_energy.

> 
> Also, this function may enter several times for the shared sge,
> there is no need to do the duplicate operation below. Adding
> this would be better?
> 
> if (cpu != group_balance_cpu(sg))
>         return;
> 

That's true.

This snippet gives the functionality on top of this patch (Tested on 
a two cluster ARM system w/ fn set to NULL on MC or DIE sd level or
both in arm_topology[] (arch/arm/kernel/topology.c)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index c49f3ee928b8..6d9b5327a2b6 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5969,7 +5969,6 @@ static int get_group(int cpu, struct sd_data *sdd, struct sched_group **sg)
                (*sg)->sgc = *per_cpu_ptr(sdd->sgc, cpu);
                atomic_set(&(*sg)->sgc->ref, 1); /* for claim_allocations */
                (*sg)->sge = *per_cpu_ptr(sdd->sge, cpu);
-               atomic_set(&(*sg)->sge->ref, 1); /* for claim_allocations */
        }
 
        return cpu;
@@ -6067,8 +6066,16 @@ static void init_sched_energy(int cpu, struct sched_domain *sd,
        sched_domain_energy_f fn = tl->energy;
        struct cpumask *mask = sched_group_cpus(sg);
 
-       if (!fn || !fn(cpu))
+       if (cpu != group_balance_cpu(sg))
+               return;
+
+       if (!fn || !fn(cpu)) {
+               sg->sge = NULL;
                return;
+       }
+
+       atomic_set(&sg->sge->ref, 1); /* for claim_allocations */
+

[...]


^ permalink raw reply related	[flat|nested] 124+ messages in thread

* Re: [RFCv3 PATCH 29/48] sched: Highest energy aware balancing sched_domain level pointer
       [not found]   ` <OF5977496A.A21A7B96-ON48257E35.002EC23C-48257E35.00324DAD@zte.com.cn>
@ 2015-04-29 15:54     ` Dietmar Eggemann
  0 siblings, 0 replies; 124+ messages in thread
From: Dietmar Eggemann @ 2015-04-29 15:54 UTC (permalink / raw)
  To: pang.xunlei, Morten Rasmussen
  Cc: Juri Lelli, linux-kernel, linux-kernel-owner, mingo, mturquette,
	nico, peterz, preeti, rjw, vincent.guittot, yuyang.du

On 28/04/15 10:09, pang.xunlei@zte.com.cn wrote:
> Morten Rasmussen <morten.rasmussen@arm.com>  wrote 2015-02-05 AM 02:31:06:
> 
>> Morten Rasmussen <morten.rasmussen@arm.com>
>> [RFCv3 PATCH 29/48] sched: Highest energy aware balancing
>> sched_domain level pointer
>>
>> Add another member to the family of per-cpu sched_domain shortcut
>> pointers. This one, sd_ea, points to the highest level at which energy
>> model is provided. At this level and all levels below all sched_groups
>> have energy model data attached.

[...]

>> @@ -5766,6 +5767,14 @@ static void update_top_cache_domain(int cpu)
>>  
>>     sd = highest_flag_domain(cpu, SD_ASYM_PACKING);
>>     rcu_assign_pointer(per_cpu(sd_asym, cpu), sd);
>> +
>> +   for_each_domain(cpu, sd) {
>> +      if (sd->groups->sge)
> 
> sge is like sgc, I think the test will always return true, no?

True for the current implementation. This code will make more sense once
we integrate the changes you proposed for '[RFCv3 PATCH 23/48] sched:
Allocate and initialize energy data structures' into the next version.

[...]


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [RFCv3 PATCH 37/48] sched: Determine the current sched_group idle-state
       [not found]   ` <OF1FDC99CD.22435E74-ON48257E37.001BA739-48257E37.001CA5ED@zte.com.cn>
@ 2015-04-30 20:17     ` Dietmar Eggemann
       [not found]       ` <OF2F4202E4.8A4AF229-ON48257E38.00312CD4-48257E38.0036ADB6@zte.com.cn>
  0 siblings, 1 reply; 124+ messages in thread
From: Dietmar Eggemann @ 2015-04-30 20:17 UTC (permalink / raw)
  To: pang.xunlei, Morten Rasmussen
  Cc: Juri Lelli, linux-kernel, linux-kernel-owner, mingo, mturquette,
	nico, peterz, preeti, rjw, vincent.guittot, yuyang.du

On 30/04/15 06:12, pang.xunlei@zte.com.cn wrote:
> linux-kernel-owner@vger.kernel.org wrote 2015-02-05 AM 02:31:14:
> 
>> Morten Rasmussen <morten.rasmussen@arm.com>
>>
>> [RFCv3 PATCH 37/48] sched: Determine the current sched_group idle-state
>>
>> To estimate the energy consumption of a sched_group in
>> sched_group_energy() it is necessary to know which idle-state the group
>> is in when it is idle. For now, it is assumed that this is the current
>> idle-state (though it might be wrong). Based on the individual cpu
>> idle-states group_idle_state() finds the group idle-state.

[...]

>> +static int group_idle_state(struct sched_group *sg)
>> +{
>> +   struct sched_group_energy *sge = sg->sge;
>> +   int shallowest_state = sge->idle_states_below + sge->nr_idle_states;
>> +   int i;
>> +
>> +   for_each_cpu(i, sched_group_cpus(sg)) {
>> +      int cpuidle_idx = idle_get_state_idx(cpu_rq(i));
>> +      int group_idx = cpuidle_idx - sge->idle_states_below + 1;
> 
> So cpuidle_idx==0 for core-level(2 C-States for example) groups, it
> returns 1?

Yes.

> 
> What does this mean?

So for ARM TC2 and CPU0 (Cortex A15, big) for example:

# cat /sys/devices/system/cpu/cpu0/cpuidle/state*/name
WFI		[cpuidle_idx=0]
cluster-sleep-b [cpuidle_idx=1]

group_idx is the system C state re-indexed to the sd (e.g.on MC level
WFI is group idx 1 and on DIE level it's group idx 0).


For MC (or core-) level groups:

cpu=0 sg_mask=0x1 group_idx=1 cpuidle_idx=0 sge->idle_states_below=0
shallowest_state=1 sge->nr_idle_states=1

group_idle_state() returns 0 in this case because of (shallowest_state=1
>= sge->nr_idle_states=1).

This value is then used to index into the sg->sge->idle_states[] to get
the idle power value for the topology bit represented by sg (i.e. for
CPU0). The idle power value for CPU0 is originally specified in the
energy model:

static struct idle_state idle_states_core_a15[] = { .power = 0 }, };
[arch/arm/kernel/topology.c]


Another example would be to find the idle power value for the LITTLE
cluster if it is in 'cluster-sleep-l' [cpuidle_idx=1]:

cpu=2 sg_mask=0x1c group_idx=1 cpuidle_idx=1 sge->idle_states_below=1
shallowest_state=3 sge->nr_idle_states=2
cpu=3 sg_mask=0x1c group_idx=1 cpuidle_idx=1 sge->idle_states_below=1
shallowest_state=1 sge->nr_idle_states=2
cpu=3 sg_mask=0x1c group_idx=1 cpuidle_idx=1 sge->idle_states_below=1
shallowest_state=1 sge->nr_idle_states=2

group_idle_state() returns 1 (shallowest_state=1) and the idle power
value is 10.

static struct idle_state idle_states_cluster_a7[] = { ..., { .power = 10 }};


To sum it up, group_idle_state() is necessary to transform the C state
values (-1, 0, 1) into sg->sge->idle_states[] indexes for the different
sg's (MC, DIE).

[...]


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [RFCv3 PATCH 31/48] sched: Extend sched_group_energy to test load-balancing decisions
       [not found]   ` <OF081FBA75.F80B8844-ON48257E37.00261E89-48257E37.00267F24@zte.com.cn>
@ 2015-04-30 20:26     ` Dietmar Eggemann
  0 siblings, 0 replies; 124+ messages in thread
From: Dietmar Eggemann @ 2015-04-30 20:26 UTC (permalink / raw)
  To: pang.xunlei, Morten Rasmussen
  Cc: Juri Lelli, linux-kernel, linux-kernel-owner, mingo, mturquette,
	nico, peterz, preeti, rjw, vincent.guittot, yuyang.du

On 30/04/15 08:00, pang.xunlei@zte.com.cn wrote:
> linux-kernel-owner@vger.kernel.org wrote 2015-02-05 AM 02:31:08:

[...]

>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index d12aa63..07c84af 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -4592,23 +4592,44 @@ static unsigned long capacity_curr_of(int cpu)
>>   * Without capping the usage, a group could be seen as overloaded
> (CPU0 usage
>>   * at 121% + CPU1 usage at 80%) whereas CPU1 has 20% of available
> capacity/
>>   */
>> -static int get_cpu_usage(int cpu)
>> +static int __get_cpu_usage(int cpu, int delta)
>>  {
>> +   int sum;
>>     unsigned long usage = cpu_rq(cpu)->cfs.utilization_load_avg;
>>     unsigned long blocked = cpu_rq(cpu)->cfs.utilization_blocked_avg;
>>     unsigned long capacity_curr = capacity_curr_of(cpu);
>>  
>> -   if (usage + blocked >= capacity_curr)
>> +   sum = usage + blocked + delta;
>> +
>> +   if (sum < 0)
>> +      return 0;
>> +
>> +   if (sum >= capacity_curr)
>>        return capacity_curr;
> 
> So if the added delta exceeds the curr capacity not its orignal capacity
> which I think would be quite often cases, I guess it should be better if
> it's allowed to increase its freq and calculate the right energy diff.

Yes, I mentioned this in my answer for [RFCv3 PATCH 17/48] that our
testing in the meantime has shown that this capping by capacity_curr is
the wrong approach in some cases and that we are likely to change this
to capacity_orig_of(cpu) in the next version.

[...]


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [RFCv3 PATCH 43/48] sched: Introduce energy awareness into detach_tasks
       [not found]       ` <OFDCE15EEF.2F536D7F-ON48257E37.002565ED-48257E37.0027A8B9@zte.com.cn>
@ 2015-04-30 20:35         ` Dietmar Eggemann
  0 siblings, 0 replies; 124+ messages in thread
From: Dietmar Eggemann @ 2015-04-30 20:35 UTC (permalink / raw)
  To: pang.xunlei
  Cc: Juri Lelli, linux-kernel, linux-kernel-owner, mingo,
	Morten Rasmussen, mturquette, nico, Peter Boonstoppel, peterz,
	preeti, rjw, Sai Gurrappadi, vincent.guittot, yuyang.du

On 30/04/15 08:12, pang.xunlei@zte.com.cn wrote:
> linux-kernel-owner@vger.kernel.org wrote 2015-03-27 PM 11:03:24:
>> Re: [RFCv3 PATCH 43/48] sched: Introduce energy awareness into
> detach_tasks

[...]

>> > Hi Morten, Dietmar,
>> >
>> > Wouldn't the above energy_diff() use the 'old' value of dst_cpu's util?
>> > Tasks are detached/dequeued in this loop so they have their util
>> > contrib. removed from src_cpu but their contrib. hasn't been added to
>> > dst_cpu yet (happens in attach_tasks).
>>
>> You're absolutely right Sai. Thanks for pointing this out! I guess I
> rather
>> have to accumulate the usage of tasks I've detached and add this to the
>> eenv::usage_delta of the energy_diff() call for the next task.
>>
>> Something like this (only slightly tested):
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 8d4cc72f4778..d0d0e965fd0c 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6097,6 +6097,7 @@ static int detach_tasks(struct lb_env *env)
>>         struct task_struct *p;
>>         unsigned long load = 0;
>>         int detached = 0;
>> +       int usage_delta = 0;
>>  
>>         lockdep_assert_held(&env->src_rq->lock);
>>  
>> @@ -6122,16 +6123,19 @@ static int detach_tasks(struct lb_env *env)
>>                         goto next;
>>  
>>                 if (env->use_ea) {
>> +                       int util = task_utilization(p);
>>                         struct energy_env eenv = {
>>                                 .src_cpu = env->src_cpu,
>>                                 .dst_cpu = env->dst_cpu,
>> -                               .usage_delta = task_utilization(p),
>> +                               .usage_delta = usage_delta + util,
>>                         };
>>                         int e_diff = energy_diff(&eenv);
> 
> If any or total utilization of tasks detached exceeds the orig capacity,
> of src_cpu, should we bail out in case of performance plus avoiding
> reaching the tipping point easily(because in such cases, src_cpu tends
> to be ended up overutilized)?

Yes, correct. We have to limit the dst cpu from pulling  more than its
remaining capacity worth of usage. I already integrated a check that the
dst cpu is not over-utilized by taking the usage_delta.

> Also like what I just replied to "[RFCv4 PATCH 31/48]", when doing
> energy_diff() it should be allowed to exceed cur capacity if not
> reaching its original capacity(is capacity_of(cpu) better?).

True.

[...]


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [RFCv3 PATCH 37/48] sched: Determine the current sched_group idle-state
       [not found]       ` <OF2F4202E4.8A4AF229-ON48257E38.00312CD4-48257E38.0036ADB6@zte.com.cn>
@ 2015-05-01 15:09         ` Dietmar Eggemann
  0 siblings, 0 replies; 124+ messages in thread
From: Dietmar Eggemann @ 2015-05-01 15:09 UTC (permalink / raw)
  To: pang.xunlei
  Cc: Juri Lelli, linux-kernel, linux-kernel-owner, mingo,
	Morten Rasmussen, mturquette, nico, peterz, preeti, rjw,
	vincent.guittot, yuyang.du

On 01/05/15 10:56, pang.xunlei@zte.com.cn wrote:
> Hi Dietmar,
> 
> Dietmar Eggemann <dietmar.eggemann@arm.com> wrote 2015-05-01 AM 04:17:51:
>>
>> Re: [RFCv3 PATCH 37/48] sched: Determine the current sched_group
> idle-state
>>
>> On 30/04/15 06:12, pang.xunlei@zte.com.cn wrote:
>> > linux-kernel-owner@vger.kernel.org wrote 2015-02-05 AM 02:31:14:

[...]

> Thanks for explaining this in graphic detail.
> 
> From what I understood, let's just assume ARM TC2 has an extra
> MC-level C-States SPC(assuming its power is 40 for the big).
> 
> Take the power value from "RFCv3 PATCH 25/48":
> static struct idle_state idle_states_cluster_a15[] = {
>                  { .power = 70 }, /* WFI */
>                  { .power = 25 }, /* cluster-sleep-b */
>         };
> 
> static struct idle_state idle_states_core_a15[] = {
>                  { .power = 0 }, /* WFI */
>        };
> 
> Then we will get the following idle energy table?
> static struct idle_state idle_states_core_a15[] = {
>                { .power = 70 }, /* WFI */
>                 { .power = 0 }, /* SPC*/
>         };
> 
> static struct idle_state idle_states_cluster_a15[] = {
>                { .power = 40 }, /* SPC */
>                { .power = 25 }, /* cluster-sleep-b */
>        };
> 
> Is this correct?

Yes. It's the same C-state configuration we have on our ARMv8 big.LITTLE
JUNO (Cortex A57, A53) board.

# cat /sys/devices/system/cpu/cpu0/cpuidle/state*/name
WFI
cpu-sleep-0
cluster-sleep-0

> If this is right, there may be a bug.
> 
> For MC-level CPU0:
> sg_mask=0x1 sge->nr_idle_states=2 sge->idle_states_below=0
> cpuidle_idx=0 group_idx=1 shallowest_state=1
> 
> See, group_idle_state() finally returns 1 as CPU0's MC-level
> idle engery model index, and this is obviously wrong.

Yes, I can see this problem too.

> So, I think for
> "int group_idx = cpuidle_idx - sge->idle_states_below + 1;"
> 
> Maybe we shouldn't add the extra 1 for lowest levels?

Maybe. In any case we will have to resolve this issue in the next
version. Thanks for pointing this out!

[...]


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [RFCv3 PATCH 45/48] sched: Skip cpu as lb src which has one task and capacity gte the dst cpu
       [not found]       ` <OF9320540C.255228F9-ON48257E37.002A02D1-48257E37.002AB5EE@zte.com.cn>
@ 2015-05-05 10:01         ` Dietmar Eggemann
  0 siblings, 0 replies; 124+ messages in thread
From: Dietmar Eggemann @ 2015-05-05 10:01 UTC (permalink / raw)
  To: pang.xunlei
  Cc: Juri Lelli, linux-kernel, linux-kernel-owner, mingo,
	Morten Rasmussen, mturquette, nico, Peter Zijlstra, preeti, rjw,
	vincent.guittot, yuyang.du

On 30/04/15 08:46, pang.xunlei@zte.com.cn wrote:
> linux-kernel-owner@vger.kernel.org wrote 2015-03-26 AM 02:44:48:
> 
>> Dietmar Eggemann <dietmar.eggemann@arm.com>
>>
>> Re: [RFCv3 PATCH 45/48] sched: Skip cpu as lb src which has one task
>> and capacity gte the dst cpu
>>
>> On 24/03/15 15:27, Peter Zijlstra wrote:
>> > On Wed, Feb 04, 2015 at 06:31:22PM +0000, Morten Rasmussen wrote:
>> >> From: Dietmar Eggemann <dietmar.eggemann@arm.com>
>> >>
>> >> Skip cpu as a potential src (costliest) in case it has only one task
>> >> running and its original capacity is greater than or equal to the
>> >> original capacity of the dst cpu.
>> >
>> > Again, that's what, but is lacking a why.
>> >
>>
>> You're right, the 'why' is completely missing.
>>
>> This is one of our heterogeneous (big.LITTLE) cpu related patches. We
>> don't want to end up migrating this single task from a big to a little
>> cpu, hence the use of capacity_orig_of(cpu). Our cpu topology makes sure
>> that this rule is only active on DIE sd level.
> 
> Hi Dietmar,
> 
> Could you tell me the reason why don't want to end up migrating this single
> task from a big to a little cpu?
> 
> Like what I just replied to "[RFCv3 PATCH 47/48]", if the task is a
> small one,
> why couldn't we migrate it to the little cpu to save energy especially when
> the cluster has shared freq, the saving may be appreciable?

If it's a big (always running) task, it should stay on the cpu with the
higher capacity. If it is a small task it will eventually go to sleep
and the wakeup path will take care of placing it onto the right cpu.

[...]


^ permalink raw reply	[flat|nested] 124+ messages in thread

* Re: [RFCv3 PATCH 12/48] sched: Make usage tracking cpu scale-invariant
       [not found]       ` <OF8A3E3617.0D4400A5-ON48257E3A.001B38D9-48257E3A.002379A4@zte.com.cn>
@ 2015-05-06  9:49         ` Dietmar Eggemann
  0 siblings, 0 replies; 124+ messages in thread
From: Dietmar Eggemann @ 2015-05-06  9:49 UTC (permalink / raw)
  To: pang.xunlei
  Cc: Juri Lelli, linux-kernel, linux-kernel-owner, mingo,
	Morten Rasmussen, mturquette, nico, Peter Zijlstra, preeti, rjw,
	vincent.guittot, yuyang.du

On 03/05/15 07:27, pang.xunlei@zte.com.cn wrote:
> Hi Dietmar,
> 
> Dietmar Eggemann <dietmar.eggemann@arm.com>  wrote 2015-03-24 AM 03:19:41:
>>
>> Re: [RFCv3 PATCH 12/48] sched: Make usage tracking cpu scale-invariant

[...]

>> In the previous patch-set https://lkml.org/lkml/2014/12/2/332we
>> cpu-scaled both (sched_avg::runnable_avg_sum (load) and
>> sched_avg::running_avg_sum (utilization)) but during the review Vincent
>> pointed out that a cpu-scaled invariant load signal messes up
>> load-balancing based on s[dg]_lb_stats::avg_load in overload scenarios.
>>
>> avg_load = load/capacity and load can't be simply replaced here by
>> 'cpu-scale invariant load' (which is load*capacity).
> 
> I can't see why it shouldn't.
> 
> For "avg_load = load/capacity", "avg_load" stands for how busy the cpu
> works,
> it is actually a value relative to its capacity. The system is seen
> balanced
> for the case that a task runs on a 512-capacity cpu contributing 50% usage,
> and two the same tasks run on the 1024-capacity cpu contributing 50% usage.
> "capacity" in this formula contains uarch capacity, "load" in this formula
> must be an absolute real load, not relative.
> 
> But with current kernel implementation, "load" computed without this patch
> is a relative value. For example, one task (1024 weight) runs on a 1024
> capacity CPU, it gets 256 load contribution(25% on this CPU). When it runs
> on a 512 capacity CPU, it will get the 512 load contribution(50% on ths
> CPU).
> See, currently runnable "load" is relative, so "avg_load" is actually wrong
> and its value equals that of "load". So I think the runnable load should be
> made cpu scale-invariant as well.
> 
> Please point me out if I was wrong.

Cpu-scaled load leads to wrong lb decisions in overload scenarios:

(1) Overload example taken from email thread between Vincent and Morten:
    https://lkml.org/lkml/2014/12/30/114

7 always running tasks, 4 on cluster 0, 3 on cluster 1:

		cluster 0	cluster 1
capacity	1024 (2*512)	1024 (1*1024)
load		4096		3072
scale_load	2048		3072

Simply using cpu-scaled load in the existing lb code would declare
cluster 1 busier than cluster 0, although the compute capacity budget
for one task is higher on cluster 1 (1024/3 = 341) than on cluster 0
(2*512/4 = 256).

(2) A non-overload example does not show this problem:

7 12.5% (scaled to 1024) tasks, 4 on cluster 0, 3 on cluster 1:

		cluster 0	cluster 1
capacity	1024 (2*512)	1024 (1*1024)
load		1024		384
scale_load	512		384

Here cluster 0 is busier taking load or cpu-scaled load.

We should continue to use avg_load based on load (maybe calculated out
of scaled load once introduced?) for overload scenarios and use
scale_load for non-overload scenarios. Since this hasn't been
implemented yet, we got rid of cpu-scaled load in
this RFC.

[...]


^ permalink raw reply	[flat|nested] 124+ messages in thread

end of thread, other threads:[~2015-05-06  9:53 UTC | newest]

Thread overview: 124+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-02-04 18:30 [RFCv3 PATCH 00/48] sched: Energy cost model for energy-aware scheduling Morten Rasmussen
2015-02-04 18:30 ` [RFCv3 PATCH 01/48] sched: add utilization_avg_contrib Morten Rasmussen
2015-02-11  8:50   ` Preeti U Murthy
2015-02-12  1:07     ` Vincent Guittot
2015-02-04 18:30 ` [RFCv3 PATCH 02/48] sched: Track group sched_entity usage contributions Morten Rasmussen
2015-02-04 18:30 ` [RFCv3 PATCH 03/48] sched: remove frequency scaling from cpu_capacity Morten Rasmussen
2015-02-04 18:30 ` [RFCv3 PATCH 04/48] sched: Make sched entity usage tracking frequency-invariant Morten Rasmussen
2015-02-04 18:30 ` [RFCv3 PATCH 05/48] sched: make scale_rt invariant with frequency Morten Rasmussen
2015-02-04 18:30 ` [RFCv3 PATCH 06/48] sched: add per rq cpu_capacity_orig Morten Rasmussen
2015-02-04 18:30 ` [RFCv3 PATCH 07/48] sched: get CPU's usage statistic Morten Rasmussen
2015-02-04 18:30 ` [RFCv3 PATCH 08/48] sched: replace capacity_factor by usage Morten Rasmussen
2015-02-04 18:30 ` [RFCv3 PATCH 09/48] sched: add SD_PREFER_SIBLING for SMT level Morten Rasmussen
2015-02-04 18:30 ` [RFCv3 PATCH 10/48] sched: move cfs task on a CPU with higher capacity Morten Rasmussen
2015-02-04 18:30 ` [RFCv3 PATCH 11/48] sched: Make load tracking frequency scale-invariant Morten Rasmussen
2015-02-04 18:30 ` [RFCv3 PATCH 12/48] sched: Make usage tracking cpu scale-invariant Morten Rasmussen
2015-03-23 14:46   ` Peter Zijlstra
2015-03-23 19:19     ` Dietmar Eggemann
     [not found]       ` <OF8A3E3617.0D4400A5-ON48257E3A.001B38D9-48257E3A.002379A4@zte.com.cn>
2015-05-06  9:49         ` Dietmar Eggemann
2015-02-04 18:30 ` [RFCv3 PATCH 13/48] cpufreq: Architecture specific callback for frequency changes Morten Rasmussen
2015-02-04 18:30 ` [RFCv3 PATCH 14/48] arm: Frequency invariant scheduler load-tracking support Morten Rasmussen
2015-03-23 13:39   ` Peter Zijlstra
2015-03-24  9:41     ` Morten Rasmussen
2015-02-04 18:30 ` [RFCv3 PATCH 15/48] arm: vexpress: Add CPU clock-frequencies to TC2 device-tree Morten Rasmussen
2015-02-04 18:30 ` [RFCv3 PATCH 16/48] arm: Cpu invariant scheduler load-tracking support Morten Rasmussen
2015-02-04 18:30 ` [RFCv3 PATCH 17/48] sched: Get rid of scaling usage by cpu_capacity_orig Morten Rasmussen
     [not found]   ` <OFFC493540.15A92099-ON48257E35.0026F60C-48257E35.0027A5FB@zte.com.cn>
2015-04-28 16:54     ` Dietmar Eggemann
2015-02-04 18:30 ` [RFCv3 PATCH 18/48] sched: Track blocked utilization contributions Morten Rasmussen
2015-03-23 14:08   ` Peter Zijlstra
2015-03-24  9:43     ` Morten Rasmussen
2015-03-24 16:07       ` Peter Zijlstra
2015-03-24 17:44         ` Morten Rasmussen
2015-02-04 18:30 ` [RFCv3 PATCH 19/48] sched: Include blocked utilization in usage tracking Morten Rasmussen
2015-02-04 18:30 ` [RFCv3 PATCH 20/48] sched: Documentation for scheduler energy cost model Morten Rasmussen
2015-02-04 18:30 ` [RFCv3 PATCH 21/48] sched: Make energy awareness a sched feature Morten Rasmussen
2015-02-04 18:30 ` [RFCv3 PATCH 22/48] sched: Introduce energy data structures Morten Rasmussen
2015-02-04 18:31 ` [RFCv3 PATCH 23/48] sched: Allocate and initialize " Morten Rasmussen
     [not found]   ` <OF29F384AC.37929D8E-ON48257E35.002FCB0C-48257E35.003156FE@zte.com.cn>
2015-04-29 15:43     ` Dietmar Eggemann
2015-02-04 18:31 ` [RFCv3 PATCH 24/48] sched: Introduce SD_SHARE_CAP_STATES sched_domain flag Morten Rasmussen
2015-02-04 18:31 ` [RFCv3 PATCH 25/48] arm: topology: Define TC2 energy and provide it to the scheduler Morten Rasmussen
2015-02-04 18:31 ` [RFCv3 PATCH 26/48] sched: Compute cpu capacity available at current frequency Morten Rasmussen
2015-02-04 18:31 ` [RFCv3 PATCH 27/48] sched: Relocated get_cpu_usage() Morten Rasmussen
2015-02-04 18:31 ` [RFCv3 PATCH 28/48] sched: Use capacity_curr to cap utilization in get_cpu_usage() Morten Rasmussen
2015-03-23 16:14   ` Peter Zijlstra
2015-03-24 11:36     ` Morten Rasmussen
2015-03-24 12:59       ` Peter Zijlstra
2015-02-04 18:31 ` [RFCv3 PATCH 29/48] sched: Highest energy aware balancing sched_domain level pointer Morten Rasmussen
2015-03-23 16:16   ` Peter Zijlstra
2015-03-24 10:52     ` Morten Rasmussen
     [not found]   ` <OF5977496A.A21A7B96-ON48257E35.002EC23C-48257E35.00324DAD@zte.com.cn>
2015-04-29 15:54     ` Dietmar Eggemann
2015-02-04 18:31 ` [RFCv3 PATCH 30/48] sched: Calculate energy consumption of sched_group Morten Rasmussen
2015-03-13 22:54   ` Sai Gurrappadi
2015-03-16 14:15     ` Morten Rasmussen
2015-03-23 16:47       ` Peter Zijlstra
2015-03-23 20:21         ` Dietmar Eggemann
2015-03-24 10:44           ` Morten Rasmussen
2015-03-24 16:10             ` Peter Zijlstra
2015-03-24 17:39               ` Morten Rasmussen
2015-03-26 15:23                 ` Dietmar Eggemann
2015-03-20 18:40   ` Sai Gurrappadi
2015-03-27 15:58     ` Morten Rasmussen
2015-02-04 18:31 ` [RFCv3 PATCH 31/48] sched: Extend sched_group_energy to test load-balancing decisions Morten Rasmussen
     [not found]   ` <OF081FBA75.F80B8844-ON48257E37.00261E89-48257E37.00267F24@zte.com.cn>
2015-04-30 20:26     ` Dietmar Eggemann
2015-02-04 18:31 ` [RFCv3 PATCH 32/48] sched: Estimate energy impact of scheduling decisions Morten Rasmussen
2015-02-04 18:31 ` [RFCv3 PATCH 33/48] sched: Energy-aware wake-up task placement Morten Rasmussen
2015-03-13 22:47   ` Sai Gurrappadi
2015-03-16 14:47     ` Morten Rasmussen
2015-03-18 20:15       ` Sai Gurrappadi
2015-03-27 16:37         ` Morten Rasmussen
2015-03-24 13:00       ` Peter Zijlstra
2015-03-24 15:24         ` Morten Rasmussen
2015-03-24 13:00   ` Peter Zijlstra
2015-03-24 15:42     ` Morten Rasmussen
2015-03-24 15:53       ` Peter Zijlstra
2015-03-24 17:47         ` Morten Rasmussen
2015-03-24 16:35   ` Peter Zijlstra
2015-03-25 18:01     ` Juri Lelli
2015-03-25 18:14       ` Peter Zijlstra
2015-03-26 10:21         ` Juri Lelli
2015-03-26 10:41           ` Peter Zijlstra
2015-04-27 16:01             ` Michael Turquette
2015-04-28 13:06               ` Peter Zijlstra
2015-02-04 18:31 ` [RFCv3 PATCH 34/48] sched: Bias new task wakeups towards higher capacity cpus Morten Rasmussen
2015-03-24 13:33   ` Peter Zijlstra
2015-03-25 18:18     ` Morten Rasmussen
2015-02-04 18:31 ` [RFCv3 PATCH 35/48] sched, cpuidle: Track cpuidle state index in the scheduler Morten Rasmussen
2015-02-04 18:31 ` [RFCv3 PATCH 36/48] sched: Count number of shallower idle-states in struct sched_group_energy Morten Rasmussen
2015-03-24 13:14   ` Peter Zijlstra
2015-03-24 17:13     ` Morten Rasmussen
2015-02-04 18:31 ` [RFCv3 PATCH 37/48] sched: Determine the current sched_group idle-state Morten Rasmussen
     [not found]   ` <OF1FDC99CD.22435E74-ON48257E37.001BA739-48257E37.001CA5ED@zte.com.cn>
2015-04-30 20:17     ` Dietmar Eggemann
     [not found]       ` <OF2F4202E4.8A4AF229-ON48257E38.00312CD4-48257E38.0036ADB6@zte.com.cn>
2015-05-01 15:09         ` Dietmar Eggemann
2015-02-04 18:31 ` [RFCv3 PATCH 38/48] sched: Infrastructure to query if load balancing is energy-aware Morten Rasmussen
2015-03-24 13:41   ` Peter Zijlstra
2015-03-24 16:17     ` Dietmar Eggemann
2015-03-24 13:56   ` Peter Zijlstra
2015-03-24 16:22     ` Dietmar Eggemann
2015-02-04 18:31 ` [RFCv3 PATCH 39/48] sched: Introduce energy awareness into update_sg_lb_stats Morten Rasmussen
2015-02-04 18:31 ` [RFCv3 PATCH 40/48] sched: Introduce energy awareness into update_sd_lb_stats Morten Rasmussen
2015-02-04 18:31 ` [RFCv3 PATCH 41/48] sched: Introduce energy awareness into find_busiest_group Morten Rasmussen
2015-02-04 18:31 ` [RFCv3 PATCH 42/48] sched: Introduce energy awareness into find_busiest_queue Morten Rasmussen
2015-03-24 15:21   ` Peter Zijlstra
2015-03-24 18:04     ` Dietmar Eggemann
2015-02-04 18:31 ` [RFCv3 PATCH 43/48] sched: Introduce energy awareness into detach_tasks Morten Rasmussen
2015-03-24 15:25   ` Peter Zijlstra
2015-03-25 23:50   ` Sai Gurrappadi
2015-03-27 15:03     ` Dietmar Eggemann
     [not found]       ` <OFDCE15EEF.2F536D7F-ON48257E37.002565ED-48257E37.0027A8B9@zte.com.cn>
2015-04-30 20:35         ` Dietmar Eggemann
2015-02-04 18:31 ` [RFCv3 PATCH 44/48] sched: Tipping point from energy-aware to conventional load balancing Morten Rasmussen
2015-03-24 15:26   ` Peter Zijlstra
2015-03-24 18:47     ` Dietmar Eggemann
2015-02-04 18:31 ` [RFCv3 PATCH 45/48] sched: Skip cpu as lb src which has one task and capacity gte the dst cpu Morten Rasmussen
2015-03-24 15:27   ` Peter Zijlstra
2015-03-25 18:44     ` Dietmar Eggemann
     [not found]       ` <OF9320540C.255228F9-ON48257E37.002A02D1-48257E37.002AB5EE@zte.com.cn>
2015-05-05 10:01         ` Dietmar Eggemann
2015-02-04 18:31 ` [RFCv3 PATCH 46/48] sched: Turn off fast idling of cpus on a partially loaded system Morten Rasmussen
2015-03-24 16:01   ` Peter Zijlstra
2015-02-04 18:31 ` [RFCv3 PATCH 47/48] sched: Enable active migration for cpus of lower capacity Morten Rasmussen
2015-03-24 16:02   ` Peter Zijlstra
2015-02-04 18:31 ` [RFCv3 PATCH 48/48] sched: Disable energy-unfriendly nohz kicks Morten Rasmussen
2015-02-20 19:26   ` Dietmar Eggemann
2015-04-02 12:43 ` [RFCv3 PATCH 00/48] sched: Energy cost model for energy-aware scheduling Vincent Guittot
2015-04-08 13:33   ` Morten Rasmussen
2015-04-09  7:41     ` Vincent Guittot
2015-04-10 14:46       ` Morten Rasmussen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.