All of lore.kernel.org
 help / color / mirror / Atom feed
* [RFC PATCH 00/16] sched: Energy cost model for energy-aware scheduling
@ 2014-05-23 18:16 Morten Rasmussen
  2014-05-23 18:16 ` [RFC PATCH 01/16] sched: Documentation for scheduler energy cost model Morten Rasmussen
                   ` (15 more replies)
  0 siblings, 16 replies; 71+ messages in thread
From: Morten Rasmussen @ 2014-05-23 18:16 UTC (permalink / raw)
  To: linux-kernel, linux-pm, peterz, mingo
  Cc: rjw, vincent.guittot, daniel.lezcano, preeti, dietmar.eggemann

Several techniques for saving energy through various scheduler
modifications have been proposed in the past, however most of the
techniques have not been universally beneficial for all use-cases and
platforms. For example, consolidating tasks on fewer cpus is an
effective way to save energy on some platforms, while it might make
things worse on others.

This proposal, which is inspired by the Ksummit workshop discussions
last year [1], takes a different approach by using a (relatively) simple
platform energy cost model to guide scheduling decisions. By providing
the model with platform specific costing data the model can provide a
estimate of the energy implications of scheduling decisions. So instead
of blindly applying scheduling techniques that may or may not work for
the current use-case, the scheduler can make informed energy-aware
decisions. We believe this approach provides a methodology that can be
adapted to any platform, including heterogeneous systems such as ARM
big.LITTLE. The model considers cpus only. Model data includes power
consumption at each P-state, C-state power consumption, and wake-up
energy costs. However, the energy model could potentially be extended to
be used to guide performance/energy decisions in other subsystems.

The scheduler can use energy_diff_task(cpu, task) to estimate the cost
of placing a task on a specific cpu and compare energy costs of
different cpus.

This is an RFC and there are some loose ends that have not been
addressed here or in the code yet. The model and its infrastructure is
in place in the scheduler and it is being used for load-balancing
decisions. However only for the select_task_rq_fair() path for
fork/exec/wake balancing for now. No modifications to periodic or idle
balance yet. There are quite a few dirty hacks in there to tie things
together. To mention a few current limitations:

1. Due to the lack of scale invariant cpu and task utilization, it 
   doesn't work properly with frequency scaling or heterogeneous systems 
   (big.LITTLE).

2. Lacking a proper utilization metric it is assumed that utilization == 
   load. This is only close to being a reasonable assumption if all 
   tasks have nice=0.

3. Platform data for the test platform (ARM TC2) has been hardcoded in 
   arch/arm/ code.

4. Support for multiple per cpu C-states is not implemented yet.

However, the main ideas and the primary focus of this RFC: The energy
model and energy_diff_{load, task}() are there.

Due to limitation 1, the ARM TC2 platform (2xA15+3xA7) was setup to
disable frequency scaling and set frequencies to eliminate the
big.LITTLE performance difference. That basically turns TC2 into an SMP
platform where a subset of the cpus are less energy-efficient.

Tests using a synthetic workload with seven short running periodic
tasks of different size and period, and the sysbench cpu benchmark with
five threads gave the following results:

cpu energy*	short tasks	sysbench
Mainline	100		100
EA		 50		 97

* Note that these energy savings are _not_ representative of what can be
achieved on a true SMP platform where all cpus are equally 
energy-efficient. There should be benefit for SMP platforms as well, 
however, it will be smaller.

The energy model led to consolidation of the short tasks on the A7
cluster (more energy-efficient), while sysbench made use of all cpus as
the A7s didn't have sufficient compute capacity to handle the five
tasks.

To see how scheduling would happen if all cpus would have been A7s the
same tests were done with the A15s' energy model being the same as that
of the A7s (i.e. lying about the platform to the scheduler energy
model). The scheduling pattern for the short tasks changed to being
either consolidated on the A7 or the A15 cluster instead of just on the
A7, which was expected. Currently, there are no tools available to 
easily deduce energy for traces using a platform energy model, which 
could have estimated the energy benefit. Linaro is currently looking 
into extending the idle-stat tool [3] to do this.

Testing using Android workloads [2] didn't go well due to Android's
extensive use of task priority and limitation 2. Once these limitations 
have been addressed benefit is expected on Android as well, which is a 
key target.

The latency overhead induced by the energy model in
select_task_rq_fair() for this unoptimized implementation on TC2 is:

latency		avg (depending on cpu)
Mainline	 2.5 -  4.7 us
EA		10.9 - 16.5 us

However, it should be possible to reduce this significantly.

Patch   1-4: Infrastructure to set up energy model data
Patch   5-9: Bits and pieces needed for the energy model
Patch 10-15: The energy model and scheduler tweaks

This series is based on top of Vincent's topology patches [4].

[1] http://etherpad.osuosl.org/energy-aware-scheduling-ks-2013 (search
for 'cost')
[2] https://lkml.org/lkml/2014/1/7/355
[3] http://git.linaro.org/power/idlestat.git
[4] https://lkml.org/lkml/2014/4/11/137

Dietmar Eggemann (5):
  sched: Introduce sd energy data structures
  sched: Allocate and initialize sched energy
  sched: Add sd energy procfs interface
  arm: topology: Define TC2 sched energy and provide it to scheduler
  sched: Introduce system-wide sched_energy

Morten Rasmussen (11):
  sched: Documentation for scheduler energy cost model
  sched: Introduce CONFIG_SCHED_ENERGY
  sched: Introduce SD_SHARE_CAP_STATES sched_domain flag
  sched, cpufreq: Introduce current cpu compute capacity into scheduler
  sched, cpufreq: Current compute capacity hack for ARM TC2
  sched: Energy model functions
  sched: Task wakeup tracking
  sched: Take task wakeups into account in energy estimates
  sched: Use energy model in select_idle_sibling
  sched: Use energy to guide wakeup task placement
  sched: Disable wake_affine to broaden the scope of wakeup target cpus

 Documentation/scheduler/sched-energy.txt |   66 ++++++
 arch/arm/Kconfig                         |    5 +
 arch/arm/kernel/topology.c               |  120 +++++++++-
 drivers/cpufreq/cpufreq.c                |    8 +
 include/linux/sched.h                    |   30 +++
 kernel/sched/core.c                      |  192 +++++++++++++++-
 kernel/sched/fair.c                      |  359 +++++++++++++++++++++++++++++-
 kernel/sched/sched.h                     |   44 ++++
 8 files changed, 805 insertions(+), 19 deletions(-)
 create mode 100644 Documentation/scheduler/sched-energy.txt

-- 
1.7.9.5



^ permalink raw reply	[flat|nested] 71+ messages in thread

* [RFC PATCH 01/16] sched: Documentation for scheduler energy cost model
  2014-05-23 18:16 [RFC PATCH 00/16] sched: Energy cost model for energy-aware scheduling Morten Rasmussen
@ 2014-05-23 18:16 ` Morten Rasmussen
  2014-06-05  8:49   ` Vincent Guittot
  2014-05-23 18:16 ` [RFC PATCH 02/16] sched: Introduce CONFIG_SCHED_ENERGY Morten Rasmussen
                   ` (14 subsequent siblings)
  15 siblings, 1 reply; 71+ messages in thread
From: Morten Rasmussen @ 2014-05-23 18:16 UTC (permalink / raw)
  To: linux-kernel, linux-pm, peterz, mingo
  Cc: rjw, vincent.guittot, daniel.lezcano, preeti, dietmar.eggemann

This documentation patch provide a brief overview of the experimental
scheduler energy costing model and associated data structures.

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 Documentation/scheduler/sched-energy.txt |   66 ++++++++++++++++++++++++++++++
 1 file changed, 66 insertions(+)
 create mode 100644 Documentation/scheduler/sched-energy.txt

diff --git a/Documentation/scheduler/sched-energy.txt b/Documentation/scheduler/sched-energy.txt
new file mode 100644
index 0000000..c6896c0
--- /dev/null
+++ b/Documentation/scheduler/sched-energy.txt
@@ -0,0 +1,66 @@
+Energy cost model for energy-aware scheduling (EXPERIMENTAL)
+
+Introduction
+=============
+The basic energy model uses platform energy data stored in sched_energy data
+structures attached to the sched_groups in the sched_domain hierarchy. The
+energy cost model offers two function that can be used to guide scheduling
+decisions:
+
+1.	energy_diff_util(cpu, util, wakeups) 
+2.	energy_diff_task(cpu, task)
+
+Both return the energy cost delta caused by adding/removing utilization or a
+task to/from a specific cpu.
+
+CONFIG_SCHED_ENERGY needs to be defined in Kconfig to enable the energy cost
+model and associated data structures.
+
+The basic algorithm
+====================
+The basic idea is to determine the energy cost at each level in sched_domain
+hierarchy based on utilization:
+
+	for_each_domain(cpu, sd) {
+		sg = sched_group_of(cpu)
+		energy_before = curr_util(sg) * busy_power(sg)
+                	        + 1-curr_util(sg) * idle_power(sg)
+		energy_after = new_util(sg) * busy_power(sg)
+                	        + 1-new_util(sg) * idle_power(sg)
+                        	+ new_util(sg) * task_wakeups
+							* wakeup_energy(sg)
+		energy_diff += energy_before - energy_after
+	}
+
+	return energy_diff
+
+Platform energy data
+=====================
+struct sched_energy has the following members:
+
+cap_states:
+	List of struct capacity_state representing the supported capacity states 
+	(P-states). struct capacity_state has two members: cap and power, which 
+	represents the compute capacity and the busy power of the state. The 
+	list must ordered by capacity low->high.
+
+nr_cap_states:
+	Number of capacity states in cap_states.
+
+max_capacity:
+	The highest capacity supported by any of the capacity states in 
+	cap_states.
+
+idle_power:
+	Idle power consumption. Will be extended to support multiple C-states 
+	later.
+
+wakeup_energy:
+	Energy cost of wakeup/power-down cycle for the sched_group which this is 
+	attached to. Will be extended to support different costs for different 
+	C-states later.
+
+There are no unit requirements for the energy cost data. Data can be normalized 
+with any reference, however, the normalization must be consistent across all 
+energy cost data. That is, one bogo-joule/watt must be same quantity for data, 
+but we don't care what it is.
-- 
1.7.9.5



^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [RFC PATCH 02/16] sched: Introduce CONFIG_SCHED_ENERGY
  2014-05-23 18:16 [RFC PATCH 00/16] sched: Energy cost model for energy-aware scheduling Morten Rasmussen
  2014-05-23 18:16 ` [RFC PATCH 01/16] sched: Documentation for scheduler energy cost model Morten Rasmussen
@ 2014-05-23 18:16 ` Morten Rasmussen
  2014-06-08  6:03   ` Henrik Austad
  2014-05-23 18:16 ` [RFC PATCH 03/16] sched: Introduce sd energy data structures Morten Rasmussen
                   ` (13 subsequent siblings)
  15 siblings, 1 reply; 71+ messages in thread
From: Morten Rasmussen @ 2014-05-23 18:16 UTC (permalink / raw)
  To: linux-kernel, linux-pm, peterz, mingo
  Cc: rjw, vincent.guittot, daniel.lezcano, preeti, dietmar.eggemann

The Energy-aware scheduler implementation is guarded by
CONFIG_SCHED_ENERGY.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 arch/arm/Kconfig |    5 +++++
 1 file changed, 5 insertions(+)

diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
index ab438cb..bfc3a85 100644
--- a/arch/arm/Kconfig
+++ b/arch/arm/Kconfig
@@ -1926,6 +1926,11 @@ config XEN
 	help
 	  Say Y if you want to run Linux in a Virtual Machine on Xen on ARM.
 
+config SCHED_ENERGY
+	bool "Energy-aware scheduling (EXPERIMENTAL)"
+	help
+	  Highly experimental energy aware task scheduling.
+
 endmenu
 
 menu "Boot options"
-- 
1.7.9.5



^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [RFC PATCH 03/16] sched: Introduce sd energy data structures
  2014-05-23 18:16 [RFC PATCH 00/16] sched: Energy cost model for energy-aware scheduling Morten Rasmussen
  2014-05-23 18:16 ` [RFC PATCH 01/16] sched: Documentation for scheduler energy cost model Morten Rasmussen
  2014-05-23 18:16 ` [RFC PATCH 02/16] sched: Introduce CONFIG_SCHED_ENERGY Morten Rasmussen
@ 2014-05-23 18:16 ` Morten Rasmussen
  2014-05-23 18:16 ` [RFC PATCH 04/16] sched: Allocate and initialize sched energy Morten Rasmussen
                   ` (12 subsequent siblings)
  15 siblings, 0 replies; 71+ messages in thread
From: Morten Rasmussen @ 2014-05-23 18:16 UTC (permalink / raw)
  To: linux-kernel, linux-pm, peterz, mingo
  Cc: rjw, vincent.guittot, daniel.lezcano, preeti, dietmar.eggemann

From: Dietmar Eggemann <dietmar.eggemann@arm.com>

The struct sched_energy represents the per scheduler group related data
which is needed for the energy aware scheduler.

It contains a pointer to a struct capacity_state array which contains
(compute capacity, power consumption @ this compute capacity) tuples.

The struct sched_group_energy wraps struct sched_energy and an atomic
reference counter, latter is used for scheduler internal bookkeeping of
data allocation and freeing.

Allocation and freeing of struct sched_group_energy uses the existing
infrastructure of the scheduler which is currently used for the other sd
hierarchy data structures (e.g. struct sched_domain).  That's why struct
sd_data is provisioned with a per cpu struct sched_group_energy double
pointer.

The struct sched_group gets a pointer to a struct sched_group_energy.

The function ptr sched_domain_energy_f is introduced into struct
sched_domain_topology_level which will allow the arch to set a pass a
particular struct sd_energy from the topology shim layer into the
scheduler core.

The function ptr sched_domain_energy_f has an 'int cpu' parameter since
the folding of two adjacent sd levels via sd degenerate doesn't work
for all sd levels.  E.g. it is not possible to use this feature to
provide per-cpu sd energy in sd level DIE (former CPU) on ARM's TC2
platform.

It was discussed that the folding of sd levels approach is preferable
over the cpu parameter approach, simply because the user (the arch
specifying the sd topology table) can introduce less errors.  But since
it is not working, the 'int cpu' parameter is the only way out.  It's
possible to use the folding of sd levels approach for
sched_domain_flags_f and the cpu parameter approach for the
sched_domain_energy_f at the same time set-up though.  With the use of
the 'int cpu' parameter, an extra check function has to be provided to
make sure that all cpus spanned by a scheduler building block (e.g a
sched domain or a group) are provisioned with the same energy data.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
---
 include/linux/sched.h |   24 ++++++++++++++++++++++++
 kernel/sched/sched.h  |   10 ++++++++++
 2 files changed, 34 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 261a419..4eb149b 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -909,6 +909,21 @@ struct sched_domain_attr {
 
 extern int sched_domain_level_max;
 
+#ifdef CONFIG_SCHED_ENERGY
+struct capacity_state {
+	int cap;	/* compute capacity */
+	int power;	/* power consumption at this compute capacity */
+};
+
+struct sched_energy {
+	long max_capacity;	/* maximal compute capacity */
+	int idle_power;		/* power consumption in idle state */
+	int wakeup_energy;	/* energy for wakeup->sleep cycle (x1024) */
+	int nr_cap_states;	/* number of capacity states */
+	struct capacity_state *cap_states; /* ptr to capacity state array */
+};
+#endif
+
 struct sched_group;
 
 struct sched_domain {
@@ -1007,6 +1022,9 @@ bool cpus_share_cache(int this_cpu, int that_cpu);
 
 typedef const struct cpumask *(*sched_domain_mask_f)(int cpu);
 typedef const int (*sched_domain_flags_f)(void);
+#ifdef CONFIG_SCHED_ENERGY
+typedef const struct sched_energy *(*sched_domain_energy_f)(int cpu);
+#endif
 
 #define SDTL_OVERLAP	0x01
 
@@ -1014,11 +1032,17 @@ struct sd_data {
 	struct sched_domain **__percpu sd;
 	struct sched_group **__percpu sg;
 	struct sched_group_power **__percpu sgp;
+#ifdef CONFIG_SCHED_ENERGY
+	struct sched_group_energy **__percpu sge;
+#endif
 };
 
 struct sched_domain_topology_level {
 	sched_domain_mask_f mask;
 	sched_domain_flags_f sd_flags;
+#ifdef CONFIG_SCHED_ENERGY
+	sched_domain_energy_f energy;
+#endif
 	int		    flags;
 	int		    numa_level;
 	struct sd_data      data;
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 456e492..c566f5e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -755,12 +755,22 @@ struct sched_group_power {
 	unsigned long cpumask[0]; /* iteration mask */
 };
 
+#ifdef CONFIG_SCHED_ENERGY
+struct sched_group_energy {
+	atomic_t ref;
+	struct sched_energy data;
+};
+#endif
+
 struct sched_group {
 	struct sched_group *next;	/* Must be a circular list */
 	atomic_t ref;
 
 	unsigned int group_weight;
 	struct sched_group_power *sgp;
+#ifdef CONFIG_SCHED_ENERGY
+	struct sched_group_energy *sge;
+#endif
 
 	/*
 	 * The CPUs this group covers.
-- 
1.7.9.5



^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [RFC PATCH 04/16] sched: Allocate and initialize sched energy
  2014-05-23 18:16 [RFC PATCH 00/16] sched: Energy cost model for energy-aware scheduling Morten Rasmussen
                   ` (2 preceding siblings ...)
  2014-05-23 18:16 ` [RFC PATCH 03/16] sched: Introduce sd energy data structures Morten Rasmussen
@ 2014-05-23 18:16 ` Morten Rasmussen
  2014-05-23 18:16 ` [RFC PATCH 05/16] sched: Add sd energy procfs interface Morten Rasmussen
                   ` (11 subsequent siblings)
  15 siblings, 0 replies; 71+ messages in thread
From: Morten Rasmussen @ 2014-05-23 18:16 UTC (permalink / raw)
  To: linux-kernel, linux-pm, peterz, mingo
  Cc: rjw, vincent.guittot, daniel.lezcano, preeti, dietmar.eggemann

From: Dietmar Eggemann <dietmar.eggemann@arm.com>

The per sg struct sched_group_energy structure plus the related struct
capacity_state array are allocated like the other sd hierarchy data
structures (e.g. struct sched_group).  This includes the freeing of
struct sched_group_energy structures which are not used.
One problem is that the sd energy information consists of two structures
per sg, the actual struct sched_group_energy and the related
capacity_state array and that the number of elements of this array can be
configured (see struct sched_group_energy.nr_cap_states).  That means
that the number of capacity states has to be figured out in __sdt_alloc()
and since both data structures are allocated at the same time, struct
sched_group_energy.cap_states is initialized to point to the start of the
capacity state array memory.

The new function init_sched_energy() initializes the per sg struct
sched_group_energy and the struct capacity_state array in case the struct
sched_domain_topology_level contains sd energy information.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
---
 kernel/sched/core.c  |   86 ++++++++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h |   30 ++++++++++++++++++
 2 files changed, 116 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 851cbd8..785b61d 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5438,6 +5438,9 @@ static void free_sched_domain(struct rcu_head *rcu)
 		free_sched_groups(sd->groups, 1);
 	} else if (atomic_dec_and_test(&sd->groups->ref)) {
 		kfree(sd->groups->sgp);
+#ifdef CONFIG_SCHED_ENERGY
+		kfree(sd->groups->sge);
+#endif
 		kfree(sd->groups);
 	}
 	kfree(sd);
@@ -5698,6 +5701,10 @@ static int get_group(int cpu, struct sd_data *sdd, struct sched_group **sg)
 		*sg = *per_cpu_ptr(sdd->sg, cpu);
 		(*sg)->sgp = *per_cpu_ptr(sdd->sgp, cpu);
 		atomic_set(&(*sg)->sgp->ref, 1); /* for claim_allocations */
+#ifdef CONFIG_SCHED_ENERGY
+		(*sg)->sge = *per_cpu_ptr(sdd->sge, cpu);
+		atomic_set(&(*sg)->sge->ref, 1); /* for claim_allocations */
+#endif
 	}
 
 	return cpu;
@@ -5789,6 +5796,31 @@ static void init_sched_groups_power(int cpu, struct sched_domain *sd)
 	atomic_set(&sg->sgp->nr_busy_cpus, sg->group_weight);
 }
 
+#ifdef CONFIG_SCHED_ENERGY
+static void init_sched_energy(int cpu, struct sched_domain *sd,
+			      struct sched_domain_topology_level *tl)
+{
+	struct sched_group *sg = sd->groups;
+	struct sched_energy *energy = &sg->sge->data;
+	sched_domain_energy_f fn = tl->energy;
+	struct cpumask *mask = sched_group_cpus(sg);
+
+	if (!fn || !fn(cpu))
+		return;
+
+	if (cpumask_weight(mask) > 1)
+		check_sched_energy_data(cpu, fn, mask);
+
+	energy->max_capacity = fn(cpu)->max_capacity;
+	energy->idle_power = fn(cpu)->idle_power;
+	energy->wakeup_energy = fn(cpu)->wakeup_energy;
+	energy->nr_cap_states = fn(cpu)->nr_cap_states;
+
+	memcpy(energy->cap_states, fn(cpu)->cap_states,
+	       energy->nr_cap_states*sizeof(struct capacity_state));
+}
+#endif
+
 /*
  * Initializers for schedule domains
  * Non-inlined to reduce accumulated stack pressure in build_sched_domains()
@@ -5879,6 +5911,11 @@ static void claim_allocations(int cpu, struct sched_domain *sd)
 
 	if (atomic_read(&(*per_cpu_ptr(sdd->sgp, cpu))->ref))
 		*per_cpu_ptr(sdd->sgp, cpu) = NULL;
+
+#ifdef CONFIG_SCHED_ENERGY
+	if (atomic_read(&(*per_cpu_ptr(sdd->sge, cpu))->ref))
+		*per_cpu_ptr(sdd->sge, cpu) = NULL;
+#endif
 }
 
 #ifdef CONFIG_NUMA
@@ -6284,10 +6321,29 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
 		if (!sdd->sgp)
 			return -ENOMEM;
 
+#ifdef CONFIG_SCHED_ENERGY
+		sdd->sge = alloc_percpu(struct sched_group_energy *);
+		if (!sdd->sge)
+			return -ENOMEM;
+#endif
 		for_each_cpu(j, cpu_map) {
 			struct sched_domain *sd;
 			struct sched_group *sg;
 			struct sched_group_power *sgp;
+#ifdef CONFIG_SCHED_ENERGY
+			struct sched_group_energy *sge;
+			sched_domain_energy_f fn = tl->energy;
+
+			/*
+			 * Figure out how many elements the cap state array has
+			 * to contain.
+			 * In case tl->info.energy(j)->nr_cap_states is 0, we
+			 * still allocate struct sched_group_energy XXX which is
+			 * not used but will be freed later XXX.
+			 */
+			unsigned int nr_cap_states = !fn || !fn(j) ? 0 :
+					fn(j)->nr_cap_states;
+#endif
 
 		       	sd = kzalloc_node(sizeof(struct sched_domain) + cpumask_size(),
 					GFP_KERNEL, cpu_to_node(j));
@@ -6311,6 +6367,20 @@ static int __sdt_alloc(const struct cpumask *cpu_map)
 				return -ENOMEM;
 
 			*per_cpu_ptr(sdd->sgp, j) = sgp;
+
+#ifdef CONFIG_SCHED_ENERGY
+			sge = kzalloc_node(sizeof(struct sched_group_energy) +
+				nr_cap_states*sizeof(struct capacity_state),
+				GFP_KERNEL, cpu_to_node(j));
+
+			if (!sge)
+				return -ENOMEM;
+
+			sge->data.cap_states = (struct capacity_state *)((void *)sge +
+				 sizeof(struct sched_group_energy));
+
+			*per_cpu_ptr(sdd->sge, j) = sge;
+#endif
 		}
 	}
 
@@ -6339,6 +6409,10 @@ static void __sdt_free(const struct cpumask *cpu_map)
 				kfree(*per_cpu_ptr(sdd->sg, j));
 			if (sdd->sgp)
 				kfree(*per_cpu_ptr(sdd->sgp, j));
+#ifdef CONFIG_SCHED_ENERGY
+			if (sdd->sge)
+				kfree(*per_cpu_ptr(sdd->sge, j));
+#endif
 		}
 		free_percpu(sdd->sd);
 		sdd->sd = NULL;
@@ -6346,6 +6420,10 @@ static void __sdt_free(const struct cpumask *cpu_map)
 		sdd->sg = NULL;
 		free_percpu(sdd->sgp);
 		sdd->sgp = NULL;
+#ifdef CONFIG_SCHED_ENERGY
+		free_percpu(sdd->sge);
+		sdd->sge = NULL;
+#endif
 	}
 }
 
@@ -6417,10 +6495,18 @@ static int build_sched_domains(const struct cpumask *cpu_map,
 
 	/* Calculate CPU power for physical packages and nodes */
 	for (i = nr_cpumask_bits-1; i >= 0; i--) {
+#ifdef CONFIG_SCHED_ENERGY
+		struct sched_domain_topology_level *tl = sched_domain_topology;
+#endif
 		if (!cpumask_test_cpu(i, cpu_map))
 			continue;
 
+#ifdef CONFIG_SCHED_ENERGY
+		for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent, tl++) {
+			init_sched_energy(i, sd, tl);
+#else
 		for (sd = *per_cpu_ptr(d.sd, i); sd; sd = sd->parent) {
+#endif
 			claim_allocations(i, sd);
 			init_sched_groups_power(i, sd);
 		}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c566f5e..6726437 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -807,6 +807,36 @@ static inline unsigned int group_first_cpu(struct sched_group *group)
 
 extern int group_balance_cpu(struct sched_group *sg);
 
+/*
+ * Check that the per-cpu provided sd energy data is consistent for all cpus
+ * within the mask.
+ */
+#ifdef CONFIG_SCHED_ENERGY
+static inline void check_sched_energy_data(int cpu, sched_domain_energy_f fn,
+					   const struct cpumask *cpumask)
+{
+	struct cpumask mask;
+	int i;
+
+	cpumask_xor(&mask, cpumask, get_cpu_mask(cpu));
+
+	for_each_cpu(i, &mask) {
+		int y = 0;
+
+		BUG_ON(fn(i)->max_capacity != fn(cpu)->max_capacity);
+		BUG_ON(fn(i)->idle_power != fn(cpu)->idle_power);
+		BUG_ON(fn(i)->wakeup_energy != fn(cpu)->wakeup_energy);
+		BUG_ON(fn(i)->nr_cap_states != fn(cpu)->nr_cap_states);
+
+		for (; y < (fn(i)->nr_cap_states); y++) {
+			BUG_ON(fn(i)->cap_states[y].cap !=
+					fn(cpu)->cap_states[y].cap);
+			BUG_ON(fn(i)->cap_states[y].power !=
+					fn(cpu)->cap_states[y].power);
+		}
+	}
+}
+#endif
 #endif /* CONFIG_SMP */
 
 #include "stats.h"
-- 
1.7.9.5



^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [RFC PATCH 05/16] sched: Add sd energy procfs interface
  2014-05-23 18:16 [RFC PATCH 00/16] sched: Energy cost model for energy-aware scheduling Morten Rasmussen
                   ` (3 preceding siblings ...)
  2014-05-23 18:16 ` [RFC PATCH 04/16] sched: Allocate and initialize sched energy Morten Rasmussen
@ 2014-05-23 18:16 ` Morten Rasmussen
  2014-05-23 18:16 ` [RFC PATCH 06/16] arm: topology: Define TC2 sched energy and provide it to scheduler Morten Rasmussen
                   ` (10 subsequent siblings)
  15 siblings, 0 replies; 71+ messages in thread
From: Morten Rasmussen @ 2014-05-23 18:16 UTC (permalink / raw)
  To: linux-kernel, linux-pm, peterz, mingo
  Cc: rjw, vincent.guittot, daniel.lezcano, preeti, dietmar.eggemann

From: Dietmar Eggemann <dietmar.eggemann@arm.com>

This patch makes the values of the sd energy data structure available via
procfs.  The related files are placed as sub-directory named 'energy'
inside the /proc/sys/kernel/sched_domain/cpuX/domainY/groupZ directory for
those cpu/domain/group tuples which have sd energy information.

The following example depicts the contents of
/proc/sys/kernel/sched_domain/cpu0/domain0/group[01] for a system which
has sd energy information attached to domain level 0.

├── cpu0
│   ├── domain0
│   │   ├── busy_factor
│   │   ├── busy_idx
│   │   ├── cache_nice_tries
│   │   ├── flags
│   │   ├── forkexec_idx
│   │   ├── group0
│   │   │   └── energy
│   │   │       ├── cap_states
│   │   │       ├── idle_power
│   │   │       ├── max_capacity
│   │   │       ├── nr_cap_states
│   │   │       └── wakeup_energy
│   │   ├── group1
│   │   │   └── energy
│   │   │       ├── cap_states
│   │   │       ├── idle_power
│   │   │       ├── max_capacity
│   │   │       ├── nr_cap_states
│   │   │       └── wakeup_energy
│   │   ├── idle_idx
│   │   ├── imbalance_pct
│   │   ├── max_interval
│   │   ├── max_newidle_lb_cost
│   │   ├── min_interval
│   │   ├── name
│   │   ├── newidle_idx
│   │   └── wake_idx
│   └── domain1
│       ├── busy_factor
│       ├── busy_idx
│       ├── cache_nice_tries
│       ├── flags
│       ├── forkexec_idx
│       ├── idle_idx
│       ├── imbalance_pct
│       ├── max_interval
│       ├── max_newidle_lb_cost
│       ├── min_interval
│       ├── name
│       ├── newidle_idx
│       └── wake_idx

The files 'idle_power', 'max_capacity', 'nr_cap_states' and 'wakeup_energy'
contain a scalar value whereas 'cap_states' contains a vector of
(compute capacity, power consumption @ this compute capacity) tuples.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
---
 kernel/sched/core.c |   73 +++++++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 71 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 785b61d..096fa55 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -4837,10 +4837,66 @@ set_table_entry(struct ctl_table *entry,
 	}
 }
 
+#ifdef CONFIG_SCHED_ENERGY
+static struct ctl_table *
+sd_alloc_ctl_energy_table(struct sched_group_energy *sge)
+{
+	struct ctl_table *table = sd_alloc_ctl_entry(6);
+
+	if (table == NULL)
+		return NULL;
+
+	set_table_entry(&table[0], "max_capacity", &sge->data.max_capacity,
+			sizeof(long), 0644, proc_doulongvec_minmax, false);
+	set_table_entry(&table[1], "idle_power", &sge->data.idle_power,
+			sizeof(int), 0644, proc_dointvec_minmax, false);
+	set_table_entry(&table[2], "wakeup_energy", &sge->data.wakeup_energy,
+			sizeof(int), 0644, proc_dointvec_minmax, false);
+	set_table_entry(&table[3], "nr_cap_states", &sge->data.nr_cap_states,
+			sizeof(int), 0644, proc_dointvec_minmax, false);
+	set_table_entry(&table[4], "cap_states", &sge->data.cap_states[0].cap,
+			sge->data.nr_cap_states*2*sizeof(int), 0644,
+			proc_dointvec_minmax, false);
+
+	return table;
+}
+
+static struct ctl_table *
+sd_alloc_ctl_group_table(struct sched_group *sg)
+{
+	struct ctl_table *table = sd_alloc_ctl_entry(2);
+
+	if (table == NULL)
+		return NULL;
+
+	table->procname = kstrdup("energy", GFP_KERNEL);
+	table->mode = 0555;
+	table->child = sd_alloc_ctl_energy_table(sg->sge);
+
+	return table;
+}
+#endif
+
 static struct ctl_table *
 sd_alloc_ctl_domain_table(struct sched_domain *sd)
 {
-	struct ctl_table *table = sd_alloc_ctl_entry(14);
+	struct ctl_table *table;
+	unsigned int nr_entries = 14;
+
+#ifdef CONFIG_SCHED_ENERGY
+	int i = 0;
+	struct sched_group *sg = sd->groups;
+
+	if (sg->sge) {
+		int nr_sgs = 0;
+
+		do {} while (nr_sgs++, sg = sg->next, sg != sd->groups);
+
+		nr_entries += nr_sgs;
+	}
+#endif
+
+	table = sd_alloc_ctl_entry(nr_entries);
 
 	if (table == NULL)
 		return NULL;
@@ -4873,7 +4929,20 @@ sd_alloc_ctl_domain_table(struct sched_domain *sd)
 		sizeof(long), 0644, proc_doulongvec_minmax, false);
 	set_table_entry(&table[12], "name", sd->name,
 		CORENAME_MAX_SIZE, 0444, proc_dostring, false);
-	/* &table[13] is terminator */
+#ifdef CONFIG_SCHED_ENERGY
+	sg = sd->groups;
+	if (sg->sge) {
+		char buf[32];
+		struct ctl_table *entry = &table[13];
+		do {
+			snprintf(buf, 32, "group%d", i);
+			entry->procname = kstrdup(buf, GFP_KERNEL);
+			entry->mode = 0555;
+			entry->child = sd_alloc_ctl_group_table(sg);
+		} while (entry++, i++, sg = sg->next, sg != sd->groups);
+	}
+#endif
+	/* &table[nr_entries-1] is terminator */
 
 	return table;
 }
-- 
1.7.9.5



^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [RFC PATCH 06/16] arm: topology: Define TC2 sched energy and provide it to scheduler
  2014-05-23 18:16 [RFC PATCH 00/16] sched: Energy cost model for energy-aware scheduling Morten Rasmussen
                   ` (4 preceding siblings ...)
  2014-05-23 18:16 ` [RFC PATCH 05/16] sched: Add sd energy procfs interface Morten Rasmussen
@ 2014-05-23 18:16 ` Morten Rasmussen
  2014-05-30 12:04   ` Peter Zijlstra
                     ` (2 more replies)
  2014-05-23 18:16 ` [RFC PATCH 07/16] sched: Introduce system-wide sched_energy Morten Rasmussen
                   ` (9 subsequent siblings)
  15 siblings, 3 replies; 71+ messages in thread
From: Morten Rasmussen @ 2014-05-23 18:16 UTC (permalink / raw)
  To: linux-kernel, linux-pm, peterz, mingo
  Cc: rjw, vincent.guittot, daniel.lezcano, preeti, dietmar.eggemann

From: Dietmar Eggemann <dietmar.eggemann@arm.com>

!!! This patch is only here to be able to test provisioning of sched
energy related data from an arch topology shim layer to the scheduler.
Since there is no code today which deals with extracting sched energy
related data from the dtb or acpi, and process it in the topology shim
layer, the struct sched_energy and the related struct capacity_state
arrays are hard-coded here !!!

This patch defines the struct sched_energy and the related struct
capacity_state array for the cluster (relates to sg's in DIE sd level)
and for the core (relates to sg's in MC sd level) for a Cortex A7 as
well as for a Cortex A15. It further provides related implementations of
the sched_domain_energy_f functions (cpu_cluster_energy() and
cpu_core_energy()).

To be able to propagate this information from the topology shim layer to
the scheduler, the elements of the arm_topology[] table have been
provisioned with the appropriate sched_domain_energy_f functions.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
---
 arch/arm/kernel/topology.c |  109 ++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 109 insertions(+)

diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
index 71e1fec..4050348 100644
--- a/arch/arm/kernel/topology.c
+++ b/arch/arm/kernel/topology.c
@@ -275,6 +275,107 @@ void store_cpu_topology(unsigned int cpuid)
 		cpu_topology[cpuid].socket_id, mpidr);
 }
 
+#ifdef CONFIG_SCHED_ENERGY
+/*
+ * ARM TC2 specific energy cost model data. There are no unit requirements for
+ * the data. Data can be normalized to any reference point, but the
+ * normalization must be consistent. That is, one bogo-joule/watt must be the
+ * same quantity for all data, but we don't care what it is.
+ */
+static struct capacity_state cap_states_cluster_a7[] = {
+	/* Cluster only power */
+	 { .cap =  358, .power = 2967, }, /*  350 MHz */
+	 { .cap =  410, .power = 2792, }, /*  400 MHz */
+	 { .cap =  512, .power = 2810, }, /*  500 MHz */
+	 { .cap =  614, .power = 2815, }, /*  600 MHz */
+	 { .cap =  717, .power = 2919, }, /*  700 MHz */
+	 { .cap =  819, .power = 2847, }, /*  800 MHz */
+	 { .cap =  922, .power = 3917, }, /*  900 MHz */
+	 { .cap = 1024, .power = 4905, }, /* 1000 MHz */
+	};
+
+static struct capacity_state cap_states_cluster_a15[] = {
+	/* Cluster only power */
+	 { .cap =  840, .power =  7920, }, /*  500 MHz */
+	 { .cap = 1008, .power =  8165, }, /*  600 MHz */
+	 { .cap = 1176, .power =  8172, }, /*  700 MHz */
+	 { .cap = 1343, .power =  8195, }, /*  800 MHz */
+	 { .cap = 1511, .power =  8265, }, /*  900 MHz */
+	 { .cap = 1679, .power =  8446, }, /* 1000 MHz */
+	 { .cap = 1847, .power = 11426, }, /* 1100 MHz */
+	 { .cap = 2015, .power = 15200, }, /* 1200 MHz */
+	};
+
+static struct sched_energy energy_cluster_a7 = {
+	  .max_capacity   = 1024,
+	  .idle_power     =   10, /* Cluster power-down */
+	  .wakeup_energy  =    6, /* << 10 */
+	  .nr_cap_states  = ARRAY_SIZE(cap_states_cluster_a7),
+	  .cap_states     = cap_states_cluster_a7,
+};
+
+static struct sched_energy energy_cluster_a15 = {
+	  .max_capacity   = 2015,
+	  .idle_power     =   25, /* Cluster power-down */
+	  .wakeup_energy  =  210, /* << 10 */
+	  .nr_cap_states  = ARRAY_SIZE(cap_states_cluster_a15),
+	  .cap_states     = cap_states_cluster_a15,
+};
+
+static struct capacity_state cap_states_core_a7[] = {
+	/* Power per cpu */
+	 { .cap =  358, .power =  187, }, /*  350 MHz */
+	 { .cap =  410, .power =  275, }, /*  400 MHz */
+	 { .cap =  512, .power =  334, }, /*  500 MHz */
+	 { .cap =  614, .power =  407, }, /*  600 MHz */
+	 { .cap =  717, .power =  447, }, /*  700 MHz */
+	 { .cap =  819, .power =  549, }, /*  800 MHz */
+	 { .cap =  922, .power =  761, }, /*  900 MHz */
+	 { .cap = 1024, .power = 1024, }, /* 1000 MHz */
+	};
+
+static struct capacity_state cap_states_core_a15[] = {
+	/* Power per cpu */
+	 { .cap =  840, .power = 2021, }, /*  500 MHz */
+	 { .cap = 1008, .power = 2312, }, /*  600 MHz */
+	 { .cap = 1176, .power = 2756, }, /*  700 MHz */
+	 { .cap = 1343, .power = 3125, }, /*  800 MHz */
+	 { .cap = 1511, .power = 3524, }, /*  900 MHz */
+	 { .cap = 1679, .power = 3846, }, /* 1000 MHz */
+	 { .cap = 1847, .power = 5177, }, /* 1100 MHz */
+	 { .cap = 2015, .power = 6997, }, /* 1200 MHz */
+	};
+
+static struct sched_energy energy_core_a7 = {
+	  .max_capacity   = 1024,
+	  .idle_power     =    0, /* No power gating */
+	  .wakeup_energy  =    0, /* << 10 */
+	  .nr_cap_states  = ARRAY_SIZE(cap_states_core_a7),
+	  .cap_states     = cap_states_core_a7,
+};
+
+static struct sched_energy energy_core_a15 = {
+	  .max_capacity   = 2015,
+	  .idle_power     =    0, /* No power gating */
+	  .wakeup_energy  =    5, /* << 10 */
+	  .nr_cap_states  = ARRAY_SIZE(cap_states_core_a15),
+	  .cap_states     = cap_states_core_a15,
+};
+
+/* sd energy functions */
+static inline const struct sched_energy *cpu_cluster_energy(int cpu)
+{
+	return cpu_topology[cpu].socket_id ? &energy_cluster_a7 :
+			&energy_cluster_a15;
+}
+
+static inline const struct sched_energy *cpu_core_energy(int cpu)
+{
+	return cpu_topology[cpu].socket_id ? &energy_core_a7 :
+			&energy_core_a15;
+}
+#endif /* CONFIG_SCHED_ENERGY */
+
 static inline const int cpu_corepower_flags(void)
 {
 	return SD_SHARE_PKG_RESOURCES  | SD_SHARE_POWERDOMAIN;
@@ -282,10 +383,18 @@ static inline const int cpu_corepower_flags(void)
 
 static struct sched_domain_topology_level arm_topology[] = {
 #ifdef CONFIG_SCHED_MC
+#ifdef CONFIG_SCHED_ENERGY
+	{ cpu_coregroup_mask, cpu_corepower_flags, cpu_core_energy, SD_INIT_NAME(MC) },
+#else
 	{ cpu_corepower_mask, cpu_corepower_flags, SD_INIT_NAME(GMC) },
 	{ cpu_coregroup_mask, cpu_core_flags, SD_INIT_NAME(MC) },
 #endif
+#endif
+#ifdef CONFIG_SCHED_ENERGY
+	{ cpu_cpu_mask, 0, cpu_cluster_energy, SD_INIT_NAME(DIE) },
+#else
 	{ cpu_cpu_mask, SD_INIT_NAME(DIE) },
+#endif
 	{ NULL, },
 };
 
-- 
1.7.9.5



^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [RFC PATCH 07/16] sched: Introduce system-wide sched_energy
  2014-05-23 18:16 [RFC PATCH 00/16] sched: Energy cost model for energy-aware scheduling Morten Rasmussen
                   ` (5 preceding siblings ...)
  2014-05-23 18:16 ` [RFC PATCH 06/16] arm: topology: Define TC2 sched energy and provide it to scheduler Morten Rasmussen
@ 2014-05-23 18:16 ` Morten Rasmussen
  2014-05-23 18:16 ` [RFC PATCH 08/16] sched: Introduce SD_SHARE_CAP_STATES sched_domain flag Morten Rasmussen
                   ` (8 subsequent siblings)
  15 siblings, 0 replies; 71+ messages in thread
From: Morten Rasmussen @ 2014-05-23 18:16 UTC (permalink / raw)
  To: linux-kernel, linux-pm, peterz, mingo
  Cc: rjw, vincent.guittot, daniel.lezcano, preeti, dietmar.eggemann

From: Dietmar Eggemann <dietmar.eggemann@arm.com>

The Energy-aware algorithm needs system wide sched energy information on
certain platforms (e.g. a one socket system with multiple cpus).

In such a system, the sched energy data is only attached to the sched
groups for the individual cpus in the sched domain MC level.

For those systems, this patch adds a _hack_ to provide system-wide sched
energy data via the sched_domain_topology_level table.

The problem is that the sched_domain_topology_level table is not an
interface to provide system-wide data but we want to keep the
configuration of all sched energy related data in one place.

The sched_domain_energy_f of the last entry (the one which is
initialized with {NULL, }) of the sched_domain_topology_level table is
set to cpu_sys_energy(). Since the sched_domain_mask_f of this entry
stays NULL it is still not considered for the existing scheduler set-up
code (see for_each_sd_topology()).

A second call to init_sched_energy() with a struct sched_domain pointer
equal NULL as an argument will initialize the system-wide sched energy
structure sse.

For the example platform (ARM TC2 (MC and DIE sd level)), the
system-wide sched_domain_energy_f returns NULL, so struct sched_energy
*sse stays NULL.

Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
---
 arch/arm/kernel/topology.c |    8 +++++++-
 kernel/sched/core.c        |   26 ++++++++++++++++++++++----
 kernel/sched/sched.h       |    2 ++
 3 files changed, 31 insertions(+), 5 deletions(-)

diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
index 4050348..0b9c1e0 100644
--- a/arch/arm/kernel/topology.c
+++ b/arch/arm/kernel/topology.c
@@ -374,6 +374,11 @@ static inline const struct sched_energy *cpu_core_energy(int cpu)
 	return cpu_topology[cpu].socket_id ? &energy_core_a7 :
 			&energy_core_a15;
 }
+
+static inline const struct sched_energy *cpu_sys_energy(int cpu)
+{
+	return NULL;
+}
 #endif /* CONFIG_SCHED_ENERGY */
 
 static inline const int cpu_corepower_flags(void)
@@ -392,10 +397,11 @@ static struct sched_domain_topology_level arm_topology[] = {
 #endif
 #ifdef CONFIG_SCHED_ENERGY
 	{ cpu_cpu_mask, 0, cpu_cluster_energy, SD_INIT_NAME(DIE) },
+	{ NULL,	0, cpu_sys_energy},
 #else
 	{ cpu_cpu_mask, SD_INIT_NAME(DIE) },
+	{ NULL,	},
 #endif
-	{ NULL, },
 };
 
 /*
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 096fa55..530a348 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5866,20 +5866,35 @@ static void init_sched_groups_power(int cpu, struct sched_domain *sd)
 }
 
 #ifdef CONFIG_SCHED_ENERGY
+/* System-wide energy information. */
+struct sched_energy *sse;
+
 static void init_sched_energy(int cpu, struct sched_domain *sd,
 			      struct sched_domain_topology_level *tl)
 {
-	struct sched_group *sg = sd->groups;
-	struct sched_energy *energy = &sg->sge->data;
+	struct sched_group *sg = sd ? sd->groups : NULL;
+	struct sched_energy *energy = sd ? &sg->sge->data : sse;
 	sched_domain_energy_f fn = tl->energy;
-	struct cpumask *mask = sched_group_cpus(sg);
+	const struct cpumask *mask = sd ? sched_group_cpus(sg) :
+					  cpu_cpu_mask(cpu);
 
-	if (!fn || !fn(cpu))
+	if (!fn || !fn(cpu) || (!sd && energy))
 		return;
 
 	if (cpumask_weight(mask) > 1)
 		check_sched_energy_data(cpu, fn, mask);
 
+	if (!sd) {
+		energy = sse = kzalloc_node(sizeof(struct sched_energy) +
+					    fn(cpu)->nr_cap_states*
+					    sizeof(struct capacity_state),
+					    GFP_KERNEL, cpu_to_node(cpu));
+		BUG_ON(!energy);
+
+		energy->cap_states = (struct capacity_state *)((void *)energy +
+				sizeof(struct sched_energy));
+	}
+
 	energy->max_capacity = fn(cpu)->max_capacity;
 	energy->idle_power = fn(cpu)->idle_power;
 	energy->wakeup_energy = fn(cpu)->wakeup_energy;
@@ -6579,6 +6594,9 @@ static int build_sched_domains(const struct cpumask *cpu_map,
 			claim_allocations(i, sd);
 			init_sched_groups_power(i, sd);
 		}
+#ifdef CONFIG_SCHED_ENERGY
+		init_sched_energy(i, NULL, tl);
+#endif
 	}
 
 	/* Attach the domains */
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 6726437..9ff67a7 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -760,6 +760,8 @@ struct sched_group_energy {
 	atomic_t ref;
 	struct sched_energy data;
 };
+
+extern struct sched_energy *sse;
 #endif
 
 struct sched_group {
-- 
1.7.9.5



^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [RFC PATCH 08/16] sched: Introduce SD_SHARE_CAP_STATES sched_domain flag
  2014-05-23 18:16 [RFC PATCH 00/16] sched: Energy cost model for energy-aware scheduling Morten Rasmussen
                   ` (6 preceding siblings ...)
  2014-05-23 18:16 ` [RFC PATCH 07/16] sched: Introduce system-wide sched_energy Morten Rasmussen
@ 2014-05-23 18:16 ` Morten Rasmussen
  2014-05-23 18:16 ` [RFC PATCH 09/16] sched, cpufreq: Introduce current cpu compute capacity into scheduler Morten Rasmussen
                   ` (7 subsequent siblings)
  15 siblings, 0 replies; 71+ messages in thread
From: Morten Rasmussen @ 2014-05-23 18:16 UTC (permalink / raw)
  To: linux-kernel, linux-pm, peterz, mingo
  Cc: rjw, vincent.guittot, daniel.lezcano, preeti, dietmar.eggemann

cpufreq is currently keeping it a secret which cpus are sharing
clock source. The scheduler needs to know about clock domains as well
to become more energy aware. The SD_SHARE_CAP_STATES domain indicates
whether cpus belonging to the domain share capacity states (P-states).

There is no connection with cpufreq (yet). The flag must be set by
the arch specific topology code.

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 arch/arm/kernel/topology.c |    3 ++-
 include/linux/sched.h      |    1 +
 kernel/sched/core.c        |   10 +++++++---
 3 files changed, 10 insertions(+), 4 deletions(-)

diff --git a/arch/arm/kernel/topology.c b/arch/arm/kernel/topology.c
index 0b9c1e0..c78d497 100644
--- a/arch/arm/kernel/topology.c
+++ b/arch/arm/kernel/topology.c
@@ -383,7 +383,8 @@ static inline const struct sched_energy *cpu_sys_energy(int cpu)
 
 static inline const int cpu_corepower_flags(void)
 {
-	return SD_SHARE_PKG_RESOURCES  | SD_SHARE_POWERDOMAIN;
+	return SD_SHARE_PKG_RESOURCES  | SD_SHARE_POWERDOMAIN | \
+		SD_SHARE_CAP_STATES;
 }
 
 static struct sched_domain_topology_level arm_topology[] = {
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 4eb149b..62d61b5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -877,6 +877,7 @@ enum cpu_idle_type {
 #define SD_PREFER_SIBLING	0x1000	/* Prefer to place tasks in a sibling domain */
 #define SD_OVERLAP		0x2000	/* sched_domains of this level overlap */
 #define SD_NUMA			0x4000	/* cross-node balancing */
+#define SD_SHARE_CAP_STATES	0x8000  /* Domain members share capacity state */
 
 #ifdef CONFIG_SCHED_SMT
 static inline const int cpu_smt_flags(void)
diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 530a348..49b895a 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -5322,7 +5322,8 @@ static int sd_degenerate(struct sched_domain *sd)
 			 SD_BALANCE_EXEC |
 			 SD_SHARE_CPUPOWER |
 			 SD_SHARE_PKG_RESOURCES |
-			 SD_SHARE_POWERDOMAIN)) {
+			 SD_SHARE_POWERDOMAIN |
+			 SD_SHARE_CAP_STATES)) {
 		if (sd->groups != sd->groups->next)
 			return 0;
 	}
@@ -5354,7 +5355,8 @@ sd_parent_degenerate(struct sched_domain *sd, struct sched_domain *parent)
 				SD_SHARE_CPUPOWER |
 				SD_SHARE_PKG_RESOURCES |
 				SD_PREFER_SIBLING |
-				SD_SHARE_POWERDOMAIN);
+				SD_SHARE_POWERDOMAIN |
+				SD_SHARE_CAP_STATES);
 		if (nr_node_ids == 1)
 			pflags &= ~SD_SERIALIZE;
 	}
@@ -6016,6 +6018,7 @@ static int sched_domains_curr_level;
  * SD_SHARE_PKG_RESOURCES - describes shared caches
  * SD_NUMA                - describes NUMA topologies
  * SD_SHARE_POWERDOMAIN   - describes shared power domain
+ * SD_SHARE_CAP_STATES    - describes shared capacity states
  *
  * Odd one out:
  * SD_ASYM_PACKING        - describes SMT quirks
@@ -6025,7 +6028,8 @@ static int sched_domains_curr_level;
 	 SD_SHARE_PKG_RESOURCES |	\
 	 SD_NUMA |			\
 	 SD_ASYM_PACKING |		\
-	 SD_SHARE_POWERDOMAIN)
+	 SD_SHARE_POWERDOMAIN |		\
+	 SD_SHARE_CAP_STATES)
 
 static struct sched_domain *
 sd_init(struct sched_domain_topology_level *tl, int cpu)
-- 
1.7.9.5



^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [RFC PATCH 09/16] sched, cpufreq: Introduce current cpu compute capacity into scheduler
  2014-05-23 18:16 [RFC PATCH 00/16] sched: Energy cost model for energy-aware scheduling Morten Rasmussen
                   ` (7 preceding siblings ...)
  2014-05-23 18:16 ` [RFC PATCH 08/16] sched: Introduce SD_SHARE_CAP_STATES sched_domain flag Morten Rasmussen
@ 2014-05-23 18:16 ` Morten Rasmussen
  2014-05-23 18:16 ` [RFC PATCH 10/16] sched, cpufreq: Current compute capacity hack for ARM TC2 Morten Rasmussen
                   ` (6 subsequent siblings)
  15 siblings, 0 replies; 71+ messages in thread
From: Morten Rasmussen @ 2014-05-23 18:16 UTC (permalink / raw)
  To: linux-kernel, linux-pm, peterz, mingo
  Cc: rjw, vincent.guittot, daniel.lezcano, preeti, dietmar.eggemann

The scheduler is currently unaware of frequency changes and the current
compute capacity offered by the cpus. This patch is not the solution.
It is a hack to give us something to experiment with for now.

A proper solution could be based on the frequency invariant load
tracking proposed in the past: https://lkml.org/lkml/2013/4/16/289

This patch should _not_ be considered safe.

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 drivers/cpufreq/cpufreq.c |    2 ++
 include/linux/sched.h     |    2 ++
 kernel/sched/fair.c       |    6 ++++++
 kernel/sched/sched.h      |    2 ++
 4 files changed, 12 insertions(+)

diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index abda660..a2b788d 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -28,6 +28,7 @@
 #include <linux/slab.h>
 #include <linux/suspend.h>
 #include <linux/tick.h>
+#include <linux/sched.h>
 #include <trace/events/power.h>
 
 /**
@@ -315,6 +316,7 @@ static void __cpufreq_notify_transition(struct cpufreq_policy *policy,
 		pr_debug("FREQ: %lu - CPU: %lu\n",
 			 (unsigned long)freqs->new, (unsigned long)freqs->cpu);
 		trace_cpu_frequency(freqs->new, freqs->cpu);
+		set_curr_capacity(freqs->cpu, (freqs->new*1024)/policy->max);
 		srcu_notifier_call_chain(&cpufreq_transition_notifier_list,
 				CPUFREQ_POSTCHANGE, freqs);
 		if (likely(policy) && likely(policy->cpu == freqs->cpu))
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 62d61b5..727b936 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -3068,4 +3068,6 @@ static inline unsigned long rlimit_max(unsigned int limit)
 	return task_rlimit_max(current, limit);
 }
 
+void set_curr_capacity(int cpu, long capacity);
+
 #endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7570dd9..3a2aeee 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7410,9 +7410,15 @@ void init_cfs_rq(struct cfs_rq *cfs_rq)
 #ifdef CONFIG_SMP
 	atomic64_set(&cfs_rq->decay_counter, 1);
 	atomic_long_set(&cfs_rq->removed_load, 0);
+	atomic_long_set(&cfs_rq->curr_capacity, 1024);
 #endif
 }
 
+void set_curr_capacity(int cpu, long capacity)
+{
+	atomic_long_set(&cpu_rq(cpu)->cfs.curr_capacity, capacity);
+}
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 static void task_move_group_fair(struct task_struct *p, int on_rq)
 {
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 9ff67a7..5a117b8 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -341,6 +341,8 @@ struct cfs_rq {
 	u64 last_decay;
 	atomic_long_t removed_load;
 
+	atomic_long_t curr_capacity;
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 	/* Required to track per-cpu representation of a task_group */
 	u32 tg_runnable_contrib;
-- 
1.7.9.5



^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [RFC PATCH 10/16] sched, cpufreq: Current compute capacity hack for ARM TC2
  2014-05-23 18:16 [RFC PATCH 00/16] sched: Energy cost model for energy-aware scheduling Morten Rasmussen
                   ` (8 preceding siblings ...)
  2014-05-23 18:16 ` [RFC PATCH 09/16] sched, cpufreq: Introduce current cpu compute capacity into scheduler Morten Rasmussen
@ 2014-05-23 18:16 ` Morten Rasmussen
  2014-05-23 18:16 ` [RFC PATCH 11/16] sched: Energy model functions Morten Rasmussen
                   ` (5 subsequent siblings)
  15 siblings, 0 replies; 71+ messages in thread
From: Morten Rasmussen @ 2014-05-23 18:16 UTC (permalink / raw)
  To: linux-kernel, linux-pm, peterz, mingo
  Cc: rjw, vincent.guittot, daniel.lezcano, preeti, dietmar.eggemann

Hack to report different cpu capacities for big and little cpus.
This is for experimentation on ARM TC2 _only_. A proper solution
has to address this problem.

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 drivers/cpufreq/cpufreq.c |    8 +++++++-
 1 file changed, 7 insertions(+), 1 deletion(-)

diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index a2b788d..134d777 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -316,7 +316,13 @@ static void __cpufreq_notify_transition(struct cpufreq_policy *policy,
 		pr_debug("FREQ: %lu - CPU: %lu\n",
 			 (unsigned long)freqs->new, (unsigned long)freqs->cpu);
 		trace_cpu_frequency(freqs->new, freqs->cpu);
-		set_curr_capacity(freqs->cpu, (freqs->new*1024)/policy->max);
+		/* Massive TC2 hack */
+		if (freqs->cpu == 1 || freqs->cpu == 2)
+			/* A15 cpus (max_capacity = 2015) */
+			set_curr_capacity(freqs->cpu, (freqs->new*2015)/1200000);
+		else
+			/* A7 cpus (nax_capacity = 1024) */
+			set_curr_capacity(freqs->cpu, (freqs->new*1024)/1000000);
 		srcu_notifier_call_chain(&cpufreq_transition_notifier_list,
 				CPUFREQ_POSTCHANGE, freqs);
 		if (likely(policy) && likely(policy->cpu == freqs->cpu))
-- 
1.7.9.5



^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [RFC PATCH 11/16] sched: Energy model functions
  2014-05-23 18:16 [RFC PATCH 00/16] sched: Energy cost model for energy-aware scheduling Morten Rasmussen
                   ` (9 preceding siblings ...)
  2014-05-23 18:16 ` [RFC PATCH 10/16] sched, cpufreq: Current compute capacity hack for ARM TC2 Morten Rasmussen
@ 2014-05-23 18:16 ` Morten Rasmussen
  2014-05-23 18:16 ` [RFC PATCH 12/16] sched: Task wakeup tracking Morten Rasmussen
                   ` (4 subsequent siblings)
  15 siblings, 0 replies; 71+ messages in thread
From: Morten Rasmussen @ 2014-05-23 18:16 UTC (permalink / raw)
  To: linux-kernel, linux-pm, peterz, mingo
  Cc: rjw, vincent.guittot, daniel.lezcano, preeti, dietmar.eggemann

Introduces energy_diff_util() which finds the energy impacts of adding
or removing utilization from a specific cpu. The calculation is based on
the energy information provided by the platform through sched_energy
data in the sched_domain hierarchy.

Task and cpu utilization is currently based on load_avg_contrib and
weighted_cpuload() which are actually load, not utilization.  We don't
have a solution for utilization yet. There are several other loose ends
that need to be addressed, such as load/utilization invariance and
proper representation of compute capacity. However, the energy model is
there.

The energy cost model only considers utilization (busy time) and idle
energy (remaining time) for now. The basic idea is to determine the
energy cost at each level in the sched_domain hierarchy.

	for_each_domain(cpu, sd) {
		sg = sched_group_of(cpu)
		energy_before = curr_util(sg) * busy_power(sg)
				+ 1-curr_util(sg) * idle_power(sg)
		energy_after = new_util(sg) * busy_power(sg)
				+ 1-new_util(sg) * idle_power(sg)
		energy_diff += energy_before - energy_after
	}

The idle power estimate currently only supports a single idle state per
power (sub-)domain. Extending the support to multiple states requires a
way of predicting which states is going to be the most likely. This
prediction could be provided by cpuidle. Wake-ups energy is added later
in this series.

Assumptions and the basic algorithm are described in the code comments.

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c |  250 +++++++++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 250 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 3a2aeee..09e5232 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4162,6 +4162,256 @@ static long effective_load(struct task_group *tg, int cpu, long wl, long wg)
 
 #endif
 
+#ifdef CONFIG_SCHED_ENERGY
+/*
+ * Energy model for energy-aware scheduling
+ *
+ * Assumptions:
+ *
+ * 1. Task and cpu load/utilization are assumed to be scale invariant. That is,
+ * task utilization is invariant to frequency scaling and cpu microarchitecture
+ * differences. For example, a task utilization of 256 means the a cpu with a
+ * capacity of 1024 will be 25% busy running the task, while another cpu with a
+ * capacity of 512 will be 50% busy.
+ *
+ * 2. The scheduler doesn't track utilization, task or cpu. Until that has been
+ * resolved weighted_cpuload() is the closest thing we have. Note that it won't
+ * work properly with tasks other priorities than nice=0.
+ *
+ * 3. When capacity states are shared (SD_SHARE_CAP_STATES) the capacity state
+ * tables are equivalent. That is, the same table index can be used across all
+ * tables.
+ *
+ * 4. Only the lowest level in sched_domain hierarchy has SD_SHARE_CAP_STATES
+ * set. This restriction will be removed later.
+ *
+ * 5. No independent higher level capacity states. Cluster/package power states
+ * are either linked with cpus (SD_SHARE_CAP_STATES) or they only have one.
+ * This restriction will be removed later.
+ *
+ * 6. The scheduler doesn't control capacity (frequency) scaling, but assumes
+ * that the controller will adjust the capacity to match the load.
+ */
+
+#define for_each_energy_state(state) \
+		for (; state->cap; state++)
+
+/*
+ * Find suitable capacity state for utilization.
+ * If over-utilized, return nr_cap_states.
+ */
+static int energy_match_cap(unsigned long util,
+		struct capacity_state *cap_table)
+{
+	struct capacity_state *state = cap_table;
+	int idx;
+
+	idx = 0;
+	for_each_energy_state(state) {
+		if (state->cap >= util)
+			return idx;
+		idx++;
+	}
+
+	return idx;
+}
+
+/*
+ * Find the max cpu utilization in a group of cpus before and after
+ * adding/removing tasks (util) from a specific cpu (cpu).
+ */
+static void find_max_util(const struct cpumask *mask, int cpu, int util,
+		unsigned long *max_util_bef, unsigned long *max_util_aft)
+{
+	int i;
+
+	*max_util_bef = 0;
+	*max_util_aft = 0;
+
+	for_each_cpu(i, mask) {
+		unsigned long cpu_util = weighted_cpuload(i);
+
+		*max_util_bef = max(*max_util_bef, cpu_util);
+
+		if (i == cpu)
+			cpu_util += util;
+
+		*max_util_aft = max(*max_util_aft, cpu_util);
+	}
+}
+
+/*
+ * Estimate the energy cost delta caused by adding/removing utilization (util)
+ * from a specific cpu (cpu).
+ *
+ * The basic idea is to determine the energy cost at each level in sched_domain
+ * hierarchy based on utilization:
+ *
+ * for_each_domain(cpu, sd) {
+ *	sg = sched_group_of(cpu)
+ *	energy_before = curr_util(sg) * busy_power(sg)
+ *				+ 1-curr_util(sg) * idle_power(sg)
+ *	energy_after = new_util(sg) * busy_power(sg)
+ *				+ 1-new_util(sg) * idle_power(sg)
+ *	energy_diff += energy_before - energy_after
+ * }
+ *
+ */
+static int energy_diff_util(int cpu, int util)
+{
+	struct sched_domain *sd;
+	int i;
+	int energy_diff = 0;
+	int curr_cap_idx = -1;
+	int new_cap_idx = -1;
+	unsigned long max_util_bef, max_util_aft, aff_util_bef, aff_util_aft;
+	unsigned long unused_util_bef, unused_util_aft;
+	unsigned long cpu_curr_capacity;
+
+	cpu_curr_capacity = atomic_long_read(&cpu_rq(cpu)->cfs.curr_capacity);
+
+	max_util_aft = weighted_cpuload(cpu) + util;
+
+	rcu_read_lock();
+	for_each_domain(cpu, sd) {
+		struct capacity_state *curr_state, *new_state, *cap_table;
+		struct sched_energy *sge;
+
+		if (!sd->groups->sge)
+			continue;
+
+		sge = &sd->groups->sge->data;
+		cap_table = sge->cap_states;
+
+		if (curr_cap_idx < 0 || !(sd->flags & SD_SHARE_CAP_STATES)) {
+
+			/* TODO: Fix assumption 2 and 3. */
+			curr_cap_idx = energy_match_cap(cpu_curr_capacity,
+					cap_table);
+
+			/*
+			 * If we remove tasks, i.e. util < 0, we should find
+			 * out if the cap state changes as well, but that is
+			 * complicated and might not be worth it. It is assumed
+			 * that the state won't be lowered for now.
+			 *
+			 * Also, if the cap state is shared new_cap_state can't
+			 * be lower than curr_cap_idx as the utilization on an
+			 * other cpu might have higher utilization than this
+			 * cpu.
+			 */
+
+			if (cap_table[curr_cap_idx].cap < max_util_aft) {
+				new_cap_idx = energy_match_cap(max_util_aft,
+						cap_table);
+				if (new_cap_idx >= sge->nr_cap_states) {
+					/* can't handle the additional load */
+					energy_diff = INT_MAX;
+					goto unlock;
+				}
+			} else {
+				new_cap_idx = curr_cap_idx;
+			}
+		}
+
+		curr_state = &cap_table[curr_cap_idx];
+		new_state = &cap_table[new_cap_idx];
+		find_max_util(sched_group_cpus(sd->groups), cpu, util,
+				&max_util_bef, &max_util_aft);
+
+		if (!sd->child) {
+			/* Lowest level - groups are individual cpus */
+			if (sd->flags & SD_SHARE_CAP_STATES) {
+				int sum_util = 0;
+				for_each_cpu(i, sched_domain_span(sd))
+					sum_util += weighted_cpuload(i);
+				aff_util_bef = sum_util;
+			} else {
+				aff_util_bef = weighted_cpuload(cpu);
+			}
+			aff_util_aft = aff_util_bef + util;
+
+			/* Estimate idle time based on unused utilization */
+			unused_util_bef = curr_state->cap
+						- weighted_cpuload(cpu);
+			unused_util_aft = new_state->cap - weighted_cpuload(cpu)
+						- util;
+		} else {
+			/* Higher level */
+			aff_util_bef = max_util_bef;
+			aff_util_aft = max_util_aft;
+
+			/* Estimate idle time based on unused utilization */
+			unused_util_bef = curr_state->cap - aff_util_bef;
+			unused_util_aft = new_state->cap - aff_util_aft;
+		}
+
+		/*
+		 * The utilization change has no impact at this level (or any
+		 * parent level).
+		 */
+		if (aff_util_bef == aff_util_aft && curr_cap_idx == new_cap_idx)
+			goto unlock;
+
+		/* Energy before */
+		energy_diff -= (aff_util_bef*curr_state->power)/curr_state->cap;
+		energy_diff -= (unused_util_bef * sge->idle_power)
+				/curr_state->cap;
+
+		/* Energy after */
+		energy_diff += (aff_util_aft*new_state->power)/new_state->cap;
+		energy_diff += (unused_util_aft * sge->idle_power)
+				/new_state->cap;
+	}
+
+	/*
+	 * We don't have any sched_group covering all cpus in the sched_domain
+	 * hierarchy to associate system wide energy with. Treat it specially
+	 * for now until it can be folded into the loop above.
+	 */
+	if (sse) {
+		struct capacity_state *cap_table = sse->cap_states;
+		struct capacity_state *curr_state, *new_state;
+
+		curr_state = &cap_table[curr_cap_idx];
+		new_state = &cap_table[new_cap_idx];
+
+		find_max_util(cpu_online_mask, cpu, util, &aff_util_bef,
+				&aff_util_aft);
+
+		/* Estimate idle time based on unused utilization */
+		unused_util_bef = curr_state->cap - aff_util_bef;
+		unused_util_aft = new_state->cap - aff_util_aft;
+
+		/* Energy before */
+		energy_diff -= (aff_util_bef*curr_state->power)/curr_state->cap;
+		energy_diff -= (unused_util_bef * sse->idle_power)
+				/curr_state->cap;
+
+		/* Energy after */
+		energy_diff += (aff_util_aft*new_state->power)/new_state->cap;
+		energy_diff += (unused_util_aft * sse->idle_power)
+				/new_state->cap;
+	}
+
+unlock:
+	rcu_read_unlock();
+
+	return energy_diff;
+}
+
+static int energy_diff_task(int cpu, struct task_struct *p)
+{
+	return energy_diff_util(cpu, p->se.avg.load_avg_contrib);
+}
+
+#else
+static int energy_diff_task(int cpu, struct task_struct *p)
+{
+	return INT_MAX;
+}
+#endif
+
 static int wake_wide(struct task_struct *p)
 {
 	int factor = this_cpu_read(sd_llc_size);
-- 
1.7.9.5



^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [RFC PATCH 12/16] sched: Task wakeup tracking
  2014-05-23 18:16 [RFC PATCH 00/16] sched: Energy cost model for energy-aware scheduling Morten Rasmussen
                   ` (10 preceding siblings ...)
  2014-05-23 18:16 ` [RFC PATCH 11/16] sched: Energy model functions Morten Rasmussen
@ 2014-05-23 18:16 ` Morten Rasmussen
  2014-05-23 18:16 ` [RFC PATCH 13/16] sched: Take task wakeups into account in energy estimates Morten Rasmussen
                   ` (3 subsequent siblings)
  15 siblings, 0 replies; 71+ messages in thread
From: Morten Rasmussen @ 2014-05-23 18:16 UTC (permalink / raw)
  To: linux-kernel, linux-pm, peterz, mingo
  Cc: rjw, vincent.guittot, daniel.lezcano, preeti, dietmar.eggemann

Track task wakeup rate in wakeup_avg_sum by counting wakeups. Note that
this is _not_ cpu wakeups (idle exits). Task wakeups only cause cpu
wakeups if the cpu is idle when the task wakeup occurs.

The wakeup rate decays over time at the same rate as used for the
existing entity load tracking. Unlike runnable_avg_sum, wakeup_avg_sum
is counting events, not time, and is therefore theoretically unbounded
and should be used with care.

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 include/linux/sched.h |    3 +++
 kernel/sched/fair.c   |   18 ++++++++++++++++++
 2 files changed, 21 insertions(+)

diff --git a/include/linux/sched.h b/include/linux/sched.h
index 727b936..d7b032f 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1109,6 +1109,9 @@ struct sched_avg {
 	u64 last_runnable_update;
 	s64 decay_count;
 	unsigned long load_avg_contrib;
+
+	unsigned long last_wakeup_update;
+	u32 wakeup_avg_sum;
 };
 
 #ifdef CONFIG_SCHEDSTATS
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 09e5232..39e9cd8 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -679,6 +679,8 @@ void init_task_runnable_average(struct task_struct *p)
 	p->se.avg.runnable_avg_sum = slice;
 	p->se.avg.runnable_avg_period = slice;
 	__update_task_entity_contrib(&p->se);
+
+	p->se.avg.last_wakeup_update = jiffies;
 }
 #else
 void init_task_runnable_average(struct task_struct *p)
@@ -4025,6 +4027,21 @@ static void record_wakee(struct task_struct *p)
 	}
 }
 
+static void update_wakeup_avg(struct task_struct *p)
+{
+	struct sched_entity *se = &p->se;
+	struct sched_avg *sa = &se->avg;
+	unsigned long now = ACCESS_ONCE(jiffies);
+
+	if (time_after(now, sa->last_wakeup_update)) {
+		sa->wakeup_avg_sum = decay_load(sa->wakeup_avg_sum,
+				jiffies_to_msecs(now - sa->last_wakeup_update));
+		sa->last_wakeup_update = now;
+	}
+
+	sa->wakeup_avg_sum += 1024;
+}
+
 static void task_waking_fair(struct task_struct *p)
 {
 	struct sched_entity *se = &p->se;
@@ -4045,6 +4062,7 @@ static void task_waking_fair(struct task_struct *p)
 
 	se->vruntime -= min_vruntime;
 	record_wakee(p);
+	update_wakeup_avg(p);
 }
 
 #ifdef CONFIG_FAIR_GROUP_SCHED
-- 
1.7.9.5



^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [RFC PATCH 13/16] sched: Take task wakeups into account in energy estimates
  2014-05-23 18:16 [RFC PATCH 00/16] sched: Energy cost model for energy-aware scheduling Morten Rasmussen
                   ` (11 preceding siblings ...)
  2014-05-23 18:16 ` [RFC PATCH 12/16] sched: Task wakeup tracking Morten Rasmussen
@ 2014-05-23 18:16 ` Morten Rasmussen
  2014-05-23 18:16 ` [RFC PATCH 14/16] sched: Use energy model in select_idle_sibling Morten Rasmussen
                   ` (2 subsequent siblings)
  15 siblings, 0 replies; 71+ messages in thread
From: Morten Rasmussen @ 2014-05-23 18:16 UTC (permalink / raw)
  To: linux-kernel, linux-pm, peterz, mingo
  Cc: rjw, vincent.guittot, daniel.lezcano, preeti, dietmar.eggemann

The energy cost of waking a cpu and sending it back to sleep can be
quite significant for short running frequently waking tasks if placed on
an idle cpu in a deep sleep state. By factoring task wakeups in such
tasks can be placed on cpus where the wakeup energy cost is lower. For
example, partly utilized cpus in a shallower idle state, or cpus in a
cluster/die that is already awake.

Current cpu utilization of the target cpu is factored in guess how many
task wakeups that translate into cpu wakeups (idle exits). It is a very
naive approach, but it is virtually impossible to get an accurate estimate.

wake_energy(task) = unused_util(cpu) * wakeups(task) * wakeup_energy(cpu)

There is no per cpu wakeup tracking, so we can't estimate the energy
savings when removing tasks from a cpu. It is also nearly impossible to
figure out which task is the cause of cpu wakeups if multiple tasks are
scheduled on the same cpu.

Support for multiple idle-states per sched_group (e.g. WFI and core
shutdown on ARM) is not implemented yet. wakeup_energy in struct
sched_energy needs to be a table instead and cpuidle needs to tells
what the most likely state is.

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c |   19 ++++++++++++++++---
 1 file changed, 16 insertions(+), 3 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 39e9cd8..5a52467 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4271,11 +4271,13 @@ static void find_max_util(const struct cpumask *mask, int cpu, int util,
  *				+ 1-curr_util(sg) * idle_power(sg)
  *	energy_after = new_util(sg) * busy_power(sg)
  *				+ 1-new_util(sg) * idle_power(sg)
+ *				+ new_util(sg) * task_wakeups
+ *							* wakeup_energy(sg)
  *	energy_diff += energy_before - energy_after
  * }
  *
  */
-static int energy_diff_util(int cpu, int util)
+static int energy_diff_util(int cpu, int util, int wakeups)
 {
 	struct sched_domain *sd;
 	int i;
@@ -4368,7 +4370,8 @@ static int energy_diff_util(int cpu, int util)
 		 * The utilization change has no impact at this level (or any
 		 * parent level).
 		 */
-		if (aff_util_bef == aff_util_aft && curr_cap_idx == new_cap_idx)
+		if (aff_util_bef == aff_util_aft && curr_cap_idx == new_cap_idx
+				&& unused_util_aft < 100)
 			goto unlock;
 
 		/* Energy before */
@@ -4380,6 +4383,13 @@ static int energy_diff_util(int cpu, int util)
 		energy_diff += (aff_util_aft*new_state->power)/new_state->cap;
 		energy_diff += (unused_util_aft * sge->idle_power)
 				/new_state->cap;
+		/*
+		 * Estimate how many of the wakeups that happens while cpu is
+		 * idle assuming they are uniformly distributed. Ignoring
+		 * wakeups caused by other tasks.
+		 */
+		energy_diff += (wakeups * sge->wakeup_energy >> 10)
+				* unused_util_aft/new_state->cap;
 	}
 
 	/*
@@ -4410,6 +4420,8 @@ static int energy_diff_util(int cpu, int util)
 		energy_diff += (aff_util_aft*new_state->power)/new_state->cap;
 		energy_diff += (unused_util_aft * sse->idle_power)
 				/new_state->cap;
+		energy_diff += (wakeups * sse->wakeup_energy >> 10)
+				* unused_util_aft/new_state->cap;
 	}
 
 unlock:
@@ -4420,7 +4432,8 @@ unlock:
 
 static int energy_diff_task(int cpu, struct task_struct *p)
 {
-	return energy_diff_util(cpu, p->se.avg.load_avg_contrib);
+	return energy_diff_util(cpu, p->se.avg.load_avg_contrib,
+			p->se.avg.wakeup_avg_sum);
 }
 
 #else
-- 
1.7.9.5



^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [RFC PATCH 14/16] sched: Use energy model in select_idle_sibling
  2014-05-23 18:16 [RFC PATCH 00/16] sched: Energy cost model for energy-aware scheduling Morten Rasmussen
                   ` (12 preceding siblings ...)
  2014-05-23 18:16 ` [RFC PATCH 13/16] sched: Take task wakeups into account in energy estimates Morten Rasmussen
@ 2014-05-23 18:16 ` Morten Rasmussen
  2014-05-23 18:16 ` [RFC PATCH 15/16] sched: Use energy to guide wakeup task placement Morten Rasmussen
  2014-05-23 18:16 ` [RFC PATCH 16/16] sched: Disable wake_affine to broaden the scope of wakeup target cpus Morten Rasmussen
  15 siblings, 0 replies; 71+ messages in thread
From: Morten Rasmussen @ 2014-05-23 18:16 UTC (permalink / raw)
  To: linux-kernel, linux-pm, peterz, mingo
  Cc: rjw, vincent.guittot, daniel.lezcano, preeti, dietmar.eggemann

Make select_idle_sibling() consider energy when picking an idle cpu.

Only idle cpus are still considered. A more aggressive energy conserving
approach could go further and consider partly utilized cpus.

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c |    8 ++++++++
 1 file changed, 8 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 5a52467..542c2b2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4644,7 +4644,9 @@ static int select_idle_sibling(struct task_struct *p, int target)
 	struct sched_domain *sd;
 	struct sched_group *sg;
 	int i = task_cpu(p);
+	int target_energy;
 
+#ifndef CONFIG_SCHED_ENERGY
 	if (idle_cpu(target))
 		return target;
 
@@ -4653,6 +4655,8 @@ static int select_idle_sibling(struct task_struct *p, int target)
 	 */
 	if (i != target && cpus_share_cache(i, target) && idle_cpu(i))
 		return i;
+#endif
+	target_energy = energy_diff_task(target, p);
 
 	/*
 	 * Otherwise, iterate the domains and find an elegible idle cpu.
@@ -4666,8 +4670,12 @@ static int select_idle_sibling(struct task_struct *p, int target)
 				goto next;
 
 			for_each_cpu(i, sched_group_cpus(sg)) {
+				int diff;
 				if (i == target || !idle_cpu(i))
 					goto next;
+				diff = energy_diff_task(i, p);
+				if (diff > target_energy)
+					goto next;
 			}
 
 			target = cpumask_first_and(sched_group_cpus(sg),
-- 
1.7.9.5



^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [RFC PATCH 15/16] sched: Use energy to guide wakeup task placement
  2014-05-23 18:16 [RFC PATCH 00/16] sched: Energy cost model for energy-aware scheduling Morten Rasmussen
                   ` (13 preceding siblings ...)
  2014-05-23 18:16 ` [RFC PATCH 14/16] sched: Use energy model in select_idle_sibling Morten Rasmussen
@ 2014-05-23 18:16 ` Morten Rasmussen
  2014-05-23 18:16 ` [RFC PATCH 16/16] sched: Disable wake_affine to broaden the scope of wakeup target cpus Morten Rasmussen
  15 siblings, 0 replies; 71+ messages in thread
From: Morten Rasmussen @ 2014-05-23 18:16 UTC (permalink / raw)
  To: linux-kernel, linux-pm, peterz, mingo
  Cc: rjw, vincent.guittot, daniel.lezcano, preeti, dietmar.eggemann

Attempt to pick most energy efficient wakeup in find_idlest_{group,
cpu}(). Finding the optimum target requires an exhaustive search
through all cpus in the groups. Instead, the target group is determined
based on load and probing the energy cost on a single cpu in each group.
The target cpu is the cpu with the lowest energy cost.

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/fair.c |   64 +++++++++++++++++++++++++++++++++++++++++----------
 1 file changed, 52 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 542c2b2..0d3334b 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4556,25 +4556,27 @@ static int wake_affine(struct sched_domain *sd, struct task_struct *p, int sync)
 }
 
 /*
- * find_idlest_group finds and returns the least busy CPU group within the
- * domain.
+ * find_target_group finds and returns the least busy/most energy-efficient
+ * CPU group within the domain.
  */
 static struct sched_group *
-find_idlest_group(struct sched_domain *sd, struct task_struct *p,
+find_target_group(struct sched_domain *sd, struct task_struct *p,
 		  int this_cpu, int sd_flag)
 {
-	struct sched_group *idlest = NULL, *group = sd->groups;
+	struct sched_group *idlest = NULL, *group = sd->groups, *energy = NULL;
 	unsigned long min_load = ULONG_MAX, this_load = 0;
 	int load_idx = sd->forkexec_idx;
 	int imbalance = 100 + (sd->imbalance_pct-100)/2;
+	int local_energy = 0, min_energy = INT_MAX;
 
 	if (sd_flag & SD_BALANCE_WAKE)
 		load_idx = sd->wake_idx;
 
 	do {
-		unsigned long load, avg_load;
+		unsigned long load, avg_load, probe_load = UINT_MAX;
 		int local_group;
 		int i;
+		int probe_cpu, energy_diff;
 
 		/* Skip over this group if it has no CPUs allowed */
 		if (!cpumask_intersects(sched_group_cpus(group),
@@ -4586,6 +4588,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 
 		/* Tally up the load of all CPUs in the group */
 		avg_load = 0;
+		probe_cpu = cpumask_first(sched_group_cpus(group));
 
 		for_each_cpu(i, sched_group_cpus(group)) {
 			/* Bias balancing toward cpus of our domain */
@@ -4595,44 +4598,81 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 				load = target_load(i, load_idx);
 
 			avg_load += load;
+
+			if (load < probe_load) {
+				probe_load = load;
+				probe_cpu = i;
+			}
 		}
 
 		/* Adjust by relative CPU power of the group */
 		avg_load = (avg_load * SCHED_POWER_SCALE) / group->sgp->power;
 
+		/*
+		 * Sample energy diff on probe_cpu.
+		 * Finding the optimum cpu requires testing all cpus which is
+		 * expensive.
+		 */
+
+		energy_diff = energy_diff_task(probe_cpu, p);
+
 		if (local_group) {
 			this_load = avg_load;
-		} else if (avg_load < min_load) {
-			min_load = avg_load;
-			idlest = group;
+			local_energy = energy_diff;
+		} else {
+			if (avg_load < min_load) {
+				min_load = avg_load;
+				idlest = group;
+			}
+
+			if (energy_diff < min_energy) {
+				min_energy = energy_diff;
+				energy = group;
+			}
 		}
 	} while (group = group->next, group != sd->groups);
 
+#ifdef CONFIG_SCHED_ENERGY
+	if (energy && min_energy < local_energy)
+		return energy;
+	return NULL;
+#else
 	if (!idlest || 100*this_load < imbalance*min_load)
 		return NULL;
 	return idlest;
+#endif
 }
 
 /*
- * find_idlest_cpu - find the idlest cpu among the cpus in group.
+ * find_target_cpu - find the target cpu among the cpus in group.
  */
 static int
-find_idlest_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
+find_target_cpu(struct sched_group *group, struct task_struct *p, int this_cpu)
 {
 	unsigned long load, min_load = ULONG_MAX;
+	int min_energy = INT_MAX, energy, least_energy = -1;
 	int idlest = -1;
 	int i;
 
 	/* Traverse only the allowed CPUs */
 	for_each_cpu_and(i, sched_group_cpus(group), tsk_cpus_allowed(p)) {
 		load = weighted_cpuload(i);
+		energy = energy_diff_task(i, p);
 
 		if (load < min_load || (load == min_load && i == this_cpu)) {
 			min_load = load;
 			idlest = i;
 		}
+
+		if (energy < min_energy) {
+			min_energy = energy;
+			least_energy = i;
+		}
 	}
 
+	if (least_energy >= 0)
+		return least_energy;
+
 	return idlest;
 }
 
@@ -4755,13 +4795,13 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 			continue;
 		}
 
-		group = find_idlest_group(sd, p, cpu, sd_flag);
+		group = find_target_group(sd, p, cpu, sd_flag);
 		if (!group) {
 			sd = sd->child;
 			continue;
 		}
 
-		new_cpu = find_idlest_cpu(group, p, cpu);
+		new_cpu = find_target_cpu(group, p, cpu);
 		if (new_cpu == -1 || new_cpu == cpu) {
 			/* Now try balancing at a lower domain level of cpu */
 			sd = sd->child;
-- 
1.7.9.5



^ permalink raw reply related	[flat|nested] 71+ messages in thread

* [RFC PATCH 16/16] sched: Disable wake_affine to broaden the scope of wakeup target cpus
  2014-05-23 18:16 [RFC PATCH 00/16] sched: Energy cost model for energy-aware scheduling Morten Rasmussen
                   ` (14 preceding siblings ...)
  2014-05-23 18:16 ` [RFC PATCH 15/16] sched: Use energy to guide wakeup task placement Morten Rasmussen
@ 2014-05-23 18:16 ` Morten Rasmussen
  15 siblings, 0 replies; 71+ messages in thread
From: Morten Rasmussen @ 2014-05-23 18:16 UTC (permalink / raw)
  To: linux-kernel, linux-pm, peterz, mingo
  Cc: rjw, vincent.guittot, daniel.lezcano, preeti, dietmar.eggemann

SD_WAKE_AFFINE is currently set by default on all levels which means
that wakeups are always handled inside the lowest level sched_domain.
That means a tiny periodic task is very likely to stay on the cpu it was
forked on forever. To save energy we need to revisit the task placement
decision every now and again to ensure that we don't keep waking the
same cpu if there are cheaper alternatives.

One way is to simply disable wake_affine and rely on the fork/exec
balancing mechanism (find_idlest_{group, cpu}). This is what this patch
does.

An alternative is to let the platform remove the SD_WAKE_AFFINE flag
from lower levels to increase the search space for
select_idle_sibling().

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 kernel/sched/core.c |    5 +++++
 1 file changed, 5 insertions(+)

diff --git a/kernel/sched/core.c b/kernel/sched/core.c
index 49b895a..eeb0508 100644
--- a/kernel/sched/core.c
+++ b/kernel/sched/core.c
@@ -6069,8 +6069,13 @@ sd_init(struct sched_domain_topology_level *tl, int cpu)
 					| 1*SD_BALANCE_NEWIDLE
 					| 1*SD_BALANCE_EXEC
 					| 1*SD_BALANCE_FORK
+#ifdef CONFIG_SCHED_ENERGY
+					| 1*SD_BALANCE_WAKE
+					| 0*SD_WAKE_AFFINE
+#else
 					| 0*SD_BALANCE_WAKE
 					| 1*SD_WAKE_AFFINE
+#endif
 					| 0*SD_SHARE_CPUPOWER
 					| 0*SD_SHARE_PKG_RESOURCES
 					| 0*SD_SERIALIZE
-- 
1.7.9.5



^ permalink raw reply related	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 06/16] arm: topology: Define TC2 sched energy and provide it to scheduler
  2014-05-23 18:16 ` [RFC PATCH 06/16] arm: topology: Define TC2 sched energy and provide it to scheduler Morten Rasmussen
@ 2014-05-30 12:04   ` Peter Zijlstra
  2014-06-02 14:15     ` Morten Rasmussen
  2014-06-03 11:44   ` Peter Zijlstra
  2014-06-03 11:50   ` Peter Zijlstra
  2 siblings, 1 reply; 71+ messages in thread
From: Peter Zijlstra @ 2014-05-30 12:04 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: linux-kernel, linux-pm, mingo, rjw, vincent.guittot,
	daniel.lezcano, preeti, dietmar.eggemann

[-- Attachment #1: Type: text/plain, Size: 1505 bytes --]

On Fri, May 23, 2014 at 07:16:33PM +0100, Morten Rasmussen wrote:
> +static struct capacity_state cap_states_cluster_a7[] = {
> +	/* Cluster only power */
> +	 { .cap =  358, .power = 2967, }, /*  350 MHz */
> +	 { .cap =  410, .power = 2792, }, /*  400 MHz */
> +	 { .cap =  512, .power = 2810, }, /*  500 MHz */
> +	 { .cap =  614, .power = 2815, }, /*  600 MHz */
> +	 { .cap =  717, .power = 2919, }, /*  700 MHz */
> +	 { .cap =  819, .power = 2847, }, /*  800 MHz */
> +	 { .cap =  922, .power = 3917, }, /*  900 MHz */
> +	 { .cap = 1024, .power = 4905, }, /* 1000 MHz */
> +	};

> +static struct capacity_state cap_states_core_a7[] = {
> +	/* Power per cpu */
> +	 { .cap =  358, .power =  187, }, /*  350 MHz */
> +	 { .cap =  410, .power =  275, }, /*  400 MHz */
> +	 { .cap =  512, .power =  334, }, /*  500 MHz */
> +	 { .cap =  614, .power =  407, }, /*  600 MHz */
> +	 { .cap =  717, .power =  447, }, /*  700 MHz */
> +	 { .cap =  819, .power =  549, }, /*  800 MHz */
> +	 { .cap =  922, .power =  761, }, /*  900 MHz */
> +	 { .cap = 1024, .power = 1024, }, /* 1000 MHz */
> +	};

Talk to me about this core vs cluster thing.

Why would an architecture have multiple energy domains like this?

That is, if a cpu can set P states per core, why does it need a cluster
wide thing.

Also, in general, why would we need to walk the domain tree all the way
up, typically I would expect to stop walking once we've covered the two
cpu's we're interested in, because above that nothing changes.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 06/16] arm: topology: Define TC2 sched energy and provide it to scheduler
  2014-05-30 12:04   ` Peter Zijlstra
@ 2014-06-02 14:15     ` Morten Rasmussen
  2014-06-03 11:41       ` Peter Zijlstra
  0 siblings, 1 reply; 71+ messages in thread
From: Morten Rasmussen @ 2014-06-02 14:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-pm, mingo, rjw, vincent.guittot,
	daniel.lezcano, preeti, Dietmar Eggemann

On Fri, May 30, 2014 at 01:04:24PM +0100, Peter Zijlstra wrote:
> On Fri, May 23, 2014 at 07:16:33PM +0100, Morten Rasmussen wrote:
> > +static struct capacity_state cap_states_cluster_a7[] = {
> > +	/* Cluster only power */
> > +	 { .cap =  358, .power = 2967, }, /*  350 MHz */
> > +	 { .cap =  410, .power = 2792, }, /*  400 MHz */
> > +	 { .cap =  512, .power = 2810, }, /*  500 MHz */
> > +	 { .cap =  614, .power = 2815, }, /*  600 MHz */
> > +	 { .cap =  717, .power = 2919, }, /*  700 MHz */
> > +	 { .cap =  819, .power = 2847, }, /*  800 MHz */
> > +	 { .cap =  922, .power = 3917, }, /*  900 MHz */
> > +	 { .cap = 1024, .power = 4905, }, /* 1000 MHz */
> > +	};
> 
> > +static struct capacity_state cap_states_core_a7[] = {
> > +	/* Power per cpu */
> > +	 { .cap =  358, .power =  187, }, /*  350 MHz */
> > +	 { .cap =  410, .power =  275, }, /*  400 MHz */
> > +	 { .cap =  512, .power =  334, }, /*  500 MHz */
> > +	 { .cap =  614, .power =  407, }, /*  600 MHz */
> > +	 { .cap =  717, .power =  447, }, /*  700 MHz */
> > +	 { .cap =  819, .power =  549, }, /*  800 MHz */
> > +	 { .cap =  922, .power =  761, }, /*  900 MHz */
> > +	 { .cap = 1024, .power = 1024, }, /* 1000 MHz */
> > +	};
> 
> Talk to me about this core vs cluster thing.
> 
> Why would an architecture have multiple energy domains like this?
> 
> That is, if a cpu can set P states per core, why does it need a cluster
> wide thing.

The reason is that power domains are often organized in a hierarchy
where you may be able to power down just a cpu or the entire cluster
along with cluster wide shared resources. This is quite typical for ARM
systems. Frequency domains (P-states) typically cover the same hardware
as one of the power domain levels. That is, there might be several
smaller power domains sharing the same frequency (P-state) or there
might be a power domain spanning multiple frequency domains.

The main reason why we need to worry about all this is that it typically
cost a lot more energy to use the first cpu in a cluster since you
also need to power up all the shared hardware resources than the energy
cost of waking and using additional cpus in the same cluster.

IMHO, the most natural way to model the energy is therefore something
like:

    energy = energy_cluster + n * energy_cpu

Where 'n' is the number of cpus powered up and energy_cluster is the
cost paid as soon as any cpu in the cluster is powered up.

If we take TC2 as an example, we have per-cluster frequency domains
(P-states) and idle-states for both the individual cpus and the
clusters. WFI for individual cpus and cluster power down for the
cluster, which takes down the per-cluster L2 cache and other cluster
resources. When we wake the first cpu in a cluster, the cluster will
exit cluster power down and put all other into WFI. Powering on the
first cpu (A7) and fully utilizing it at 1000 MHz will cost:

    power_one = 4905 + 1024

Waking up an additional cpu and fully utilizing it we get:

    power_two = 4905 + 2*1024

So if we need two cpu's worth of compute capacity (at max capacity) we
can save quite a lot of energy by picking two in the same cluster rather
than paying the cluster power twice.

Now if one of the cpus is only 50% utilized, it will be in WFI half the
time:

    power = power_cluster + \sum{n}^{cpus} util(n) * power_cpu +
					(1-util(n)) * idle_power_cpu

    power_100_50 = 4905 + (1.0*1024 + 0.0*0) + (0.5*1024 + 0.5*0)

I have normalized the utilization factor to 1.0 for simplicity. We also
need to factor in the cost of the wakeups on the 50% loaded cpu, but I
will leave that out here to keep it simpler.

If we now consider a slightly different scenario where one cpu is 50%
utilized and the other is 25% utilized. We assume that the busy period
starts at the same time on both cpus (overlapped). In this case, we can
power down the whole cluster 50% of the time (assuming that the idle
period is long enough to allow it). We can expand power_cluster to
factor that in:

    power_cluster' = util(cluster) * power_cluster + 
				(1-util(cluster)) * idle_power_cluster

    power_50_25 = 0.5*4905 + 0.5*10 + (0.5*1024 + 0.0*0) +
    						(0.25*1024 + 0.75*0)

> Also, in general, why would we need to walk the domain tree all the way
> up, typically I would expect to stop walking once we've covered the two
> cpu's we're interested in, because above that nothing changes.

True. In some cases we don't have to go all the way up. There is a
condition in energy_diff_load() that bails out if the energy doesn't
change further up the hierarchy. There might be scope for improving that
condition though.

We can basically stop going up if the utilization of the domain is
unchanged by the change we want to do. For example, we can ignore the
next level above if a third cpu is keeping the domain up all the time
anyway. In the 100% + 50% case above, putting another 50% task on the
50% cpu wouldn't affect the cluster according the proposed model, so it
can be ignored. However, if we did the same on any of the two cpus in
the 50% + 25% example we affect the cluster utilization and have to do
the cluster level maths.

So we do sometimes have to go all the way up even if we are balancing
two sibling cpus to determine the energy implications. At least if we
want an energy score like energy_diff_load() produces. However, we might
be able to take some other shortcuts if we are balancing load between
two specific cpus (not wakeup/fork/exec balancing) as you point out. But
there are cases where we need to continue up until the domain
utilization is unchanged.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 06/16] arm: topology: Define TC2 sched energy and provide it to scheduler
  2014-06-02 14:15     ` Morten Rasmussen
@ 2014-06-03 11:41       ` Peter Zijlstra
  2014-06-04 13:49         ` Morten Rasmussen
  0 siblings, 1 reply; 71+ messages in thread
From: Peter Zijlstra @ 2014-06-03 11:41 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: linux-kernel, linux-pm, mingo, rjw, vincent.guittot,
	daniel.lezcano, preeti, Dietmar Eggemann

[-- Attachment #1: Type: text/plain, Size: 3303 bytes --]

On Mon, Jun 02, 2014 at 03:15:36PM +0100, Morten Rasmussen wrote:
> > 
> > Talk to me about this core vs cluster thing.
> > 
> > Why would an architecture have multiple energy domains like this?

> The reason is that power domains are often organized in a hierarchy
> where you may be able to power down just a cpu or the entire cluster
> along with cluster wide shared resources. This is quite typical for ARM
> systems. Frequency domains (P-states) typically cover the same hardware
> as one of the power domain levels. That is, there might be several
> smaller power domains sharing the same frequency (P-state) or there
> might be a power domain spanning multiple frequency domains.
> 
> The main reason why we need to worry about all this is that it typically
> cost a lot more energy to use the first cpu in a cluster since you
> also need to power up all the shared hardware resources than the energy
> cost of waking and using additional cpus in the same cluster.
> 
> IMHO, the most natural way to model the energy is therefore something
> like:
> 
>     energy = energy_cluster + n * energy_cpu
> 
> Where 'n' is the number of cpus powered up and energy_cluster is the
> cost paid as soon as any cpu in the cluster is powered up.

OK, that makes sense, thanks! Maybe expand the doc/changelogs with this
because it wasn't immediately clear to me.

> > Also, in general, why would we need to walk the domain tree all the way
> > up, typically I would expect to stop walking once we've covered the two
> > cpu's we're interested in, because above that nothing changes.
> 
> True. In some cases we don't have to go all the way up. There is a
> condition in energy_diff_load() that bails out if the energy doesn't
> change further up the hierarchy. There might be scope for improving that
> condition though.
> 
> We can basically stop going up if the utilization of the domain is
> unchanged by the change we want to do. For example, we can ignore the
> next level above if a third cpu is keeping the domain up all the time
> anyway. In the 100% + 50% case above, putting another 50% task on the
> 50% cpu wouldn't affect the cluster according the proposed model, so it
> can be ignored. However, if we did the same on any of the two cpus in
> the 50% + 25% example we affect the cluster utilization and have to do
> the cluster level maths.
> 
> So we do sometimes have to go all the way up even if we are balancing
> two sibling cpus to determine the energy implications. At least if we
> want an energy score like energy_diff_load() produces. However, we might
> be able to take some other shortcuts if we are balancing load between
> two specific cpus (not wakeup/fork/exec balancing) as you point out. But
> there are cases where we need to continue up until the domain
> utilization is unchanged.

Right.. so my worry with this is scalability. We typically want to avoid
having to scan the entire machine, even for power aware balancing.

That said, I don't think we have a 'sane' model for really big hardware
(yet). Intel still hasn't really said anything much on that iirc, as
long as a single core is up, all the memory controllers in the numa
fabric need to be awake, not to mention to cost of keeping the dram
alive.



[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 06/16] arm: topology: Define TC2 sched energy and provide it to scheduler
  2014-05-23 18:16 ` [RFC PATCH 06/16] arm: topology: Define TC2 sched energy and provide it to scheduler Morten Rasmussen
  2014-05-30 12:04   ` Peter Zijlstra
@ 2014-06-03 11:44   ` Peter Zijlstra
  2014-06-04 15:42     ` Morten Rasmussen
  2014-06-03 11:50   ` Peter Zijlstra
  2 siblings, 1 reply; 71+ messages in thread
From: Peter Zijlstra @ 2014-06-03 11:44 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: linux-kernel, linux-pm, mingo, rjw, vincent.guittot,
	daniel.lezcano, preeti, dietmar.eggemann

[-- Attachment #1: Type: text/plain, Size: 1528 bytes --]

On Fri, May 23, 2014 at 07:16:33PM +0100, Morten Rasmussen wrote:
> +static struct capacity_state cap_states_cluster_a7[] = {
> +	/* Cluster only power */
> +	 { .cap =  358, .power = 2967, }, /*  350 MHz */
> +	 { .cap =  410, .power = 2792, }, /*  400 MHz */
> +	 { .cap =  512, .power = 2810, }, /*  500 MHz */
> +	 { .cap =  614, .power = 2815, }, /*  600 MHz */
> +	 { .cap =  717, .power = 2919, }, /*  700 MHz */
> +	 { .cap =  819, .power = 2847, }, /*  800 MHz */
> +	 { .cap =  922, .power = 3917, }, /*  900 MHz */
> +	 { .cap = 1024, .power = 4905, }, /* 1000 MHz */
> +	};
> +
> +static struct capacity_state cap_states_cluster_a15[] = {
> +	/* Cluster only power */
> +	 { .cap =  840, .power =  7920, }, /*  500 MHz */
> +	 { .cap = 1008, .power =  8165, }, /*  600 MHz */
> +	 { .cap = 1176, .power =  8172, }, /*  700 MHz */
> +	 { .cap = 1343, .power =  8195, }, /*  800 MHz */
> +	 { .cap = 1511, .power =  8265, }, /*  900 MHz */
> +	 { .cap = 1679, .power =  8446, }, /* 1000 MHz */
> +	 { .cap = 1847, .power = 11426, }, /* 1100 MHz */
> +	 { .cap = 2015, .power = 15200, }, /* 1200 MHz */
> +	};


So how did you obtain these numbers? Did you use numbers provided by the
hardware people, or did you run a particular benchmark and record the
power usage?

Does that benchmark do some actual work (as opposed to a while(1) loop)
to keep more silicon lit up?

If you have a setup for measuring these, should we try and publish that
too so that people can run it on their platform and provide these
numbers?


[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 06/16] arm: topology: Define TC2 sched energy and provide it to scheduler
  2014-05-23 18:16 ` [RFC PATCH 06/16] arm: topology: Define TC2 sched energy and provide it to scheduler Morten Rasmussen
  2014-05-30 12:04   ` Peter Zijlstra
  2014-06-03 11:44   ` Peter Zijlstra
@ 2014-06-03 11:50   ` Peter Zijlstra
  2014-06-04 16:02     ` Morten Rasmussen
  2 siblings, 1 reply; 71+ messages in thread
From: Peter Zijlstra @ 2014-06-03 11:50 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: linux-kernel, linux-pm, mingo, rjw, vincent.guittot,
	daniel.lezcano, preeti, dietmar.eggemann

[-- Attachment #1: Type: text/plain, Size: 782 bytes --]

On Fri, May 23, 2014 at 07:16:33PM +0100, Morten Rasmussen wrote:
> +static struct capacity_state cap_states_cluster_a7[] = {
> +	/* Cluster only power */
> +	 { .cap =  358, .power = 2967, }, /*  350 MHz */
> +	 { .cap =  410, .power = 2792, }, /*  400 MHz */
> +	 { .cap =  512, .power = 2810, }, /*  500 MHz */
> +	 { .cap =  614, .power = 2815, }, /*  600 MHz */
> +	 { .cap =  717, .power = 2919, }, /*  700 MHz */
> +	 { .cap =  819, .power = 2847, }, /*  800 MHz */
> +	 { .cap =  922, .power = 3917, }, /*  900 MHz */
> +	 { .cap = 1024, .power = 4905, }, /* 1000 MHz */
> +	};

So one thing I remember was that we spoke about restricting this to
frequency levels where the voltage changed.

Because voltage jumps were the biggest factor to energy usage.

Any word on that?

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 06/16] arm: topology: Define TC2 sched energy and provide it to scheduler
  2014-06-03 11:41       ` Peter Zijlstra
@ 2014-06-04 13:49         ` Morten Rasmussen
  0 siblings, 0 replies; 71+ messages in thread
From: Morten Rasmussen @ 2014-06-04 13:49 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-pm, mingo, rjw, vincent.guittot,
	daniel.lezcano, preeti, Dietmar Eggemann

On Tue, Jun 03, 2014 at 12:41:45PM +0100, Peter Zijlstra wrote:
> On Mon, Jun 02, 2014 at 03:15:36PM +0100, Morten Rasmussen wrote:
> > > 
> > > Talk to me about this core vs cluster thing.
> > > 
> > > Why would an architecture have multiple energy domains like this?
> 
> > The reason is that power domains are often organized in a hierarchy
> > where you may be able to power down just a cpu or the entire cluster
> > along with cluster wide shared resources. This is quite typical for ARM
> > systems. Frequency domains (P-states) typically cover the same hardware
> > as one of the power domain levels. That is, there might be several
> > smaller power domains sharing the same frequency (P-state) or there
> > might be a power domain spanning multiple frequency domains.
> > 
> > The main reason why we need to worry about all this is that it typically
> > cost a lot more energy to use the first cpu in a cluster since you
> > also need to power up all the shared hardware resources than the energy
> > cost of waking and using additional cpus in the same cluster.
> > 
> > IMHO, the most natural way to model the energy is therefore something
> > like:
> > 
> >     energy = energy_cluster + n * energy_cpu
> > 
> > Where 'n' is the number of cpus powered up and energy_cluster is the
> > cost paid as soon as any cpu in the cluster is powered up.
> 
> OK, that makes sense, thanks! Maybe expand the doc/changelogs with this
> because it wasn't immediately clear to me.

I will add more documention to the next round, it is indeed needed.

> 
> > > Also, in general, why would we need to walk the domain tree all the way
> > > up, typically I would expect to stop walking once we've covered the two
> > > cpu's we're interested in, because above that nothing changes.
> > 
> > True. In some cases we don't have to go all the way up. There is a
> > condition in energy_diff_load() that bails out if the energy doesn't
> > change further up the hierarchy. There might be scope for improving that
> > condition though.
> > 
> > We can basically stop going up if the utilization of the domain is
> > unchanged by the change we want to do. For example, we can ignore the
> > next level above if a third cpu is keeping the domain up all the time
> > anyway. In the 100% + 50% case above, putting another 50% task on the
> > 50% cpu wouldn't affect the cluster according the proposed model, so it
> > can be ignored. However, if we did the same on any of the two cpus in
> > the 50% + 25% example we affect the cluster utilization and have to do
> > the cluster level maths.
> > 
> > So we do sometimes have to go all the way up even if we are balancing
> > two sibling cpus to determine the energy implications. At least if we
> > want an energy score like energy_diff_load() produces. However, we might
> > be able to take some other shortcuts if we are balancing load between
> > two specific cpus (not wakeup/fork/exec balancing) as you point out. But
> > there are cases where we need to continue up until the domain
> > utilization is unchanged.
> 
> Right.. so my worry with this is scalability. We typically want to avoid
> having to scan the entire machine, even for power aware balancing.

I haven't looked at power management for really big machines, but I hope
that we can stop a socket level or wherever utilization changes won't
affect the energy of the rest of the system. If we can power off groups
of sockets or something like that, we could scan at that level less
frequently (like we do now). The cost and latency of powering off
multiple sockets is probably high and not something we want to do often.

> That said, I don't think we have a 'sane' model for really big hardware
> (yet). Intel still hasn't really said anything much on that iirc, as
> long as a single core is up, all the memory controllers in the numa
> fabric need to be awake, not to mention to cost of keeping the dram
> alive.

Right. I'm hoping that we can roll that in once we know more about power
management on big hardware.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 06/16] arm: topology: Define TC2 sched energy and provide it to scheduler
  2014-06-03 11:44   ` Peter Zijlstra
@ 2014-06-04 15:42     ` Morten Rasmussen
  2014-06-04 16:16       ` Peter Zijlstra
  0 siblings, 1 reply; 71+ messages in thread
From: Morten Rasmussen @ 2014-06-04 15:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-pm, mingo, rjw, vincent.guittot,
	daniel.lezcano, preeti, Dietmar Eggemann

On Tue, Jun 03, 2014 at 12:44:28PM +0100, Peter Zijlstra wrote:
> On Fri, May 23, 2014 at 07:16:33PM +0100, Morten Rasmussen wrote:
> > +static struct capacity_state cap_states_cluster_a7[] = {
> > +	/* Cluster only power */
> > +	 { .cap =  358, .power = 2967, }, /*  350 MHz */
> > +	 { .cap =  410, .power = 2792, }, /*  400 MHz */
> > +	 { .cap =  512, .power = 2810, }, /*  500 MHz */
> > +	 { .cap =  614, .power = 2815, }, /*  600 MHz */
> > +	 { .cap =  717, .power = 2919, }, /*  700 MHz */
> > +	 { .cap =  819, .power = 2847, }, /*  800 MHz */
> > +	 { .cap =  922, .power = 3917, }, /*  900 MHz */
> > +	 { .cap = 1024, .power = 4905, }, /* 1000 MHz */
> > +	};
> > +
> > +static struct capacity_state cap_states_cluster_a15[] = {
> > +	/* Cluster only power */
> > +	 { .cap =  840, .power =  7920, }, /*  500 MHz */
> > +	 { .cap = 1008, .power =  8165, }, /*  600 MHz */
> > +	 { .cap = 1176, .power =  8172, }, /*  700 MHz */
> > +	 { .cap = 1343, .power =  8195, }, /*  800 MHz */
> > +	 { .cap = 1511, .power =  8265, }, /*  900 MHz */
> > +	 { .cap = 1679, .power =  8446, }, /* 1000 MHz */
> > +	 { .cap = 1847, .power = 11426, }, /* 1100 MHz */
> > +	 { .cap = 2015, .power = 15200, }, /* 1200 MHz */
> > +	};
> 
> 
> So how did you obtain these numbers? Did you use numbers provided by the
> hardware people, or did you run a particular benchmark and record the
> power usage?
>
> Does that benchmark do some actual work (as opposed to a while(1) loop)
> to keep more silicon lit up?

Hardware people don't like sharing data, so I did my own measurements
and calculations to get the numbers above.

ARM TC2 has on-chip energy counters for counting energy consumed by the
A7 and A15 clusters. They are fairly accurate. I used sysbench cpu
benchmark as test workload for the above numbers. sysbench might not be
a representative workload, but it is easy to use. I think, ideally,
vendors would run their own mix of workloads they care about and derrive
their numbers for their platform based on that.

> If you have a setup for measuring these, should we try and publish that
> too so that people can run it on their platform and provide these
> numbers?

The workload setup I used quite simple. I ran sysbench with taskset with
different numbers of threads to extrapolate power consumed by each
individual cpu and how much comes from just powering on the domain.

Measuring the actual power is very platform specific. Developing a fully
automated tool do it for any given platform isn't straigt forward, but
I'm happy to share how I did it. I can add a description of the method I
used on TC2 to the documentation so others can use it as reference.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 06/16] arm: topology: Define TC2 sched energy and provide it to scheduler
  2014-06-03 11:50   ` Peter Zijlstra
@ 2014-06-04 16:02     ` Morten Rasmussen
  2014-06-04 17:27       ` Peter Zijlstra
  0 siblings, 1 reply; 71+ messages in thread
From: Morten Rasmussen @ 2014-06-04 16:02 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-pm, mingo, rjw, vincent.guittot,
	daniel.lezcano, preeti, Dietmar Eggemann

On Tue, Jun 03, 2014 at 12:50:15PM +0100, Peter Zijlstra wrote:
> On Fri, May 23, 2014 at 07:16:33PM +0100, Morten Rasmussen wrote:
> > +static struct capacity_state cap_states_cluster_a7[] = {
> > +	/* Cluster only power */
> > +	 { .cap =  358, .power = 2967, }, /*  350 MHz */
> > +	 { .cap =  410, .power = 2792, }, /*  400 MHz */
> > +	 { .cap =  512, .power = 2810, }, /*  500 MHz */
> > +	 { .cap =  614, .power = 2815, }, /*  600 MHz */
> > +	 { .cap =  717, .power = 2919, }, /*  700 MHz */
> > +	 { .cap =  819, .power = 2847, }, /*  800 MHz */
> > +	 { .cap =  922, .power = 3917, }, /*  900 MHz */
> > +	 { .cap = 1024, .power = 4905, }, /* 1000 MHz */
> > +	};
> 
> So one thing I remember was that we spoke about restricting this to
> frequency levels where the voltage changed.
> 
> Because voltage jumps were the biggest factor to energy usage.
> 
> Any word on that?

Since we don't drive P-state changes from the scheduler, I think we
could leave out P-states from the table without too much trouble. Good
point.

TC2 is an early development platform and somewhat different from what
you find in end user products. TC2 actually uses the same voltage for
all states except the highest 2-3 states. That is not typical. The
voltage is typically slightly different for each state, however, the
difference get bigger for higher P-states. We could probably get away
with representing multiple states as one in the energy model if the
voltage change is minimal.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 06/16] arm: topology: Define TC2 sched energy and provide it to scheduler
  2014-06-04 15:42     ` Morten Rasmussen
@ 2014-06-04 16:16       ` Peter Zijlstra
  2014-06-06 13:15         ` Morten Rasmussen
  0 siblings, 1 reply; 71+ messages in thread
From: Peter Zijlstra @ 2014-06-04 16:16 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: linux-kernel, linux-pm, mingo, rjw, vincent.guittot,
	daniel.lezcano, preeti, Dietmar Eggemann

[-- Attachment #1: Type: text/plain, Size: 3424 bytes --]

On Wed, Jun 04, 2014 at 04:42:27PM +0100, Morten Rasmussen wrote:
> On Tue, Jun 03, 2014 at 12:44:28PM +0100, Peter Zijlstra wrote:
> > On Fri, May 23, 2014 at 07:16:33PM +0100, Morten Rasmussen wrote:
> > > +static struct capacity_state cap_states_cluster_a7[] = {
> > > +	/* Cluster only power */
> > > +	 { .cap =  358, .power = 2967, }, /*  350 MHz */
> > > +	 { .cap =  410, .power = 2792, }, /*  400 MHz */
> > > +	 { .cap =  512, .power = 2810, }, /*  500 MHz */
> > > +	 { .cap =  614, .power = 2815, }, /*  600 MHz */
> > > +	 { .cap =  717, .power = 2919, }, /*  700 MHz */
> > > +	 { .cap =  819, .power = 2847, }, /*  800 MHz */
> > > +	 { .cap =  922, .power = 3917, }, /*  900 MHz */
> > > +	 { .cap = 1024, .power = 4905, }, /* 1000 MHz */
> > > +	};
> > > +
> > > +static struct capacity_state cap_states_cluster_a15[] = {
> > > +	/* Cluster only power */
> > > +	 { .cap =  840, .power =  7920, }, /*  500 MHz */
> > > +	 { .cap = 1008, .power =  8165, }, /*  600 MHz */
> > > +	 { .cap = 1176, .power =  8172, }, /*  700 MHz */
> > > +	 { .cap = 1343, .power =  8195, }, /*  800 MHz */
> > > +	 { .cap = 1511, .power =  8265, }, /*  900 MHz */
> > > +	 { .cap = 1679, .power =  8446, }, /* 1000 MHz */
> > > +	 { .cap = 1847, .power = 11426, }, /* 1100 MHz */
> > > +	 { .cap = 2015, .power = 15200, }, /* 1200 MHz */
> > > +	};
> > 
> > 
> > So how did you obtain these numbers? Did you use numbers provided by the
> > hardware people, or did you run a particular benchmark and record the
> > power usage?
> >
> > Does that benchmark do some actual work (as opposed to a while(1) loop)
> > to keep more silicon lit up?
> 
> Hardware people don't like sharing data, so I did my own measurements
> and calculations to get the numbers above.
> 
> ARM TC2 has on-chip energy counters for counting energy consumed by the
> A7 and A15 clusters. They are fairly accurate. 

Recent Intel chips have that too; they come packaged as:

  perf stat -a -e "power/energy-cores/" -- cmd

(through the perf_event_intel_rapl.c driver), It would be ideal if the
ARM equivalent was available through a similar interface.

http://lwn.net/Articles/573602/

> I used sysbench cpu
> benchmark as test workload for the above numbers. sysbench might not be
> a representative workload, but it is easy to use. I think, ideally,
> vendors would run their own mix of workloads they care about and derrive
> their numbers for their platform based on that.
> 
> > If you have a setup for measuring these, should we try and publish that
> > too so that people can run it on their platform and provide these
> > numbers?
> 
> The workload setup I used quite simple. I ran sysbench with taskset with
> different numbers of threads to extrapolate power consumed by each
> individual cpu and how much comes from just powering on the domain.
> 
> Measuring the actual power is very platform specific. Developing a fully
> automated tool do it for any given platform isn't straigt forward, but
> I'm happy to share how I did it. I can add a description of the method I
> used on TC2 to the documentation so others can use it as reference.

That would be good I think, esp. if we can get similar perf based energy
measurement things sorted. And if we make the tool consume the machine
topology present in sysfs we can get a long way towards automating this
I think.



[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 06/16] arm: topology: Define TC2 sched energy and provide it to scheduler
  2014-06-04 16:02     ` Morten Rasmussen
@ 2014-06-04 17:27       ` Peter Zijlstra
  2014-06-04 21:56         ` Rafael J. Wysocki
                           ` (2 more replies)
  0 siblings, 3 replies; 71+ messages in thread
From: Peter Zijlstra @ 2014-06-04 17:27 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: linux-kernel, linux-pm, mingo, rjw, vincent.guittot,
	daniel.lezcano, preeti, Dietmar Eggemann

On Wed, Jun 04, 2014 at 05:02:30PM +0100, Morten Rasmussen wrote:
> On Tue, Jun 03, 2014 at 12:50:15PM +0100, Peter Zijlstra wrote:
> > On Fri, May 23, 2014 at 07:16:33PM +0100, Morten Rasmussen wrote:
> > > +static struct capacity_state cap_states_cluster_a7[] = {
> > > +	/* Cluster only power */
> > > +	 { .cap =  358, .power = 2967, }, /*  350 MHz */
> > > +	 { .cap =  410, .power = 2792, }, /*  400 MHz */
> > > +	 { .cap =  512, .power = 2810, }, /*  500 MHz */
> > > +	 { .cap =  614, .power = 2815, }, /*  600 MHz */
> > > +	 { .cap =  717, .power = 2919, }, /*  700 MHz */
> > > +	 { .cap =  819, .power = 2847, }, /*  800 MHz */
> > > +	 { .cap =  922, .power = 3917, }, /*  900 MHz */
> > > +	 { .cap = 1024, .power = 4905, }, /* 1000 MHz */
> > > +	};
> > 
> > So one thing I remember was that we spoke about restricting this to
> > frequency levels where the voltage changed.
> > 
> > Because voltage jumps were the biggest factor to energy usage.
> > 
> > Any word on that?
> 
> Since we don't drive P-state changes from the scheduler, I think we
> could leave out P-states from the table without too much trouble. Good
> point.

Well, we eventually want to go there I think. Although we still needed
to come up with something for Intel, because I'm not at all sure how all
that works.

> TC2 is an early development platform and somewhat different from what
> you find in end user products. TC2 actually uses the same voltage for
> all states except the highest 2-3 states. That is not typical. The
> voltage is typically slightly different for each state, however, the
> difference get bigger for higher P-states. We could probably get away
> with representing multiple states as one in the energy model if the
> voltage change is minimal.

So while I don't mind the full table, esp. if its fairly easy to
generate using that tool you spoke about, I just wondered if it made
sense to somewhat reduce it.

Now that I look at the actual .power values, you can indeed see that all
except the last two are pretty much similar in power usage.

On that, is that fluctuation measurement noise, or is that stable?

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 06/16] arm: topology: Define TC2 sched energy and provide it to scheduler
  2014-06-04 17:27       ` Peter Zijlstra
@ 2014-06-04 21:56         ` Rafael J. Wysocki
  2014-06-05  6:52           ` Peter Zijlstra
  2014-06-06 13:03         ` Morten Rasmussen
  2014-06-07  2:52         ` Nicolas Pitre
  2 siblings, 1 reply; 71+ messages in thread
From: Rafael J. Wysocki @ 2014-06-04 21:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Morten Rasmussen, linux-kernel, linux-pm, mingo, vincent.guittot,
	daniel.lezcano, preeti, Dietmar Eggemann

On Wednesday, June 04, 2014 07:27:12 PM Peter Zijlstra wrote:
> On Wed, Jun 04, 2014 at 05:02:30PM +0100, Morten Rasmussen wrote:
> > On Tue, Jun 03, 2014 at 12:50:15PM +0100, Peter Zijlstra wrote:
> > > On Fri, May 23, 2014 at 07:16:33PM +0100, Morten Rasmussen wrote:
> > > > +static struct capacity_state cap_states_cluster_a7[] = {
> > > > +	/* Cluster only power */
> > > > +	 { .cap =  358, .power = 2967, }, /*  350 MHz */
> > > > +	 { .cap =  410, .power = 2792, }, /*  400 MHz */
> > > > +	 { .cap =  512, .power = 2810, }, /*  500 MHz */
> > > > +	 { .cap =  614, .power = 2815, }, /*  600 MHz */
> > > > +	 { .cap =  717, .power = 2919, }, /*  700 MHz */
> > > > +	 { .cap =  819, .power = 2847, }, /*  800 MHz */
> > > > +	 { .cap =  922, .power = 3917, }, /*  900 MHz */
> > > > +	 { .cap = 1024, .power = 4905, }, /* 1000 MHz */
> > > > +	};
> > > 
> > > So one thing I remember was that we spoke about restricting this to
> > > frequency levels where the voltage changed.
> > > 
> > > Because voltage jumps were the biggest factor to energy usage.
> > > 
> > > Any word on that?
> > 
> > Since we don't drive P-state changes from the scheduler, I think we
> > could leave out P-states from the table without too much trouble. Good
> > point.
> 
> Well, we eventually want to go there I think. Although we still needed
> to come up with something for Intel, because I'm not at all sure how all
> that works.

Do you mean power numbers or how P-states work on Intel in general?

Rafael


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 06/16] arm: topology: Define TC2 sched energy and provide it to scheduler
  2014-06-04 21:56         ` Rafael J. Wysocki
@ 2014-06-05  6:52           ` Peter Zijlstra
  2014-06-05 15:03             ` Dirk Brandewie
  0 siblings, 1 reply; 71+ messages in thread
From: Peter Zijlstra @ 2014-06-05  6:52 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Morten Rasmussen, linux-kernel, linux-pm, mingo, vincent.guittot,
	daniel.lezcano, preeti, Dietmar Eggemann, len.brown

[-- Attachment #1: Type: text/plain, Size: 1184 bytes --]

On Wed, Jun 04, 2014 at 11:56:55PM +0200, Rafael J. Wysocki wrote:
> On Wednesday, June 04, 2014 07:27:12 PM Peter Zijlstra wrote:

> > Well, we eventually want to go there I think. Although we still needed
> > to come up with something for Intel, because I'm not at all sure how all
> > that works.
> 
> Do you mean power numbers or how P-states work on Intel in general?

P-states, I'm still not at all sure how all that works on Intel and what
we can sanely do with them.

Supposedly Intel has a means of setting P-states (there's a driver after
all), but then is completely free to totally ignore it and do something
entirely different anyhow.

And while APERF/MPERF allows observing what it did, its afaik, nigh on
impossible to predict wtf its going to do, and therefore any such energy
computation is going to be a PRNG at best.

Now, given all that I'm not sure what we need that P-state driver for,
so supposedly I'm missing something.

Ideally Len (or someone equally in-the-know) would explain to me how
exactly all that works and what we can rely upon. All I've gotten so far
is, you can't rely on anything, and magik. Which is entirely useless.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 01/16] sched: Documentation for scheduler energy cost model
  2014-05-23 18:16 ` [RFC PATCH 01/16] sched: Documentation for scheduler energy cost model Morten Rasmussen
@ 2014-06-05  8:49   ` Vincent Guittot
  2014-06-05 11:35     ` Morten Rasmussen
  0 siblings, 1 reply; 71+ messages in thread
From: Vincent Guittot @ 2014-06-05  8:49 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: linux-kernel, linux-pm, Peter Zijlstra, Ingo Molnar, rjw,
	Daniel Lezcano, Preeti U Murthy, Dietmar Eggemann

Hi Morten,

On 23 May 2014 20:16, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> This documentation patch provide a brief overview of the experimental
> scheduler energy costing model and associated data structures.
>
> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
> ---
>  Documentation/scheduler/sched-energy.txt |   66 ++++++++++++++++++++++++++++++
>  1 file changed, 66 insertions(+)
>  create mode 100644 Documentation/scheduler/sched-energy.txt
>
> diff --git a/Documentation/scheduler/sched-energy.txt b/Documentation/scheduler/sched-energy.txt
> new file mode 100644
> index 0000000..c6896c0
> --- /dev/null
> +++ b/Documentation/scheduler/sched-energy.txt
> @@ -0,0 +1,66 @@
> +Energy cost model for energy-aware scheduling (EXPERIMENTAL)
> +
> +Introduction
> +=============
> +The basic energy model uses platform energy data stored in sched_energy data
> +structures attached to the sched_groups in the sched_domain hierarchy. The
> +energy cost model offers two function that can be used to guide scheduling
> +decisions:
> +
> +1.     energy_diff_util(cpu, util, wakeups)

Could you give us mor edetails of what util and wakeups are ?
util is a absolute value or a delta
Is wakeups a boolean or does wakeups define a number of tasks/cpus
that wake up ?

> +2.     energy_diff_task(cpu, task)
> +
> +Both return the energy cost delta caused by adding/removing utilization or a
> +task to/from a specific cpu.
> +
> +CONFIG_SCHED_ENERGY needs to be defined in Kconfig to enable the energy cost
> +model and associated data structures.
> +
> +The basic algorithm
> +====================
> +The basic idea is to determine the energy cost at each level in sched_domain
> +hierarchy based on utilization:
> +
> +       for_each_domain(cpu, sd) {
> +               sg = sched_group_of(cpu)
> +               energy_before = curr_util(sg) * busy_power(sg)
> +                               + 1-curr_util(sg) * idle_power(sg)
> +               energy_after = new_util(sg) * busy_power(sg)
> +                               + 1-new_util(sg) * idle_power(sg)
> +                               + new_util(sg) * task_wakeups
> +                                                       * wakeup_energy(sg)
> +               energy_diff += energy_before - energy_after
> +       }
> +
> +       return energy_diff

So this is the algorithm used in energy_diff_util and energy_diff_task ?

it's not straight foward for me to map the algorithm variable and the
function argument

> +
> +Platform energy data
> +=====================
> +struct sched_energy has the following members:
> +
> +cap_states:
> +       List of struct capacity_state representing the supported capacity states
> +       (P-states). struct capacity_state has two members: cap and power, which
> +       represents the compute capacity and the busy power of the state. The
> +       list must ordered by capacity low->high.
> +
> +nr_cap_states:
> +       Number of capacity states in cap_states.
> +
> +max_capacity:
> +       The highest capacity supported by any of the capacity states in
> +       cap_states.

can't you directly use cap_states[nr_cap_states].cap has the array is ordered ?

Vincent
> +
> +idle_power:
> +       Idle power consumption. Will be extended to support multiple C-states
> +       later.
> +
> +wakeup_energy:
> +       Energy cost of wakeup/power-down cycle for the sched_group which this is
> +       attached to. Will be extended to support different costs for different
> +       C-states later.
> +
> +There are no unit requirements for the energy cost data. Data can be normalized
> +with any reference, however, the normalization must be consistent across all
> +energy cost data. That is, one bogo-joule/watt must be same quantity for data,
> +but we don't care what it is.
> --
> 1.7.9.5
>
>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 01/16] sched: Documentation for scheduler energy cost model
  2014-06-05  8:49   ` Vincent Guittot
@ 2014-06-05 11:35     ` Morten Rasmussen
  2014-06-05 15:02       ` Vincent Guittot
  0 siblings, 1 reply; 71+ messages in thread
From: Morten Rasmussen @ 2014-06-05 11:35 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: linux-kernel, linux-pm, Peter Zijlstra, Ingo Molnar, rjw,
	Daniel Lezcano, Preeti U Murthy, Dietmar Eggemann

On Thu, Jun 05, 2014 at 09:49:35AM +0100, Vincent Guittot wrote:
> Hi Morten,
> 
> On 23 May 2014 20:16, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> > This documentation patch provide a brief overview of the experimental
> > scheduler energy costing model and associated data structures.
> >
> > Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
> > ---
> >  Documentation/scheduler/sched-energy.txt |   66 ++++++++++++++++++++++++++++++
> >  1 file changed, 66 insertions(+)
> >  create mode 100644 Documentation/scheduler/sched-energy.txt
> >
> > diff --git a/Documentation/scheduler/sched-energy.txt b/Documentation/scheduler/sched-energy.txt
> > new file mode 100644
> > index 0000000..c6896c0
> > --- /dev/null
> > +++ b/Documentation/scheduler/sched-energy.txt
> > @@ -0,0 +1,66 @@
> > +Energy cost model for energy-aware scheduling (EXPERIMENTAL)
> > +
> > +Introduction
> > +=============
> > +The basic energy model uses platform energy data stored in sched_energy data
> > +structures attached to the sched_groups in the sched_domain hierarchy. The
> > +energy cost model offers two function that can be used to guide scheduling
> > +decisions:
> > +
> > +1.     energy_diff_util(cpu, util, wakeups)
> 
> Could you give us mor edetails of what util and wakeups are ?
> util is a absolute value or a delta
> Is wakeups a boolean or does wakeups define a number of tasks/cpus
> that wake up ?

Good point... It is not clear at all. Improving the documentation is at
the top of my todo list.

cpu: The cpu in question.

util: Is a signed utilization delta. That is, the amount of utilization
we want to add or remove from the cpu. We don't have good metric for
utilization yet (I assume you have followed the thread on that topic
that started from your recent patch posting), so for now I have used
load_avg_contrib. energy_diff_task() just passes the task
load_avg_contrib as the utilization to energy_diff_load().

wakeups: Is the number of wakeups (task enqueues, not idle exits) caused
by the utilization we are about to add or remove from the cpu. We need
to pick some period to measure the wakeups over. For that I have
introduced task wakeup tracking, very similar to the existing load tracking.
The wakeup tracking gives us an indication of how often a task will
cause an idle exit if it ran alone on a cpu. For short but frequently
running tasks, the wakeup cost may be where the majority of the energy
is spent.

> 
> > +2.     energy_diff_task(cpu, task)
> > +
> > +Both return the energy cost delta caused by adding/removing utilization or a
> > +task to/from a specific cpu.
> > +
> > +CONFIG_SCHED_ENERGY needs to be defined in Kconfig to enable the energy cost
> > +model and associated data structures.
> > +
> > +The basic algorithm
> > +====================
> > +The basic idea is to determine the energy cost at each level in sched_domain
> > +hierarchy based on utilization:
> > +
> > +       for_each_domain(cpu, sd) {
> > +               sg = sched_group_of(cpu)
> > +               energy_before = curr_util(sg) * busy_power(sg)
> > +                               + 1-curr_util(sg) * idle_power(sg)
> > +               energy_after = new_util(sg) * busy_power(sg)
> > +                               + 1-new_util(sg) * idle_power(sg)
> > +                               + new_util(sg) * task_wakeups
> > +                                                       * wakeup_energy(sg)
> > +               energy_diff += energy_before - energy_after
> > +       }
> > +
> > +       return energy_diff
> 
> So this is the algorithm used in energy_diff_util and energy_diff_task ?

It is. energy_diff_task() is basically just a wrapper for
energy_diff_util().

> it's not straight foward for me to map the algorithm variable and the
> function argument

The pseudo-code above is very simplified. It is an attempt to show that
the algorithm goes up the sched_domain hierarhcy and estimates the
energy impact of adding/removing 'util' amount of utilization to/from
the cpu.

{curr, new}_util is the cpu utilization at the lowest level and
the overall non-idle time for the entire group for higher levels.
utilization is in the range 0.0 to 1.0.

busy_power is the power consumption of the group (for TC2, cpu at the
lowest level, cluster at the next).

idle_power is the power consumption of the group while idle (for TC2,
WFI at the lowest level, cluster power down at cluster level).

task_wakeups (should have been just 'wakeups' in the general case) is the
number of wakeups caused by the utilization we are adding/removing. To
predict how many of the wakeups that causes idle exits we scale the
number by the utilization (assuming that wakeups are uniformly
distributed). wakeup_energy is the energy consumed for an idle
exit/entry cycle for the group (for TC2, WFI at lowest level, cluster
power down at cluster level).

At each level we need to compute the energy before and after the change
to find the energy delta.

Does that answer your question?

> 
> > +
> > +Platform energy data
> > +=====================
> > +struct sched_energy has the following members:
> > +
> > +cap_states:
> > +       List of struct capacity_state representing the supported capacity states
> > +       (P-states). struct capacity_state has two members: cap and power, which
> > +       represents the compute capacity and the busy power of the state. The
> > +       list must ordered by capacity low->high.
> > +
> > +nr_cap_states:
> > +       Number of capacity states in cap_states.
> > +
> > +max_capacity:
> > +       The highest capacity supported by any of the capacity states in
> > +       cap_states.
> 
> can't you directly use cap_states[nr_cap_states].cap has the array is ordered ?

Yes, indeed. max_capacity can be removed.

Morten


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 01/16] sched: Documentation for scheduler energy cost model
  2014-06-05 11:35     ` Morten Rasmussen
@ 2014-06-05 15:02       ` Vincent Guittot
  0 siblings, 0 replies; 71+ messages in thread
From: Vincent Guittot @ 2014-06-05 15:02 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: linux-kernel, linux-pm, Peter Zijlstra, Ingo Molnar, rjw,
	Daniel Lezcano, Preeti U Murthy, Dietmar Eggemann

On 5 June 2014 13:35, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
> On Thu, Jun 05, 2014 at 09:49:35AM +0100, Vincent Guittot wrote:
>> Hi Morten,
>>
>> On 23 May 2014 20:16, Morten Rasmussen <morten.rasmussen@arm.com> wrote:
>> > This documentation patch provide a brief overview of the experimental
>> > scheduler energy costing model and associated data structures.
>> >
>> > Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
>> > ---
>> >  Documentation/scheduler/sched-energy.txt |   66 ++++++++++++++++++++++++++++++
>> >  1 file changed, 66 insertions(+)
>> >  create mode 100644 Documentation/scheduler/sched-energy.txt
>> >
>> > diff --git a/Documentation/scheduler/sched-energy.txt b/Documentation/scheduler/sched-energy.txt
>> > new file mode 100644
>> > index 0000000..c6896c0
>> > --- /dev/null
>> > +++ b/Documentation/scheduler/sched-energy.txt
>> > @@ -0,0 +1,66 @@
>> > +Energy cost model for energy-aware scheduling (EXPERIMENTAL)
>> > +
>> > +Introduction
>> > +=============
>> > +The basic energy model uses platform energy data stored in sched_energy data
>> > +structures attached to the sched_groups in the sched_domain hierarchy. The
>> > +energy cost model offers two function that can be used to guide scheduling
>> > +decisions:
>> > +
>> > +1.     energy_diff_util(cpu, util, wakeups)
>>
>> Could you give us mor edetails of what util and wakeups are ?
>> util is a absolute value or a delta
>> Is wakeups a boolean or does wakeups define a number of tasks/cpus
>> that wake up ?
>
> Good point... It is not clear at all. Improving the documentation is at
> the top of my todo list.
>
> cpu: The cpu in question.
>
> util: Is a signed utilization delta. That is, the amount of utilization
> we want to add or remove from the cpu. We don't have good metric for
> utilization yet (I assume you have followed the thread on that topic
> that started from your recent patch posting), so for now I have used
> load_avg_contrib. energy_diff_task() just passes the task
> load_avg_contrib as the utilization to energy_diff_load().
>
> wakeups: Is the number of wakeups (task enqueues, not idle exits) caused
> by the utilization we are about to add or remove from the cpu. We need
> to pick some period to measure the wakeups over. For that I have
> introduced task wakeup tracking, very similar to the existing load tracking.
> The wakeup tracking gives us an indication of how often a task will
> cause an idle exit if it ran alone on a cpu. For short but frequently
> running tasks, the wakeup cost may be where the majority of the energy
> is spent.
>
>>
>> > +2.     energy_diff_task(cpu, task)
>> > +
>> > +Both return the energy cost delta caused by adding/removing utilization or a
>> > +task to/from a specific cpu.
>> > +
>> > +CONFIG_SCHED_ENERGY needs to be defined in Kconfig to enable the energy cost
>> > +model and associated data structures.
>> > +
>> > +The basic algorithm
>> > +====================
>> > +The basic idea is to determine the energy cost at each level in sched_domain
>> > +hierarchy based on utilization:
>> > +
>> > +       for_each_domain(cpu, sd) {
>> > +               sg = sched_group_of(cpu)
>> > +               energy_before = curr_util(sg) * busy_power(sg)
>> > +                               + 1-curr_util(sg) * idle_power(sg)
>> > +               energy_after = new_util(sg) * busy_power(sg)
>> > +                               + 1-new_util(sg) * idle_power(sg)
>> > +                               + new_util(sg) * task_wakeups
>> > +                                                       * wakeup_energy(sg)
>> > +               energy_diff += energy_before - energy_after
>> > +       }
>> > +
>> > +       return energy_diff
>>
>> So this is the algorithm used in energy_diff_util and energy_diff_task ?
>
> It is. energy_diff_task() is basically just a wrapper for
> energy_diff_util().
>
>> it's not straight foward for me to map the algorithm variable and the
>> function argument
>
> The pseudo-code above is very simplified. It is an attempt to show that
> the algorithm goes up the sched_domain hierarhcy and estimates the
> energy impact of adding/removing 'util' amount of utilization to/from
> the cpu.
>
> {curr, new}_util is the cpu utilization at the lowest level and
> the overall non-idle time for the entire group for higher levels.
> utilization is in the range 0.0 to 1.0.
>
> busy_power is the power consumption of the group (for TC2, cpu at the
> lowest level, cluster at the next).
>
> idle_power is the power consumption of the group while idle (for TC2,
> WFI at the lowest level, cluster power down at cluster level).
>
> task_wakeups (should have been just 'wakeups' in the general case) is the
> number of wakeups caused by the utilization we are adding/removing. To
> predict how many of the wakeups that causes idle exits we scale the
> number by the utilization (assuming that wakeups are uniformly
> distributed). wakeup_energy is the energy consumed for an idle
> exit/entry cycle for the group (for TC2, WFI at lowest level, cluster
> power down at cluster level).
>
> At each level we need to compute the energy before and after the change
> to find the energy delta.
>
> Does that answer your question?

yes, thanks

>
>>
>> > +
>> > +Platform energy data
>> > +=====================
>> > +struct sched_energy has the following members:
>> > +
>> > +cap_states:
>> > +       List of struct capacity_state representing the supported capacity states
>> > +       (P-states). struct capacity_state has two members: cap and power, which
>> > +       represents the compute capacity and the busy power of the state. The
>> > +       list must ordered by capacity low->high.
>> > +
>> > +nr_cap_states:
>> > +       Number of capacity states in cap_states.
>> > +
>> > +max_capacity:
>> > +       The highest capacity supported by any of the capacity states in
>> > +       cap_states.
>>
>> can't you directly use cap_states[nr_cap_states].cap has the array is ordered ?
>
> Yes, indeed. max_capacity can be removed.
>
> Morten
>

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 06/16] arm: topology: Define TC2 sched energy and provide it to scheduler
  2014-06-05  6:52           ` Peter Zijlstra
@ 2014-06-05 15:03             ` Dirk Brandewie
  2014-06-05 20:29               ` Yuyang Du
  0 siblings, 1 reply; 71+ messages in thread
From: Dirk Brandewie @ 2014-06-05 15:03 UTC (permalink / raw)
  To: Peter Zijlstra, Rafael J. Wysocki
  Cc: dirk.brandewie, Morten Rasmussen, linux-kernel, linux-pm, mingo,
	vincent.guittot, daniel.lezcano, preeti, Dietmar Eggemann,
	len.brown

On 06/04/2014 11:52 PM, Peter Zijlstra wrote:
> On Wed, Jun 04, 2014 at 11:56:55PM +0200, Rafael J. Wysocki wrote:
>> On Wednesday, June 04, 2014 07:27:12 PM Peter Zijlstra wrote:
>
>>> Well, we eventually want to go there I think. Although we still needed
>>> to come up with something for Intel, because I'm not at all sure how all
>>> that works.
>>
>> Do you mean power numbers or how P-states work on Intel in general?
>
> P-states, I'm still not at all sure how all that works on Intel and what
> we can sanely do with them.
>
> Supposedly Intel has a means of setting P-states (there's a driver after
> all), but then is completely free to totally ignore it and do something
> entirely different anyhow.

You can request a P state per core but the package does coordination at
a package level for the P state that will be used based on all requests.
This is due to the fact that most SKUs have a single VR and PLL. So
the highest P state wins.  When a core goes idle it loses it's vote
for the current package P state and that cores clock it turned off.

>
> And while APERF/MPERF allows observing what it did, its afaik, nigh on
> impossible to predict wtf its going to do, and therefore any such energy
> computation is going to be a PRNG at best.
>
> Now, given all that I'm not sure what we need that P-state driver for,
> so supposedly I'm missing something.

intel_pstate tries to keep the core P state as low as possible to satisfy
the given load, so when various cores go idle the package P state can be
as low as possible.  The big power win is a core going idle.

>
> Ideally Len (or someone equally in-the-know) would explain to me how
> exactly all that works and what we can rely upon. All I've gotten so far
> is, you can't rely on anything, and magik. Which is entirely useless.
>
The only thing you can rely on is that you will get "at least" the P state
requested in the presence of hardware coordination.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 06/16] arm: topology: Define TC2 sched energy and provide it to scheduler
  2014-06-05 15:03             ` Dirk Brandewie
@ 2014-06-05 20:29               ` Yuyang Du
  2014-06-06  8:05                 ` Peter Zijlstra
  0 siblings, 1 reply; 71+ messages in thread
From: Yuyang Du @ 2014-06-05 20:29 UTC (permalink / raw)
  To: Dirk Brandewie
  Cc: Peter Zijlstra, Rafael J. Wysocki, Morten Rasmussen,
	linux-kernel, linux-pm, mingo, vincent.guittot, daniel.lezcano,
	preeti, Dietmar Eggemann, len.brown

On Thu, Jun 05, 2014 at 08:03:15AM -0700, Dirk Brandewie wrote:
> 
> You can request a P state per core but the package does coordination at
> a package level for the P state that will be used based on all requests.
> This is due to the fact that most SKUs have a single VR and PLL. So
> the highest P state wins.  When a core goes idle it loses it's vote
> for the current package P state and that cores clock it turned off.
> 

You need to differentiate Turbo and non-Turbo. The highest P state wins? Not
really. Actually, silicon supports indepdent non-Turbo pstate, but just not enabled.
For Turbo, it basically depends on power budget of both core and gfx (because
they share) for each core to get which Turbo point.

> >
> >And while APERF/MPERF allows observing what it did, its afaik, nigh on
> >impossible to predict wtf its going to do, and therefore any such energy
> >computation is going to be a PRNG at best.
> >
> >Now, given all that I'm not sure what we need that P-state driver for,
> >so supposedly I'm missing something.
> 
> intel_pstate tries to keep the core P state as low as possible to satisfy
> the given load, so when various cores go idle the package P state can be
> as low as possible.  The big power win is a core going idle.
> 

In terms of prediction, it is definitely can't be 100% right. But the
performance of most workloads does scale with pstate (frequency), may not be
linearly. So it is to some point predictable FWIW. And this is all governors
and Intel_pstate's basic assumption.

Thanks,
Yuyang

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 06/16] arm: topology: Define TC2 sched energy and provide it to scheduler
  2014-06-06  8:05                 ` Peter Zijlstra
@ 2014-06-06  0:35                   ` Yuyang Du
  2014-06-06 10:50                     ` Peter Zijlstra
  2014-06-06 16:27                     ` Jacob Pan
  0 siblings, 2 replies; 71+ messages in thread
From: Yuyang Du @ 2014-06-06  0:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dirk Brandewie, Rafael J. Wysocki, Morten Rasmussen,
	linux-kernel, linux-pm, mingo, vincent.guittot, daniel.lezcano,
	preeti, Dietmar Eggemann, len.brown, jacob.jun.pan

On Fri, Jun 06, 2014 at 10:05:43AM +0200, Peter Zijlstra wrote:
> On Fri, Jun 06, 2014 at 04:29:30AM +0800, Yuyang Du wrote:
> > On Thu, Jun 05, 2014 at 08:03:15AM -0700, Dirk Brandewie wrote:
> > > 
> > > You can request a P state per core but the package does coordination at
> > > a package level for the P state that will be used based on all requests.
> > > This is due to the fact that most SKUs have a single VR and PLL. So
> > > the highest P state wins.  When a core goes idle it loses it's vote
> > > for the current package P state and that cores clock it turned off.
> > > 
> > 
> > You need to differentiate Turbo and non-Turbo. The highest P state wins? Not
> > really.
> 
> *sigh* and here we go again.. someone please, write something coherent
> and have all intel people sign off on it and stop saying different
> things.
> 
> > Actually, silicon supports indepdent non-Turbo pstate, but just not enabled.
> 
> Then it doesn't exist, so no point in mentioning it.
> 

Well, things actually get more complicated. Not-enabled is for Core. For Atom
Baytrail, each core indeed can operate on difference frequency. I am not sure for
Xeon, :)

> > For Turbo, it basically depends on power budget of both core and gfx (because
> > they share) for each core to get which Turbo point.
> 
> And RAPL controls can give preference of which gfx/core gets most,
> right?
>

Maybe Jacob knows that.

> > > intel_pstate tries to keep the core P state as low as possible to satisfy
> > > the given load, so when various cores go idle the package P state can be
> > > as low as possible.  The big power win is a core going idle.
> > > 
> > 
> > In terms of prediction, it is definitely can't be 100% right. But the
> > performance of most workloads does scale with pstate (frequency), may not be
> > linearly. So it is to some point predictable FWIW. And this is all governors
> > and Intel_pstate's basic assumption.
> 
> So frequency isn't _that_ interesting, voltage is. And while
> predictability it might be their assumption, is it actually true? I
> mean, there's really nothing else except to assume that, if its not you
> can't do anything at all, so you _have_ to assume this.
> 
> But again, is the assumption true? Or just happy thoughts in an attempt
> to do something.

Voltage is combined with frequency, roughly, voltage is proportional to freuquecy, so
roughly, power is proportionaly to voltage^3. You can't say which is more important,
or there is no reason to raise voltage without raising frequency.

If only one word to say: true of false, it is true. Because given any fixed
workload, I can't see why performance would be worse if frequency is higher.

The reality as opposed to the assumption is in two-fold:
1) if workload is CPU bound, performance scales with frequency absolutely. if workload is
   memory bound, it does not scale. But from kernel, we don't know whether it is CPU bound
   or not (or it is hard to know). uArch statistics can model that.
2) the workload is not fixed in real-time, changing all the time.

But still, the assumption is a must or no guilty, because we adjust frequency continuously,
for example, if the workload is fixed, and if the performance does not scale with freq we stop
increasing frequency. So a good frequency governor or driver should and can continuously
pursue "good" frequency with the changing workload. Therefore, in the long term, we will be
better off.


^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 06/16] arm: topology: Define TC2 sched energy and provide it to scheduler
  2014-06-05 20:29               ` Yuyang Du
@ 2014-06-06  8:05                 ` Peter Zijlstra
  2014-06-06  0:35                   ` Yuyang Du
  0 siblings, 1 reply; 71+ messages in thread
From: Peter Zijlstra @ 2014-06-06  8:05 UTC (permalink / raw)
  To: Yuyang Du
  Cc: Dirk Brandewie, Rafael J. Wysocki, Morten Rasmussen,
	linux-kernel, linux-pm, mingo, vincent.guittot, daniel.lezcano,
	preeti, Dietmar Eggemann, len.brown

[-- Attachment #1: Type: text/plain, Size: 1977 bytes --]

On Fri, Jun 06, 2014 at 04:29:30AM +0800, Yuyang Du wrote:
> On Thu, Jun 05, 2014 at 08:03:15AM -0700, Dirk Brandewie wrote:
> > 
> > You can request a P state per core but the package does coordination at
> > a package level for the P state that will be used based on all requests.
> > This is due to the fact that most SKUs have a single VR and PLL. So
> > the highest P state wins.  When a core goes idle it loses it's vote
> > for the current package P state and that cores clock it turned off.
> > 
> 
> You need to differentiate Turbo and non-Turbo. The highest P state wins? Not
> really.

*sigh* and here we go again.. someone please, write something coherent
and have all intel people sign off on it and stop saying different
things.

> Actually, silicon supports indepdent non-Turbo pstate, but just not enabled.

Then it doesn't exist, so no point in mentioning it.

> For Turbo, it basically depends on power budget of both core and gfx (because
> they share) for each core to get which Turbo point.

And RAPL controls can give preference of which gfx/core gets most,
right?

> > intel_pstate tries to keep the core P state as low as possible to satisfy
> > the given load, so when various cores go idle the package P state can be
> > as low as possible.  The big power win is a core going idle.
> > 
> 
> In terms of prediction, it is definitely can't be 100% right. But the
> performance of most workloads does scale with pstate (frequency), may not be
> linearly. So it is to some point predictable FWIW. And this is all governors
> and Intel_pstate's basic assumption.

So frequency isn't _that_ interesting, voltage is. And while
predictability it might be their assumption, is it actually true? I
mean, there's really nothing else except to assume that, if its not you
can't do anything at all, so you _have_ to assume this.

But again, is the assumption true? Or just happy thoughts in an attempt
to do something.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 06/16] arm: topology: Define TC2 sched energy and provide it to scheduler
  2014-06-06  0:35                   ` Yuyang Du
@ 2014-06-06 10:50                     ` Peter Zijlstra
  2014-06-06 12:13                       ` Ingo Molnar
  2014-06-07 23:26                       ` Yuyang Du
  2014-06-06 16:27                     ` Jacob Pan
  1 sibling, 2 replies; 71+ messages in thread
From: Peter Zijlstra @ 2014-06-06 10:50 UTC (permalink / raw)
  To: Yuyang Du
  Cc: Dirk Brandewie, Rafael J. Wysocki, Morten Rasmussen,
	linux-kernel, linux-pm, mingo, vincent.guittot, daniel.lezcano,
	preeti, Dietmar Eggemann, len.brown, jacob.jun.pan

[-- Attachment #1: Type: text/plain, Size: 3658 bytes --]

On Fri, Jun 06, 2014 at 08:35:21AM +0800, Yuyang Du wrote:

> > > Actually, silicon supports indepdent non-Turbo pstate, but just not enabled.
> > 
> > Then it doesn't exist, so no point in mentioning it.
> > 
> 
> Well, things actually get more complicated. Not-enabled is for Core. For Atom
> Baytrail, each core indeed can operate on difference frequency. I am not sure for
> Xeon, :)

Yes, I understand Atom is an entirely different thing.

> > So frequency isn't _that_ interesting, voltage is. And while
> > predictability it might be their assumption, is it actually true? I
> > mean, there's really nothing else except to assume that, if its not you
> > can't do anything at all, so you _have_ to assume this.
> > 
> > But again, is the assumption true? Or just happy thoughts in an attempt
> > to do something.
> 
> Voltage is combined with frequency, roughly, voltage is proportional
> to freuquecy, so roughly, power is proportionaly to voltage^3. You

P ~ V^2, last time I checked.

> can't say which is more important, or there is no reason to raise
> voltage without raising frequency.

Well, some chips have far fewer voltage steps than freq steps; or,
differently put, they have multiple freq steps for a single voltage
level.

And since the power (Watts) is proportional to Voltage squared, its the
biggest term.

If you have a distinct voltage level for each freq, it all doesn't
matter.

> If only one word to say: true of false, it is true. Because given any
> fixed workload, I can't see why performance would be worse if
> frequency is higher.

Well, our work here is to redefine performance as performance/watt. So
running at higher frequency (and thus likely higher voltage) is a
definite performance decrease in that sense.

> The reality as opposed to the assumption is in two-fold:
> 1) if workload is CPU bound, performance scales with frequency absolutely. if workload is
>    memory bound, it does not scale. But from kernel, we don't know whether it is CPU bound
>    or not (or it is hard to know). uArch statistics can model that.

Well, we could know for a number of archs, its just that these
statistics are expensive to track.

Also, lowering P-state is 'fine', as long as you can 'guarantee' you
don't loose IPC performance, since running at lower voltage for the same
IPC is actually better IPC/watt than estimated.

But what was said earlier is that P-state is a lower limit, not a higher
limit. In that case the core can run at higher voltage and the estimate
is just plain wrong.

> But still, the assumption is a must or no guilty, because we adjust
> frequency continuously, for example, if the workload is fixed, and if
> the performance does not scale with freq we stop increasing frequency.
> So a good frequency governor or driver should and can continuously
> pursue "good" frequency with the changing workload. Therefore, in the
> long term, we will be better off.

Sure, but realize that we must fully understand this governor and
integrate it in the scheduler if we're to attain the goal of IPC/watt
optimized scheduling behaviour.

So you (or rather Intel in general) will have to be very explicit on how
their stuff works and can no longer hide in some driver and do magic.
The same is true for all other vendors for that matter.

If you (vendors, not Yuyang in specific) do not want to play (and be
explicit and expose how your hardware functions) then you simply will
not get power efficient scheduling full stop.

There's no rocks to hide under, no magic veils to hide behind. You tell
_in_public_ or you get nothing.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 06/16] arm: topology: Define TC2 sched energy and provide it to scheduler
  2014-06-06 10:50                     ` Peter Zijlstra
@ 2014-06-06 12:13                       ` Ingo Molnar
  2014-06-06 12:27                         ` Ingo Molnar
  2014-06-07 23:53                         ` Yuyang Du
  2014-06-07 23:26                       ` Yuyang Du
  1 sibling, 2 replies; 71+ messages in thread
From: Ingo Molnar @ 2014-06-06 12:13 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Yuyang Du, Dirk Brandewie, Rafael J. Wysocki, Morten Rasmussen,
	linux-kernel, linux-pm, vincent.guittot, daniel.lezcano, preeti,
	Dietmar Eggemann, len.brown, jacob.jun.pan


* Peter Zijlstra <peterz@infradead.org> wrote:

> > Voltage is combined with frequency, roughly, voltage is 
> > proportional to freuquecy, so roughly, power is proportionaly to 
> > voltage^3. You
> 
> P ~ V^2, last time I checked.

Yes, that's a good approximation for CMOS gates:

  The switching power dissipated by a chip using static CMOS gates is 
  C·V^2·f, where C is the capacitance being switched per clock cycle, 
  V is the supply voltage, and f is the switching frequency,[1] so 
  this part of the power consumption decreases quadratically with 
  voltage. The formula is not exact however, as many modern chips are 
  not implemented using 100% CMOS, but also use special memory 
  circuits, dynamic logic such as domino logic, etc. Moreover, there 
  is also a static leakage current, which has become more and more 
  accentuated as feature sizes have become smaller (below 90 
  nanometres) and threshold levels lower.

  Accordingly, dynamic voltage scaling is widely used as part of 
  strategies to manage switching power consumption in battery powered 
  devices such as cell phones and laptop computers. Low voltage modes 
  are used in conjunction with lowered clock frequencies to minimize 
  power consumption associated with components such as CPUs and DSPs; 
  only when significant computational power is needed will the voltage 
  and frequency be raised.

  Some peripherals also support low voltage operational modes. For 
  example, low power MMC and SD cards can run at 1.8 V as well as at 
  3.3 V, and driver stacks may conserve power by switching to the 
  lower voltage after detecting a card which supports it.

  When leakage current is a significant factor in terms of power 
  consumption, chips are often designed so that portions of them can 
  be powered completely off. This is not usually viewed as being 
  dynamic voltage scaling, because it is not transparent to software. 
  When sections of chips can be turned off, as for example on TI OMAP3 
  processors, drivers and other support software need to support that.

  http://en.wikipedia.org/wiki/Dynamic_voltage_scaling

Leakage current typically gets higher with higher frequencies, but 
it's also highly process dependent AFAIK.

If switching power dissipation is the main factor in power use, then 
we can essentially assume that P ~ V^2, at the same frequency - and 
scales linearly with frequency - but real work performed also scales 
semi-linearly with frequency for many workloads, so that's an 
invariant for everything except highly memory bound workloads.

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 06/16] arm: topology: Define TC2 sched energy and provide it to scheduler
  2014-06-06 12:13                       ` Ingo Molnar
@ 2014-06-06 12:27                         ` Ingo Molnar
  2014-06-06 14:11                           ` Morten Rasmussen
  2014-06-07  2:33                           ` Nicolas Pitre
  2014-06-07 23:53                         ` Yuyang Du
  1 sibling, 2 replies; 71+ messages in thread
From: Ingo Molnar @ 2014-06-06 12:27 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Yuyang Du, Dirk Brandewie, Rafael J. Wysocki, Morten Rasmussen,
	linux-kernel, linux-pm, vincent.guittot, daniel.lezcano, preeti,
	Dietmar Eggemann, len.brown, jacob.jun.pan


* Ingo Molnar <mingo@kernel.org> wrote:

> * Peter Zijlstra <peterz@infradead.org> wrote:
> 
> > > Voltage is combined with frequency, roughly, voltage is 
> > > proportional to freuquecy, so roughly, power is proportionaly to 
> > > voltage^3. You
> > 
> > P ~ V^2, last time I checked.
> 
> Yes, that's a good approximation for CMOS gates:
> 
>   The switching power dissipated by a chip using static CMOS gates is 
>   C·V^2·f, where C is the capacitance being switched per clock cycle, 
>   V is the supply voltage, and f is the switching frequency,[1] so 
>   this part of the power consumption decreases quadratically with 
>   voltage. The formula is not exact however, as many modern chips are 
>   not implemented using 100% CMOS, but also use special memory 
>   circuits, dynamic logic such as domino logic, etc. Moreover, there 
>   is also a static leakage current, which has become more and more 
>   accentuated as feature sizes have become smaller (below 90 
>   nanometres) and threshold levels lower.
> 
>   Accordingly, dynamic voltage scaling is widely used as part of 
>   strategies to manage switching power consumption in battery powered 
>   devices such as cell phones and laptop computers. Low voltage modes 
>   are used in conjunction with lowered clock frequencies to minimize 
>   power consumption associated with components such as CPUs and DSPs; 
>   only when significant computational power is needed will the voltage 
>   and frequency be raised.
> 
>   Some peripherals also support low voltage operational modes. For 
>   example, low power MMC and SD cards can run at 1.8 V as well as at 
>   3.3 V, and driver stacks may conserve power by switching to the 
>   lower voltage after detecting a card which supports it.
> 
>   When leakage current is a significant factor in terms of power 
>   consumption, chips are often designed so that portions of them can 
>   be powered completely off. This is not usually viewed as being 
>   dynamic voltage scaling, because it is not transparent to software. 
>   When sections of chips can be turned off, as for example on TI OMAP3 
>   processors, drivers and other support software need to support that.
> 
>   http://en.wikipedia.org/wiki/Dynamic_voltage_scaling
> 
> Leakage current typically gets higher with higher frequencies, but 
> it's also highly process dependent AFAIK.
> 
> If switching power dissipation is the main factor in power use, then 
> we can essentially assume that P ~ V^2, at the same frequency - and 
> scales linearly with frequency - but real work performed also scales 
> semi-linearly with frequency for many workloads, so that's an 
> invariant for everything except highly memory bound workloads.

So in practice this probably means that Turbo probably has a somewhat 
super-linear power use factor.

At lower frequencies the leakage current difference is probably 
negligible.

In any case, even with turbo frequencies, switching power use is 
probably an order of magnitude higher than leakage current power use, 
on any marketable chip, so we should concentrate on being able to 
cover this first order effect (P/work ~ V^2), before considering any 
second order effects (leakage current).

Thanks,

	Ingo

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 06/16] arm: topology: Define TC2 sched energy and provide it to scheduler
  2014-06-04 17:27       ` Peter Zijlstra
  2014-06-04 21:56         ` Rafael J. Wysocki
@ 2014-06-06 13:03         ` Morten Rasmussen
  2014-06-07  2:52         ` Nicolas Pitre
  2 siblings, 0 replies; 71+ messages in thread
From: Morten Rasmussen @ 2014-06-06 13:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-pm, mingo, rjw, vincent.guittot,
	daniel.lezcano, preeti, Dietmar Eggemann

On Wed, Jun 04, 2014 at 06:27:12PM +0100, Peter Zijlstra wrote:
> On Wed, Jun 04, 2014 at 05:02:30PM +0100, Morten Rasmussen wrote:
> > On Tue, Jun 03, 2014 at 12:50:15PM +0100, Peter Zijlstra wrote:
> > > On Fri, May 23, 2014 at 07:16:33PM +0100, Morten Rasmussen wrote:
> > > > +static struct capacity_state cap_states_cluster_a7[] = {
> > > > +	/* Cluster only power */
> > > > +	 { .cap =  358, .power = 2967, }, /*  350 MHz */
> > > > +	 { .cap =  410, .power = 2792, }, /*  400 MHz */
> > > > +	 { .cap =  512, .power = 2810, }, /*  500 MHz */
> > > > +	 { .cap =  614, .power = 2815, }, /*  600 MHz */
> > > > +	 { .cap =  717, .power = 2919, }, /*  700 MHz */
> > > > +	 { .cap =  819, .power = 2847, }, /*  800 MHz */
> > > > +	 { .cap =  922, .power = 3917, }, /*  900 MHz */
> > > > +	 { .cap = 1024, .power = 4905, }, /* 1000 MHz */
> > > > +	};

[...]

> > TC2 is an early development platform and somewhat different from what
> > you find in end user products. TC2 actually uses the same voltage for
> > all states except the highest 2-3 states. That is not typical. The
> > voltage is typically slightly different for each state, however, the
> > difference get bigger for higher P-states. We could probably get away
> > with representing multiple states as one in the energy model if the
> > voltage change is minimal.
> 
> So while I don't mind the full table, esp. if its fairly easy to
> generate using that tool you spoke about, I just wondered if it made
> sense to somewhat reduce it.
> 
> Now that I look at the actual .power values, you can indeed see that all
> except the last two are pretty much similar in power usage.
> 
> On that, is that fluctuation measurement noise, or is that stable?

It would make sense to reduce it for this particular platform. In fact
it is questionable whether we should use frequencies below 800 MHz at
all. On TC2 the voltage is same for 800 MHz and below and it seems that
leakage (static) power is dominating the power consumption. Since the
power is almost constant in the range 350 to 800 MHz energy-efficiency
(performance/watt ~ cap/power) is actually getting *better* as we run
faster until we get to 800 MHz. Beyond 800 MHz energy-efficiency goes
down due to increased voltages.

The proposed platform energy model is an extremely simplified view of
the platform. The numbers are pretty much the raw data normalized and
averaged as appropriate. I haven't tweaked them in any way to make them
look more perfect. So, the small variations (within 4%) may be
measurement noise and the fact I model something complex with a simple
model.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 06/16] arm: topology: Define TC2 sched energy and provide it to scheduler
  2014-06-04 16:16       ` Peter Zijlstra
@ 2014-06-06 13:15         ` Morten Rasmussen
  2014-06-06 13:43           ` Peter Zijlstra
  0 siblings, 1 reply; 71+ messages in thread
From: Morten Rasmussen @ 2014-06-06 13:15 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-pm, mingo, rjw, vincent.guittot,
	daniel.lezcano, preeti, Dietmar Eggemann

On Wed, Jun 04, 2014 at 05:16:18PM +0100, Peter Zijlstra wrote:
> On Wed, Jun 04, 2014 at 04:42:27PM +0100, Morten Rasmussen wrote:
> > On Tue, Jun 03, 2014 at 12:44:28PM +0100, Peter Zijlstra wrote:
> > > On Fri, May 23, 2014 at 07:16:33PM +0100, Morten Rasmussen wrote:
> > > > +static struct capacity_state cap_states_cluster_a7[] = {
> > > > +	/* Cluster only power */
> > > > +	 { .cap =  358, .power = 2967, }, /*  350 MHz */
> > > > +	 { .cap =  410, .power = 2792, }, /*  400 MHz */
> > > > +	 { .cap =  512, .power = 2810, }, /*  500 MHz */
> > > > +	 { .cap =  614, .power = 2815, }, /*  600 MHz */
> > > > +	 { .cap =  717, .power = 2919, }, /*  700 MHz */
> > > > +	 { .cap =  819, .power = 2847, }, /*  800 MHz */
> > > > +	 { .cap =  922, .power = 3917, }, /*  900 MHz */
> > > > +	 { .cap = 1024, .power = 4905, }, /* 1000 MHz */
> > > > +	};
> > > > +
> > > > +static struct capacity_state cap_states_cluster_a15[] = {
> > > > +	/* Cluster only power */
> > > > +	 { .cap =  840, .power =  7920, }, /*  500 MHz */
> > > > +	 { .cap = 1008, .power =  8165, }, /*  600 MHz */
> > > > +	 { .cap = 1176, .power =  8172, }, /*  700 MHz */
> > > > +	 { .cap = 1343, .power =  8195, }, /*  800 MHz */
> > > > +	 { .cap = 1511, .power =  8265, }, /*  900 MHz */
> > > > +	 { .cap = 1679, .power =  8446, }, /* 1000 MHz */
> > > > +	 { .cap = 1847, .power = 11426, }, /* 1100 MHz */
> > > > +	 { .cap = 2015, .power = 15200, }, /* 1200 MHz */
> > > > +	};
> > > 
> > > 
> > > So how did you obtain these numbers? Did you use numbers provided by the
> > > hardware people, or did you run a particular benchmark and record the
> > > power usage?
> > >
> > > Does that benchmark do some actual work (as opposed to a while(1) loop)
> > > to keep more silicon lit up?
> > 
> > Hardware people don't like sharing data, so I did my own measurements
> > and calculations to get the numbers above.
> > 
> > ARM TC2 has on-chip energy counters for counting energy consumed by the
> > A7 and A15 clusters. They are fairly accurate. 
> 
> Recent Intel chips have that too; they come packaged as:
> 
>   perf stat -a -e "power/energy-cores/" -- cmd
> 
> (through the perf_event_intel_rapl.c driver), It would be ideal if the
> ARM equivalent was available through a similar interface.
> 
> http://lwn.net/Articles/573602/

Nice. On ARM it is not mandatory to have energy counters and what they
actually measure if they are implemented is implementation dependent.
However, each vendor does extensive evaluation and characterization of
their implementation already, so I don't think would be a problem for
them to provide the numbers.

> > I used sysbench cpu
> > benchmark as test workload for the above numbers. sysbench might not be
> > a representative workload, but it is easy to use. I think, ideally,
> > vendors would run their own mix of workloads they care about and derrive
> > their numbers for their platform based on that.
> > 
> > > If you have a setup for measuring these, should we try and publish that
> > > too so that people can run it on their platform and provide these
> > > numbers?
> > 
> > The workload setup I used quite simple. I ran sysbench with taskset with
> > different numbers of threads to extrapolate power consumed by each
> > individual cpu and how much comes from just powering on the domain.
> > 
> > Measuring the actual power is very platform specific. Developing a fully
> > automated tool do it for any given platform isn't straigt forward, but
> > I'm happy to share how I did it. I can add a description of the method I
> > used on TC2 to the documentation so others can use it as reference.
> 
> That would be good I think, esp. if we can get similar perf based energy
> measurement things sorted. And if we make the tool consume the machine
> topology present in sysfs we can get a long way towards automating this
> I think.

Some of the measurements could be automated. Others are hard to
automate as they require extensive knowledge about the platform. wakeup
energy, for example. You may need to do various tricks and hacks to
force the platform to use a specific idle-state so you know what you are
measuring.

I will add the TC2 recipe as a start and then see if my ugly scripts can
be turned into something generally useful.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 06/16] arm: topology: Define TC2 sched energy and provide it to scheduler
  2014-06-06 13:15         ` Morten Rasmussen
@ 2014-06-06 13:43           ` Peter Zijlstra
  2014-06-06 14:29             ` Morten Rasmussen
  0 siblings, 1 reply; 71+ messages in thread
From: Peter Zijlstra @ 2014-06-06 13:43 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: linux-kernel, linux-pm, mingo, rjw, vincent.guittot,
	daniel.lezcano, preeti, Dietmar Eggemann

On Fri, Jun 06, 2014 at 02:15:10PM +0100, Morten Rasmussen wrote:
> > > ARM TC2 has on-chip energy counters for counting energy consumed by the
> > > A7 and A15 clusters. They are fairly accurate. 
> > 
> > Recent Intel chips have that too; they come packaged as:
> > 
> >   perf stat -a -e "power/energy-cores/" -- cmd
> > 
> > (through the perf_event_intel_rapl.c driver), It would be ideal if the
> > ARM equivalent was available through a similar interface.
> > 
> > http://lwn.net/Articles/573602/
> 
> Nice. On ARM it is not mandatory to have energy counters and what they
> actually measure if they are implemented is implementation dependent.
> However, each vendor does extensive evaluation and characterization of
> their implementation already, so I don't think would be a problem for
> them to provide the numbers.

How is the ARM energy thing exposed? Through the regular PMU but with
vendor specific events, or through a separate interface, entirely vendor
specific?

In any case, would it be at all possible to nudge them to provide a
'driver' for this so that they can be more easily used?

> Some of the measurements could be automated. Others are hard to
> automate as they require extensive knowledge about the platform. wakeup
> energy, for example. You may need to do various tricks and hacks to
> force the platform to use a specific idle-state so you know what you are
> measuring.
> 
> I will add the TC2 recipe as a start and then see if my ugly scripts can
> be turned into something generally useful.

Fair enough; I would prefer to have a situation where 'we' can validate
whatever magic numbers the vendors provide for their hardware, or can
generate numbers for hardware where the vendor is not interested.

But yes, publishing your hacks is a good first step at getting such a
thing going, if we then further require everybody to use this 'tool' and
improve if not suitable, we might end up with something useful ;-)

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 06/16] arm: topology: Define TC2 sched energy and provide it to scheduler
  2014-06-06 12:27                         ` Ingo Molnar
@ 2014-06-06 14:11                           ` Morten Rasmussen
  2014-06-07  2:33                           ` Nicolas Pitre
  1 sibling, 0 replies; 71+ messages in thread
From: Morten Rasmussen @ 2014-06-06 14:11 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Yuyang Du, Dirk Brandewie, Rafael J. Wysocki,
	linux-kernel, linux-pm, vincent.guittot, daniel.lezcano, preeti,
	Dietmar Eggemann, len.brown, jacob.jun.pan

On Fri, Jun 06, 2014 at 01:27:40PM +0100, Ingo Molnar wrote:
> 
> * Ingo Molnar <mingo@kernel.org> wrote:
> 
> > * Peter Zijlstra <peterz@infradead.org> wrote:
> > 
> > > > Voltage is combined with frequency, roughly, voltage is 
> > > > proportional to freuquecy, so roughly, power is proportionaly to 
> > > > voltage^3. You
> > > 
> > > P ~ V^2, last time I checked.
> > 
> > Yes, that's a good approximation for CMOS gates:
> > 
> >   The switching power dissipated by a chip using static CMOS gates is 
> >   C·V^2·f, where C is the capacitance being switched per clock cycle, 
> >   V is the supply voltage, and f is the switching frequency,[1] so 
> >   this part of the power consumption decreases quadratically with 
> >   voltage. The formula is not exact however, as many modern chips are 
> >   not implemented using 100% CMOS, but also use special memory 
> >   circuits, dynamic logic such as domino logic, etc. Moreover, there 
> >   is also a static leakage current, which has become more and more 
> >   accentuated as feature sizes have become smaller (below 90 
> >   nanometres) and threshold levels lower.
> > 
> >   Accordingly, dynamic voltage scaling is widely used as part of 
> >   strategies to manage switching power consumption in battery powered 
> >   devices such as cell phones and laptop computers. Low voltage modes 
> >   are used in conjunction with lowered clock frequencies to minimize 
> >   power consumption associated with components such as CPUs and DSPs; 
> >   only when significant computational power is needed will the voltage 
> >   and frequency be raised.
> > 
> >   Some peripherals also support low voltage operational modes. For 
> >   example, low power MMC and SD cards can run at 1.8 V as well as at 
> >   3.3 V, and driver stacks may conserve power by switching to the 
> >   lower voltage after detecting a card which supports it.
> > 
> >   When leakage current is a significant factor in terms of power 
> >   consumption, chips are often designed so that portions of them can 
> >   be powered completely off. This is not usually viewed as being 
> >   dynamic voltage scaling, because it is not transparent to software. 
> >   When sections of chips can be turned off, as for example on TI OMAP3 
> >   processors, drivers and other support software need to support that.
> > 
> >   http://en.wikipedia.org/wiki/Dynamic_voltage_scaling
> > 
> > Leakage current typically gets higher with higher frequencies, but 
> > it's also highly process dependent AFAIK.

Strictly speaking leakage current gets higher with voltage, not
frequency (well, not to an extend where we should care). However,
frequency increase typically implies a voltage increase, so in that
sense I agree.

> > 
> > If switching power dissipation is the main factor in power use, then 
> > we can essentially assume that P ~ V^2, at the same frequency - and 
> > scales linearly with frequency - but real work performed also scales 
> > semi-linearly with frequency for many workloads, so that's an 
> > invariant for everything except highly memory bound workloads.

AFAIK, there isn't much sense in running a slower frequency than the
highest one supported at a given voltage unless there are specific
reasons not to (peripherals that keeps the system up anyway and such).
In the general case, I think it is safe to assume that energy-efficiency
goes down for every increase in frequency. Modern ARM platforms
typically have different voltages for more or less all frequencies (TC2
is quite atypical). The voltage increases more rapidly than the
frequency which makes the higher frequencies extremely expensive in
terms of energy-efficiency.

All of this is of course without considering power gating which allow us
to eliminate the leakage power (or at least partially eliminate it)
when idle. So, while energy-efficiency is bad at high frequencies, it
might pay off overall to use them anyway if we can save more leakage
energy while idle than we burn extra to race to idle. This is where the
platform energy model becomes useful.

> So in practice this probably means that Turbo probably has a somewhat 
> super-linear power use factor.

I'm not familiar with the voltage scaling on Intel platforms, but as
said above, I think power always scales up faster than performance. It
can probably be ignored for lower frequencies, but for the higher ones,
the extra energy per instruction executed is significant.

> At lower frequencies the leakage current difference is probably 
> negligible.

It is still there, but it is smaller due to the reduced voltage and so
is the dynamic power.

> In any case, even with turbo frequencies, switching power use is 
> probably an order of magnitude higher than leakage current power use, 
> on any marketable chip, 

That strongly depends on the process and the gate library used, but I
agree that dynamic power should be our primary focus.

> so we should concentrate on being able to 
> cover this first order effect (P/work ~ V^2), before considering any 
> second order effects (leakage current).

I think we should be fine as long as we include the leakage power in the
'busy' power consumption and know the idle-state power consumption in
the idle-states. I already do this in the TC2 model. That way we don't
have to distinguish between leakage and dynamic power.

Morten

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 06/16] arm: topology: Define TC2 sched energy and provide it to scheduler
  2014-06-06 13:43           ` Peter Zijlstra
@ 2014-06-06 14:29             ` Morten Rasmussen
  2014-06-12 15:05               ` Vince Weaver
  0 siblings, 1 reply; 71+ messages in thread
From: Morten Rasmussen @ 2014-06-06 14:29 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, linux-pm, mingo, rjw, vincent.guittot,
	daniel.lezcano, preeti, Dietmar Eggemann

On Fri, Jun 06, 2014 at 02:43:03PM +0100, Peter Zijlstra wrote:
> On Fri, Jun 06, 2014 at 02:15:10PM +0100, Morten Rasmussen wrote:
> > > > ARM TC2 has on-chip energy counters for counting energy consumed by the
> > > > A7 and A15 clusters. They are fairly accurate. 
> > > 
> > > Recent Intel chips have that too; they come packaged as:
> > > 
> > >   perf stat -a -e "power/energy-cores/" -- cmd
> > > 
> > > (through the perf_event_intel_rapl.c driver), It would be ideal if the
> > > ARM equivalent was available through a similar interface.
> > > 
> > > http://lwn.net/Articles/573602/
> > 
> > Nice. On ARM it is not mandatory to have energy counters and what they
> > actually measure if they are implemented is implementation dependent.
> > However, each vendor does extensive evaluation and characterization of
> > their implementation already, so I don't think would be a problem for
> > them to provide the numbers.
> 
> How is the ARM energy thing exposed? Through the regular PMU but with
> vendor specific events, or through a separate interface, entirely vendor
> specific?

There is an upstream hwmon driver for TC2 already with an easy to use
sysfs interface for all the energy counters. So it is somewhat vendor
specific at the moment unfortunately.

> In any case, would it be at all possible to nudge them to provide a
> 'driver' for this so that they can be more easily used?

I have raised it internally that unification on this front is needed. 

> > Some of the measurements could be automated. Others are hard to
> > automate as they require extensive knowledge about the platform. wakeup
> > energy, for example. You may need to do various tricks and hacks to
> > force the platform to use a specific idle-state so you know what you are
> > measuring.
> > 
> > I will add the TC2 recipe as a start and then see if my ugly scripts can
> > be turned into something generally useful.
> 
> Fair enough; I would prefer to have a situation where 'we' can validate
> whatever magic numbers the vendors provide for their hardware, or can
> generate numbers for hardware where the vendor is not interested.
> 
> But yes, publishing your hacks is a good first step at getting such a
> thing going, if we then further require everybody to use this 'tool' and
> improve if not suitable, we might end up with something useful ;-)

Fair plan ;-)

That said, vendors may want to provide slightly different numbers if
they do characterization based on workloads they care about rather than
sysbench or whatever 'we' end up using. The numbers will vary depending
on which workload(s) you use.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 06/16] arm: topology: Define TC2 sched energy and provide it to scheduler
  2014-06-06  0:35                   ` Yuyang Du
  2014-06-06 10:50                     ` Peter Zijlstra
@ 2014-06-06 16:27                     ` Jacob Pan
  1 sibling, 0 replies; 71+ messages in thread
From: Jacob Pan @ 2014-06-06 16:27 UTC (permalink / raw)
  To: Yuyang Du
  Cc: Peter Zijlstra, Dirk Brandewie, Rafael J. Wysocki,
	Morten Rasmussen, linux-kernel, linux-pm, mingo, vincent.guittot,
	daniel.lezcano, preeti, Dietmar Eggemann, len.brown

On Fri, 6 Jun 2014 08:35:21 +0800
Yuyang Du <yuyang.du@intel.com> wrote:

> On Fri, Jun 06, 2014 at 10:05:43AM +0200, Peter Zijlstra wrote:
> > On Fri, Jun 06, 2014 at 04:29:30AM +0800, Yuyang Du wrote:
> > > On Thu, Jun 05, 2014 at 08:03:15AM -0700, Dirk Brandewie wrote:
> > > > 
> > > > You can request a P state per core but the package does
> > > > coordination at a package level for the P state that will be
> > > > used based on all requests. This is due to the fact that most
> > > > SKUs have a single VR and PLL. So the highest P state wins.
> > > > When a core goes idle it loses it's vote for the current
> > > > package P state and that cores clock it turned off.
> > > > 
> > > 
> > > You need to differentiate Turbo and non-Turbo. The highest P
> > > state wins? Not really.
> > 
> > *sigh* and here we go again.. someone please, write something
> > coherent and have all intel people sign off on it and stop saying
> > different things.
> > 
> > > Actually, silicon supports indepdent non-Turbo pstate, but just
> > > not enabled.
> > 
> > Then it doesn't exist, so no point in mentioning it.
> > 
> 
> Well, things actually get more complicated. Not-enabled is for Core.
> For Atom Baytrail, each core indeed can operate on difference
> frequency. I am not sure for Xeon, :)
> 
> > > For Turbo, it basically depends on power budget of both core and
> > > gfx (because they share) for each core to get which Turbo point.
> > 
> > And RAPL controls can give preference of which gfx/core gets most,
> > right?
> >
> 
There are two controls can influence gfx and core power budge sharing:
1. set power limit on each RAPL domain
2. turbo power budge sharing
#2 is not implemented yet. default to CPU take all.

> 
> > > > intel_pstate tries to keep the core P state as low as possible
> > > > to satisfy the given load, so when various cores go idle the
> > > > package P state can be as low as possible.  The big power win
> > > > is a core going idle.
> > > > 
> > > 
> > > In terms of prediction, it is definitely can't be 100% right. But
> > > the performance of most workloads does scale with pstate
> > > (frequency), may not be linearly. So it is to some point
> > > predictable FWIW. And this is all governors and Intel_pstate's
> > > basic assumption.
> > 
> > So frequency isn't _that_ interesting, voltage is. And while
> > predictability it might be their assumption, is it actually true? I
> > mean, there's really nothing else except to assume that, if its not
> > you can't do anything at all, so you _have_ to assume this.
> > 
> > But again, is the assumption true? Or just happy thoughts in an
> > attempt to do something.
> 
> Voltage is combined with frequency, roughly, voltage is proportional
> to freuquecy, so roughly, power is proportionaly to voltage^3. You
> can't say which is more important, or there is no reason to raise
> voltage without raising frequency.
> 
> If only one word to say: true of false, it is true. Because given any
> fixed workload, I can't see why performance would be worse if
> frequency is higher.
> 
> The reality as opposed to the assumption is in two-fold:
> 1) if workload is CPU bound, performance scales with frequency
> absolutely. if workload is memory bound, it does not scale. But from
> kernel, we don't know whether it is CPU bound or not (or it is hard
> to know). uArch statistics can model that. 2) the workload is not
> fixed in real-time, changing all the time.
> 
> But still, the assumption is a must or no guilty, because we adjust
> frequency continuously, for example, if the workload is fixed, and if
> the performance does not scale with freq we stop increasing
> frequency. So a good frequency governor or driver should and can
> continuously pursue "good" frequency with the changing workload.
> Therefore, in the long term, we will be better off.
> 

[Jacob Pan]

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 06/16] arm: topology: Define TC2 sched energy and provide it to scheduler
  2014-06-06 12:27                         ` Ingo Molnar
  2014-06-06 14:11                           ` Morten Rasmussen
@ 2014-06-07  2:33                           ` Nicolas Pitre
  2014-06-09  8:27                             ` Morten Rasmussen
  1 sibling, 1 reply; 71+ messages in thread
From: Nicolas Pitre @ 2014-06-07  2:33 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Yuyang Du, Dirk Brandewie, Rafael J. Wysocki,
	Morten Rasmussen, linux-kernel, linux-pm, vincent.guittot,
	daniel.lezcano, preeti, Dietmar Eggemann, len.brown,
	jacob.jun.pan

On Fri, 6 Jun 2014, Ingo Molnar wrote:

> In any case, even with turbo frequencies, switching power use is 
> probably an order of magnitude higher than leakage current power use, 
> on any marketable chip, so we should concentrate on being able to 
> cover this first order effect (P/work ~ V^2), before considering any 
> second order effects (leakage current).

Just so that people are aware... We'll have to introduce thermal 
constraint management into the scheduler mix as well at some point.  
Right now what we have is an ad hoc subsystem that simply monitors 
temperature and apply crude cooling strategies when some thresholds are 
met. But a better strategy would imply thermal "provisioning".


Nicolas

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 06/16] arm: topology: Define TC2 sched energy and provide it to scheduler
  2014-06-04 17:27       ` Peter Zijlstra
  2014-06-04 21:56         ` Rafael J. Wysocki
  2014-06-06 13:03         ` Morten Rasmussen
@ 2014-06-07  2:52         ` Nicolas Pitre
  2 siblings, 0 replies; 71+ messages in thread
From: Nicolas Pitre @ 2014-06-07  2:52 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Morten Rasmussen, linux-kernel, linux-pm, mingo, rjw,
	vincent.guittot, daniel.lezcano, preeti, Dietmar Eggemann

On Wed, 4 Jun 2014, Peter Zijlstra wrote:

> On Wed, Jun 04, 2014 at 05:02:30PM +0100, Morten Rasmussen wrote:
> > On Tue, Jun 03, 2014 at 12:50:15PM +0100, Peter Zijlstra wrote:
> > > On Fri, May 23, 2014 at 07:16:33PM +0100, Morten Rasmussen wrote:
> > > > +static struct capacity_state cap_states_cluster_a7[] = {
> > > > +	/* Cluster only power */
> > > > +	 { .cap =  358, .power = 2967, }, /*  350 MHz */
> > > > +	 { .cap =  410, .power = 2792, }, /*  400 MHz */
> > > > +	 { .cap =  512, .power = 2810, }, /*  500 MHz */
> > > > +	 { .cap =  614, .power = 2815, }, /*  600 MHz */
> > > > +	 { .cap =  717, .power = 2919, }, /*  700 MHz */
> > > > +	 { .cap =  819, .power = 2847, }, /*  800 MHz */
> > > > +	 { .cap =  922, .power = 3917, }, /*  900 MHz */
> > > > +	 { .cap = 1024, .power = 4905, }, /* 1000 MHz */
> > > > +	};
> > > 
> > > So one thing I remember was that we spoke about restricting this to
> > > frequency levels where the voltage changed.
> > > 
> > > Because voltage jumps were the biggest factor to energy usage.
> > > 
> > > Any word on that?
> > 
> > Since we don't drive P-state changes from the scheduler, I think we
> > could leave out P-states from the table without too much trouble. Good
> > point.
> 
> Well, we eventually want to go there I think.

People within Linaro have initial code for this.  Should be posted as an 
RFC soon.

> Although we still needed
> to come up with something for Intel, because I'm not at all sure how all
> that works.

Our initial code reuse whatever existing platform specific cpufreq 
drivers.  The idea is to bypass the cpufreq governors.

If Intel hardware doesn't provide/allow much control here then the 
platform driver should already tell the cpufreq core (and by extension 
the scheduler) about the extent of what can be done.


Nicolas

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 06/16] arm: topology: Define TC2 sched energy and provide it to scheduler
  2014-06-06 10:50                     ` Peter Zijlstra
  2014-06-06 12:13                       ` Ingo Molnar
@ 2014-06-07 23:26                       ` Yuyang Du
  2014-06-09  8:59                         ` Morten Rasmussen
  2014-06-10 10:16                         ` Peter Zijlstra
  1 sibling, 2 replies; 71+ messages in thread
From: Yuyang Du @ 2014-06-07 23:26 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dirk Brandewie, Rafael J. Wysocki, Morten Rasmussen,
	linux-kernel, linux-pm, mingo, vincent.guittot, daniel.lezcano,
	preeti, Dietmar Eggemann, len.brown, jacob.jun.pan

On Fri, Jun 06, 2014 at 12:50:36PM +0200, Peter Zijlstra wrote:
> > Voltage is combined with frequency, roughly, voltage is proportional
> > to freuquecy, so roughly, power is proportionaly to voltage^3. You
> 
> P ~ V^2, last time I checked.
> 
> > can't say which is more important, or there is no reason to raise
> > voltage without raising frequency.
> 
> Well, some chips have far fewer voltage steps than freq steps; or,
> differently put, they have multiple freq steps for a single voltage
> level.
> 
> And since the power (Watts) is proportional to Voltage squared, its the
> biggest term.
> 
> If you have a distinct voltage level for each freq, it all doesn't
> matter.
> 

Ok. I think we understand each other. But one more thing, I said P ~ V^3,
because P ~ V^2*f and f ~ V, so P ~ V^3. Maybe some frequencies share the same
voltage, but you can still safely assume V changes with f in general, and it
will be more and more so, since we do need finer control over power consumption.

> Sure, but realize that we must fully understand this governor and
> integrate it in the scheduler if we're to attain the goal of IPC/watt
> optimized scheduling behaviour.
> 

Attain the goal of IPC/watt optimized?

I don't see how it can be done like this. As I said, what is unknown for
prediction is perf scaling *and* changing workload. So the challenge for pstate
control is in both. But I see more chanllenge in the changing workload than
in the performance scaling or the resulting IPC impact (if workload is
fixed).

Currently, all freq governors take CPU utilization (load%) as the indicator
(target), which can server both: workload and perf scaling.

As for IPC/watt optimized, I don't see how it can be practical. Too micro to
be used for the general well-being?

> So you (or rather Intel in general) will have to be very explicit on how
> their stuff works and can no longer hide in some driver and do magic.
> The same is true for all other vendors for that matter.
> 
> If you (vendors, not Yuyang in specific) do not want to play (and be
> explicit and expose how your hardware functions) then you simply will
> not get power efficient scheduling full stop.
> 
> There's no rocks to hide under, no magic veils to hide behind. You tell
> _in_public_ or you get nothing.

Better communication is good, especially for our increasingly iterated
products because the changing products do incur noises and inconsistency
in detail.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 06/16] arm: topology: Define TC2 sched energy and provide it to scheduler
  2014-06-06 12:13                       ` Ingo Molnar
  2014-06-06 12:27                         ` Ingo Molnar
@ 2014-06-07 23:53                         ` Yuyang Du
  1 sibling, 0 replies; 71+ messages in thread
From: Yuyang Du @ 2014-06-07 23:53 UTC (permalink / raw)
  To: Ingo Molnar
  Cc: Peter Zijlstra, Dirk Brandewie, Rafael J. Wysocki,
	Morten Rasmussen, linux-kernel, linux-pm, vincent.guittot,
	daniel.lezcano, preeti, Dietmar Eggemann, len.brown,
	jacob.jun.pan

On Fri, Jun 06, 2014 at 02:13:05PM +0200, Ingo Molnar wrote:
> 
> Leakage current typically gets higher with higher frequencies, but 
> it's also highly process dependent AFAIK.
> 

In general, you can assume leakage power ~ V^2.

> If switching power dissipation is the main factor in power use, then 
> we can essentially assume that P ~ V^2, at the same frequency - and 
> scales linearly with frequency - but real work performed also scales 
> semi-linearly with frequency for many workloads, so that's an 
> invariant for everything except highly memory bound workloads.
> 

Agreed. Strictly, Energy ~ V^2.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 02/16] sched: Introduce CONFIG_SCHED_ENERGY
  2014-05-23 18:16 ` [RFC PATCH 02/16] sched: Introduce CONFIG_SCHED_ENERGY Morten Rasmussen
@ 2014-06-08  6:03   ` Henrik Austad
  2014-06-09 10:20     ` Morten Rasmussen
  0 siblings, 1 reply; 71+ messages in thread
From: Henrik Austad @ 2014-06-08  6:03 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: linux-kernel, linux-pm, peterz, mingo, rjw, vincent.guittot,
	daniel.lezcano, preeti, dietmar.eggemann

On Fri, May 23, 2014 at 07:16:29PM +0100, Morten Rasmussen wrote:
> The Energy-aware scheduler implementation is guarded by
> CONFIG_SCHED_ENERGY.
> 
> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
> ---
>  arch/arm/Kconfig |    5 +++++
>  1 file changed, 5 insertions(+)
> 
> diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
> index ab438cb..bfc3a85 100644
> --- a/arch/arm/Kconfig
> +++ b/arch/arm/Kconfig

Is this going to be duplicate for each architecture enabling this? Why
not make a kernel/Kconfig.energy and link to that from those
architectures using it?

> @@ -1926,6 +1926,11 @@ config XEN
>  	help
>  	  Say Y if you want to run Linux in a Virtual Machine on Xen on ARM.
>  
> +config SCHED_ENERGY
> +	bool "Energy-aware scheduling (EXPERIMENTAL)"
> +	help
> +	  Highly experimental energy aware task scheduling.
> +

how about adding *slightly* more info here? :) (yes, yes, I know it's an RFC)

"""
Highly experimental energy aware task scheduling.

This will allow the kernel to keep track of energy required for
different capacity levels for a given CPU. That way, the scheduler can
make more informed decisions as to where a newly woken task should be
placed. Heterogenous platform will benefit the most from this option.

Enabling this will add a significant overhead for a task-switch.

If unsure, say N here.
"""

>  endmenu
>  
>  menu "Boot options"
> -- 
> 1.7.9.5
> 
> 
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at  http://www.tux.org/lkml/

-- 
Henrik Austad

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 06/16] arm: topology: Define TC2 sched energy and provide it to scheduler
  2014-06-09  8:59                         ` Morten Rasmussen
@ 2014-06-09  2:15                           ` Yuyang Du
  0 siblings, 0 replies; 71+ messages in thread
From: Yuyang Du @ 2014-06-09  2:15 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Peter Zijlstra, Dirk Brandewie, Rafael J. Wysocki, linux-kernel,
	linux-pm, mingo, vincent.guittot, daniel.lezcano, preeti,
	Dietmar Eggemann, len.brown, jacob.jun.pan

On Mon, Jun 09, 2014 at 09:59:52AM +0100, Morten Rasmussen wrote:
> IMHO, the per-entity load-tracking does a fair job representing the task
> compute capacity requirements. Sure it isn't perfect, particularly not
> for memory bound tasks, but it is way better than not having any task
> history at all, which was the case before.
> 
> The story is more or less the same for performance scaling. It is not
> taken into account at all in the scheduler at the moment. cpufreq is
> actually messing up load-balancing decisions after task load-tracking
> was introduced. Adding performance scaling awareness should only make
> things better even if predictions are not accurate for all workloads. I
> don't see why it shouldn't given the current state of energy-awareness
> in the scheduler.
> 

Optimized IPC is good for sure (with regard to pstate adjustment). My point is
how it is practical to rightly correlate to scheduler and pstate
power-efficiency. Put another way, with fixed workload, you really can do such
a thing by offline running the workload several times to conclude with a very
power-efficient solution which takes IPC into account. Actually, lots of
people have done that in papers/reports (for SPECXXX or TPC-X for example). But
I can't see how online realtime workload can be done like it.

> > Currently, all freq governors take CPU utilization (load%) as the indicator
> > (target), which can server both: workload and perf scaling.
> 
> With a bunch of hacks on top to make it more reactive because the
> current cpu utilization metric is not responsive enough to deal with
> workload changes. That is at least the case for ondemand and interactive
> (in Android).
> 

To what end it is not responsive enough? And how it is related here?

Thanks,
Yuyang

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 06/16] arm: topology: Define TC2 sched energy and provide it to scheduler
  2014-06-07  2:33                           ` Nicolas Pitre
@ 2014-06-09  8:27                             ` Morten Rasmussen
  2014-06-09 13:22                               ` Nicolas Pitre
  0 siblings, 1 reply; 71+ messages in thread
From: Morten Rasmussen @ 2014-06-09  8:27 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Ingo Molnar, Peter Zijlstra, Yuyang Du, Dirk Brandewie,
	Rafael J. Wysocki, linux-kernel, linux-pm, vincent.guittot,
	daniel.lezcano, preeti, Dietmar Eggemann, len.brown,
	jacob.jun.pan

On Sat, Jun 07, 2014 at 03:33:58AM +0100, Nicolas Pitre wrote:
> On Fri, 6 Jun 2014, Ingo Molnar wrote:
> 
> > In any case, even with turbo frequencies, switching power use is 
> > probably an order of magnitude higher than leakage current power use, 
> > on any marketable chip, so we should concentrate on being able to 
> > cover this first order effect (P/work ~ V^2), before considering any 
> > second order effects (leakage current).
> 
> Just so that people are aware... We'll have to introduce thermal 
> constraint management into the scheduler mix as well at some point.  
> Right now what we have is an ad hoc subsystem that simply monitors 
> temperature and apply crude cooling strategies when some thresholds are 
> met. But a better strategy would imply thermal "provisioning".

There is already work going on to improve thermal management:

http://lwn.net/Articles/599598/

The proposal is based on power/energy models (too). The goal is to
allocate power intelligently based on performance requirements.

While it is related to energy-aware scheduling and I fully agree that it
is something we need to consider, I think it is worth developing the two
ideas in parallel and look at sharing things like the power model later
once things mature. Energy-aware scheduling is complex enough on its
own to keep us entertained for a while :-)

Morten

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 06/16] arm: topology: Define TC2 sched energy and provide it to scheduler
  2014-06-07 23:26                       ` Yuyang Du
@ 2014-06-09  8:59                         ` Morten Rasmussen
  2014-06-09  2:15                           ` Yuyang Du
  2014-06-10 10:16                         ` Peter Zijlstra
  1 sibling, 1 reply; 71+ messages in thread
From: Morten Rasmussen @ 2014-06-09  8:59 UTC (permalink / raw)
  To: Yuyang Du
  Cc: Peter Zijlstra, Dirk Brandewie, Rafael J. Wysocki, linux-kernel,
	linux-pm, mingo, vincent.guittot, daniel.lezcano, preeti,
	Dietmar Eggemann, len.brown, jacob.jun.pan

On Sun, Jun 08, 2014 at 12:26:29AM +0100, Yuyang Du wrote:
> On Fri, Jun 06, 2014 at 12:50:36PM +0200, Peter Zijlstra wrote:
> > > Voltage is combined with frequency, roughly, voltage is proportional
> > > to freuquecy, so roughly, power is proportionaly to voltage^3. You
> > 
> > P ~ V^2, last time I checked.
> > 
> > > can't say which is more important, or there is no reason to raise
> > > voltage without raising frequency.
> > 
> > Well, some chips have far fewer voltage steps than freq steps; or,
> > differently put, they have multiple freq steps for a single voltage
> > level.
> > 
> > And since the power (Watts) is proportional to Voltage squared, its the
> > biggest term.
> > 
> > If you have a distinct voltage level for each freq, it all doesn't
> > matter.
> > 
> 
> Ok. I think we understand each other. But one more thing, I said P ~ V^3,
> because P ~ V^2*f and f ~ V, so P ~ V^3. Maybe some frequencies share the same
> voltage, but you can still safely assume V changes with f in general, and it
> will be more and more so, since we do need finer control over power consumption.

Agreed. Voltage typically changes with frequency.

> 
> > Sure, but realize that we must fully understand this governor and
> > integrate it in the scheduler if we're to attain the goal of IPC/watt
> > optimized scheduling behaviour.
> > 
> 
> Attain the goal of IPC/watt optimized?
> 
> I don't see how it can be done like this. As I said, what is unknown for
> prediction is perf scaling *and* changing workload. So the challenge for pstate
> control is in both. But I see more chanllenge in the changing workload than
> in the performance scaling or the resulting IPC impact (if workload is
> fixed).

IMHO, the per-entity load-tracking does a fair job representing the task
compute capacity requirements. Sure it isn't perfect, particularly not
for memory bound tasks, but it is way better than not having any task
history at all, which was the case before.

The story is more or less the same for performance scaling. It is not
taken into account at all in the scheduler at the moment. cpufreq is
actually messing up load-balancing decisions after task load-tracking
was introduced. Adding performance scaling awareness should only make
things better even if predictions are not accurate for all workloads. I
don't see why it shouldn't given the current state of energy-awareness
in the scheduler.

> Currently, all freq governors take CPU utilization (load%) as the indicator
> (target), which can server both: workload and perf scaling.

With a bunch of hacks on top to make it more reactive because the
current cpu utilization metric is not responsive enough to deal with
workload changes. That is at least the case for ondemand and interactive
(in Android).

> As for IPC/watt optimized, I don't see how it can be practical. Too micro to
> be used for the general well-being?

That is why I propose to have a platform specific energy model. You tell
the scheduler enough about your platform that it understands the most
basic power/performance trade-offs of your platform and thereby enable
the scheduler to make better decisions.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 02/16] sched: Introduce CONFIG_SCHED_ENERGY
  2014-06-08  6:03   ` Henrik Austad
@ 2014-06-09 10:20     ` Morten Rasmussen
  2014-06-10  9:39       ` Peter Zijlstra
  0 siblings, 1 reply; 71+ messages in thread
From: Morten Rasmussen @ 2014-06-09 10:20 UTC (permalink / raw)
  To: Henrik Austad
  Cc: linux-kernel, linux-pm, peterz, mingo, rjw, vincent.guittot,
	daniel.lezcano, preeti, Dietmar Eggemann

On Sun, Jun 08, 2014 at 07:03:16AM +0100, Henrik Austad wrote:
> On Fri, May 23, 2014 at 07:16:29PM +0100, Morten Rasmussen wrote:
> > The Energy-aware scheduler implementation is guarded by
> > CONFIG_SCHED_ENERGY.
> > 
> > Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
> > ---
> >  arch/arm/Kconfig |    5 +++++
> >  1 file changed, 5 insertions(+)
> > 
> > diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
> > index ab438cb..bfc3a85 100644
> > --- a/arch/arm/Kconfig
> > +++ b/arch/arm/Kconfig
> 
> Is this going to be duplicate for each architecture enabling this? Why
> not make a kernel/Kconfig.energy and link to that from those
> architectures using it?

kernel/Kconfig.energy is better I think.

> 
> > @@ -1926,6 +1926,11 @@ config XEN
> >  	help
> >  	  Say Y if you want to run Linux in a Virtual Machine on Xen on ARM.
> >  
> > +config SCHED_ENERGY
> > +	bool "Energy-aware scheduling (EXPERIMENTAL)"
> > +	help
> > +	  Highly experimental energy aware task scheduling.
> > +
> 
> how about adding *slightly* more info here? :) (yes, yes, I know it's an RFC)

Fair point.

> 
> """
> Highly experimental energy aware task scheduling.
> 
> This will allow the kernel to keep track of energy required for
> different capacity levels for a given CPU. That way, the scheduler can
> make more informed decisions as to where a newly woken task should be
> placed. Heterogenous platform will benefit the most from this option.

Platforms with hierarchical power domains (for example, having ability
to power off groups of cpus and their caches) should see some benefit as
well.

> Enabling this will add a significant overhead for a task-switch.

The overhead is at task wakeup, task switch (as in task preemption)
should not be affected.

Thanks for the text. I will roll into v2.

Morten

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 06/16] arm: topology: Define TC2 sched energy and provide it to scheduler
  2014-06-09  8:27                             ` Morten Rasmussen
@ 2014-06-09 13:22                               ` Nicolas Pitre
  2014-06-11 11:02                                 ` Eduardo Valentin
  0 siblings, 1 reply; 71+ messages in thread
From: Nicolas Pitre @ 2014-06-09 13:22 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Ingo Molnar, Peter Zijlstra, Yuyang Du, Dirk Brandewie,
	Rafael J. Wysocki, linux-kernel, linux-pm, vincent.guittot,
	daniel.lezcano, preeti, Dietmar Eggemann, len.brown,
	jacob.jun.pan

On Mon, 9 Jun 2014, Morten Rasmussen wrote:

> On Sat, Jun 07, 2014 at 03:33:58AM +0100, Nicolas Pitre wrote:
> > On Fri, 6 Jun 2014, Ingo Molnar wrote:
> > 
> > > In any case, even with turbo frequencies, switching power use is 
> > > probably an order of magnitude higher than leakage current power use, 
> > > on any marketable chip, so we should concentrate on being able to 
> > > cover this first order effect (P/work ~ V^2), before considering any 
> > > second order effects (leakage current).
> > 
> > Just so that people are aware... We'll have to introduce thermal 
> > constraint management into the scheduler mix as well at some point.  
> > Right now what we have is an ad hoc subsystem that simply monitors 
> > temperature and apply crude cooling strategies when some thresholds are 
> > met. But a better strategy would imply thermal "provisioning".
> 
> There is already work going on to improve thermal management:
> 
> http://lwn.net/Articles/599598/
> 
> The proposal is based on power/energy models (too). The goal is to
> allocate power intelligently based on performance requirements.

Ah, great!  I missed that.

> While it is related to energy-aware scheduling and I fully agree that it
> is something we need to consider, I think it is worth developing the two
> ideas in parallel and look at sharing things like the power model later
> once things mature. Energy-aware scheduling is complex enough on its
> own to keep us entertained for a while :-)

Absolutely.  This is why I said "at some point".


Nicolas

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 02/16] sched: Introduce CONFIG_SCHED_ENERGY
  2014-06-09 10:20     ` Morten Rasmussen
@ 2014-06-10  9:39       ` Peter Zijlstra
  2014-06-10 10:06         ` Morten Rasmussen
  0 siblings, 1 reply; 71+ messages in thread
From: Peter Zijlstra @ 2014-06-10  9:39 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Henrik Austad, linux-kernel, linux-pm, mingo, rjw,
	vincent.guittot, daniel.lezcano, preeti, Dietmar Eggemann

[-- Attachment #1: Type: text/plain, Size: 1035 bytes --]

On Mon, Jun 09, 2014 at 11:20:27AM +0100, Morten Rasmussen wrote:
> On Sun, Jun 08, 2014 at 07:03:16AM +0100, Henrik Austad wrote:
> > On Fri, May 23, 2014 at 07:16:29PM +0100, Morten Rasmussen wrote:
> > > The Energy-aware scheduler implementation is guarded by
> > > CONFIG_SCHED_ENERGY.
> > > 
> > > Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > > Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
> > > ---
> > >  arch/arm/Kconfig |    5 +++++
> > >  1 file changed, 5 insertions(+)
> > > 
> > > diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
> > > index ab438cb..bfc3a85 100644
> > > --- a/arch/arm/Kconfig
> > > +++ b/arch/arm/Kconfig
> > 
> > Is this going to be duplicate for each architecture enabling this? Why
> > not make a kernel/Kconfig.energy and link to that from those
> > architectures using it?
> 
> kernel/Kconfig.energy is better I think.

Well, strictly speaking I'd prefer to not have more sched CONFIG knobs.

Do we really need to have this CONFIG guarded?

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 02/16] sched: Introduce CONFIG_SCHED_ENERGY
  2014-06-10  9:39       ` Peter Zijlstra
@ 2014-06-10 10:06         ` Morten Rasmussen
  2014-06-10 10:23           ` Peter Zijlstra
  0 siblings, 1 reply; 71+ messages in thread
From: Morten Rasmussen @ 2014-06-10 10:06 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Henrik Austad, linux-kernel, linux-pm, mingo, rjw,
	vincent.guittot, daniel.lezcano, preeti, Dietmar Eggemann

On Tue, Jun 10, 2014 at 10:39:43AM +0100, Peter Zijlstra wrote:
> On Mon, Jun 09, 2014 at 11:20:27AM +0100, Morten Rasmussen wrote:
> > On Sun, Jun 08, 2014 at 07:03:16AM +0100, Henrik Austad wrote:
> > > On Fri, May 23, 2014 at 07:16:29PM +0100, Morten Rasmussen wrote:
> > > > The Energy-aware scheduler implementation is guarded by
> > > > CONFIG_SCHED_ENERGY.
> > > > 
> > > > Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > > > Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
> > > > ---
> > > >  arch/arm/Kconfig |    5 +++++
> > > >  1 file changed, 5 insertions(+)
> > > > 
> > > > diff --git a/arch/arm/Kconfig b/arch/arm/Kconfig
> > > > index ab438cb..bfc3a85 100644
> > > > --- a/arch/arm/Kconfig
> > > > +++ b/arch/arm/Kconfig
> > > 
> > > Is this going to be duplicate for each architecture enabling this? Why
> > > not make a kernel/Kconfig.energy and link to that from those
> > > architectures using it?
> > 
> > kernel/Kconfig.energy is better I think.
> 
> Well, strictly speaking I'd prefer to not have more sched CONFIG knobs.
> 
> Do we really need to have this CONFIG guarded?

How would you like to disable the energy stuff for users for whom
latency is everything?

I mean, we are adding some extra load/utilization tracking. While I
think we should do everything possible to minimize the overhead, I think
it is unrealistic to assume that it will be zero. Is a some extra 'if
(energy_enabled)' acceptable?

I'm open for other suggestions.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 06/16] arm: topology: Define TC2 sched energy and provide it to scheduler
  2014-06-07 23:26                       ` Yuyang Du
  2014-06-09  8:59                         ` Morten Rasmussen
@ 2014-06-10 10:16                         ` Peter Zijlstra
  2014-06-10 17:01                           ` Nicolas Pitre
  2014-06-10 18:35                           ` Yuyang Du
  1 sibling, 2 replies; 71+ messages in thread
From: Peter Zijlstra @ 2014-06-10 10:16 UTC (permalink / raw)
  To: Yuyang Du
  Cc: Dirk Brandewie, Rafael J. Wysocki, Morten Rasmussen,
	linux-kernel, linux-pm, mingo, vincent.guittot, daniel.lezcano,
	preeti, Dietmar Eggemann, len.brown, jacob.jun.pan

[-- Attachment #1: Type: text/plain, Size: 3287 bytes --]

On Sun, Jun 08, 2014 at 07:26:29AM +0800, Yuyang Du wrote:
> Ok. I think we understand each other. But one more thing, I said P ~ V^3,
> because P ~ V^2*f and f ~ V, so P ~ V^3. Maybe some frequencies share the same
> voltage, but you can still safely assume V changes with f in general, and it
> will be more and more so, since we do need finer control over power consumption.

I didn't know the frequency part was proportionate to another voltage
term, ok, then the cubic term makes sense.

> > Sure, but realize that we must fully understand this governor and
> > integrate it in the scheduler if we're to attain the goal of IPC/watt
> > optimized scheduling behaviour.
> > 
> 
> Attain the goal of IPC/watt optimized?
> 
> I don't see how it can be done like this. As I said, what is unknown for
> prediction is perf scaling *and* changing workload. So the challenge for pstate
> control is in both. But I see more chanllenge in the changing workload than
> in the performance scaling or the resulting IPC impact (if workload is
> fixed).

But for the scheduler the workload change isn't that big a problem; we
know the history of each task, we know when tasks wake up and when we
move them around. Therefore we can fairly accurately predict this.

And given a simple P state model (like ARM) where the CPU simply does
what you tell it to, that all works out. We can change P-state at task
wakeup/sleep/migration and compute the most efficient P-state, and task
distribution, for the new task-set.

> Currently, all freq governors take CPU utilization (load%) as the indicator
> (target), which can server both: workload and perf scaling.

So the current cpufreq stuff is terminally broken in too many ways; its
sampling, so it misses a lot of changes, its strictly cpu local, so it
completely misses SMP information (like the migrations etc..)

If we move a 50% task from CPU1 to CPU0, a sampling thing takes time to
adjust on both CPUs, whereas if its scheduler driven, we can instantly
adjust and be done, because we _know_ what we moved.

Now some of that is due to hysterical raisins, and some of that due to
broken hardware (hardware that needs to schedule in order to change its
state because its behind some broken bus or other). But we should
basically kill off cpufreq for anything recent and sane.

> As for IPC/watt optimized, I don't see how it can be practical. Too micro to
> be used for the general well-being?

What other target would you optimize for? The purpose here is to build
an energy aware scheduler, one that schedules tasks so that the total
amount of energy, for the given amount of work, is minimal.

So we can't measure in Watt, since if we forced the CPU into the lowest
P-state (or even C-state for that matter) work would simply not
complete. So we need a complete energy term.

Now. IPC is instructions/cycle, Watt is Joule/second, so IPC/Watt is

instructions   second
------------ * ------ ~ instructions / joule
  cycle        joule

Seeing how both cycles and seconds are time units.

So for any given amount of instructions, the work needs to be done, we
want the minimal amount of energy consumed, and IPC/Watt is the natural
metric to measure this over an entire workload.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 02/16] sched: Introduce CONFIG_SCHED_ENERGY
  2014-06-10 10:06         ` Morten Rasmussen
@ 2014-06-10 10:23           ` Peter Zijlstra
  2014-06-10 11:17             ` Henrik Austad
  2014-06-10 11:24             ` Morten Rasmussen
  0 siblings, 2 replies; 71+ messages in thread
From: Peter Zijlstra @ 2014-06-10 10:23 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Henrik Austad, linux-kernel, linux-pm, mingo, rjw,
	vincent.guittot, daniel.lezcano, preeti, Dietmar Eggemann

[-- Attachment #1: Type: text/plain, Size: 831 bytes --]

On Tue, Jun 10, 2014 at 11:06:41AM +0100, Morten Rasmussen wrote:
> How would you like to disable the energy stuff for users for whom
> latency is everything?
> 
> I mean, we are adding some extra load/utilization tracking. While I
> think we should do everything possible to minimize the overhead, I think
> it is unrealistic to assume that it will be zero. Is a some extra 'if
> (energy_enabled)' acceptable?
> 
> I'm open for other suggestions.

We have the jump-label stuff to do self modifying code ;-) The only
thing we need to be careful with is data-layout.

So I'm _hoping_ we can do all this without more CONFIG knobs, because
{PREEMPT*SMP*CGROUP^3*NUMA^2} is already entirely annoying to
build and run test, not to mention that distro builds will have no other
option than to enable everything anyhow.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 02/16] sched: Introduce CONFIG_SCHED_ENERGY
  2014-06-10 10:23           ` Peter Zijlstra
@ 2014-06-10 11:17             ` Henrik Austad
  2014-06-10 12:19               ` Peter Zijlstra
  2014-06-10 11:24             ` Morten Rasmussen
  1 sibling, 1 reply; 71+ messages in thread
From: Henrik Austad @ 2014-06-10 11:17 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Morten Rasmussen, linux-kernel, linux-pm, mingo, rjw,
	vincent.guittot, daniel.lezcano, preeti, Dietmar Eggemann

On Tue, Jun 10, 2014 at 12:23:53PM +0200, Peter Zijlstra wrote:
> On Tue, Jun 10, 2014 at 11:06:41AM +0100, Morten Rasmussen wrote:
> > How would you like to disable the energy stuff for users for whom
> > latency is everything?
> > 
> > I mean, we are adding some extra load/utilization tracking. While I
> > think we should do everything possible to minimize the overhead, I think
> > it is unrealistic to assume that it will be zero. Is a some extra 'if
> > (energy_enabled)' acceptable?
> > 
> > I'm open for other suggestions.
> 
> We have the jump-label stuff to do self modifying code ;-) The only
> thing we need to be careful with is data-layout.

Isn't this asking for trouble?

I do get the point of not introducing more make-ifdeffery, but I'm not
so sure the alternative is much better. Do we really want to spend time
tracing down bugs introduced via a self-modifying process in something
as central as the scheduler?

> So I'm _hoping_ we can do all this without more CONFIG knobs, because
> {PREEMPT*SMP*CGROUP^3*NUMA^2} is already entirely annoying to
> build and run test, not to mention that distro builds will have no other
> option than to enable everything anyhow.

True, but if that is the argument, how is adding this as a dynamic thing
any better, you still end up with a test-matrix of the same size?

Building a kernel isn't _that_ much work and it would make the
test-scripts all the much simpler to maintain if we don't have to rely
on some dynamic tweaking of the core.

Just sayin'

-- 
Henrik Austad

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 02/16] sched: Introduce CONFIG_SCHED_ENERGY
  2014-06-10 10:23           ` Peter Zijlstra
  2014-06-10 11:17             ` Henrik Austad
@ 2014-06-10 11:24             ` Morten Rasmussen
  2014-06-10 12:24               ` Peter Zijlstra
  1 sibling, 1 reply; 71+ messages in thread
From: Morten Rasmussen @ 2014-06-10 11:24 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Henrik Austad, linux-kernel, linux-pm, mingo, rjw,
	vincent.guittot, daniel.lezcano, preeti, Dietmar Eggemann

On Tue, Jun 10, 2014 at 11:23:53AM +0100, Peter Zijlstra wrote:
> On Tue, Jun 10, 2014 at 11:06:41AM +0100, Morten Rasmussen wrote:
> > How would you like to disable the energy stuff for users for whom
> > latency is everything?
> > 
> > I mean, we are adding some extra load/utilization tracking. While I
> > think we should do everything possible to minimize the overhead, I think
> > it is unrealistic to assume that it will be zero. Is a some extra 'if
> > (energy_enabled)' acceptable?
> > 
> > I'm open for other suggestions.
> 
> We have the jump-label stuff to do self modifying code ;-) The only
> thing we need to be careful with is data-layout.

Thanks. I can see that it is already used in for various bit in
kernel/sched/*. I didn't catch anything in Documentation/static-keys.txt
related to data-layout caveats. Is there some other
documentation/patches I should read before messing everything up? ;-)

> So I'm _hoping_ we can do all this without more CONFIG knobs, because
> {PREEMPT*SMP*CGROUP^3*NUMA^2} is already entirely annoying to
> build and run test, not to mention that distro builds will have no other
> option than to enable everything anyhow.

Fair enough.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 02/16] sched: Introduce CONFIG_SCHED_ENERGY
  2014-06-10 11:17             ` Henrik Austad
@ 2014-06-10 12:19               ` Peter Zijlstra
  0 siblings, 0 replies; 71+ messages in thread
From: Peter Zijlstra @ 2014-06-10 12:19 UTC (permalink / raw)
  To: Henrik Austad
  Cc: Morten Rasmussen, linux-kernel, linux-pm, mingo, rjw,
	vincent.guittot, daniel.lezcano, preeti, Dietmar Eggemann

[-- Attachment #1: Type: text/plain, Size: 2538 bytes --]

On Tue, Jun 10, 2014 at 01:17:32PM +0200, Henrik Austad wrote:
> On Tue, Jun 10, 2014 at 12:23:53PM +0200, Peter Zijlstra wrote:
> > On Tue, Jun 10, 2014 at 11:06:41AM +0100, Morten Rasmussen wrote:
> > > How would you like to disable the energy stuff for users for whom
> > > latency is everything?
> > > 
> > > I mean, we are adding some extra load/utilization tracking. While I
> > > think we should do everything possible to minimize the overhead, I think
> > > it is unrealistic to assume that it will be zero. Is a some extra 'if
> > > (energy_enabled)' acceptable?
> > > 
> > > I'm open for other suggestions.
> > 
> > We have the jump-label stuff to do self modifying code ;-) The only
> > thing we need to be careful with is data-layout.
> 
> Isn't this asking for trouble?
> 
> I do get the point of not introducing more make-ifdeffery, but I'm not
> so sure the alternative is much better. Do we really want to spend time
> tracing down bugs introduced via a self-modifying process in something
> as central as the scheduler?

Its already chock full of that stuff ;-)

> > So I'm _hoping_ we can do all this without more CONFIG knobs, because
> > {PREEMPT*SMP*CGROUP^3*NUMA^2} is already entirely annoying to
> > build and run test, not to mention that distro builds will have no other
> > option than to enable everything anyhow.
> 
> True, but if that is the argument, how is adding this as a dynamic thing
> any better, you still end up with a test-matrix of the same size?

Test-matrix yes, sadly so and there's nothing we can really do about
that, so that sucks.

But it does reduce the coverage of the tests; everything that is not
uber critical fast path we can do unconditionally. So all the
sched_domain wankery gets tested on every boot / hotplug event, which is
tons better than only when that particular option is build in.

So while the total test matrix does suck rocks, the actual code that
needs testing per option can be siginficantly reduced.

> Building a kernel isn't _that_ much work and it would make the
> test-scripts all the much simpler to maintain if we don't have to rely
> on some dynamic tweaking of the core.

its exponential, given that I now already have to build
PREEMPT*SMP*CGROUP^3*NUMA^2 = 2^7 = 128 kernels to cover all options,
adding one more option means I'll have to build another 128 kernels.
Building 128 kernels does take a lot of time, no matter how far you
strip that .config and no matter I can build a kernel in <50 seconds.

[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 02/16] sched: Introduce CONFIG_SCHED_ENERGY
  2014-06-10 11:24             ` Morten Rasmussen
@ 2014-06-10 12:24               ` Peter Zijlstra
  2014-06-10 14:41                 ` Morten Rasmussen
  0 siblings, 1 reply; 71+ messages in thread
From: Peter Zijlstra @ 2014-06-10 12:24 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Henrik Austad, linux-kernel, linux-pm, mingo, rjw,
	vincent.guittot, daniel.lezcano, preeti, Dietmar Eggemann

[-- Attachment #1: Type: text/plain, Size: 1464 bytes --]

On Tue, Jun 10, 2014 at 12:24:03PM +0100, Morten Rasmussen wrote:
> On Tue, Jun 10, 2014 at 11:23:53AM +0100, Peter Zijlstra wrote:
> > On Tue, Jun 10, 2014 at 11:06:41AM +0100, Morten Rasmussen wrote:
> > > How would you like to disable the energy stuff for users for whom
> > > latency is everything?
> > > 
> > > I mean, we are adding some extra load/utilization tracking. While I
> > > think we should do everything possible to minimize the overhead, I think
> > > it is unrealistic to assume that it will be zero. Is a some extra 'if
> > > (energy_enabled)' acceptable?
> > > 
> > > I'm open for other suggestions.
> > 
> > We have the jump-label stuff to do self modifying code ;-) The only
> > thing we need to be careful with is data-layout.
> 
> Thanks. I can see that it is already used in for various bit in
> kernel/sched/*. I didn't catch anything in Documentation/static-keys.txt
> related to data-layout caveats. Is there some other
> documentation/patches I should read before messing everything up? ;-)

So the data-layout was mostly referring to things like making sure that
struct sched_avg doesn't end up straddling a cacheline somewhere by
accident.

The most expensive part of the per-task accounting nonsense is the
amount of memory we need to touch to do so, the actual instructions come
second, unless of course we go put tons of divisions in there :-)

BTW, are cachelines 64 bytes for you ARM people too?


[-- Attachment #2: Type: application/pgp-signature, Size: 836 bytes --]

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 02/16] sched: Introduce CONFIG_SCHED_ENERGY
  2014-06-10 12:24               ` Peter Zijlstra
@ 2014-06-10 14:41                 ` Morten Rasmussen
  0 siblings, 0 replies; 71+ messages in thread
From: Morten Rasmussen @ 2014-06-10 14:41 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Henrik Austad, linux-kernel, linux-pm, mingo, rjw,
	vincent.guittot, daniel.lezcano, preeti, Dietmar Eggemann,
	Catalin Marinas, Will Deacon

On Tue, Jun 10, 2014 at 01:24:35PM +0100, Peter Zijlstra wrote:
> On Tue, Jun 10, 2014 at 12:24:03PM +0100, Morten Rasmussen wrote:
> > On Tue, Jun 10, 2014 at 11:23:53AM +0100, Peter Zijlstra wrote:
> > > On Tue, Jun 10, 2014 at 11:06:41AM +0100, Morten Rasmussen wrote:
> > > > How would you like to disable the energy stuff for users for whom
> > > > latency is everything?
> > > > 
> > > > I mean, we are adding some extra load/utilization tracking. While I
> > > > think we should do everything possible to minimize the overhead, I think
> > > > it is unrealistic to assume that it will be zero. Is a some extra 'if
> > > > (energy_enabled)' acceptable?
> > > > 
> > > > I'm open for other suggestions.
> > > 
> > > We have the jump-label stuff to do self modifying code ;-) The only
> > > thing we need to be careful with is data-layout.
> > 
> > Thanks. I can see that it is already used in for various bit in
> > kernel/sched/*. I didn't catch anything in Documentation/static-keys.txt
> > related to data-layout caveats. Is there some other
> > documentation/patches I should read before messing everything up? ;-)
> 
> So the data-layout was mostly referring to things like making sure that
> struct sched_avg doesn't end up straddling a cacheline somewhere by
> accident.
> 
> The most expensive part of the per-task accounting nonsense is the
> amount of memory we need to touch to do so, the actual instructions come
> second, unless of course we go put tons of divisions in there :-)

Make sense.

> BTW, are cachelines 64 bytes for you ARM people too?

Mostly yes, but as with a lot of other things on ARM it is
implementation defined. The cacheline sizes are probeable at runtime,
but for things where we don't know I think 64 bytes is the current
assumption.

Catalin or Will would be able to provide a more detailed answer.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 06/16] arm: topology: Define TC2 sched energy and provide it to scheduler
  2014-06-10 10:16                         ` Peter Zijlstra
@ 2014-06-10 17:01                           ` Nicolas Pitre
  2014-06-10 18:35                           ` Yuyang Du
  1 sibling, 0 replies; 71+ messages in thread
From: Nicolas Pitre @ 2014-06-10 17:01 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Yuyang Du, Dirk Brandewie, Rafael J. Wysocki, Morten Rasmussen,
	linux-kernel, linux-pm, mingo, vincent.guittot, daniel.lezcano,
	preeti, Dietmar Eggemann, len.brown, jacob.jun.pan

On Tue, 10 Jun 2014, Peter Zijlstra wrote:

> So the current cpufreq stuff is terminally broken in too many ways; its
> sampling, so it misses a lot of changes, its strictly cpu local, so it
> completely misses SMP information (like the migrations etc..)
> 
> If we move a 50% task from CPU1 to CPU0, a sampling thing takes time to
> adjust on both CPUs, whereas if its scheduler driven, we can instantly
> adjust and be done, because we _know_ what we moved.

Incidentally I submitted a LWN article highlighting those very issues 
and the planned remedies.  No confirmation of a publication date though.

> Now some of that is due to hysterical raisins, and some of that due to
> broken hardware (hardware that needs to schedule in order to change its
> state because its behind some broken bus or other). But we should
> basically kill off cpufreq for anything recent and sane.

EVen if some change has to happen through a kernel thread, you're still 
far better with the scheduler requesting this change proactively than 
waiting for both the cpufreq governor to catch up with the load and then 
wait for the freq change thread to be scheduled.


Nicolas

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 06/16] arm: topology: Define TC2 sched energy and provide it to scheduler
  2014-06-10 10:16                         ` Peter Zijlstra
  2014-06-10 17:01                           ` Nicolas Pitre
@ 2014-06-10 18:35                           ` Yuyang Du
  1 sibling, 0 replies; 71+ messages in thread
From: Yuyang Du @ 2014-06-10 18:35 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dirk Brandewie, Rafael J. Wysocki, Morten Rasmussen,
	linux-kernel, linux-pm, mingo, vincent.guittot, daniel.lezcano,
	preeti, Dietmar Eggemann, len.brown, jacob.jun.pan

On Tue, Jun 10, 2014 at 12:16:22PM +0200, Peter Zijlstra wrote:
> What other target would you optimize for? The purpose here is to build
> an energy aware scheduler, one that schedules tasks so that the total
> amount of energy, for the given amount of work, is minimal.
> 
> So we can't measure in Watt, since if we forced the CPU into the lowest
> P-state (or even C-state for that matter) work would simply not
> complete. So we need a complete energy term.
> 
> Now. IPC is instructions/cycle, Watt is Joule/second, so IPC/Watt is
> 
> instructions   second
> ------------ * ------ ~ instructions / joule
>   cycle        joule
> 
> Seeing how both cycles and seconds are time units.
> 
> So for any given amount of instructions, the work needs to be done, we
> want the minimal amount of energy consumed, and IPC/Watt is the natural
> metric to measure this over an entire workload.

Ok, I understand. Whether we take IPC/watt as an input metric in scheduler or
as a goal for scheduler, we definitely need to try both.

Thanks, Peter.

Yuyang

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 06/16] arm: topology: Define TC2 sched energy and provide it to scheduler
  2014-06-09 13:22                               ` Nicolas Pitre
@ 2014-06-11 11:02                                 ` Eduardo Valentin
  2014-06-11 11:42                                   ` Morten Rasmussen
  0 siblings, 1 reply; 71+ messages in thread
From: Eduardo Valentin @ 2014-06-11 11:02 UTC (permalink / raw)
  To: Nicolas Pitre
  Cc: Morten Rasmussen, Ingo Molnar, Peter Zijlstra, Yuyang Du,
	Dirk Brandewie, Rafael J. Wysocki, linux-kernel, linux-pm,
	vincent.guittot, daniel.lezcano, preeti, Dietmar Eggemann,
	len.brown, jacob.jun.pan

Hello,

On Mon, Jun 09, 2014 at 09:22:49AM -0400, Nicolas Pitre wrote:
> On Mon, 9 Jun 2014, Morten Rasmussen wrote:
> 
> > On Sat, Jun 07, 2014 at 03:33:58AM +0100, Nicolas Pitre wrote:
> > > On Fri, 6 Jun 2014, Ingo Molnar wrote:
> > > 
> > > > In any case, even with turbo frequencies, switching power use is 
> > > > probably an order of magnitude higher than leakage current power use, 
> > > > on any marketable chip, so we should concentrate on being able to 
> > > > cover this first order effect (P/work ~ V^2), before considering any 
> > > > second order effects (leakage current).
> > > 
> > > Just so that people are aware... We'll have to introduce thermal 
> > > constraint management into the scheduler mix as well at some point.  
> > > Right now what we have is an ad hoc subsystem that simply monitors 
> > > temperature and apply crude cooling strategies when some thresholds are 
> > > met. But a better strategy would imply thermal "provisioning".
> > 
> > There is already work going on to improve thermal management:
> > 
> > http://lwn.net/Articles/599598/
> > 
> > The proposal is based on power/energy models (too). The goal is to

Can you please point me to the other piece of code which is using
power/energy models too?  We are considering having these models within
the thermal software compoenents. But if we already have more than one
user, might be worth considering a separate API.
 
> > allocate power intelligently based on performance requirements.
> 
> Ah, great!  I missed that.
> 
> > While it is related to energy-aware scheduling and I fully agree that it
> > is something we need to consider, I think it is worth developing the two
> > ideas in parallel and look at sharing things like the power model later
> > once things mature. Energy-aware scheduling is complex enough on its
> > own to keep us entertained for a while :-)
> 
> Absolutely.  This is why I said "at some point".
> 
> 
> Nicolas
> --
> To unsubscribe from this list: send the line "unsubscribe linux-pm" in
> the body of a message to majordomo@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 06/16] arm: topology: Define TC2 sched energy and provide it to scheduler
  2014-06-11 11:02                                 ` Eduardo Valentin
@ 2014-06-11 11:42                                   ` Morten Rasmussen
  2014-06-11 11:43                                     ` Eduardo Valentin
  0 siblings, 1 reply; 71+ messages in thread
From: Morten Rasmussen @ 2014-06-11 11:42 UTC (permalink / raw)
  To: Eduardo Valentin
  Cc: Nicolas Pitre, Ingo Molnar, Peter Zijlstra, Yuyang Du,
	Dirk Brandewie, Rafael J. Wysocki, linux-kernel, linux-pm,
	vincent.guittot, daniel.lezcano, preeti, Dietmar Eggemann,
	len.brown, jacob.jun.pan

On Wed, Jun 11, 2014 at 12:02:51PM +0100, Eduardo Valentin wrote:
> Hello,
> 
> On Mon, Jun 09, 2014 at 09:22:49AM -0400, Nicolas Pitre wrote:
> > On Mon, 9 Jun 2014, Morten Rasmussen wrote:
> > 
> > > On Sat, Jun 07, 2014 at 03:33:58AM +0100, Nicolas Pitre wrote:
> > > > On Fri, 6 Jun 2014, Ingo Molnar wrote:
> > > > 
> > > > > In any case, even with turbo frequencies, switching power use is 
> > > > > probably an order of magnitude higher than leakage current power use, 
> > > > > on any marketable chip, so we should concentrate on being able to 
> > > > > cover this first order effect (P/work ~ V^2), before considering any 
> > > > > second order effects (leakage current).
> > > > 
> > > > Just so that people are aware... We'll have to introduce thermal 
> > > > constraint management into the scheduler mix as well at some point.  
> > > > Right now what we have is an ad hoc subsystem that simply monitors 
> > > > temperature and apply crude cooling strategies when some thresholds are 
> > > > met. But a better strategy would imply thermal "provisioning".
> > > 
> > > There is already work going on to improve thermal management:
> > > 
> > > http://lwn.net/Articles/599598/
> > > 
> > > The proposal is based on power/energy models (too). The goal is to
> 
> Can you please point me to the other piece of code which is using
> power/energy models too?  We are considering having these models within
> the thermal software compoenents. But if we already have more than one
> user, might be worth considering a separate API.

The link above is to the thermal management proposal which includes a
power model. This one might work better:

http://article.gmane.org/gmane.linux.power-management.general/45000

The power/energy model in this energy-aware scheduling proposal is
different. An example of the model data is in patch 6 (the start of this
thread) and the actual use of the model is in patch 11 and the following
patches. As said below, the two proposals are independent, but there
might be potential for merging the power/energy models once the
proposals are more mature.

Morten

>  
> > > allocate power intelligently based on performance requirements.
> > 
> > Ah, great!  I missed that.
> > 
> > > While it is related to energy-aware scheduling and I fully agree that it
> > > is something we need to consider, I think it is worth developing the two
> > > ideas in parallel and look at sharing things like the power model later
> > > once things mature. Energy-aware scheduling is complex enough on its
> > > own to keep us entertained for a while :-)
> > 
> > Absolutely.  This is why I said "at some point".
> > 
> > 
> > Nicolas

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 06/16] arm: topology: Define TC2 sched energy and provide it to scheduler
  2014-06-11 11:42                                   ` Morten Rasmussen
@ 2014-06-11 11:43                                     ` Eduardo Valentin
  2014-06-11 13:37                                       ` Morten Rasmussen
  0 siblings, 1 reply; 71+ messages in thread
From: Eduardo Valentin @ 2014-06-11 11:43 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Nicolas Pitre, Ingo Molnar, Peter Zijlstra, Yuyang Du,
	Dirk Brandewie, Rafael J. Wysocki, linux-kernel, linux-pm,
	vincent.guittot, daniel.lezcano, preeti, Dietmar Eggemann,
	len.brown, jacob.jun.pan

On Wed, Jun 11, 2014 at 12:42:18PM +0100, Morten Rasmussen wrote:
> On Wed, Jun 11, 2014 at 12:02:51PM +0100, Eduardo Valentin wrote:
> > Hello,
> > 
> > On Mon, Jun 09, 2014 at 09:22:49AM -0400, Nicolas Pitre wrote:
> > > On Mon, 9 Jun 2014, Morten Rasmussen wrote:
> > > 
> > > > On Sat, Jun 07, 2014 at 03:33:58AM +0100, Nicolas Pitre wrote:
> > > > > On Fri, 6 Jun 2014, Ingo Molnar wrote:
> > > > > 
> > > > > > In any case, even with turbo frequencies, switching power use is 
> > > > > > probably an order of magnitude higher than leakage current power use, 
> > > > > > on any marketable chip, so we should concentrate on being able to 
> > > > > > cover this first order effect (P/work ~ V^2), before considering any 
> > > > > > second order effects (leakage current).
> > > > > 
> > > > > Just so that people are aware... We'll have to introduce thermal 
> > > > > constraint management into the scheduler mix as well at some point.  
> > > > > Right now what we have is an ad hoc subsystem that simply monitors 
> > > > > temperature and apply crude cooling strategies when some thresholds are 
> > > > > met. But a better strategy would imply thermal "provisioning".
> > > > 
> > > > There is already work going on to improve thermal management:
> > > > 
> > > > http://lwn.net/Articles/599598/
> > > > 
> > > > The proposal is based on power/energy models (too). The goal is to
> > 
> > Can you please point me to the other piece of code which is using
> > power/energy models too?  We are considering having these models within
> > the thermal software compoenents. But if we already have more than one
> > user, might be worth considering a separate API.
> 
> The link above is to the thermal management proposal which includes a
> power model. This one might work better:
> 
> http://article.gmane.org/gmane.linux.power-management.general/45000
> 
> The power/energy model in this energy-aware scheduling proposal is
> different. An example of the model data is in patch 6 (the start of this
> thread) and the actual use of the model is in patch 11 and the following
> patches. As said below, the two proposals are independent, but there
> might be potential for merging the power/energy models once the
> proposals are more mature.

Morten,

For the power allocator thermal governor, I am aware, as I am reviewing
it. I am more interested in other users of power models, a part from
thermal subsystem.

> 
> Morten
> 
> >  
> > > > allocate power intelligently based on performance requirements.
> > > 
> > > Ah, great!  I missed that.
> > > 
> > > > While it is related to energy-aware scheduling and I fully agree that it
> > > > is something we need to consider, I think it is worth developing the two
> > > > ideas in parallel and look at sharing things like the power model later
> > > > once things mature. Energy-aware scheduling is complex enough on its
> > > > own to keep us entertained for a while :-)
> > > 
> > > Absolutely.  This is why I said "at some point".
> > > 
> > > 
> > > Nicolas

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 06/16] arm: topology: Define TC2 sched energy and provide it to scheduler
  2014-06-11 11:43                                     ` Eduardo Valentin
@ 2014-06-11 13:37                                       ` Morten Rasmussen
  0 siblings, 0 replies; 71+ messages in thread
From: Morten Rasmussen @ 2014-06-11 13:37 UTC (permalink / raw)
  To: Eduardo Valentin
  Cc: Nicolas Pitre, Ingo Molnar, Peter Zijlstra, Yuyang Du,
	Dirk Brandewie, Rafael J. Wysocki, linux-kernel, linux-pm,
	vincent.guittot, daniel.lezcano, preeti, Dietmar Eggemann,
	len.brown, jacob.jun.pan

On Wed, Jun 11, 2014 at 12:43:26PM +0100, Eduardo Valentin wrote:
> On Wed, Jun 11, 2014 at 12:42:18PM +0100, Morten Rasmussen wrote:
> > On Wed, Jun 11, 2014 at 12:02:51PM +0100, Eduardo Valentin wrote:
> > > Hello,
> > > 
> > > On Mon, Jun 09, 2014 at 09:22:49AM -0400, Nicolas Pitre wrote:
> > > > On Mon, 9 Jun 2014, Morten Rasmussen wrote:
> > > > 
> > > > > On Sat, Jun 07, 2014 at 03:33:58AM +0100, Nicolas Pitre wrote:
> > > > > > On Fri, 6 Jun 2014, Ingo Molnar wrote:
> > > > > > 
> > > > > > > In any case, even with turbo frequencies, switching power use is 
> > > > > > > probably an order of magnitude higher than leakage current power use, 
> > > > > > > on any marketable chip, so we should concentrate on being able to 
> > > > > > > cover this first order effect (P/work ~ V^2), before considering any 
> > > > > > > second order effects (leakage current).
> > > > > > 
> > > > > > Just so that people are aware... We'll have to introduce thermal 
> > > > > > constraint management into the scheduler mix as well at some point.  
> > > > > > Right now what we have is an ad hoc subsystem that simply monitors 
> > > > > > temperature and apply crude cooling strategies when some thresholds are 
> > > > > > met. But a better strategy would imply thermal "provisioning".
> > > > > 
> > > > > There is already work going on to improve thermal management:
> > > > > 
> > > > > http://lwn.net/Articles/599598/
> > > > > 
> > > > > The proposal is based on power/energy models (too). The goal is to
> > > 
> > > Can you please point me to the other piece of code which is using
> > > power/energy models too?  We are considering having these models within
> > > the thermal software compoenents. But if we already have more than one
> > > user, might be worth considering a separate API.
> > 
> > The link above is to the thermal management proposal which includes a
> > power model. This one might work better:
> > 
> > http://article.gmane.org/gmane.linux.power-management.general/45000
> > 
> > The power/energy model in this energy-aware scheduling proposal is
> > different. An example of the model data is in patch 6 (the start of this
> > thread) and the actual use of the model is in patch 11 and the following
> > patches. As said below, the two proposals are independent, but there
> > might be potential for merging the power/energy models once the
> > proposals are more mature.
> 
> Morten,
> 
> For the power allocator thermal governor, I am aware, as I am reviewing
> it. I am more interested in other users of power models, a part from
> thermal subsystem.

The user in this proposal is the scheduler. The intention is to
eventually tie cpuidle and cpufreq closer to the scheduler. When/if that
happens, they might become users too.

^ permalink raw reply	[flat|nested] 71+ messages in thread

* Re: [RFC PATCH 06/16] arm: topology: Define TC2 sched energy and provide it to scheduler
  2014-06-06 14:29             ` Morten Rasmussen
@ 2014-06-12 15:05               ` Vince Weaver
  0 siblings, 0 replies; 71+ messages in thread
From: Vince Weaver @ 2014-06-12 15:05 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Peter Zijlstra, linux-kernel, linux-pm, mingo, rjw,
	vincent.guittot, daniel.lezcano, preeti, Dietmar Eggemann

On Fri, 6 Jun 2014, Morten Rasmussen wrote:

> On Fri, Jun 06, 2014 at 02:43:03PM +0100, Peter Zijlstra wrote:
> > On Fri, Jun 06, 2014 at 02:15:10PM +0100, Morten Rasmussen wrote:
> > > > > ARM TC2 has on-chip energy counters for counting energy consumed by the
> > > > > A7 and A15 clusters. They are fairly accurate. 
> > > > 
> > > > Recent Intel chips have that too; they come packaged as:
> > > > 
> > > >   perf stat -a -e "power/energy-cores/" -- cmd
> > > > 
> > > > (through the perf_event_intel_rapl.c driver), It would be ideal if the
> > > > ARM equivalent was available through a similar interface.
> > > > 
> > > > http://lwn.net/Articles/573602/
> > > 
> > > Nice. On ARM it is not mandatory to have energy counters and what they
> > > actually measure if they are implemented is implementation dependent.
> > > However, each vendor does extensive evaluation and characterization of
> > > their implementation already, so I don't think would be a problem for
> > > them to provide the numbers.
> > 
> > How is the ARM energy thing exposed? Through the regular PMU but with
> > vendor specific events, or through a separate interface, entirely vendor
> > specific?
> 
> There is an upstream hwmon driver for TC2 already with an easy to use
> sysfs interface for all the energy counters. So it is somewhat vendor
> specific at the moment unfortunately.

What is the plan about future interfaces for energy info?

Intel RAPL of course has a perf_event interface.

However AMD's (somewhat unfortunately acronymed) Application Power 
Management exports similar information via hwmon and the fam15h_power
driver.

And it sounds like ARM systems also put things in hwmon.

User tools like PAPI can sort of abstract this (for example it supports 
getting RAPL data from perf_event while it also has a driver for getting 
info from hwmon).  But users stuck with perf end up having to use multiple 
tools to get energy and performance info simultaneously on non-intel 
hardware.

Vince

^ permalink raw reply	[flat|nested] 71+ messages in thread

end of thread, other threads:[~2014-06-12 15:13 UTC | newest]

Thread overview: 71+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2014-05-23 18:16 [RFC PATCH 00/16] sched: Energy cost model for energy-aware scheduling Morten Rasmussen
2014-05-23 18:16 ` [RFC PATCH 01/16] sched: Documentation for scheduler energy cost model Morten Rasmussen
2014-06-05  8:49   ` Vincent Guittot
2014-06-05 11:35     ` Morten Rasmussen
2014-06-05 15:02       ` Vincent Guittot
2014-05-23 18:16 ` [RFC PATCH 02/16] sched: Introduce CONFIG_SCHED_ENERGY Morten Rasmussen
2014-06-08  6:03   ` Henrik Austad
2014-06-09 10:20     ` Morten Rasmussen
2014-06-10  9:39       ` Peter Zijlstra
2014-06-10 10:06         ` Morten Rasmussen
2014-06-10 10:23           ` Peter Zijlstra
2014-06-10 11:17             ` Henrik Austad
2014-06-10 12:19               ` Peter Zijlstra
2014-06-10 11:24             ` Morten Rasmussen
2014-06-10 12:24               ` Peter Zijlstra
2014-06-10 14:41                 ` Morten Rasmussen
2014-05-23 18:16 ` [RFC PATCH 03/16] sched: Introduce sd energy data structures Morten Rasmussen
2014-05-23 18:16 ` [RFC PATCH 04/16] sched: Allocate and initialize sched energy Morten Rasmussen
2014-05-23 18:16 ` [RFC PATCH 05/16] sched: Add sd energy procfs interface Morten Rasmussen
2014-05-23 18:16 ` [RFC PATCH 06/16] arm: topology: Define TC2 sched energy and provide it to scheduler Morten Rasmussen
2014-05-30 12:04   ` Peter Zijlstra
2014-06-02 14:15     ` Morten Rasmussen
2014-06-03 11:41       ` Peter Zijlstra
2014-06-04 13:49         ` Morten Rasmussen
2014-06-03 11:44   ` Peter Zijlstra
2014-06-04 15:42     ` Morten Rasmussen
2014-06-04 16:16       ` Peter Zijlstra
2014-06-06 13:15         ` Morten Rasmussen
2014-06-06 13:43           ` Peter Zijlstra
2014-06-06 14:29             ` Morten Rasmussen
2014-06-12 15:05               ` Vince Weaver
2014-06-03 11:50   ` Peter Zijlstra
2014-06-04 16:02     ` Morten Rasmussen
2014-06-04 17:27       ` Peter Zijlstra
2014-06-04 21:56         ` Rafael J. Wysocki
2014-06-05  6:52           ` Peter Zijlstra
2014-06-05 15:03             ` Dirk Brandewie
2014-06-05 20:29               ` Yuyang Du
2014-06-06  8:05                 ` Peter Zijlstra
2014-06-06  0:35                   ` Yuyang Du
2014-06-06 10:50                     ` Peter Zijlstra
2014-06-06 12:13                       ` Ingo Molnar
2014-06-06 12:27                         ` Ingo Molnar
2014-06-06 14:11                           ` Morten Rasmussen
2014-06-07  2:33                           ` Nicolas Pitre
2014-06-09  8:27                             ` Morten Rasmussen
2014-06-09 13:22                               ` Nicolas Pitre
2014-06-11 11:02                                 ` Eduardo Valentin
2014-06-11 11:42                                   ` Morten Rasmussen
2014-06-11 11:43                                     ` Eduardo Valentin
2014-06-11 13:37                                       ` Morten Rasmussen
2014-06-07 23:53                         ` Yuyang Du
2014-06-07 23:26                       ` Yuyang Du
2014-06-09  8:59                         ` Morten Rasmussen
2014-06-09  2:15                           ` Yuyang Du
2014-06-10 10:16                         ` Peter Zijlstra
2014-06-10 17:01                           ` Nicolas Pitre
2014-06-10 18:35                           ` Yuyang Du
2014-06-06 16:27                     ` Jacob Pan
2014-06-06 13:03         ` Morten Rasmussen
2014-06-07  2:52         ` Nicolas Pitre
2014-05-23 18:16 ` [RFC PATCH 07/16] sched: Introduce system-wide sched_energy Morten Rasmussen
2014-05-23 18:16 ` [RFC PATCH 08/16] sched: Introduce SD_SHARE_CAP_STATES sched_domain flag Morten Rasmussen
2014-05-23 18:16 ` [RFC PATCH 09/16] sched, cpufreq: Introduce current cpu compute capacity into scheduler Morten Rasmussen
2014-05-23 18:16 ` [RFC PATCH 10/16] sched, cpufreq: Current compute capacity hack for ARM TC2 Morten Rasmussen
2014-05-23 18:16 ` [RFC PATCH 11/16] sched: Energy model functions Morten Rasmussen
2014-05-23 18:16 ` [RFC PATCH 12/16] sched: Task wakeup tracking Morten Rasmussen
2014-05-23 18:16 ` [RFC PATCH 13/16] sched: Take task wakeups into account in energy estimates Morten Rasmussen
2014-05-23 18:16 ` [RFC PATCH 14/16] sched: Use energy model in select_idle_sibling Morten Rasmussen
2014-05-23 18:16 ` [RFC PATCH 15/16] sched: Use energy to guide wakeup task placement Morten Rasmussen
2014-05-23 18:16 ` [RFC PATCH 16/16] sched: Disable wake_affine to broaden the scope of wakeup target cpus Morten Rasmussen

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.