All of lore.kernel.org
 help / color / mirror / Atom feed
From: Morten Rasmussen <morten.rasmussen@arm.com>
To: peterz@infradead.org, mingo@redhat.com
Cc: vincent.guittot@linaro.org,
	Dietmar Eggemann <Dietmar.Eggemann@arm.com>,
	yuyang.du@intel.com, preeti@linux.vnet.ibm.com,
	mturquette@linaro.org, rjw@rjwysocki.net,
	Juri Lelli <Juri.Lelli@arm.com>,
	sgurrappadi@nvidia.com, pang.xunlei@zte.com.cn,
	linux-kernel@vger.kernel.org, linux-pm@vger.kernel.org,
	morten.rasmussen@arm.com
Subject: [RFCv4 PATCH 13/34] sched: Documentation for scheduler energy cost model
Date: Tue, 12 May 2015 20:38:48 +0100	[thread overview]
Message-ID: <1431459549-18343-14-git-send-email-morten.rasmussen@arm.com> (raw)
In-Reply-To: <1431459549-18343-1-git-send-email-morten.rasmussen@arm.com>

This documentation patch provides an overview of the experimental
scheduler energy costing model, associated data structures, and a
reference recipe on how platforms can be characterized to derive energy
models.

Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
---
 Documentation/scheduler/sched-energy.txt | 363 +++++++++++++++++++++++++++++++
 1 file changed, 363 insertions(+)
 create mode 100644 Documentation/scheduler/sched-energy.txt

diff --git a/Documentation/scheduler/sched-energy.txt b/Documentation/scheduler/sched-energy.txt
new file mode 100644
index 0000000..f2a4c19
--- /dev/null
+++ b/Documentation/scheduler/sched-energy.txt
@@ -0,0 +1,363 @@
+Energy cost model for energy-aware scheduling (EXPERIMENTAL)
+
+Introduction
+=============
+
+The basic energy model uses platform energy data stored in sched_group_energy
+data structures attached to the sched_groups in the sched_domain hierarchy. The
+energy cost model offers two functions that can be used to guide scheduling
+decisions:
+
+1.	static unsigned int sched_group_energy(struct energy_env *eenv)
+2.	static int energy_diff(struct energy_env *eenv)
+
+sched_group_energy() estimates the energy consumed by all cpus in a specific
+sched_group including any shared resources owned exclusively by this group of
+cpus. Resources shared with other cpus are excluded (e.g. later level caches).
+
+energy_diff() estimates the total energy impact of a utilization change. That
+is, adding, removing, or migrating utilization (tasks).
+
+Both functions use a struct energy_env to specify the scenario to be evaluated:
+
+	struct energy_env {
+		struct sched_group      *sg_top;
+		struct sched_group      *sg_cap;
+		int                     cap_idx;
+		int                     usage_delta;
+		int                     src_cpu;
+		int                     dst_cpu;
+		int                     energy;
+	};
+
+sg_top: sched_group to be evaluated. Not used by energy_diff().
+
+sg_cap: sched_group covering the cpus in the same frequency domain. Set by
+sched_group_energy().
+
+cap_idx: Capacity state to be used for energy calculations. Set by
+find_new_capacity().
+
+usage_delta: Amount of utilization to be added, removed, or migrated.
+
+src_cpu: Source cpu from where 'usage_delta' utilization is removed. Should be
+-1 if no source (e.g. task wake-up).
+
+dst_cpu: Destination cpu where 'usage_delta' utilization is added. Should be -1
+if utilization is removed (e.g. terminating tasks).
+
+energy: Result of sched_group_energy().
+
+The metric used to represent utilization is the actual per-entity running time
+averaged over time using a geometric series. Very similar to the existing
+per-entity load-tracking, but _not_ scaled by task priority and capped by the
+capacity of the cpu. The latter property does mean that utilization may
+underestimate the compute requirements for task on fully/over utilized cpus.
+The greatest potential for energy savings without affecting performance too much
+is scenarios where the system isn't fully utilized. If the system is deemed
+fully utilized load-balancing should be done with task load (includes task
+priority) instead in the interest of fairness and performance.
+
+
+Background and Terminology
+===========================
+
+To make it clear from the start:
+
+energy = [joule] (resource like a battery on powered devices)
+power = energy/time = [joule/second] = [watt]
+
+The goal of energy-aware scheduling is to minimize energy, while still getting
+the job done. That is, we want to maximize:
+
+	performance [inst/s]
+	--------------------
+	    power [W]
+
+which is equivalent to minimizing:
+
+	energy [J]
+	-----------
+	instruction
+
+while still getting 'good' performance. It is essentially an alternative
+optimization objective to the current performance-only objective for the
+scheduler. This alternative considers two objectives: energy-efficiency and
+performance. Hence, there needs to be a user controllable knob to switch the
+objective. Since it is early days, this is currently a sched_feature
+(ENERGY_AWARE).
+
+The idea behind introducing an energy cost model is to allow the scheduler to
+evaluate the implications of its decisions rather than applying energy-saving
+techniques blindly that may only have positive effects on some platforms. At
+the same time, the energy cost model must be as simple as possible to minimize
+the scheduler latency impact.
+
+Platform topology
+------------------
+
+The system topology (cpus, caches, and NUMA information, not peripherals) is
+represented in the scheduler by the sched_domain hierarchy which has
+sched_groups attached at each level that covers one or more cpus (see
+sched-domains.txt for more details). To add energy awareness to the scheduler
+we need to consider power and frequency domains.
+
+Power domain:
+
+A power domain is a part of the system that can be powered on/off
+independently. Power domains are typically organized in a hierarchy where you
+may be able to power down just a cpu or a group of cpus along with any
+associated resources (e.g.  shared caches). Powering up a cpu means that all
+power domains it is a part of in the hierarchy must be powered up. Hence, it is
+more expensive to power up the first cpu that belongs to a higher level power
+domain than powering up additional cpus in the same high level domain. Two
+level power domain hierarchy example:
+
+		Power source
+		         +-------------------------------+----...
+per group PD		 G                               G
+		         |           +----------+        |
+		    +--------+-------| Shared   |  (other groups)
+per-cpu PD	    G        G       | resource |
+		    |        |       +----------+
+		+-------+ +-------+
+		| CPU 0 | | CPU 1 |
+		+-------+ +-------+
+
+Frequency domain:
+
+Frequency domains (P-states) typically cover the same group of cpus as one of
+the power domain levels. That is, there might be several smaller power domains
+sharing the same frequency (P-state) or there might be a power domain spanning
+multiple frequency domains.
+
+From a scheduling point of view there is no need to know the actual frequencies
+[Hz]. All the scheduler cares about is the compute capacity available at the
+current state (P-state) the cpu is in and any other available states. For that
+reason, and to also factor in any cpu micro-architecture differences, compute
+capacity scaling states are called 'capacity states' in this document. For SMP
+systems this is equivalent to P-states. For mixed micro-architecture systems
+(like ARM big.LITTLE) it is P-states scaled according to the micro-architecture
+performance relative to the other cpus in the system.
+
+Energy modelling:
+------------------
+
+Due to the hierarchical nature of the power domains, the most obvious way to
+model energy costs is therefore to associate power and energy costs with
+domains (groups of cpus). Energy costs of shared resources are associated with
+the group of cpus that share the resources, only the cost of powering the
+cpu itself and any private resources (e.g. private L1 caches) is associated
+with the per-cpu groups (lowest level).
+
+For example, for an SMP system with per-cpu power domains and a cluster level
+(group of cpus) power domain we get the overall energy costs to be:
+
+	energy = energy_cluster + n * energy_cpu
+
+where 'n' is the number of cpus powered up and energy_cluster is the cost paid
+as soon as any cpu in the cluster is powered up.
+
+The power and frequency domains can naturally be mapped onto the existing
+sched_domain hierarchy and sched_groups by adding the necessary data to the
+existing data structures.
+
+The energy model considers energy consumption from two contributors (shown in
+the illustration below):
+
+1. Busy energy: Energy consumed while a cpu and the higher level groups that it
+belongs to are busy running tasks. Busy energy is associated with the state of
+the cpu, not an event. The time the cpu spends in this state varies. Thus, the
+most obvious platform parameter for this contribution is busy power
+(energy/time).
+
+2. Idle energy: Energy consumed while a cpu and higher level groups that it
+belongs to are idle (in a C-state). Like busy energy, idle energy is associated
+with the state of the cpu. Thus, the platform parameter for this contribution
+is idle power (energy/time).
+
+Energy consumed during transitions from an idle-state (C-state) to a busy state
+(P-staet) or going the other way is ignored by the model to simplify the energy
+model calculations.
+
+
+	Power
+	^
+	|            busy->idle             idle->busy
+	|            transition             transition
+	|
+	|                _                      __
+	|               / \                    /  \__________________
+	|______________/   \                  /
+	|                   \                /
+	|  Busy              \    Idle      /        Busy
+	|  low P-state        \____________/         high P-state
+	|
+	+------------------------------------------------------------> time
+
+Busy    |--------------|                          |-----------------|
+
+Wakeup                 |------|            |------|
+
+Idle                          |------------|
+
+
+The basic algorithm
+====================
+
+The basic idea is to determine the total energy impact when utilization is
+added or removed by estimating the impact at each level in the sched_domain
+hierarchy starting from the bottom (sched_group contains just a single cpu).
+The energy cost comes from busy time (sched_group is awake because one or more
+cpus are busy) and idle time (in an idle-state). Energy model numbers account
+for energy costs associated with all cpus in the sched_group as a group.
+
+	for_each_domain(cpu, sd) {
+		sg = sched_group_of(cpu)
+		energy_before = curr_util(sg) * busy_power(sg)
+				+ (1-curr_util(sg)) * idle_power(sg)
+		energy_after = new_util(sg) * busy_power(sg)
+				+ (1-new_util(sg)) * idle_power(sg)
+		energy_diff += energy_before - energy_after
+
+	}
+
+	return energy_diff
+
+{curr, new}_util: The cpu utilization at the lowest level and the overall
+non-idle time for the entire group for higher levels. Utilization is in the
+range 0.0 to 1.0 in the pseudo-code.
+
+busy_power: The power consumption of the sched_group.
+
+idle_power: The power consumption of the sched_group when idle.
+
+Note: It is a fundamental assumption that the utilization is (roughly) scale
+invariant. Task utilization tracking factors in any frequency scaling and
+performance scaling differences due to difference cpu microarchitectures such
+that task utilization can be used across the entire system.
+
+
+Platform energy data
+=====================
+
+struct sched_group_energy can be attached to sched_groups in the sched_domain
+hierarchy and has the following members:
+
+cap_states:
+	List of struct capacity_state representing the supported capacity states
+	(P-states). struct capacity_state has two members: cap and power, which
+	represents the compute capacity and the busy_power of the state. The
+	list must be ordered by capacity low->high.
+
+nr_cap_states:
+	Number of capacity states in cap_states list.
+
+idle_states:
+	List of struct idle_state containing idle_state power cost for each
+	idle-state support by the sched_group. Note that the energy model
+	calculations will use this table to determine idle power even if no idle
+	state is actually entered by cpuidle. That is, if latency constraints
+	prevents that the group enters a coupled state or no idle-states are
+	supported. Hence, the first entry of the list must be the idle power
+	when idle, but no idle state was actually entered ('active idle'). This
+	state may be left out groups with one cpu if the cpu is guaranteed to
+	enter the state when idle.
+
+nr_idle_states:
+	Number of idle states in idle_states list.
+
+nr_idle_states_below:
+	Number of idle-states below current level. Filled by generic code, not
+	to be provided by the platform.
+
+There are no unit requirements for the energy cost data. Data can be normalized
+with any reference, however, the normalization must be consistent across all
+energy cost data. That is, one bogo-joule/watt must be the same quantity for
+data, but we don't care what it is.
+
+A recipe for platform characterization
+=======================================
+
+Obtaining the actual model data for a particular platform requires some way of
+measuring power/energy. There isn't a tool to help with this (yet). This
+section provides a recipe for use as reference. It covers the steps used to
+characterize the ARM TC2 development platform. This sort of measurements is
+expected to be done anyway when tuning cpuidle and cpufreq for a given
+platform.
+
+The energy model needs two types of data (struct sched_group_energy holds
+these) for each sched_group where energy costs should be taken into account:
+
+1. Capacity state information
+
+A list containing the compute capacity and power consumption when fully
+utilized attributed to the group as a whole for each available capacity state.
+At the lowest level (group contains just a single cpu) this is the power of the
+cpu alone without including power consumed by resources shared with other cpus.
+It basically needs to fit the basic modelling approach described in "Background
+and Terminology" section:
+
+	energy_system = energy_shared + n * energy_cpu
+
+for a system containing 'n' busy cpus. Only 'energy_cpu' should be included at
+the lowest level. 'energy_shared' is included at the next level which
+represents the group of cpus among which the resources are shared.
+
+This model is, of course, a simplification of reality. Thus, power/energy
+attributions might not always exactly represent how the hardware is designed.
+Also, busy power is likely to depend on the workload. It is therefore
+recommended to use a representative mix of workloads when characterizing the
+capacity states.
+
+If the group has no capacity scaling support, the list will contain a single
+state where power is the busy power attributed to the group. The capacity
+should be set to a default value (1024).
+
+When frequency domains include multiple power domains, the group representing
+the frequency domain and all child groups share capacity states. This must be
+indicated by setting the SD_SHARE_CAP_STATES sched_domain flag. All groups at
+all levels that share the capacity state must have the list of capacity states
+with the power set to the contribution of the individual group.
+
+2. Idle power information
+
+Stored in the idle_states list. The power number is the group idle power
+consumption in each idle state as well when the group is idle but has not
+entered an idle-state ('active idle' as mentioned earlier). Due to the way the
+energy model is defined, the idle power of the deepest group idle state can
+alternatively be accounted for in the parent group busy power. In that case the
+group idle state power values are offset such that the idle power of the
+deepest state is zero. It is less intuitive, but it is easier to measure as
+idle power consumed by the group and the busy/idle power of the parent group
+cannot be distinguished without per group measurement points.
+
+Measuring capacity states and idle power:
+
+The capacity states' capacity and power can be estimated by running a benchmark
+workload at each available capacity state. By restricting the benchmark to run
+on subsets of cpus it is possible to extrapolate the power consumption of
+shared resources.
+
+ARM TC2 has two clusters of two and three cpus respectively. Each cluster has a
+shared L2 cache. TC2 has on-chip energy counters per cluster. Running a
+benchmark workload on just one cpu in a cluster means that power is consumed in
+the cluster (higher level group) and a single cpu (lowest level group). Adding
+another benchmark task to another cpu increases the power consumption by the
+amount consumed by the additional cpu. Hence, it is possible to extrapolate the
+cluster busy power.
+
+For platforms that don't have energy counters or equivalent instrumentation
+built-in, it may be possible to use an external DAQ to acquire similar data.
+
+If the benchmark includes some performance score (for example sysbench cpu
+benchmark), this can be used to record the compute capacity.
+
+Measuring idle power requires insight into the idle state implementation on the
+particular platform. Specifically, if the platform has coupled idle-states (or
+package states). To measure non-coupled per-cpu idle-states it is necessary to
+keep one cpu busy to keep any shared resources alive to isolate the idle power
+of the cpu from idle/busy power of the shared resources. The cpu can be tricked
+into different per-cpu idle states by disabling the other states. Based on
+various combinations of measurements with specific cpus busy and disabling
+idle-states it is possible to extrapolate the idle-state power.
-- 
1.9.1


  parent reply	other threads:[~2015-05-12 19:44 UTC|newest]

Thread overview: 56+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2015-05-12 19:38 [RFCv4 PATCH 00/34] sched: Energy cost model for energy-aware scheduling Morten Rasmussen
2015-05-12 19:38 ` [RFCv4 PATCH 01/34] arm: Frequency invariant scheduler load-tracking support Morten Rasmussen
2015-05-12 19:38 ` [RFCv4 PATCH 02/34] sched: Make load tracking frequency scale-invariant Morten Rasmussen
2015-05-12 19:38 ` [RFCv4 PATCH 03/34] arm: vexpress: Add CPU clock-frequencies to TC2 device-tree Morten Rasmussen
2015-05-12 19:38 ` [RFCv4 PATCH 04/34] sched: Convert arch_scale_cpu_capacity() from weak function to #define Morten Rasmussen
2015-05-12 19:38 ` [RFCv4 PATCH 05/34] arm: Update arch_scale_cpu_capacity() to reflect change to define Morten Rasmussen
2015-05-12 19:38 ` [RFCv4 PATCH 06/34] sched: Make usage tracking cpu scale-invariant Morten Rasmussen
2015-05-12 19:38 ` [RFCv4 PATCH 07/34] arm: Cpu invariant scheduler load-tracking support Morten Rasmussen
2015-05-12 19:38 ` [RFCv4 PATCH 08/34] sched: Get rid of scaling usage by cpu_capacity_orig Morten Rasmussen
2015-05-12 19:38 ` [RFCv4 PATCH 09/34] sched: Track blocked utilization contributions Morten Rasmussen
2015-05-12 19:38 ` [RFCv4 PATCH 10/34] sched: Include blocked utilization in usage tracking Morten Rasmussen
2015-05-12 19:38 ` [RFCv4 PATCH 11/34] sched: Remove blocked load and utilization contributions of dying tasks Morten Rasmussen
2015-05-13  0:33   ` Sai Gurrappadi
2015-05-13  0:33     ` Sai Gurrappadi
2015-05-13 13:49     ` Morten Rasmussen
2015-05-19 14:22       ` Morten Rasmussen
2015-05-12 19:38 ` [RFCv4 PATCH 12/34] sched: Initialize CFS task load and usage before placing task on rq Morten Rasmussen
2015-05-12 19:38 ` Morten Rasmussen [this message]
2015-05-20  4:04   ` [RFCv4 PATCH 13/34] sched: Documentation for scheduler energy cost model Kamalesh Babulal
2015-05-20  9:27     ` Morten Rasmussen
2015-05-12 19:38 ` [RFCv4 PATCH 14/34] sched: Make energy awareness a sched feature Morten Rasmussen
2015-05-12 19:38 ` [RFCv4 PATCH 15/34] sched: Introduce energy data structures Morten Rasmussen
2015-05-12 19:38 ` [RFCv4 PATCH 16/34] sched: Allocate and initialize " Morten Rasmussen
2015-05-12 19:38 ` [RFCv4 PATCH 17/34] sched: Introduce SD_SHARE_CAP_STATES sched_domain flag Morten Rasmussen
2015-05-12 19:38 ` [RFCv4 PATCH 18/34] arm: topology: Define TC2 energy and provide it to the scheduler Morten Rasmussen
2015-05-12 19:38 ` [RFCv4 PATCH 19/34] sched: Compute cpu capacity available at current frequency Morten Rasmussen
2015-05-12 19:38 ` [RFCv4 PATCH 20/34] sched: Relocated get_cpu_usage() and change return type Morten Rasmussen
2015-05-12 19:38 ` [RFCv4 PATCH 21/34] sched: Highest energy aware balancing sched_domain level pointer Morten Rasmussen
2015-05-12 19:38 ` [RFCv4 PATCH 22/34] sched: Calculate energy consumption of sched_group Morten Rasmussen
2015-05-21  7:57   ` Kamalesh Babulal
2015-05-22 15:38     ` Morten Rasmussen
2015-05-12 19:38 ` [RFCv4 PATCH 23/34] sched: Extend sched_group_energy to test load-balancing decisions Morten Rasmussen
2015-05-12 19:38 ` [RFCv4 PATCH 24/34] sched: Estimate energy impact of scheduling decisions Morten Rasmussen
2015-05-12 19:39 ` [RFCv4 PATCH 25/34] sched: Add over-utilization/tipping point indicator Morten Rasmussen
2015-05-22 19:48   ` [PATCH] sched: Fix compiler errors for NO_SMP machines Abel Vesa
2015-05-23 14:52     ` Ingo Molnar
2015-05-23 19:22       ` Abel Vesa
2015-06-30  9:35   ` [RFCv4 PATCH 25/34] sched: Add over-utilization/tipping point indicator pang.xunlei
2015-06-30  9:35     ` pang.xunlei
2015-05-12 19:39 ` [RFCv4 PATCH 26/34] sched: Store system-wide maximum cpu capacity in root domain Morten Rasmussen
2015-05-12 19:39 ` [RFCv4 PATCH 27/34] sched, cpuidle: Track cpuidle state index in the scheduler Morten Rasmussen
2015-05-12 19:39 ` [RFCv4 PATCH 28/34] sched: Count number of shallower idle-states in struct sched_group_energy Morten Rasmussen
2015-05-12 19:39 ` [RFCv4 PATCH 29/34] sched: Determine the current sched_group idle-state Morten Rasmussen
2015-05-12 19:39 ` [RFCv4 PATCH 30/34] sched: Add cpu capacity awareness to wakeup balancing Morten Rasmussen
2015-05-12 19:39 ` [RFCv4 PATCH 31/34] sched: Energy-aware wake-up task placement Morten Rasmussen
2015-05-14 14:03   ` Dietmar Eggemann
     [not found]   ` <OF168B7415.9556008C-ON48257E45.003388D7-48257E45.00349D8D@zte.com.cn>
2015-05-14 15:10     ` Morten Rasmussen
2015-05-12 19:39 ` [RFCv4 PATCH 32/34] sched: Consider a not over-utilized energy-aware system as balanced Morten Rasmussen
2015-05-12 19:39 ` [RFCv4 PATCH 33/34] sched: Enable idle balance to pull single task towards cpu with higher capacity Morten Rasmussen
2015-05-12 19:39 ` [RFCv4 PATCH 34/34] sched: Disable energy-unfriendly nohz kicks Morten Rasmussen
2015-05-12 22:07 ` [RFCv4 PATCH 00/34] sched: Energy cost model for energy-aware scheduling Sai Gurrappadi
2015-05-12 22:07   ` Sai Gurrappadi
2015-05-13 13:47   ` Morten Rasmussen
2015-06-28 20:26     ` Abel Vesa
2015-06-29  9:06       ` pang.xunlei
2015-06-29 10:19         ` Dietmar Eggemann

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=1431459549-18343-14-git-send-email-morten.rasmussen@arm.com \
    --to=morten.rasmussen@arm.com \
    --cc=Dietmar.Eggemann@arm.com \
    --cc=Juri.Lelli@arm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-pm@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=mturquette@linaro.org \
    --cc=pang.xunlei@zte.com.cn \
    --cc=peterz@infradead.org \
    --cc=preeti@linux.vnet.ibm.com \
    --cc=rjw@rjwysocki.net \
    --cc=sgurrappadi@nvidia.com \
    --cc=vincent.guittot@linaro.org \
    --cc=yuyang.du@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.