linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH v2 0/6] Energy Aware Scheduling
@ 2018-04-06 15:36 Dietmar Eggemann
  2018-04-06 15:36 ` [RFC PATCH v2 1/6] sched/fair: Create util_fits_capacity() Dietmar Eggemann
                   ` (6 more replies)
  0 siblings, 7 replies; 44+ messages in thread
From: Dietmar Eggemann @ 2018-04-06 15:36 UTC (permalink / raw)
  To: linux-kernel, Peter Zijlstra, Quentin Perret, Thara Gopinath
  Cc: linux-pm, Morten Rasmussen, Chris Redpath, Patrick Bellasi,
	Valentin Schneider, Rafael J . Wysocki, Greg Kroah-Hartman,
	Vincent Guittot, Viresh Kumar, Todd Kjos, Joel Fernandes,
	Juri Lelli, Steve Muckle, Eduardo Valentin

1. Overview

The Energy Aware Scheduler (EAS) based on Morten Rasmussen's posting on
LKML [1] is currently part of the AOSP Common Kernel and runs on
today's smartphones with Arm's big.LITTLE CPUs.
Based on the experience gained over the last two and a half years in
product development, we propose an energy model based task placement
for CPUs with asymmetric core capacities (e.g. Arm big.LITTLE or
DynamIQ), to align with the EAS adopted by the AOSP Common Kernel. We
have developed a simplified energy model, based on the physical
active power/performance curve of each core type using existing
SoC power/performance data already known to the kernel. The energy
model is used to select the most energy-efficient CPU to place each
task, taking utilization into account.

1.1 Energy Model

A CPU with asymmetric core capacities features cores with significantly
different energy and performance characteristics. As the configurations
can vary greatly from one SoC to another, designing an energy-efficient
scheduling heuristic that performs well on a broad spectrum of platforms
appears to be particularly hard.
This proposal attempts to solve this issue by providing the scheduler
with an energy model of the platform which enables energy impact
estimation of scheduling decisions in a generic way. The energy model is
kept very simple as it represents only the active power of CPUs at all
available P-states and relies on existing data in the kernel (only used
by the thermal subsystem so far).
This proposal does not include the power consumption of C-states and
cluster-level resources which were originally introduced in [1] since
firstly, their impact on task placement decisions appears to be
neglectable on modern asymmetric platforms and secondly, they require
additional infrastructure and data (e.g new DT entries).
The scheduler is also informed of the span of frequency domains, hence
enabling an accurate accounting of the energy costs of frequency
changes. This appears to be especially important for future Arm CPU
topologies (DynamIQ) where the span of scheduling domains can be
different from the span of frequency domains.

1.2 Overutilization/Tipping Point

The primary job for the task scheduler is to deliver the highest
possible throughput with minimal latency. With increasing utilization
the opportunities to save energy for the scheduler become rarer. There
must be spare CPU time available to place tasks based on utilization
in an energy-aware fashion, i.e. to pack tasks on energy-efficient CPUs
with unnecessary constraining of the task throughput. This spare CPU
time decreases towards zero when the utilization of the system rises.  
To cope with this situation, we introduce the concept of overutilization
in order to enable/disable EAS depending on system utilization.
The point in which a system switches from being not overutilized to
being overutilized or vice versa is called the tipping point. A per
sched domain tipping point indicator implementation is introduced here.

1.3 Wakeup path

On a system which has an energy model, the energy-aware wakeup path
trumps affine and capacity based wake up in case the lowest sched
domain of the task's previous CPU is not overutilized. The energy-aware
algorithm tries to find a new target CPU among the CPUs of the highest
non-overutilized domain which includes previous and current CPU, for
which the placement of the task would contribute a minimum on energy
consumption. The energy model is only enabled on CPUs with asymmetric
core capacities (SD_ASYM_CPUCAPACITY). These systems typically have
less than or equal 8 cores.

2. Tests

Two fundamentally different tests were executed. Firstly the energy
test case shows the impact on energy consumption this patch-set has
using a synthetic set of tasks. Secondly the performance test case
provides the conventional hackbench metric numbers.

The tests run on two arm64 big.LITTLE platforms: Hikey960 (4xA73 +
4xA53) and Juno r0 (2xA57 + 4xA53).

Base kernel is tip/sched/core (4.16-rc6), with some Hikey960 and
Juno specific patches, the SD_ASYM_CPUCAPACITY flag set at DIE sched
domain level for arm64 and schedutil as cpufreq governor [2].

2.1 Energy test case

10 iterations of between 10 and 50 periodic rt-app tasks (16ms period,
5% duty-cycle) for 30 seconds with energy measurement. Unit is Joules.
The goal is to save energy, so lower is better.

2.1.1 Hikey960

Energy is measured with an ACME Cape on an instrumented board. Numbers
include consumption of big and little CPUs, LPDDR memory, GPU and most
of the other small components on the board. They do not include
consumption of the radio chip (turned-off anyway) and external
connectors.

+----------+-----------------+-------------------------+
|          | Without patches | With patches            |
+----------+--------+--------+------------------+------+ 
| Tasks nb |  Mean  | RSD*   | Mean             | RSD* |
+----------+--------+--------+------------------+------+
|       10 |  41.14 |  1.4%  |  36.51 (-11.25%) | 1.6% |
|       20 |  55.95 |  0.8%  |  50.14 (-10.38%) | 1.9% |
|       30 |  74.37 |  0.2%  |  72.89 ( -1.99%) | 5.3% |
|       40 |  94.12 |  0.7%  |  87.78 ( -6.74%) | 4.5% |
|       50 | 117.88 |  0.2%  | 111.66 ( -5.28%) | 0.9% |
+----------+--------+-------+-----------------+--------+

2.1.2 Juno r0

Energy is measured with the onboard energy meter. Numbers include
consumption of big and little CPUs.

+----------+-----------------+-------------------------+
|          | Without patches | With patches            |
+----------+--------+--------+------------------+------+
| Tasks nb | Mean   | RSD*   | Mean             | RSD* |
+----------+--------+--------+------------------+------+
|       10 |  11.25 |  3.1%  |   7.07 (-37.16%) | 2.1% |
|       20 |  19.18 |  1.1%  |  12.75 (-33.52%) | 2.2% |
|       30 |  28.81 |  1.9%  |  21.29 (-26.10%) | 1.5% |
|       40 |  36.83 |  1.2%  |  30.72 (-16.59%) | 0.6% |
|       50 |  46.41 |  0.6%  |  46.02 ( -0.01%) | 0.5% |
+----------+--------+--------+------------------+------+

2.2 Performance test case

30 iterations of perf bench sched messaging --pipe --thread --group G
--loop L with G=[1 2 4 8] and L=50000 (Hikey960)/16000 (Juno r0).

2.2.1 Hikey960

The impact of thermal capping was mitigated thanks to a heatsink, a
fan, and a 10 sec delay between two successive executions.

+----------------+-----------------+-------------------------+
|                | Without patches | With patches            |
+--------+-------+---------+-------+-----------------+-------+
| Groups | Tasks | Mean    | RSD*  | Mean            | RSD*  |
+--------+-------+---------+-------+-----------------+-------+
|      1 |    40 |    8.58 | 0.81% | 10.34 (+21.24%) | 4.35% |
|      2 |    80 |   15.33 | 0.79% | 15.56 (+1.51%)  | 1.04% |
|      4 |   160 |   31.75 | 0.52% | 31.85 (+0.29%)  | 0.54% |
|      8 |   320 |   67.00 | 0.36% | 66.79 (-0.30%)  | 0.43% |
+--------+-------+---------+-------+-----------------+-------+

2.2.2 Juno r0

+----------------+-----------------+-------------------------+
|                | Without patches | With patches            |
+--------+-------+---------+-------+-----------------+-------+
| Groups | Tasks | Mean    | RSD*  | Mean            | RSD*  |
+--------+-------+---------+-------+-----------------+-------+
|      1 |    40 |    8.44 | 0.12% |  8.39 (-0.01%)  | 0.10% |
|      2 |    80 |   14.65 | 0.11% | 14.73 ( 0.01%)  | 0.12% |
|      4 |   160 |   27.34 | 0.14% | 27.47 ( 0.00%)  | 0.14% |
|      8 |   320 |   53.88 | 0.25% | 54.34 ( 0.01%)  | 0.30% |
+--------+-------+---------+-------+-----------------+-------+

*RSD: Relative Standard Deviation (std dev / mean)

3. Dependencies

This series depends on additional infrastructure being merged in the
OPP core. As this infrastructure can also be useful for other clients,
the related patches have been posted separately [3].

4. Changes between versions

Changes v1[4]->v2:

- Reworked interface between fair.c and energy.[ch] (Remove #ifdef
  CONFIG_PM_OPP from energy.c) (Greg KH)
- Fixed licence & header issue in energy.[ch] (Greg KH)
- Reordered EAS path in select_task_rq_fair() (Joel)
- Avoid prev_cpu if not allowed in select_task_rq_fair() (Morten/Joel)
- Refactored compute_energy() (Patrick)
- Account for RT/IRQ pressure in task_fits() (Patrick)
- Use UTIL_EST and DL utilization during OPP estimation (Patrick/Juri)
- Optimize selection of CPU candidates in the energy-aware wake-up path
- Rebased on top of tip/sched/core (commit b720342849fe “sched/core:
  Update Preempt_notifier_key to modern API”)

[1] https://lkml.org/lkml/2015/7/7/754
[2] http://www.linux-arm.org/git?p=linux-de.git;a=shortlog;h=refs/heads/upstream/eas_v2_base
[3] https://marc.info/?l=linux-pm&m=151635516419249&w=2
[4] https://marc.info/?l=linux-pm&m=152153905805048&w=2

Dietmar Eggemann (1):
  sched/fair: Create util_fits_capacity()

Quentin Perret (4):
  sched: Introduce energy models of CPUs
  sched/fair: Introduce an energy estimation helper function
  sched/fair: Select an energy-efficient CPU on task wake-up
  drivers: base: arch_topology.c: Enable EAS for arm/arm64 platforms

Thara Gopinath (1):
  sched: Add over-utilization/tipping point indicator

 drivers/base/arch_topology.c   |   2 +
 include/linux/sched/energy.h   |  69 ++++++++++++
 include/linux/sched/topology.h |   1 +
 kernel/sched/Makefile          |   3 +
 kernel/sched/energy.c          | 184 ++++++++++++++++++++++++++++++++
 kernel/sched/fair.c            | 234 +++++++++++++++++++++++++++++++++++++++--
 kernel/sched/sched.h           |   3 +-
 kernel/sched/topology.c        |  12 +--
 8 files changed, 491 insertions(+), 17 deletions(-)
 create mode 100644 include/linux/sched/energy.h
 create mode 100644 kernel/sched/energy.c

-- 
2.11.0

^ permalink raw reply	[flat|nested] 44+ messages in thread

* [RFC PATCH v2 1/6] sched/fair: Create util_fits_capacity()
  2018-04-06 15:36 [RFC PATCH v2 0/6] Energy Aware Scheduling Dietmar Eggemann
@ 2018-04-06 15:36 ` Dietmar Eggemann
  2018-04-12  7:02   ` Viresh Kumar
  2018-04-06 15:36 ` [RFC PATCH v2 2/6] sched: Introduce energy models of CPUs Dietmar Eggemann
                   ` (5 subsequent siblings)
  6 siblings, 1 reply; 44+ messages in thread
From: Dietmar Eggemann @ 2018-04-06 15:36 UTC (permalink / raw)
  To: linux-kernel, Peter Zijlstra, Quentin Perret, Thara Gopinath
  Cc: linux-pm, Morten Rasmussen, Chris Redpath, Patrick Bellasi,
	Valentin Schneider, Rafael J . Wysocki, Greg Kroah-Hartman,
	Vincent Guittot, Viresh Kumar, Todd Kjos, Joel Fernandes,
	Juri Lelli, Steve Muckle, Eduardo Valentin

The functionality that a given utilization fits into a given capacity
is factored out into a separate function.

Currently it is only used in wake_cap() but will be re-used to figure
out if a cpu or a scheduler group is over-utilized.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
---
 kernel/sched/fair.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0951d1c58d2f..0a76ad2ef022 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6574,6 +6574,11 @@ static unsigned long cpu_util_wake(int cpu, struct task_struct *p)
 	return min_t(unsigned long, util, capacity_orig_of(cpu));
 }
 
+static inline int util_fits_capacity(unsigned long util, unsigned long capacity)
+{
+	return capacity * 1024 > util * capacity_margin;
+}
+
 /*
  * Disable WAKE_AFFINE in the case where task @p doesn't fit in the
  * capacity of either the waking CPU @cpu or the previous CPU @prev_cpu.
@@ -6595,7 +6600,7 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
 	/* Bring task utilization in sync with prev_cpu */
 	sync_entity_load_avg(&p->se);
 
-	return min_cap * 1024 < task_util(p) * capacity_margin;
+	return !util_fits_capacity(task_util(p), min_cap);
 }
 
 /*
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFC PATCH v2 2/6] sched: Introduce energy models of CPUs
  2018-04-06 15:36 [RFC PATCH v2 0/6] Energy Aware Scheduling Dietmar Eggemann
  2018-04-06 15:36 ` [RFC PATCH v2 1/6] sched/fair: Create util_fits_capacity() Dietmar Eggemann
@ 2018-04-06 15:36 ` Dietmar Eggemann
  2018-04-10 11:54   ` Peter Zijlstra
  2018-04-13  4:02   ` Viresh Kumar
  2018-04-06 15:36 ` [RFC PATCH v2 3/6] sched: Add over-utilization/tipping point indicator Dietmar Eggemann
                   ` (4 subsequent siblings)
  6 siblings, 2 replies; 44+ messages in thread
From: Dietmar Eggemann @ 2018-04-06 15:36 UTC (permalink / raw)
  To: linux-kernel, Peter Zijlstra, Quentin Perret, Thara Gopinath
  Cc: linux-pm, Morten Rasmussen, Chris Redpath, Patrick Bellasi,
	Valentin Schneider, Rafael J . Wysocki, Greg Kroah-Hartman,
	Vincent Guittot, Viresh Kumar, Todd Kjos, Joel Fernandes,
	Juri Lelli, Steve Muckle, Eduardo Valentin

From: Quentin Perret <quentin.perret@arm.com>

The energy consumption of each CPU in the system is modeled with a list
of values representing its dissipated power and compute capacity at each
available Operating Performance Point (OPP). These values are derived
from existing information in the kernel (currently used by the thermal
subsystem) and don't require the introduction of new platform-specific
tunables. The energy model is also provided with a simple representation
of all frequency domains as cpumasks, hence enabling the scheduler to be
aware of dependencies between CPUs. The data required to build the energy
model is provided by the OPP library which enables an abstract view of
the platform from the scheduler. The new data structures holding these
models and the routines to populate them are stored in
kernel/sched/energy.c.

For the sake of simplicity, it is assumed in the energy model that all
CPUs in a frequency domain share the same micro-architecture. As long as
this assumption is correct, the energy models of different CPUs belonging
to the same frequency domain are equal. Hence, this commit builds only one
energy model per frequency domain, and links all relevant CPUs to it in
order to save time and memory. If needed for future hardware platforms,
relaxing this assumption should imply relatively simple modifications in
the code but a significantly higher algorithmic complexity.

As it appears that energy-aware scheduling really makes a difference on
heterogeneous systems (e.g. big.LITTLE platforms), it is restricted to
systems having:

   1. SD_ASYM_CPUCAPACITY flag set
   2. Dynamic Voltage and Frequency Scaling (DVFS) is enabled
   3. Available power estimates for the OPPs of all possible CPUs

Moreover, the scheduler is notified of the energy model availability
using a static key in order to minimize the overhead on non-energy-aware
systems.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

---
This patch depends on additional infrastructure being merged in the OPP
core. As this infrastructure can also be useful for other clients, the
related patches have been posted separately [1].

[1] https://marc.info/?l=linux-pm&m=151635516419249&w=2
---
 include/linux/sched/energy.h |  49 ++++++++++++
 kernel/sched/Makefile        |   3 +
 kernel/sched/energy.c        | 184 +++++++++++++++++++++++++++++++++++++++++++
 3 files changed, 236 insertions(+)
 create mode 100644 include/linux/sched/energy.h
 create mode 100644 kernel/sched/energy.c

diff --git a/include/linux/sched/energy.h b/include/linux/sched/energy.h
new file mode 100644
index 000000000000..941071eec013
--- /dev/null
+++ b/include/linux/sched/energy.h
@@ -0,0 +1,49 @@
+// SPDX-License-Identifier: GPL-2.0
+#ifndef _LINUX_SCHED_ENERGY_H
+#define _LINUX_SCHED_ENERGY_H
+
+struct capacity_state {
+	unsigned long cap;	/* compute capacity */
+	unsigned long power;	/* power consumption at this compute capacity */
+};
+
+struct sched_energy_model {
+	int nr_cap_states;
+	struct capacity_state *cap_states;
+};
+
+struct freq_domain {
+	struct list_head next;
+	cpumask_t span;
+};
+
+#if defined(CONFIG_SMP) && defined(CONFIG_PM_OPP)
+extern struct sched_energy_model ** __percpu energy_model;
+extern struct static_key_false sched_energy_present;
+extern struct list_head sched_freq_domains;
+
+static inline bool sched_energy_enabled(void)
+{
+	return static_branch_unlikely(&sched_energy_present);
+}
+
+static inline struct cpumask *freq_domain_span(struct freq_domain *fd)
+{
+	return &fd->span;
+}
+
+extern void init_sched_energy(void);
+
+#define for_each_freq_domain(fdom) \
+	list_for_each_entry(fdom, &sched_freq_domains, next)
+
+#else
+struct freq_domain;
+static inline bool sched_energy_enabled(void) { return false; }
+static inline struct cpumask
+*freq_domain_span(struct freq_domain *fd) { return NULL; }
+static inline void init_sched_energy(void) { }
+#define for_each_freq_domain(fdom) for (; fdom; fdom = NULL)
+#endif
+
+#endif /* _LINUX_SCHED_ENERGY_H */
diff --git a/kernel/sched/Makefile b/kernel/sched/Makefile
index d9a02b318108..15fb3dfd7064 100644
--- a/kernel/sched/Makefile
+++ b/kernel/sched/Makefile
@@ -29,3 +29,6 @@ obj-$(CONFIG_CPU_FREQ) += cpufreq.o
 obj-$(CONFIG_CPU_FREQ_GOV_SCHEDUTIL) += cpufreq_schedutil.o
 obj-$(CONFIG_MEMBARRIER) += membarrier.o
 obj-$(CONFIG_CPU_ISOLATION) += isolation.o
+ifeq ($(CONFIG_PM_OPP),y)
+       obj-$(CONFIG_SMP) += energy.o
+endif
diff --git a/kernel/sched/energy.c b/kernel/sched/energy.c
new file mode 100644
index 000000000000..704bea6e1cad
--- /dev/null
+++ b/kernel/sched/energy.c
@@ -0,0 +1,184 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Energy-aware scheduling models
+ *
+ * Copyright (C) 2018, Arm Ltd.
+ * Written by: Quentin Perret, Arm Ltd.
+ */
+
+#define pr_fmt(fmt) "sched-energy: " fmt
+
+#include <linux/sched/topology.h>
+#include <linux/sched/energy.h>
+#include <linux/pm_opp.h>
+
+#include "sched.h"
+
+DEFINE_STATIC_KEY_FALSE(sched_energy_present);
+struct sched_energy_model ** __percpu energy_model;
+
+/*
+ * A copy of the cpumasks representing the frequency domains is kept private
+ * to the scheduler. They are stacked in a dynamically allocated linked list
+ * as we don't know how many frequency domains the system has.
+ */
+LIST_HEAD(sched_freq_domains);
+
+static struct sched_energy_model *build_energy_model(int cpu)
+{
+	unsigned long cap_scale = arch_scale_cpu_capacity(NULL, cpu);
+	unsigned long cap, freq, power, max_freq = ULONG_MAX;
+	unsigned long opp_eff, prev_opp_eff = ULONG_MAX;
+	struct sched_energy_model *em = NULL;
+	struct device *cpu_dev;
+	struct dev_pm_opp *opp;
+	int opp_cnt, i;
+
+	cpu_dev = get_cpu_device(cpu);
+	if (!cpu_dev) {
+		pr_err("CPU%d: Failed to get device\n", cpu);
+		return NULL;
+	}
+
+	opp_cnt = dev_pm_opp_get_opp_count(cpu_dev);
+	if (opp_cnt <= 0) {
+		pr_err("CPU%d: Failed to get # of available OPPs.\n", cpu);
+		return NULL;
+	}
+
+	opp = dev_pm_opp_find_freq_floor(cpu_dev, &max_freq);
+	if (IS_ERR(opp)) {
+		pr_err("CPU%d: Failed to get max frequency.\n", cpu);
+		return NULL;
+	}
+
+	dev_pm_opp_put(opp);
+	if (!max_freq) {
+		pr_err("CPU%d: Found null max frequency.\n", cpu);
+		return NULL;
+	}
+
+	em = kzalloc(sizeof(*em), GFP_KERNEL);
+	if (!em)
+		return NULL;
+
+	em->cap_states = kcalloc(opp_cnt, sizeof(*em->cap_states), GFP_KERNEL);
+	if (!em->cap_states)
+		goto free_em;
+
+	for (i = 0, freq = 0; i < opp_cnt; i++, freq++) {
+		opp = dev_pm_opp_find_freq_ceil(cpu_dev, &freq);
+		if (IS_ERR(opp)) {
+			pr_err("CPU%d: Failed to get OPP %d.\n", cpu, i+1);
+			goto free_cs;
+		}
+
+		power = dev_pm_opp_get_power(opp);
+		dev_pm_opp_put(opp);
+		if (!power || !freq)
+			goto free_cs;
+
+		cap = freq * cap_scale / max_freq;
+		em->cap_states[i].power = power;
+		em->cap_states[i].cap = cap;
+
+		/*
+		 * The capacity/watts efficiency ratio should decrease as the
+		 * frequency grows on sane platforms. If not, warn the user
+		 * that some high OPPs are more power efficient than some
+		 * of the lower ones.
+		 */
+		opp_eff = (cap << 20) / power;
+		if (opp_eff >= prev_opp_eff)
+			pr_warn("CPU%d: cap/pwr: OPP%d > OPP%d\n", cpu, i, i-1);
+		prev_opp_eff = opp_eff;
+	}
+
+	em->nr_cap_states = opp_cnt;
+	return em;
+
+free_cs:
+	kfree(em->cap_states);
+free_em:
+	kfree(em);
+	return NULL;
+}
+
+static void free_energy_model(void)
+{
+	struct sched_energy_model *em;
+	struct freq_domain *tmp, *pos;
+	int cpu;
+
+	list_for_each_entry_safe(pos, tmp, &sched_freq_domains, next) {
+		cpu = cpumask_first(&(pos->span));
+		em = *per_cpu_ptr(energy_model, cpu);
+		if (em) {
+			kfree(em->cap_states);
+			kfree(em);
+		}
+
+		list_del(&(pos->next));
+		kfree(pos);
+	}
+
+	free_percpu(energy_model);
+}
+
+void init_sched_energy(void)
+{
+	struct freq_domain *fdom;
+	struct sched_energy_model *em;
+	struct sched_domain *sd;
+	struct device *cpu_dev;
+	int cpu, ret, fdom_cpu;
+
+	/* Energy Aware Scheduling is used for asymmetric systems only. */
+	rcu_read_lock();
+	sd = lowest_flag_domain(smp_processor_id(), SD_ASYM_CPUCAPACITY);
+	rcu_read_unlock();
+	if (!sd)
+		return;
+
+	energy_model = alloc_percpu(struct sched_energy_model *);
+	if (!energy_model)
+		goto exit_fail;
+
+	for_each_possible_cpu(cpu) {
+		if (*per_cpu_ptr(energy_model, cpu))
+			continue;
+
+		/* Keep a copy of the sharing_cpus mask */
+		fdom = kzalloc(sizeof(struct freq_domain), GFP_KERNEL);
+		if (!fdom)
+			goto free_em;
+
+		cpu_dev = get_cpu_device(cpu);
+		ret = dev_pm_opp_get_sharing_cpus(cpu_dev, &(fdom->span));
+		if (ret)
+			goto free_em;
+		list_add(&(fdom->next), &sched_freq_domains);
+
+		/*
+		 * Build the energy model of one CPU, and link it to all CPUs
+		 * in its frequency domain. This should be correct as long as
+		 * they share the same micro-architecture.
+		 */
+		fdom_cpu = cpumask_first(&(fdom->span));
+		em = build_energy_model(fdom_cpu);
+		if (!em)
+			goto free_em;
+
+		for_each_cpu(fdom_cpu, &(fdom->span))
+			*per_cpu_ptr(energy_model, fdom_cpu) = em;
+	}
+
+	static_branch_enable(&sched_energy_present);
+
+	pr_info("Energy Aware Scheduling started.\n");
+	return;
+free_em:
+	free_energy_model();
+exit_fail:
+	pr_err("Energy Aware Scheduling initialization failed.\n");
+}
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFC PATCH v2 3/6] sched: Add over-utilization/tipping point indicator
  2018-04-06 15:36 [RFC PATCH v2 0/6] Energy Aware Scheduling Dietmar Eggemann
  2018-04-06 15:36 ` [RFC PATCH v2 1/6] sched/fair: Create util_fits_capacity() Dietmar Eggemann
  2018-04-06 15:36 ` [RFC PATCH v2 2/6] sched: Introduce energy models of CPUs Dietmar Eggemann
@ 2018-04-06 15:36 ` Dietmar Eggemann
  2018-04-13 23:56   ` Joel Fernandes
  2018-04-17 14:25   ` Leo Yan
  2018-04-06 15:36 ` [RFC PATCH v2 4/6] sched/fair: Introduce an energy estimation helper function Dietmar Eggemann
                   ` (3 subsequent siblings)
  6 siblings, 2 replies; 44+ messages in thread
From: Dietmar Eggemann @ 2018-04-06 15:36 UTC (permalink / raw)
  To: linux-kernel, Peter Zijlstra, Quentin Perret, Thara Gopinath
  Cc: linux-pm, Morten Rasmussen, Chris Redpath, Patrick Bellasi,
	Valentin Schneider, Rafael J . Wysocki, Greg Kroah-Hartman,
	Vincent Guittot, Viresh Kumar, Todd Kjos, Joel Fernandes,
	Juri Lelli, Steve Muckle, Eduardo Valentin

From: Thara Gopinath <thara.gopinath@linaro.org>

Energy-aware scheduling should only operate when the system is not
overutilized. There must be cpu time available to place tasks based on
utilization in an energy-aware fashion, i.e. to pack tasks on
energy-efficient cpus without harming the overall throughput.

In case the system operates above this tipping point the tasks have to
be placed based on task and cpu load in the classical way of spreading
tasks across as many cpus as possible.

The point in which a system switches from being not overutilized to
being overutilized is called the tipping point.

Such a tipping point indicator on a sched domain as the system
boundary is introduced here. As soon as one cpu of a sched domain is
overutilized the whole sched domain is declared overutilized as well.
A cpu becomes overutilized when its utilization is higher that 80%
(capacity_margin) of its capacity.

The implementation takes advantage of the shared sched domain which is
shared across all per-cpu views of a sched domain level. The new
overutilized flag is placed in this shared sched domain.

Load balancing is skipped in case the energy model is present and the
sched domain is not overutilized because under this condition the
predominantly load-per-capacity driven load-balancer should not
interfere with the energy-aware wakeup placement based on utilization.

In case the total utilization of a sched domain is greater than the
total sched domain capacity the overutilized flag is set at the parent
sched domain level to let other sched groups help getting rid of the
overutilization of cpus.

Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
---
 include/linux/sched/topology.h |  1 +
 kernel/sched/fair.c            | 62 ++++++++++++++++++++++++++++++++++++++++--
 kernel/sched/sched.h           |  1 +
 kernel/sched/topology.c        | 12 +++-----
 4 files changed, 65 insertions(+), 11 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 26347741ba50..dd001c232646 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -72,6 +72,7 @@ struct sched_domain_shared {
 	atomic_t	ref;
 	atomic_t	nr_busy_cpus;
 	int		has_idle_cores;
+	int		overutilized;
 };
 
 struct sched_domain {
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0a76ad2ef022..6960e5ef3c14 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5345,6 +5345,28 @@ static inline void hrtick_update(struct rq *rq)
 }
 #endif
 
+#ifdef CONFIG_SMP
+static inline int cpu_overutilized(int cpu);
+
+static inline int sd_overutilized(struct sched_domain *sd)
+{
+	return READ_ONCE(sd->shared->overutilized);
+}
+
+static inline void update_overutilized_status(struct rq *rq)
+{
+	struct sched_domain *sd;
+
+	rcu_read_lock();
+	sd = rcu_dereference(rq->sd);
+	if (sd && !sd_overutilized(sd) && cpu_overutilized(rq->cpu))
+		WRITE_ONCE(sd->shared->overutilized, 1);
+	rcu_read_unlock();
+}
+#else
+static inline void update_overutilized_status(struct rq *rq) {}
+#endif /* CONFIG_SMP */
+
 /*
  * The enqueue_task method is called before nr_running is
  * increased. Here we update the fair scheduling stats and
@@ -5394,8 +5416,10 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		update_cfs_group(se);
 	}
 
-	if (!se)
+	if (!se) {
 		add_nr_running(rq, 1);
+		update_overutilized_status(rq);
+	}
 
 	util_est_enqueue(&rq->cfs, p);
 	hrtick_update(rq);
@@ -6579,6 +6603,11 @@ static inline int util_fits_capacity(unsigned long util, unsigned long capacity)
 	return capacity * 1024 > util * capacity_margin;
 }
 
+static inline int cpu_overutilized(int cpu)
+{
+	return !util_fits_capacity(cpu_util(cpu), capacity_of(cpu));
+}
+
 /*
  * Disable WAKE_AFFINE in the case where task @p doesn't fit in the
  * capacity of either the waking CPU @cpu or the previous CPU @prev_cpu.
@@ -7817,6 +7846,7 @@ struct sd_lb_stats {
 	unsigned long total_running;
 	unsigned long total_load;	/* Total load of all groups in sd */
 	unsigned long total_capacity;	/* Total capacity of all groups in sd */
+	unsigned long total_util;	/* Total util of all groups in sd */
 	unsigned long avg_load;	/* Average load across all groups in sd */
 
 	struct sg_lb_stats busiest_stat;/* Statistics of the busiest group */
@@ -7837,6 +7867,7 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds)
 		.total_running = 0UL,
 		.total_load = 0UL,
 		.total_capacity = 0UL,
+		.total_util = 0UL,
 		.busiest_stat = {
 			.avg_load = 0UL,
 			.sum_nr_running = 0,
@@ -8133,11 +8164,12 @@ static bool update_nohz_stats(struct rq *rq, bool force)
  * @local_group: Does group contain this_cpu.
  * @sgs: variable to hold the statistics for this group.
  * @overload: Indicate more than one runnable task for any CPU.
+ * @overutilized: Indicate overutilization for any CPU.
  */
 static inline void update_sg_lb_stats(struct lb_env *env,
 			struct sched_group *group, int load_idx,
 			int local_group, struct sg_lb_stats *sgs,
-			bool *overload)
+			bool *overload, int *overutilized)
 {
 	unsigned long load;
 	int i, nr_running;
@@ -8174,6 +8206,9 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 		 */
 		if (!nr_running && idle_cpu(i))
 			sgs->idle_cpus++;
+
+		if (cpu_overutilized(i))
+			*overutilized = 1;
 	}
 
 	/* Adjust by relative CPU capacity of the group */
@@ -8301,6 +8336,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 	struct sg_lb_stats tmp_sgs;
 	int load_idx, prefer_sibling = 0;
 	bool overload = false;
+	int overutilized = 0;
 
 	if (child && child->flags & SD_PREFER_SIBLING)
 		prefer_sibling = 1;
@@ -8327,7 +8363,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 		}
 
 		update_sg_lb_stats(env, sg, load_idx, local_group, sgs,
-						&overload);
+						&overload, &overutilized);
 
 		if (local_group)
 			goto next_group;
@@ -8359,6 +8395,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 		sds->total_running += sgs->sum_nr_running;
 		sds->total_load += sgs->group_load;
 		sds->total_capacity += sgs->group_capacity;
+		sds->total_util += sgs->group_util;
 
 		sg = sg->next;
 	} while (sg != env->sd->groups);
@@ -8380,6 +8417,17 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 		if (env->dst_rq->rd->overload != overload)
 			env->dst_rq->rd->overload = overload;
 	}
+
+	if (sd_overutilized(env->sd) != overutilized)
+		WRITE_ONCE(env->sd->shared->overutilized, overutilized);
+
+	/*
+	 * If the domain util is greater that domain capacity, load balancing
+	 * needs to be done at the next sched domain level as well.
+	 */
+	if (env->sd->parent &&
+	    !util_fits_capacity(sds->total_util, sds->total_capacity))
+		WRITE_ONCE(env->sd->parent->shared->overutilized, 1);
 }
 
 /**
@@ -9255,6 +9303,9 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
 		}
 		max_cost += sd->max_newidle_lb_cost;
 
+		if (sched_energy_enabled() && !sd_overutilized(sd))
+			continue;
+
 		if (!(sd->flags & SD_LOAD_BALANCE))
 			continue;
 
@@ -9822,6 +9873,9 @@ static int idle_balance(struct rq *this_rq, struct rq_flags *rf)
 			break;
 		}
 
+		if (sched_energy_enabled() && !sd_overutilized(sd))
+			continue;
+
 		if (sd->flags & SD_BALANCE_NEWIDLE) {
 			t0 = sched_clock_cpu(this_cpu);
 
@@ -9955,6 +10009,8 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 
 	if (static_branch_unlikely(&sched_numa_balancing))
 		task_tick_numa(rq, curr);
+
+	update_overutilized_status(rq);
 }
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index c3deaee7a7a2..5d552c0d7109 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -11,6 +11,7 @@
 #include <linux/sched/cputime.h>
 #include <linux/sched/deadline.h>
 #include <linux/sched/debug.h>
+#include <linux/sched/energy.h>
 #include <linux/sched/hotplug.h>
 #include <linux/sched/idle.h>
 #include <linux/sched/init.h>
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 64cc564f5255..c8b7c7665ab2 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -1184,15 +1184,11 @@ sd_init(struct sched_domain_topology_level *tl,
 		sd->idle_idx = 1;
 	}
 
-	/*
-	 * For all levels sharing cache; connect a sched_domain_shared
-	 * instance.
-	 */
-	if (sd->flags & SD_SHARE_PKG_RESOURCES) {
-		sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
-		atomic_inc(&sd->shared->ref);
+	sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
+	atomic_inc(&sd->shared->ref);
+
+	if (sd->flags & SD_SHARE_PKG_RESOURCES)
 		atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
-	}
 
 	sd->private = sdd;
 
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFC PATCH v2 4/6] sched/fair: Introduce an energy estimation helper function
  2018-04-06 15:36 [RFC PATCH v2 0/6] Energy Aware Scheduling Dietmar Eggemann
                   ` (2 preceding siblings ...)
  2018-04-06 15:36 ` [RFC PATCH v2 3/6] sched: Add over-utilization/tipping point indicator Dietmar Eggemann
@ 2018-04-06 15:36 ` Dietmar Eggemann
  2018-04-10 12:51   ` Peter Zijlstra
                     ` (4 more replies)
  2018-04-06 15:36 ` [RFC PATCH v2 5/6] sched/fair: Select an energy-efficient CPU on task wake-up Dietmar Eggemann
                   ` (2 subsequent siblings)
  6 siblings, 5 replies; 44+ messages in thread
From: Dietmar Eggemann @ 2018-04-06 15:36 UTC (permalink / raw)
  To: linux-kernel, Peter Zijlstra, Quentin Perret, Thara Gopinath
  Cc: linux-pm, Morten Rasmussen, Chris Redpath, Patrick Bellasi,
	Valentin Schneider, Rafael J . Wysocki, Greg Kroah-Hartman,
	Vincent Guittot, Viresh Kumar, Todd Kjos, Joel Fernandes,
	Juri Lelli, Steve Muckle, Eduardo Valentin

From: Quentin Perret <quentin.perret@arm.com>

In preparation for the definition of an energy-aware wakeup path, a
helper function is provided to estimate the consequence on system energy
when a specific task wakes-up on a specific CPU. compute_energy()
estimates the OPPs to be reached by all frequency domains and estimates
the consumption of each online CPU according to its energy model and its
percentage of busy time.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
---
 include/linux/sched/energy.h | 20 +++++++++++++
 kernel/sched/fair.c          | 68 ++++++++++++++++++++++++++++++++++++++++++++
 kernel/sched/sched.h         |  2 +-
 3 files changed, 89 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched/energy.h b/include/linux/sched/energy.h
index 941071eec013..b4110b145228 100644
--- a/include/linux/sched/energy.h
+++ b/include/linux/sched/energy.h
@@ -27,6 +27,24 @@ static inline bool sched_energy_enabled(void)
 	return static_branch_unlikely(&sched_energy_present);
 }
 
+static inline
+struct capacity_state *find_cap_state(int cpu, unsigned long util)
+{
+	struct sched_energy_model *em = *per_cpu_ptr(energy_model, cpu);
+	struct capacity_state *cs = NULL;
+	int i;
+
+	util += util >> 2;
+
+	for (i = 0; i < em->nr_cap_states; i++) {
+		cs = &em->cap_states[i];
+		if (cs->cap >= util)
+			break;
+	}
+
+	return cs;
+}
+
 static inline struct cpumask *freq_domain_span(struct freq_domain *fd)
 {
 	return &fd->span;
@@ -42,6 +60,8 @@ struct freq_domain;
 static inline bool sched_energy_enabled(void) { return false; }
 static inline struct cpumask
 *freq_domain_span(struct freq_domain *fd) { return NULL; }
+static inline struct capacity_state
+*find_cap_state(int cpu, unsigned long util) { return NULL; }
 static inline void init_sched_energy(void) { }
 #define for_each_freq_domain(fdom) for (; fdom; fdom = NULL)
 #endif
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6960e5ef3c14..8cb9fb04fff2 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6633,6 +6633,74 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
 }
 
 /*
+ * Returns the util of "cpu" if "p" wakes up on "dst_cpu".
+ */
+static unsigned long cpu_util_next(int cpu, struct task_struct *p, int dst_cpu)
+{
+	unsigned long util, util_est;
+	struct cfs_rq *cfs_rq;
+
+	/* Task is where it should be, or has no impact on cpu */
+	if ((task_cpu(p) == dst_cpu) || (cpu != task_cpu(p) && cpu != dst_cpu))
+		return cpu_util(cpu);
+
+	cfs_rq = &cpu_rq(cpu)->cfs;
+	util = READ_ONCE(cfs_rq->avg.util_avg);
+
+	if (dst_cpu == cpu)
+		util += task_util(p);
+	else
+		util = max_t(long, util - task_util(p), 0);
+
+	if (sched_feat(UTIL_EST)) {
+		util_est = READ_ONCE(cfs_rq->avg.util_est.enqueued);
+		if (dst_cpu == cpu)
+			util_est += _task_util_est(p);
+		else
+			util_est = max_t(long, util_est - _task_util_est(p), 0);
+		util = max(util, util_est);
+	}
+
+	return min_t(unsigned long, util, capacity_orig_of(cpu));
+}
+
+/*
+ * Estimates the system level energy assuming that p wakes-up on dst_cpu.
+ *
+ * compute_energy() is safe to call only if an energy model is available for
+ * the platform, which is when sched_energy_enabled() is true.
+ */
+static unsigned long compute_energy(struct task_struct *p, int dst_cpu)
+{
+	unsigned long util, max_util, sum_util;
+	struct capacity_state *cs;
+	unsigned long energy = 0;
+	struct freq_domain *fd;
+	int cpu;
+
+	for_each_freq_domain(fd) {
+		max_util = sum_util = 0;
+		for_each_cpu_and(cpu, freq_domain_span(fd), cpu_online_mask) {
+			util = cpu_util_next(cpu, p, dst_cpu);
+			util += cpu_util_dl(cpu_rq(cpu));
+			max_util = max(util, max_util);
+			sum_util += util;
+		}
+
+		/*
+		 * Here we assume that the capacity states of CPUs belonging to
+		 * the same frequency domains are shared. Hence, we look at the
+		 * capacity state of the first CPU and re-use it for all.
+		 */
+		cpu = cpumask_first(freq_domain_span(fd));
+		cs = find_cap_state(cpu, max_util);
+		energy += cs->power * sum_util / cs->cap;
+	}
+
+	return energy;
+}
+
+/*
  * select_task_rq_fair: Select target runqueue for the waking task in domains
  * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
  * SD_BALANCE_FORK, or SD_BALANCE_EXEC.
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 5d552c0d7109..6eb38f41d5d9 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2156,7 +2156,7 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
 # define arch_scale_freq_invariant()	false
 #endif
 
-#ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
+#ifdef CONFIG_SMP
 static inline unsigned long cpu_util_dl(struct rq *rq)
 {
 	return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFC PATCH v2 5/6] sched/fair: Select an energy-efficient CPU on task wake-up
  2018-04-06 15:36 [RFC PATCH v2 0/6] Energy Aware Scheduling Dietmar Eggemann
                   ` (3 preceding siblings ...)
  2018-04-06 15:36 ` [RFC PATCH v2 4/6] sched/fair: Introduce an energy estimation helper function Dietmar Eggemann
@ 2018-04-06 15:36 ` Dietmar Eggemann
  2018-04-09 16:30   ` Peter Zijlstra
                     ` (2 more replies)
  2018-04-06 15:36 ` [RFC PATCH v2 6/6] drivers: base: arch_topology.c: Enable EAS for arm/arm64 platforms Dietmar Eggemann
  2018-04-17 12:50 ` [RFC PATCH v2 0/6] Energy Aware Scheduling Leo Yan
  6 siblings, 3 replies; 44+ messages in thread
From: Dietmar Eggemann @ 2018-04-06 15:36 UTC (permalink / raw)
  To: linux-kernel, Peter Zijlstra, Quentin Perret, Thara Gopinath
  Cc: linux-pm, Morten Rasmussen, Chris Redpath, Patrick Bellasi,
	Valentin Schneider, Rafael J . Wysocki, Greg Kroah-Hartman,
	Vincent Guittot, Viresh Kumar, Todd Kjos, Joel Fernandes,
	Juri Lelli, Steve Muckle, Eduardo Valentin

From: Quentin Perret <quentin.perret@arm.com>

In case an energy model is available, waking tasks are re-routed into a
new energy-aware placement algorithm. The eligible CPUs to be used in the
energy-aware wakeup path are restricted to the highest non-overutilized
sched_domain containing prev_cpu and this_cpu. If no such domain is found,
the tasks go through the usual wake-up path, hence energy-aware placement
happens only in lightly utilized scenarios.

The selection of the most energy-efficient CPU for a task is achieved by
estimating the impact on system-level active energy resulting from the
placement of the task on the CPU with the highest spare capacity in each
frequency domain. The best CPU energy-wise is then selected if it saves
a large enough amount of energy with respect to prev_cpu.

Although it has already shown significant benefits on some existing
targets, this approach cannot scale to platforms with numerous CPUs.
This patch is an attempt to do something useful as writing a fast
heuristic that performs reasonably well on a broad spectrum of
architectures isn't an easy task. As a consequence, the scope of
usability of the energy-aware wake-up path is restricted to systems
with the SD_ASYM_CPUCAPACITY flag set. These systems not only show the
most promising opportunities for saving energy but also typically
feature a limited number of logical CPUs.

Moreover, the energy-aware wake-up path is accessible only if
sched_energy_enabled() is true. For systems which don't meet all
dependencies for EAS (CONFIG_PM_OPP for ex.) at compile time,
sched_enegy_enabled() defaults to a constant "false" value, hence letting
the compiler remove the unused EAS code entirely.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
---
 kernel/sched/fair.c | 97 ++++++++++++++++++++++++++++++++++++++++++++++++++---
 1 file changed, 93 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8cb9fb04fff2..5ebb2d0306c7 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6700,6 +6700,81 @@ static unsigned long compute_energy(struct task_struct *p, int dst_cpu)
 	return energy;
 }
 
+static int find_energy_efficient_cpu(struct sched_domain *sd,
+					struct task_struct *p, int prev_cpu)
+{
+	unsigned long cur_energy, prev_energy, best_energy, cpu_cap;
+	unsigned long task_util = task_util_est(p);
+	int cpu, best_energy_cpu = prev_cpu;
+	struct freq_domain *fd;
+
+	if (!task_util)
+		return prev_cpu;
+
+	if (cpumask_test_cpu(prev_cpu, &p->cpus_allowed))
+		prev_energy = best_energy = compute_energy(p, prev_cpu);
+	else
+		prev_energy = best_energy = ULONG_MAX;
+
+	for_each_freq_domain(fd) {
+		unsigned long spare_cap, max_spare_cap = 0;
+		int max_spare_cap_cpu = -1;
+		unsigned long util;
+
+		/* Find the CPU with the max spare cap in the freq. dom. */
+		for_each_cpu_and(cpu, freq_domain_span(fd), sched_domain_span(sd)) {
+			if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
+				continue;
+
+			if (cpu == prev_cpu)
+				continue;
+
+			util = cpu_util_wake(cpu, p);
+			cpu_cap = capacity_of(cpu);
+			if (!util_fits_capacity(util + task_util, cpu_cap))
+				continue;
+
+			spare_cap = cpu_cap - util;
+			if (spare_cap > max_spare_cap) {
+				max_spare_cap = spare_cap;
+				max_spare_cap_cpu = cpu;
+			}
+		}
+
+		/* Evaluate the energy impact of using this CPU. */
+		if (max_spare_cap_cpu >= 0) {
+			cur_energy = compute_energy(p, max_spare_cap_cpu);
+			if (cur_energy < best_energy) {
+				best_energy = cur_energy;
+				best_energy_cpu = max_spare_cap_cpu;
+			}
+		}
+	}
+
+	/*
+	 * We pick the best CPU only if it saves at least 1.5% of the
+	 * energy used by prev_cpu.
+	 */
+	if ((prev_energy - best_energy) > (prev_energy >> 6))
+		return best_energy_cpu;
+
+	return prev_cpu;
+}
+
+static inline bool wake_energy(struct task_struct *p, int prev_cpu)
+{
+	struct sched_domain *sd;
+
+	if (!sched_energy_enabled())
+		return false;
+
+	sd = rcu_dereference_sched(cpu_rq(prev_cpu)->sd);
+	if (!sd || sd_overutilized(sd))
+		return false;
+
+	return true;
+}
+
 /*
  * select_task_rq_fair: Select target runqueue for the waking task in domains
  * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
@@ -6716,18 +6791,22 @@ static int
 select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags)
 {
 	struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
+	struct sched_domain *energy_sd = NULL;
 	int cpu = smp_processor_id();
 	int new_cpu = prev_cpu;
-	int want_affine = 0;
+	int want_affine = 0, want_energy = 0;
 	int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
 
+	rcu_read_lock();
+
 	if (sd_flag & SD_BALANCE_WAKE) {
 		record_wakee(p);
+		want_energy = wake_energy(p, prev_cpu);
 		want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu)
-			      && cpumask_test_cpu(cpu, &p->cpus_allowed);
+			      && cpumask_test_cpu(cpu, &p->cpus_allowed)
+			      && !want_energy;
 	}
 
-	rcu_read_lock();
 	for_each_domain(cpu, tmp) {
 		if (!(tmp->flags & SD_LOAD_BALANCE))
 			break;
@@ -6742,6 +6821,14 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 			break;
 		}
 
+		/*
+		 * Energy-aware task placement is performed on the highest
+		 * non-overutilized domain spanning over cpu and prev_cpu.
+		 */
+		if (want_energy && !sd_overutilized(tmp) &&
+		    cpumask_test_cpu(prev_cpu, sched_domain_span(tmp)))
+			energy_sd = tmp;
+
 		if (tmp->flags & sd_flag)
 			sd = tmp;
 		else if (!want_affine)
@@ -6765,7 +6852,9 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 		sync_entity_load_avg(&p->se);
 	}
 
-	if (!sd) {
+	if (energy_sd) {
+		new_cpu = find_energy_efficient_cpu(energy_sd, p, prev_cpu);
+	} else if (!sd) {
 pick_cpu:
 		if (sd_flag & SD_BALANCE_WAKE) { /* XXX always ? */
 			new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* [RFC PATCH v2 6/6] drivers: base: arch_topology.c: Enable EAS for arm/arm64 platforms
  2018-04-06 15:36 [RFC PATCH v2 0/6] Energy Aware Scheduling Dietmar Eggemann
                   ` (4 preceding siblings ...)
  2018-04-06 15:36 ` [RFC PATCH v2 5/6] sched/fair: Select an energy-efficient CPU on task wake-up Dietmar Eggemann
@ 2018-04-06 15:36 ` Dietmar Eggemann
  2018-04-17 12:50 ` [RFC PATCH v2 0/6] Energy Aware Scheduling Leo Yan
  6 siblings, 0 replies; 44+ messages in thread
From: Dietmar Eggemann @ 2018-04-06 15:36 UTC (permalink / raw)
  To: linux-kernel, Peter Zijlstra, Quentin Perret, Thara Gopinath
  Cc: linux-pm, Morten Rasmussen, Chris Redpath, Patrick Bellasi,
	Valentin Schneider, Rafael J . Wysocki, Greg Kroah-Hartman,
	Vincent Guittot, Viresh Kumar, Todd Kjos, Joel Fernandes,
	Juri Lelli, Steve Muckle, Eduardo Valentin

From: Quentin Perret <quentin.perret@arm.com>

Energy Aware Scheduling (EAS) has to be started from the arch code.
This commit enables it from the arch topology driver for arm/arm64
systems, hence enabling a better support for Arm big.LITTLE and future
DynamIQ architectures.

Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
---
 drivers/base/arch_topology.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index 52ec5174bcb1..25a70c21860f 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -15,6 +15,7 @@
 #include <linux/slab.h>
 #include <linux/string.h>
 #include <linux/sched/topology.h>
+#include <linux/sched/energy.h>
 
 DEFINE_PER_CPU(unsigned long, freq_scale) = SCHED_CAPACITY_SCALE;
 
@@ -247,6 +248,7 @@ static void __init parsing_done_workfn(struct work_struct *work)
 	cpufreq_unregister_notifier(&init_cpu_capacity_notifier,
 					 CPUFREQ_POLICY_NOTIFIER);
 	free_cpumask_var(cpus_to_visit);
+	init_sched_energy();
 }
 
 #else
-- 
2.11.0

^ permalink raw reply related	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v2 5/6] sched/fair: Select an energy-efficient CPU on task wake-up
  2018-04-06 15:36 ` [RFC PATCH v2 5/6] sched/fair: Select an energy-efficient CPU on task wake-up Dietmar Eggemann
@ 2018-04-09 16:30   ` Peter Zijlstra
  2018-04-09 16:43     ` Quentin Perret
  2018-04-10 17:29   ` Peter Zijlstra
  2018-04-17 15:39   ` Leo Yan
  2 siblings, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2018-04-09 16:30 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: linux-kernel, Quentin Perret, Thara Gopinath, linux-pm,
	Morten Rasmussen, Chris Redpath, Patrick Bellasi,
	Valentin Schneider, Rafael J . Wysocki, Greg Kroah-Hartman,
	Vincent Guittot, Viresh Kumar, Todd Kjos, Joel Fernandes,
	Juri Lelli, Steve Muckle, Eduardo Valentin

On Fri, Apr 06, 2018 at 04:36:06PM +0100, Dietmar Eggemann wrote:
>  	if (sd_flag & SD_BALANCE_WAKE) {
>  		record_wakee(p);
> +		want_energy = wake_energy(p, prev_cpu);
>  		want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu)
> -			      && cpumask_test_cpu(cpu, &p->cpus_allowed);
> +			      && cpumask_test_cpu(cpu, &p->cpus_allowed)
> +			      && !want_energy;

Could you please fix that and put the operators at the end of the
previous line?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v2 5/6] sched/fair: Select an energy-efficient CPU on task wake-up
  2018-04-09 16:30   ` Peter Zijlstra
@ 2018-04-09 16:43     ` Quentin Perret
  0 siblings, 0 replies; 44+ messages in thread
From: Quentin Perret @ 2018-04-09 16:43 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dietmar Eggemann, linux-kernel, Thara Gopinath, linux-pm,
	Morten Rasmussen, Chris Redpath, Patrick Bellasi,
	Valentin Schneider, Rafael J . Wysocki, Greg Kroah-Hartman,
	Vincent Guittot, Viresh Kumar, Todd Kjos, Joel Fernandes,
	Juri Lelli, Steve Muckle, Eduardo Valentin

On Monday 09 Apr 2018 at 18:30:29 (+0200), Peter Zijlstra wrote:
> On Fri, Apr 06, 2018 at 04:36:06PM +0100, Dietmar Eggemann wrote:
> >  	if (sd_flag & SD_BALANCE_WAKE) {
> >  		record_wakee(p);
> > +		want_energy = wake_energy(p, prev_cpu);
> >  		want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu)
> > -			      && cpumask_test_cpu(cpu, &p->cpus_allowed);
> > +			      && cpumask_test_cpu(cpu, &p->cpus_allowed)
> > +			      && !want_energy;
> 
> Could you please fix that and put the operators at the end of the
> previous line?

Will do.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v2 2/6] sched: Introduce energy models of CPUs
  2018-04-06 15:36 ` [RFC PATCH v2 2/6] sched: Introduce energy models of CPUs Dietmar Eggemann
@ 2018-04-10 11:54   ` Peter Zijlstra
  2018-04-10 12:03     ` Dietmar Eggemann
  2018-04-13  4:02   ` Viresh Kumar
  1 sibling, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2018-04-10 11:54 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: linux-kernel, Quentin Perret, Thara Gopinath, linux-pm,
	Morten Rasmussen, Chris Redpath, Patrick Bellasi,
	Valentin Schneider, Rafael J . Wysocki, Greg Kroah-Hartman,
	Vincent Guittot, Viresh Kumar, Todd Kjos, Joel Fernandes,
	Juri Lelli, Steve Muckle, Eduardo Valentin

On Fri, Apr 06, 2018 at 04:36:03PM +0100, Dietmar Eggemann wrote:
> +		/*
> +		 * Build the energy model of one CPU, and link it to all CPUs
> +		 * in its frequency domain. This should be correct as long as
> +		 * they share the same micro-architecture.
> +		 */

Aside from the whole PM_OPP question; you should assert that assumption.
Put an explicit check for the uarch in and FAIL the init if that isn't
met.

I don't think it makes _ANY_ kind of sense to share a frequency domain
across uarchs and we should be very clear we're not going to support
anything like that.

I know DynamiQ strictly speaking allows that, but since it's insane, we
should consider that a bug in DynamiQ.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v2 2/6] sched: Introduce energy models of CPUs
  2018-04-10 11:54   ` Peter Zijlstra
@ 2018-04-10 12:03     ` Dietmar Eggemann
  0 siblings, 0 replies; 44+ messages in thread
From: Dietmar Eggemann @ 2018-04-10 12:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: linux-kernel, Quentin Perret, Thara Gopinath, linux-pm,
	Morten Rasmussen, Chris Redpath, Patrick Bellasi,
	Valentin Schneider, Rafael J . Wysocki, Greg Kroah-Hartman,
	Vincent Guittot, Viresh Kumar, Todd Kjos, Joel Fernandes,
	Juri Lelli, Steve Muckle, Eduardo Valentin

On 04/10/2018 01:54 PM, Peter Zijlstra wrote:
> On Fri, Apr 06, 2018 at 04:36:03PM +0100, Dietmar Eggemann wrote:
>> +		/*
>> +		 * Build the energy model of one CPU, and link it to all CPUs
>> +		 * in its frequency domain. This should be correct as long as
>> +		 * they share the same micro-architecture.
>> +		 */
> 
> Aside from the whole PM_OPP question; you should assert that assumption.
> Put an explicit check for the uarch in and FAIL the init if that isn't
> met.
> 
> I don't think it makes _ANY_ kind of sense to share a frequency domain
> across uarchs and we should be very clear we're not going to support
> anything like that.
> 
> I know DynamiQ strictly speaking allows that, but since it's insane, we
> should consider that a bug in DynamiQ.

Totally agree! We will add this assert. One open question of the current 
EAS design solved ;-)

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v2 4/6] sched/fair: Introduce an energy estimation helper function
  2018-04-06 15:36 ` [RFC PATCH v2 4/6] sched/fair: Introduce an energy estimation helper function Dietmar Eggemann
@ 2018-04-10 12:51   ` Peter Zijlstra
  2018-04-10 13:56     ` Quentin Perret
  2018-04-13  6:27   ` Viresh Kumar
                     ` (3 subsequent siblings)
  4 siblings, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2018-04-10 12:51 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: linux-kernel, Quentin Perret, Thara Gopinath, linux-pm,
	Morten Rasmussen, Chris Redpath, Patrick Bellasi,
	Valentin Schneider, Rafael J . Wysocki, Greg Kroah-Hartman,
	Vincent Guittot, Viresh Kumar, Todd Kjos, Joel Fernandes,
	Juri Lelli, Steve Muckle, Eduardo Valentin

On Fri, Apr 06, 2018 at 04:36:05PM +0100, Dietmar Eggemann wrote:
> +static inline
> +struct capacity_state *find_cap_state(int cpu, unsigned long util)
> +{
> +	struct sched_energy_model *em = *per_cpu_ptr(energy_model, cpu);
> +	struct capacity_state *cs = NULL;
> +	int i;
> +
> +	util += util >> 2;
> +
> +	for (i = 0; i < em->nr_cap_states; i++) {
> +		cs = &em->cap_states[i];
> +		if (cs->cap >= util)
> +			break;
> +	}
> +
> +	return cs;
> +}

So in the last thread there was some discussion about this; in
particular on how this related to schedutil and if we should tie it into
that.

I think for starters tying it to schedutil is not a bad idea; ideally
people _should_ migrate towards using that.

Also; I think it makes sense to better integrate cpufreq and the
energy-model values like what rjw already suggested, such that maybe we
can have cpufreq_driver_resolve_freq() return a structure containing the
relevant information for the selected frequency.

But implementing the frequency selection thing in multiple places like
now sounds like a very bad idea to me.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v2 4/6] sched/fair: Introduce an energy estimation helper function
  2018-04-10 12:51   ` Peter Zijlstra
@ 2018-04-10 13:56     ` Quentin Perret
  2018-04-10 14:08       ` Peter Zijlstra
  0 siblings, 1 reply; 44+ messages in thread
From: Quentin Perret @ 2018-04-10 13:56 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dietmar Eggemann, linux-kernel, Thara Gopinath, linux-pm,
	Morten Rasmussen, Chris Redpath, Patrick Bellasi,
	Valentin Schneider, Rafael J . Wysocki, Greg Kroah-Hartman,
	Vincent Guittot, Viresh Kumar, Todd Kjos, Joel Fernandes,
	Juri Lelli, Steve Muckle, Eduardo Valentin

On Tuesday 10 Apr 2018 at 14:51:05 (+0200), Peter Zijlstra wrote:
> On Fri, Apr 06, 2018 at 04:36:05PM +0100, Dietmar Eggemann wrote:
> > +static inline
> > +struct capacity_state *find_cap_state(int cpu, unsigned long util)
> > +{
> > +	struct sched_energy_model *em = *per_cpu_ptr(energy_model, cpu);
> > +	struct capacity_state *cs = NULL;
> > +	int i;
> > +
> > +	util += util >> 2;
> > +
> > +	for (i = 0; i < em->nr_cap_states; i++) {
> > +		cs = &em->cap_states[i];
> > +		if (cs->cap >= util)
> > +			break;
> > +	}
> > +
> > +	return cs;
> > +}
> 
> So in the last thread there was some discussion about this; in
> particular on how this related to schedutil and if we should tie it into
> that.
> 
> I think for starters tying it to schedutil is not a bad idea; ideally
> people _should_ migrate towards using that.
> 
> Also; I think it makes sense to better integrate cpufreq and the
> energy-model values like what rjw already suggested, such that maybe we
> can have cpufreq_driver_resolve_freq() return a structure containing the
> relevant information for the selected frequency.

I guess if we want to do that in the wake-up path, we would also need to
add a new parameter to it to make sure we don't actually call into
cpufreq_driver->resolve_freq() ...

But then, we could sort of rely on cpufreq_schedutil.c::get_next_freq()
to replace find_cap_state() ... Is this what you had in mind ?

> 
> But implementing the frequency selection thing in multiple places like
> now sounds like a very bad idea to me.

Understood. Making sure we share the same code everywhere might have
consequences though. I guess we'll have to either accept the cost of
function calls in the wake-up path, or to accept to inline those
functions for ex. Or maybe you had something else in mind ?

Anyways, that's probably another good discussion topic for OSPM
next week :-)

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v2 4/6] sched/fair: Introduce an energy estimation helper function
  2018-04-10 13:56     ` Quentin Perret
@ 2018-04-10 14:08       ` Peter Zijlstra
  0 siblings, 0 replies; 44+ messages in thread
From: Peter Zijlstra @ 2018-04-10 14:08 UTC (permalink / raw)
  To: Quentin Perret
  Cc: Dietmar Eggemann, linux-kernel, Thara Gopinath, linux-pm,
	Morten Rasmussen, Chris Redpath, Patrick Bellasi,
	Valentin Schneider, Rafael J . Wysocki, Greg Kroah-Hartman,
	Vincent Guittot, Viresh Kumar, Todd Kjos, Joel Fernandes,
	Juri Lelli, Steve Muckle, Eduardo Valentin

On Tue, Apr 10, 2018 at 02:56:41PM +0100, Quentin Perret wrote:
> > So in the last thread there was some discussion about this; in
> > particular on how this related to schedutil and if we should tie it into
> > that.
> > 
> > I think for starters tying it to schedutil is not a bad idea; ideally
> > people _should_ migrate towards using that.
> > 
> > Also; I think it makes sense to better integrate cpufreq and the
> > energy-model values like what rjw already suggested, such that maybe we
> > can have cpufreq_driver_resolve_freq() return a structure containing the
> > relevant information for the selected frequency.
> 
> I guess if we want to do that in the wake-up path, we would also need to
> add a new parameter to it to make sure we don't actually call into
> cpufreq_driver->resolve_freq() ...
> 
> But then, we could sort of rely on cpufreq_schedutil.c::get_next_freq()
> to replace find_cap_state() ... Is this what you had in mind ?

Yes, something along those lines; we could also of course factor
get_next_freq() into two parts.

> > But implementing the frequency selection thing in multiple places like
> > now sounds like a very bad idea to me.
> 
> Understood. Making sure we share the same code everywhere might have
> consequences though. I guess we'll have to either accept the cost of
> function calls in the wake-up path, or to accept to inline those
> functions for ex. Or maybe you had something else in mind ?
> 
> Anyways, that's probably another good discussion topic for OSPM
> next week :-)

Yes, that wants a wee bit of discussion. Ideally we'd have a shared data
structure we can iterate instead of a chain of indirect calls.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v2 5/6] sched/fair: Select an energy-efficient CPU on task wake-up
  2018-04-06 15:36 ` [RFC PATCH v2 5/6] sched/fair: Select an energy-efficient CPU on task wake-up Dietmar Eggemann
  2018-04-09 16:30   ` Peter Zijlstra
@ 2018-04-10 17:29   ` Peter Zijlstra
  2018-04-10 18:14     ` Quentin Perret
  2018-04-17 15:39   ` Leo Yan
  2 siblings, 1 reply; 44+ messages in thread
From: Peter Zijlstra @ 2018-04-10 17:29 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: linux-kernel, Quentin Perret, Thara Gopinath, linux-pm,
	Morten Rasmussen, Chris Redpath, Patrick Bellasi,
	Valentin Schneider, Rafael J . Wysocki, Greg Kroah-Hartman,
	Vincent Guittot, Viresh Kumar, Todd Kjos, Joel Fernandes,
	Juri Lelli, Steve Muckle, Eduardo Valentin

On Fri, Apr 06, 2018 at 04:36:06PM +0100, Dietmar Eggemann wrote:
> +	for_each_freq_domain(fd) {
> +		unsigned long spare_cap, max_spare_cap = 0;
> +		int max_spare_cap_cpu = -1;
> +		unsigned long util;
> +
> +		/* Find the CPU with the max spare cap in the freq. dom. */
> +		for_each_cpu_and(cpu, freq_domain_span(fd), sched_domain_span(sd)) {
> +			if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
> +				continue;
> +
> +			if (cpu == prev_cpu)
> +				continue;
> +
> +			util = cpu_util_wake(cpu, p);
> +			cpu_cap = capacity_of(cpu);
> +			if (!util_fits_capacity(util + task_util, cpu_cap))
> +				continue;
> +
> +			spare_cap = cpu_cap - util;
> +			if (spare_cap > max_spare_cap) {
> +				max_spare_cap = spare_cap;
> +				max_spare_cap_cpu = cpu;
> +			}
> +		}
> +
> +		/* Evaluate the energy impact of using this CPU. */
> +		if (max_spare_cap_cpu >= 0) {
> +			cur_energy = compute_energy(p, max_spare_cap_cpu);
> +			if (cur_energy < best_energy) {
> +				best_energy = cur_energy;
> +				best_energy_cpu = max_spare_cap_cpu;
> +			}
> +		}
> +	}

If each CPU has its own frequency domain, then the above loop ends up
being O(n^2), no? Is there really nothing we can do about that? Also, I
feel that warrants a comment warning about this.

Someone, somewhere will try and build a 64+64 cpu system and get
surprised it doesn't work :-)

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v2 5/6] sched/fair: Select an energy-efficient CPU on task wake-up
  2018-04-10 17:29   ` Peter Zijlstra
@ 2018-04-10 18:14     ` Quentin Perret
  0 siblings, 0 replies; 44+ messages in thread
From: Quentin Perret @ 2018-04-10 18:14 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Dietmar Eggemann, linux-kernel, Thara Gopinath, linux-pm,
	Morten Rasmussen, Chris Redpath, Patrick Bellasi,
	Valentin Schneider, Rafael J . Wysocki, Greg Kroah-Hartman,
	Vincent Guittot, Viresh Kumar, Todd Kjos, Joel Fernandes,
	Juri Lelli, Steve Muckle, Eduardo Valentin

On Tuesday 10 Apr 2018 at 19:29:32 (+0200), Peter Zijlstra wrote:
> On Fri, Apr 06, 2018 at 04:36:06PM +0100, Dietmar Eggemann wrote:
> > +	for_each_freq_domain(fd) {
> > +		unsigned long spare_cap, max_spare_cap = 0;
> > +		int max_spare_cap_cpu = -1;
> > +		unsigned long util;
> > +
> > +		/* Find the CPU with the max spare cap in the freq. dom. */
> > +		for_each_cpu_and(cpu, freq_domain_span(fd), sched_domain_span(sd)) {
> > +			if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
> > +				continue;
> > +
> > +			if (cpu == prev_cpu)
> > +				continue;
> > +
> > +			util = cpu_util_wake(cpu, p);
> > +			cpu_cap = capacity_of(cpu);
> > +			if (!util_fits_capacity(util + task_util, cpu_cap))
> > +				continue;
> > +
> > +			spare_cap = cpu_cap - util;
> > +			if (spare_cap > max_spare_cap) {
> > +				max_spare_cap = spare_cap;
> > +				max_spare_cap_cpu = cpu;
> > +			}
> > +		}
> > +
> > +		/* Evaluate the energy impact of using this CPU. */
> > +		if (max_spare_cap_cpu >= 0) {
> > +			cur_energy = compute_energy(p, max_spare_cap_cpu);
> > +			if (cur_energy < best_energy) {
> > +				best_energy = cur_energy;
> > +				best_energy_cpu = max_spare_cap_cpu;
> > +			}
> > +		}
> > +	}
> 
> If each CPU has its own frequency domain, then the above loop ends up
> being O(n^2), no?

Hmmm, yes, that should be O(n^2) indeed.

> Is there really nothing we can do about that?

So, the only thing I see just now would be to make compute_energy()
smarter somehow. Today we compute the energy consumed by each frequency
domain and then sum them up to get the system energy. We re-compute the
energy of each frequency domain, even when it is not involved in the
migration. In the case you describe, we will end up re-computing the
energy of many frequency domains on which nothing happens every time
we re-call compute_energy(). So there is probably something we could do
by caching those values somehow.

Otherwise, on systems with 2 frequency domains (e.g. big.LITTLE), the
current code should behave relatively well. And I think that covers a
large portion of the real-world systems for which EAS is useful, as of
today at least ... :-)

> Also, I
> feel that warrants a comment warning about this.
> 
> Someone, somewhere will try and build a 64+64 cpu system and get
> surprised it doesn't work :-)

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v2 1/6] sched/fair: Create util_fits_capacity()
  2018-04-06 15:36 ` [RFC PATCH v2 1/6] sched/fair: Create util_fits_capacity() Dietmar Eggemann
@ 2018-04-12  7:02   ` Viresh Kumar
  2018-04-12  8:20     ` Dietmar Eggemann
  0 siblings, 1 reply; 44+ messages in thread
From: Viresh Kumar @ 2018-04-12  7:02 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: linux-kernel, Peter Zijlstra, Quentin Perret, Thara Gopinath,
	linux-pm, Morten Rasmussen, Chris Redpath, Patrick Bellasi,
	Valentin Schneider, Rafael J . Wysocki, Greg Kroah-Hartman,
	Vincent Guittot, Todd Kjos, Joel Fernandes, Juri Lelli,
	Steve Muckle, Eduardo Valentin

On 06-04-18, 16:36, Dietmar Eggemann wrote:
> The functionality that a given utilization fits into a given capacity
> is factored out into a separate function.
> 
> Currently it is only used in wake_cap() but will be re-used to figure
> out if a cpu or a scheduler group is over-utilized.
> 
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> ---
>  kernel/sched/fair.c | 7 ++++++-
>  1 file changed, 6 insertions(+), 1 deletion(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 0951d1c58d2f..0a76ad2ef022 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6574,6 +6574,11 @@ static unsigned long cpu_util_wake(int cpu, struct task_struct *p)
>  	return min_t(unsigned long, util, capacity_orig_of(cpu));
>  }
>  
> +static inline int util_fits_capacity(unsigned long util, unsigned long capacity)
> +{
> +	return capacity * 1024 > util * capacity_margin;

This changes the behavior slightly compared to existing code. If that
wasn't intentional, perhaps you should use >= here.

> +}
> +
>  /*
>   * Disable WAKE_AFFINE in the case where task @p doesn't fit in the
>   * capacity of either the waking CPU @cpu or the previous CPU @prev_cpu.
> @@ -6595,7 +6600,7 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
>  	/* Bring task utilization in sync with prev_cpu */
>  	sync_entity_load_avg(&p->se);
>  
> -	return min_cap * 1024 < task_util(p) * capacity_margin;
> +	return !util_fits_capacity(task_util(p), min_cap);
>  }
>  
>  /*
> -- 
> 2.11.0

-- 
viresh

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v2 1/6] sched/fair: Create util_fits_capacity()
  2018-04-12  7:02   ` Viresh Kumar
@ 2018-04-12  8:20     ` Dietmar Eggemann
  0 siblings, 0 replies; 44+ messages in thread
From: Dietmar Eggemann @ 2018-04-12  8:20 UTC (permalink / raw)
  To: Viresh Kumar
  Cc: linux-kernel, Peter Zijlstra, Quentin Perret, Thara Gopinath,
	linux-pm, Morten Rasmussen, Chris Redpath, Patrick Bellasi,
	Valentin Schneider, Rafael J . Wysocki, Greg Kroah-Hartman,
	Vincent Guittot, Todd Kjos, Joel Fernandes, Juri Lelli,
	Steve Muckle, Eduardo Valentin

On 04/12/2018 09:02 AM, Viresh Kumar wrote:
> On 06-04-18, 16:36, Dietmar Eggemann wrote:
>> The functionality that a given utilization fits into a given capacity
>> is factored out into a separate function.
>>
>> Currently it is only used in wake_cap() but will be re-used to figure
>> out if a cpu or a scheduler group is over-utilized.
>>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
>> ---
>>   kernel/sched/fair.c | 7 ++++++-
>>   1 file changed, 6 insertions(+), 1 deletion(-)
>>
>> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> index 0951d1c58d2f..0a76ad2ef022 100644
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -6574,6 +6574,11 @@ static unsigned long cpu_util_wake(int cpu, struct task_struct *p)
>>   	return min_t(unsigned long, util, capacity_orig_of(cpu));
>>   }
>>   
>> +static inline int util_fits_capacity(unsigned long util, unsigned long capacity)
>> +{
>> +	return capacity * 1024 > util * capacity_margin;
> 
> This changes the behavior slightly compared to existing code. If that
> wasn't intentional, perhaps you should use >= here.

You're right here ... Already on our v3 list. Thanks!

The 'misfit' patch-set comes with a similar function 
task_fits_capacity() so we have to align on this one with this patch-set 
as well.

[...]

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v2 2/6] sched: Introduce energy models of CPUs
  2018-04-06 15:36 ` [RFC PATCH v2 2/6] sched: Introduce energy models of CPUs Dietmar Eggemann
  2018-04-10 11:54   ` Peter Zijlstra
@ 2018-04-13  4:02   ` Viresh Kumar
  2018-04-13  8:37     ` Quentin Perret
  1 sibling, 1 reply; 44+ messages in thread
From: Viresh Kumar @ 2018-04-13  4:02 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: linux-kernel, Peter Zijlstra, Quentin Perret, Thara Gopinath,
	linux-pm, Morten Rasmussen, Chris Redpath, Patrick Bellasi,
	Valentin Schneider, Rafael J . Wysocki, Greg Kroah-Hartman,
	Vincent Guittot, Todd Kjos, Joel Fernandes, Juri Lelli,
	Steve Muckle, Eduardo Valentin

On 06-04-18, 16:36, Dietmar Eggemann wrote:
> diff --git a/include/linux/sched/energy.h b/include/linux/sched/energy.h

> +#if defined(CONFIG_SMP) && defined(CONFIG_PM_OPP)
> +extern struct sched_energy_model ** __percpu energy_model;
> +extern struct static_key_false sched_energy_present;
> +extern struct list_head sched_freq_domains;
> +
> +static inline bool sched_energy_enabled(void)
> +{
> +	return static_branch_unlikely(&sched_energy_present);
> +}
> +
> +static inline struct cpumask *freq_domain_span(struct freq_domain *fd)
> +{
> +	return &fd->span;
> +}
> +
> +extern void init_sched_energy(void);
> +
> +#define for_each_freq_domain(fdom) \
> +	list_for_each_entry(fdom, &sched_freq_domains, next)
> +
> +#else
> +struct freq_domain;
> +static inline bool sched_energy_enabled(void) { return false; }
> +static inline struct cpumask
> +*freq_domain_span(struct freq_domain *fd) { return NULL; }
> +static inline void init_sched_energy(void) { }
> +#define for_each_freq_domain(fdom) for (; fdom; fdom = NULL)

I am not sure if this is correct. fdom would normally be a local
uninitialized variable and with above we may end up running the loop
once with an invalid pointer. Maybe rewrite it as:

for (fdom = NULL; fdom; )


And for the whole OPP discussion, perhaps we should have another
architecture specific callback which the scheduler can call to get a
ready-made energy model with all the structures filled in. That way
the OPP specific stuff will move to the architecture specific
callback.

-- 
viresh

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v2 4/6] sched/fair: Introduce an energy estimation helper function
  2018-04-06 15:36 ` [RFC PATCH v2 4/6] sched/fair: Introduce an energy estimation helper function Dietmar Eggemann
  2018-04-10 12:51   ` Peter Zijlstra
@ 2018-04-13  6:27   ` Viresh Kumar
  2018-04-17 15:22   ` Leo Yan
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 44+ messages in thread
From: Viresh Kumar @ 2018-04-13  6:27 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: linux-kernel, Peter Zijlstra, Quentin Perret, Thara Gopinath,
	linux-pm, Morten Rasmussen, Chris Redpath, Patrick Bellasi,
	Valentin Schneider, Rafael J . Wysocki, Greg Kroah-Hartman,
	Vincent Guittot, Todd Kjos, Joel Fernandes, Juri Lelli,
	Steve Muckle, Eduardo Valentin

On 06-04-18, 16:36, Dietmar Eggemann wrote:
> +static inline struct capacity_state
> +*find_cap_state(int cpu, unsigned long util) { return NULL; }

I saw this somewhere else as well in this series. I believe the line break
should happen after "*" as "struct capacity_state *" should be read together as
one entity.

-- 
viresh

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v2 2/6] sched: Introduce energy models of CPUs
  2018-04-13  4:02   ` Viresh Kumar
@ 2018-04-13  8:37     ` Quentin Perret
  0 siblings, 0 replies; 44+ messages in thread
From: Quentin Perret @ 2018-04-13  8:37 UTC (permalink / raw)
  To: Viresh Kumar
  Cc: Dietmar Eggemann, linux-kernel, Peter Zijlstra, Thara Gopinath,
	linux-pm, Morten Rasmussen, Chris Redpath, Patrick Bellasi,
	Valentin Schneider, Rafael J . Wysocki, Greg Kroah-Hartman,
	Vincent Guittot, Todd Kjos, Joel Fernandes, Juri Lelli,
	Steve Muckle, Eduardo Valentin

On Friday 13 Apr 2018 at 09:32:53 (+0530), Viresh Kumar wrote:
[...]
> And for the whole OPP discussion, perhaps we should have another
> architecture specific callback which the scheduler can call to get a
> ready-made energy model with all the structures filled in. That way
> the OPP specific stuff will move to the architecture specific
> callback.

Yes, that's another possible solution indeed. Actually, it's already on
the list of ideas to be dicussed in OSPM ;-)

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v2 3/6] sched: Add over-utilization/tipping point indicator
  2018-04-06 15:36 ` [RFC PATCH v2 3/6] sched: Add over-utilization/tipping point indicator Dietmar Eggemann
@ 2018-04-13 23:56   ` Joel Fernandes
  2018-04-18 11:17     ` Quentin Perret
  2018-04-17 14:25   ` Leo Yan
  1 sibling, 1 reply; 44+ messages in thread
From: Joel Fernandes @ 2018-04-13 23:56 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: LKML, Peter Zijlstra, Quentin Perret, Thara Gopinath, Linux PM,
	Morten Rasmussen, Chris Redpath, Patrick Bellasi,
	Valentin Schneider, Rafael J . Wysocki, Greg Kroah-Hartman,
	Vincent Guittot, Viresh Kumar, Todd Kjos, Juri Lelli,
	Steve Muckle, Eduardo Valentin

Hi,

On Fri, Apr 6, 2018 at 8:36 AM, Dietmar Eggemann
<dietmar.eggemann@arm.com> wrote:
> From: Thara Gopinath <thara.gopinath@linaro.org>
>
> Energy-aware scheduling should only operate when the system is not
> overutilized. There must be cpu time available to place tasks based on
> utilization in an energy-aware fashion, i.e. to pack tasks on
> energy-efficient cpus without harming the overall throughput.
>
> In case the system operates above this tipping point the tasks have to
> be placed based on task and cpu load in the classical way of spreading
> tasks across as many cpus as possible.
>
> The point in which a system switches from being not overutilized to
> being overutilized is called the tipping point.
>
> Such a tipping point indicator on a sched domain as the system
> boundary is introduced here. As soon as one cpu of a sched domain is
> overutilized the whole sched domain is declared overutilized as well.
> A cpu becomes overutilized when its utilization is higher that 80%
> (capacity_margin) of its capacity.
>
> The implementation takes advantage of the shared sched domain which is
> shared across all per-cpu views of a sched domain level. The new
> overutilized flag is placed in this shared sched domain.
>
> Load balancing is skipped in case the energy model is present and the
> sched domain is not overutilized because under this condition the
> predominantly load-per-capacity driven load-balancer should not
> interfere with the energy-aware wakeup placement based on utilization.
>
> In case the total utilization of a sched domain is greater than the
> total sched domain capacity the overutilized flag is set at the parent
> sched domain level to let other sched groups help getting rid of the
> overutilization of cpus.
>
> Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> ---
>  include/linux/sched/topology.h |  1 +
>  kernel/sched/fair.c            | 62 ++++++++++++++++++++++++++++++++++++++++--
>  kernel/sched/sched.h           |  1 +
>  kernel/sched/topology.c        | 12 +++-----
>  4 files changed, 65 insertions(+), 11 deletions(-)
>
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 26347741ba50..dd001c232646 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -72,6 +72,7 @@ struct sched_domain_shared {
>         atomic_t        ref;
>         atomic_t        nr_busy_cpus;
>         int             has_idle_cores;
> +       int             overutilized;
>  };
>
>  struct sched_domain {
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 0a76ad2ef022..6960e5ef3c14 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5345,6 +5345,28 @@ static inline void hrtick_update(struct rq *rq)
>  }
>  #endif
>
> +#ifdef CONFIG_SMP
> +static inline int cpu_overutilized(int cpu);
> +
> +static inline int sd_overutilized(struct sched_domain *sd)
> +{
> +       return READ_ONCE(sd->shared->overutilized);
> +}
> +
> +static inline void update_overutilized_status(struct rq *rq)
> +{
> +       struct sched_domain *sd;
> +
> +       rcu_read_lock();
> +       sd = rcu_dereference(rq->sd);
> +       if (sd && !sd_overutilized(sd) && cpu_overutilized(rq->cpu))
> +               WRITE_ONCE(sd->shared->overutilized, 1);
> +       rcu_read_unlock();
> +}
> +#else
> +static inline void update_overutilized_status(struct rq *rq) {}
> +#endif /* CONFIG_SMP */
> +
>  /*
>   * The enqueue_task method is called before nr_running is
>   * increased. Here we update the fair scheduling stats and
> @@ -5394,8 +5416,10 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>                 update_cfs_group(se);
>         }
>
> -       if (!se)
> +       if (!se) {
>                 add_nr_running(rq, 1);
> +               update_overutilized_status(rq);
> +       }

I'm wondering if it makes sense for considering scenarios whether
other classes cause CPUs in the domain to go above the tipping point.
Then in that case also, it makes sense to not to do EAS in that domain
because of the overutilization.

I guess task_fits using cpu_util which is PELT only at the moment...
so may require some other method like aggregation of CFS PELT, with
RT-PELT and DL running bw or something.

thanks,

- Joel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v2 0/6] Energy Aware Scheduling
  2018-04-06 15:36 [RFC PATCH v2 0/6] Energy Aware Scheduling Dietmar Eggemann
                   ` (5 preceding siblings ...)
  2018-04-06 15:36 ` [RFC PATCH v2 6/6] drivers: base: arch_topology.c: Enable EAS for arm/arm64 platforms Dietmar Eggemann
@ 2018-04-17 12:50 ` Leo Yan
  2018-04-17 17:22   ` Dietmar Eggemann
  6 siblings, 1 reply; 44+ messages in thread
From: Leo Yan @ 2018-04-17 12:50 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: linux-kernel, Peter Zijlstra, Quentin Perret, Thara Gopinath,
	linux-pm, Morten Rasmussen, Chris Redpath, Patrick Bellasi,
	Valentin Schneider, Rafael J . Wysocki, Greg Kroah-Hartman,
	Vincent Guittot, Viresh Kumar, Todd Kjos, Joel Fernandes,
	Juri Lelli, Steve Muckle, Eduardo Valentin

Hi Dietmar,

On Fri, Apr 06, 2018 at 04:36:01PM +0100, Dietmar Eggemann wrote:
> 1. Overview
> 
> The Energy Aware Scheduler (EAS) based on Morten Rasmussen's posting on
> LKML [1] is currently part of the AOSP Common Kernel and runs on
> today's smartphones with Arm's big.LITTLE CPUs.
> Based on the experience gained over the last two and a half years in
> product development, we propose an energy model based task placement
> for CPUs with asymmetric core capacities (e.g. Arm big.LITTLE or
> DynamIQ), to align with the EAS adopted by the AOSP Common Kernel. We
> have developed a simplified energy model, based on the physical
> active power/performance curve of each core type using existing
> SoC power/performance data already known to the kernel. The energy
> model is used to select the most energy-efficient CPU to place each
> task, taking utilization into account.
> 
> 1.1 Energy Model
> 
> A CPU with asymmetric core capacities features cores with significantly
> different energy and performance characteristics. As the configurations
> can vary greatly from one SoC to another, designing an energy-efficient
> scheduling heuristic that performs well on a broad spectrum of platforms
> appears to be particularly hard.
> This proposal attempts to solve this issue by providing the scheduler
> with an energy model of the platform which enables energy impact
> estimation of scheduling decisions in a generic way. The energy model is
> kept very simple as it represents only the active power of CPUs at all
> available P-states and relies on existing data in the kernel (only used
> by the thermal subsystem so far).
> This proposal does not include the power consumption of C-states and
> cluster-level resources which were originally introduced in [1] since
> firstly, their impact on task placement decisions appears to be
> neglectable on modern asymmetric platforms and secondly, they require
> additional infrastructure and data (e.g new DT entries).

Seems to me, if we move forward a bit for the energy model, we can use
more simple method by generate power consumption:

  Power(@Freq) = Power(cpu_util=100%@Freq) - Power(cpu_util=%0@Freq)

>From upper formula, the power data includes CPU and cluster level
power (and includes dynamic power and static leakage) but this is
quite straightforward for measurement.

I read a bit for Quentin's slides for simplized power modeling
experiments [1], IIUC the simplized power modeling still bases on the
distinguished CPU and cluster c-state and p-state power data, and just
select CPU p-state power data for scheduler.  I wander if we can
simplize the power measurement, so the power data can be generated in
single one testing and the power data without any post processing.

This might need more detailed experiment to support this idea, just
want to know how about you guys think for this?

This is a side topic for this patch series, so whatever the conclusion
for it, I think this will not impact anything of this patch series
implementation and upstreaming.

[1] http://connect.linaro.org/resource/hkg18/hkg18-501/

[...]

> 2.1.1 Hikey960
> 
> Energy is measured with an ACME Cape on an instrumented board. Numbers
> include consumption of big and little CPUs, LPDDR memory, GPU and most
> of the other small components on the board. They do not include
> consumption of the radio chip (turned-off anyway) and external
> connectors.

So the measurement point on Hikey960 is for SoC but not for whole board,
right?

> +----------+-----------------+-------------------------+
> |          | Without patches | With patches            |
> +----------+--------+--------+------------------+------+ 
> | Tasks nb |  Mean  | RSD*   | Mean             | RSD* |
> +----------+--------+--------+------------------+------+
> |       10 |  41.14 |  1.4%  |  36.51 (-11.25%) | 1.6% |
> |       20 |  55.95 |  0.8%  |  50.14 (-10.38%) | 1.9% |
> |       30 |  74.37 |  0.2%  |  72.89 ( -1.99%) | 5.3% |
> |       40 |  94.12 |  0.7%  |  87.78 ( -6.74%) | 4.5% |
> |       50 | 117.88 |  0.2%  | 111.66 ( -5.28%) | 0.9% |
> +----------+--------+-------+-----------------+--------+


> 
> 2.1.2 Juno r0
> 
> Energy is measured with the onboard energy meter. Numbers include
> consumption of big and little CPUs.
> 
> +----------+-----------------+-------------------------+
> |          | Without patches | With patches            |
> +----------+--------+--------+------------------+------+
> | Tasks nb | Mean   | RSD*   | Mean             | RSD* |
> +----------+--------+--------+------------------+------+
> |       10 |  11.25 |  3.1%  |   7.07 (-37.16%) | 2.1% |
> |       20 |  19.18 |  1.1%  |  12.75 (-33.52%) | 2.2% |
> |       30 |  28.81 |  1.9%  |  21.29 (-26.10%) | 1.5% |
> |       40 |  36.83 |  1.2%  |  30.72 (-16.59%) | 0.6% |
> |       50 |  46.41 |  0.6%  |  46.02 ( -0.01%) | 0.5% |
> +----------+--------+--------+------------------+------+
> 
> 2.2 Performance test case
> 
> 30 iterations of perf bench sched messaging --pipe --thread --group G
> --loop L with G=[1 2 4 8] and L=50000 (Hikey960)/16000 (Juno r0).

What's the reason to select different loop number for Hikey960 and
Juno? Based on the testing time?

[...]

Thanks,
Leo Yan

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v2 3/6] sched: Add over-utilization/tipping point indicator
  2018-04-06 15:36 ` [RFC PATCH v2 3/6] sched: Add over-utilization/tipping point indicator Dietmar Eggemann
  2018-04-13 23:56   ` Joel Fernandes
@ 2018-04-17 14:25   ` Leo Yan
  2018-04-17 17:39     ` Dietmar Eggemann
  1 sibling, 1 reply; 44+ messages in thread
From: Leo Yan @ 2018-04-17 14:25 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: linux-kernel, Peter Zijlstra, Quentin Perret, Thara Gopinath,
	linux-pm, Morten Rasmussen, Chris Redpath, Patrick Bellasi,
	Valentin Schneider, Rafael J . Wysocki, Greg Kroah-Hartman,
	Vincent Guittot, Viresh Kumar, Todd Kjos, Joel Fernandes,
	Juri Lelli, Steve Muckle, Eduardo Valentin

On Fri, Apr 06, 2018 at 04:36:04PM +0100, Dietmar Eggemann wrote:
> From: Thara Gopinath <thara.gopinath@linaro.org>
> 
> Energy-aware scheduling should only operate when the system is not
> overutilized. There must be cpu time available to place tasks based on
> utilization in an energy-aware fashion, i.e. to pack tasks on
> energy-efficient cpus without harming the overall throughput.
> 
> In case the system operates above this tipping point the tasks have to
> be placed based on task and cpu load in the classical way of spreading
> tasks across as many cpus as possible.
> 
> The point in which a system switches from being not overutilized to
> being overutilized is called the tipping point.
> 
> Such a tipping point indicator on a sched domain as the system
> boundary is introduced here. As soon as one cpu of a sched domain is
> overutilized the whole sched domain is declared overutilized as well.
> A cpu becomes overutilized when its utilization is higher that 80%
> (capacity_margin) of its capacity.
> 
> The implementation takes advantage of the shared sched domain which is
> shared across all per-cpu views of a sched domain level. The new
> overutilized flag is placed in this shared sched domain.
> 
> Load balancing is skipped in case the energy model is present and the
> sched domain is not overutilized because under this condition the
> predominantly load-per-capacity driven load-balancer should not
> interfere with the energy-aware wakeup placement based on utilization.
> 
> In case the total utilization of a sched domain is greater than the
> total sched domain capacity the overutilized flag is set at the parent
> sched domain level to let other sched groups help getting rid of the
> overutilization of cpus.
> 
> Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> ---
>  include/linux/sched/topology.h |  1 +
>  kernel/sched/fair.c            | 62 ++++++++++++++++++++++++++++++++++++++++--
>  kernel/sched/sched.h           |  1 +
>  kernel/sched/topology.c        | 12 +++-----
>  4 files changed, 65 insertions(+), 11 deletions(-)
> 
> diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> index 26347741ba50..dd001c232646 100644
> --- a/include/linux/sched/topology.h
> +++ b/include/linux/sched/topology.h
> @@ -72,6 +72,7 @@ struct sched_domain_shared {
>  	atomic_t	ref;
>  	atomic_t	nr_busy_cpus;
>  	int		has_idle_cores;
> +	int		overutilized;
>  };
>  
>  struct sched_domain {
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 0a76ad2ef022..6960e5ef3c14 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -5345,6 +5345,28 @@ static inline void hrtick_update(struct rq *rq)
>  }
>  #endif
>  
> +#ifdef CONFIG_SMP
> +static inline int cpu_overutilized(int cpu);
> +
> +static inline int sd_overutilized(struct sched_domain *sd)
> +{
> +	return READ_ONCE(sd->shared->overutilized);
> +}
> +
> +static inline void update_overutilized_status(struct rq *rq)
> +{
> +	struct sched_domain *sd;
> +
> +	rcu_read_lock();
> +	sd = rcu_dereference(rq->sd);
> +	if (sd && !sd_overutilized(sd) && cpu_overutilized(rq->cpu))
> +		WRITE_ONCE(sd->shared->overutilized, 1);
> +	rcu_read_unlock();
> +}
> +#else
> +static inline void update_overutilized_status(struct rq *rq) {}
> +#endif /* CONFIG_SMP */
> +
>  /*
>   * The enqueue_task method is called before nr_running is
>   * increased. Here we update the fair scheduling stats and
> @@ -5394,8 +5416,10 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  		update_cfs_group(se);
>  	}
>  
> -	if (!se)
> +	if (!se) {
>  		add_nr_running(rq, 1);
> +		update_overutilized_status(rq);
> +	}

Maybe this isn't a good question, why only update overutilized flag
for enqueue flow but not for dequeue flow?

>  	util_est_enqueue(&rq->cfs, p);
>  	hrtick_update(rq);
> @@ -6579,6 +6603,11 @@ static inline int util_fits_capacity(unsigned long util, unsigned long capacity)
>  	return capacity * 1024 > util * capacity_margin;
>  }
>  
> +static inline int cpu_overutilized(int cpu)
> +{
> +	return !util_fits_capacity(cpu_util(cpu), capacity_of(cpu));
> +}
> +
>  /*
>   * Disable WAKE_AFFINE in the case where task @p doesn't fit in the
>   * capacity of either the waking CPU @cpu or the previous CPU @prev_cpu.
> @@ -7817,6 +7846,7 @@ struct sd_lb_stats {
>  	unsigned long total_running;
>  	unsigned long total_load;	/* Total load of all groups in sd */
>  	unsigned long total_capacity;	/* Total capacity of all groups in sd */
> +	unsigned long total_util;	/* Total util of all groups in sd */
>  	unsigned long avg_load;	/* Average load across all groups in sd */
>  
>  	struct sg_lb_stats busiest_stat;/* Statistics of the busiest group */
> @@ -7837,6 +7867,7 @@ static inline void init_sd_lb_stats(struct sd_lb_stats *sds)
>  		.total_running = 0UL,
>  		.total_load = 0UL,
>  		.total_capacity = 0UL,
> +		.total_util = 0UL,
>  		.busiest_stat = {
>  			.avg_load = 0UL,
>  			.sum_nr_running = 0,
> @@ -8133,11 +8164,12 @@ static bool update_nohz_stats(struct rq *rq, bool force)
>   * @local_group: Does group contain this_cpu.
>   * @sgs: variable to hold the statistics for this group.
>   * @overload: Indicate more than one runnable task for any CPU.
> + * @overutilized: Indicate overutilization for any CPU.
>   */
>  static inline void update_sg_lb_stats(struct lb_env *env,
>  			struct sched_group *group, int load_idx,
>  			int local_group, struct sg_lb_stats *sgs,
> -			bool *overload)
> +			bool *overload, int *overutilized)
>  {
>  	unsigned long load;
>  	int i, nr_running;
> @@ -8174,6 +8206,9 @@ static inline void update_sg_lb_stats(struct lb_env *env,
>  		 */
>  		if (!nr_running && idle_cpu(i))
>  			sgs->idle_cpus++;
> +
> +		if (cpu_overutilized(i))
> +			*overutilized = 1;
>  	}
>  
>  	/* Adjust by relative CPU capacity of the group */
> @@ -8301,6 +8336,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>  	struct sg_lb_stats tmp_sgs;
>  	int load_idx, prefer_sibling = 0;
>  	bool overload = false;
> +	int overutilized = 0;
>  
>  	if (child && child->flags & SD_PREFER_SIBLING)
>  		prefer_sibling = 1;
> @@ -8327,7 +8363,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>  		}
>  
>  		update_sg_lb_stats(env, sg, load_idx, local_group, sgs,
> -						&overload);
> +						&overload, &overutilized);
>  
>  		if (local_group)
>  			goto next_group;
> @@ -8359,6 +8395,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>  		sds->total_running += sgs->sum_nr_running;
>  		sds->total_load += sgs->group_load;
>  		sds->total_capacity += sgs->group_capacity;
> +		sds->total_util += sgs->group_util;
>  
>  		sg = sg->next;
>  	} while (sg != env->sd->groups);
> @@ -8380,6 +8417,17 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
>  		if (env->dst_rq->rd->overload != overload)
>  			env->dst_rq->rd->overload = overload;
>  	}
> +
> +	if (sd_overutilized(env->sd) != overutilized)
> +		WRITE_ONCE(env->sd->shared->overutilized, overutilized);
> +
> +	/*
> +	 * If the domain util is greater that domain capacity, load balancing
> +	 * needs to be done at the next sched domain level as well.
> +	 */
> +	if (env->sd->parent &&
> +	    !util_fits_capacity(sds->total_util, sds->total_capacity))
> +		WRITE_ONCE(env->sd->parent->shared->overutilized, 1);
>  }
>  
>  /**
> @@ -9255,6 +9303,9 @@ static void rebalance_domains(struct rq *rq, enum cpu_idle_type idle)
>  		}
>  		max_cost += sd->max_newidle_lb_cost;
>  
> +		if (sched_energy_enabled() && !sd_overutilized(sd))
> +			continue;
> +
>  		if (!(sd->flags & SD_LOAD_BALANCE))
>  			continue;
>  
> @@ -9822,6 +9873,9 @@ static int idle_balance(struct rq *this_rq, struct rq_flags *rf)
>  			break;
>  		}
>  
> +		if (sched_energy_enabled() && !sd_overutilized(sd))
> +			continue;
> +
>  		if (sd->flags & SD_BALANCE_NEWIDLE) {
>  			t0 = sched_clock_cpu(this_cpu);
>  
> @@ -9955,6 +10009,8 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
>  
>  	if (static_branch_unlikely(&sched_numa_balancing))
>  		task_tick_numa(rq, curr);
> +
> +	update_overutilized_status(rq);

Can sched tick clear overutilized flag if under tipping point?

Thanks,
Leo Yan

>  }
>  
>  /*
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index c3deaee7a7a2..5d552c0d7109 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -11,6 +11,7 @@
>  #include <linux/sched/cputime.h>
>  #include <linux/sched/deadline.h>
>  #include <linux/sched/debug.h>
> +#include <linux/sched/energy.h>
>  #include <linux/sched/hotplug.h>
>  #include <linux/sched/idle.h>
>  #include <linux/sched/init.h>
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index 64cc564f5255..c8b7c7665ab2 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -1184,15 +1184,11 @@ sd_init(struct sched_domain_topology_level *tl,
>  		sd->idle_idx = 1;
>  	}
>  
> -	/*
> -	 * For all levels sharing cache; connect a sched_domain_shared
> -	 * instance.
> -	 */
> -	if (sd->flags & SD_SHARE_PKG_RESOURCES) {
> -		sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
> -		atomic_inc(&sd->shared->ref);
> +	sd->shared = *per_cpu_ptr(sdd->sds, sd_id);
> +	atomic_inc(&sd->shared->ref);
> +
> +	if (sd->flags & SD_SHARE_PKG_RESOURCES)
>  		atomic_set(&sd->shared->nr_busy_cpus, sd_weight);
> -	}
>  
>  	sd->private = sdd;
>  
> -- 
> 2.11.0
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v2 4/6] sched/fair: Introduce an energy estimation helper function
  2018-04-06 15:36 ` [RFC PATCH v2 4/6] sched/fair: Introduce an energy estimation helper function Dietmar Eggemann
  2018-04-10 12:51   ` Peter Zijlstra
  2018-04-13  6:27   ` Viresh Kumar
@ 2018-04-17 15:22   ` Leo Yan
  2018-04-18  8:13     ` Quentin Perret
  2018-04-18  9:23   ` Leo Yan
  2018-04-18 12:15   ` Leo Yan
  4 siblings, 1 reply; 44+ messages in thread
From: Leo Yan @ 2018-04-17 15:22 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: linux-kernel, Peter Zijlstra, Quentin Perret, Thara Gopinath,
	linux-pm, Morten Rasmussen, Chris Redpath, Patrick Bellasi,
	Valentin Schneider, Rafael J . Wysocki, Greg Kroah-Hartman,
	Vincent Guittot, Viresh Kumar, Todd Kjos, Joel Fernandes,
	Juri Lelli, Steve Muckle, Eduardo Valentin

On Fri, Apr 06, 2018 at 04:36:05PM +0100, Dietmar Eggemann wrote:
> From: Quentin Perret <quentin.perret@arm.com>
> 
> In preparation for the definition of an energy-aware wakeup path, a
> helper function is provided to estimate the consequence on system energy
> when a specific task wakes-up on a specific CPU. compute_energy()
> estimates the OPPs to be reached by all frequency domains and estimates
> the consumption of each online CPU according to its energy model and its
> percentage of busy time.
> 
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Quentin Perret <quentin.perret@arm.com>
> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> ---
>  include/linux/sched/energy.h | 20 +++++++++++++
>  kernel/sched/fair.c          | 68 ++++++++++++++++++++++++++++++++++++++++++++
>  kernel/sched/sched.h         |  2 +-
>  3 files changed, 89 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/sched/energy.h b/include/linux/sched/energy.h
> index 941071eec013..b4110b145228 100644
> --- a/include/linux/sched/energy.h
> +++ b/include/linux/sched/energy.h
> @@ -27,6 +27,24 @@ static inline bool sched_energy_enabled(void)
>  	return static_branch_unlikely(&sched_energy_present);
>  }
>  
> +static inline
> +struct capacity_state *find_cap_state(int cpu, unsigned long util)
> +{
> +	struct sched_energy_model *em = *per_cpu_ptr(energy_model, cpu);
> +	struct capacity_state *cs = NULL;
> +	int i;
> +
> +	util += util >> 2;
> +
> +	for (i = 0; i < em->nr_cap_states; i++) {
> +		cs = &em->cap_states[i];
> +		if (cs->cap >= util)
> +			break;
> +	}
> +
> +	return cs;

'cs' is possible to return NULL.

> +}
> +
>  static inline struct cpumask *freq_domain_span(struct freq_domain *fd)
>  {
>  	return &fd->span;
> @@ -42,6 +60,8 @@ struct freq_domain;
>  static inline bool sched_energy_enabled(void) { return false; }
>  static inline struct cpumask
>  *freq_domain_span(struct freq_domain *fd) { return NULL; }
> +static inline struct capacity_state
> +*find_cap_state(int cpu, unsigned long util) { return NULL; }
>  static inline void init_sched_energy(void) { }
>  #define for_each_freq_domain(fdom) for (; fdom; fdom = NULL)
>  #endif
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6960e5ef3c14..8cb9fb04fff2 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6633,6 +6633,74 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
>  }
>  
>  /*
> + * Returns the util of "cpu" if "p" wakes up on "dst_cpu".
> + */
> +static unsigned long cpu_util_next(int cpu, struct task_struct *p, int dst_cpu)
> +{
> +	unsigned long util, util_est;
> +	struct cfs_rq *cfs_rq;
> +
> +	/* Task is where it should be, or has no impact on cpu */
> +	if ((task_cpu(p) == dst_cpu) || (cpu != task_cpu(p) && cpu != dst_cpu))
> +		return cpu_util(cpu);
> +
> +	cfs_rq = &cpu_rq(cpu)->cfs;
> +	util = READ_ONCE(cfs_rq->avg.util_avg);
> +
> +	if (dst_cpu == cpu)
> +		util += task_util(p);
> +	else
> +		util = max_t(long, util - task_util(p), 0);

I tried to understand the logic at here, below code is more clear for
myself:

        int prev_cpu = task_cpu(p);

        cfs_rq = &cpu_rq(cpu)->cfs;
        util = READ_ONCE(cfs_rq->avg.util_avg);

        /* Bail out if src and dst CPUs are the same one */
        if (prev_cpu == cpu && dst_cpu == cpu)
                return util;

        /* Remove task utilization for src CPU */
        if (cpu == prev_cpu)
                util = max_t(long, util - task_util(p), 0);

        /* Add task utilization for dst CPU */
        if (dst_cpu == cpu)
                util += task_util(p);

BTW, CPU utilization is decayed value and task_util() is not decayed
value, so 'util - task_util(p)' calculates a smaller value than the
prev CPU pure utilization, right?

Another question is can we reuse the function cpu_util_wake() and
just compenstate task util for dst cpu?

> +	if (sched_feat(UTIL_EST)) {
> +		util_est = READ_ONCE(cfs_rq->avg.util_est.enqueued);
> +		if (dst_cpu == cpu)
> +			util_est += _task_util_est(p);
> +		else
> +			util_est = max_t(long, util_est - _task_util_est(p), 0);
> +		util = max(util, util_est);
> +	}
> +
> +	return min_t(unsigned long, util, capacity_orig_of(cpu));
> +}
> +
> +/*
> + * Estimates the system level energy assuming that p wakes-up on dst_cpu.
> + *
> + * compute_energy() is safe to call only if an energy model is available for
> + * the platform, which is when sched_energy_enabled() is true.
> + */
> +static unsigned long compute_energy(struct task_struct *p, int dst_cpu)
> +{
> +	unsigned long util, max_util, sum_util;
> +	struct capacity_state *cs;
> +	unsigned long energy = 0;
> +	struct freq_domain *fd;
> +	int cpu;
> +
> +	for_each_freq_domain(fd) {
> +		max_util = sum_util = 0;
> +		for_each_cpu_and(cpu, freq_domain_span(fd), cpu_online_mask) {
> +			util = cpu_util_next(cpu, p, dst_cpu);
> +			util += cpu_util_dl(cpu_rq(cpu));
> +			max_util = max(util, max_util);
> +			sum_util += util;
> +		}
> +
> +		/*
> +		 * Here we assume that the capacity states of CPUs belonging to
> +		 * the same frequency domains are shared. Hence, we look at the
> +		 * capacity state of the first CPU and re-use it for all.
> +		 */
> +		cpu = cpumask_first(freq_domain_span(fd));
> +		cs = find_cap_state(cpu, max_util);
> +		energy += cs->power * sum_util / cs->cap;
> +	}

This means all CPUs will be iterated for calculation, the complexity is
O(n)...

Thanks,
Leo Yan

> +	return energy;
> +}
> +
> +/*
>   * select_task_rq_fair: Select target runqueue for the waking task in domains
>   * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
>   * SD_BALANCE_FORK, or SD_BALANCE_EXEC.
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 5d552c0d7109..6eb38f41d5d9 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2156,7 +2156,7 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
>  # define arch_scale_freq_invariant()	false
>  #endif
>  
> -#ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
> +#ifdef CONFIG_SMP
>  static inline unsigned long cpu_util_dl(struct rq *rq)
>  {
>  	return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
> -- 
> 2.11.0
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v2 5/6] sched/fair: Select an energy-efficient CPU on task wake-up
  2018-04-06 15:36 ` [RFC PATCH v2 5/6] sched/fair: Select an energy-efficient CPU on task wake-up Dietmar Eggemann
  2018-04-09 16:30   ` Peter Zijlstra
  2018-04-10 17:29   ` Peter Zijlstra
@ 2018-04-17 15:39   ` Leo Yan
  2018-04-18  7:57     ` Quentin Perret
  2 siblings, 1 reply; 44+ messages in thread
From: Leo Yan @ 2018-04-17 15:39 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: linux-kernel, Peter Zijlstra, Quentin Perret, Thara Gopinath,
	linux-pm, Morten Rasmussen, Chris Redpath, Patrick Bellasi,
	Valentin Schneider, Rafael J . Wysocki, Greg Kroah-Hartman,
	Vincent Guittot, Viresh Kumar, Todd Kjos, Joel Fernandes,
	Juri Lelli, Steve Muckle, Eduardo Valentin

On Fri, Apr 06, 2018 at 04:36:06PM +0100, Dietmar Eggemann wrote:
> From: Quentin Perret <quentin.perret@arm.com>
> 
> In case an energy model is available, waking tasks are re-routed into a
> new energy-aware placement algorithm. The eligible CPUs to be used in the
> energy-aware wakeup path are restricted to the highest non-overutilized
> sched_domain containing prev_cpu and this_cpu. If no such domain is found,
> the tasks go through the usual wake-up path, hence energy-aware placement
> happens only in lightly utilized scenarios.
> 
> The selection of the most energy-efficient CPU for a task is achieved by
> estimating the impact on system-level active energy resulting from the
> placement of the task on the CPU with the highest spare capacity in each
> frequency domain. The best CPU energy-wise is then selected if it saves
> a large enough amount of energy with respect to prev_cpu.
> 
> Although it has already shown significant benefits on some existing
> targets, this approach cannot scale to platforms with numerous CPUs.
> This patch is an attempt to do something useful as writing a fast
> heuristic that performs reasonably well on a broad spectrum of
> architectures isn't an easy task. As a consequence, the scope of
> usability of the energy-aware wake-up path is restricted to systems
> with the SD_ASYM_CPUCAPACITY flag set. These systems not only show the
> most promising opportunities for saving energy but also typically
> feature a limited number of logical CPUs.
> 
> Moreover, the energy-aware wake-up path is accessible only if
> sched_energy_enabled() is true. For systems which don't meet all
> dependencies for EAS (CONFIG_PM_OPP for ex.) at compile time,
> sched_enegy_enabled() defaults to a constant "false" value, hence letting
> the compiler remove the unused EAS code entirely.
> 
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Quentin Perret <quentin.perret@arm.com>
> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> ---
>  kernel/sched/fair.c | 97 ++++++++++++++++++++++++++++++++++++++++++++++++++---
>  1 file changed, 93 insertions(+), 4 deletions(-)
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 8cb9fb04fff2..5ebb2d0306c7 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6700,6 +6700,81 @@ static unsigned long compute_energy(struct task_struct *p, int dst_cpu)
>  	return energy;
>  }
>  
> +static int find_energy_efficient_cpu(struct sched_domain *sd,
> +					struct task_struct *p, int prev_cpu)
> +{
> +	unsigned long cur_energy, prev_energy, best_energy, cpu_cap;
> +	unsigned long task_util = task_util_est(p);
> +	int cpu, best_energy_cpu = prev_cpu;
> +	struct freq_domain *fd;
> +
> +	if (!task_util)
> +		return prev_cpu;
> +
> +	if (cpumask_test_cpu(prev_cpu, &p->cpus_allowed))
> +		prev_energy = best_energy = compute_energy(p, prev_cpu);
> +	else
> +		prev_energy = best_energy = ULONG_MAX;
> +
> +	for_each_freq_domain(fd) {
> +		unsigned long spare_cap, max_spare_cap = 0;
> +		int max_spare_cap_cpu = -1;
> +		unsigned long util;
> +
> +		/* Find the CPU with the max spare cap in the freq. dom. */
> +		for_each_cpu_and(cpu, freq_domain_span(fd), sched_domain_span(sd)) {
> +			if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
> +				continue;
> +
> +			if (cpu == prev_cpu)
> +				continue;
> +
> +			util = cpu_util_wake(cpu, p);
> +			cpu_cap = capacity_of(cpu);
> +			if (!util_fits_capacity(util + task_util, cpu_cap))
> +				continue;
> +
> +			spare_cap = cpu_cap - util;
> +			if (spare_cap > max_spare_cap) {
> +				max_spare_cap = spare_cap;
> +				max_spare_cap_cpu = cpu;
> +			}
> +		}

If have two clusters, and if firstly iterate the big cluster, then
max_spare_cap is a big value for big cluster and later LITTLE cluster
has no chance to have higher value for spare_cap.  For this case, the
LITTLE CPU will be skipped for energy computation?

> +
> +		/* Evaluate the energy impact of using this CPU. */
> +		if (max_spare_cap_cpu >= 0) {
> +			cur_energy = compute_energy(p, max_spare_cap_cpu);
> +			if (cur_energy < best_energy) {
> +				best_energy = cur_energy;
> +				best_energy_cpu = max_spare_cap_cpu;
> +			}
> +		}
> +	}
> +
> +	/*
> +	 * We pick the best CPU only if it saves at least 1.5% of the
> +	 * energy used by prev_cpu.
> +	 */
> +	if ((prev_energy - best_energy) > (prev_energy >> 6))
> +		return best_energy_cpu;
> +
> +	return prev_cpu;
> +}
> +
> +static inline bool wake_energy(struct task_struct *p, int prev_cpu)
> +{
> +	struct sched_domain *sd;
> +
> +	if (!sched_energy_enabled())
> +		return false;
> +
> +	sd = rcu_dereference_sched(cpu_rq(prev_cpu)->sd);
> +	if (!sd || sd_overutilized(sd))
> +		return false;
> +
> +	return true;
> +}
> +
>  /*
>   * select_task_rq_fair: Select target runqueue for the waking task in domains
>   * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
> @@ -6716,18 +6791,22 @@ static int
>  select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags)
>  {
>  	struct sched_domain *tmp, *affine_sd = NULL, *sd = NULL;
> +	struct sched_domain *energy_sd = NULL;
>  	int cpu = smp_processor_id();
>  	int new_cpu = prev_cpu;
> -	int want_affine = 0;
> +	int want_affine = 0, want_energy = 0;
>  	int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
>  
> +	rcu_read_lock();
> +
>  	if (sd_flag & SD_BALANCE_WAKE) {
>  		record_wakee(p);
> +		want_energy = wake_energy(p, prev_cpu);
>  		want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu)
> -			      && cpumask_test_cpu(cpu, &p->cpus_allowed);
> +			      && cpumask_test_cpu(cpu, &p->cpus_allowed)
> +			      && !want_energy;
>  	}
>  
> -	rcu_read_lock();
>  	for_each_domain(cpu, tmp) {
>  		if (!(tmp->flags & SD_LOAD_BALANCE))
>  			break;
> @@ -6742,6 +6821,14 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
>  			break;
>  		}
>  
> +		/*
> +		 * Energy-aware task placement is performed on the highest
> +		 * non-overutilized domain spanning over cpu and prev_cpu.
> +		 */
> +		if (want_energy && !sd_overutilized(tmp) &&
> +		    cpumask_test_cpu(prev_cpu, sched_domain_span(tmp)))
> +			energy_sd = tmp;
> +
>  		if (tmp->flags & sd_flag)
>  			sd = tmp;
>  		else if (!want_affine)
> @@ -6765,7 +6852,9 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
>  		sync_entity_load_avg(&p->se);
>  	}
>  
> -	if (!sd) {
> +	if (energy_sd) {
> +		new_cpu = find_energy_efficient_cpu(energy_sd, p, prev_cpu);
> +	} else if (!sd) {
>  pick_cpu:
>  		if (sd_flag & SD_BALANCE_WAKE) { /* XXX always ? */
>  			new_cpu = select_idle_sibling(p, prev_cpu, new_cpu);
> -- 
> 2.11.0
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v2 0/6] Energy Aware Scheduling
  2018-04-17 12:50 ` [RFC PATCH v2 0/6] Energy Aware Scheduling Leo Yan
@ 2018-04-17 17:22   ` Dietmar Eggemann
  0 siblings, 0 replies; 44+ messages in thread
From: Dietmar Eggemann @ 2018-04-17 17:22 UTC (permalink / raw)
  To: Leo Yan
  Cc: linux-kernel, Peter Zijlstra, Quentin Perret, Thara Gopinath,
	linux-pm, Morten Rasmussen, Chris Redpath, Patrick Bellasi,
	Valentin Schneider, Rafael J . Wysocki, Greg Kroah-Hartman,
	Vincent Guittot, Viresh Kumar, Todd Kjos, Joel Fernandes,
	Juri Lelli, Steve Muckle, Eduardo Valentin

Hi Leo,

On 04/17/2018 02:50 PM, Leo Yan wrote:
> Hi Dietmar,
> 
> On Fri, Apr 06, 2018 at 04:36:01PM +0100, Dietmar Eggemann wrote:

[...]

>> 1.1 Energy Model
>>
>> A CPU with asymmetric core capacities features cores with significantly
>> different energy and performance characteristics. As the configurations
>> can vary greatly from one SoC to another, designing an energy-efficient
>> scheduling heuristic that performs well on a broad spectrum of platforms
>> appears to be particularly hard.
>> This proposal attempts to solve this issue by providing the scheduler
>> with an energy model of the platform which enables energy impact
>> estimation of scheduling decisions in a generic way. The energy model is
>> kept very simple as it represents only the active power of CPUs at all
>> available P-states and relies on existing data in the kernel (only used
>> by the thermal subsystem so far).
>> This proposal does not include the power consumption of C-states and
>> cluster-level resources which were originally introduced in [1] since
>> firstly, their impact on task placement decisions appears to be
>> neglectable on modern asymmetric platforms and secondly, they require
>> additional infrastructure and data (e.g new DT entries).
> 
> Seems to me, if we move forward a bit for the energy model, we can use
> more simple method by generate power consumption:
> 
>    Power(@Freq) = Power(cpu_util=100%@Freq) - Power(cpu_util=%0@Freq)
> 
>  From upper formula, the power data includes CPU and cluster level
> power (and includes dynamic power and static leakage) but this is
> quite straightforward for measurement.
> 
> I read a bit for Quentin's slides for simplized power modeling
> experiments [1], IIUC the simplized power modeling still bases on the
> distinguished CPU and cluster c-state and p-state power data, and just
> select CPU p-state power data for scheduler.  I wander if we can
 > simplize the power measurement, so the power data can be generated in
 > single one testing and the power data without any post processing.
 >
 > This might need more detailed experiment to support this idea, just
 > want to know how about you guys think for this?
 >
 > This is a side topic for this patch series, so whatever the conclusion
 > for it, I think this will not impact anything of this patch series
 > implementation and upstreaming.
 >
 > [1] http://connect.linaro.org/resource/hkg18/hkg18-501/

The simplified Energy Model in this patch-set only contains the per-cpu 
p-state power data. This allows us to only rely on the knowledge of 
which OPP's (opp frequency/max frequency) we have for the individual 
frequency domains and the CPU dt property 'dynamic-power-coefficient'. 
This is even encapsulated in the new PM_OPP library function 
dev_pm_opp_get_power().

Please note that this has to be redesigned since neither Rafael nor 
Peter like the idea of using PM_OPP library here. But we will continue 
to only use per-cpu p-state power data.

[...]

>> 30 iterations of perf bench sched messaging --pipe --thread --group G
>> --loop L with G=[1 2 4 8] and L=50000 (Hikey960)/16000 (Juno r0).
> 
> What's the reason to select different loop number for Hikey960 and
> Juno? Based on the testing time?

The Juno r0 board has only ~0.3 of the performance of the Hikey960. We 
wanted to have roughly comparable test execution time numbers. We're 
only interested in the difference between running w/ and w/o this code 
per platform.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v2 3/6] sched: Add over-utilization/tipping point indicator
  2018-04-17 14:25   ` Leo Yan
@ 2018-04-17 17:39     ` Dietmar Eggemann
  2018-04-18  0:18       ` Leo Yan
  0 siblings, 1 reply; 44+ messages in thread
From: Dietmar Eggemann @ 2018-04-17 17:39 UTC (permalink / raw)
  To: Leo Yan
  Cc: linux-kernel, Peter Zijlstra, Quentin Perret, Thara Gopinath,
	linux-pm, Morten Rasmussen, Chris Redpath, Patrick Bellasi,
	Valentin Schneider, Rafael J . Wysocki, Greg Kroah-Hartman,
	Vincent Guittot, Viresh Kumar, Todd Kjos, Joel Fernandes,
	Juri Lelli, Steve Muckle, Eduardo Valentin

On 04/17/2018 04:25 PM, Leo Yan wrote:

>> @@ -5394,8 +5416,10 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>>   		update_cfs_group(se);
>>   	}
>>   
>> -	if (!se)
>> +	if (!se) {
>>   		add_nr_running(rq, 1);
>> +		update_overutilized_status(rq);
>> +	}
> 
> Maybe this isn't a good question, why only update overutilized flag
> for enqueue flow but not for dequeue flow?

[...]

>> @@ -9955,6 +10009,8 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
>>   
>>   	if (static_branch_unlikely(&sched_numa_balancing))
>>   		task_tick_numa(rq, curr);
>> +
>> +	update_overutilized_status(rq);
> 
> Can sched tick clear overutilized flag if under tipping point?

No, only the load balancer for this particular sched domain level is 
able to clear the flag. We want to use the existing iteration over all 
cpus of the sched domain span to reset the flag.

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v2 3/6] sched: Add over-utilization/tipping point indicator
  2018-04-17 17:39     ` Dietmar Eggemann
@ 2018-04-18  0:18       ` Leo Yan
  0 siblings, 0 replies; 44+ messages in thread
From: Leo Yan @ 2018-04-18  0:18 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: linux-kernel, Peter Zijlstra, Quentin Perret, Thara Gopinath,
	linux-pm, Morten Rasmussen, Chris Redpath, Patrick Bellasi,
	Valentin Schneider, Rafael J . Wysocki, Greg Kroah-Hartman,
	Vincent Guittot, Viresh Kumar, Todd Kjos, Joel Fernandes,
	Juri Lelli, Steve Muckle, Eduardo Valentin

On Tue, Apr 17, 2018 at 07:39:21PM +0200, Dietmar Eggemann wrote:
> On 04/17/2018 04:25 PM, Leo Yan wrote:
> 
> >>@@ -5394,8 +5416,10 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >>  		update_cfs_group(se);
> >>  	}
> >>-	if (!se)
> >>+	if (!se) {
> >>  		add_nr_running(rq, 1);
> >>+		update_overutilized_status(rq);
> >>+	}
> >
> >Maybe this isn't a good question, why only update overutilized flag
> >for enqueue flow but not for dequeue flow?
> 
> [...]
> 
> >>@@ -9955,6 +10009,8 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
> >>  	if (static_branch_unlikely(&sched_numa_balancing))
> >>  		task_tick_numa(rq, curr);
> >>+
> >>+	update_overutilized_status(rq);
> >
> >Can sched tick clear overutilized flag if under tipping point?
> 
> No, only the load balancer for this particular sched domain level is able to
> clear the flag. We want to use the existing iteration over all cpus of the
> sched domain span to reset the flag.

Yes and sorry introduce noise.  The overutilized flag is shared with
all CPUs in the sched domain, so one CPU can set overutilized flag
for itself, but the CPU cannot clear the flag due this flag might be
set by other CPUs in the same sched domain.

Thanks,
Leo Yan

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v2 5/6] sched/fair: Select an energy-efficient CPU on task wake-up
  2018-04-17 15:39   ` Leo Yan
@ 2018-04-18  7:57     ` Quentin Perret
  0 siblings, 0 replies; 44+ messages in thread
From: Quentin Perret @ 2018-04-18  7:57 UTC (permalink / raw)
  To: Leo Yan
  Cc: Dietmar Eggemann, linux-kernel, Peter Zijlstra, Thara Gopinath,
	linux-pm, Morten Rasmussen, Chris Redpath, Patrick Bellasi,
	Valentin Schneider, Rafael J . Wysocki, Greg Kroah-Hartman,
	Vincent Guittot, Viresh Kumar, Todd Kjos, Joel Fernandes,
	Juri Lelli, Steve Muckle, Eduardo Valentin

Hi Leo,

On Tuesday 17 Apr 2018 at 23:39:44 (+0800), Leo Yan wrote:
> > +	for_each_freq_domain(fd) {
> > +		unsigned long spare_cap, max_spare_cap = 0;
> > +		int max_spare_cap_cpu = -1;
> > +		unsigned long util;
> > +
> > +		/* Find the CPU with the max spare cap in the freq. dom. */
> > +		for_each_cpu_and(cpu, freq_domain_span(fd), sched_domain_span(sd)) {
> > +			if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
> > +				continue;
> > +
> > +			if (cpu == prev_cpu)
> > +				continue;
> > +
> > +			util = cpu_util_wake(cpu, p);
> > +			cpu_cap = capacity_of(cpu);
> > +			if (!util_fits_capacity(util + task_util, cpu_cap))
> > +				continue;
> > +
> > +			spare_cap = cpu_cap - util;
> > +			if (spare_cap > max_spare_cap) {
> > +				max_spare_cap = spare_cap;
> > +				max_spare_cap_cpu = cpu;
> > +			}
> > +		}
> 
> If have two clusters, and if firstly iterate the big cluster, then
> max_spare_cap is a big value for big cluster and later LITTLE cluster
> has no chance to have higher value for spare_cap.  For this case, the
> LITTLE CPU will be skipped for energy computation?

max_spare_cap is reset to 0 at the top of the for_each_freq_domain()
loop above so that shouldn't happen.

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v2 4/6] sched/fair: Introduce an energy estimation helper function
  2018-04-17 15:22   ` Leo Yan
@ 2018-04-18  8:13     ` Quentin Perret
  2018-04-18  9:19       ` Leo Yan
  0 siblings, 1 reply; 44+ messages in thread
From: Quentin Perret @ 2018-04-18  8:13 UTC (permalink / raw)
  To: Leo Yan
  Cc: Dietmar Eggemann, linux-kernel, Peter Zijlstra, Thara Gopinath,
	linux-pm, Morten Rasmussen, Chris Redpath, Patrick Bellasi,
	Valentin Schneider, Rafael J . Wysocki, Greg Kroah-Hartman,
	Vincent Guittot, Viresh Kumar, Todd Kjos, Joel Fernandes,
	Juri Lelli, Steve Muckle, Eduardo Valentin

On Tuesday 17 Apr 2018 at 23:22:13 (+0800), Leo Yan wrote:
> On Fri, Apr 06, 2018 at 04:36:05PM +0100, Dietmar Eggemann wrote:
> > From: Quentin Perret <quentin.perret@arm.com>
> > 
> > In preparation for the definition of an energy-aware wakeup path, a
> > helper function is provided to estimate the consequence on system energy
> > when a specific task wakes-up on a specific CPU. compute_energy()
> > estimates the OPPs to be reached by all frequency domains and estimates
> > the consumption of each online CPU according to its energy model and its
> > percentage of busy time.
> > 
> > Cc: Ingo Molnar <mingo@redhat.com>
> > Cc: Peter Zijlstra <peterz@infradead.org>
> > Signed-off-by: Quentin Perret <quentin.perret@arm.com>
> > Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > ---
> >  include/linux/sched/energy.h | 20 +++++++++++++
> >  kernel/sched/fair.c          | 68 ++++++++++++++++++++++++++++++++++++++++++++
> >  kernel/sched/sched.h         |  2 +-
> >  3 files changed, 89 insertions(+), 1 deletion(-)
> > 
> > diff --git a/include/linux/sched/energy.h b/include/linux/sched/energy.h
> > index 941071eec013..b4110b145228 100644
> > --- a/include/linux/sched/energy.h
> > +++ b/include/linux/sched/energy.h
> > @@ -27,6 +27,24 @@ static inline bool sched_energy_enabled(void)
> >  	return static_branch_unlikely(&sched_energy_present);
> >  }
> >  
> > +static inline
> > +struct capacity_state *find_cap_state(int cpu, unsigned long util)
> > +{
> > +	struct sched_energy_model *em = *per_cpu_ptr(energy_model, cpu);
> > +	struct capacity_state *cs = NULL;
> > +	int i;
> > +
> > +	util += util >> 2;
> > +
> > +	for (i = 0; i < em->nr_cap_states; i++) {
> > +		cs = &em->cap_states[i];
> > +		if (cs->cap >= util)
> > +			break;
> > +	}
> > +
> > +	return cs;
> 
> 'cs' is possible to return NULL.

Only if em-nr_cap_states==0, and that shouldn't be possible if
sched_energy_present==True, so this code should be safe :-)

> 
> > +}
> > +
> >  static inline struct cpumask *freq_domain_span(struct freq_domain *fd)
> >  {
> >  	return &fd->span;
> > @@ -42,6 +60,8 @@ struct freq_domain;
> >  static inline bool sched_energy_enabled(void) { return false; }
> >  static inline struct cpumask
> >  *freq_domain_span(struct freq_domain *fd) { return NULL; }
> > +static inline struct capacity_state
> > +*find_cap_state(int cpu, unsigned long util) { return NULL; }
> >  static inline void init_sched_energy(void) { }
> >  #define for_each_freq_domain(fdom) for (; fdom; fdom = NULL)
> >  #endif
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 6960e5ef3c14..8cb9fb04fff2 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6633,6 +6633,74 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
> >  }
> >  
> >  /*
> > + * Returns the util of "cpu" if "p" wakes up on "dst_cpu".
> > + */
> > +static unsigned long cpu_util_next(int cpu, struct task_struct *p, int dst_cpu)
> > +{
> > +	unsigned long util, util_est;
> > +	struct cfs_rq *cfs_rq;
> > +
> > +	/* Task is where it should be, or has no impact on cpu */
> > +	if ((task_cpu(p) == dst_cpu) || (cpu != task_cpu(p) && cpu != dst_cpu))
> > +		return cpu_util(cpu);
> > +
> > +	cfs_rq = &cpu_rq(cpu)->cfs;
> > +	util = READ_ONCE(cfs_rq->avg.util_avg);
> > +
> > +	if (dst_cpu == cpu)
> > +		util += task_util(p);
> > +	else
> > +		util = max_t(long, util - task_util(p), 0);
> 
> I tried to understand the logic at here, below code is more clear for
> myself:
> 
>         int prev_cpu = task_cpu(p);
> 
>         cfs_rq = &cpu_rq(cpu)->cfs;
>         util = READ_ONCE(cfs_rq->avg.util_avg);
> 
>         /* Bail out if src and dst CPUs are the same one */
>         if (prev_cpu == cpu && dst_cpu == cpu)
>                 return util;
> 
>         /* Remove task utilization for src CPU */
>         if (cpu == prev_cpu)
>                 util = max_t(long, util - task_util(p), 0);
> 
>         /* Add task utilization for dst CPU */
>         if (dst_cpu == cpu)
>                 util += task_util(p);
> 
> BTW, CPU utilization is decayed value and task_util() is not decayed
> value, so 'util - task_util(p)' calculates a smaller value than the
> prev CPU pure utilization, right?

task_util() is the raw PELT signal, without UTIL_EST, so I think it's
fine to do `util - task_util()`.

> 
> Another question is can we reuse the function cpu_util_wake() and
> just compenstate task util for dst cpu?

Well it's not that simple. cpu_util_wake() will give you the max between
the util_avg and the util_est value, so which task_util() should you add
to it ? The util_avg or the uti_est value ?

Here we are trying to predict what will be the cpu_util signal in the
future, so the only always-correct implementation of this function has
to predict what will be the CPU util_avg and util_est signals in
parallel and take the max of the two.

> 
> > +	if (sched_feat(UTIL_EST)) {
> > +		util_est = READ_ONCE(cfs_rq->avg.util_est.enqueued);
> > +		if (dst_cpu == cpu)
> > +			util_est += _task_util_est(p);
> > +		else
> > +			util_est = max_t(long, util_est - _task_util_est(p), 0);
> > +		util = max(util, util_est);
> > +	}
> > +
> > +	return min_t(unsigned long, util, capacity_orig_of(cpu));
> > +}
> > +
> > +/*
> > + * Estimates the system level energy assuming that p wakes-up on dst_cpu.
> > + *
> > + * compute_energy() is safe to call only if an energy model is available for
> > + * the platform, which is when sched_energy_enabled() is true.
> > + */
> > +static unsigned long compute_energy(struct task_struct *p, int dst_cpu)
> > +{
> > +	unsigned long util, max_util, sum_util;
> > +	struct capacity_state *cs;
> > +	unsigned long energy = 0;
> > +	struct freq_domain *fd;
> > +	int cpu;
> > +
> > +	for_each_freq_domain(fd) {
> > +		max_util = sum_util = 0;
> > +		for_each_cpu_and(cpu, freq_domain_span(fd), cpu_online_mask) {
> > +			util = cpu_util_next(cpu, p, dst_cpu);
> > +			util += cpu_util_dl(cpu_rq(cpu));
> > +			max_util = max(util, max_util);
> > +			sum_util += util;
> > +		}
> > +
> > +		/*
> > +		 * Here we assume that the capacity states of CPUs belonging to
> > +		 * the same frequency domains are shared. Hence, we look at the
> > +		 * capacity state of the first CPU and re-use it for all.
> > +		 */
> > +		cpu = cpumask_first(freq_domain_span(fd));
> > +		cs = find_cap_state(cpu, max_util);
> > +		energy += cs->power * sum_util / cs->cap;
> > +	}
> 
> This means all CPUs will be iterated for calculation, the complexity is
> O(n)...
> 
> > +	return energy;
> > +}
> > +
> > +/*
> >   * select_task_rq_fair: Select target runqueue for the waking task in domains
> >   * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
> >   * SD_BALANCE_FORK, or SD_BALANCE_EXEC.
> > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > index 5d552c0d7109..6eb38f41d5d9 100644
> > --- a/kernel/sched/sched.h
> > +++ b/kernel/sched/sched.h
> > @@ -2156,7 +2156,7 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
> >  # define arch_scale_freq_invariant()	false
> >  #endif
> >  
> > -#ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
> > +#ifdef CONFIG_SMP
> >  static inline unsigned long cpu_util_dl(struct rq *rq)
> >  {
> >  	return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
> > -- 
> > 2.11.0
> > 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v2 4/6] sched/fair: Introduce an energy estimation helper function
  2018-04-18  8:13     ` Quentin Perret
@ 2018-04-18  9:19       ` Leo Yan
  2018-04-18 11:06         ` Quentin Perret
  0 siblings, 1 reply; 44+ messages in thread
From: Leo Yan @ 2018-04-18  9:19 UTC (permalink / raw)
  To: Quentin Perret
  Cc: Dietmar Eggemann, linux-kernel, Peter Zijlstra, Thara Gopinath,
	linux-pm, Morten Rasmussen, Chris Redpath, Patrick Bellasi,
	Valentin Schneider, Rafael J . Wysocki, Greg Kroah-Hartman,
	Vincent Guittot, Viresh Kumar, Todd Kjos, Joel Fernandes,
	Juri Lelli, Steve Muckle, Eduardo Valentin

On Wed, Apr 18, 2018 at 09:13:39AM +0100, Quentin Perret wrote:
> On Tuesday 17 Apr 2018 at 23:22:13 (+0800), Leo Yan wrote:
> > On Fri, Apr 06, 2018 at 04:36:05PM +0100, Dietmar Eggemann wrote:
> > > From: Quentin Perret <quentin.perret@arm.com>
> > > 
> > > In preparation for the definition of an energy-aware wakeup path, a
> > > helper function is provided to estimate the consequence on system energy
> > > when a specific task wakes-up on a specific CPU. compute_energy()
> > > estimates the OPPs to be reached by all frequency domains and estimates
> > > the consumption of each online CPU according to its energy model and its
> > > percentage of busy time.
> > > 
> > > Cc: Ingo Molnar <mingo@redhat.com>
> > > Cc: Peter Zijlstra <peterz@infradead.org>
> > > Signed-off-by: Quentin Perret <quentin.perret@arm.com>
> > > Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > > ---
> > >  include/linux/sched/energy.h | 20 +++++++++++++
> > >  kernel/sched/fair.c          | 68 ++++++++++++++++++++++++++++++++++++++++++++
> > >  kernel/sched/sched.h         |  2 +-
> > >  3 files changed, 89 insertions(+), 1 deletion(-)
> > > 
> > > diff --git a/include/linux/sched/energy.h b/include/linux/sched/energy.h
> > > index 941071eec013..b4110b145228 100644
> > > --- a/include/linux/sched/energy.h
> > > +++ b/include/linux/sched/energy.h
> > > @@ -27,6 +27,24 @@ static inline bool sched_energy_enabled(void)
> > >  	return static_branch_unlikely(&sched_energy_present);
> > >  }
> > >  
> > > +static inline
> > > +struct capacity_state *find_cap_state(int cpu, unsigned long util)
> > > +{
> > > +	struct sched_energy_model *em = *per_cpu_ptr(energy_model, cpu);
> > > +	struct capacity_state *cs = NULL;
> > > +	int i;
> > > +
> > > +	util += util >> 2;
> > > +
> > > +	for (i = 0; i < em->nr_cap_states; i++) {
> > > +		cs = &em->cap_states[i];
> > > +		if (cs->cap >= util)
> > > +			break;
> > > +	}
> > > +
> > > +	return cs;
> > 
> > 'cs' is possible to return NULL.
> 
> Only if em-nr_cap_states==0, and that shouldn't be possible if
> sched_energy_present==True, so this code should be safe :-)

You are right. Thanks for explanation.

> > > +}
> > > +
> > >  static inline struct cpumask *freq_domain_span(struct freq_domain *fd)
> > >  {
> > >  	return &fd->span;
> > > @@ -42,6 +60,8 @@ struct freq_domain;
> > >  static inline bool sched_energy_enabled(void) { return false; }
> > >  static inline struct cpumask
> > >  *freq_domain_span(struct freq_domain *fd) { return NULL; }
> > > +static inline struct capacity_state
> > > +*find_cap_state(int cpu, unsigned long util) { return NULL; }
> > >  static inline void init_sched_energy(void) { }
> > >  #define for_each_freq_domain(fdom) for (; fdom; fdom = NULL)
> > >  #endif
> > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > index 6960e5ef3c14..8cb9fb04fff2 100644
> > > --- a/kernel/sched/fair.c
> > > +++ b/kernel/sched/fair.c
> > > @@ -6633,6 +6633,74 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
> > >  }
> > >  
> > >  /*
> > > + * Returns the util of "cpu" if "p" wakes up on "dst_cpu".
> > > + */
> > > +static unsigned long cpu_util_next(int cpu, struct task_struct *p, int dst_cpu)
> > > +{
> > > +	unsigned long util, util_est;
> > > +	struct cfs_rq *cfs_rq;
> > > +
> > > +	/* Task is where it should be, or has no impact on cpu */
> > > +	if ((task_cpu(p) == dst_cpu) || (cpu != task_cpu(p) && cpu != dst_cpu))
> > > +		return cpu_util(cpu);
> > > +
> > > +	cfs_rq = &cpu_rq(cpu)->cfs;
> > > +	util = READ_ONCE(cfs_rq->avg.util_avg);
> > > +
> > > +	if (dst_cpu == cpu)
> > > +		util += task_util(p);
> > > +	else
> > > +		util = max_t(long, util - task_util(p), 0);
> > 
> > I tried to understand the logic at here, below code is more clear for
> > myself:
> > 
> >         int prev_cpu = task_cpu(p);
> > 
> >         cfs_rq = &cpu_rq(cpu)->cfs;
> >         util = READ_ONCE(cfs_rq->avg.util_avg);
> > 
> >         /* Bail out if src and dst CPUs are the same one */
> >         if (prev_cpu == cpu && dst_cpu == cpu)
> >                 return util;
> > 
> >         /* Remove task utilization for src CPU */
> >         if (cpu == prev_cpu)
> >                 util = max_t(long, util - task_util(p), 0);
> > 
> >         /* Add task utilization for dst CPU */
> >         if (dst_cpu == cpu)
> >                 util += task_util(p);
> > 
> > BTW, CPU utilization is decayed value and task_util() is not decayed
> > value, so 'util - task_util(p)' calculates a smaller value than the
> > prev CPU pure utilization, right?
> 
> task_util() is the raw PELT signal, without UTIL_EST, so I think it's
> fine to do `util - task_util()`.
> 
> > 
> > Another question is can we reuse the function cpu_util_wake() and
> > just compenstate task util for dst cpu?
> 
> Well it's not that simple. cpu_util_wake() will give you the max between
> the util_avg and the util_est value, so which task_util() should you add
> to it ? The util_avg or the uti_est value ?

If feature 'UTIL_EST' is enabled, then add task's util_est; otherwise
add task util_avg value.

I think cpu_util_wake() has similiar logic with here, it merely returns
CPU level util; but here needs to accumulate CPU level util + task level
util.  So seems to me, the logic is:

  cpu_util_wake() + task_util_wake()

> Here we are trying to predict what will be the cpu_util signal in the
> future, so the only always-correct implementation of this function has
> to predict what will be the CPU util_avg and util_est signals in
> parallel and take the max of the two.

I totally agree with this, just want to check if can reuse existed code,
so we can have more consistent logic accrossing scheduler.

> > > +	if (sched_feat(UTIL_EST)) {
> > > +		util_est = READ_ONCE(cfs_rq->avg.util_est.enqueued);
> > > +		if (dst_cpu == cpu)
> > > +			util_est += _task_util_est(p);
> > > +		else
> > > +			util_est = max_t(long, util_est - _task_util_est(p), 0);
> > > +		util = max(util, util_est);
> > > +	}
> > > +
> > > +	return min_t(unsigned long, util, capacity_orig_of(cpu));
> > > +}
> > > +
> > > +/*
> > > + * Estimates the system level energy assuming that p wakes-up on dst_cpu.
> > > + *
> > > + * compute_energy() is safe to call only if an energy model is available for
> > > + * the platform, which is when sched_energy_enabled() is true.
> > > + */
> > > +static unsigned long compute_energy(struct task_struct *p, int dst_cpu)
> > > +{
> > > +	unsigned long util, max_util, sum_util;
> > > +	struct capacity_state *cs;
> > > +	unsigned long energy = 0;
> > > +	struct freq_domain *fd;
> > > +	int cpu;
> > > +
> > > +	for_each_freq_domain(fd) {
> > > +		max_util = sum_util = 0;
> > > +		for_each_cpu_and(cpu, freq_domain_span(fd), cpu_online_mask) {
> > > +			util = cpu_util_next(cpu, p, dst_cpu);
> > > +			util += cpu_util_dl(cpu_rq(cpu));
> > > +			max_util = max(util, max_util);
> > > +			sum_util += util;
> > > +		}
> > > +
> > > +		/*
> > > +		 * Here we assume that the capacity states of CPUs belonging to
> > > +		 * the same frequency domains are shared. Hence, we look at the
> > > +		 * capacity state of the first CPU and re-use it for all.
> > > +		 */
> > > +		cpu = cpumask_first(freq_domain_span(fd));
> > > +		cs = find_cap_state(cpu, max_util);
> > > +		energy += cs->power * sum_util / cs->cap;
> > > +	}
> > 
> > This means all CPUs will be iterated for calculation, the complexity is
> > O(n)...
> > 
> > > +	return energy;
> > > +}
> > > +
> > > +/*
> > >   * select_task_rq_fair: Select target runqueue for the waking task in domains
> > >   * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
> > >   * SD_BALANCE_FORK, or SD_BALANCE_EXEC.
> > > diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> > > index 5d552c0d7109..6eb38f41d5d9 100644
> > > --- a/kernel/sched/sched.h
> > > +++ b/kernel/sched/sched.h
> > > @@ -2156,7 +2156,7 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
> > >  # define arch_scale_freq_invariant()	false
> > >  #endif
> > >  
> > > -#ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
> > > +#ifdef CONFIG_SMP
> > >  static inline unsigned long cpu_util_dl(struct rq *rq)
> > >  {
> > >  	return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
> > > -- 
> > > 2.11.0
> > > 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v2 4/6] sched/fair: Introduce an energy estimation helper function
  2018-04-06 15:36 ` [RFC PATCH v2 4/6] sched/fair: Introduce an energy estimation helper function Dietmar Eggemann
                     ` (2 preceding siblings ...)
  2018-04-17 15:22   ` Leo Yan
@ 2018-04-18  9:23   ` Leo Yan
  2018-04-20 14:51     ` Quentin Perret
  2018-04-18 12:15   ` Leo Yan
  4 siblings, 1 reply; 44+ messages in thread
From: Leo Yan @ 2018-04-18  9:23 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: linux-kernel, Peter Zijlstra, Quentin Perret, Thara Gopinath,
	linux-pm, Morten Rasmussen, Chris Redpath, Patrick Bellasi,
	Valentin Schneider, Rafael J . Wysocki, Greg Kroah-Hartman,
	Vincent Guittot, Viresh Kumar, Todd Kjos, Joel Fernandes,
	Juri Lelli, Steve Muckle, Eduardo Valentin

On Fri, Apr 06, 2018 at 04:36:05PM +0100, Dietmar Eggemann wrote:

[...]

> +/*
> + * Estimates the system level energy assuming that p wakes-up on dst_cpu.
> + *
> + * compute_energy() is safe to call only if an energy model is available for
> + * the platform, which is when sched_energy_enabled() is true.
> + */
> +static unsigned long compute_energy(struct task_struct *p, int dst_cpu)
> +{
> +	unsigned long util, max_util, sum_util;
> +	struct capacity_state *cs;
> +	unsigned long energy = 0;
> +	struct freq_domain *fd;
> +	int cpu;
> +
> +	for_each_freq_domain(fd) {
> +		max_util = sum_util = 0;
> +		for_each_cpu_and(cpu, freq_domain_span(fd), cpu_online_mask) {
> +			util = cpu_util_next(cpu, p, dst_cpu);
> +			util += cpu_util_dl(cpu_rq(cpu));
> +			max_util = max(util, max_util);
> +			sum_util += util;
> +		}
> +
> +		/*
> +		 * Here we assume that the capacity states of CPUs belonging to
> +		 * the same frequency domains are shared. Hence, we look at the
> +		 * capacity state of the first CPU and re-use it for all.
> +		 */
> +		cpu = cpumask_first(freq_domain_span(fd));
> +		cs = find_cap_state(cpu, max_util);
> +		energy += cs->power * sum_util / cs->cap;

I am a bit worry about the resolution issue, especially when the
capacity value is a quite high value and sum_util is a minor value.

> +	}
> +
> +	return energy;
> +}

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v2 4/6] sched/fair: Introduce an energy estimation helper function
  2018-04-18  9:19       ` Leo Yan
@ 2018-04-18 11:06         ` Quentin Perret
  0 siblings, 0 replies; 44+ messages in thread
From: Quentin Perret @ 2018-04-18 11:06 UTC (permalink / raw)
  To: Leo Yan
  Cc: Dietmar Eggemann, linux-kernel, Peter Zijlstra, Thara Gopinath,
	linux-pm, Morten Rasmussen, Chris Redpath, Patrick Bellasi,
	Valentin Schneider, Rafael J . Wysocki, Greg Kroah-Hartman,
	Vincent Guittot, Viresh Kumar, Todd Kjos, Joel Fernandes,
	Juri Lelli, Steve Muckle, Eduardo Valentin

On Wednesday 18 Apr 2018 at 17:19:28 (+0800), Leo Yan wrote:
> > > BTW, CPU utilization is decayed value and task_util() is not decayed
> > > value, so 'util - task_util(p)' calculates a smaller value than the
> > > prev CPU pure utilization, right?
> > 
> > task_util() is the raw PELT signal, without UTIL_EST, so I think it's
> > fine to do `util - task_util()`.
> > 
> > > 
> > > Another question is can we reuse the function cpu_util_wake() and
> > > just compenstate task util for dst cpu?
> > 
> > Well it's not that simple. cpu_util_wake() will give you the max between
> > the util_avg and the util_est value, so which task_util() should you add
> > to it ? The util_avg or the uti_est value ?
> 
> If feature 'UTIL_EST' is enabled, then add task's util_est; otherwise
> add task util_avg value.

I don't think this is correct. If UTIL_EST is enabled, cpu_util_wake()
will return the max between the util_avg and the util_est. When you call
it, you don't know what you get. Adding the _task_util_est() to
cpu_util_wake() is wrong if cpu_util_avg > cpu_util_est for example.

> 
> I think cpu_util_wake() has similiar logic with here, it merely returns
> CPU level util; but here needs to accumulate CPU level util + task level
> util.  So seems to me, the logic is:
> 
>   cpu_util_wake() + task_util_wake()

I'm not sure to get that one ... What is task_util_wake() ?

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v2 3/6] sched: Add over-utilization/tipping point indicator
  2018-04-13 23:56   ` Joel Fernandes
@ 2018-04-18 11:17     ` Quentin Perret
  2018-04-20  8:13       ` Joel Fernandes
  0 siblings, 1 reply; 44+ messages in thread
From: Quentin Perret @ 2018-04-18 11:17 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Dietmar Eggemann, LKML, Peter Zijlstra, Thara Gopinath, Linux PM,
	Morten Rasmussen, Chris Redpath, Patrick Bellasi,
	Valentin Schneider, Rafael J . Wysocki, Greg Kroah-Hartman,
	Vincent Guittot, Viresh Kumar, Todd Kjos, Juri Lelli,
	Steve Muckle, Eduardo Valentin

On Friday 13 Apr 2018 at 16:56:39 (-0700), Joel Fernandes wrote:
> Hi,
> 
> On Fri, Apr 6, 2018 at 8:36 AM, Dietmar Eggemann
> <dietmar.eggemann@arm.com> wrote:
> > From: Thara Gopinath <thara.gopinath@linaro.org>
> >
> > Energy-aware scheduling should only operate when the system is not
> > overutilized. There must be cpu time available to place tasks based on
> > utilization in an energy-aware fashion, i.e. to pack tasks on
> > energy-efficient cpus without harming the overall throughput.
> >
> > In case the system operates above this tipping point the tasks have to
> > be placed based on task and cpu load in the classical way of spreading
> > tasks across as many cpus as possible.
> >
> > The point in which a system switches from being not overutilized to
> > being overutilized is called the tipping point.
> >
> > Such a tipping point indicator on a sched domain as the system
> > boundary is introduced here. As soon as one cpu of a sched domain is
> > overutilized the whole sched domain is declared overutilized as well.
> > A cpu becomes overutilized when its utilization is higher that 80%
> > (capacity_margin) of its capacity.
> >
> > The implementation takes advantage of the shared sched domain which is
> > shared across all per-cpu views of a sched domain level. The new
> > overutilized flag is placed in this shared sched domain.
> >
> > Load balancing is skipped in case the energy model is present and the
> > sched domain is not overutilized because under this condition the
> > predominantly load-per-capacity driven load-balancer should not
> > interfere with the energy-aware wakeup placement based on utilization.
> >
> > In case the total utilization of a sched domain is greater than the
> > total sched domain capacity the overutilized flag is set at the parent
> > sched domain level to let other sched groups help getting rid of the
> > overutilization of cpus.
> >
> > Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
> > Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> > ---
> >  include/linux/sched/topology.h |  1 +
> >  kernel/sched/fair.c            | 62 ++++++++++++++++++++++++++++++++++++++++--
> >  kernel/sched/sched.h           |  1 +
> >  kernel/sched/topology.c        | 12 +++-----
> >  4 files changed, 65 insertions(+), 11 deletions(-)
> >
> > diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> > index 26347741ba50..dd001c232646 100644
> > --- a/include/linux/sched/topology.h
> > +++ b/include/linux/sched/topology.h
> > @@ -72,6 +72,7 @@ struct sched_domain_shared {
> >         atomic_t        ref;
> >         atomic_t        nr_busy_cpus;
> >         int             has_idle_cores;
> > +       int             overutilized;
> >  };
> >
> >  struct sched_domain {
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 0a76ad2ef022..6960e5ef3c14 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -5345,6 +5345,28 @@ static inline void hrtick_update(struct rq *rq)
> >  }
> >  #endif
> >
> > +#ifdef CONFIG_SMP
> > +static inline int cpu_overutilized(int cpu);
> > +
> > +static inline int sd_overutilized(struct sched_domain *sd)
> > +{
> > +       return READ_ONCE(sd->shared->overutilized);
> > +}
> > +
> > +static inline void update_overutilized_status(struct rq *rq)
> > +{
> > +       struct sched_domain *sd;
> > +
> > +       rcu_read_lock();
> > +       sd = rcu_dereference(rq->sd);
> > +       if (sd && !sd_overutilized(sd) && cpu_overutilized(rq->cpu))
> > +               WRITE_ONCE(sd->shared->overutilized, 1);
> > +       rcu_read_unlock();
> > +}
> > +#else
> > +static inline void update_overutilized_status(struct rq *rq) {}
> > +#endif /* CONFIG_SMP */
> > +
> >  /*
> >   * The enqueue_task method is called before nr_running is
> >   * increased. Here we update the fair scheduling stats and
> > @@ -5394,8 +5416,10 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >                 update_cfs_group(se);
> >         }
> >
> > -       if (!se)
> > +       if (!se) {
> >                 add_nr_running(rq, 1);
> > +               update_overutilized_status(rq);
> > +       }
> 
> I'm wondering if it makes sense for considering scenarios whether
> other classes cause CPUs in the domain to go above the tipping point.
> Then in that case also, it makes sense to not to do EAS in that domain
> because of the overutilization.
> 
> I guess task_fits using cpu_util which is PELT only at the moment...
> so may require some other method like aggregation of CFS PELT, with
> RT-PELT and DL running bw or something.
>

So at the moment in cpu_overutilized() we comapre cpu_util() to
capacity_of() which should include RT and IRQ pressure IIRC. But
you're right, we might be able to do more here... Perhaps we
could also use cpu_util_dl() which is available in sched.h now ?

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v2 4/6] sched/fair: Introduce an energy estimation helper function
  2018-04-06 15:36 ` [RFC PATCH v2 4/6] sched/fair: Introduce an energy estimation helper function Dietmar Eggemann
                     ` (3 preceding siblings ...)
  2018-04-18  9:23   ` Leo Yan
@ 2018-04-18 12:15   ` Leo Yan
  2018-04-20 14:42     ` Quentin Perret
  4 siblings, 1 reply; 44+ messages in thread
From: Leo Yan @ 2018-04-18 12:15 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: linux-kernel, Peter Zijlstra, Quentin Perret, Thara Gopinath,
	linux-pm, Morten Rasmussen, Chris Redpath, Patrick Bellasi,
	Valentin Schneider, Rafael J . Wysocki, Greg Kroah-Hartman,
	Vincent Guittot, Viresh Kumar, Todd Kjos, Joel Fernandes,
	Juri Lelli, Steve Muckle, Eduardo Valentin

Hi Quentin,

On Fri, Apr 06, 2018 at 04:36:05PM +0100, Dietmar Eggemann wrote:
> From: Quentin Perret <quentin.perret@arm.com>
> 
> In preparation for the definition of an energy-aware wakeup path, a
> helper function is provided to estimate the consequence on system energy
> when a specific task wakes-up on a specific CPU. compute_energy()
> estimates the OPPs to be reached by all frequency domains and estimates
> the consumption of each online CPU according to its energy model and its
> percentage of busy time.
> 
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Quentin Perret <quentin.perret@arm.com>
> Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> ---
>  include/linux/sched/energy.h | 20 +++++++++++++
>  kernel/sched/fair.c          | 68 ++++++++++++++++++++++++++++++++++++++++++++
>  kernel/sched/sched.h         |  2 +-
>  3 files changed, 89 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/sched/energy.h b/include/linux/sched/energy.h
> index 941071eec013..b4110b145228 100644
> --- a/include/linux/sched/energy.h
> +++ b/include/linux/sched/energy.h
> @@ -27,6 +27,24 @@ static inline bool sched_energy_enabled(void)
>  	return static_branch_unlikely(&sched_energy_present);
>  }
>  
> +static inline
> +struct capacity_state *find_cap_state(int cpu, unsigned long util)
> +{
> +	struct sched_energy_model *em = *per_cpu_ptr(energy_model, cpu);
> +	struct capacity_state *cs = NULL;
> +	int i;
> +
> +	util += util >> 2;
> +
> +	for (i = 0; i < em->nr_cap_states; i++) {
> +		cs = &em->cap_states[i];
> +		if (cs->cap >= util)
> +			break;
> +	}
> +
> +	return cs;
> +}
> +
>  static inline struct cpumask *freq_domain_span(struct freq_domain *fd)
>  {
>  	return &fd->span;
> @@ -42,6 +60,8 @@ struct freq_domain;
>  static inline bool sched_energy_enabled(void) { return false; }
>  static inline struct cpumask
>  *freq_domain_span(struct freq_domain *fd) { return NULL; }
> +static inline struct capacity_state
> +*find_cap_state(int cpu, unsigned long util) { return NULL; }
>  static inline void init_sched_energy(void) { }
>  #define for_each_freq_domain(fdom) for (; fdom; fdom = NULL)
>  #endif
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 6960e5ef3c14..8cb9fb04fff2 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6633,6 +6633,74 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
>  }
>  
>  /*
> + * Returns the util of "cpu" if "p" wakes up on "dst_cpu".
> + */
> +static unsigned long cpu_util_next(int cpu, struct task_struct *p, int dst_cpu)
> +{
> +	unsigned long util, util_est;
> +	struct cfs_rq *cfs_rq;
> +
> +	/* Task is where it should be, or has no impact on cpu */
> +	if ((task_cpu(p) == dst_cpu) || (cpu != task_cpu(p) && cpu != dst_cpu))
> +		return cpu_util(cpu);
> +
> +	cfs_rq = &cpu_rq(cpu)->cfs;
> +	util = READ_ONCE(cfs_rq->avg.util_avg);
> +
> +	if (dst_cpu == cpu)
> +		util += task_util(p);
> +	else
> +		util = max_t(long, util - task_util(p), 0);
> +
> +	if (sched_feat(UTIL_EST)) {
> +		util_est = READ_ONCE(cfs_rq->avg.util_est.enqueued);
> +		if (dst_cpu == cpu)
> +			util_est += _task_util_est(p);
> +		else
> +			util_est = max_t(long, util_est - _task_util_est(p), 0);
> +		util = max(util, util_est);
> +	}
> +
> +	return min_t(unsigned long, util, capacity_orig_of(cpu));
> +}
> +
> +/*
> + * Estimates the system level energy assuming that p wakes-up on dst_cpu.
> + *
> + * compute_energy() is safe to call only if an energy model is available for
> + * the platform, which is when sched_energy_enabled() is true.
> + */
> +static unsigned long compute_energy(struct task_struct *p, int dst_cpu)
> +{
> +	unsigned long util, max_util, sum_util;
> +	struct capacity_state *cs;
> +	unsigned long energy = 0;
> +	struct freq_domain *fd;
> +	int cpu;
> +
> +	for_each_freq_domain(fd) {
> +		max_util = sum_util = 0;
> +		for_each_cpu_and(cpu, freq_domain_span(fd), cpu_online_mask) {
> +			util = cpu_util_next(cpu, p, dst_cpu);
> +			util += cpu_util_dl(cpu_rq(cpu));
> +			max_util = max(util, max_util);
> +			sum_util += util;
> +		}
> +
> +		/*
> +		 * Here we assume that the capacity states of CPUs belonging to
> +		 * the same frequency domains are shared. Hence, we look at the
> +		 * capacity state of the first CPU and re-use it for all.
> +		 */
> +		cpu = cpumask_first(freq_domain_span(fd));
> +		cs = find_cap_state(cpu, max_util);
> +		energy += cs->power * sum_util / cs->cap;
> +	}

Sorry I introduce mess at here to spread my questions in several
replying, later will try to ask questions in one replying.  Below are
more questions which it's good to bring up:

The code for energy computation is quite neat and simple, but I think
the energy computation mixes two concepts for CPU util: one concept is
the estimated CPU util which is used to select CPU OPP in schedutil,
another concept is the raw CPU util according to CPU real running time;
for example, cpu_util_next() predicts CPU util but this value might be
much higher than cpu_util(), especially after enabled UTIL_EST feature
(I have shallow understanding for UTIL_EST so correct me as needed);
but this patch simply computes CPU capacity and energy with the single
one CPU utilization value (and it will be an inflated value afte enable
UTIL_EST).  Is this purposed for simple implementation?

IMHO, cpu_util_next() can be used to predict CPU capacity, on the other
hand, should we use the CPU util without UTIL_EST capping for 'sum_util',
this can be more reasonable to reflect the CPU utilization?

Furthermore, if we consider RT thread is running on CPU and connect with
'schedutil' governor, the CPU will run at maximum frequency, but we
cannot say the CPU has 100% utilization.  The RT thread case is not
handled in this patch.

> +
> +	return energy;
> +}
> +
> +/*
>   * select_task_rq_fair: Select target runqueue for the waking task in domains
>   * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
>   * SD_BALANCE_FORK, or SD_BALANCE_EXEC.
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 5d552c0d7109..6eb38f41d5d9 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2156,7 +2156,7 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
>  # define arch_scale_freq_invariant()	false
>  #endif
>  
> -#ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
> +#ifdef CONFIG_SMP
>  static inline unsigned long cpu_util_dl(struct rq *rq)
>  {
>  	return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
> -- 
> 2.11.0
> 

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v2 3/6] sched: Add over-utilization/tipping point indicator
  2018-04-18 11:17     ` Quentin Perret
@ 2018-04-20  8:13       ` Joel Fernandes
  2018-04-20  8:14         ` Joel Fernandes
  0 siblings, 1 reply; 44+ messages in thread
From: Joel Fernandes @ 2018-04-20  8:13 UTC (permalink / raw)
  To: Quentin Perret
  Cc: Dietmar Eggemann, LKML, Peter Zijlstra, Thara Gopinath, Linux PM,
	Morten Rasmussen, Chris Redpath, Patrick Bellasi,
	Valentin Schneider, Rafael J . Wysocki, Greg Kroah-Hartman,
	Vincent Guittot, Viresh Kumar, Todd Kjos, Juri Lelli,
	Steve Muckle, Eduardo Valentin

On Wed, Apr 18, 2018 at 4:17 AM, Quentin Perret <quentin.perret@arm.com> wrote:
> On Friday 13 Apr 2018 at 16:56:39 (-0700), Joel Fernandes wrote:
>> Hi,
>>
>> On Fri, Apr 6, 2018 at 8:36 AM, Dietmar Eggemann
>> <dietmar.eggemann@arm.com> wrote:
>> > From: Thara Gopinath <thara.gopinath@linaro.org>
>> >
>> > Energy-aware scheduling should only operate when the system is not
>> > overutilized. There must be cpu time available to place tasks based on
>> > utilization in an energy-aware fashion, i.e. to pack tasks on
>> > energy-efficient cpus without harming the overall throughput.
>> >
>> > In case the system operates above this tipping point the tasks have to
>> > be placed based on task and cpu load in the classical way of spreading
>> > tasks across as many cpus as possible.
>> >
>> > The point in which a system switches from being not overutilized to
>> > being overutilized is called the tipping point.
>> >
>> > Such a tipping point indicator on a sched domain as the system
>> > boundary is introduced here. As soon as one cpu of a sched domain is
>> > overutilized the whole sched domain is declared overutilized as well.
>> > A cpu becomes overutilized when its utilization is higher that 80%
>> > (capacity_margin) of its capacity.
>> >
>> > The implementation takes advantage of the shared sched domain which is
>> > shared across all per-cpu views of a sched domain level. The new
>> > overutilized flag is placed in this shared sched domain.
>> >
>> > Load balancing is skipped in case the energy model is present and the
>> > sched domain is not overutilized because under this condition the
>> > predominantly load-per-capacity driven load-balancer should not
>> > interfere with the energy-aware wakeup placement based on utilization.
>> >
>> > In case the total utilization of a sched domain is greater than the
>> > total sched domain capacity the overutilized flag is set at the parent
>> > sched domain level to let other sched groups help getting rid of the
>> > overutilization of cpus.
>> >
>> > Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
>> > Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
>> > ---
>> >  include/linux/sched/topology.h |  1 +
>> >  kernel/sched/fair.c            | 62 ++++++++++++++++++++++++++++++++++++++++--
>> >  kernel/sched/sched.h           |  1 +
>> >  kernel/sched/topology.c        | 12 +++-----
>> >  4 files changed, 65 insertions(+), 11 deletions(-)
>> >
>> > diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
>> > index 26347741ba50..dd001c232646 100644
>> > --- a/include/linux/sched/topology.h
>> > +++ b/include/linux/sched/topology.h
>> > @@ -72,6 +72,7 @@ struct sched_domain_shared {
>> >         atomic_t        ref;
>> >         atomic_t        nr_busy_cpus;
>> >         int             has_idle_cores;
>> > +       int             overutilized;
>> >  };
>> >
>> >  struct sched_domain {
>> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>> > index 0a76ad2ef022..6960e5ef3c14 100644
>> > --- a/kernel/sched/fair.c
>> > +++ b/kernel/sched/fair.c
>> > @@ -5345,6 +5345,28 @@ static inline void hrtick_update(struct rq *rq)
>> >  }
>> >  #endif
>> >
>> > +#ifdef CONFIG_SMP
>> > +static inline int cpu_overutilized(int cpu);
>> > +
>> > +static inline int sd_overutilized(struct sched_domain *sd)
>> > +{
>> > +       return READ_ONCE(sd->shared->overutilized);
>> > +}
>> > +
>> > +static inline void update_overutilized_status(struct rq *rq)
>> > +{
>> > +       struct sched_domain *sd;
>> > +
>> > +       rcu_read_lock();
>> > +       sd = rcu_dereference(rq->sd);
>> > +       if (sd && !sd_overutilized(sd) && cpu_overutilized(rq->cpu))
>> > +               WRITE_ONCE(sd->shared->overutilized, 1);
>> > +       rcu_read_unlock();
>> > +}
>> > +#else
>> > +static inline void update_overutilized_status(struct rq *rq) {}
>> > +#endif /* CONFIG_SMP */
>> > +
>> >  /*
>> >   * The enqueue_task method is called before nr_running is
>> >   * increased. Here we update the fair scheduling stats and
>> > @@ -5394,8 +5416,10 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>> >                 update_cfs_group(se);
>> >         }
>> >
>> > -       if (!se)
>> > +       if (!se) {
>> >                 add_nr_running(rq, 1);
>> > +               update_overutilized_status(rq);
>> > +       }
>>
>> I'm wondering if it makes sense for considering scenarios whether
>> other classes cause CPUs in the domain to go above the tipping point.
>> Then in that case also, it makes sense to not to do EAS in that domain
>> because of the overutilization.
>>
>> I guess task_fits using cpu_util which is PELT only at the moment...
>> so may require some other method like aggregation of CFS PELT, with
>> RT-PELT and DL running bw or something.
>>
>
> So at the moment in cpu_overutilized() we comapre cpu_util() to
> capacity_of() which should include RT and IRQ pressure IIRC. But
> you're right, we might be able to do more here... Perhaps we
> could also use cpu_util_dl() which is available in sched.h now ?

Yes, should be Ok, and then when RT utilization stuff is available,
then that can be included in the equation as well (probably for now
you could use rt_avg).

Another crazy idea is to check the contribution of higher classes in
one-shot with (capacity_orig_of - capacity_of) although I think that
method would be less instantaneous/accurate.

thanks,

- Joel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v2 3/6] sched: Add over-utilization/tipping point indicator
  2018-04-20  8:13       ` Joel Fernandes
@ 2018-04-20  8:14         ` Joel Fernandes
  2018-04-20  8:31           ` Quentin Perret
  0 siblings, 1 reply; 44+ messages in thread
From: Joel Fernandes @ 2018-04-20  8:14 UTC (permalink / raw)
  To: Quentin Perret
  Cc: Dietmar Eggemann, LKML, Peter Zijlstra, Thara Gopinath, Linux PM,
	Morten Rasmussen, Chris Redpath, Patrick Bellasi,
	Valentin Schneider, Rafael J . Wysocki, Greg Kroah-Hartman,
	Vincent Guittot, Viresh Kumar, Todd Kjos, Juri Lelli,
	Steve Muckle, Eduardo Valentin

On Fri, Apr 20, 2018 at 1:13 AM, Joel Fernandes <joelaf@google.com> wrote:
> On Wed, Apr 18, 2018 at 4:17 AM, Quentin Perret <quentin.perret@arm.com> wrote:
>> On Friday 13 Apr 2018 at 16:56:39 (-0700), Joel Fernandes wrote:
>>> Hi,
>>>
>>> On Fri, Apr 6, 2018 at 8:36 AM, Dietmar Eggemann
>>> <dietmar.eggemann@arm.com> wrote:
>>> > From: Thara Gopinath <thara.gopinath@linaro.org>
>>> >
>>> > Energy-aware scheduling should only operate when the system is not
>>> > overutilized. There must be cpu time available to place tasks based on
>>> > utilization in an energy-aware fashion, i.e. to pack tasks on
>>> > energy-efficient cpus without harming the overall throughput.
>>> >
>>> > In case the system operates above this tipping point the tasks have to
>>> > be placed based on task and cpu load in the classical way of spreading
>>> > tasks across as many cpus as possible.
>>> >
>>> > The point in which a system switches from being not overutilized to
>>> > being overutilized is called the tipping point.
>>> >
>>> > Such a tipping point indicator on a sched domain as the system
>>> > boundary is introduced here. As soon as one cpu of a sched domain is
>>> > overutilized the whole sched domain is declared overutilized as well.
>>> > A cpu becomes overutilized when its utilization is higher that 80%
>>> > (capacity_margin) of its capacity.
>>> >
>>> > The implementation takes advantage of the shared sched domain which is
>>> > shared across all per-cpu views of a sched domain level. The new
>>> > overutilized flag is placed in this shared sched domain.
>>> >
>>> > Load balancing is skipped in case the energy model is present and the
>>> > sched domain is not overutilized because under this condition the
>>> > predominantly load-per-capacity driven load-balancer should not
>>> > interfere with the energy-aware wakeup placement based on utilization.
>>> >
>>> > In case the total utilization of a sched domain is greater than the
>>> > total sched domain capacity the overutilized flag is set at the parent
>>> > sched domain level to let other sched groups help getting rid of the
>>> > overutilization of cpus.
>>> >
>>> > Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
>>> > Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
>>> > ---
>>> >  include/linux/sched/topology.h |  1 +
>>> >  kernel/sched/fair.c            | 62 ++++++++++++++++++++++++++++++++++++++++--
>>> >  kernel/sched/sched.h           |  1 +
>>> >  kernel/sched/topology.c        | 12 +++-----
>>> >  4 files changed, 65 insertions(+), 11 deletions(-)
>>> >
>>> > diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
>>> > index 26347741ba50..dd001c232646 100644
>>> > --- a/include/linux/sched/topology.h
>>> > +++ b/include/linux/sched/topology.h
>>> > @@ -72,6 +72,7 @@ struct sched_domain_shared {
>>> >         atomic_t        ref;
>>> >         atomic_t        nr_busy_cpus;
>>> >         int             has_idle_cores;
>>> > +       int             overutilized;
>>> >  };
>>> >
>>> >  struct sched_domain {
>>> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
>>> > index 0a76ad2ef022..6960e5ef3c14 100644
>>> > --- a/kernel/sched/fair.c
>>> > +++ b/kernel/sched/fair.c
>>> > @@ -5345,6 +5345,28 @@ static inline void hrtick_update(struct rq *rq)
>>> >  }
>>> >  #endif
>>> >
>>> > +#ifdef CONFIG_SMP
>>> > +static inline int cpu_overutilized(int cpu);
>>> > +
>>> > +static inline int sd_overutilized(struct sched_domain *sd)
>>> > +{
>>> > +       return READ_ONCE(sd->shared->overutilized);
>>> > +}
>>> > +
>>> > +static inline void update_overutilized_status(struct rq *rq)
>>> > +{
>>> > +       struct sched_domain *sd;
>>> > +
>>> > +       rcu_read_lock();
>>> > +       sd = rcu_dereference(rq->sd);
>>> > +       if (sd && !sd_overutilized(sd) && cpu_overutilized(rq->cpu))
>>> > +               WRITE_ONCE(sd->shared->overutilized, 1);
>>> > +       rcu_read_unlock();
>>> > +}
>>> > +#else
>>> > +static inline void update_overutilized_status(struct rq *rq) {}
>>> > +#endif /* CONFIG_SMP */
>>> > +
>>> >  /*
>>> >   * The enqueue_task method is called before nr_running is
>>> >   * increased. Here we update the fair scheduling stats and
>>> > @@ -5394,8 +5416,10 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>>> >                 update_cfs_group(se);
>>> >         }
>>> >
>>> > -       if (!se)
>>> > +       if (!se) {
>>> >                 add_nr_running(rq, 1);
>>> > +               update_overutilized_status(rq);
>>> > +       }
>>>
>>> I'm wondering if it makes sense for considering scenarios whether
>>> other classes cause CPUs in the domain to go above the tipping point.
>>> Then in that case also, it makes sense to not to do EAS in that domain
>>> because of the overutilization.
>>>
>>> I guess task_fits using cpu_util which is PELT only at the moment...
>>> so may require some other method like aggregation of CFS PELT, with
>>> RT-PELT and DL running bw or something.
>>>
>>
>> So at the moment in cpu_overutilized() we comapre cpu_util() to
>> capacity_of() which should include RT and IRQ pressure IIRC. But
>> you're right, we might be able to do more here... Perhaps we
>> could also use cpu_util_dl() which is available in sched.h now ?
>
> Yes, should be Ok, and then when RT utilization stuff is available,
> then that can be included in the equation as well (probably for now
> you could use rt_avg).
>
> Another crazy idea is to check the contribution of higher classes in
> one-shot with (capacity_orig_of - capacity_of) although I think that
> method would be less instantaneous/accurate.

Just to add to the last point, the capacity_of also factors in the IRQ
contribution if I remember correctly, which is probably a good thing?

- Joel

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v2 3/6] sched: Add over-utilization/tipping point indicator
  2018-04-20  8:14         ` Joel Fernandes
@ 2018-04-20  8:31           ` Quentin Perret
  2018-04-20  8:57             ` Juri Lelli
  0 siblings, 1 reply; 44+ messages in thread
From: Quentin Perret @ 2018-04-20  8:31 UTC (permalink / raw)
  To: Joel Fernandes
  Cc: Dietmar Eggemann, LKML, Peter Zijlstra, Thara Gopinath, Linux PM,
	Morten Rasmussen, Chris Redpath, Patrick Bellasi,
	Valentin Schneider, Rafael J . Wysocki, Greg Kroah-Hartman,
	Vincent Guittot, Viresh Kumar, Todd Kjos, Juri Lelli,
	Steve Muckle, Eduardo Valentin

On Friday 20 Apr 2018 at 01:14:35 (-0700), Joel Fernandes wrote:
> On Fri, Apr 20, 2018 at 1:13 AM, Joel Fernandes <joelaf@google.com> wrote:
> > On Wed, Apr 18, 2018 at 4:17 AM, Quentin Perret <quentin.perret@arm.com> wrote:
> >> On Friday 13 Apr 2018 at 16:56:39 (-0700), Joel Fernandes wrote:
> >>> Hi,
> >>>
> >>> On Fri, Apr 6, 2018 at 8:36 AM, Dietmar Eggemann
> >>> <dietmar.eggemann@arm.com> wrote:
> >>> > From: Thara Gopinath <thara.gopinath@linaro.org>
> >>> >
> >>> > Energy-aware scheduling should only operate when the system is not
> >>> > overutilized. There must be cpu time available to place tasks based on
> >>> > utilization in an energy-aware fashion, i.e. to pack tasks on
> >>> > energy-efficient cpus without harming the overall throughput.
> >>> >
> >>> > In case the system operates above this tipping point the tasks have to
> >>> > be placed based on task and cpu load in the classical way of spreading
> >>> > tasks across as many cpus as possible.
> >>> >
> >>> > The point in which a system switches from being not overutilized to
> >>> > being overutilized is called the tipping point.
> >>> >
> >>> > Such a tipping point indicator on a sched domain as the system
> >>> > boundary is introduced here. As soon as one cpu of a sched domain is
> >>> > overutilized the whole sched domain is declared overutilized as well.
> >>> > A cpu becomes overutilized when its utilization is higher that 80%
> >>> > (capacity_margin) of its capacity.
> >>> >
> >>> > The implementation takes advantage of the shared sched domain which is
> >>> > shared across all per-cpu views of a sched domain level. The new
> >>> > overutilized flag is placed in this shared sched domain.
> >>> >
> >>> > Load balancing is skipped in case the energy model is present and the
> >>> > sched domain is not overutilized because under this condition the
> >>> > predominantly load-per-capacity driven load-balancer should not
> >>> > interfere with the energy-aware wakeup placement based on utilization.
> >>> >
> >>> > In case the total utilization of a sched domain is greater than the
> >>> > total sched domain capacity the overutilized flag is set at the parent
> >>> > sched domain level to let other sched groups help getting rid of the
> >>> > overutilization of cpus.
> >>> >
> >>> > Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
> >>> > Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> >>> > ---
> >>> >  include/linux/sched/topology.h |  1 +
> >>> >  kernel/sched/fair.c            | 62 ++++++++++++++++++++++++++++++++++++++++--
> >>> >  kernel/sched/sched.h           |  1 +
> >>> >  kernel/sched/topology.c        | 12 +++-----
> >>> >  4 files changed, 65 insertions(+), 11 deletions(-)
> >>> >
> >>> > diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> >>> > index 26347741ba50..dd001c232646 100644
> >>> > --- a/include/linux/sched/topology.h
> >>> > +++ b/include/linux/sched/topology.h
> >>> > @@ -72,6 +72,7 @@ struct sched_domain_shared {
> >>> >         atomic_t        ref;
> >>> >         atomic_t        nr_busy_cpus;
> >>> >         int             has_idle_cores;
> >>> > +       int             overutilized;
> >>> >  };
> >>> >
> >>> >  struct sched_domain {
> >>> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >>> > index 0a76ad2ef022..6960e5ef3c14 100644
> >>> > --- a/kernel/sched/fair.c
> >>> > +++ b/kernel/sched/fair.c
> >>> > @@ -5345,6 +5345,28 @@ static inline void hrtick_update(struct rq *rq)
> >>> >  }
> >>> >  #endif
> >>> >
> >>> > +#ifdef CONFIG_SMP
> >>> > +static inline int cpu_overutilized(int cpu);
> >>> > +
> >>> > +static inline int sd_overutilized(struct sched_domain *sd)
> >>> > +{
> >>> > +       return READ_ONCE(sd->shared->overutilized);
> >>> > +}
> >>> > +
> >>> > +static inline void update_overutilized_status(struct rq *rq)
> >>> > +{
> >>> > +       struct sched_domain *sd;
> >>> > +
> >>> > +       rcu_read_lock();
> >>> > +       sd = rcu_dereference(rq->sd);
> >>> > +       if (sd && !sd_overutilized(sd) && cpu_overutilized(rq->cpu))
> >>> > +               WRITE_ONCE(sd->shared->overutilized, 1);
> >>> > +       rcu_read_unlock();
> >>> > +}
> >>> > +#else
> >>> > +static inline void update_overutilized_status(struct rq *rq) {}
> >>> > +#endif /* CONFIG_SMP */
> >>> > +
> >>> >  /*
> >>> >   * The enqueue_task method is called before nr_running is
> >>> >   * increased. Here we update the fair scheduling stats and
> >>> > @@ -5394,8 +5416,10 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >>> >                 update_cfs_group(se);
> >>> >         }
> >>> >
> >>> > -       if (!se)
> >>> > +       if (!se) {
> >>> >                 add_nr_running(rq, 1);
> >>> > +               update_overutilized_status(rq);
> >>> > +       }
> >>>
> >>> I'm wondering if it makes sense for considering scenarios whether
> >>> other classes cause CPUs in the domain to go above the tipping point.
> >>> Then in that case also, it makes sense to not to do EAS in that domain
> >>> because of the overutilization.
> >>>
> >>> I guess task_fits using cpu_util which is PELT only at the moment...
> >>> so may require some other method like aggregation of CFS PELT, with
> >>> RT-PELT and DL running bw or something.
> >>>
> >>
> >> So at the moment in cpu_overutilized() we comapre cpu_util() to
> >> capacity_of() which should include RT and IRQ pressure IIRC. But
> >> you're right, we might be able to do more here... Perhaps we
> >> could also use cpu_util_dl() which is available in sched.h now ?
> >
> > Yes, should be Ok, and then when RT utilization stuff is available,
> > then that can be included in the equation as well (probably for now
> > you could use rt_avg).
> >
> > Another crazy idea is to check the contribution of higher classes in
> > one-shot with (capacity_orig_of - capacity_of) although I think that
> > method would be less instantaneous/accurate.
> 
> Just to add to the last point, the capacity_of also factors in the IRQ
> contribution if I remember correctly, which is probably a good thing?
> 

I think so too yes. But actually, since we compare cpu_util() to
capacity_of() in cpu_overutilized(), the current implementation should
already be fairly similar to the "capacity_orig_of - capacity_of"
implementation you're suggesting I guess.
And I agree that when Vincent's RT PELT patches get merged we should
probably use that :-)

Thanks !
Quentin

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v2 3/6] sched: Add over-utilization/tipping point indicator
  2018-04-20  8:31           ` Quentin Perret
@ 2018-04-20  8:57             ` Juri Lelli
  0 siblings, 0 replies; 44+ messages in thread
From: Juri Lelli @ 2018-04-20  8:57 UTC (permalink / raw)
  To: Quentin Perret
  Cc: Joel Fernandes, Dietmar Eggemann, LKML, Peter Zijlstra,
	Thara Gopinath, Linux PM, Morten Rasmussen, Chris Redpath,
	Patrick Bellasi, Valentin Schneider, Rafael J . Wysocki,
	Greg Kroah-Hartman, Vincent Guittot, Viresh Kumar, Todd Kjos,
	Steve Muckle, Eduardo Valentin

On 20/04/18 09:31, Quentin Perret wrote:
> On Friday 20 Apr 2018 at 01:14:35 (-0700), Joel Fernandes wrote:
> > On Fri, Apr 20, 2018 at 1:13 AM, Joel Fernandes <joelaf@google.com> wrote:
> > > On Wed, Apr 18, 2018 at 4:17 AM, Quentin Perret <quentin.perret@arm.com> wrote:
> > >> On Friday 13 Apr 2018 at 16:56:39 (-0700), Joel Fernandes wrote:

[...]

> > >>>
> > >>> I'm wondering if it makes sense for considering scenarios whether
> > >>> other classes cause CPUs in the domain to go above the tipping point.
> > >>> Then in that case also, it makes sense to not to do EAS in that domain
> > >>> because of the overutilization.
> > >>>
> > >>> I guess task_fits using cpu_util which is PELT only at the moment...
> > >>> so may require some other method like aggregation of CFS PELT, with
> > >>> RT-PELT and DL running bw or something.
> > >>>
> > >>
> > >> So at the moment in cpu_overutilized() we comapre cpu_util() to
> > >> capacity_of() which should include RT and IRQ pressure IIRC. But
> > >> you're right, we might be able to do more here... Perhaps we
> > >> could also use cpu_util_dl() which is available in sched.h now ?
> > >
> > > Yes, should be Ok, and then when RT utilization stuff is available,
> > > then that can be included in the equation as well (probably for now
> > > you could use rt_avg).
> > >
> > > Another crazy idea is to check the contribution of higher classes in
> > > one-shot with (capacity_orig_of - capacity_of) although I think that
> > > method would be less instantaneous/accurate.
> > 
> > Just to add to the last point, the capacity_of also factors in the IRQ
> > contribution if I remember correctly, which is probably a good thing?
> > 
> 
> I think so too yes. But actually, since we compare cpu_util() to
> capacity_of() in cpu_overutilized(), the current implementation should
> already be fairly similar to the "capacity_orig_of - capacity_of"
> implementation you're suggesting I guess.
> And I agree that when Vincent's RT PELT patches get merged we should
> probably use that :-)

Mind that rt_avg contains DL contribution as well ATM

https://elixir.bootlin.com/linux/v4.17-rc1/source/kernel/sched/deadline.c#L1182

So you shouldn't add newer DL utilization signal to it.

OTOH, using RT PELT (once in) is of course to be preferred.

Best,

- Juri

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v2 4/6] sched/fair: Introduce an energy estimation helper function
  2018-04-18 12:15   ` Leo Yan
@ 2018-04-20 14:42     ` Quentin Perret
  2018-04-20 16:27       ` Leo Yan
  0 siblings, 1 reply; 44+ messages in thread
From: Quentin Perret @ 2018-04-20 14:42 UTC (permalink / raw)
  To: Leo Yan
  Cc: Dietmar Eggemann, linux-kernel, Peter Zijlstra, Thara Gopinath,
	linux-pm, Morten Rasmussen, Chris Redpath, Patrick Bellasi,
	Valentin Schneider, Rafael J . Wysocki, Greg Kroah-Hartman,
	Vincent Guittot, Viresh Kumar, Todd Kjos, Joel Fernandes,
	Juri Lelli, Steve Muckle, Eduardo Valentin

Hi Leo,

On Wednesday 18 Apr 2018 at 20:15:47 (+0800), Leo Yan wrote:
> Sorry I introduce mess at here to spread my questions in several
> replying, later will try to ask questions in one replying.  Below are
> more questions which it's good to bring up:
> 
> The code for energy computation is quite neat and simple, but I think
> the energy computation mixes two concepts for CPU util: one concept is
> the estimated CPU util which is used to select CPU OPP in schedutil,
> another concept is the raw CPU util according to CPU real running time;
> for example, cpu_util_next() predicts CPU util but this value might be
> much higher than cpu_util(), especially after enabled UTIL_EST feature
> (I have shallow understanding for UTIL_EST so correct me as needed);

I'm not not sure to understand what you mean by higher than cpu_util()
here ... In which case would that happen ?

cpu_util_next() is basically used to figure out what will be the
cpu_util() of CPU A after task p has been enqueued on CPU B (no matter
what A and B are).

> but this patch simply computes CPU capacity and energy with the single
> one CPU utilization value (and it will be an inflated value afte enable
> UTIL_EST).  Is this purposed for simple implementation?
> 
> IMHO, cpu_util_next() can be used to predict CPU capacity, on the other
> hand, should we use the CPU util without UTIL_EST capping for 'sum_util',
> this can be more reasonable to reflect the CPU utilization?

Why would a decayed utilisation be a better estimate of the time that
a task is going to spend on a CPU ?

> 
> Furthermore, if we consider RT thread is running on CPU and connect with
> 'schedutil' governor, the CPU will run at maximum frequency, but we
> cannot say the CPU has 100% utilization.  The RT thread case is not
> handled in this patch.

Right, we don't account for RT tasks in the OPP prediction for now.
Vincent's patches to have a util_avg for RT runqueues could help us
do that I suppose ...

Thanks !
Quentin

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v2 4/6] sched/fair: Introduce an energy estimation helper function
  2018-04-18  9:23   ` Leo Yan
@ 2018-04-20 14:51     ` Quentin Perret
  0 siblings, 0 replies; 44+ messages in thread
From: Quentin Perret @ 2018-04-20 14:51 UTC (permalink / raw)
  To: Leo Yan
  Cc: Dietmar Eggemann, linux-kernel, Peter Zijlstra, Thara Gopinath,
	linux-pm, Morten Rasmussen, Chris Redpath, Patrick Bellasi,
	Valentin Schneider, Rafael J . Wysocki, Greg Kroah-Hartman,
	Vincent Guittot, Viresh Kumar, Todd Kjos, Joel Fernandes,
	Juri Lelli, Steve Muckle, Eduardo Valentin

On Wednesday 18 Apr 2018 at 17:23:16 (+0800), Leo Yan wrote:
> On Fri, Apr 06, 2018 at 04:36:05PM +0100, Dietmar Eggemann wrote:
> 
> [...]
> 
> > +/*
> > + * Estimates the system level energy assuming that p wakes-up on dst_cpu.
> > + *
> > + * compute_energy() is safe to call only if an energy model is available for
> > + * the platform, which is when sched_energy_enabled() is true.
> > + */
> > +static unsigned long compute_energy(struct task_struct *p, int dst_cpu)
> > +{
> > +	unsigned long util, max_util, sum_util;
> > +	struct capacity_state *cs;
> > +	unsigned long energy = 0;
> > +	struct freq_domain *fd;
> > +	int cpu;
> > +
> > +	for_each_freq_domain(fd) {
> > +		max_util = sum_util = 0;
> > +		for_each_cpu_and(cpu, freq_domain_span(fd), cpu_online_mask) {
> > +			util = cpu_util_next(cpu, p, dst_cpu);
> > +			util += cpu_util_dl(cpu_rq(cpu));
> > +			max_util = max(util, max_util);
> > +			sum_util += util;
> > +		}
> > +
> > +		/*
> > +		 * Here we assume that the capacity states of CPUs belonging to
> > +		 * the same frequency domains are shared. Hence, we look at the
> > +		 * capacity state of the first CPU and re-use it for all.
> > +		 */
> > +		cpu = cpumask_first(freq_domain_span(fd));
> > +		cs = find_cap_state(cpu, max_util);
> > +		energy += cs->power * sum_util / cs->cap;
> 
> I am a bit worry about the resolution issue, especially when the
> capacity value is a quite high value and sum_util is a minor value.

Good point. As of now, the cs->power values happen to be in micro-watts
for the platforms we've been testing with, so they're typically high
enough to avoid significant resolution problem I guess.

Now, the energy model code has to be reworked significantly as we have
to remove the dependency on PM_OPP, so I'll try to make sure to keep
this issue in mind for the next version.

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v2 4/6] sched/fair: Introduce an energy estimation helper function
  2018-04-20 14:42     ` Quentin Perret
@ 2018-04-20 16:27       ` Leo Yan
  2018-04-25  8:23         ` Quentin Perret
  0 siblings, 1 reply; 44+ messages in thread
From: Leo Yan @ 2018-04-20 16:27 UTC (permalink / raw)
  To: Quentin Perret
  Cc: Dietmar Eggemann, linux-kernel, Peter Zijlstra, Thara Gopinath,
	linux-pm, Morten Rasmussen, Chris Redpath, Patrick Bellasi,
	Valentin Schneider, Rafael J . Wysocki, Greg Kroah-Hartman,
	Vincent Guittot, Viresh Kumar, Todd Kjos, Joel Fernandes,
	Juri Lelli, Steve Muckle, Eduardo Valentin

On Fri, Apr 20, 2018 at 03:42:45PM +0100, Quentin Perret wrote:
> Hi Leo,
> 
> On Wednesday 18 Apr 2018 at 20:15:47 (+0800), Leo Yan wrote:
> > Sorry I introduce mess at here to spread my questions in several
> > replying, later will try to ask questions in one replying.  Below are
> > more questions which it's good to bring up:
> > 
> > The code for energy computation is quite neat and simple, but I think
> > the energy computation mixes two concepts for CPU util: one concept is
> > the estimated CPU util which is used to select CPU OPP in schedutil,
> > another concept is the raw CPU util according to CPU real running time;
> > for example, cpu_util_next() predicts CPU util but this value might be
> > much higher than cpu_util(), especially after enabled UTIL_EST feature
> > (I have shallow understanding for UTIL_EST so correct me as needed);
> 
> I'm not not sure to understand what you mean by higher than cpu_util()
> here ... In which case would that happen ?

After UTIL_EST feature is enabled, cpu_util_next() returns higher value
than cpu_util(), see below code 'util = max(util, util_est);';  as
result cpu_util_next() takes consideration for extra compensention
introduced by UTIL_EST.

	if (sched_feat(UTIL_EST)) {
	        util_est = READ_ONCE(cfs_rq->avg.util_est.enqueued);
	        if (dst_cpu == cpu)
	                util_est += _task_util_est(p);
	        else
	                util_est = max_t(long, util_est - _task_util_est(p), 0);
	        util = max(util, util_est);
	}

> cpu_util_next() is basically used to figure out what will be the
> cpu_util() of CPU A after task p has been enqueued on CPU B (no matter
> what A and B are).

Same with upper description, cpu_util_next() is not the same thing
with cpu_util(), cpu_util_next() takes consideration for extra
compensention introduced by UTIL_EST.

> > but this patch simply computes CPU capacity and energy with the single
> > one CPU utilization value (and it will be an inflated value afte enable
> > UTIL_EST).  Is this purposed for simple implementation?
> > 
> > IMHO, cpu_util_next() can be used to predict CPU capacity, on the other
> > hand, should we use the CPU util without UTIL_EST capping for 'sum_util',
> > this can be more reasonable to reflect the CPU utilization?
> 
> Why would a decayed utilisation be a better estimate of the time that
> a task is going to spend on a CPU ?

IIUC, in the scheduler waken up path task_util() is the task utilisation
before task sleeping, so it's not a decayed value.  cpu_util() is
decayed value, but is this just we want to reflect cpu historic
utilisation at the recent past time?  This is the reason I bring up to
use 'cpu_util() + task_util()' as estimation.

I understand this patch tries to use pre-decayed value, please review
below example has issue or not:
if one CPU's cfs_rq->avg.util_est.enqueued is quite high value, then this
CPU enter idle state and sleep for long while, if we use
cfs_rq->avg.util_est.enqueued to estimate CPU utilisation, this might
have big deviation than the CPU run time if place wake task on it?  On
the other hand, cpu_util() can decay for CPU idle time...

> > Furthermore, if we consider RT thread is running on CPU and connect with
> > 'schedutil' governor, the CPU will run at maximum frequency, but we
> > cannot say the CPU has 100% utilization.  The RT thread case is not
> > handled in this patch.
> 
> Right, we don't account for RT tasks in the OPP prediction for now.
> Vincent's patches to have a util_avg for RT runqueues could help us
> do that I suppose ...

Good to know this.

> Thanks !
> Quentin

^ permalink raw reply	[flat|nested] 44+ messages in thread

* Re: [RFC PATCH v2 4/6] sched/fair: Introduce an energy estimation helper function
  2018-04-20 16:27       ` Leo Yan
@ 2018-04-25  8:23         ` Quentin Perret
  0 siblings, 0 replies; 44+ messages in thread
From: Quentin Perret @ 2018-04-25  8:23 UTC (permalink / raw)
  To: Leo Yan
  Cc: Dietmar Eggemann, linux-kernel, Peter Zijlstra, Thara Gopinath,
	linux-pm, Morten Rasmussen, Chris Redpath, Patrick Bellasi,
	Valentin Schneider, Rafael J . Wysocki, Greg Kroah-Hartman,
	Vincent Guittot, Viresh Kumar, Todd Kjos, Joel Fernandes,
	Juri Lelli, Steve Muckle, Eduardo Valentin

Hi Leo,

Sorry for the delay in responding...

On Saturday 21 Apr 2018 at 00:27:53 (+0800), Leo Yan wrote:
> On Fri, Apr 20, 2018 at 03:42:45PM +0100, Quentin Perret wrote:
> > Hi Leo,
> > 
> > On Wednesday 18 Apr 2018 at 20:15:47 (+0800), Leo Yan wrote:
> > > Sorry I introduce mess at here to spread my questions in several
> > > replying, later will try to ask questions in one replying.  Below are
> > > more questions which it's good to bring up:
> > > 
> > > The code for energy computation is quite neat and simple, but I think
> > > the energy computation mixes two concepts for CPU util: one concept is
> > > the estimated CPU util which is used to select CPU OPP in schedutil,
> > > another concept is the raw CPU util according to CPU real running time;
> > > for example, cpu_util_next() predicts CPU util but this value might be
> > > much higher than cpu_util(), especially after enabled UTIL_EST feature
> > > (I have shallow understanding for UTIL_EST so correct me as needed);
> > 
> > I'm not not sure to understand what you mean by higher than cpu_util()
> > here ... In which case would that happen ?
> 
> After UTIL_EST feature is enabled, cpu_util_next() returns higher value
> than cpu_util(), see below code 'util = max(util, util_est);';  as
> result cpu_util_next() takes consideration for extra compensention
> introduced by UTIL_EST.
> 
> 	if (sched_feat(UTIL_EST)) {
> 	        util_est = READ_ONCE(cfs_rq->avg.util_est.enqueued);
> 	        if (dst_cpu == cpu)
> 	                util_est += _task_util_est(p);
> 	        else
> 	                util_est = max_t(long, util_est - _task_util_est(p), 0);
> 	        util = max(util, util_est);
> 	}

So, cpu_util() accounts for the UTIL_EST compensation:

	static inline unsigned long cpu_util(int cpu)
	{
		struct cfs_rq *cfs_rq;
		unsigned int util;

		cfs_rq = &cpu_rq(cpu)->cfs;
		util = READ_ONCE(cfs_rq->avg.util_avg);

		if (sched_feat(UTIL_EST))
			util = max(util, READ_ONCE(cfs_rq->avg.util_est.enqueued));

		return min_t(unsigned long, util, capacity_orig_of(cpu));
	}

So cpu_util_next() just mimics that.

> 
> > cpu_util_next() is basically used to figure out what will be the
> > cpu_util() of CPU A after task p has been enqueued on CPU B (no matter
> > what A and B are).
> 
> Same with upper description, cpu_util_next() is not the same thing
> with cpu_util(), cpu_util_next() takes consideration for extra
> compensention introduced by UTIL_EST.
> 
> > > but this patch simply computes CPU capacity and energy with the single
> > > one CPU utilization value (and it will be an inflated value afte enable
> > > UTIL_EST).  Is this purposed for simple implementation?
> > > 
> > > IMHO, cpu_util_next() can be used to predict CPU capacity, on the other
> > > hand, should we use the CPU util without UTIL_EST capping for 'sum_util',
> > > this can be more reasonable to reflect the CPU utilization?
> > 
> > Why would a decayed utilisation be a better estimate of the time that
> > a task is going to spend on a CPU ?
> 
> IIUC, in the scheduler waken up path task_util() is the task utilisation
> before task sleeping, so it's not a decayed value.

I don't think this is correct. sync_entity_load_avg() is called in
select_task_rq_fair() so task_util() *is* decayed upon wakeup.

> cpu_util() is
> decayed value,

This is not necessarily correct either. As mentioned above, cpu_util()
includes the UTIL_EST compensation, so the value isn't necessarily
decayed.

> but is this just we want to reflect cpu historic
> utilisation at the recent past time?  This is the reason I bring up to
> use 'cpu_util() + task_util()' as estimation.
> 
> I understand this patch tries to use pre-decayed value,

No, this patch tries to estimate what will be the return value of
cpu_util() if the task is enqueued on a specific CPU. This value can be
the util_avg (decayed) or the util_est (non-decayed) depending on the
conditions.

> please review
> below example has issue or not:
> if one CPU's cfs_rq->avg.util_est.enqueued is quite high value, then this
> CPU enter idle state and sleep for long while, if we use
> cfs_rq->avg.util_est.enqueued to estimate CPU utilisation, this might
> have big deviation than the CPU run time if place wake task on it?  On
> the other hand, cpu_util() can decay for CPU idle time...
> 
> > > Furthermore, if we consider RT thread is running on CPU and connect with
> > > 'schedutil' governor, the CPU will run at maximum frequency, but we
> > > cannot say the CPU has 100% utilization.  The RT thread case is not
> > > handled in this patch.
> > 
> > Right, we don't account for RT tasks in the OPP prediction for now.
> > Vincent's patches to have a util_avg for RT runqueues could help us
> > do that I suppose ...
> 
> Good to know this.
> 
> > Thanks !
> > Quentin

^ permalink raw reply	[flat|nested] 44+ messages in thread

end of thread, other threads:[~2018-04-25  8:23 UTC | newest]

Thread overview: 44+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-04-06 15:36 [RFC PATCH v2 0/6] Energy Aware Scheduling Dietmar Eggemann
2018-04-06 15:36 ` [RFC PATCH v2 1/6] sched/fair: Create util_fits_capacity() Dietmar Eggemann
2018-04-12  7:02   ` Viresh Kumar
2018-04-12  8:20     ` Dietmar Eggemann
2018-04-06 15:36 ` [RFC PATCH v2 2/6] sched: Introduce energy models of CPUs Dietmar Eggemann
2018-04-10 11:54   ` Peter Zijlstra
2018-04-10 12:03     ` Dietmar Eggemann
2018-04-13  4:02   ` Viresh Kumar
2018-04-13  8:37     ` Quentin Perret
2018-04-06 15:36 ` [RFC PATCH v2 3/6] sched: Add over-utilization/tipping point indicator Dietmar Eggemann
2018-04-13 23:56   ` Joel Fernandes
2018-04-18 11:17     ` Quentin Perret
2018-04-20  8:13       ` Joel Fernandes
2018-04-20  8:14         ` Joel Fernandes
2018-04-20  8:31           ` Quentin Perret
2018-04-20  8:57             ` Juri Lelli
2018-04-17 14:25   ` Leo Yan
2018-04-17 17:39     ` Dietmar Eggemann
2018-04-18  0:18       ` Leo Yan
2018-04-06 15:36 ` [RFC PATCH v2 4/6] sched/fair: Introduce an energy estimation helper function Dietmar Eggemann
2018-04-10 12:51   ` Peter Zijlstra
2018-04-10 13:56     ` Quentin Perret
2018-04-10 14:08       ` Peter Zijlstra
2018-04-13  6:27   ` Viresh Kumar
2018-04-17 15:22   ` Leo Yan
2018-04-18  8:13     ` Quentin Perret
2018-04-18  9:19       ` Leo Yan
2018-04-18 11:06         ` Quentin Perret
2018-04-18  9:23   ` Leo Yan
2018-04-20 14:51     ` Quentin Perret
2018-04-18 12:15   ` Leo Yan
2018-04-20 14:42     ` Quentin Perret
2018-04-20 16:27       ` Leo Yan
2018-04-25  8:23         ` Quentin Perret
2018-04-06 15:36 ` [RFC PATCH v2 5/6] sched/fair: Select an energy-efficient CPU on task wake-up Dietmar Eggemann
2018-04-09 16:30   ` Peter Zijlstra
2018-04-09 16:43     ` Quentin Perret
2018-04-10 17:29   ` Peter Zijlstra
2018-04-10 18:14     ` Quentin Perret
2018-04-17 15:39   ` Leo Yan
2018-04-18  7:57     ` Quentin Perret
2018-04-06 15:36 ` [RFC PATCH v2 6/6] drivers: base: arch_topology.c: Enable EAS for arm/arm64 platforms Dietmar Eggemann
2018-04-17 12:50 ` [RFC PATCH v2 0/6] Energy Aware Scheduling Leo Yan
2018-04-17 17:22   ` Dietmar Eggemann

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).