LKML Archive on lore.kernel.org
 help / Atom feed
* [PATCH v5 00/14] Energy Aware Scheduling
@ 2018-07-24 12:25 Quentin Perret
  2018-07-24 12:25 ` [PATCH v5 01/14] sched: Relocate arch_scale_cpu_capacity Quentin Perret
                   ` (13 more replies)
  0 siblings, 14 replies; 72+ messages in thread
From: Quentin Perret @ 2018-07-24 12:25 UTC (permalink / raw)
  To: peterz, rjw, linux-kernel, linux-pm
  Cc: gregkh, mingo, dietmar.eggemann, morten.rasmussen, chris.redpath,
	patrick.bellasi, valentin.schneider, vincent.guittot,
	thara.gopinath, viresh.kumar, tkjos, joel, smuckle, adharmap,
	skannan, pkondeti, juri.lelli, edubezval, srinivas.pandruvada,
	currojerez, javi.merino, quentin.perret

The Energy Aware Scheduler (EAS) based on Morten Rasmussen's posting on
LKML [1] is currently part of the AOSP Common Kernel and runs on today's
smartphones with Arm's big.LITTLE CPUs. This series implements a new and
largely simplified version of EAS based on an Energy Model (EM) of the
platform with only costs information for the active states of the CPUs.

The patch-set is organized in three main parts.
1. Patches 01-04/14 introduce a centralized and independent EM
   management framework
2. Patches 05-12/14 make use of the EM in the scheduler to bias task
   placement decisions
3. Patches 13-14/14 give an Arm64 example on how to register an Energy
   Model in the new framework


1. The Energy Model Framework

The energy consumed by the CPUs can be provided to the OS in different
ways depending on the source of information (firmware or device tree for
example). The EM framework introduced in patch 03/14 addresses this
issue by aggregating the data coming from the drivers in a standard way
and making it available to interested clients (thermal or the task
scheduler, for example).

Although this series comprises patches for the task scheduler as a user
of the EM, the framework itself as introduced in patch 03/14 is meant to
be _independent_ from any other subsystem.

The overall design of the EM framework is depicted on the diagram below
(focused on Arm drivers for the example, but applicable to any
architecture).

      +---------------+  +-----------------+  +-------------+
      | Thermal (IPA) |  | Scheduler (EAS) |  |    Other    |
      +---------------+  +-----------------+  +-------------+
              |                   | em_fd_energy()   |
              |                   | em_cpu_get()     |
              +-----------+       |       +----------+
                          |       |       |
                          v       v       v
                       +---------------------+
                       |                     |
                       |    Energy Model     |
                       |                     |
                       |     Framework       |
                       |                     |
                       +---------------------+
                          ^       ^       ^
                          |       |       | em_register_freq_domain()
               +----------+       |       +---------+
               |                  |                 |
       +---------------+  +---------------+  +--------------+
       |  cpufreq-dt   |  |   arm_scmi    |  |    Other     |
       +---------------+  +---------------+  +--------------+
               ^                  ^                 ^
               |                  |                 |
       +--------------+   +---------------+  +--------------+
       | Device Tree  |   |   Firmware    |  |      ?       |
       +--------------+   +---------------+  +--------------+

Drivers can register data in the EM framework using the
em_register_freq_domain() API. They are expected to provide a callback
function that the EM framework can use to build energy cost tables and
store them in shared data structures. Then, clients such as the task
scheduler are allowed to read those shared structures using the
em_fd_energy() and em_cpu_get() APIs. More details about the different
APIs of the framework can be found in patch 03/14.


2. Energy-aware task placement in the task scheduler

Patches 05-12/14 make use of the newly introduced EM in the scheduler to
bias task placement decisions. When the system is detected as
non-”overutilized”, an EM is available, and the platform has an
asymmetric CPU capacity topology (e.g. big.LITTLE), the consequences on
energy of placing a waking task on a CPU are taken into account to avoid
energy-inefficient CPUs if possible. Patches 05-07/14 modify the
scheduler topology code in order to: 1) check if all conditions for EAS
are met when the scheduling domains are built; and 2) create data
structures holding references on the EM tables that can be accessed in
latency sensitive code paths (e.g. wake-up path).

An “overutilized” flag (patches 08-09/14) is attached to the root domain,
and is set whenever a CPU is utilized at more than 80% of its capacity.
Patches 10-12/14 introduce the new energy-aware wake-up path which makes
use of the data structures introduced in patches 05-07/14 whenever the
system isn’t overutilized.


3. Arm example of driver modifications to register an EM

Patches 13-14/14 show an example of how drivers should be modified to
register an EM in the new framework. The patches target Arm drivers,
as an example, but the same ideas should be applicable for others
architectures. Patch 13/14 rebuilds the scheduling domains once CPUFreq
is up and running, and after the asymmetry of the system has been
discovered. Patch 14/14 changes the cpufreq-dt driver (used for testing
on Hikey960, see Section 4.) to provide estimated power values to the
EM framework using coefficients read from DT. This patch has been made
simple and self-contained intentionally in order to show an example of
usage of the EM framework.


4. Test results

Two fundamentally different tests were executed. Firstly the energy test
case shows the impact on energy consumption this patch-set has using a
synthetic set of tasks. Secondly the performance test case provides the
conventional hackbench metric numbers.

The tests run on two arm64 big.LITTLE platforms: Hikey960 (4xA73 +
4xA53) and Juno r0 (2xA57 + 4xA53).

Base kernel is tip/sched/core (4.18-rc5), with some Hikey960 and Juno
specific patches, the SD_ASYM_CPUCAPACITY flag set at DIE sched domain
level for arm64 and schedutil as cpufreq governor [2].

4.1 Energy test case

10 iterations of between 10 and 50 periodic rt-app tasks (16ms period,
5% duty-cycle) for 30 seconds with energy measurement. Unit is Joules.
The goal is to save energy, so lower is better.

4.1.1 Hikey960

Energy is measured with an ACME Cape on an instrumented board. Numbers
include consumption of big and little CPUs, LPDDR memory, GPU and most
of the other small components on the board. They do not include
consumption of the radio chip (turned-off anyway) and external
connectors.

+----------+-----------------+-------------------------+
|          | Without patches | With patches            |
+----------+--------+--------+------------------+------+
| Tasks nb |  Mean  | RSD*   | Mean             | RSD* |
+----------+--------+--------+------------------+------+

|       10 |  32.16 |   1.3% |  30.36  (-5.60%) | 1.2% |
|       20 |  50.28 |   1.3% |  44.79 (-10.92%) | 0.6% |
|       30 |  67.59 |   6.1% |  59.32 (-12.24%) | 1.4% |
|       40 |  91.47 |   2.8% |  85.96  (-6.02%) | 3.7% |
|       50 | 131.39 |   6.6% | 111.42 (-15.20%) | 4.8% |
+----------+--------+--------+------------------+------+

4.1.2 Juno r0

Energy is measured with the onboard energy meter. Numbers include
consumption of big and little CPUs.

+----------+-----------------+------------------------+
|          | Without patches | With patches           |
+----------+--------+--------+-----------------+------+
| Tasks nb |  Mean  | RSD*   | Mean            | RSD* |
+----------+--------+--------+-----------------+------+
|       10 |  11.07 |   3.2% |  8.04 (-27.37%) | 2.2% |
|       20 |  20.14 |   4.2% | 14.20 (-29.49%) | 1.4% |
|       30 |  32.67 |   3.5% | 24.06 (-26.35%) | 3.0% |
|       40 |  46.23 |   1.0% | 36.87 (-20.24%) | 7.3% |
|       50 |  57.36 |   0.5% | 54.69 ( -4.65%) | 0.7% |
+----------+--------+--------+-----------------+------+

4.2 Performance test case

30 iterations of perf bench sched messaging --pipe --thread --group G
--loop L with G=[1 2 4 8] and L=50000 (Hikey960)/16000 (Juno r0).

4.2.1 Hikey960

The impact of thermal capping was mitigated thanks to a heatsink, a
fan, and a 10 sec delay between two successive executions.

+----------------+-----------------+------------------------+
|                | Without patches | With patches           |
+--------+-------+---------+-------+----------------+-------+
| Groups | Tasks | Mean    | RSD*  | Mean           | RSD*  |
+--------+-------+---------+-------+----------------+-------+
|      1 |    40 |    8.01 | 1.13% |  8.01 (+0.00%) | 1.40% |
|      2 |    80 |   14.57 | 0.53% | 14.57 (+0.00%) | 0.63% |
|      4 |   160 |   29.92 | 0.60% | 30.79 (+2.91%) | 0.49% |
|      8 |   320 |   63.42 | 0.68% | 65.27 (+2.92%) | 0.43% |
+--------+-------+---------+-------+----------------+-------+

4.2.2 Juno r0

+----------------+-----------------+-----------------------+
|                | Without patches | With patches          |
+--------+-------+---------+-------+---------------+-------+
| Groups | Tasks | Mean    | RSD*  | Mean          | RSD*  |
+--------+-------+---------+-------+---------------+-------+
|      1 |    40 |    7.76 | 0.11% |  7.83 (0.01%) | 0.11% |
|      2 |    80 |   14.22 | 0.14% | 14.41 (0.01%) | 0.15% |
|      4 |   160 |   26.95 | 0.34% | 27.08 (0.01%) | 0.24% |
|      8 |   320 |   54.38 | 1.65% | 55.94 (0.03%) | 3.70% |
+--------+-------+---------+-------+---------------+-------+

*RSD: Relative Standard Deviation (std dev / mean)


5. Version history:

Changes v4[3]->v5:
- Removed the RCU protection of the EM tables and the associated
  need for em_rescale_cpu_capacity().
- Factorized schedutil’s PELT aggregation function with EAS
- Improved comments/doc in the EM framework
- Added check on the uarch of CPUs in one fd in the EM framework
- Reduced CONFIG_ENERGY_MODEL ifdefery in kernel/sched/topology.c
- Cleaned-up update_sg_lb_stats parameters
- Improved comments in compute_energy() to explain the multi-rd
  scenarios

Changes v3[4]->v4:
- Replaced spinlock in EM framework by smp_store_release/READ_ONCE
- Fixed missing locks to protect rcu_assign_pointer in EM framework
- Fixed capacity calculation in EM framework on 32 bits system
- Fixed compilation issue for CONFIG_ENERGY_MODEL=n
- Removed cpumask from struct em_freq_domain, now dynamically allocated
- Power costs of the EM are specified in milliwatts
- Added example of CPUFreq driver modification
- Added doc/comments in the EM framework and better commit header
- Fixed integration issue with util_est in cpu_util_next()
- Changed scheduler topology code to have one freq. dom. list per rd
- Split sched topology patch in smaller patches
- Added doc/comments explaining the heuristic in the wake-up path
- Changed energy threshold for migration to from 1.5% to 6%

Changes v2[5]->v3:
- Removed the PM_OPP dependency by implementing a new EM framework
- Modified the scheduler topology code to take references on the EM data
  structures
- Simplified the overutilization mechanism into a system-wide flag
- Reworked the integration in the wake-up path using the sd_ea shortcut
- Rebased on tip/sched/core (247f2f6f3c70 "sched/core: Don't schedule
  threads on pre-empted vCPUs")

Changes v1[6]->v2:
- Reworked interface between fair.c and energy.[ch] (Remove #ifdef
  CONFIG_PM_OPP from energy.c) (Greg KH)
- Fixed licence & header issue in energy.[ch] (Greg KH)
- Reordered EAS path in select_task_rq_fair() (Joel)
- Avoid prev_cpu if not allowed in select_task_rq_fair() (Morten/Joel)
- Refactored compute_energy() (Patrick)
- Account for RT/IRQ pressure in task_fits() (Patrick)
- Use UTIL_EST and DL utilization during OPP estimation (Patrick/Juri)
- Optimize selection of CPU candidates in the energy-aware wake-up path
- Rebased on top of tip/sched/core (commit b720342849fe “sched/core:
  Update Preempt_notifier_key to modern API”)

[1] https://lkml.org/lkml/2015/7/7/754
[2] http://www.linux-arm.org/git?p=linux-qp.git;a=shortlog;h=refs/heads/upstream/eas_v5
[3] https://marc.info/?l=linux-kernel&m=153018606728533&w=2
[4] https://marc.info/?l=linux-kernel&m=152691273111941&w=2
[5] https://marc.info/?l=linux-kernel&m=152302902427143&w=2
[6] https://marc.info/?l=linux-kernel&m=152153905805048&w=2


Morten Rasmussen (1):
  sched: Add over-utilization/tipping point indicator

Quentin Perret (13):
  sched: Relocate arch_scale_cpu_capacity
  sched/cpufreq: Factor out utilization to frequency mapping
  PM: Introduce an Energy Model management framework
  PM / EM: Expose the Energy Model in sysfs
  sched/topology: Reference the Energy Model of CPUs when available
  sched/topology: Lowest energy aware balancing sched_domain level
    pointer
  sched/topology: Introduce sched_energy_present static key
  sched/fair: Clean-up update_sg_lb_stats parameters
  sched/cpufreq: Refactor the utilization aggregation method
  sched/fair: Introduce an energy estimation helper function
  sched/fair: Select an energy-efficient CPU on task wake-up
  OPTIONAL: arch_topology: Start Energy Aware Scheduling
  OPTIONAL: cpufreq: dt: Register an Energy Model

 drivers/base/arch_topology.c     |   2 +
 drivers/cpufreq/cpufreq-dt.c     |  45 ++++-
 include/linux/energy_model.h     | 162 +++++++++++++++++
 include/linux/sched/cpufreq.h    |   6 +
 include/linux/sched/topology.h   |  19 ++
 kernel/power/Kconfig             |  15 ++
 kernel/power/Makefile            |   2 +
 kernel/power/energy_model.c      | 289 +++++++++++++++++++++++++++++++
 kernel/sched/cpufreq_schedutil.c |  89 ++++++----
 kernel/sched/fair.c              | 273 ++++++++++++++++++++++++++---
 kernel/sched/sched.h             |  85 ++++++---
 kernel/sched/topology.c          | 214 ++++++++++++++++++++++-
 12 files changed, 1125 insertions(+), 76 deletions(-)
 create mode 100644 include/linux/energy_model.h
 create mode 100644 kernel/power/energy_model.c

-- 
2.18.0


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH v5 01/14] sched: Relocate arch_scale_cpu_capacity
  2018-07-24 12:25 [PATCH v5 00/14] Energy Aware Scheduling Quentin Perret
@ 2018-07-24 12:25 ` Quentin Perret
  2018-07-24 12:25 ` [PATCH v5 02/14] sched/cpufreq: Factor out utilization to frequency mapping Quentin Perret
                   ` (12 subsequent siblings)
  13 siblings, 0 replies; 72+ messages in thread
From: Quentin Perret @ 2018-07-24 12:25 UTC (permalink / raw)
  To: peterz, rjw, linux-kernel, linux-pm
  Cc: gregkh, mingo, dietmar.eggemann, morten.rasmussen, chris.redpath,
	patrick.bellasi, valentin.schneider, vincent.guittot,
	thara.gopinath, viresh.kumar, tkjos, joel, smuckle, adharmap,
	skannan, pkondeti, juri.lelli, edubezval, srinivas.pandruvada,
	currojerez, javi.merino, quentin.perret

By default, arch_scale_cpu_capacity() is only visible from within the
kernel/sched folder. Relocate it to include/linux/sched/topology.h to
make it visible to other clients needing to know about the capacity of
CPUs, such as the Energy Model framework.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
---
 include/linux/sched/topology.h | 19 +++++++++++++++++++
 kernel/sched/sched.h           | 21 ---------------------
 2 files changed, 19 insertions(+), 21 deletions(-)

diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
index 26347741ba50..1e24e88bee6d 100644
--- a/include/linux/sched/topology.h
+++ b/include/linux/sched/topology.h
@@ -202,6 +202,17 @@ extern void set_sched_topology(struct sched_domain_topology_level *tl);
 # define SD_INIT_NAME(type)
 #endif
 
+#ifndef arch_scale_cpu_capacity
+static __always_inline
+unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
+{
+	if (sd && (sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1))
+		return sd->smt_gain / sd->span_weight;
+
+	return SCHED_CAPACITY_SCALE;
+}
+#endif
+
 #else /* CONFIG_SMP */
 
 struct sched_domain_attr;
@@ -217,6 +228,14 @@ static inline bool cpus_share_cache(int this_cpu, int that_cpu)
 	return true;
 }
 
+#ifndef arch_scale_cpu_capacity
+static __always_inline
+unsigned long arch_scale_cpu_capacity(void __always_unused *sd, int cpu)
+{
+	return SCHED_CAPACITY_SCALE;
+}
+#endif
+
 #endif	/* !CONFIG_SMP */
 
 static inline int task_node(const struct task_struct *p)
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index ebb4b3c3ece7..2a72f1b9be0f 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1751,27 +1751,6 @@ unsigned long arch_scale_freq_capacity(int cpu)
 }
 #endif
 
-#ifdef CONFIG_SMP
-#ifndef arch_scale_cpu_capacity
-static __always_inline
-unsigned long arch_scale_cpu_capacity(struct sched_domain *sd, int cpu)
-{
-	if (sd && (sd->flags & SD_SHARE_CPUCAPACITY) && (sd->span_weight > 1))
-		return sd->smt_gain / sd->span_weight;
-
-	return SCHED_CAPACITY_SCALE;
-}
-#endif
-#else
-#ifndef arch_scale_cpu_capacity
-static __always_inline
-unsigned long arch_scale_cpu_capacity(void __always_unused *sd, int cpu)
-{
-	return SCHED_CAPACITY_SCALE;
-}
-#endif
-#endif
-
 struct rq *__task_rq_lock(struct task_struct *p, struct rq_flags *rf)
 	__acquires(rq->lock);
 
-- 
2.18.0


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH v5 02/14] sched/cpufreq: Factor out utilization to frequency mapping
  2018-07-24 12:25 [PATCH v5 00/14] Energy Aware Scheduling Quentin Perret
  2018-07-24 12:25 ` [PATCH v5 01/14] sched: Relocate arch_scale_cpu_capacity Quentin Perret
@ 2018-07-24 12:25 ` Quentin Perret
  2018-07-24 12:25 ` [PATCH v5 03/14] PM: Introduce an Energy Model management framework Quentin Perret
                   ` (11 subsequent siblings)
  13 siblings, 0 replies; 72+ messages in thread
From: Quentin Perret @ 2018-07-24 12:25 UTC (permalink / raw)
  To: peterz, rjw, linux-kernel, linux-pm
  Cc: gregkh, mingo, dietmar.eggemann, morten.rasmussen, chris.redpath,
	patrick.bellasi, valentin.schneider, vincent.guittot,
	thara.gopinath, viresh.kumar, tkjos, joel, smuckle, adharmap,
	skannan, pkondeti, juri.lelli, edubezval, srinivas.pandruvada,
	currojerez, javi.merino, quentin.perret

The schedutil governor maps utilization values to frequencies by applying
a 25% margin. Since this sort of mapping mechanism can be needed by other
users (i.e. EAS), factor the utilization-to-frequency mapping code out
of schedutil and move it to include/linux/sched/cpufreq.h to avoid code
duplication. The new map_util_freq() function is inlined to avoid
overheads.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
---
 include/linux/sched/cpufreq.h    | 6 ++++++
 kernel/sched/cpufreq_schedutil.c | 3 ++-
 2 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/include/linux/sched/cpufreq.h b/include/linux/sched/cpufreq.h
index 59667444669f..afa940cd50dc 100644
--- a/include/linux/sched/cpufreq.h
+++ b/include/linux/sched/cpufreq.h
@@ -20,6 +20,12 @@ void cpufreq_add_update_util_hook(int cpu, struct update_util_data *data,
                        void (*func)(struct update_util_data *data, u64 time,
 				    unsigned int flags));
 void cpufreq_remove_update_util_hook(int cpu);
+
+static inline unsigned long map_util_freq(unsigned long util,
+					unsigned long freq, unsigned long cap)
+{
+	return (freq + (freq >> 2)) * util / cap;
+}
 #endif /* CONFIG_CPU_FREQ */
 
 #endif /* _LINUX_SCHED_CPUFREQ_H */
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 97dcd4472a0e..810193c8e4cd 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -13,6 +13,7 @@
 
 #include "sched.h"
 
+#include <linux/sched/cpufreq.h>
 #include <trace/events/power.h>
 
 struct sugov_tunables {
@@ -167,7 +168,7 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
 	unsigned int freq = arch_scale_freq_invariant() ?
 				policy->cpuinfo.max_freq : policy->cur;
 
-	freq = (freq + (freq >> 2)) * util / max;
+	freq = map_util_freq(util, freq, max);
 
 	if (freq == sg_policy->cached_raw_freq && !sg_policy->need_freq_update)
 		return sg_policy->next_freq;
-- 
2.18.0


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH v5 03/14] PM: Introduce an Energy Model management framework
  2018-07-24 12:25 [PATCH v5 00/14] Energy Aware Scheduling Quentin Perret
  2018-07-24 12:25 ` [PATCH v5 01/14] sched: Relocate arch_scale_cpu_capacity Quentin Perret
  2018-07-24 12:25 ` [PATCH v5 02/14] sched/cpufreq: Factor out utilization to frequency mapping Quentin Perret
@ 2018-07-24 12:25 ` Quentin Perret
  2018-08-09 21:52   ` Rafael J. Wysocki
  2018-07-24 12:25 ` [PATCH v5 04/14] PM / EM: Expose the Energy Model in sysfs Quentin Perret
                   ` (10 subsequent siblings)
  13 siblings, 1 reply; 72+ messages in thread
From: Quentin Perret @ 2018-07-24 12:25 UTC (permalink / raw)
  To: peterz, rjw, linux-kernel, linux-pm
  Cc: gregkh, mingo, dietmar.eggemann, morten.rasmussen, chris.redpath,
	patrick.bellasi, valentin.schneider, vincent.guittot,
	thara.gopinath, viresh.kumar, tkjos, joel, smuckle, adharmap,
	skannan, pkondeti, juri.lelli, edubezval, srinivas.pandruvada,
	currojerez, javi.merino, quentin.perret

Several subsystems in the kernel (task scheduler and/or thermal at the
time of writing) can benefit from knowing about the energy consumed by
CPUs. Yet, this information can come from different sources (DT or
firmware for example), in different formats, hence making it hard to
exploit without a standard API.

As an attempt to address this, introduce a centralized Energy Model
(EM) management framework which aggregates the power values provided
by drivers into a table for each frequency domain in the system. The
power cost tables are made available to interested clients (e.g. task
scheduler or thermal) via platform-agnostic APIs. The overall design
is represented by the diagram below (focused on Arm-related drivers as
an example, but applicable to any architecture):

     +---------------+  +-----------------+  +-------------+
     | Thermal (IPA) |  | Scheduler (EAS) |  |    Other    |
     +---------------+  +-----------------+  +-------------+
             |                 | em_fd_energy()     |
             |                 | em_cpu_get()       |
             +-----------+     |         +----------+
                         |     |         |
                         v     v         v
                      +---------------------+
                      |                     |
                      |    Energy Model     |
                      |                     |
                      |     Framework       |
                      |                     |
                      +---------------------+
                         ^       ^       ^
                         |       |       | em_register_freq_domain()
              +----------+       |       +---------+
              |                  |                 |
      +---------------+  +---------------+  +--------------+
      |  cpufreq-dt   |  |   arm_scmi    |  |    Other     |
      +---------------+  +---------------+  +--------------+
              ^                  ^                 ^
              |                  |                 |
      +--------------+   +---------------+  +--------------+
      | Device Tree  |   |   Firmware    |  |      ?       |
      +--------------+   +---------------+  +--------------+

Drivers (typically, but not limited to, CPUFreq drivers) can register
data in the EM framework using the em_register_freq_domain() API. The
calling driver must provide a callback function with a standardized
signature that will be used by the EM framework to build the power
cost tables of the frequency domain. This design should offer a lot of
flexibility to calling drivers which are free of reading information
from any location and to use any technique to compute power costs.
Moreover, the capacity states registered by drivers in the EM framework
are not required to match real performance states of the target. This
is particularly important on targets where the performance states are
not known by the OS.

On the client side, the EM framework offers APIs to access the power
cost tables of a CPU (em_cpu_get()), and to estimate the energy
consumed by the CPUs of a frequency domain (em_fd_energy()). Clients
such as the task scheduler can then use these APIs to access the shared
data structures holding the Energy Model of CPUs.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
---
 include/linux/energy_model.h | 161 ++++++++++++++++++++++++++++
 kernel/power/Kconfig         |  15 +++
 kernel/power/Makefile        |   2 +
 kernel/power/energy_model.c  | 199 +++++++++++++++++++++++++++++++++++
 4 files changed, 377 insertions(+)
 create mode 100644 include/linux/energy_model.h
 create mode 100644 kernel/power/energy_model.c

diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
new file mode 100644
index 000000000000..be822ce05c17
--- /dev/null
+++ b/include/linux/energy_model.h
@@ -0,0 +1,161 @@
+/* SPDX-License-Identifier: GPL-2.0 */
+#ifndef _LINUX_ENERGY_MODEL_H
+#define _LINUX_ENERGY_MODEL_H
+#include <linux/cpumask.h>
+#include <linux/jump_label.h>
+#include <linux/kobject.h>
+#include <linux/rcupdate.h>
+#include <linux/sched/cpufreq.h>
+#include <linux/sched/topology.h>
+#include <linux/types.h>
+
+#ifdef CONFIG_ENERGY_MODEL
+struct em_cap_state {
+	unsigned long frequency; /* Kilo-hertz */
+	unsigned long power; /* Milli-watts */
+	unsigned long cost; /* power * max_frequency / frequency */
+};
+
+struct em_freq_domain {
+	struct em_cap_state *table; /* Capacity states, in ascending order. */
+	int nr_cap_states;
+	unsigned long cpus[0]; /* CPUs of the frequency domain. */
+};
+
+#define EM_CPU_MAX_POWER 0xFFFF
+
+struct em_data_callback {
+	/**
+	 * active_power() - Provide power at the next capacity state of a CPU
+	 * @power	: Active power at the capacity state in mW (modified)
+	 * @freq	: Frequency at the capacity state in kHz (modified)
+	 * @cpu		: CPU for which we do this operation
+	 *
+	 * active_power() must find the lowest capacity state of 'cpu' above
+	 * 'freq' and update 'power' and 'freq' to the matching active power
+	 * and frequency.
+	 *
+	 * The power is the one of a single CPU in the domain, expressed in
+	 * milli-watts. It is expected to fit in the [0, EM_CPU_MAX_POWER]
+	 * range.
+	 *
+	 * Return 0 on success.
+	 */
+	int (*active_power)(unsigned long *power, unsigned long *freq, int cpu);
+};
+#define EM_DATA_CB(_active_power_cb) { .active_power = &_active_power_cb }
+
+struct em_freq_domain *em_cpu_get(int cpu);
+int em_register_freq_domain(cpumask_t *span, unsigned int nr_states,
+						struct em_data_callback *cb);
+
+/**
+ * em_fd_energy() - Estimates the energy consumed by the CPUs of a freq. domain
+ * @fd		: frequency domain for which energy has to be estimated
+ * @max_util	: highest utilization among CPUs of the domain
+ * @sum_util	: sum of the utilization of all CPUs in the domain
+ *
+ * Return: the sum of the energy consumed by the CPUs of the domain assuming
+ * a capacity state satisfying the max utilization of the domain.
+ */
+static inline unsigned long em_fd_energy(struct em_freq_domain *fd,
+				unsigned long max_util, unsigned long sum_util)
+{
+	unsigned long freq, scale_cpu;
+	struct em_cap_state *cs;
+	int i, cpu;
+
+	/*
+	 * In order to predict the capacity state, map the utilization of the
+	 * most utilized CPU of the frequency domain to a requested frequency,
+	 * like schedutil.
+	 */
+	cpu = cpumask_first(to_cpumask(fd->cpus));
+	scale_cpu = arch_scale_cpu_capacity(NULL, cpu);
+	cs = &fd->table[fd->nr_cap_states - 1];
+	freq = map_util_freq(max_util, cs->frequency, scale_cpu);
+
+	/*
+	 * Find the lowest capacity state of the Energy Model above the
+	 * requested frequency.
+	 */
+	for (i = 0; i < fd->nr_cap_states; i++) {
+		cs = &fd->table[i];
+		if (cs->frequency >= freq)
+			break;
+	}
+
+	/*
+	 * The capacity of a CPU in the domain at that capacity state (cs)
+	 * can be computed as:
+	 *
+	 *             cs->freq * scale_cpu
+	 *   cs->cap = --------------------                          (1)
+	 *                 cpu_max_freq
+	 *
+	 * So, the energy consumed by this CPU at that capacity state is:
+	 *
+	 *             cs->power * cpu_util
+	 *   cpu_nrg = --------------------                          (2)
+	 *                   cs->cap
+	 *
+	 * since 'cpu_util / cs->cap' represents its percentage of busy time.
+	 * By injecting (1) in (2), 'cpu_nrg' can be re-expressed as a product
+	 * of two terms:
+	 *
+	 *             cs->power * cpu_max_freq   cpu_util
+	 *   cpu_nrg = ------------------------ * ---------          (3)
+	 *                    cs->freq            scale_cpu
+	 *
+	 * The first term is static, and is stored in the em_cap_state struct
+	 * as 'cs->cost'.
+	 *
+	 * Since all CPUs of the domain have the same micro-architecture, they
+	 * share the same 'cs->cost', and the same CPU capacity. Hence, the
+	 * total energy of the domain (which is the simple sum of the energy of
+	 * all of its CPUs) can be factorized as:
+	 *
+	 *            cs->cost * \Sum cpu_util
+	 *   fd_nrg = ------------------------                       (4)
+	 *                  scale_cpu
+	 */
+	return cs->cost * sum_util / scale_cpu;
+}
+
+/**
+ * em_fd_nr_cap_states() - Get the number of capacity states of a freq. domain
+ * @fd		: frequency domain for which this must be done
+ *
+ * Return: the number of capacity states in the frequency domain table
+ */
+static inline int em_fd_nr_cap_states(struct em_freq_domain *fd)
+{
+	return fd->nr_cap_states;
+}
+
+#else
+struct em_freq_domain {};
+struct em_data_callback {};
+#define EM_DATA_CB(_active_power_cb) { }
+
+static inline int em_register_freq_domain(cpumask_t *span,
+			unsigned int nr_states, struct em_data_callback *cb)
+{
+	return -EINVAL;
+}
+static inline struct em_freq_domain *em_cpu_get(int cpu)
+{
+	return NULL;
+}
+static inline unsigned long em_fd_energy(struct em_freq_domain *fd,
+			unsigned long max_util, unsigned long sum_util)
+{
+	return 0;
+}
+static inline int em_fd_nr_cap_states(struct em_freq_domain *fd)
+{
+	return 0;
+}
+#endif
+
+#endif
diff --git a/kernel/power/Kconfig b/kernel/power/Kconfig
index e880ca22c5a5..6f6db452dc7d 100644
--- a/kernel/power/Kconfig
+++ b/kernel/power/Kconfig
@@ -297,3 +297,18 @@ config PM_GENERIC_DOMAINS_OF
 
 config CPU_PM
 	bool
+
+config ENERGY_MODEL
+	bool "Energy Model for CPUs"
+	depends on SMP
+	depends on CPU_FREQ
+	default n
+	help
+	  Several subsystems (thermal and/or the task scheduler for example)
+	  can leverage information about the energy consumed by CPUs to make
+	  smarter decisions. This config option enables the framework from
+	  which subsystems can access the energy models.
+
+	  The exact usage of the energy model is subsystem-dependent.
+
+	  If in doubt, say N.
diff --git a/kernel/power/Makefile b/kernel/power/Makefile
index a3f79f0eef36..e7e47d9be1e5 100644
--- a/kernel/power/Makefile
+++ b/kernel/power/Makefile
@@ -15,3 +15,5 @@ obj-$(CONFIG_PM_AUTOSLEEP)	+= autosleep.o
 obj-$(CONFIG_PM_WAKELOCKS)	+= wakelock.o
 
 obj-$(CONFIG_MAGIC_SYSRQ)	+= poweroff.o
+
+obj-$(CONFIG_ENERGY_MODEL)	+= energy_model.o
diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
new file mode 100644
index 000000000000..39740fe728ea
--- /dev/null
+++ b/kernel/power/energy_model.c
@@ -0,0 +1,199 @@
+// SPDX-License-Identifier: GPL-2.0
+/*
+ * Energy Model of CPUs
+ *
+ * Copyright (c) 2018, Arm ltd.
+ * Written by: Quentin Perret, Arm ltd.
+ */
+
+#define pr_fmt(fmt) "energy_model: " fmt
+
+#include <linux/cpu.h>
+#include <linux/cpumask.h>
+#include <linux/energy_model.h>
+#include <linux/sched/topology.h>
+#include <linux/slab.h>
+
+/* Mapping of each CPU to the frequency domain to which it belongs. */
+static DEFINE_PER_CPU(struct em_freq_domain *, em_data);
+
+/*
+ * Mutex serializing the registrations of frequency domains and letting
+ * callbacks defined by drivers sleep.
+ */
+static DEFINE_MUTEX(em_fd_mutex);
+
+static struct em_freq_domain *em_create_fd(cpumask_t *span, int nr_states,
+						struct em_data_callback *cb)
+{
+	unsigned long opp_eff, prev_opp_eff = ULONG_MAX;
+	unsigned long power, freq, prev_freq = 0;
+	int i, ret, cpu = cpumask_first(span);
+	struct em_cap_state *table;
+	struct em_freq_domain *fd;
+	u64 fmax;
+
+	if (!cb->active_power)
+		return NULL;
+
+	fd = kzalloc(sizeof(*fd) + cpumask_size(), GFP_KERNEL);
+	if (!fd)
+		return NULL;
+
+	table = kcalloc(nr_states, sizeof(*table), GFP_KERNEL);
+	if (!table)
+		goto free_fd;
+
+	/* Build the list of capacity states for this frequency domain */
+	for (i = 0, freq = 0; i < nr_states; i++, freq++) {
+		/*
+		 * active_power() is a driver callback which ceils 'freq' to
+		 * lowest capacity state of 'cpu' above 'freq' and updates
+		 * 'power' and 'freq' accordingly.
+		 */
+		ret = cb->active_power(&power, &freq, cpu);
+		if (ret) {
+			pr_err("fd%d: invalid cap. state: %d\n", cpu, ret);
+			goto free_cs_table;
+		}
+
+		/*
+		 * We expect the driver callback to increase the frequency for
+		 * higher capacity states.
+		 */
+		if (freq <= prev_freq) {
+			pr_err("fd%d: non-increasing freq: %lu\n", cpu, freq);
+			goto free_cs_table;
+		}
+
+		/*
+		 * The power returned by active_state() is expected to be in
+		 * milli-watts and to fit into 16 bits.
+		 */
+		if (power > EM_CPU_MAX_POWER) {
+			pr_err("fd%d: power out of scale: %lu\n", cpu, power);
+			goto free_cs_table;
+		}
+
+		table[i].power = power;
+		table[i].frequency = prev_freq = freq;
+
+		/*
+		 * The hertz/watts efficiency ratio should decrease as the
+		 * frequency grows on sane platforms. But this isn't always
+		 * true in practice so warn the user if a higher OPP is more
+		 * power efficient than a lower one.
+		 */
+		opp_eff = freq / power;
+		if (opp_eff >= prev_opp_eff)
+			pr_warn("fd%d: hertz/watts ratio non-monotonically decreasing: OPP%d >= OPP%d\n",
+					cpu, i, i - 1);
+		prev_opp_eff = opp_eff;
+	}
+
+	/* Compute the cost of each capacity_state. */
+	fmax = (u64) table[nr_states - 1].frequency;
+	for (i = 0; i < nr_states; i++) {
+		table[i].cost = div64_u64(fmax * table[i].power,
+					  table[i].frequency);
+	}
+
+	fd->table = table;
+	fd->nr_cap_states = nr_states;
+	cpumask_copy(to_cpumask(fd->cpus), span);
+
+	return fd;
+
+free_cs_table:
+	kfree(table);
+free_fd:
+	kfree(fd);
+
+	return NULL;
+}
+
+/**
+ * em_cpu_get() - Return the frequency domain for a CPU
+ * @cpu : CPU to find the frequency domain for
+ *
+ * Return: the frequency domain to which 'cpu' belongs, or NULL if it doesn't
+ * exist.
+ */
+struct em_freq_domain *em_cpu_get(int cpu)
+{
+	return READ_ONCE(per_cpu(em_data, cpu));
+}
+EXPORT_SYMBOL_GPL(em_cpu_get);
+
+/**
+ * em_register_freq_domain() - Register the Energy Model of a frequency domain
+ * @span	: Mask of CPUs in the frequency domain
+ * @nr_states	: Number of capacity states to register
+ * @cb		: Callback functions providing the data of the Energy Model
+ *
+ * Create Energy Model tables for a frequency domain using the callbacks
+ * defined in cb.
+ *
+ * If multiple clients register the same frequency domain, all but the first
+ * registration will be ignored.
+ *
+ * Return 0 on success
+ */
+int em_register_freq_domain(cpumask_t *span, unsigned int nr_states,
+						struct em_data_callback *cb)
+{
+	unsigned long cap, prev_cap = 0;
+	struct em_freq_domain *fd;
+	int cpu, ret = 0;
+
+	if (!span || !nr_states || !cb)
+		return -EINVAL;
+
+	/*
+	 * Use a mutex to serialize the registration of frequency domains and
+	 * let the driver-defined callback functions sleep.
+	 */
+	mutex_lock(&em_fd_mutex);
+
+	for_each_cpu(cpu, span) {
+		/* Make sure we don't register again an existing domain. */
+		if (READ_ONCE(per_cpu(em_data, cpu))) {
+			ret = -EEXIST;
+			goto unlock;
+		}
+
+		/*
+		 * All CPUs of a domain must have the same micro-architecture
+		 * since they all share the same table.
+		 */
+		cap = arch_scale_cpu_capacity(NULL, cpu);
+		if (prev_cap && prev_cap != cap) {
+			pr_err("CPUs of %*pbl must have the same capacity\n",
+							cpumask_pr_args(span));
+			ret = -EINVAL;
+			goto unlock;
+		}
+		prev_cap = cap;
+	}
+
+	/* Create the frequency domain and add it to the Energy Model. */
+	fd = em_create_fd(span, nr_states, cb);
+	if (!fd) {
+		ret = -EINVAL;
+		goto unlock;
+	}
+
+	for_each_cpu(cpu, span) {
+		/*
+		 * The per-cpu array can be concurrently accessed from
+		 * em_cpu_get().
+		 */
+		smp_store_release(per_cpu_ptr(&em_data, cpu), fd);
+	}
+
+	pr_debug("Created freq domain %*pbl\n", cpumask_pr_args(span));
+unlock:
+	mutex_unlock(&em_fd_mutex);
+	return ret;
+}
+EXPORT_SYMBOL_GPL(em_register_freq_domain);
-- 
2.18.0


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH v5 04/14] PM / EM: Expose the Energy Model in sysfs
  2018-07-24 12:25 [PATCH v5 00/14] Energy Aware Scheduling Quentin Perret
                   ` (2 preceding siblings ...)
  2018-07-24 12:25 ` [PATCH v5 03/14] PM: Introduce an Energy Model management framework Quentin Perret
@ 2018-07-24 12:25 ` Quentin Perret
  2018-07-24 12:25 ` [PATCH v5 05/14] sched/topology: Reference the Energy Model of CPUs when available Quentin Perret
                   ` (9 subsequent siblings)
  13 siblings, 0 replies; 72+ messages in thread
From: Quentin Perret @ 2018-07-24 12:25 UTC (permalink / raw)
  To: peterz, rjw, linux-kernel, linux-pm
  Cc: gregkh, mingo, dietmar.eggemann, morten.rasmussen, chris.redpath,
	patrick.bellasi, valentin.schneider, vincent.guittot,
	thara.gopinath, viresh.kumar, tkjos, joel, smuckle, adharmap,
	skannan, pkondeti, juri.lelli, edubezval, srinivas.pandruvada,
	currojerez, javi.merino, quentin.perret

Expose the Energy Model (read-only) of all frequency domains in sysfs
for convenience. To do so, add a kobject to the CPU subsystem under the
umbrella of which a kobject for each frequency domain is attached.

The resulting hierarchy is as follows for a platform with two frequency
domains for example:

   /sys/devices/system/cpu/energy_model
   ├── fd0
   │   ├── cost
   │   ├── cpus
   │   ├── frequency
   │   └── power
   └── fd4
       ├── cost
       ├── cpus
       ├── frequency
       └── power

In this implementation, the kobject abstraction is only used as a
convenient way of exposing data to sysfs. However, it could also be
used in the future to allocate and release frequency domains in a more
dynamic way using reference counting.

Cc: Peter Zijlstra <peterz@infradead.org>
Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
---
 include/linux/energy_model.h |  1 +
 kernel/power/energy_model.c  | 90 ++++++++++++++++++++++++++++++++++++
 2 files changed, 91 insertions(+)

diff --git a/include/linux/energy_model.h b/include/linux/energy_model.h
index be822ce05c17..4c3a98bd1df4 100644
--- a/include/linux/energy_model.h
+++ b/include/linux/energy_model.h
@@ -19,6 +19,7 @@ struct em_cap_state {
 struct em_freq_domain {
 	struct em_cap_state *table; /* Capacity states, in ascending order. */
 	int nr_cap_states;
+	struct kobject kobj;
 	unsigned long cpus[0]; /* CPUs of the frequency domain. */
 };
 
diff --git a/kernel/power/energy_model.c b/kernel/power/energy_model.c
index 39740fe728ea..aa6c0bc5e784 100644
--- a/kernel/power/energy_model.c
+++ b/kernel/power/energy_model.c
@@ -23,6 +23,82 @@ static DEFINE_PER_CPU(struct em_freq_domain *, em_data);
  */
 static DEFINE_MUTEX(em_fd_mutex);
 
+static struct kobject *em_kobject;
+
+/* Getters for the attributes of em_freq_domain objects */
+struct em_fd_attr {
+	struct attribute attr;
+	ssize_t (*show)(struct em_freq_domain *fd, char *buf);
+	ssize_t (*store)(struct em_freq_domain *fd, const char *buf, size_t s);
+};
+
+#define EM_ATTR_LEN 13
+#define show_table_attr(_attr) \
+static ssize_t show_##_attr(struct em_freq_domain *fd, char *buf) \
+{ \
+	ssize_t cnt = 0; \
+	int i; \
+	for (i = 0; i < fd->nr_cap_states; i++) { \
+		if (cnt >= (ssize_t) (PAGE_SIZE / sizeof(char) \
+				      - (EM_ATTR_LEN + 2))) \
+			goto out; \
+		cnt += scnprintf(&buf[cnt], EM_ATTR_LEN + 1, "%lu ", \
+				 fd->table[i]._attr); \
+	} \
+out: \
+	cnt += sprintf(&buf[cnt], "\n"); \
+	return cnt; \
+}
+
+show_table_attr(power);
+show_table_attr(frequency);
+show_table_attr(cost);
+
+static ssize_t show_cpus(struct em_freq_domain *fd, char *buf)
+{
+	return sprintf(buf, "%*pbl\n", cpumask_pr_args(to_cpumask(fd->cpus)));
+}
+
+#define fd_attr(_name) em_fd_##_name##_attr
+#define define_fd_attr(_name) static struct em_fd_attr fd_attr(_name) = \
+		__ATTR(_name, 0444, show_##_name, NULL)
+
+define_fd_attr(power);
+define_fd_attr(frequency);
+define_fd_attr(cost);
+define_fd_attr(cpus);
+
+static struct attribute *em_fd_default_attrs[] = {
+	&fd_attr(power).attr,
+	&fd_attr(frequency).attr,
+	&fd_attr(cost).attr,
+	&fd_attr(cpus).attr,
+	NULL
+};
+
+#define to_fd(k) container_of(k, struct em_freq_domain, kobj)
+#define to_fd_attr(a) container_of(a, struct em_fd_attr, attr)
+
+static ssize_t show(struct kobject *kobj, struct attribute *attr, char *buf)
+{
+	struct em_freq_domain *fd = to_fd(kobj);
+	struct em_fd_attr *fd_attr = to_fd_attr(attr);
+	ssize_t ret;
+
+	ret = fd_attr->show(fd, buf);
+
+	return ret;
+}
+
+static const struct sysfs_ops em_fd_sysfs_ops = {
+	.show	= show,
+};
+
+static struct kobj_type ktype_em_fd = {
+	.sysfs_ops	= &em_fd_sysfs_ops,
+	.default_attrs	= em_fd_default_attrs,
+};
+
 static struct em_freq_domain *em_create_fd(cpumask_t *span, int nr_states,
 						struct em_data_callback *cb)
 {
@@ -102,6 +178,11 @@ static struct em_freq_domain *em_create_fd(cpumask_t *span, int nr_states,
 	fd->nr_cap_states = nr_states;
 	cpumask_copy(to_cpumask(fd->cpus), span);
 
+	ret = kobject_init_and_add(&fd->kobj, &ktype_em_fd, em_kobject,
+				   "fd%u", cpu);
+	if (ret)
+		pr_err("fd%d: failed kobject_init_and_add(): %d\n", cpu, ret);
+
 	return fd;
 
 free_cs_table:
@@ -155,6 +236,15 @@ int em_register_freq_domain(cpumask_t *span, unsigned int nr_states,
 	 */
 	mutex_lock(&em_fd_mutex);
 
+	if (!em_kobject) {
+		em_kobject = kobject_create_and_add("energy_model",
+						&cpu_subsys.dev_root->kobj);
+		if (!em_kobject) {
+			ret = -ENODEV;
+			goto unlock;
+		}
+	}
+
 	for_each_cpu(cpu, span) {
 		/* Make sure we don't register again an existing domain. */
 		if (READ_ONCE(per_cpu(em_data, cpu))) {
-- 
2.18.0


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH v5 05/14] sched/topology: Reference the Energy Model of CPUs when available
  2018-07-24 12:25 [PATCH v5 00/14] Energy Aware Scheduling Quentin Perret
                   ` (3 preceding siblings ...)
  2018-07-24 12:25 ` [PATCH v5 04/14] PM / EM: Expose the Energy Model in sysfs Quentin Perret
@ 2018-07-24 12:25 ` Quentin Perret
  2018-07-24 12:25 ` [PATCH v5 06/14] sched/topology: Lowest energy aware balancing sched_domain level pointer Quentin Perret
                   ` (8 subsequent siblings)
  13 siblings, 0 replies; 72+ messages in thread
From: Quentin Perret @ 2018-07-24 12:25 UTC (permalink / raw)
  To: peterz, rjw, linux-kernel, linux-pm
  Cc: gregkh, mingo, dietmar.eggemann, morten.rasmussen, chris.redpath,
	patrick.bellasi, valentin.schneider, vincent.guittot,
	thara.gopinath, viresh.kumar, tkjos, joel, smuckle, adharmap,
	skannan, pkondeti, juri.lelli, edubezval, srinivas.pandruvada,
	currojerez, javi.merino, quentin.perret

The existing scheduling domain hierarchy is defined to map to the cache
topology of the system. However, Energy Aware Scheduling (EAS) requires
more knowledge about the platform, and specifically needs to know about
the span of Frequency Domains (FD), which do not always align with
caches.

To address this issue, use the Energy Model (EM) of the system to extend
the scheduler topology code with a representation of the FDs, alongside
the scheduling domains. More specifically, a linked list of FDs is
attached to each root domain. When multiple root domains are in use,
each list contains only the FDs covering the CPUs of its root domain. If
a FD spans over CPUs of two different root domains, it will be
duplicated in both lists.

The lists are fully maintained by the scheduler from
partition_sched_domains() in order to cope with hotplug and cpuset
changes. As for scheduling domains, the list are protected by RCU to
ensure safe concurrent updates.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
---
 kernel/sched/sched.h    |  23 +++++++
 kernel/sched/topology.c | 139 ++++++++++++++++++++++++++++++++++++++--
 2 files changed, 158 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 2a72f1b9be0f..fdf6924d53e7 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -44,6 +44,7 @@
 #include <linux/ctype.h>
 #include <linux/debugfs.h>
 #include <linux/delayacct.h>
+#include <linux/energy_model.h>
 #include <linux/init_task.h>
 #include <linux/kprobes.h>
 #include <linux/kthread.h>
@@ -700,6 +701,12 @@ static inline bool sched_asym_prefer(int a, int b)
 	return arch_asym_cpu_priority(a) > arch_asym_cpu_priority(b);
 }
 
+struct freq_domain {
+	struct em_freq_domain *obj;
+	struct freq_domain *next;
+	struct rcu_head rcu;
+};
+
 /*
  * We add the notion of a root-domain which will be used to define per-domain
  * variables. Each exclusive cpuset essentially defines an island domain by
@@ -748,6 +755,14 @@ struct root_domain {
 	struct cpupri		cpupri;
 
 	unsigned long		max_cpu_capacity;
+
+#ifdef CONFIG_ENERGY_MODEL
+	/*
+	 * NULL-terminated list of frequency domains intersecting with the
+	 * CPUs of the rd. Protected by RCU.
+	 */
+	struct freq_domain *fd;
+#endif
 };
 
 extern struct root_domain def_root_domain;
@@ -2203,3 +2218,11 @@ static inline unsigned long cpu_util_irq(struct rq *rq)
 
 #endif
 #endif
+
+#ifdef CONFIG_SMP
+#ifdef CONFIG_ENERGY_MODEL
+#define freq_domain_span(fd) (to_cpumask(((fd)->obj->cpus)))
+#else
+#define freq_domain_span(fd) NULL
+#endif
+#endif
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 05a831427bc7..ade1eae9d21b 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -201,6 +201,121 @@ sd_parent_degenerate(struct sched_domain *sd, struct sched_domain *parent)
 	return 1;
 }
 
+#ifdef CONFIG_ENERGY_MODEL
+static void free_fd(struct freq_domain *fd)
+{
+	struct freq_domain *tmp;
+
+	while (fd) {
+		tmp = fd->next;
+		kfree(fd);
+		fd = tmp;
+	}
+}
+
+static void free_rd_fd(struct root_domain *rd)
+{
+	free_fd(rd->fd);
+}
+
+static struct freq_domain *find_fd(struct freq_domain *fd, int cpu)
+{
+	while (fd) {
+		if (cpumask_test_cpu(cpu, freq_domain_span(fd)))
+			return fd;
+		fd = fd->next;
+	}
+
+	return NULL;
+}
+
+static struct freq_domain *fd_init(int cpu)
+{
+	struct em_freq_domain *obj = em_cpu_get(cpu);
+	struct freq_domain *fd;
+
+	if (!obj) {
+		if (sched_debug())
+			pr_info("%s: no EM found for CPU%d\n", __func__, cpu);
+		return NULL;
+	}
+
+	fd = kzalloc(sizeof(*fd), GFP_KERNEL);
+	if (!fd)
+		return NULL;
+	fd->obj = obj;
+
+	return fd;
+}
+
+static void freq_domain_debug(const struct cpumask *cpu_map,
+						struct freq_domain *fd)
+{
+	if (!sched_debug() || !fd)
+		return;
+
+	printk(KERN_DEBUG "root_domain %*pbl: fd:", cpumask_pr_args(cpu_map));
+
+	while (fd) {
+		printk(KERN_CONT " { fd%d cpus=%*pbl nr_cstate=%d }",
+				cpumask_first(freq_domain_span(fd)),
+				cpumask_pr_args(freq_domain_span(fd)),
+				em_fd_nr_cap_states(fd->obj));
+		fd = fd->next;
+	}
+
+	printk(KERN_CONT "\n");
+}
+
+static void destroy_freq_domain_rcu(struct rcu_head *rp)
+{
+	struct freq_domain *fd;
+
+	fd = container_of(rp, struct freq_domain, rcu);
+	free_fd(fd);
+}
+
+static void build_freq_domains(const struct cpumask *cpu_map)
+{
+	struct freq_domain *fd = NULL, *tmp;
+	int cpu = cpumask_first(cpu_map);
+	struct root_domain *rd = cpu_rq(cpu)->rd;
+	int i;
+
+	for_each_cpu(i, cpu_map) {
+		/* Skip already covered CPUs. */
+		if (find_fd(fd, i))
+			continue;
+
+		/* Create the new fd and add it to the local list. */
+		tmp = fd_init(i);
+		if (!tmp)
+			goto free;
+		tmp->next = fd;
+		fd = tmp;
+	}
+
+	freq_domain_debug(cpu_map, fd);
+
+	/* Attach the new list of frequency domains to the root domain. */
+	tmp = rd->fd;
+	rcu_assign_pointer(rd->fd, fd);
+	if (tmp)
+		call_rcu(&tmp->rcu, destroy_freq_domain_rcu);
+
+	return;
+
+free:
+	free_fd(fd);
+	tmp = rd->fd;
+	rcu_assign_pointer(rd->fd, NULL);
+	if (tmp)
+		call_rcu(&tmp->rcu, destroy_freq_domain_rcu);
+}
+#else
+static void free_rd_fd(struct root_domain *rd) { }
+#endif
+
 static void free_rootdomain(struct rcu_head *rcu)
 {
 	struct root_domain *rd = container_of(rcu, struct root_domain, rcu);
@@ -211,6 +326,7 @@ static void free_rootdomain(struct rcu_head *rcu)
 	free_cpumask_var(rd->rto_mask);
 	free_cpumask_var(rd->online);
 	free_cpumask_var(rd->span);
+	free_rd_fd(rd);
 	kfree(rd);
 }
 
@@ -1882,8 +1998,8 @@ void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
 	/* Destroy deleted domains: */
 	for (i = 0; i < ndoms_cur; i++) {
 		for (j = 0; j < n && !new_topology; j++) {
-			if (cpumask_equal(doms_cur[i], doms_new[j])
-			    && dattrs_equal(dattr_cur, i, dattr_new, j))
+			if (cpumask_equal(doms_cur[i], doms_new[j]) &&
+			    dattrs_equal(dattr_cur, i, dattr_new, j))
 				goto match1;
 		}
 		/* No match - a current sched domain not in new doms_new[] */
@@ -1903,8 +2019,8 @@ void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
 	/* Build new domains: */
 	for (i = 0; i < ndoms_new; i++) {
 		for (j = 0; j < n && !new_topology; j++) {
-			if (cpumask_equal(doms_new[i], doms_cur[j])
-			    && dattrs_equal(dattr_new, i, dattr_cur, j))
+			if (cpumask_equal(doms_new[i], doms_cur[j]) &&
+			    dattrs_equal(dattr_new, i, dattr_cur, j))
 				goto match2;
 		}
 		/* No match - add a new doms_new */
@@ -1913,6 +2029,21 @@ void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
 		;
 	}
 
+#ifdef CONFIG_ENERGY_MODEL
+	/* Build freq domains: */
+	for (i = 0; i < ndoms_new; i++) {
+		for (j = 0; j < n; j++) {
+			if (cpumask_equal(doms_new[i], doms_cur[j]) &&
+			    cpu_rq(cpumask_first(doms_cur[j]))->rd->fd)
+				goto match3;
+		}
+		/* No match - add freq domains for a new rd */
+		build_freq_domains(doms_new[i]);
+match3:
+		;
+	}
+#endif
+
 	/* Remember the new sched domains: */
 	if (doms_cur != &fallback_doms)
 		free_sched_domains(doms_cur, ndoms_cur);
-- 
2.18.0


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH v5 06/14] sched/topology: Lowest energy aware balancing sched_domain level pointer
  2018-07-24 12:25 [PATCH v5 00/14] Energy Aware Scheduling Quentin Perret
                   ` (4 preceding siblings ...)
  2018-07-24 12:25 ` [PATCH v5 05/14] sched/topology: Reference the Energy Model of CPUs when available Quentin Perret
@ 2018-07-24 12:25 ` Quentin Perret
  2018-07-26 16:00   ` Valentin Schneider
  2018-07-24 12:25 ` [PATCH v5 07/14] sched/topology: Introduce sched_energy_present static key Quentin Perret
                   ` (7 subsequent siblings)
  13 siblings, 1 reply; 72+ messages in thread
From: Quentin Perret @ 2018-07-24 12:25 UTC (permalink / raw)
  To: peterz, rjw, linux-kernel, linux-pm
  Cc: gregkh, mingo, dietmar.eggemann, morten.rasmussen, chris.redpath,
	patrick.bellasi, valentin.schneider, vincent.guittot,
	thara.gopinath, viresh.kumar, tkjos, joel, smuckle, adharmap,
	skannan, pkondeti, juri.lelli, edubezval, srinivas.pandruvada,
	currojerez, javi.merino, quentin.perret

Add another member to the family of per-cpu sched_domain shortcut
pointers. This one, sd_ea, points to the lowest level at which energy
aware scheduling should be used.

Generally speaking, the largest opportunity to save energy via scheduling
comes from a smarter exploitation of heterogeneous platforms (i.e.
big.LITTLE). Consequently, the sd_ea shortcut is wired to the lowest
scheduling domain at which the SD_ASYM_CPUCAPACITY flag is set. For
example, it is possible to apply Energy-Aware Scheduling within a socket
on a multi-socket system, as long as each socket has an asymmetric
topology. Cross-sockets wake-up balancing will only happen when the
system is over-utilized, or this_cpu and prev_cpu are in different
sockets.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Suggested-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
---
 kernel/sched/sched.h    | 1 +
 kernel/sched/topology.c | 4 ++++
 2 files changed, 5 insertions(+)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index fdf6924d53e7..25d64a0b6fe0 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -1198,6 +1198,7 @@ DECLARE_PER_CPU(int, sd_llc_id);
 DECLARE_PER_CPU(struct sched_domain_shared *, sd_llc_shared);
 DECLARE_PER_CPU(struct sched_domain *, sd_numa);
 DECLARE_PER_CPU(struct sched_domain *, sd_asym);
+DECLARE_PER_CPU(struct sched_domain *, sd_ea);
 
 struct sched_group_capacity {
 	atomic_t		ref;
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index ade1eae9d21b..8f3f746b0d5e 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -514,6 +514,7 @@ DEFINE_PER_CPU(int, sd_llc_id);
 DEFINE_PER_CPU(struct sched_domain_shared *, sd_llc_shared);
 DEFINE_PER_CPU(struct sched_domain *, sd_numa);
 DEFINE_PER_CPU(struct sched_domain *, sd_asym);
+DEFINE_PER_CPU(struct sched_domain *, sd_ea);
 
 static void update_top_cache_domain(int cpu)
 {
@@ -539,6 +540,9 @@ static void update_top_cache_domain(int cpu)
 
 	sd = highest_flag_domain(cpu, SD_ASYM_PACKING);
 	rcu_assign_pointer(per_cpu(sd_asym, cpu), sd);
+
+	sd = lowest_flag_domain(cpu, SD_ASYM_CPUCAPACITY);
+	rcu_assign_pointer(per_cpu(sd_ea, cpu), sd);
 }
 
 /*
-- 
2.18.0


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH v5 07/14] sched/topology: Introduce sched_energy_present static key
  2018-07-24 12:25 [PATCH v5 00/14] Energy Aware Scheduling Quentin Perret
                   ` (5 preceding siblings ...)
  2018-07-24 12:25 ` [PATCH v5 06/14] sched/topology: Lowest energy aware balancing sched_domain level pointer Quentin Perret
@ 2018-07-24 12:25 ` Quentin Perret
  2018-07-24 12:25 ` [PATCH v5 08/14] sched/fair: Clean-up update_sg_lb_stats parameters Quentin Perret
                   ` (6 subsequent siblings)
  13 siblings, 0 replies; 72+ messages in thread
From: Quentin Perret @ 2018-07-24 12:25 UTC (permalink / raw)
  To: peterz, rjw, linux-kernel, linux-pm
  Cc: gregkh, mingo, dietmar.eggemann, morten.rasmussen, chris.redpath,
	patrick.bellasi, valentin.schneider, vincent.guittot,
	thara.gopinath, viresh.kumar, tkjos, joel, smuckle, adharmap,
	skannan, pkondeti, juri.lelli, edubezval, srinivas.pandruvada,
	currojerez, javi.merino, quentin.perret

In order to ensure a minimal performance impact on non-energy-aware
systems, introduce a static_key guarding the access to Energy-Aware
Scheduling (EAS) code.

The static key is set iff all the following conditions are met for at
least one root domain:
  1. all online CPUs of the root domain are covered by the Energy
     Model (EM);
  2. the complexity of the root domain's EM is low enough to keep
     scheduling overheads low;
  3. the root domain has an asymmetric CPU capacity topology (detected
     by looking for the SD_ASYM_CPUCAPACITY flag in the sched_domain
     hierarchy).

The static key is checked in the rd_freq_domain() function which returns
the frequency domains of a root domain when they are available. As EAS
cannot be enabled with CONFIG_ENERGY_MODEL=n, rd_freq_domain() is
stubbed to 'NULL' to let the compiler remove the unused EAS code by
constant propagation.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
---
 kernel/sched/sched.h    | 17 ++++++++++
 kernel/sched/topology.c | 73 ++++++++++++++++++++++++++++++++++++++++-
 2 files changed, 89 insertions(+), 1 deletion(-)

diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 25d64a0b6fe0..a317457804dd 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2222,8 +2222,25 @@ static inline unsigned long cpu_util_irq(struct rq *rq)
 
 #ifdef CONFIG_SMP
 #ifdef CONFIG_ENERGY_MODEL
+extern struct static_key_false sched_energy_present;
+/**
+ * rd_freq_domain - Get the frequency domains of a root domain.
+ *
+ * Must be called from a RCU read-side critical section.
+ */
+static inline struct freq_domain *rd_freq_domain(struct root_domain *rd)
+{
+	if (!static_branch_unlikely(&sched_energy_present))
+		return NULL;
+
+	return rcu_dereference(rd->fd);
+}
 #define freq_domain_span(fd) (to_cpumask(((fd)->obj->cpus)))
 #else
+static inline struct freq_domain *rd_freq_domain(struct root_domain *rd)
+{
+	return NULL;
+}
 #define freq_domain_span(fd) NULL
 #endif
 #endif
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index 8f3f746b0d5e..483bb7bf7af6 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -275,12 +275,32 @@ static void destroy_freq_domain_rcu(struct rcu_head *rp)
 	free_fd(fd);
 }
 
+/*
+ * The complexity of the Energy Model is defined as: nr_fd * (nr_cpus + nr_cs)
+ * with: 'nr_fd' the number of frequency domains; 'nr_cpus' the number of CPUs;
+ * and 'nr_cs' the sum of the capacity states numbers of all frequency domains.
+ *
+ * It is generally not a good idea to use such a model in the wake-up path on
+ * very complex platforms because of the associated scheduling overheads. The
+ * arbitrary constraint below prevents that. It makes EAS usable up to 16 CPUs
+ * with per-CPU DVFS and less than 8 capacity states each, for example.
+ */
+#define EM_MAX_COMPLEXITY 2048
+
+
 static void build_freq_domains(const struct cpumask *cpu_map)
 {
+	int i, nr_fd = 0, nr_cs = 0, nr_cpus = cpumask_weight(cpu_map);
 	struct freq_domain *fd = NULL, *tmp;
 	int cpu = cpumask_first(cpu_map);
 	struct root_domain *rd = cpu_rq(cpu)->rd;
-	int i;
+
+	/* EAS is enabled for asymmetric CPU capacity topologies. */
+	if (!per_cpu(sd_ea, cpu)) {
+		if (sched_debug())
+			pr_info("rd %*pbl: !sd_ea\n", cpumask_pr_args(cpu_map));
+		goto free;
+	}
 
 	for_each_cpu(i, cpu_map) {
 		/* Skip already covered CPUs. */
@@ -293,6 +313,18 @@ static void build_freq_domains(const struct cpumask *cpu_map)
 			goto free;
 		tmp->next = fd;
 		fd = tmp;
+
+		/* Count freq. doms and perf states for the complexity check. */
+		nr_fd++;
+		nr_cs += em_fd_nr_cap_states(fd->obj);
+	}
+
+	/* Bail out if the Energy Model complexity is too high. */
+	if (nr_fd * (nr_cs + nr_cpus) > EM_MAX_COMPLEXITY) {
+		if (sched_debug())
+			pr_info("rd %*pbl: EM complexity is too high\n ",
+						cpumask_pr_args(cpu_map));
+		goto free;
 	}
 
 	freq_domain_debug(cpu_map, fd);
@@ -312,6 +344,44 @@ static void build_freq_domains(const struct cpumask *cpu_map)
 	if (tmp)
 		call_rcu(&tmp->rcu, destroy_freq_domain_rcu);
 }
+
+/*
+ * This static_key is set if at least one root domain meets all the following
+ * conditions:
+ *    1. all CPUs of the root domain are covered by the EM;
+ *    2. the EM complexity is low enough to keep scheduling overheads low;
+ *    3. the SD_ASYM_CPUCAPACITY flag is set in the sched_domain hierarchy.
+ */
+DEFINE_STATIC_KEY_FALSE(sched_energy_present);
+
+static void sched_energy_start(int ndoms_new, cpumask_var_t doms_new[])
+{
+	/*
+	 * The conditions for EAS to start are checked during the creation of
+	 * root domains. If one of them meets all conditions, it will have a
+	 * non-null list of frequency domains.
+	 */
+	while (ndoms_new) {
+		if (cpu_rq(cpumask_first(doms_new[ndoms_new - 1]))->rd->fd)
+			goto enable;
+		ndoms_new--;
+	}
+
+	if (static_branch_unlikely(&sched_energy_present)) {
+		if (sched_debug())
+			pr_info("%s: stopping EAS\n", __func__);
+		static_branch_disable_cpuslocked(&sched_energy_present);
+	}
+
+	return;
+
+enable:
+	if (!static_branch_unlikely(&sched_energy_present)) {
+		if (sched_debug())
+			pr_info("%s: starting EAS\n", __func__);
+		static_branch_enable_cpuslocked(&sched_energy_present);
+	}
+}
 #else
 static void free_rd_fd(struct root_domain *rd) { }
 #endif
@@ -2046,6 +2116,7 @@ void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
 match3:
 		;
 	}
+	sched_energy_start(ndoms_new, doms_new);
 #endif
 
 	/* Remember the new sched domains: */
-- 
2.18.0


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH v5 08/14] sched/fair: Clean-up update_sg_lb_stats parameters
  2018-07-24 12:25 [PATCH v5 00/14] Energy Aware Scheduling Quentin Perret
                   ` (6 preceding siblings ...)
  2018-07-24 12:25 ` [PATCH v5 07/14] sched/topology: Introduce sched_energy_present static key Quentin Perret
@ 2018-07-24 12:25 ` Quentin Perret
  2018-07-24 12:25 ` [PATCH v5 09/14] sched: Add over-utilization/tipping point indicator Quentin Perret
                   ` (5 subsequent siblings)
  13 siblings, 0 replies; 72+ messages in thread
From: Quentin Perret @ 2018-07-24 12:25 UTC (permalink / raw)
  To: peterz, rjw, linux-kernel, linux-pm
  Cc: gregkh, mingo, dietmar.eggemann, morten.rasmussen, chris.redpath,
	patrick.bellasi, valentin.schneider, vincent.guittot,
	thara.gopinath, viresh.kumar, tkjos, joel, smuckle, adharmap,
	skannan, pkondeti, juri.lelli, edubezval, srinivas.pandruvada,
	currojerez, javi.merino, quentin.perret

In preparation for the introduction of a new root domain flag which can
be set during load balance (the 'overutilized' flag), clean-up the set
of parameters passed to update_sg_lb_stats(). More specifically, the
'local_group' and 'local_idx' parameters can be removed since they can
easily be reconstructed from within the function.

While at it, transform the 'overload' parameter into a flag stored in
the 'sg_status' parameter and change the root_domain's overload field to
an 'int' since sizeof(_Bool) is implementation defined.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Suggested-by: Valentin Schneider <valentin.schneider@arm.com>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
---
 kernel/sched/fair.c  | 26 +++++++++++---------------
 kernel/sched/sched.h |  5 ++++-
 2 files changed, 15 insertions(+), 16 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index d5f7d521e448..fcc97fa3be94 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -7802,16 +7802,16 @@ static bool update_nohz_stats(struct rq *rq, bool force)
  * update_sg_lb_stats - Update sched_group's statistics for load balancing.
  * @env: The load balancing environment.
  * @group: sched_group whose statistics are to be updated.
- * @load_idx: Load index of sched_domain of this_cpu for load calc.
- * @local_group: Does group contain this_cpu.
  * @sgs: variable to hold the statistics for this group.
- * @overload: Indicate more than one runnable task for any CPU.
+ * @sg_status: Holds flag indicating the status of the sched_group
  */
 static inline void update_sg_lb_stats(struct lb_env *env,
-			struct sched_group *group, int load_idx,
-			int local_group, struct sg_lb_stats *sgs,
-			bool *overload)
+				      struct sched_group *group,
+				      struct sg_lb_stats *sgs,
+				      int *sg_status)
 {
+	int local_group = cpumask_test_cpu(env->dst_cpu, sched_group_span(group));
+	int load_idx = get_sd_load_idx(env->sd, env->idle);
 	unsigned long load;
 	int i, nr_running;
 
@@ -7835,7 +7835,7 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 
 		nr_running = rq->nr_running;
 		if (nr_running > 1)
-			*overload = true;
+			*sg_status |= SG_OVERLOAD;
 
 #ifdef CONFIG_NUMA_BALANCING
 		sgs->nr_numa_running += rq->nr_numa_running;
@@ -7972,8 +7972,8 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 	struct sched_group *sg = env->sd->groups;
 	struct sg_lb_stats *local = &sds->local_stat;
 	struct sg_lb_stats tmp_sgs;
-	int load_idx, prefer_sibling = 0;
-	bool overload = false;
+	int prefer_sibling = 0;
+	int sg_status = 0;
 
 	if (child && child->flags & SD_PREFER_SIBLING)
 		prefer_sibling = 1;
@@ -7983,8 +7983,6 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 		env->flags |= LBF_NOHZ_STATS;
 #endif
 
-	load_idx = get_sd_load_idx(env->sd, env->idle);
-
 	do {
 		struct sg_lb_stats *sgs = &tmp_sgs;
 		int local_group;
@@ -7999,8 +7997,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 				update_group_capacity(env->sd, env->dst_cpu);
 		}
 
-		update_sg_lb_stats(env, sg, load_idx, local_group, sgs,
-						&overload);
+		update_sg_lb_stats(env, sg, sgs, &sg_status);
 
 		if (local_group)
 			goto next_group;
@@ -8050,8 +8047,7 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 
 	if (!env->sd->parent) {
 		/* update overload indicator if we are at root domain */
-		if (env->dst_rq->rd->overload != overload)
-			env->dst_rq->rd->overload = overload;
+		env->dst_rq->rd->overload = sg_status & SG_OVERLOAD;
 	}
 }
 
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index a317457804dd..99b5bf245eab 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -707,6 +707,9 @@ struct freq_domain {
 	struct rcu_head rcu;
 };
 
+/* Scheduling group status flags */
+#define SG_OVERLOAD		0x1 /* More than one runnable task on a CPU. */
+
 /*
  * We add the notion of a root-domain which will be used to define per-domain
  * variables. Each exclusive cpuset essentially defines an island domain by
@@ -723,7 +726,7 @@ struct root_domain {
 	cpumask_var_t		online;
 
 	/* Indicate more than one runnable task for any CPU */
-	bool			overload;
+	int			overload;
 
 	/*
 	 * The bit corresponding to a CPU gets set here if such CPU has more
-- 
2.18.0


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH v5 09/14] sched: Add over-utilization/tipping point indicator
  2018-07-24 12:25 [PATCH v5 00/14] Energy Aware Scheduling Quentin Perret
                   ` (7 preceding siblings ...)
  2018-07-24 12:25 ` [PATCH v5 08/14] sched/fair: Clean-up update_sg_lb_stats parameters Quentin Perret
@ 2018-07-24 12:25 ` Quentin Perret
  2018-08-02 12:26   ` Peter Zijlstra
  2018-08-09  9:30   ` Vincent Guittot
  2018-07-24 12:25 ` [PATCH v5 10/14] sched/cpufreq: Refactor the utilization aggregation method Quentin Perret
                   ` (4 subsequent siblings)
  13 siblings, 2 replies; 72+ messages in thread
From: Quentin Perret @ 2018-07-24 12:25 UTC (permalink / raw)
  To: peterz, rjw, linux-kernel, linux-pm
  Cc: gregkh, mingo, dietmar.eggemann, morten.rasmussen, chris.redpath,
	patrick.bellasi, valentin.schneider, vincent.guittot,
	thara.gopinath, viresh.kumar, tkjos, joel, smuckle, adharmap,
	skannan, pkondeti, juri.lelli, edubezval, srinivas.pandruvada,
	currojerez, javi.merino, quentin.perret

From: Morten Rasmussen <morten.rasmussen@arm.com>

Energy-aware scheduling is only meant to be active while the system is
_not_ over-utilized. That is, there are spare cycles available to shift
tasks around based on their actual utilization to get a more
energy-efficient task distribution without depriving any tasks. When
above the tipping point task placement is done the traditional way based
on load_avg, spreading the tasks across as many cpus as possible based
on priority scaled load to preserve smp_nice. Below the tipping point we
want to use util_avg instead. We need to define a criteria for when we
make the switch.

The util_avg for each cpu converges towards 100% (1024) regardless of
how many task additional task we may put on it. If we define
over-utilized as:

sum_{cpus}(rq.cfs.avg.util_avg) + margin > sum_{cpus}(rq.capacity)

some individual cpus may be over-utilized running multiple tasks even
when the above condition is false. That should be okay as long as we try
to spread the tasks out to avoid per-cpu over-utilization as much as
possible and if all tasks have the _same_ priority. If the latter isn't
true, we have to consider priority to preserve smp_nice.

For example, we could have n_cpus nice=-10 util_avg=55% tasks and
n_cpus/2 nice=0 util_avg=60% tasks. Balancing based on util_avg we are
likely to end up with nice=-10 tasks sharing cpus and nice=0 tasks
getting their own as we 1.5*n_cpus tasks in total and 55%+55% is less
over-utilized than 55%+60% for those cpus that have to be shared. The
system utilization is only 85% of the system capacity, but we are
breaking smp_nice.

To be sure not to break smp_nice, we have defined over-utilization
conservatively as when any cpu in the system is fully utilized at its
highest frequency instead:

cpu_rq(any).cfs.avg.util_avg + margin > cpu_rq(any).capacity

IOW, as soon as one cpu is (nearly) 100% utilized, we switch to load_avg
to factor in priority to preserve smp_nice.

With this definition, we can skip periodic load-balance as no cpu has an
always-running task when the system is not over-utilized. All tasks will
be periodic and we can balance them at wake-up. This conservative
condition does however mean that some scenarios that could benefit from
energy-aware decisions even if one cpu is fully utilized would not get
those benefits.

For systems where some cpus might have reduced capacity on some cpus
(RT-pressure and/or big.LITTLE), we want periodic load-balance checks as
soon a just a single cpu is fully utilized as it might one of those with
reduced capacity and in that case we want to migrate it.

cc: Ingo Molnar <mingo@redhat.com>
cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Morten Rasmussen <morten.rasmussen@arm.com>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
---
 kernel/sched/fair.c  | 48 ++++++++++++++++++++++++++++++++++++++++++--
 kernel/sched/sched.h |  4 ++++
 2 files changed, 50 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fcc97fa3be94..4aaa9132e840 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5043,6 +5043,24 @@ static inline void hrtick_update(struct rq *rq)
 }
 #endif
 
+#ifdef CONFIG_SMP
+static inline unsigned long cpu_util(int cpu);
+static unsigned long capacity_of(int cpu);
+
+static inline bool cpu_overutilized(int cpu)
+{
+	return (capacity_of(cpu) * 1024) < (cpu_util(cpu) * capacity_margin);
+}
+
+static inline void update_overutilized_status(struct rq *rq)
+{
+	if (!READ_ONCE(rq->rd->overutilized) && cpu_overutilized(rq->cpu))
+		WRITE_ONCE(rq->rd->overutilized, SG_OVERUTILIZED);
+}
+#else
+static inline void update_overutilized_status(struct rq *rq) { }
+#endif
+
 /*
  * The enqueue_task method is called before nr_running is
  * increased. Here we update the fair scheduling stats and
@@ -5100,8 +5118,17 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
 		update_cfs_group(se);
 	}
 
-	if (!se)
+	if (!se) {
 		add_nr_running(rq, 1);
+		/*
+		 * The utilization of a new task is 'wrong' so wait for it
+		 * to build some utilization history before trying to detect
+		 * the overutilized flag.
+		 */
+		if (flags & ENQUEUE_WAKEUP)
+			update_overutilized_status(rq);
+
+	}
 
 	hrtick_update(rq);
 }
@@ -7837,6 +7864,9 @@ static inline void update_sg_lb_stats(struct lb_env *env,
 		if (nr_running > 1)
 			*sg_status |= SG_OVERLOAD;
 
+		if (cpu_overutilized(i))
+			*sg_status |= SG_OVERUTILIZED;
+
 #ifdef CONFIG_NUMA_BALANCING
 		sgs->nr_numa_running += rq->nr_numa_running;
 		sgs->nr_preferred_running += rq->nr_preferred_running;
@@ -8046,8 +8076,15 @@ static inline void update_sd_lb_stats(struct lb_env *env, struct sd_lb_stats *sd
 		env->fbq_type = fbq_classify_group(&sds->busiest_stat);
 
 	if (!env->sd->parent) {
+		struct root_domain *rd = env->dst_rq->rd;
+
 		/* update overload indicator if we are at root domain */
-		env->dst_rq->rd->overload = sg_status & SG_OVERLOAD;
+		rd->overload = sg_status & SG_OVERLOAD;
+
+		/* Update over-utilization (tipping point, U >= 0) indicator */
+		WRITE_ONCE(rd->overutilized, sg_status & SG_OVERUTILIZED);
+	} else if (sg_status & SG_OVERUTILIZED) {
+		WRITE_ONCE(env->dst_rq->rd->overutilized, SG_OVERUTILIZED);
 	}
 }
 
@@ -8267,6 +8304,11 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
 	 * this level.
 	 */
 	update_sd_lb_stats(env, &sds);
+
+	if (rd_freq_domain(env->dst_rq->rd) &&
+				!READ_ONCE(env->dst_rq->rd->overutilized))
+		goto out_balanced;
+
 	local = &sds.local_stat;
 	busiest = &sds.busiest_stat;
 
@@ -9624,6 +9666,8 @@ static void task_tick_fair(struct rq *rq, struct task_struct *curr, int queued)
 
 	if (static_branch_unlikely(&sched_numa_balancing))
 		task_tick_numa(rq, curr);
+
+	update_overutilized_status(task_rq(curr));
 }
 
 /*
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 99b5bf245eab..6d08ccd1e7a4 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -709,6 +709,7 @@ struct freq_domain {
 
 /* Scheduling group status flags */
 #define SG_OVERLOAD		0x1 /* More than one runnable task on a CPU. */
+#define SG_OVERUTILIZED		0x2 /* One or more CPUs are over-utilized. */
 
 /*
  * We add the notion of a root-domain which will be used to define per-domain
@@ -728,6 +729,9 @@ struct root_domain {
 	/* Indicate more than one runnable task for any CPU */
 	int			overload;
 
+	/* Indicate one or more cpus over-utilized (tipping point) */
+	int			overutilized;
+
 	/*
 	 * The bit corresponding to a CPU gets set here if such CPU has more
 	 * than one runnable -deadline task (as it is below for RT tasks).
-- 
2.18.0


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH v5 10/14] sched/cpufreq: Refactor the utilization aggregation method
  2018-07-24 12:25 [PATCH v5 00/14] Energy Aware Scheduling Quentin Perret
                   ` (8 preceding siblings ...)
  2018-07-24 12:25 ` [PATCH v5 09/14] sched: Add over-utilization/tipping point indicator Quentin Perret
@ 2018-07-24 12:25 ` Quentin Perret
  2018-07-30 19:35   ` skannan
  2018-07-24 12:25 ` [PATCH v5 11/14] sched/fair: Introduce an energy estimation helper function Quentin Perret
                   ` (3 subsequent siblings)
  13 siblings, 1 reply; 72+ messages in thread
From: Quentin Perret @ 2018-07-24 12:25 UTC (permalink / raw)
  To: peterz, rjw, linux-kernel, linux-pm
  Cc: gregkh, mingo, dietmar.eggemann, morten.rasmussen, chris.redpath,
	patrick.bellasi, valentin.schneider, vincent.guittot,
	thara.gopinath, viresh.kumar, tkjos, joel, smuckle, adharmap,
	skannan, pkondeti, juri.lelli, edubezval, srinivas.pandruvada,
	currojerez, javi.merino, quentin.perret

Schedutil aggregates the PELT signals of CFS, RT, DL and IRQ in order
to decide which frequency to request. Energy Aware Scheduling (EAS)
needs to be able to predict those requests to assess the energy impact
of scheduling decisions. However, the PELT signals aggregation is only
done in schedutil for now, hence making it hard to synchronize it with
EAS.

To address this issue, introduce schedutil_freq_util() to perform the
aforementioned aggregation and make it available to other parts of the
scheduler. Since frequency selection and energy estimation still need
to deal with RT and DL signals slightly differently, schedutil_freq_util()
is called with a different 'type' parameter in those two contexts, and
returns an aggregated utilization signal accordingly.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Suggested-by: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
---
 kernel/sched/cpufreq_schedutil.c | 86 +++++++++++++++++++++-----------
 kernel/sched/sched.h             | 14 ++++++
 2 files changed, 72 insertions(+), 28 deletions(-)

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index 810193c8e4cd..af86050edcf5 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -198,15 +198,15 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
  * based on the task model parameters and gives the minimal utilization
  * required to meet deadlines.
  */
-static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu)
+unsigned long schedutil_freq_util(int cpu, unsigned long util_cfs,
+				  enum schedutil_type type)
 {
-	struct rq *rq = cpu_rq(sg_cpu->cpu);
+	struct rq *rq = cpu_rq(cpu);
 	unsigned long util, irq, max;
 
-	sg_cpu->max = max = arch_scale_cpu_capacity(NULL, sg_cpu->cpu);
-	sg_cpu->bw_dl = cpu_bw_dl(rq);
+	max = arch_scale_cpu_capacity(NULL, cpu);
 
-	if (rt_rq_is_runnable(&rq->rt))
+	if (type == frequency_util && rt_rq_is_runnable(&rq->rt))
 		return max;
 
 	/*
@@ -224,20 +224,33 @@ static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu)
 	 * utilization (PELT windows are synchronized) we can directly add them
 	 * to obtain the CPU's actual utilization.
 	 */
-	util = cpu_util_cfs(rq);
+	util = util_cfs;
 	util += cpu_util_rt(rq);
 
-	/*
-	 * We do not make cpu_util_dl() a permanent part of this sum because we
-	 * want to use cpu_bw_dl() later on, but we need to check if the
-	 * CFS+RT+DL sum is saturated (ie. no idle time) such that we select
-	 * f_max when there is no idle time.
-	 *
-	 * NOTE: numerical errors or stop class might cause us to not quite hit
-	 * saturation when we should -- something for later.
-	 */
-	if ((util + cpu_util_dl(rq)) >= max)
-		return max;
+	if (type == frequency_util) {
+		/*
+		 * For frequency selection we do not make cpu_util_dl() a
+		 * permanent part of this sum because we want to use
+		 * cpu_bw_dl() later on, but we need to check if the
+		 * CFS+RT+DL sum is saturated (ie. no idle time) such
+		 * that we select f_max when there is no idle time.
+		 *
+		 * NOTE: numerical errors or stop class might cause us
+		 * to not quite hit saturation when we should --
+		 * something for later.
+		 */
+
+		if ((util + cpu_util_dl(rq)) >= max)
+			return max;
+	} else {
+		/*
+		 * OTOH, for energy computation we need the estimated
+		 * running time, so include util_dl and ignore dl_bw.
+		 */
+		util += cpu_util_dl(rq);
+		if (util >= max)
+			return max;
+	}
 
 	/*
 	 * There is still idle time; further improve the number by using the
@@ -252,17 +265,34 @@ static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu)
 	util /= max;
 	util += irq;
 
-	/*
-	 * Bandwidth required by DEADLINE must always be granted while, for
-	 * FAIR and RT, we use blocked utilization of IDLE CPUs as a mechanism
-	 * to gracefully reduce the frequency when no tasks show up for longer
-	 * periods of time.
-	 *
-	 * Ideally we would like to set bw_dl as min/guaranteed freq and util +
-	 * bw_dl as requested freq. However, cpufreq is not yet ready for such
-	 * an interface. So, we only do the latter for now.
-	 */
-	return min(max, util + sg_cpu->bw_dl);
+	if (type == frequency_util) {
+		/*
+		 * Bandwidth required by DEADLINE must always be granted
+		 * while, for FAIR and RT, we use blocked utilization of
+		 * IDLE CPUs as a mechanism to gracefully reduce the
+		 * frequency when no tasks show up for longer periods of
+		 * time.
+		 *
+		 * Ideally we would like to set bw_dl as min/guaranteed
+		 * freq and util + bw_dl as requested freq. However,
+		 * cpufreq is not yet ready for such an interface. So,
+		 * we only do the latter for now.
+		 */
+		util += cpu_bw_dl(rq);
+	}
+
+	return min(max, util);
+}
+
+static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu)
+{
+	struct rq *rq = cpu_rq(sg_cpu->cpu);
+	unsigned long util = cpu_util_cfs(rq);
+
+	sg_cpu->max = arch_scale_cpu_capacity(NULL, sg_cpu->cpu);
+	sg_cpu->bw_dl = cpu_bw_dl(rq);
+
+	return schedutil_freq_util(sg_cpu->cpu, util, frequency_util);
 }
 
 /**
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 6d08ccd1e7a4..51e7f113ee23 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2185,7 +2185,15 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
 # define arch_scale_freq_invariant()	false
 #endif
 
+enum schedutil_type {
+	frequency_util,
+	energy_util,
+};
+
 #ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
+unsigned long schedutil_freq_util(int cpu, unsigned long util_cfs,
+				  enum schedutil_type type);
+
 static inline unsigned long cpu_bw_dl(struct rq *rq)
 {
 	return (rq->dl.running_bw * SCHED_CAPACITY_SCALE) >> BW_SHIFT;
@@ -2225,6 +2233,12 @@ static inline unsigned long cpu_util_irq(struct rq *rq)
 }
 
 #endif
+#else /* CONFIG_CPU_FREQ_GOV_SCHEDUTIL */
+static inline unsigned long schedutil_freq_util(int cpu, unsigned long util,
+				  enum schedutil_type type)
+{
+	return util;
+}
 #endif
 
 #ifdef CONFIG_SMP
-- 
2.18.0


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH v5 11/14] sched/fair: Introduce an energy estimation helper function
  2018-07-24 12:25 [PATCH v5 00/14] Energy Aware Scheduling Quentin Perret
                   ` (9 preceding siblings ...)
  2018-07-24 12:25 ` [PATCH v5 10/14] sched/cpufreq: Refactor the utilization aggregation method Quentin Perret
@ 2018-07-24 12:25 ` Quentin Perret
  2018-07-24 12:25 ` [PATCH v5 12/14] sched/fair: Select an energy-efficient CPU on task wake-up Quentin Perret
                   ` (2 subsequent siblings)
  13 siblings, 0 replies; 72+ messages in thread
From: Quentin Perret @ 2018-07-24 12:25 UTC (permalink / raw)
  To: peterz, rjw, linux-kernel, linux-pm
  Cc: gregkh, mingo, dietmar.eggemann, morten.rasmussen, chris.redpath,
	patrick.bellasi, valentin.schneider, vincent.guittot,
	thara.gopinath, viresh.kumar, tkjos, joel, smuckle, adharmap,
	skannan, pkondeti, juri.lelli, edubezval, srinivas.pandruvada,
	currojerez, javi.merino, quentin.perret

In preparation for the definition of an energy-aware wakeup path,
introduce a helper function to estimate the consequence on system energy
when a specific task wakes-up on a specific CPU. compute_energy()
estimates the capacity state to be reached by all frequency domains and
estimates the consumption of each online CPU according to its Energy
Model and its percentage of busy time.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
---
 kernel/sched/fair.c | 77 +++++++++++++++++++++++++++++++++++++++++++++
 1 file changed, 77 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 4aaa9132e840..dce2b1160cf4 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6292,6 +6292,83 @@ static int wake_cap(struct task_struct *p, int cpu, int prev_cpu)
 	return min_cap * 1024 < task_util(p) * capacity_margin;
 }
 
+/*
+ * Predicts what cpu_util(@cpu) would return if @p was migrated (and enqueued)
+ * to @dst_cpu.
+ */
+static unsigned long cpu_util_next(int cpu, struct task_struct *p, int dst_cpu)
+{
+	struct cfs_rq *cfs_rq = &cpu_rq(cpu)->cfs;
+	unsigned long util_est, util = READ_ONCE(cfs_rq->avg.util_avg);
+
+	/*
+	 * If @p migrates from @cpu to another, remove its contribution. Or,
+	 * if @p migrates from another CPU to @cpu, add its contribution. In
+	 * the other cases, @cpu is not impacted by the migration, so the
+	 * util_avg should already be correct.
+	 */
+	if (task_cpu(p) == cpu && dst_cpu != cpu)
+		util = max_t(long, util - task_util(p), 0);
+	else if (task_cpu(p) != cpu && dst_cpu == cpu)
+		util += task_util(p);
+
+	if (sched_feat(UTIL_EST)) {
+		util_est = READ_ONCE(cfs_rq->avg.util_est.enqueued);
+
+		/*
+		 * During wake-up, the task isn't enqueued yet and doesn't
+		 * appear in the cfs_rq->avg.util_est.enqueued of any rq,
+		 * so just add it (if needed) to "simulate" what will be
+		 * cpu_util() after the task has been enqueued.
+		 */
+		if (dst_cpu == cpu)
+			util_est += _task_util_est(p);
+
+		util = max(util, util_est);
+	}
+
+	return min_t(unsigned long, util, capacity_orig_of(cpu));
+}
+
+/*
+ * compute_energy(): Estimates the energy that would be consumed if @p was
+ * migrated to @dst_cpu. compute_energy() predicts what will be the utilization
+ * landscape of the * CPUs after the task migration, and uses the Energy Model
+ * to compute what would be the energy if we decided to actually migrate that
+ * task.
+ */
+static long compute_energy(struct task_struct *p, int dst_cpu,
+							struct freq_domain *fd)
+{
+	long util, max_util, sum_util, energy = 0;
+	int cpu;
+
+	while (fd) {
+		max_util = sum_util = 0;
+		/*
+		 * The frequency of CPUs of the current rd can be driven by
+		 * CPUs of another rd if they belong to the same frequency
+		 * domain. So, account for the utilization of these CPUs too
+		 * by masking fd with cpu_online_mask instead of the rd span.
+		 *
+		 * If an entire frequency domain is outide of the current rd,
+		 * it will not appear in its fd list and will not be accounted
+		 * by compute_energy().
+		 */
+		for_each_cpu_and(cpu, freq_domain_span(fd), cpu_online_mask) {
+			util = cpu_util_next(cpu, p, dst_cpu);
+			util = schedutil_freq_util(cpu, util, energy_util);
+			max_util = max(util, max_util);
+			sum_util += util;
+		}
+
+		energy += em_fd_energy(fd->obj, max_util, sum_util);
+		fd = fd->next;
+	}
+
+	return energy;
+}
+
 /*
  * select_task_rq_fair: Select target runqueue for the waking task in domains
  * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
-- 
2.18.0


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH v5 12/14] sched/fair: Select an energy-efficient CPU on task wake-up
  2018-07-24 12:25 [PATCH v5 00/14] Energy Aware Scheduling Quentin Perret
                   ` (10 preceding siblings ...)
  2018-07-24 12:25 ` [PATCH v5 11/14] sched/fair: Introduce an energy estimation helper function Quentin Perret
@ 2018-07-24 12:25 ` Quentin Perret
  2018-08-02 13:54   ` Peter Zijlstra
  2018-07-24 12:25 ` [PATCH v5 13/14] OPTIONAL: arch_topology: Start Energy Aware Scheduling Quentin Perret
  2018-07-24 12:25 ` [PATCH v5 14/14] OPTIONAL: cpufreq: dt: Register an Energy Model Quentin Perret
  13 siblings, 1 reply; 72+ messages in thread
From: Quentin Perret @ 2018-07-24 12:25 UTC (permalink / raw)
  To: peterz, rjw, linux-kernel, linux-pm
  Cc: gregkh, mingo, dietmar.eggemann, morten.rasmussen, chris.redpath,
	patrick.bellasi, valentin.schneider, vincent.guittot,
	thara.gopinath, viresh.kumar, tkjos, joel, smuckle, adharmap,
	skannan, pkondeti, juri.lelli, edubezval, srinivas.pandruvada,
	currojerez, javi.merino, quentin.perret

If an Energy Model (EM) is available and if the system isn't
overutilized, re-route waking tasks into an energy-aware placement
algorithm. The selection of an energy-efficient CPU for a task
is achieved by estimating the impact on system-level active energy
resulting from the placement of the task on the CPU with the highest
spare capacity in each frequency domain. This strategy spreads tasks in
a frequency domain and avoids overly aggressive task packing. The best
CPU energy-wise is then selected if it saves a large enough amount of
energy with respect to prev_cpu.

Although it has already shown significant benefits on some existing
targets, this approach cannot scale to platforms with numerous CPUs.
This is an attempt to do something useful as writing a fast heuristic
that performs reasonably well on a broad spectrum of architectures isn't
an easy task. As such, the scope of usability of the energy-aware
wake-up path is restricted to systems with the SD_ASYM_CPUCAPACITY flag
set, and where the EM isn't too complex.

Cc: Ingo Molnar <mingo@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
---
 kernel/sched/fair.c | 124 ++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 120 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index dce2b1160cf4..c1b789b80cec 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6369,6 +6369,113 @@ static long compute_energy(struct task_struct *p, int dst_cpu,
 	return energy;
 }
 
+/*
+ * find_energy_efficient_cpu(): Find most energy-efficient target CPU for the
+ * waking task. find_energy_efficient_cpu() looks for the CPU with maximum
+ * spare capacity in each frequency domain and uses it as a potential
+ * candidate to execute the task. Then, it uses the Energy Model to figure
+ * out which of the CPU candidates is the most energy-efficient.
+ *
+ * The rationale for this heuristic is as follows. In a frequency domain,
+ * all the most energy efficient CPU candidates (according to the Energy
+ * Model) are those for which we'll request a low frequency. When there are
+ * several CPUs for which the frequency request will be the same, we don't
+ * have enough data to break the tie between them, because the Energy Model
+ * only includes active power costs. With this model, if we assume that
+ * frequency requests follow utilization (e.g. using schedutil), the CPU with
+ * the maximum spare capacity in a frequency domain is guaranteed to be among
+ * the best candidates of the frequency domain.
+ *
+ * In practice, it could be preferable from an energy standpoint to pack
+ * small tasks on a CPU in order to let other CPUs go in deeper idle states,
+ * but that could also hurt our chances to go cluster idle, and we have no
+ * ways to tell with the current Energy Model if this is actually a good
+ * idea or not. So, find_energy_efficient_cpu() basically favors
+ * cluster-packing, and spreading inside a cluster. That should at least be
+ * a good thing for latency, and this is consistent with the idea that most
+ * of the energy savings of EAS come from the asymmetry of the system, and
+ * not so much from breaking the tie between identical CPUs. That's also the
+ * reason why EAS is enabled in the topology code only for systems where
+ * SD_ASYM_CPUCAPACITY is set.
+ */
+static int find_energy_efficient_cpu(struct task_struct *p, int prev_cpu,
+							struct freq_domain *fd)
+{
+	unsigned long prev_energy = ULONG_MAX, best_energy = ULONG_MAX;
+	int cpu, best_energy_cpu = prev_cpu;
+	struct freq_domain *head = fd;
+	unsigned long cpu_cap, util;
+	struct sched_domain *sd;
+
+	sync_entity_load_avg(&p->se);
+
+	if (!task_util_est(p))
+		return prev_cpu;
+
+	/*
+	 * Energy-aware wake-up happens on the lowest sched_domain starting
+	 * from sd_ea spanning over this_cpu and prev_cpu.
+	 */
+	sd = rcu_dereference(*this_cpu_ptr(&sd_ea));
+	while (sd && !cpumask_test_cpu(prev_cpu, sched_domain_span(sd)))
+		sd = sd->parent;
+	if (!sd)
+		return prev_cpu;
+
+	while (fd) {
+		unsigned long cur_energy, spare_cap, max_spare_cap = 0;
+		int max_spare_cap_cpu = -1;
+
+		for_each_cpu_and(cpu, freq_domain_span(fd), sched_domain_span(sd)) {
+			if (!cpumask_test_cpu(cpu, &p->cpus_allowed))
+				continue;
+
+			/* Skip CPUs that will be overutilized. */
+			util = cpu_util_next(cpu, p, cpu);
+			cpu_cap = capacity_of(cpu);
+			if (cpu_cap * 1024 < util * capacity_margin)
+				continue;
+
+			/* Always use prev_cpu as a candidate. */
+			if (cpu == prev_cpu) {
+				prev_energy = compute_energy(p, prev_cpu, head);
+				if (prev_energy < best_energy)
+					best_energy = prev_energy;
+				continue;
+			}
+
+			/*
+			 * Find the CPU with the maximum spare capacity in
+			 * the frequency domain
+			 */
+			spare_cap = cpu_cap - util;
+			if (spare_cap > max_spare_cap) {
+				max_spare_cap = spare_cap;
+				max_spare_cap_cpu = cpu;
+			}
+		}
+
+		/* Evaluate the energy impact of using this CPU. */
+		if (max_spare_cap_cpu >= 0) {
+			cur_energy = compute_energy(p, max_spare_cap_cpu, head);
+			if (cur_energy < best_energy) {
+				best_energy = cur_energy;
+				best_energy_cpu = max_spare_cap_cpu;
+			}
+		}
+		fd = fd->next;
+	}
+
+	/*
+	 * Pick the best CPU only if it saves at least 6% of the
+	 * energy used by prev_cpu.
+	 */
+	if ((prev_energy - best_energy) > (prev_energy >> 4))
+		return best_energy_cpu;
+
+	return prev_cpu;
+}
+
 /*
  * select_task_rq_fair: Select target runqueue for the waking task in domains
  * that have the 'sd_flag' flag set. In practice, this is SD_BALANCE_WAKE,
@@ -6385,18 +6492,26 @@ static int
 select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags)
 {
 	struct sched_domain *tmp, *sd = NULL;
+	struct freq_domain *fd;
 	int cpu = smp_processor_id();
 	int new_cpu = prev_cpu;
-	int want_affine = 0;
+	int want_affine = 0, want_energy = 0;
 	int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
 
+	rcu_read_lock();
 	if (sd_flag & SD_BALANCE_WAKE) {
 		record_wakee(p);
-		want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu)
-			      && cpumask_test_cpu(cpu, &p->cpus_allowed);
+		fd = rd_freq_domain(cpu_rq(cpu)->rd);
+		want_energy = fd && !READ_ONCE(cpu_rq(cpu)->rd->overutilized);
+		want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu) &&
+			      cpumask_test_cpu(cpu, &p->cpus_allowed);
+	}
+
+	if (want_energy) {
+		new_cpu = find_energy_efficient_cpu(p, prev_cpu, fd);
+		goto unlock;
 	}
 
-	rcu_read_lock();
 	for_each_domain(cpu, tmp) {
 		if (!(tmp->flags & SD_LOAD_BALANCE))
 			break;
@@ -6431,6 +6546,7 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_f
 		if (want_affine)
 			current->recent_used_cpu = cpu;
 	}
+unlock:
 	rcu_read_unlock();
 
 	return new_cpu;
-- 
2.18.0


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH v5 13/14] OPTIONAL: arch_topology: Start Energy Aware Scheduling
  2018-07-24 12:25 [PATCH v5 00/14] Energy Aware Scheduling Quentin Perret
                   ` (11 preceding siblings ...)
  2018-07-24 12:25 ` [PATCH v5 12/14] sched/fair: Select an energy-efficient CPU on task wake-up Quentin Perret
@ 2018-07-24 12:25 ` Quentin Perret
  2018-07-24 12:25 ` [PATCH v5 14/14] OPTIONAL: cpufreq: dt: Register an Energy Model Quentin Perret
  13 siblings, 0 replies; 72+ messages in thread
From: Quentin Perret @ 2018-07-24 12:25 UTC (permalink / raw)
  To: peterz, rjw, linux-kernel, linux-pm
  Cc: gregkh, mingo, dietmar.eggemann, morten.rasmussen, chris.redpath,
	patrick.bellasi, valentin.schneider, vincent.guittot,
	thara.gopinath, viresh.kumar, tkjos, joel, smuckle, adharmap,
	skannan, pkondeti, juri.lelli, edubezval, srinivas.pandruvada,
	currojerez, javi.merino, quentin.perret

Energy Aware Scheduling (EAS) starts when the scheduling domains are
built if the Energy Model (EM) is present. However, in the typical case
of Arm/Arm64 systems, the EM is provided after the scheduling domains
are first built at boot time, which results in EAS staying disabled.

Fix this issue by re-building the scheduling domain from the arch
topology driver, once CPUfreq is up and running and the asymmetry in CPU
capacities has been detected.

Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
---
 drivers/base/arch_topology.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/base/arch_topology.c b/drivers/base/arch_topology.c
index e7cb0c6ade81..5b9f107f2e4a 100644
--- a/drivers/base/arch_topology.c
+++ b/drivers/base/arch_topology.c
@@ -10,6 +10,7 @@
 #include <linux/arch_topology.h>
 #include <linux/cpu.h>
 #include <linux/cpufreq.h>
+#include <linux/cpuset.h>
 #include <linux/device.h>
 #include <linux/of.h>
 #include <linux/slab.h>
@@ -247,6 +248,7 @@ static void parsing_done_workfn(struct work_struct *work)
 	cpufreq_unregister_notifier(&init_cpu_capacity_notifier,
 					 CPUFREQ_POLICY_NOTIFIER);
 	free_cpumask_var(cpus_to_visit);
+	rebuild_sched_domains();
 }
 
 #else
-- 
2.18.0


^ permalink raw reply	[flat|nested] 72+ messages in thread

* [PATCH v5 14/14] OPTIONAL: cpufreq: dt: Register an Energy Model
  2018-07-24 12:25 [PATCH v5 00/14] Energy Aware Scheduling Quentin Perret
                   ` (12 preceding siblings ...)
  2018-07-24 12:25 ` [PATCH v5 13/14] OPTIONAL: arch_topology: Start Energy Aware Scheduling Quentin Perret
@ 2018-07-24 12:25 ` Quentin Perret
  13 siblings, 0 replies; 72+ messages in thread
From: Quentin Perret @ 2018-07-24 12:25 UTC (permalink / raw)
  To: peterz, rjw, linux-kernel, linux-pm
  Cc: gregkh, mingo, dietmar.eggemann, morten.rasmussen, chris.redpath,
	patrick.bellasi, valentin.schneider, vincent.guittot,
	thara.gopinath, viresh.kumar, tkjos, joel, smuckle, adharmap,
	skannan, pkondeti, juri.lelli, edubezval, srinivas.pandruvada,
	currojerez, javi.merino, quentin.perret

*******************************************************************
* This patch illustrates the usage of the newly introduced Energy *
* Model framework and isn't supposed to be merged as-is.          *
*******************************************************************

The Energy Model framework provides an API to register the active power
of CPUs. Call this API from the cpufreq-dt driver with an estimation
of the power as P = C * V^2 * f with C, V, and f respectively the
capacitance of the CPU and the voltage and frequency of the OPP.

The CPU capacitance is read from the "dynamic-power-coefficient" DT
binding (originally introduced for thermal/IPA), and the voltage and
frequency values from PM_OPP.

Cc: "Rafael J. Wysocki" <rjw@rjwysocki.net>
Cc: Viresh Kumar <viresh.kumar@linaro.org>
Signed-off-by: Quentin Perret <quentin.perret@arm.com>
---
 drivers/cpufreq/cpufreq-dt.c | 45 +++++++++++++++++++++++++++++++++++-
 1 file changed, 44 insertions(+), 1 deletion(-)

diff --git a/drivers/cpufreq/cpufreq-dt.c b/drivers/cpufreq/cpufreq-dt.c
index 0a9ebf00be46..399f54d0abf2 100644
--- a/drivers/cpufreq/cpufreq-dt.c
+++ b/drivers/cpufreq/cpufreq-dt.c
@@ -16,6 +16,7 @@
 #include <linux/cpu_cooling.h>
 #include <linux/cpufreq.h>
 #include <linux/cpumask.h>
+#include <linux/energy_model.h>
 #include <linux/err.h>
 #include <linux/module.h>
 #include <linux/of.h>
@@ -149,8 +150,47 @@ static int resources_available(void)
 	return 0;
 }
 
+static int of_est_power(unsigned long *mW, unsigned long *KHz, int cpu)
+{
+	unsigned long mV, Hz, MHz;
+	struct device *cpu_dev;
+	struct dev_pm_opp *opp;
+	struct device_node *np;
+	u32 cap;
+	u64 tmp;
+
+	cpu_dev = get_cpu_device(cpu);
+	if (!cpu_dev)
+		return -ENODEV;
+
+	np = of_node_get(cpu_dev->of_node);
+	if (!np)
+		return -EINVAL;
+
+	if (of_property_read_u32(np, "dynamic-power-coefficient", &cap))
+		return -EINVAL;
+
+	Hz = *KHz * 1000;
+	opp = dev_pm_opp_find_freq_ceil(cpu_dev, &Hz);
+	if (IS_ERR(opp))
+		return -EINVAL;
+
+	mV = dev_pm_opp_get_voltage(opp) / 1000;
+	dev_pm_opp_put(opp);
+
+	MHz = Hz / 1000000;
+	tmp = (u64)cap * mV * mV * MHz;
+	do_div(tmp, 1000000000);
+
+	*mW = (unsigned long)tmp;
+	*KHz = Hz / 1000;
+
+	return 0;
+}
+
 static int cpufreq_init(struct cpufreq_policy *policy)
 {
+	struct em_data_callback em_cb = EM_DATA_CB(of_est_power);
 	struct cpufreq_frequency_table *freq_table;
 	struct opp_table *opp_table = NULL;
 	struct private_data *priv;
@@ -159,7 +199,7 @@ static int cpufreq_init(struct cpufreq_policy *policy)
 	unsigned int transition_latency;
 	bool fallback = false;
 	const char *name;
-	int ret;
+	int ret, nr_opp;
 
 	cpu_dev = get_cpu_device(policy->cpu);
 	if (!cpu_dev) {
@@ -226,6 +266,7 @@ static int cpufreq_init(struct cpufreq_policy *policy)
 		ret = -EPROBE_DEFER;
 		goto out_free_opp;
 	}
+	nr_opp = ret;
 
 	if (fallback) {
 		cpumask_setall(policy->cpus);
@@ -278,6 +319,8 @@ static int cpufreq_init(struct cpufreq_policy *policy)
 	policy->cpuinfo.transition_latency = transition_latency;
 	policy->dvfs_possible_from_any_cpu = true;
 
+	em_register_freq_domain(policy->cpus, nr_opp, &em_cb);
+
 	return 0;
 
 out_free_cpufreq_table:
-- 
2.18.0


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v5 06/14] sched/topology: Lowest energy aware balancing sched_domain level pointer
  2018-07-24 12:25 ` [PATCH v5 06/14] sched/topology: Lowest energy aware balancing sched_domain level pointer Quentin Perret
@ 2018-07-26 16:00   ` Valentin Schneider
  2018-07-26 17:01     ` Quentin Perret
  0 siblings, 1 reply; 72+ messages in thread
From: Valentin Schneider @ 2018-07-26 16:00 UTC (permalink / raw)
  To: Quentin Perret, peterz, rjw, linux-kernel, linux-pm
  Cc: gregkh, mingo, dietmar.eggemann, morten.rasmussen, chris.redpath,
	patrick.bellasi, vincent.guittot, thara.gopinath, viresh.kumar,
	tkjos, joel, smuckle, adharmap, skannan, pkondeti, juri.lelli,
	edubezval, srinivas.pandruvada, currojerez, javi.merino

Hi,

On 24/07/18 13:25, Quentin Perret wrote:
> Add another member to the family of per-cpu sched_domain shortcut
> pointers. This one, sd_ea, points to the lowest level at which energy
> aware scheduling should be used.
> 
> Generally speaking, the largest opportunity to save energy via scheduling
> comes from a smarter exploitation of heterogeneous platforms (i.e.
> big.LITTLE). Consequently, the sd_ea shortcut is wired to the lowest
> scheduling domain at which the SD_ASYM_CPUCAPACITY flag is set. For
> example, it is possible to apply Energy-Aware Scheduling within a socket
> on a multi-socket system, as long as each socket has an asymmetric
> topology. Cross-sockets wake-up balancing will only happen when the
> system is over-utilized, or this_cpu and prev_cpu are in different
> sockets.
> 
> cc: Ingo Molnar <mingo@redhat.com>
> cc: Peter Zijlstra <peterz@infradead.org>
> Suggested-by: Morten Rasmussen <morten.rasmussen@arm.com>
> Signed-off-by: Quentin Perret <quentin.perret@arm.com>
> ---
>  kernel/sched/sched.h    | 1 +
>  kernel/sched/topology.c | 4 ++++
>  2 files changed, 5 insertions(+)
> 
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index fdf6924d53e7..25d64a0b6fe0 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1198,6 +1198,7 @@ DECLARE_PER_CPU(int, sd_llc_id);
>  DECLARE_PER_CPU(struct sched_domain_shared *, sd_llc_shared);
>  DECLARE_PER_CPU(struct sched_domain *, sd_numa);
>  DECLARE_PER_CPU(struct sched_domain *, sd_asym);
> +DECLARE_PER_CPU(struct sched_domain *, sd_ea);

There's already the asym-packing shortcut which is making naming a bit more
tedious, but should that really be named energy-aware? IMO it's just the
lowest level at which we can see asymmetry, so perhaps it should be named
as such, i.e. something like sd_asym_capa (and perhaps rename the other one
as sd_asym_pack)?

>  
>  struct sched_group_capacity {
>  	atomic_t		ref;
> diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
> index ade1eae9d21b..8f3f746b0d5e 100644
> --- a/kernel/sched/topology.c
> +++ b/kernel/sched/topology.c
> @@ -514,6 +514,7 @@ DEFINE_PER_CPU(int, sd_llc_id);
>  DEFINE_PER_CPU(struct sched_domain_shared *, sd_llc_shared);
>  DEFINE_PER_CPU(struct sched_domain *, sd_numa);
>  DEFINE_PER_CPU(struct sched_domain *, sd_asym);
> +DEFINE_PER_CPU(struct sched_domain *, sd_ea);
>  
>  static void update_top_cache_domain(int cpu)
>  {
> @@ -539,6 +540,9 @@ static void update_top_cache_domain(int cpu)
>  
>  	sd = highest_flag_domain(cpu, SD_ASYM_PACKING);
>  	rcu_assign_pointer(per_cpu(sd_asym, cpu), sd);
> +
> +	sd = lowest_flag_domain(cpu, SD_ASYM_CPUCAPACITY);
> +	rcu_assign_pointer(per_cpu(sd_ea, cpu), sd);
>  }
>  
>  /*
> 

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v5 06/14] sched/topology: Lowest energy aware balancing sched_domain level pointer
  2018-07-26 16:00   ` Valentin Schneider
@ 2018-07-26 17:01     ` Quentin Perret
  0 siblings, 0 replies; 72+ messages in thread
From: Quentin Perret @ 2018-07-26 17:01 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: peterz, rjw, linux-kernel, linux-pm, gregkh, mingo,
	dietmar.eggemann, morten.rasmussen, chris.redpath,
	patrick.bellasi, vincent.guittot, thara.gopinath, viresh.kumar,
	tkjos, joel, smuckle, adharmap, skannan, pkondeti, juri.lelli,
	edubezval, srinivas.pandruvada, currojerez, javi.merino

On Thursday 26 Jul 2018 at 17:00:50 (+0100), Valentin Schneider wrote:
> > +DECLARE_PER_CPU(struct sched_domain *, sd_ea);
> 
> There's already the asym-packing shortcut which is making naming a bit more
> tedious, but should that really be named energy-aware? IMO it's just the
> lowest level at which we can see asymmetry, so perhaps it should be named
> as such, i.e. something like sd_asym_capa (and perhaps rename the other one
> as sd_asym_pack)?

So the 'sd_ea' name comes from the usage we do of this shortcut, and also
because this is how the EAS shortcut has always been called, historically.

But TBH, if nobody is against that, I don't mind changing the name at all.
And maybe we should go for the longer names ? 'sd_asym_packing' and
'sd_asym_cpucapacity' aren't that long, I think. And they're
descriptive :-)

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v5 10/14] sched/cpufreq: Refactor the utilization aggregation method
  2018-07-24 12:25 ` [PATCH v5 10/14] sched/cpufreq: Refactor the utilization aggregation method Quentin Perret
@ 2018-07-30 19:35   ` skannan
  2018-07-31  7:59     ` Quentin Perret
  2018-08-02 12:33     ` Peter Zijlstra
  0 siblings, 2 replies; 72+ messages in thread
From: skannan @ 2018-07-30 19:35 UTC (permalink / raw)
  To: Quentin Perret
  Cc: peterz, rjw, linux-kernel, linux-pm, gregkh, mingo,
	dietmar.eggemann, morten.rasmussen, chris.redpath,
	patrick.bellasi, valentin.schneider, vincent.guittot,
	thara.gopinath, viresh.kumar, tkjos, joel, smuckle, adharmap,
	skannan, pkondeti, juri.lelli, edubezval, srinivas.pandruvada,
	currojerez, javi.merino, linux-pm-owner

On 2018-07-24 05:25, Quentin Perret wrote:
> Schedutil aggregates the PELT signals of CFS, RT, DL and IRQ in order
> to decide which frequency to request. Energy Aware Scheduling (EAS)
> needs to be able to predict those requests to assess the energy impact
> of scheduling decisions. However, the PELT signals aggregation is only
> done in schedutil for now, hence making it hard to synchronize it with
> EAS.
> 
> To address this issue, introduce schedutil_freq_util() to perform the
> aforementioned aggregation and make it available to other parts of the
> scheduler. Since frequency selection and energy estimation still need
> to deal with RT and DL signals slightly differently, 
> schedutil_freq_util()
> is called with a different 'type' parameter in those two contexts, and
> returns an aggregated utilization signal accordingly.
> 
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Suggested-by: Peter Zijlstra <peterz@infradead.org>
> Signed-off-by: Quentin Perret <quentin.perret@arm.com>
> ---
>  kernel/sched/cpufreq_schedutil.c | 86 +++++++++++++++++++++-----------
>  kernel/sched/sched.h             | 14 ++++++
>  2 files changed, 72 insertions(+), 28 deletions(-)
> 
> diff --git a/kernel/sched/cpufreq_schedutil.c 
> b/kernel/sched/cpufreq_schedutil.c
> index 810193c8e4cd..af86050edcf5 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -198,15 +198,15 @@ static unsigned int get_next_freq(struct
> sugov_policy *sg_policy,
>   * based on the task model parameters and gives the minimal 
> utilization
>   * required to meet deadlines.
>   */
> -static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu)
> +unsigned long schedutil_freq_util(int cpu, unsigned long util_cfs,
> +				  enum schedutil_type type)
>  {
> -	struct rq *rq = cpu_rq(sg_cpu->cpu);
> +	struct rq *rq = cpu_rq(cpu);
>  	unsigned long util, irq, max;
> 
> -	sg_cpu->max = max = arch_scale_cpu_capacity(NULL, sg_cpu->cpu);
> -	sg_cpu->bw_dl = cpu_bw_dl(rq);
> +	max = arch_scale_cpu_capacity(NULL, cpu);
> 
> -	if (rt_rq_is_runnable(&rq->rt))
> +	if (type == frequency_util && rt_rq_is_runnable(&rq->rt))
>  		return max;
> 
>  	/*
> @@ -224,20 +224,33 @@ static unsigned long sugov_get_util(struct
> sugov_cpu *sg_cpu)
>  	 * utilization (PELT windows are synchronized) we can directly add 
> them
>  	 * to obtain the CPU's actual utilization.
>  	 */
> -	util = cpu_util_cfs(rq);
> +	util = util_cfs;
>  	util += cpu_util_rt(rq);
> 
> -	/*
> -	 * We do not make cpu_util_dl() a permanent part of this sum because 
> we
> -	 * want to use cpu_bw_dl() later on, but we need to check if the
> -	 * CFS+RT+DL sum is saturated (ie. no idle time) such that we select
> -	 * f_max when there is no idle time.
> -	 *
> -	 * NOTE: numerical errors or stop class might cause us to not quite 
> hit
> -	 * saturation when we should -- something for later.
> -	 */
> -	if ((util + cpu_util_dl(rq)) >= max)
> -		return max;
> +	if (type == frequency_util) {
> +		/*
> +		 * For frequency selection we do not make cpu_util_dl() a
> +		 * permanent part of this sum because we want to use
> +		 * cpu_bw_dl() later on, but we need to check if the
> +		 * CFS+RT+DL sum is saturated (ie. no idle time) such
> +		 * that we select f_max when there is no idle time.
> +		 *
> +		 * NOTE: numerical errors or stop class might cause us
> +		 * to not quite hit saturation when we should --
> +		 * something for later.
> +		 */
> +
> +		if ((util + cpu_util_dl(rq)) >= max)
> +			return max;
> +	} else {
> +		/*
> +		 * OTOH, for energy computation we need the estimated
> +		 * running time, so include util_dl and ignore dl_bw.
> +		 */
> +		util += cpu_util_dl(rq);
> +		if (util >= max)
> +			return max;
> +	}

If it's going to be a different aggregation from what's done for 
frequency guidance, I don't see the point of having this inside 
schedutil. Why not keep it inside the scheduler files? Also, it seems 
weird to use a governor's code when it might not actually be in use. 
What if someone is using ondemand, conservative, performance, etc?

> 
>  	/*
>  	 * There is still idle time; further improve the number by using the
> @@ -252,17 +265,34 @@ static unsigned long sugov_get_util(struct
> sugov_cpu *sg_cpu)
>  	util /= max;
>  	util += irq;
> 
> -	/*
> -	 * Bandwidth required by DEADLINE must always be granted while, for
> -	 * FAIR and RT, we use blocked utilization of IDLE CPUs as a 
> mechanism
> -	 * to gracefully reduce the frequency when no tasks show up for 
> longer
> -	 * periods of time.
> -	 *
> -	 * Ideally we would like to set bw_dl as min/guaranteed freq and util 
> +
> -	 * bw_dl as requested freq. However, cpufreq is not yet ready for 
> such
> -	 * an interface. So, we only do the latter for now.
> -	 */
> -	return min(max, util + sg_cpu->bw_dl);
> +	if (type == frequency_util) {
> +		/*
> +		 * Bandwidth required by DEADLINE must always be granted
> +		 * while, for FAIR and RT, we use blocked utilization of
> +		 * IDLE CPUs as a mechanism to gracefully reduce the
> +		 * frequency when no tasks show up for longer periods of
> +		 * time.
> +		 *
> +		 * Ideally we would like to set bw_dl as min/guaranteed
> +		 * freq and util + bw_dl as requested freq. However,
> +		 * cpufreq is not yet ready for such an interface. So,
> +		 * we only do the latter for now.
> +		 */
> +		util += cpu_bw_dl(rq);
> +	}

Instead of all this indentation, can't you just return early without 
doing the code inside the if?

> +
> +	return min(max, util);
> +}
> +
> +static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu)
> +{
> +	struct rq *rq = cpu_rq(sg_cpu->cpu);
> +	unsigned long util = cpu_util_cfs(rq);
> +
> +	sg_cpu->max = arch_scale_cpu_capacity(NULL, sg_cpu->cpu);
> +	sg_cpu->bw_dl = cpu_bw_dl(rq);
> +
> +	return schedutil_freq_util(sg_cpu->cpu, util, frequency_util);
>  }
> 
>  /**
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 6d08ccd1e7a4..51e7f113ee23 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2185,7 +2185,15 @@ static inline void cpufreq_update_util(struct
> rq *rq, unsigned int flags) {}
>  # define arch_scale_freq_invariant()	false
>  #endif
> 
> +enum schedutil_type {
> +	frequency_util,
> +	energy_util,
> +};

Please don't use lower case for enums. It's extremely confusing.

Thanks,
Saravana

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v5 10/14] sched/cpufreq: Refactor the utilization aggregation method
  2018-07-30 19:35   ` skannan
@ 2018-07-31  7:59     ` Quentin Perret
  2018-07-31 19:31       ` skannan
  2018-08-02 12:33     ` Peter Zijlstra
  1 sibling, 1 reply; 72+ messages in thread
From: Quentin Perret @ 2018-07-31  7:59 UTC (permalink / raw)
  To: skannan
  Cc: peterz, rjw, linux-kernel, linux-pm, gregkh, mingo,
	dietmar.eggemann, morten.rasmussen, chris.redpath,
	patrick.bellasi, valentin.schneider, vincent.guittot,
	thara.gopinath, viresh.kumar, tkjos, joel, smuckle, adharmap,
	skannan, pkondeti, juri.lelli, edubezval, srinivas.pandruvada,
	currojerez, javi.merino, linux-pm-owner

On Monday 30 Jul 2018 at 12:35:27 (-0700), skannan@codeaurora.org wrote:
[...]
> If it's going to be a different aggregation from what's done for frequency
> guidance, I don't see the point of having this inside schedutil. Why not
> keep it inside the scheduler files?

This code basically results from a discussion we had with Peter on v4.
Keeping everything centralized can make sense from a maintenance
perspective, I think. That makes it easy to see the impact of any change
to utilization signals for both EAS and schedutil.


> Also, it seems weird to use a governor's
> code when it might not actually be in use. What if someone is using
> ondemand, conservative, performance, etc?

Yeah I thought about that too ... I would say that even if you don't
use schedutil, it is probably a fair assumption from the scheduler's
standpoint to assume that somewhat OPPs follow utilization (in a very
loose way). So yes, if you use ondemand with EAS you won't have a
perfectly consistent match between the frequency requests and what EAS
predicts, and that might result in sub-optimal decisions in some cases,
but I'm not sure if we can do anything better at this stage.

Also, if you do use schedutil, EAS will accurately predict what will be
the frequency _request_, but that gives you no guarantee whatsoever that
you'll actually get it for real (because you're throttled, or because of
thermal capping, or simply because the HW refuses it for some reason ...).

There will be inconsistencies between EAS' predictions and the actual
frequencies, and we have to live with that. The best we can do is make
sure we're at least internally consistent from the scheduler's
standpoint, and that's why I think it can make sense to factorize as
many things as possible with schedutil where applicable.

> > +	if (type == frequency_util) {
> > +		/*
> > +		 * Bandwidth required by DEADLINE must always be granted
> > +		 * while, for FAIR and RT, we use blocked utilization of
> > +		 * IDLE CPUs as a mechanism to gracefully reduce the
> > +		 * frequency when no tasks show up for longer periods of
> > +		 * time.
> > +		 *
> > +		 * Ideally we would like to set bw_dl as min/guaranteed
> > +		 * freq and util + bw_dl as requested freq. However,
> > +		 * cpufreq is not yet ready for such an interface. So,
> > +		 * we only do the latter for now.
> > +		 */
> > +		util += cpu_bw_dl(rq);
> > +	}
> 
> Instead of all this indentation, can't you just return early without doing
> the code inside the if?

But then I'll need to duplicate the 'min' below, so not sure if it's
worth it ?

> > +enum schedutil_type {
> > +	frequency_util,
> > +	energy_util,
> > +};
> 
> Please don't use lower case for enums. It's extremely confusing.

Ok, I'll change that in v6.

Thanks !
Quentin

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v5 10/14] sched/cpufreq: Refactor the utilization aggregation method
  2018-07-31  7:59     ` Quentin Perret
@ 2018-07-31 19:31       ` skannan
  2018-08-01  7:32         ` Rafael J. Wysocki
  0 siblings, 1 reply; 72+ messages in thread
From: skannan @ 2018-07-31 19:31 UTC (permalink / raw)
  To: Quentin Perret
  Cc: peterz, rjw, linux-kernel, linux-pm, gregkh, mingo,
	dietmar.eggemann, morten.rasmussen, chris.redpath,
	patrick.bellasi, valentin.schneider, vincent.guittot,
	thara.gopinath, viresh.kumar, tkjos, joel, smuckle, adharmap,
	skannan, pkondeti, juri.lelli, edubezval, srinivas.pandruvada,
	currojerez, javi.merino, linux-pm-owner

On 2018-07-31 00:59, Quentin Perret wrote:
> On Monday 30 Jul 2018 at 12:35:27 (-0700), skannan@codeaurora.org 
> wrote:
> [...]
>> If it's going to be a different aggregation from what's done for 
>> frequency
>> guidance, I don't see the point of having this inside schedutil. Why 
>> not
>> keep it inside the scheduler files?
> 
> This code basically results from a discussion we had with Peter on v4.
> Keeping everything centralized can make sense from a maintenance
> perspective, I think. That makes it easy to see the impact of any 
> change
> to utilization signals for both EAS and schedutil.

In that case, I'd argue it makes more sense to keep the code centralized 
in the scheduler. The scheduler can let schedutil know about the 
utilization after it aggregates them. There's no need for a cpufreq 
governor to know that there are scheduling classes or how many there 
are. And the scheduler can then choose to aggregate one way for task 
packing and another way for frequency guidance.

It just seems so weird to have logic that's very essential for task 
placement to be inside a cpufreq governor.

>> Also, it seems weird to use a governor's
>> code when it might not actually be in use. What if someone is using
>> ondemand, conservative, performance, etc?
> 
> Yeah I thought about that too ... I would say that even if you don't
> use schedutil, it is probably a fair assumption from the scheduler's
> standpoint to assume that somewhat OPPs follow utilization (in a very
> loose way). So yes, if you use ondemand with EAS you won't have a
> perfectly consistent match between the frequency requests and what EAS
> predicts, and that might result in sub-optimal decisions in some cases,
> but I'm not sure if we can do anything better at this stage.
> 
> Also, if you do use schedutil, EAS will accurately predict what will be
> the frequency _request_, but that gives you no guarantee whatsoever 
> that
> you'll actually get it for real (because you're throttled, or because 
> of
> thermal capping, or simply because the HW refuses it for some reason 
> ...).
> 
> There will be inconsistencies between EAS' predictions and the actual
> frequencies, and we have to live with that. The best we can do is make
> sure we're at least internally consistent from the scheduler's
> standpoint, and that's why I think it can make sense to factorize as
> many things as possible with schedutil where applicable.
> 
>> > +	if (type == frequency_util) {
>> > +		/*
>> > +		 * Bandwidth required by DEADLINE must always be granted
>> > +		 * while, for FAIR and RT, we use blocked utilization of
>> > +		 * IDLE CPUs as a mechanism to gracefully reduce the
>> > +		 * frequency when no tasks show up for longer periods of
>> > +		 * time.
>> > +		 *
>> > +		 * Ideally we would like to set bw_dl as min/guaranteed
>> > +		 * freq and util + bw_dl as requested freq. However,
>> > +		 * cpufreq is not yet ready for such an interface. So,
>> > +		 * we only do the latter for now.
>> > +		 */
>> > +		util += cpu_bw_dl(rq);
>> > +	}
>> 
>> Instead of all this indentation, can't you just return early without 
>> doing
>> the code inside the if?
> 
> But then I'll need to duplicate the 'min' below, so not sure if it's
> worth it ?

I feel like less indentation where reasonably possible leads to more 
readability. But I don't have a strong opinion in this specific case.

>> > +enum schedutil_type {
>> > +	frequency_util,
>> > +	energy_util,
>> > +};
>> 
>> Please don't use lower case for enums. It's extremely confusing.
> 
> Ok, I'll change that in v6.

Thanks.

-Saravana

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v5 10/14] sched/cpufreq: Refactor the utilization aggregation method
  2018-07-31 19:31       ` skannan
@ 2018-08-01  7:32         ` Rafael J. Wysocki
  2018-08-01  8:23           ` Quentin Perret
  0 siblings, 1 reply; 72+ messages in thread
From: Rafael J. Wysocki @ 2018-08-01  7:32 UTC (permalink / raw)
  To: Saravana Kannan, Quentin Perret
  Cc: Peter Zijlstra, Rafael J. Wysocki, Linux Kernel Mailing List,
	Linux PM, Greg Kroah-Hartman, Ingo Molnar, Dietmar Eggemann,
	Morten Rasmussen, Chris Redpath, Patrick Bellasi,
	Valentin Schneider, Vincent Guittot, Thara Gopinath,
	Viresh Kumar, Todd Kjos, Joel Fernandes, Steve Muckle, adharmap,
	skannan, Pavan Kondeti, Juri Lelli, Eduardo Valentin,
	Srinivas Pandruvada, currojerez, Javi Merino, linux-pm-owner

On Tue, Jul 31, 2018 at 9:31 PM,  <skannan@codeaurora.org> wrote:
> On 2018-07-31 00:59, Quentin Perret wrote:
>>
>> On Monday 30 Jul 2018 at 12:35:27 (-0700), skannan@codeaurora.org wrote:
>> [...]
>>>
>>> If it's going to be a different aggregation from what's done for
>>> frequency
>>> guidance, I don't see the point of having this inside schedutil. Why not
>>> keep it inside the scheduler files?
>>
>>
>> This code basically results from a discussion we had with Peter on v4.
>> Keeping everything centralized can make sense from a maintenance
>> perspective, I think. That makes it easy to see the impact of any change
>> to utilization signals for both EAS and schedutil.
>
>
> In that case, I'd argue it makes more sense to keep the code centralized in
> the scheduler. The scheduler can let schedutil know about the utilization
> after it aggregates them. There's no need for a cpufreq governor to know
> that there are scheduling classes or how many there are. And the scheduler
> can then choose to aggregate one way for task packing and another way for
> frequency guidance.

Also the aggregate utilization may be used by cpuidle governors in
principle to decide how deep they can go with idle state selection.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v5 10/14] sched/cpufreq: Refactor the utilization aggregation method
  2018-08-01  7:32         ` Rafael J. Wysocki
@ 2018-08-01  8:23           ` Quentin Perret
  2018-08-01  8:35             ` Rafael J. Wysocki
  0 siblings, 1 reply; 72+ messages in thread
From: Quentin Perret @ 2018-08-01  8:23 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Saravana Kannan, Peter Zijlstra, Rafael J. Wysocki,
	Linux Kernel Mailing List, Linux PM, Greg Kroah-Hartman,
	Ingo Molnar, Dietmar Eggemann, Morten Rasmussen, Chris Redpath,
	Patrick Bellasi, Valentin Schneider, Vincent Guittot,
	Thara Gopinath, Viresh Kumar, Todd Kjos, Joel Fernandes,
	Steve Muckle, adharmap, skannan, Pavan Kondeti, Juri Lelli,
	Eduardo Valentin, Srinivas Pandruvada, currojerez, Javi Merino,
	linux-pm-owner

On Wednesday 01 Aug 2018 at 09:32:49 (+0200), Rafael J. Wysocki wrote:
> On Tue, Jul 31, 2018 at 9:31 PM,  <skannan@codeaurora.org> wrote:
> > On 2018-07-31 00:59, Quentin Perret wrote:
> >>
> >> On Monday 30 Jul 2018 at 12:35:27 (-0700), skannan@codeaurora.org wrote:
> >> [...]
> >>>
> >>> If it's going to be a different aggregation from what's done for
> >>> frequency
> >>> guidance, I don't see the point of having this inside schedutil. Why not
> >>> keep it inside the scheduler files?
> >>
> >>
> >> This code basically results from a discussion we had with Peter on v4.
> >> Keeping everything centralized can make sense from a maintenance
> >> perspective, I think. That makes it easy to see the impact of any change
> >> to utilization signals for both EAS and schedutil.
> >
> >
> > In that case, I'd argue it makes more sense to keep the code centralized in
> > the scheduler. The scheduler can let schedutil know about the utilization
> > after it aggregates them. There's no need for a cpufreq governor to know
> > that there are scheduling classes or how many there are. And the scheduler
> > can then choose to aggregate one way for task packing and another way for
> > frequency guidance.
> 
> Also the aggregate utilization may be used by cpuidle governors in
> principle to decide how deep they can go with idle state selection.

The only issue I see with this right now is that some of the things done
in this function are policy decisions which really belong to the governor,
I think. The RT-go-to-max-freq thing in particular. And I really don't
think EAS should cope with that, at least for now.

But if this specific bit is factored out of the aggregation function, I
suppose we could move it somewhere else. Maybe pelt.c ?

How ugly is something like the below (totally untested) code ? It would
change slightly how we deal with DL utilization in EAS but I don't think
this is an issue.

diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index af86050edcf5..51c9ac9f30e8 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -178,121 +178,17 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
        return cpufreq_driver_resolve_freq(policy, freq);
 }

-/*
- * This function computes an effective utilization for the given CPU, to be
- * used for frequency selection given the linear relation: f = u * f_max.
- *
- * The scheduler tracks the following metrics:
- *
- *   cpu_util_{cfs,rt,dl,irq}()
- *   cpu_bw_dl()
- *
- * Where the cfs,rt and dl util numbers are tracked with the same metric and
- * synchronized windows and are thus directly comparable.
- *
- * The cfs,rt,dl utilization are the running times measured with rq->clock_task
- * which excludes things like IRQ and steal-time. These latter are then accrued
- * in the irq utilization.
- *
- * The DL bandwidth number otoh is not a measured metric but a value computed
- * based on the task model parameters and gives the minimal utilization
- * required to meet deadlines.
- */
-unsigned long schedutil_freq_util(int cpu, unsigned long util_cfs,
-                                 enum schedutil_type type)
-{
-       struct rq *rq = cpu_rq(cpu);
-       unsigned long util, irq, max;
-
-       max = arch_scale_cpu_capacity(NULL, cpu);
-
-       if (type == frequency_util && rt_rq_is_runnable(&rq->rt))
-               return max;
-
-       /*
-        * Early check to see if IRQ/steal time saturates the CPU, can be
-        * because of inaccuracies in how we track these -- see
-        * update_irq_load_avg().
-        */
-       irq = cpu_util_irq(rq);
-       if (unlikely(irq >= max))
-               return max;
-
-       /*
-        * Because the time spend on RT/DL tasks is visible as 'lost' time to
-        * CFS tasks and we use the same metric to track the effective
-        * utilization (PELT windows are synchronized) we can directly add them
-        * to obtain the CPU's actual utilization.
-        */
-       util = util_cfs;
-       util += cpu_util_rt(rq);
-
-       if (type == frequency_util) {
-               /*
-                * For frequency selection we do not make cpu_util_dl() a
-                * permanent part of this sum because we want to use
-                * cpu_bw_dl() later on, but we need to check if the
-                * CFS+RT+DL sum is saturated (ie. no idle time) such
-                * that we select f_max when there is no idle time.
-                *
-                * NOTE: numerical errors or stop class might cause us
-                * to not quite hit saturation when we should --
-                * something for later.
-                */
-
-               if ((util + cpu_util_dl(rq)) >= max)
-                       return max;
-       } else {
-               /*
-                * OTOH, for energy computation we need the estimated
-                * running time, so include util_dl and ignore dl_bw.
-                */
-               util += cpu_util_dl(rq);
-               if (util >= max)
-                       return max;
-       }
-
-       /*
-        * There is still idle time; further improve the number by using the
-        * irq metric. Because IRQ/steal time is hidden from the task clock we
-        * need to scale the task numbers:
-        *
-        *              1 - irq
-        *   U' = irq + ------- * U
-        *                max
-        */
-       util *= (max - irq);
-       util /= max;
-       util += irq;
-
-       if (type == frequency_util) {
-               /*
-                * Bandwidth required by DEADLINE must always be granted
-                * while, for FAIR and RT, we use blocked utilization of
-                * IDLE CPUs as a mechanism to gracefully reduce the
-                * frequency when no tasks show up for longer periods of
-                * time.
-                *
-                * Ideally we would like to set bw_dl as min/guaranteed
-                * freq and util + bw_dl as requested freq. However,
-                * cpufreq is not yet ready for such an interface. So,
-                * we only do the latter for now.
-                */
-               util += cpu_bw_dl(rq);
-       }
-
-       return min(max, util);
-}
-
 static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu)
 {
        struct rq *rq = cpu_rq(sg_cpu->cpu);
-       unsigned long util = cpu_util_cfs(rq);

        sg_cpu->max = arch_scale_cpu_capacity(NULL, sg_cpu->cpu);
        sg_cpu->bw_dl = cpu_bw_dl(rq);

-       return schedutil_freq_util(sg_cpu->cpu, util, frequency_util);
+       if (rt_rq_is_runnable(&rq->rt))
+               return sg_cpu->max;
+
+       return cpu_util_total(sg_cpu->cpu, cpu_util_cfs(rq));
 }

 /**
diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
index 35475c0c5419..5f99bd564dfc 100644
--- a/kernel/sched/pelt.c
+++ b/kernel/sched/pelt.c
@@ -397,3 +397,77 @@ int update_irq_load_avg(struct rq *rq, u64 running)
        return ret;
 }
 #endif
+
+/*
+ * This function computes an effective utilization for the given CPU, to be
+ * used for frequency selection given the linear relation: f = u * f_max.
+ *
+ * The scheduler tracks the following metrics:
+ *
+ *   cpu_util_{cfs,rt,dl,irq}()
+ *   cpu_bw_dl()
+ *
+ * Where the cfs,rt and dl util numbers are tracked with the same metric and
+ * synchronized windows and are thus directly comparable.
+ *
+ * The cfs,rt,dl utilization are the running times measured with rq->clock_task
+ * which excludes things like IRQ and steal-time. These latter are then accrued
+ * in the irq utilization.
+ *
+ * The DL bandwidth number otoh is not a measured metric but a value computed
+ * based on the task model parameters and gives the minimal utilization
+ * required to meet deadlines.
+ */
+unsigned long cpu_util_total(int cpu, unsigned long util_cfs)
+{
+       struct rq *rq = cpu_rq(cpu);
+       unsigned long util, irq, max;
+
+       max = arch_scale_cpu_capacity(NULL, cpu);
+
+       /*
+        * Early check to see if IRQ/steal time saturates the CPU, can be
+        * because of inaccuracies in how we track these -- see
+        * update_irq_load_avg().
+        */
+       irq = cpu_util_irq(rq);
+       if (unlikely(irq >= max))
+               return max;
+
+       /*
+        * Because the time spend on RT/DL tasks is visible as 'lost' time to
+        * CFS tasks and we use the same metric to track the effective
+        * utilization (PELT windows are synchronized) we can directly add them
+        * to obtain the CPU's actual utilization.
+        */
+       util = util_cfs;
+       util += cpu_util_rt(rq);
+
+       if ((util + cpu_util_dl(rq)) >= max)
+               return max;
+
+       /*
+        * There is still idle time; further improve the number by using the
+        * irq metric. Because IRQ/steal time is hidden from the task clock we
+        * need to scale the task numbers:
+        *
+        *              1 - irq
+        *   U' = irq + ------- * U
+        *                max
+        */
+       util *= (max - irq);
+       util /= max;
+       util += irq;
+
+       /*
+        * Bandwidth required by DEADLINE must always be granted while, for
+        * FAIR and RT, we use blocked utilization of IDLE CPUs as a mechanism
+        * to gracefully reduce the frequency when no tasks show up for longer
+        * periods of time.
+        *
+        * Ideally we would like to set bw_dl as min/guaranteed freq and util +
+        * bw_dl as requested freq. However, cpufreq is not yet ready for such
+        * an interface. So, we only do the latter for now.
+        */
+       return min(max, util + cpu_bw_dl(rq));
+}
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 51e7f113ee23..7ad037bb653e 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2185,14 +2185,9 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
 # define arch_scale_freq_invariant()   false
 #endif

-enum schedutil_type {
-       frequency_util,
-       energy_util,
-};

-#ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
-unsigned long schedutil_freq_util(int cpu, unsigned long util_cfs,
-                                 enum schedutil_type type);
+#ifdef CONFIG_SMP
+unsigned long cpu_util_total(int cpu, unsigned long cfs_util);

 static inline unsigned long cpu_bw_dl(struct rq *rq)
 {
@@ -2233,12 +2228,6 @@ static inline unsigned long cpu_util_irq(struct rq *rq)
 }

 #endif
-#else /* CONFIG_CPU_FREQ_GOV_SCHEDUTIL */
-static inline unsigned long schedutil_freq_util(int cpu, unsigned long util,
-                                 enum schedutil_type type)
-{
-       return util;
-}
 #endif

 #ifdef CONFIG_SMP

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v5 10/14] sched/cpufreq: Refactor the utilization aggregation method
  2018-08-01  8:23           ` Quentin Perret
@ 2018-08-01  8:35             ` Rafael J. Wysocki
  2018-08-01  9:23               ` Quentin Perret
  0 siblings, 1 reply; 72+ messages in thread
From: Rafael J. Wysocki @ 2018-08-01  8:35 UTC (permalink / raw)
  To: Quentin Perret
  Cc: Rafael J. Wysocki, Saravana Kannan, Peter Zijlstra,
	Rafael J. Wysocki, Linux Kernel Mailing List, Linux PM,
	Greg Kroah-Hartman, Ingo Molnar, Dietmar Eggemann,
	Morten Rasmussen, Chris Redpath, Patrick Bellasi,
	Valentin Schneider, Vincent Guittot, Thara Gopinath,
	Viresh Kumar, Todd Kjos, Joel Fernandes, Steve Muckle, adharmap,
	skannan, Pavan Kondeti, Juri Lelli, Eduardo Valentin,
	Srinivas Pandruvada, currojerez, Javi Merino, linux-pm-owner

On Wed, Aug 1, 2018 at 10:23 AM, Quentin Perret <quentin.perret@arm.com> wrote:
> On Wednesday 01 Aug 2018 at 09:32:49 (+0200), Rafael J. Wysocki wrote:
>> On Tue, Jul 31, 2018 at 9:31 PM,  <skannan@codeaurora.org> wrote:
>> > On 2018-07-31 00:59, Quentin Perret wrote:
>> >>
>> >> On Monday 30 Jul 2018 at 12:35:27 (-0700), skannan@codeaurora.org wrote:
>> >> [...]
>> >>>
>> >>> If it's going to be a different aggregation from what's done for
>> >>> frequency
>> >>> guidance, I don't see the point of having this inside schedutil. Why not
>> >>> keep it inside the scheduler files?
>> >>
>> >>
>> >> This code basically results from a discussion we had with Peter on v4.
>> >> Keeping everything centralized can make sense from a maintenance
>> >> perspective, I think. That makes it easy to see the impact of any change
>> >> to utilization signals for both EAS and schedutil.
>> >
>> >
>> > In that case, I'd argue it makes more sense to keep the code centralized in
>> > the scheduler. The scheduler can let schedutil know about the utilization
>> > after it aggregates them. There's no need for a cpufreq governor to know
>> > that there are scheduling classes or how many there are. And the scheduler
>> > can then choose to aggregate one way for task packing and another way for
>> > frequency guidance.
>>
>> Also the aggregate utilization may be used by cpuidle governors in
>> principle to decide how deep they can go with idle state selection.
>
> The only issue I see with this right now is that some of the things done
> in this function are policy decisions which really belong to the governor,
> I think.

Well, the scheduler makes policy decisions too, in quite a few places. :-)

The really important consideration here is whether or not there may be
multiple governors making different policy decisions in that respect.
If not, then where exactly the single policy decision is made doesn't
particularly matter IMO.

> The RT-go-to-max-freq thing in particular. And I really don't
> think EAS should cope with that, at least for now.
>
> But if this specific bit is factored out of the aggregation function, I
> suppose we could move it somewhere else. Maybe pelt.c ?
>
> How ugly is something like the below (totally untested) code ? It would
> change slightly how we deal with DL utilization in EAS but I don't think
> this is an issue.
>
> diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
> index af86050edcf5..51c9ac9f30e8 100644
> --- a/kernel/sched/cpufreq_schedutil.c
> +++ b/kernel/sched/cpufreq_schedutil.c
> @@ -178,121 +178,17 @@ static unsigned int get_next_freq(struct sugov_policy *sg_policy,
>         return cpufreq_driver_resolve_freq(policy, freq);
>  }
>
> -/*
> - * This function computes an effective utilization for the given CPU, to be
> - * used for frequency selection given the linear relation: f = u * f_max.
> - *
> - * The scheduler tracks the following metrics:
> - *
> - *   cpu_util_{cfs,rt,dl,irq}()
> - *   cpu_bw_dl()
> - *
> - * Where the cfs,rt and dl util numbers are tracked with the same metric and
> - * synchronized windows and are thus directly comparable.
> - *
> - * The cfs,rt,dl utilization are the running times measured with rq->clock_task
> - * which excludes things like IRQ and steal-time. These latter are then accrued
> - * in the irq utilization.
> - *
> - * The DL bandwidth number otoh is not a measured metric but a value computed
> - * based on the task model parameters and gives the minimal utilization
> - * required to meet deadlines.
> - */
> -unsigned long schedutil_freq_util(int cpu, unsigned long util_cfs,
> -                                 enum schedutil_type type)
> -{
> -       struct rq *rq = cpu_rq(cpu);
> -       unsigned long util, irq, max;
> -
> -       max = arch_scale_cpu_capacity(NULL, cpu);
> -
> -       if (type == frequency_util && rt_rq_is_runnable(&rq->rt))
> -               return max;
> -
> -       /*
> -        * Early check to see if IRQ/steal time saturates the CPU, can be
> -        * because of inaccuracies in how we track these -- see
> -        * update_irq_load_avg().
> -        */
> -       irq = cpu_util_irq(rq);
> -       if (unlikely(irq >= max))
> -               return max;
> -
> -       /*
> -        * Because the time spend on RT/DL tasks is visible as 'lost' time to
> -        * CFS tasks and we use the same metric to track the effective
> -        * utilization (PELT windows are synchronized) we can directly add them
> -        * to obtain the CPU's actual utilization.
> -        */
> -       util = util_cfs;
> -       util += cpu_util_rt(rq);
> -
> -       if (type == frequency_util) {
> -               /*
> -                * For frequency selection we do not make cpu_util_dl() a
> -                * permanent part of this sum because we want to use
> -                * cpu_bw_dl() later on, but we need to check if the
> -                * CFS+RT+DL sum is saturated (ie. no idle time) such
> -                * that we select f_max when there is no idle time.
> -                *
> -                * NOTE: numerical errors or stop class might cause us
> -                * to not quite hit saturation when we should --
> -                * something for later.
> -                */
> -
> -               if ((util + cpu_util_dl(rq)) >= max)
> -                       return max;
> -       } else {
> -               /*
> -                * OTOH, for energy computation we need the estimated
> -                * running time, so include util_dl and ignore dl_bw.
> -                */
> -               util += cpu_util_dl(rq);
> -               if (util >= max)
> -                       return max;
> -       }
> -
> -       /*
> -        * There is still idle time; further improve the number by using the
> -        * irq metric. Because IRQ/steal time is hidden from the task clock we
> -        * need to scale the task numbers:
> -        *
> -        *              1 - irq
> -        *   U' = irq + ------- * U
> -        *                max
> -        */
> -       util *= (max - irq);
> -       util /= max;
> -       util += irq;
> -
> -       if (type == frequency_util) {
> -               /*
> -                * Bandwidth required by DEADLINE must always be granted
> -                * while, for FAIR and RT, we use blocked utilization of
> -                * IDLE CPUs as a mechanism to gracefully reduce the
> -                * frequency when no tasks show up for longer periods of
> -                * time.
> -                *
> -                * Ideally we would like to set bw_dl as min/guaranteed
> -                * freq and util + bw_dl as requested freq. However,
> -                * cpufreq is not yet ready for such an interface. So,
> -                * we only do the latter for now.
> -                */
> -               util += cpu_bw_dl(rq);
> -       }
> -
> -       return min(max, util);
> -}
> -
>  static unsigned long sugov_get_util(struct sugov_cpu *sg_cpu)
>  {
>         struct rq *rq = cpu_rq(sg_cpu->cpu);
> -       unsigned long util = cpu_util_cfs(rq);
>
>         sg_cpu->max = arch_scale_cpu_capacity(NULL, sg_cpu->cpu);
>         sg_cpu->bw_dl = cpu_bw_dl(rq);
>
> -       return schedutil_freq_util(sg_cpu->cpu, util, frequency_util);
> +       if (rt_rq_is_runnable(&rq->rt))
> +               return sg_cpu->max;
> +
> +       return cpu_util_total(sg_cpu->cpu, cpu_util_cfs(rq));
>  }
>
>  /**
> diff --git a/kernel/sched/pelt.c b/kernel/sched/pelt.c
> index 35475c0c5419..5f99bd564dfc 100644
> --- a/kernel/sched/pelt.c
> +++ b/kernel/sched/pelt.c
> @@ -397,3 +397,77 @@ int update_irq_load_avg(struct rq *rq, u64 running)
>         return ret;
>  }
>  #endif
> +
> +/*
> + * This function computes an effective utilization for the given CPU, to be
> + * used for frequency selection given the linear relation: f = u * f_max.
> + *
> + * The scheduler tracks the following metrics:
> + *
> + *   cpu_util_{cfs,rt,dl,irq}()
> + *   cpu_bw_dl()
> + *
> + * Where the cfs,rt and dl util numbers are tracked with the same metric and
> + * synchronized windows and are thus directly comparable.
> + *
> + * The cfs,rt,dl utilization are the running times measured with rq->clock_task
> + * which excludes things like IRQ and steal-time. These latter are then accrued
> + * in the irq utilization.
> + *
> + * The DL bandwidth number otoh is not a measured metric but a value computed
> + * based on the task model parameters and gives the minimal utilization
> + * required to meet deadlines.
> + */
> +unsigned long cpu_util_total(int cpu, unsigned long util_cfs)
> +{
> +       struct rq *rq = cpu_rq(cpu);
> +       unsigned long util, irq, max;
> +
> +       max = arch_scale_cpu_capacity(NULL, cpu);
> +
> +       /*
> +        * Early check to see if IRQ/steal time saturates the CPU, can be
> +        * because of inaccuracies in how we track these -- see
> +        * update_irq_load_avg().
> +        */
> +       irq = cpu_util_irq(rq);
> +       if (unlikely(irq >= max))
> +               return max;
> +
> +       /*
> +        * Because the time spend on RT/DL tasks is visible as 'lost' time to
> +        * CFS tasks and we use the same metric to track the effective
> +        * utilization (PELT windows are synchronized) we can directly add them
> +        * to obtain the CPU's actual utilization.
> +        */
> +       util = util_cfs;
> +       util += cpu_util_rt(rq);
> +
> +       if ((util + cpu_util_dl(rq)) >= max)
> +               return max;
> +
> +       /*
> +        * There is still idle time; further improve the number by using the
> +        * irq metric. Because IRQ/steal time is hidden from the task clock we
> +        * need to scale the task numbers:
> +        *
> +        *              1 - irq
> +        *   U' = irq + ------- * U
> +        *                max
> +        */
> +       util *= (max - irq);
> +       util /= max;
> +       util += irq;
> +
> +       /*
> +        * Bandwidth required by DEADLINE must always be granted while, for
> +        * FAIR and RT, we use blocked utilization of IDLE CPUs as a mechanism
> +        * to gracefully reduce the frequency when no tasks show up for longer
> +        * periods of time.
> +        *
> +        * Ideally we would like to set bw_dl as min/guaranteed freq and util +
> +        * bw_dl as requested freq. However, cpufreq is not yet ready for such
> +        * an interface. So, we only do the latter for now.
> +        */
> +       return min(max, util + cpu_bw_dl(rq));
> +}
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index 51e7f113ee23..7ad037bb653e 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -2185,14 +2185,9 @@ static inline void cpufreq_update_util(struct rq *rq, unsigned int flags) {}
>  # define arch_scale_freq_invariant()   false
>  #endif
>
> -enum schedutil_type {
> -       frequency_util,
> -       energy_util,
> -};
>
> -#ifdef CONFIG_CPU_FREQ_GOV_SCHEDUTIL
> -unsigned long schedutil_freq_util(int cpu, unsigned long util_cfs,
> -                                 enum schedutil_type type);
> +#ifdef CONFIG_SMP
> +unsigned long cpu_util_total(int cpu, unsigned long cfs_util);
>
>  static inline unsigned long cpu_bw_dl(struct rq *rq)
>  {
> @@ -2233,12 +2228,6 @@ static inline unsigned long cpu_util_irq(struct rq *rq)
>  }
>
>  #endif
> -#else /* CONFIG_CPU_FREQ_GOV_SCHEDUTIL */
> -static inline unsigned long schedutil_freq_util(int cpu, unsigned long util,
> -                                 enum schedutil_type type)
> -{
> -       return util;
> -}
>  #endif
>
>  #ifdef CONFIG_SMP

This doesn't look objectionable to me.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v5 10/14] sched/cpufreq: Refactor the utilization aggregation method
  2018-08-01  8:35             ` Rafael J. Wysocki
@ 2018-08-01  9:23               ` Quentin Perret
  2018-08-01  9:40                 ` Rafael J. Wysocki
  2018-08-02 13:04                 ` Peter Zijlstra
  0 siblings, 2 replies; 72+ messages in thread
From: Quentin Perret @ 2018-08-01  9:23 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Saravana Kannan, Peter Zijlstra, Rafael J. Wysocki,
	Linux Kernel Mailing List, Linux PM, Greg Kroah-Hartman,
	Ingo Molnar, Dietmar Eggemann, Morten Rasmussen, Chris Redpath,
	Patrick Bellasi, Valentin Schneider, Vincent Guittot,
	Thara Gopinath, Viresh Kumar, Todd Kjos, Joel Fernandes,
	Steve Muckle, adharmap, skannan, Pavan Kondeti, Juri Lelli,
	Eduardo Valentin, Srinivas Pandruvada, currojerez, Javi Merino,
	linux-pm-owner

On Wednesday 01 Aug 2018 at 10:35:32 (+0200), Rafael J. Wysocki wrote:
> On Wed, Aug 1, 2018 at 10:23 AM, Quentin Perret <quentin.perret@arm.com> wrote:
> > On Wednesday 01 Aug 2018 at 09:32:49 (+0200), Rafael J. Wysocki wrote:
> >> On Tue, Jul 31, 2018 at 9:31 PM,  <skannan@codeaurora.org> wrote:
> >> >> On Monday 30 Jul 2018 at 12:35:27 (-0700), skannan@codeaurora.org wrote:
> >> >>> If it's going to be a different aggregation from what's done for
> >> >>> frequency
> >> >>> guidance, I don't see the point of having this inside schedutil. Why not
> >> >>> keep it inside the scheduler files?
> >> >>
> >> >> This code basically results from a discussion we had with Peter on v4.
> >> >> Keeping everything centralized can make sense from a maintenance
> >> >> perspective, I think. That makes it easy to see the impact of any change
> >> >> to utilization signals for both EAS and schedutil.
> >> >
> >> > In that case, I'd argue it makes more sense to keep the code centralized in
> >> > the scheduler. The scheduler can let schedutil know about the utilization
> >> > after it aggregates them. There's no need for a cpufreq governor to know
> >> > that there are scheduling classes or how many there are. And the scheduler
> >> > can then choose to aggregate one way for task packing and another way for
> >> > frequency guidance.
> >>
> >> Also the aggregate utilization may be used by cpuidle governors in
> >> principle to decide how deep they can go with idle state selection.
> >
> > The only issue I see with this right now is that some of the things done
> > in this function are policy decisions which really belong to the governor,
> > I think.
> 
> Well, the scheduler makes policy decisions too, in quite a few places. :-)

That is true ... ;-) But not so much about frequency selection yet I guess

> The really important consideration here is whether or not there may be
> multiple governors making different policy decisions in that respect.
> If not, then where exactly the single policy decision is made doesn't
> particularly matter IMO.

I think some users of the aggregated utilization signal do want to make
slightly different decisions (I'm thinking about the RT-go-to-max thing
again which makes perfect sense in sugov, but could possibly hurt EAS).

So the "hard" part of this work is to figure out what really is a
governor-specific policy decision, and what is common between all users.
I put "hard" between quotes because I only see the case of RT as truly
sugov-specific for now.

If we also want a special case for DL, Peter's enum should work OK, and
enable to add more special cases for new users (cpuidle ?) if needed.
But maybe that is something for later ?

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v5 10/14] sched/cpufreq: Refactor the utilization aggregation method
  2018-08-01  9:23               ` Quentin Perret
@ 2018-08-01  9:40                 ` Rafael J. Wysocki
  2018-08-02 13:04                 ` Peter Zijlstra
  1 sibling, 0 replies; 72+ messages in thread
From: Rafael J. Wysocki @ 2018-08-01  9:40 UTC (permalink / raw)
  To: Quentin Perret
  Cc: Rafael J. Wysocki, Saravana Kannan, Peter Zijlstra,
	Rafael J. Wysocki, Linux Kernel Mailing List, Linux PM,
	Greg Kroah-Hartman, Ingo Molnar, Dietmar Eggemann,
	Morten Rasmussen, Chris Redpath, Patrick Bellasi,
	Valentin Schneider, Vincent Guittot, Thara Gopinath,
	Viresh Kumar, Todd Kjos, Joel Fernandes, Steve Muckle, adharmap,
	skannan, Pavan Kondeti, Juri Lelli, Eduardo Valentin,
	Srinivas Pandruvada, currojerez, Javi Merino, linux-pm-owner

On Wed, Aug 1, 2018 at 11:23 AM, Quentin Perret <quentin.perret@arm.com> wrote:
> On Wednesday 01 Aug 2018 at 10:35:32 (+0200), Rafael J. Wysocki wrote:
>> On Wed, Aug 1, 2018 at 10:23 AM, Quentin Perret <quentin.perret@arm.com> wrote:
>> > On Wednesday 01 Aug 2018 at 09:32:49 (+0200), Rafael J. Wysocki wrote:
>> >> On Tue, Jul 31, 2018 at 9:31 PM,  <skannan@codeaurora.org> wrote:
>> >> >> On Monday 30 Jul 2018 at 12:35:27 (-0700), skannan@codeaurora.org wrote:
>> >> >>> If it's going to be a different aggregation from what's done for
>> >> >>> frequency
>> >> >>> guidance, I don't see the point of having this inside schedutil. Why not
>> >> >>> keep it inside the scheduler files?
>> >> >>
>> >> >> This code basically results from a discussion we had with Peter on v4.
>> >> >> Keeping everything centralized can make sense from a maintenance
>> >> >> perspective, I think. That makes it easy to see the impact of any change
>> >> >> to utilization signals for both EAS and schedutil.
>> >> >
>> >> > In that case, I'd argue it makes more sense to keep the code centralized in
>> >> > the scheduler. The scheduler can let schedutil know about the utilization
>> >> > after it aggregates them. There's no need for a cpufreq governor to know
>> >> > that there are scheduling classes or how many there are. And the scheduler
>> >> > can then choose to aggregate one way for task packing and another way for
>> >> > frequency guidance.
>> >>
>> >> Also the aggregate utilization may be used by cpuidle governors in
>> >> principle to decide how deep they can go with idle state selection.
>> >
>> > The only issue I see with this right now is that some of the things done
>> > in this function are policy decisions which really belong to the governor,
>> > I think.
>>
>> Well, the scheduler makes policy decisions too, in quite a few places. :-)
>
> That is true ... ;-) But not so much about frequency selection yet I guess
>
>> The really important consideration here is whether or not there may be
>> multiple governors making different policy decisions in that respect.
>> If not, then where exactly the single policy decision is made doesn't
>> particularly matter IMO.
>
> I think some users of the aggregated utilization signal do want to make
> slightly different decisions (I'm thinking about the RT-go-to-max thing
> again which makes perfect sense in sugov, but could possibly hurt EAS).

Fair enough.

> So the "hard" part of this work is to figure out what really is a
> governor-specific policy decision, and what is common between all users.
> I put "hard" between quotes because I only see the case of RT as truly
> sugov-specific for now.

OK

> If we also want a special case for DL, Peter's enum should work OK, and
> enable to add more special cases for new users (cpuidle ?) if needed.
> But maybe that is something for later ?

I agree.  That can be done later if need be.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v5 09/14] sched: Add over-utilization/tipping point indicator
  2018-07-24 12:25 ` [PATCH v5 09/14] sched: Add over-utilization/tipping point indicator Quentin Perret
@ 2018-08-02 12:26   ` Peter Zijlstra
  2018-08-02 13:03     ` Quentin Perret
  2018-08-09  9:30   ` Vincent Guittot
  1 sibling, 1 reply; 72+ messages in thread
From: Peter Zijlstra @ 2018-08-02 12:26 UTC (permalink / raw)
  To: Quentin Perret
  Cc: rjw, linux-kernel, linux-pm, gregkh, mingo, dietmar.eggemann,
	morten.rasmussen, chris.redpath, patrick.bellasi,
	valentin.schneider, vincent.guittot, thara.gopinath,
	viresh.kumar, tkjos, joel, smuckle, adharmap, skannan, pkondeti,
	juri.lelli, edubezval, srinivas.pandruvada, currojerez,
	javi.merino

On Tue, Jul 24, 2018 at 01:25:16PM +0100, Quentin Perret wrote:
> @@ -5100,8 +5118,17 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>  		update_cfs_group(se);
>  	}
>  
> -	if (!se)
> +	if (!se) {
>  		add_nr_running(rq, 1);
> +		/*
> +		 * The utilization of a new task is 'wrong' so wait for it
> +		 * to build some utilization history before trying to detect
> +		 * the overutilized flag.
> +		 */
> +		if (flags & ENQUEUE_WAKEUP)
> +			update_overutilized_status(rq);
> +
> +	}
>  
>  	hrtick_update(rq);
>  }

That is a somewhat dodgy hack. There is no guarantee what so ever that
when the task wakes next its history is any better. The comment doesn't
reflect this I feel.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v5 10/14] sched/cpufreq: Refactor the utilization aggregation method
  2018-07-30 19:35   ` skannan
  2018-07-31  7:59     ` Quentin Perret
@ 2018-08-02 12:33     ` Peter Zijlstra
  2018-08-02 12:45       ` Peter Zijlstra
  1 sibling, 1 reply; 72+ messages in thread
From: Peter Zijlstra @ 2018-08-02 12:33 UTC (permalink / raw)
  To: skannan
  Cc: Quentin Perret, rjw, linux-kernel, linux-pm, gregkh, mingo,
	dietmar.eggemann, morten.rasmussen, chris.redpath,
	patrick.bellasi, valentin.schneider, vincent.guittot,
	thara.gopinath, viresh.kumar, tkjos, joel, smuckle, adharmap,
	skannan, pkondeti, juri.lelli, edubezval, srinivas.pandruvada,
	currojerez, javi.merino, linux-pm-owner

On Mon, Jul 30, 2018 at 12:35:27PM -0700, skannan@codeaurora.org wrote:
> On 2018-07-24 05:25, Quentin Perret wrote:
> If it's going to be a different aggregation from what's done for frequency
> guidance, I don't see the point of having this inside schedutil. Why not
> keep it inside the scheduler files? Also, it seems weird to use a governor's
> code when it might not actually be in use. What if someone is using
> ondemand, conservative, performance, etc?

EAS hard relies on schedutil -- I suppose we need a check for that
somewhere and maybe some infrastructure to pin the cpufreq governor.

We're simply not going to support it for anything else.

> > +enum schedutil_type {
> > +	frequency_util,
> > +	energy_util,
> > +};
> 
> Please don't use lower case for enums. It's extremely confusing.

How is that confusing?

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v5 10/14] sched/cpufreq: Refactor the utilization aggregation method
  2018-08-02 12:33     ` Peter Zijlstra
@ 2018-08-02 12:45       ` Peter Zijlstra
  2018-08-02 15:21         ` Quentin Perret
  0 siblings, 1 reply; 72+ messages in thread
From: Peter Zijlstra @ 2018-08-02 12:45 UTC (permalink / raw)
  To: skannan
  Cc: Quentin Perret, rjw, linux-kernel, linux-pm, gregkh, mingo,
	dietmar.eggemann, morten.rasmussen, chris.redpath,
	patrick.bellasi, valentin.schneider, vincent.guittot,
	thara.gopinath, viresh.kumar, tkjos, joel, smuckle, adharmap,
	skannan, pkondeti, juri.lelli, edubezval, srinivas.pandruvada,
	currojerez, javi.merino, linux-pm-owner

On Thu, Aug 02, 2018 at 02:33:15PM +0200, Peter Zijlstra wrote:
> On Mon, Jul 30, 2018 at 12:35:27PM -0700, skannan@codeaurora.org wrote:
> > On 2018-07-24 05:25, Quentin Perret wrote:
> > If it's going to be a different aggregation from what's done for frequency
> > guidance, I don't see the point of having this inside schedutil. Why not
> > keep it inside the scheduler files? Also, it seems weird to use a governor's
> > code when it might not actually be in use. What if someone is using
> > ondemand, conservative, performance, etc?
> 
> EAS hard relies on schedutil -- I suppose we need a check for that
> somewhere and maybe some infrastructure to pin the cpufreq governor.

Either that or disable EAS when another governor is selected.

> We're simply not going to support it for anything else.

To clarify, it makes absolutely no sense what so ever to attempt EAS
when the DVFS control is not coordinated.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v5 09/14] sched: Add over-utilization/tipping point indicator
  2018-08-02 12:26   ` Peter Zijlstra
@ 2018-08-02 13:03     ` Quentin Perret
  2018-08-02 13:08       ` Peter Zijlstra
  0 siblings, 1 reply; 72+ messages in thread
From: Quentin Perret @ 2018-08-02 13:03 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: rjw, linux-kernel, linux-pm, gregkh, mingo, dietmar.eggemann,
	morten.rasmussen, chris.redpath, patrick.bellasi,
	valentin.schneider, vincent.guittot, thara.gopinath,
	viresh.kumar, tkjos, joel, smuckle, adharmap, skannan, pkondeti,
	juri.lelli, edubezval, srinivas.pandruvada, currojerez,
	javi.merino

On Thursday 02 Aug 2018 at 14:26:29 (+0200), Peter Zijlstra wrote:
> On Tue, Jul 24, 2018 at 01:25:16PM +0100, Quentin Perret wrote:
> > @@ -5100,8 +5118,17 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >  		update_cfs_group(se);
> >  	}
> >  
> > -	if (!se)
> > +	if (!se) {
> >  		add_nr_running(rq, 1);
> > +		/*
> > +		 * The utilization of a new task is 'wrong' so wait for it
> > +		 * to build some utilization history before trying to detect
> > +		 * the overutilized flag.
> > +		 */
> > +		if (flags & ENQUEUE_WAKEUP)
> > +			update_overutilized_status(rq);
> > +
> > +	}
> >  
> >  	hrtick_update(rq);
> >  }
> 
> That is a somewhat dodgy hack. There is no guarantee what so ever that
> when the task wakes next its history is any better. The comment doesn't
> reflect this I feel.

AFAICT the main use-case here is to avoid re-enabling the load balance
and ruining all the task placement because of a tiny task. I don't
really see how we can do that differently ...

Or am I missing something Morten ?

In the meantime, I can try to improve the comment at least :-)

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v5 10/14] sched/cpufreq: Refactor the utilization aggregation method
  2018-08-01  9:23               ` Quentin Perret
  2018-08-01  9:40                 ` Rafael J. Wysocki
@ 2018-08-02 13:04                 ` Peter Zijlstra
  2018-08-02 15:39                   ` Quentin Perret
  1 sibling, 1 reply; 72+ messages in thread
From: Peter Zijlstra @ 2018-08-02 13:04 UTC (permalink / raw)
  To: Quentin Perret
  Cc: Rafael J. Wysocki, Saravana Kannan, Rafael J. Wysocki,
	Linux Kernel Mailing List, Linux PM, Greg Kroah-Hartman,
	Ingo Molnar, Dietmar Eggemann, Morten Rasmussen, Chris Redpath,
	Patrick Bellasi, Valentin Schneider, Vincent Guittot,
	Thara Gopinath, Viresh Kumar, Todd Kjos, Joel Fernandes,
	Steve Muckle, adharmap, skannan, Pavan Kondeti, Juri Lelli,
	Eduardo Valentin, Srinivas Pandruvada, currojerez, Javi Merino,
	linux-pm-owner

On Wed, Aug 01, 2018 at 10:23:27AM +0100, Quentin Perret wrote:
> On Wednesday 01 Aug 2018 at 10:35:32 (+0200), Rafael J. Wysocki wrote:
> > On Wed, Aug 1, 2018 at 10:23 AM, Quentin Perret <quentin.perret@arm.com> wrote:
> > > On Wednesday 01 Aug 2018 at 09:32:49 (+0200), Rafael J. Wysocki wrote:
> > >> On Tue, Jul 31, 2018 at 9:31 PM,  <skannan@codeaurora.org> wrote:
> > >> >> On Monday 30 Jul 2018 at 12:35:27 (-0700), skannan@codeaurora.org wrote:
> > >> >>> If it's going to be a different aggregation from what's done for
> > >> >>> frequency
> > >> >>> guidance, I don't see the point of having this inside schedutil. Why not
> > >> >>> keep it inside the scheduler files?
> > >> >>
> > >> >> This code basically results from a discussion we had with Peter on v4.
> > >> >> Keeping everything centralized can make sense from a maintenance
> > >> >> perspective, I think. That makes it easy to see the impact of any change
> > >> >> to utilization signals for both EAS and schedutil.
> > >> >
> > >> > In that case, I'd argue it makes more sense to keep the code centralized in
> > >> > the scheduler. The scheduler can let schedutil know about the utilization
> > >> > after it aggregates them. There's no need for a cpufreq governor to know
> > >> > that there are scheduling classes or how many there are. And the scheduler
> > >> > can then choose to aggregate one way for task packing and another way for
> > >> > frequency guidance.
> > >>
> > >> Also the aggregate utilization may be used by cpuidle governors in
> > >> principle to decide how deep they can go with idle state selection.
> > >
> > > The only issue I see with this right now is that some of the things done
> > > in this function are policy decisions which really belong to the governor,
> > > I think.
> > 
> > Well, the scheduler makes policy decisions too, in quite a few places. :-)
> 
> That is true ... ;-) But not so much about frequency selection yet I guess

Well, sugov is part of the scheduler :-) It being so allows for the
co-ordinated decision making required for EAS.

> > The really important consideration here is whether or not there may be
> > multiple governors making different policy decisions in that respect.
> > If not, then where exactly the single policy decision is made doesn't
> > particularly matter IMO.
> 
> I think some users of the aggregated utilization signal do want to make
> slightly different decisions (I'm thinking about the RT-go-to-max thing
> again which makes perfect sense in sugov, but could possibly hurt EAS).
> 
> So the "hard" part of this work is to figure out what really is a
> governor-specific policy decision, and what is common between all users.
> I put "hard" between quotes because I only see the case of RT as truly
> sugov-specific for now.
> 
> If we also want a special case for DL, Peter's enum should work OK, and
> enable to add more special cases for new users (cpuidle ?) if needed.
> But maybe that is something for later ?

Right, I don't mind moving the function. What I do oppose is having two
very similar functions in different translation units -- because then
they _will_ diverge and result in 'funny' things.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v5 09/14] sched: Add over-utilization/tipping point indicator
  2018-08-02 13:03     ` Quentin Perret
@ 2018-08-02 13:08       ` Peter Zijlstra
  2018-08-02 13:18         ` Quentin Perret
  0 siblings, 1 reply; 72+ messages in thread
From: Peter Zijlstra @ 2018-08-02 13:08 UTC (permalink / raw)
  To: Quentin Perret
  Cc: rjw, linux-kernel, linux-pm, gregkh, mingo, dietmar.eggemann,
	morten.rasmussen, chris.redpath, patrick.bellasi,
	valentin.schneider, vincent.guittot, thara.gopinath,
	viresh.kumar, tkjos, joel, smuckle, adharmap, skannan, pkondeti,
	juri.lelli, edubezval, srinivas.pandruvada, currojerez,
	javi.merino

On Thu, Aug 02, 2018 at 02:03:38PM +0100, Quentin Perret wrote:
> On Thursday 02 Aug 2018 at 14:26:29 (+0200), Peter Zijlstra wrote:
> > On Tue, Jul 24, 2018 at 01:25:16PM +0100, Quentin Perret wrote:
> > > @@ -5100,8 +5118,17 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > >  		update_cfs_group(se);
> > >  	}
> > >  
> > > -	if (!se)
> > > +	if (!se) {
> > >  		add_nr_running(rq, 1);
> > > +		/*
> > > +		 * The utilization of a new task is 'wrong' so wait for it
> > > +		 * to build some utilization history before trying to detect
> > > +		 * the overutilized flag.
> > > +		 */
> > > +		if (flags & ENQUEUE_WAKEUP)
> > > +			update_overutilized_status(rq);
> > > +
> > > +	}
> > >  
> > >  	hrtick_update(rq);
> > >  }
> > 
> > That is a somewhat dodgy hack. There is no guarantee what so ever that
> > when the task wakes next its history is any better. The comment doesn't
> > reflect this I feel.
> 
> AFAICT the main use-case here is to avoid re-enabling the load balance
> and ruining all the task placement because of a tiny task. I don't
> really see how we can do that differently ...

Sure I realize that.. but it doesn't completely avoid it. Suppose this
new task instantly blocks and wakes up again. Then its util signal will
be exactly what you didn't want but we'll account it and cause the above
scenario you wanted to avoid.

Now, I suppose in practise it works well enough.

The alternative is trying to track when a running signal has converged,
but that's not a simple problem either I suppose.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v5 09/14] sched: Add over-utilization/tipping point indicator
  2018-08-02 13:08       ` Peter Zijlstra
@ 2018-08-02 13:18         ` Quentin Perret
  2018-08-02 13:48           ` Vincent Guittot
  0 siblings, 1 reply; 72+ messages in thread
From: Quentin Perret @ 2018-08-02 13:18 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: rjw, linux-kernel, linux-pm, gregkh, mingo, dietmar.eggemann,
	morten.rasmussen, chris.redpath, patrick.bellasi,
	valentin.schneider, vincent.guittot, thara.gopinath,
	viresh.kumar, tkjos, joel, smuckle, adharmap, skannan, pkondeti,
	juri.lelli, edubezval, srinivas.pandruvada, currojerez,
	javi.merino

On Thursday 02 Aug 2018 at 15:08:01 (+0200), Peter Zijlstra wrote:
> On Thu, Aug 02, 2018 at 02:03:38PM +0100, Quentin Perret wrote:
> > On Thursday 02 Aug 2018 at 14:26:29 (+0200), Peter Zijlstra wrote:
> > > On Tue, Jul 24, 2018 at 01:25:16PM +0100, Quentin Perret wrote:
> > > > @@ -5100,8 +5118,17 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > > >  		update_cfs_group(se);
> > > >  	}
> > > >  
> > > > -	if (!se)
> > > > +	if (!se) {
> > > >  		add_nr_running(rq, 1);
> > > > +		/*
> > > > +		 * The utilization of a new task is 'wrong' so wait for it
> > > > +		 * to build some utilization history before trying to detect
> > > > +		 * the overutilized flag.
> > > > +		 */
> > > > +		if (flags & ENQUEUE_WAKEUP)
> > > > +			update_overutilized_status(rq);
> > > > +
> > > > +	}
> > > >  
> > > >  	hrtick_update(rq);
> > > >  }
> > > 
> > > That is a somewhat dodgy hack. There is no guarantee what so ever that
> > > when the task wakes next its history is any better. The comment doesn't
> > > reflect this I feel.
> > 
> > AFAICT the main use-case here is to avoid re-enabling the load balance
> > and ruining all the task placement because of a tiny task. I don't
> > really see how we can do that differently ...
> 
> Sure I realize that.. but it doesn't completely avoid it. Suppose this
> new task instantly blocks and wakes up again. Then its util signal will
> be exactly what you didn't want but we'll account it and cause the above
> scenario you wanted to avoid.

That is true. ... I also realize now that this patch was written long
before util_est, and that also has an impact here, especially in the
scenario you described where the task blocks. So any wake-up after the
first enqueue will risk to overutilize the system, even if the task
blocked for ages.

Hmm ...

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v5 09/14] sched: Add over-utilization/tipping point indicator
  2018-08-02 13:18         ` Quentin Perret
@ 2018-08-02 13:48           ` Vincent Guittot
  2018-08-02 14:14             ` Quentin Perret
  0 siblings, 1 reply; 72+ messages in thread
From: Vincent Guittot @ 2018-08-02 13:48 UTC (permalink / raw)
  To: Quentin Perret
  Cc: Peter Zijlstra, Rafael J. Wysocki, linux-kernel,
	open list:THERMAL, gregkh, Ingo Molnar, Dietmar Eggemann,
	Morten Rasmussen, Chris Redpath, Patrick Bellasi,
	Valentin Schneider, Thara Gopinath, viresh kumar, Todd Kjos,
	Joel Fernandes, Cc: Steve Muckle, adharmap, Kannan, Saravana,
	pkondeti, Juri Lelli, Eduardo Valentin, Srinivas Pandruvada,
	currojerez, Javi Merino

On Thu, 2 Aug 2018 at 15:19, Quentin Perret <quentin.perret@arm.com> wrote:
>
> On Thursday 02 Aug 2018 at 15:08:01 (+0200), Peter Zijlstra wrote:
> > On Thu, Aug 02, 2018 at 02:03:38PM +0100, Quentin Perret wrote:
> > > On Thursday 02 Aug 2018 at 14:26:29 (+0200), Peter Zijlstra wrote:
> > > > On Tue, Jul 24, 2018 at 01:25:16PM +0100, Quentin Perret wrote:
> > > > > @@ -5100,8 +5118,17 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > > > >                 update_cfs_group(se);
> > > > >         }
> > > > >
> > > > > -       if (!se)
> > > > > +       if (!se) {
> > > > >                 add_nr_running(rq, 1);
> > > > > +               /*
> > > > > +                * The utilization of a new task is 'wrong' so wait for it
> > > > > +                * to build some utilization history before trying to detect
> > > > > +                * the overutilized flag.
> > > > > +                */
> > > > > +               if (flags & ENQUEUE_WAKEUP)
> > > > > +                       update_overutilized_status(rq);
> > > > > +
> > > > > +       }
> > > > >
> > > > >         hrtick_update(rq);
> > > > >  }
> > > >
> > > > That is a somewhat dodgy hack. There is no guarantee what so ever that
> > > > when the task wakes next its history is any better. The comment doesn't
> > > > reflect this I feel.
> > >
> > > AFAICT the main use-case here is to avoid re-enabling the load balance
> > > and ruining all the task placement because of a tiny task. I don't
> > > really see how we can do that differently ...
> >
> > Sure I realize that.. but it doesn't completely avoid it. Suppose this
> > new task instantly blocks and wakes up again. Then its util signal will
> > be exactly what you didn't want but we'll account it and cause the above
> > scenario you wanted to avoid.
>
> That is true. ... I also realize now that this patch was written long
> before util_est, and that also has an impact here, especially in the
> scenario you described where the task blocks. So any wake-up after the
> first enqueue will risk to overutilize the system, even if the task
> blocked for ages.
>
> Hmm ...

Does a init value set to 0 for util_avg for newly created task can
help in EAS in this case ?
Current initial value is computed to prevent packing newly created
tasks on same CPUs because it hurts performance of some benches. In
fact it somehow assumes that newly created task will use significant
part of the remaining capacity of a CPU and want to spread tasks. In
EAS case, it seems that it prefer to assume that  newly created task
are small and we can pack them and wait a bit to make sure the new
task will be a big task and will overload the CPU

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v5 12/14] sched/fair: Select an energy-efficient CPU on task wake-up
  2018-07-24 12:25 ` [PATCH v5 12/14] sched/fair: Select an energy-efficient CPU on task wake-up Quentin Perret
@ 2018-08-02 13:54   ` Peter Zijlstra
  2018-08-02 16:21     ` Quentin Perret
  0 siblings, 1 reply; 72+ messages in thread
From: Peter Zijlstra @ 2018-08-02 13:54 UTC (permalink / raw)
  To: Quentin Perret
  Cc: rjw, linux-kernel, linux-pm, gregkh, mingo, dietmar.eggemann,
	morten.rasmussen, chris.redpath, patrick.bellasi,
	valentin.schneider, vincent.guittot, thara.gopinath,
	viresh.kumar, tkjos, joel, smuckle, adharmap, skannan, pkondeti,
	juri.lelli, edubezval, srinivas.pandruvada, currojerez,
	javi.merino

On Tue, Jul 24, 2018 at 01:25:19PM +0100, Quentin Perret wrote:
> @@ -6385,18 +6492,26 @@ static int
>  select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags)
>  {
>  	struct sched_domain *tmp, *sd = NULL;
> +	struct freq_domain *fd;
>  	int cpu = smp_processor_id();
>  	int new_cpu = prev_cpu;
> -	int want_affine = 0;
> +	int want_affine = 0, want_energy = 0;
>  	int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
>  
> +	rcu_read_lock();
>  	if (sd_flag & SD_BALANCE_WAKE) {
>  		record_wakee(p);
> -		want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu)
> -			      && cpumask_test_cpu(cpu, &p->cpus_allowed);
> +		fd = rd_freq_domain(cpu_rq(cpu)->rd);
> +		want_energy = fd && !READ_ONCE(cpu_rq(cpu)->rd->overutilized);
> +		want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu) &&
> +			      cpumask_test_cpu(cpu, &p->cpus_allowed);
> +	}
> +
> +	if (want_energy) {
> +		new_cpu = find_energy_efficient_cpu(p, prev_cpu, fd);
> +		goto unlock;
>  	}
>  

And I suppose you rely on the compiler to optimize that for the static
key inside rd_freq_domain()... Does it do a good job of that?

That is, would not something like:


	rcu_read_lock();
	if (sd_flag & SD_BALANCE_WAKE) {
		record_wakee(p);

		if (static_branch_unlikely(&sched_energy_present)) {
			struct root_domain *rd = cpu_rq(cpu)->rd;
			struct freq_domain *fd = rd_freq_domain(rd);

			if (fd && !READ_ONCE(rd->overutilized)) {
				new_cpu = find_energy_efficient_cpu(p, prev_cpu, fd);
				goto unlock;
			}
		}

		/* ... */
	}


Be far more clear ?

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v5 09/14] sched: Add over-utilization/tipping point indicator
  2018-08-02 13:48           ` Vincent Guittot
@ 2018-08-02 14:14             ` Quentin Perret
  2018-08-02 15:14               ` Vincent Guittot
  0 siblings, 1 reply; 72+ messages in thread
From: Quentin Perret @ 2018-08-02 14:14 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, Rafael J. Wysocki, linux-kernel,
	open list:THERMAL, gregkh, Ingo Molnar, Dietmar Eggemann,
	Morten Rasmussen, Chris Redpath, Patrick Bellasi,
	Valentin Schneider, Thara Gopinath, viresh kumar, Todd Kjos,
	Joel Fernandes, Cc: Steve Muckle, adharmap, Kannan, Saravana,
	pkondeti, Juri Lelli, Eduardo Valentin, Srinivas Pandruvada,
	currojerez, Javi Merino

On Thursday 02 Aug 2018 at 15:48:01 (+0200), Vincent Guittot wrote:
> On Thu, 2 Aug 2018 at 15:19, Quentin Perret <quentin.perret@arm.com> wrote:
> >
> > On Thursday 02 Aug 2018 at 15:08:01 (+0200), Peter Zijlstra wrote:
> > > On Thu, Aug 02, 2018 at 02:03:38PM +0100, Quentin Perret wrote:
> > > > On Thursday 02 Aug 2018 at 14:26:29 (+0200), Peter Zijlstra wrote:
> > > > > On Tue, Jul 24, 2018 at 01:25:16PM +0100, Quentin Perret wrote:
> > > > > > @@ -5100,8 +5118,17 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > > > > >                 update_cfs_group(se);
> > > > > >         }
> > > > > >
> > > > > > -       if (!se)
> > > > > > +       if (!se) {
> > > > > >                 add_nr_running(rq, 1);
> > > > > > +               /*
> > > > > > +                * The utilization of a new task is 'wrong' so wait for it
> > > > > > +                * to build some utilization history before trying to detect
> > > > > > +                * the overutilized flag.
> > > > > > +                */
> > > > > > +               if (flags & ENQUEUE_WAKEUP)
> > > > > > +                       update_overutilized_status(rq);
> > > > > > +
> > > > > > +       }
> > > > > >
> > > > > >         hrtick_update(rq);
> > > > > >  }
> > > > >
> > > > > That is a somewhat dodgy hack. There is no guarantee what so ever that
> > > > > when the task wakes next its history is any better. The comment doesn't
> > > > > reflect this I feel.
> > > >
> > > > AFAICT the main use-case here is to avoid re-enabling the load balance
> > > > and ruining all the task placement because of a tiny task. I don't
> > > > really see how we can do that differently ...
> > >
> > > Sure I realize that.. but it doesn't completely avoid it. Suppose this
> > > new task instantly blocks and wakes up again. Then its util signal will
> > > be exactly what you didn't want but we'll account it and cause the above
> > > scenario you wanted to avoid.
> >
> > That is true. ... I also realize now that this patch was written long
> > before util_est, and that also has an impact here, especially in the
> > scenario you described where the task blocks. So any wake-up after the
> > first enqueue will risk to overutilize the system, even if the task
> > blocked for ages.
> >
> > Hmm ...
> 
> Does a init value set to 0 for util_avg for newly created task can
> help in EAS in this case ?
> Current initial value is computed to prevent packing newly created
> tasks on same CPUs because it hurts performance of some benches. In
> fact it somehow assumes that newly created task will use significant
> part of the remaining capacity of a CPU and want to spread tasks. In
> EAS case, it seems that it prefer to assume that  newly created task
> are small and we can pack them and wait a bit to make sure the new
> task will be a big task and will overload the CPU

Good point, setting the util_avg to 0 for new tasks should help
filtering out those tiny tasks too. And that would match with the idea
of letting tasks build their history before looking at their util_avg ...

But there is one difference w.r.t frequency selection. The current code
won't mark the system overutilized, but will let sugov raise the
frequency when a new task is enqueued. So in case of a fork bomb, we
sort of fallback on the existing mainline strategy for both task
placement (because forkees don't go in find_energy_efficient_cpu) and
frequency selection. And I would argue this is the right thing to do
since EAS can't really help in this case.

Thoughts ?

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v5 09/14] sched: Add over-utilization/tipping point indicator
  2018-08-02 14:14             ` Quentin Perret
@ 2018-08-02 15:14               ` Vincent Guittot
  2018-08-02 15:30                 ` Quentin Perret
  0 siblings, 1 reply; 72+ messages in thread
From: Vincent Guittot @ 2018-08-02 15:14 UTC (permalink / raw)
  To: Quentin Perret
  Cc: Peter Zijlstra, Rafael J. Wysocki, linux-kernel,
	open list:THERMAL, gregkh, Ingo Molnar, Dietmar Eggemann,
	Morten Rasmussen, Chris Redpath, Patrick Bellasi,
	Valentin Schneider, Thara Gopinath, viresh kumar, Todd Kjos,
	Joel Fernandes, Cc: Steve Muckle, adharmap, Kannan, Saravana,
	pkondeti, Juri Lelli, Eduardo Valentin, Srinivas Pandruvada,
	currojerez, Javi Merino

On Thu, 2 Aug 2018 at 16:14, Quentin Perret <quentin.perret@arm.com> wrote:
>
> On Thursday 02 Aug 2018 at 15:48:01 (+0200), Vincent Guittot wrote:
> > On Thu, 2 Aug 2018 at 15:19, Quentin Perret <quentin.perret@arm.com> wrote:
> > >
> > > On Thursday 02 Aug 2018 at 15:08:01 (+0200), Peter Zijlstra wrote:
> > > > On Thu, Aug 02, 2018 at 02:03:38PM +0100, Quentin Perret wrote:
> > > > > On Thursday 02 Aug 2018 at 14:26:29 (+0200), Peter Zijlstra wrote:
> > > > > > On Tue, Jul 24, 2018 at 01:25:16PM +0100, Quentin Perret wrote:
> > > > > > > @@ -5100,8 +5118,17 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> > > > > > >                 update_cfs_group(se);
> > > > > > >         }
> > > > > > >
> > > > > > > -       if (!se)
> > > > > > > +       if (!se) {
> > > > > > >                 add_nr_running(rq, 1);
> > > > > > > +               /*
> > > > > > > +                * The utilization of a new task is 'wrong' so wait for it
> > > > > > > +                * to build some utilization history before trying to detect
> > > > > > > +                * the overutilized flag.
> > > > > > > +                */
> > > > > > > +               if (flags & ENQUEUE_WAKEUP)
> > > > > > > +                       update_overutilized_status(rq);
> > > > > > > +
> > > > > > > +       }
> > > > > > >
> > > > > > >         hrtick_update(rq);
> > > > > > >  }
> > > > > >
> > > > > > That is a somewhat dodgy hack. There is no guarantee what so ever that
> > > > > > when the task wakes next its history is any better. The comment doesn't
> > > > > > reflect this I feel.
> > > > >
> > > > > AFAICT the main use-case here is to avoid re-enabling the load balance
> > > > > and ruining all the task placement because of a tiny task. I don't
> > > > > really see how we can do that differently ...
> > > >
> > > > Sure I realize that.. but it doesn't completely avoid it. Suppose this
> > > > new task instantly blocks and wakes up again. Then its util signal will
> > > > be exactly what you didn't want but we'll account it and cause the above
> > > > scenario you wanted to avoid.
> > >
> > > That is true. ... I also realize now that this patch was written long
> > > before util_est, and that also has an impact here, especially in the
> > > scenario you described where the task blocks. So any wake-up after the
> > > first enqueue will risk to overutilize the system, even if the task
> > > blocked for ages.
> > >
> > > Hmm ...
> >
> > Does a init value set to 0 for util_avg for newly created task can
> > help in EAS in this case ?
> > Current initial value is computed to prevent packing newly created
> > tasks on same CPUs because it hurts performance of some benches. In
> > fact it somehow assumes that newly created task will use significant
> > part of the remaining capacity of a CPU and want to spread tasks. In
> > EAS case, it seems that it prefer to assume that  newly created task
> > are small and we can pack them and wait a bit to make sure the new
> > task will be a big task and will overload the CPU
>
> Good point, setting the util_avg to 0 for new tasks should help
> filtering out those tiny tasks too. And that would match with the idea
> of letting tasks build their history before looking at their util_avg ...
>
> But there is one difference w.r.t frequency selection. The current code
> won't mark the system overutilized, but will let sugov raise the
> frequency when a new task is enqueued. So in case of a fork bomb, we

If the initial value of util_avg is 0, we should not have any impact
on the util_avg of the cfs rq on which the task is attached, isn't it
? so this should not impact both the over utilization state and the
frequency selected by sugov or I'm missing something ?
Then, select_task_rq_fair is called for a new task but util_avg is
still 0 at that time in the current code so you will have consistent
util_avg of the new task before and after calling
find_energy_efficient_cpu

> sort of fallback on the existing mainline strategy for both task
> placement (because forkees don't go in find_energy_efficient_cpu) and
> frequency selection. And I would argue this is the right thing to do
> since EAS can't really help in this case.
>
> Thoughts ?
>
> Thanks,
> Quentin

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v5 10/14] sched/cpufreq: Refactor the utilization aggregation method
  2018-08-02 12:45       ` Peter Zijlstra
@ 2018-08-02 15:21         ` Quentin Perret
  2018-08-02 17:36           ` Peter Zijlstra
  0 siblings, 1 reply; 72+ messages in thread
From: Quentin Perret @ 2018-08-02 15:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: skannan, rjw, linux-kernel, linux-pm, gregkh, mingo,
	dietmar.eggemann, morten.rasmussen, chris.redpath,
	patrick.bellasi, valentin.schneider, vincent.guittot,
	thara.gopinath, viresh.kumar, tkjos, joel, smuckle, adharmap,
	skannan, pkondeti, juri.lelli, edubezval, srinivas.pandruvada,
	currojerez, javi.merino, linux-pm-owner

On Thursday 02 Aug 2018 at 14:45:11 (+0200), Peter Zijlstra wrote:
> On Thu, Aug 02, 2018 at 02:33:15PM +0200, Peter Zijlstra wrote:
> > On Mon, Jul 30, 2018 at 12:35:27PM -0700, skannan@codeaurora.org wrote:
> > > On 2018-07-24 05:25, Quentin Perret wrote:
> > > If it's going to be a different aggregation from what's done for frequency
> > > guidance, I don't see the point of having this inside schedutil. Why not
> > > keep it inside the scheduler files? Also, it seems weird to use a governor's
> > > code when it might not actually be in use. What if someone is using
> > > ondemand, conservative, performance, etc?
> > 
> > EAS hard relies on schedutil -- I suppose we need a check for that
> > somewhere and maybe some infrastructure to pin the cpufreq governor.
> 
> Either that or disable EAS when another governor is selected.
> 
> > We're simply not going to support it for anything else.
> 
> To clarify, it makes absolutely no sense what so ever to attempt EAS
> when the DVFS control is not coordinated.

I tend to agree with that, but at the same time even if we create a very
strong dependency on schedutil, we will have no guarantee that the actual
frequencies used on the platform are the ones we predicted in EAS.

There are a number of reasons why a frequency request might not be served
(throttling, thermal capping, something HW-related, ...), so it's hard
to enforce the EAS model in practice.

The way I see things, EAS needs to assume that OPPs follow utilization.
Sugov does something that looks like that too, and it's also in the
scheduler, so that makes sense to try and factorize things, especially
for maintenance purpose. But I feel like the correlation between the two
could stop here.

If you use some sort HW governor that tries to always have some idle time
on the CPUs, then the assumption that OPPs follow utilization isn't _totally_
wrong. There should be a (loose) relation between what EAS 'thinks'
and the reality. And if this isn't true, then you might make slightly
sub-optimal decisions, but I'm not sure if there is anything we can do
about it :/

The scheduler works with various models which, by definition, don't
always perfectly reflect the reality. But those models are useful
because they enable us to reason about things and make decisions. EAS uses
a model where OPPs follow utilization. I think it's just another model
to the list, and we can't really enforce it strictly in practice anyway,
so we will have to live with its inaccuracies I suppose ...

I hope that makes sense :-)

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v5 09/14] sched: Add over-utilization/tipping point indicator
  2018-08-02 15:14               ` Vincent Guittot
@ 2018-08-02 15:30                 ` Quentin Perret
  2018-08-02 15:55                   ` Vincent Guittot
  0 siblings, 1 reply; 72+ messages in thread
From: Quentin Perret @ 2018-08-02 15:30 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, Rafael J. Wysocki, linux-kernel,
	open list:THERMAL, gregkh, Ingo Molnar, Dietmar Eggemann,
	Morten Rasmussen, Chris Redpath, Patrick Bellasi,
	Valentin Schneider, Thara Gopinath, viresh kumar, Todd Kjos,
	Joel Fernandes, Cc: Steve Muckle, adharmap, Kannan, Saravana,
	pkondeti, Juri Lelli, Eduardo Valentin, Srinivas Pandruvada,
	currojerez, Javi Merino

On Thursday 02 Aug 2018 at 17:14:15 (+0200), Vincent Guittot wrote:
> On Thu, 2 Aug 2018 at 16:14, Quentin Perret <quentin.perret@arm.com> wrote:
> > Good point, setting the util_avg to 0 for new tasks should help
> > filtering out those tiny tasks too. And that would match with the idea
> > of letting tasks build their history before looking at their util_avg ...
> >
> > But there is one difference w.r.t frequency selection. The current code
> > won't mark the system overutilized, but will let sugov raise the
> > frequency when a new task is enqueued. So in case of a fork bomb, we
> 
> If the initial value of util_avg is 0, we should not have any impact
> on the util_avg of the cfs rq on which the task is attached, isn't it
> ? so this should not impact both the over utilization state and the
> frequency selected by sugov or I'm missing something ?

What I tried to say is that setting util_avg to 0 for new tasks will
prevent schedutil from raising the frequency in case of a fork bomb, and
I think that could be an issue. And I think this isn't an issue with the
patch as-is ...

Sorry if that wasn't clear

> Then, select_task_rq_fair is called for a new task but util_avg is
> still 0 at that time in the current code so you will have consistent
> util_avg of the new task before and after calling
> find_energy_efficient_cpu

New tasks don't go in find_energy_efficient_cpu(), because, as you said,
they have no consistent util_avg yet when select_task_rq_fair() is called
for the first time.

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v5 10/14] sched/cpufreq: Refactor the utilization aggregation method
  2018-08-02 13:04                 ` Peter Zijlstra
@ 2018-08-02 15:39                   ` Quentin Perret
  2018-08-03 13:04                     ` Quentin Perret
  0 siblings, 1 reply; 72+ messages in thread
From: Quentin Perret @ 2018-08-02 15:39 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rafael J. Wysocki, Saravana Kannan, Rafael J. Wysocki,
	Linux Kernel Mailing List, Linux PM, Greg Kroah-Hartman,
	Ingo Molnar, Dietmar Eggemann, Morten Rasmussen, Chris Redpath,
	Patrick Bellasi, Valentin Schneider, Vincent Guittot,
	Thara Gopinath, Viresh Kumar, Todd Kjos, Joel Fernandes,
	Steve Muckle, adharmap, skannan, Pavan Kondeti, Juri Lelli,
	Eduardo Valentin, Srinivas Pandruvada, currojerez, Javi Merino,
	linux-pm-owner

On Thursday 02 Aug 2018 at 15:04:40 (+0200), Peter Zijlstra wrote:
> On Wed, Aug 01, 2018 at 10:23:27AM +0100, Quentin Perret wrote:
> > On Wednesday 01 Aug 2018 at 10:35:32 (+0200), Rafael J. Wysocki wrote:
> > > On Wed, Aug 1, 2018 at 10:23 AM, Quentin Perret <quentin.perret@arm.com> wrote:
> > > > On Wednesday 01 Aug 2018 at 09:32:49 (+0200), Rafael J. Wysocki wrote:
> > > >> On Tue, Jul 31, 2018 at 9:31 PM,  <skannan@codeaurora.org> wrote:
> > > >> >> On Monday 30 Jul 2018 at 12:35:27 (-0700), skannan@codeaurora.org wrote:
> > > >> >>> If it's going to be a different aggregation from what's done for
> > > >> >>> frequency
> > > >> >>> guidance, I don't see the point of having this inside schedutil. Why not
> > > >> >>> keep it inside the scheduler files?
> > > >> >>
> > > >> >> This code basically results from a discussion we had with Peter on v4.
> > > >> >> Keeping everything centralized can make sense from a maintenance
> > > >> >> perspective, I think. That makes it easy to see the impact of any change
> > > >> >> to utilization signals for both EAS and schedutil.
> > > >> >
> > > >> > In that case, I'd argue it makes more sense to keep the code centralized in
> > > >> > the scheduler. The scheduler can let schedutil know about the utilization
> > > >> > after it aggregates them. There's no need for a cpufreq governor to know
> > > >> > that there are scheduling classes or how many there are. And the scheduler
> > > >> > can then choose to aggregate one way for task packing and another way for
> > > >> > frequency guidance.
> > > >>
> > > >> Also the aggregate utilization may be used by cpuidle governors in
> > > >> principle to decide how deep they can go with idle state selection.
> > > >
> > > > The only issue I see with this right now is that some of the things done
> > > > in this function are policy decisions which really belong to the governor,
> > > > I think.
> > > 
> > > Well, the scheduler makes policy decisions too, in quite a few places. :-)
> > 
> > That is true ... ;-) But not so much about frequency selection yet I guess
> 
> Well, sugov is part of the scheduler :-) It being so allows for the
> co-ordinated decision making required for EAS.
> 
> > > The really important consideration here is whether or not there may be
> > > multiple governors making different policy decisions in that respect.
> > > If not, then where exactly the single policy decision is made doesn't
> > > particularly matter IMO.
> > 
> > I think some users of the aggregated utilization signal do want to make
> > slightly different decisions (I'm thinking about the RT-go-to-max thing
> > again which makes perfect sense in sugov, but could possibly hurt EAS).
> > 
> > So the "hard" part of this work is to figure out what really is a
> > governor-specific policy decision, and what is common between all users.
> > I put "hard" between quotes because I only see the case of RT as truly
> > sugov-specific for now.
> > 
> > If we also want a special case for DL, Peter's enum should work OK, and
> > enable to add more special cases for new users (cpuidle ?) if needed.
> > But maybe that is something for later ?
> 
> Right, I don't mind moving the function. What I do oppose is having two
> very similar functions in different translation units -- because then
> they _will_ diverge and result in 'funny' things.

Sounds good :-) Would kernel/sched/pelt.c be the right place then ? It's
cross-class and kinda pelt-related I guess

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v5 09/14] sched: Add over-utilization/tipping point indicator
  2018-08-02 15:30                 ` Quentin Perret
@ 2018-08-02 15:55                   ` Vincent Guittot
  2018-08-02 16:00                     ` Quentin Perret
  0 siblings, 1 reply; 72+ messages in thread
From: Vincent Guittot @ 2018-08-02 15:55 UTC (permalink / raw)
  To: Quentin Perret
  Cc: Peter Zijlstra, Rafael J. Wysocki, linux-kernel,
	open list:THERMAL, gregkh, Ingo Molnar, Dietmar Eggemann,
	Morten Rasmussen, Chris Redpath, Patrick Bellasi,
	Valentin Schneider, Thara Gopinath, viresh kumar, Todd Kjos,
	Joel Fernandes, Cc: Steve Muckle, adharmap, Kannan, Saravana,
	pkondeti, Juri Lelli, Eduardo Valentin, Srinivas Pandruvada,
	currojerez, Javi Merino

On Thu, 2 Aug 2018 at 17:30, Quentin Perret <quentin.perret@arm.com> wrote:
>
> On Thursday 02 Aug 2018 at 17:14:15 (+0200), Vincent Guittot wrote:
> > On Thu, 2 Aug 2018 at 16:14, Quentin Perret <quentin.perret@arm.com> wrote:
> > > Good point, setting the util_avg to 0 for new tasks should help
> > > filtering out those tiny tasks too. And that would match with the idea
> > > of letting tasks build their history before looking at their util_avg ...
> > >
> > > But there is one difference w.r.t frequency selection. The current code
> > > won't mark the system overutilized, but will let sugov raise the
> > > frequency when a new task is enqueued. So in case of a fork bomb, we
> >
> > If the initial value of util_avg is 0, we should not have any impact
> > on the util_avg of the cfs rq on which the task is attached, isn't it
> > ? so this should not impact both the over utilization state and the
> > frequency selected by sugov or I'm missing something ?
>
> What I tried to say is that setting util_avg to 0 for new tasks will
> prevent schedutil from raising the frequency in case of a fork bomb, and
> I think that could be an issue. And I think this isn't an issue with the
> patch as-is ...

ok. So you also want to deal with fork bomb
Not sure that you don't have some problem with current proposal too
because select_task_rq_fair will always return prev_cpu because
util_avg and util_est are 0 at that time

>
> Sorry if that wasn't clear
>
> > Then, select_task_rq_fair is called for a new task but util_avg is
> > still 0 at that time in the current code so you will have consistent
> > util_avg of the new task before and after calling
> > find_energy_efficient_cpu
>
> New tasks don't go in find_energy_efficient_cpu(), because, as you said,
> they have no consistent util_avg yet when select_task_rq_fair() is called
> for the first time.
>
> Thanks,
> Quentin

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v5 09/14] sched: Add over-utilization/tipping point indicator
  2018-08-02 15:55                   ` Vincent Guittot
@ 2018-08-02 16:00                     ` Quentin Perret
  2018-08-02 16:07                       ` Vincent Guittot
  0 siblings, 1 reply; 72+ messages in thread
From: Quentin Perret @ 2018-08-02 16:00 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, Rafael J. Wysocki, linux-kernel,
	open list:THERMAL, gregkh, Ingo Molnar, Dietmar Eggemann,
	Morten Rasmussen, Chris Redpath, Patrick Bellasi,
	Valentin Schneider, Thara Gopinath, viresh kumar, Todd Kjos,
	Joel Fernandes, Cc: Steve Muckle, adharmap, Kannan, Saravana,
	pkondeti, Juri Lelli, Eduardo Valentin, Srinivas Pandruvada,
	currojerez, Javi Merino

On Thursday 02 Aug 2018 at 17:55:24 (+0200), Vincent Guittot wrote:
> On Thu, 2 Aug 2018 at 17:30, Quentin Perret <quentin.perret@arm.com> wrote:
> >
> > On Thursday 02 Aug 2018 at 17:14:15 (+0200), Vincent Guittot wrote:
> > > On Thu, 2 Aug 2018 at 16:14, Quentin Perret <quentin.perret@arm.com> wrote:
> > > > Good point, setting the util_avg to 0 for new tasks should help
> > > > filtering out those tiny tasks too. And that would match with the idea
> > > > of letting tasks build their history before looking at their util_avg ...
> > > >
> > > > But there is one difference w.r.t frequency selection. The current code
> > > > won't mark the system overutilized, but will let sugov raise the
> > > > frequency when a new task is enqueued. So in case of a fork bomb, we
> > >
> > > If the initial value of util_avg is 0, we should not have any impact
> > > on the util_avg of the cfs rq on which the task is attached, isn't it
> > > ? so this should not impact both the over utilization state and the
> > > frequency selected by sugov or I'm missing something ?
> >
> > What I tried to say is that setting util_avg to 0 for new tasks will
> > prevent schedutil from raising the frequency in case of a fork bomb, and
> > I think that could be an issue. And I think this isn't an issue with the
> > patch as-is ...
> 
> ok. So you also want to deal with fork bomb
> Not sure that you don't have some problem with current proposal too
> because select_task_rq_fair will always return prev_cpu because
> util_avg and util_est are 0 at that time

But find_idlest_cpu() should select a CPU using load in case of a forkee
no ?

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v5 09/14] sched: Add over-utilization/tipping point indicator
  2018-08-02 16:00                     ` Quentin Perret
@ 2018-08-02 16:07                       ` Vincent Guittot
  2018-08-02 16:10                         ` Quentin Perret
  0 siblings, 1 reply; 72+ messages in thread
From: Vincent Guittot @ 2018-08-02 16:07 UTC (permalink / raw)
  To: Quentin Perret
  Cc: Peter Zijlstra, Rafael J. Wysocki, linux-kernel,
	open list:THERMAL, gregkh, Ingo Molnar, Dietmar Eggemann,
	Morten Rasmussen, Chris Redpath, Patrick Bellasi,
	Valentin Schneider, Thara Gopinath, viresh kumar, Todd Kjos,
	Joel Fernandes, Cc: Steve Muckle, adharmap, Kannan, Saravana,
	pkondeti, Juri Lelli, Eduardo Valentin, Srinivas Pandruvada,
	currojerez, Javi Merino

On Thu, 2 Aug 2018 at 18:00, Quentin Perret <quentin.perret@arm.com> wrote:
>
> On Thursday 02 Aug 2018 at 17:55:24 (+0200), Vincent Guittot wrote:
> > On Thu, 2 Aug 2018 at 17:30, Quentin Perret <quentin.perret@arm.com> wrote:
> > >
> > > On Thursday 02 Aug 2018 at 17:14:15 (+0200), Vincent Guittot wrote:
> > > > On Thu, 2 Aug 2018 at 16:14, Quentin Perret <quentin.perret@arm.com> wrote:
> > > > > Good point, setting the util_avg to 0 for new tasks should help
> > > > > filtering out those tiny tasks too. And that would match with the idea
> > > > > of letting tasks build their history before looking at their util_avg ...
> > > > >
> > > > > But there is one difference w.r.t frequency selection. The current code
> > > > > won't mark the system overutilized, but will let sugov raise the
> > > > > frequency when a new task is enqueued. So in case of a fork bomb, we
> > > >
> > > > If the initial value of util_avg is 0, we should not have any impact
> > > > on the util_avg of the cfs rq on which the task is attached, isn't it
> > > > ? so this should not impact both the over utilization state and the
> > > > frequency selected by sugov or I'm missing something ?
> > >
> > > What I tried to say is that setting util_avg to 0 for new tasks will
> > > prevent schedutil from raising the frequency in case of a fork bomb, and
> > > I think that could be an issue. And I think this isn't an issue with the
> > > patch as-is ...
> >
> > ok. So you also want to deal with fork bomb
> > Not sure that you don't have some problem with current proposal too
> > because select_task_rq_fair will always return prev_cpu because
> > util_avg and util_est are 0 at that time
>
> But find_idlest_cpu() should select a CPU using load in case of a forkee
> no ?

So you have to wait for the next tick that will set the overutilized
and disable the want_energy. Until this point, all new tasks will be
put on the current cpu

>
> Thanks,
> Quentin

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v5 09/14] sched: Add over-utilization/tipping point indicator
  2018-08-02 16:07                       ` Vincent Guittot
@ 2018-08-02 16:10                         ` Quentin Perret
  2018-08-02 16:38                           ` Vincent Guittot
  0 siblings, 1 reply; 72+ messages in thread
From: Quentin Perret @ 2018-08-02 16:10 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, Rafael J. Wysocki, linux-kernel,
	open list:THERMAL, gregkh, Ingo Molnar, Dietmar Eggemann,
	Morten Rasmussen, Chris Redpath, Patrick Bellasi,
	Valentin Schneider, Thara Gopinath, viresh kumar, Todd Kjos,
	Joel Fernandes, Cc: Steve Muckle, adharmap, Kannan, Saravana,
	pkondeti, Juri Lelli, Eduardo Valentin, Srinivas Pandruvada,
	currojerez, Javi Merino

On Thursday 02 Aug 2018 at 18:07:49 (+0200), Vincent Guittot wrote:
> On Thu, 2 Aug 2018 at 18:00, Quentin Perret <quentin.perret@arm.com> wrote:
> >
> > On Thursday 02 Aug 2018 at 17:55:24 (+0200), Vincent Guittot wrote:
> > > On Thu, 2 Aug 2018 at 17:30, Quentin Perret <quentin.perret@arm.com> wrote:
> > > >
> > > > On Thursday 02 Aug 2018 at 17:14:15 (+0200), Vincent Guittot wrote:
> > > > > On Thu, 2 Aug 2018 at 16:14, Quentin Perret <quentin.perret@arm.com> wrote:
> > > > > > Good point, setting the util_avg to 0 for new tasks should help
> > > > > > filtering out those tiny tasks too. And that would match with the idea
> > > > > > of letting tasks build their history before looking at their util_avg ...
> > > > > >
> > > > > > But there is one difference w.r.t frequency selection. The current code
> > > > > > won't mark the system overutilized, but will let sugov raise the
> > > > > > frequency when a new task is enqueued. So in case of a fork bomb, we
> > > > >
> > > > > If the initial value of util_avg is 0, we should not have any impact
> > > > > on the util_avg of the cfs rq on which the task is attached, isn't it
> > > > > ? so this should not impact both the over utilization state and the
> > > > > frequency selected by sugov or I'm missing something ?
> > > >
> > > > What I tried to say is that setting util_avg to 0 for new tasks will
> > > > prevent schedutil from raising the frequency in case of a fork bomb, and
> > > > I think that could be an issue. And I think this isn't an issue with the
> > > > patch as-is ...
> > >
> > > ok. So you also want to deal with fork bomb
> > > Not sure that you don't have some problem with current proposal too
> > > because select_task_rq_fair will always return prev_cpu because
> > > util_avg and util_est are 0 at that time
> >
> > But find_idlest_cpu() should select a CPU using load in case of a forkee
> > no ?
> 
> So you have to wait for the next tick that will set the overutilized
> and disable the want_energy. Until this point, all new tasks will be
> put on the current cpu

want_energy should always be false for forkees, because we set it only
for SD_BALANCE_WAKE.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v5 12/14] sched/fair: Select an energy-efficient CPU on task wake-up
  2018-08-02 13:54   ` Peter Zijlstra
@ 2018-08-02 16:21     ` Quentin Perret
  0 siblings, 0 replies; 72+ messages in thread
From: Quentin Perret @ 2018-08-02 16:21 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: rjw, linux-kernel, linux-pm, gregkh, mingo, dietmar.eggemann,
	morten.rasmussen, chris.redpath, patrick.bellasi,
	valentin.schneider, vincent.guittot, thara.gopinath,
	viresh.kumar, tkjos, joel, smuckle, adharmap, skannan, pkondeti,
	juri.lelli, edubezval, srinivas.pandruvada, currojerez,
	javi.merino

On Thursday 02 Aug 2018 at 15:54:26 (+0200), Peter Zijlstra wrote:
> On Tue, Jul 24, 2018 at 01:25:19PM +0100, Quentin Perret wrote:
> > @@ -6385,18 +6492,26 @@ static int
> >  select_task_rq_fair(struct task_struct *p, int prev_cpu, int sd_flag, int wake_flags)
> >  {
> >  	struct sched_domain *tmp, *sd = NULL;
> > +	struct freq_domain *fd;
> >  	int cpu = smp_processor_id();
> >  	int new_cpu = prev_cpu;
> > -	int want_affine = 0;
> > +	int want_affine = 0, want_energy = 0;
> >  	int sync = (wake_flags & WF_SYNC) && !(current->flags & PF_EXITING);
> >  
> > +	rcu_read_lock();
> >  	if (sd_flag & SD_BALANCE_WAKE) {
> >  		record_wakee(p);
> > -		want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu)
> > -			      && cpumask_test_cpu(cpu, &p->cpus_allowed);
> > +		fd = rd_freq_domain(cpu_rq(cpu)->rd);
> > +		want_energy = fd && !READ_ONCE(cpu_rq(cpu)->rd->overutilized);
> > +		want_affine = !wake_wide(p) && !wake_cap(p, cpu, prev_cpu) &&
> > +			      cpumask_test_cpu(cpu, &p->cpus_allowed);
> > +	}
> > +
> > +	if (want_energy) {
> > +		new_cpu = find_energy_efficient_cpu(p, prev_cpu, fd);
> > +		goto unlock;
> >  	}
> >  
> 
> And I suppose you rely on the compiler to optimize that for the static
> key inside rd_freq_domain()... Does it do a good job of that?

I does for sure when CONFIG_ENERGY_MODEL=n since rd_freq_domain() is
stubbed to false, but that's an easy one ;-)

> 
> That is, would not something like:
> 
> 
> 	rcu_read_lock();
> 	if (sd_flag & SD_BALANCE_WAKE) {
> 		record_wakee(p);
> 
> 		if (static_branch_unlikely(&sched_energy_present)) {
> 			struct root_domain *rd = cpu_rq(cpu)->rd;
> 			struct freq_domain *fd = rd_freq_domain(rd);
> 
> 			if (fd && !READ_ONCE(rd->overutilized)) {
> 				new_cpu = find_energy_efficient_cpu(p, prev_cpu, fd);
> 				goto unlock;
> 			}
> 		}
> 
> 		/* ... */
> 	}
> 
> 
> Be far more clear ?

It is clearer. Having the static key check in rd_freq_domain() makes
the change to find_busiest_group() smaller, but I can totally change it
with something like the above.

I'll do that in v6.

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v5 09/14] sched: Add over-utilization/tipping point indicator
  2018-08-02 16:10                         ` Quentin Perret
@ 2018-08-02 16:38                           ` Vincent Guittot
  2018-08-02 16:59                             ` Quentin Perret
  0 siblings, 1 reply; 72+ messages in thread
From: Vincent Guittot @ 2018-08-02 16:38 UTC (permalink / raw)
  To: Quentin Perret
  Cc: Peter Zijlstra, Rafael J. Wysocki, linux-kernel,
	open list:THERMAL, gregkh, Ingo Molnar, Dietmar Eggemann,
	Morten Rasmussen, Chris Redpath, Patrick Bellasi,
	Valentin Schneider, Thara Gopinath, viresh kumar, Todd Kjos,
	Joel Fernandes, Cc: Steve Muckle, adharmap, Kannan, Saravana,
	pkondeti, Juri Lelli, Eduardo Valentin, Srinivas Pandruvada,
	currojerez, Javi Merino

On Thu, 2 Aug 2018 at 18:10, Quentin Perret <quentin.perret@arm.com> wrote:
>
> On Thursday 02 Aug 2018 at 18:07:49 (+0200), Vincent Guittot wrote:
> > On Thu, 2 Aug 2018 at 18:00, Quentin Perret <quentin.perret@arm.com> wrote:
> > >
> > > On Thursday 02 Aug 2018 at 17:55:24 (+0200), Vincent Guittot wrote:
> > > > On Thu, 2 Aug 2018 at 17:30, Quentin Perret <quentin.perret@arm.com> wrote:
> > > > >
> > > > > On Thursday 02 Aug 2018 at 17:14:15 (+0200), Vincent Guittot wrote:
> > > > > > On Thu, 2 Aug 2018 at 16:14, Quentin Perret <quentin.perret@arm.com> wrote:
> > > > > > > Good point, setting the util_avg to 0 for new tasks should help
> > > > > > > filtering out those tiny tasks too. And that would match with the idea
> > > > > > > of letting tasks build their history before looking at their util_avg ...
> > > > > > >
> > > > > > > But there is one difference w.r.t frequency selection. The current code
> > > > > > > won't mark the system overutilized, but will let sugov raise the
> > > > > > > frequency when a new task is enqueued. So in case of a fork bomb, we
> > > > > >
> > > > > > If the initial value of util_avg is 0, we should not have any impact
> > > > > > on the util_avg of the cfs rq on which the task is attached, isn't it
> > > > > > ? so this should not impact both the over utilization state and the
> > > > > > frequency selected by sugov or I'm missing something ?
> > > > >
> > > > > What I tried to say is that setting util_avg to 0 for new tasks will
> > > > > prevent schedutil from raising the frequency in case of a fork bomb, and
> > > > > I think that could be an issue. And I think this isn't an issue with the
> > > > > patch as-is ...
> > > >
> > > > ok. So you also want to deal with fork bomb
> > > > Not sure that you don't have some problem with current proposal too
> > > > because select_task_rq_fair will always return prev_cpu because
> > > > util_avg and util_est are 0 at that time
> > >
> > > But find_idlest_cpu() should select a CPU using load in case of a forkee
> > > no ?
> >
> > So you have to wait for the next tick that will set the overutilized
> > and disable the want_energy. Until this point, all new tasks will be
> > put on the current cpu
>
> want_energy should always be false for forkees, because we set it only
> for SD_BALANCE_WAKE.

Ah yes I forgot that point.
But doesn't this break the EAS policy ? I mean each time a new task is
created, we use the load to select the best CPU

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v5 09/14] sched: Add over-utilization/tipping point indicator
  2018-08-02 16:38                           ` Vincent Guittot
@ 2018-08-02 16:59                             ` Quentin Perret
  2018-08-03  7:48                               ` Vincent Guittot
  0 siblings, 1 reply; 72+ messages in thread
From: Quentin Perret @ 2018-08-02 16:59 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, Rafael J. Wysocki, linux-kernel,
	open list:THERMAL, gregkh, Ingo Molnar, Dietmar Eggemann,
	Morten Rasmussen, Chris Redpath, Patrick Bellasi,
	Valentin Schneider, Thara Gopinath, viresh kumar, Todd Kjos,
	Joel Fernandes, Cc: Steve Muckle, adharmap, Kannan, Saravana,
	pkondeti, Juri Lelli, Eduardo Valentin, Srinivas Pandruvada,
	currojerez, Javi Merino

On Thursday 02 Aug 2018 at 18:38:01 (+0200), Vincent Guittot wrote:
> On Thu, 2 Aug 2018 at 18:10, Quentin Perret <quentin.perret@arm.com> wrote:
> >
> > On Thursday 02 Aug 2018 at 18:07:49 (+0200), Vincent Guittot wrote:
> > > On Thu, 2 Aug 2018 at 18:00, Quentin Perret <quentin.perret@arm.com> wrote:
> > > >
> > > > On Thursday 02 Aug 2018 at 17:55:24 (+0200), Vincent Guittot wrote:
> > > > > On Thu, 2 Aug 2018 at 17:30, Quentin Perret <quentin.perret@arm.com> wrote:
> > > > > >
> > > > > > On Thursday 02 Aug 2018 at 17:14:15 (+0200), Vincent Guittot wrote:
> > > > > > > On Thu, 2 Aug 2018 at 16:14, Quentin Perret <quentin.perret@arm.com> wrote:
> > > > > > > > Good point, setting the util_avg to 0 for new tasks should help
> > > > > > > > filtering out those tiny tasks too. And that would match with the idea
> > > > > > > > of letting tasks build their history before looking at their util_avg ...
> > > > > > > >
> > > > > > > > But there is one difference w.r.t frequency selection. The current code
> > > > > > > > won't mark the system overutilized, but will let sugov raise the
> > > > > > > > frequency when a new task is enqueued. So in case of a fork bomb, we
> > > > > > >
> > > > > > > If the initial value of util_avg is 0, we should not have any impact
> > > > > > > on the util_avg of the cfs rq on which the task is attached, isn't it
> > > > > > > ? so this should not impact both the over utilization state and the
> > > > > > > frequency selected by sugov or I'm missing something ?
> > > > > >
> > > > > > What I tried to say is that setting util_avg to 0 for new tasks will
> > > > > > prevent schedutil from raising the frequency in case of a fork bomb, and
> > > > > > I think that could be an issue. And I think this isn't an issue with the
> > > > > > patch as-is ...
> > > > >
> > > > > ok. So you also want to deal with fork bomb
> > > > > Not sure that you don't have some problem with current proposal too
> > > > > because select_task_rq_fair will always return prev_cpu because
> > > > > util_avg and util_est are 0 at that time
> > > >
> > > > But find_idlest_cpu() should select a CPU using load in case of a forkee
> > > > no ?
> > >
> > > So you have to wait for the next tick that will set the overutilized
> > > and disable the want_energy. Until this point, all new tasks will be
> > > put on the current cpu
> >
> > want_energy should always be false for forkees, because we set it only
> > for SD_BALANCE_WAKE.
> 
> Ah yes I forgot that point.
> But doesn't this break the EAS policy ? I mean each time a new task is
> created, we use the load to select the best CPU

If you really keep spawning new tasks all the time, yes EAS won't help
you, but there isn't a lot we can do :/. We need to have an idea of how
big a task is for EAS, and we obviously don't know that for new tasks, so
it's hard/dangerous to make assumptions.

So the proposal here is that if you only have forkees once in a while,
then those new tasks (and those new tasks only) will be placed using load
the first time, and then they'll fall under EAS control has soon as they
have at least a little bit of history. This _should_ happen without
re-enabling load balance spuriously too often, and that _should_ prevent
it from ruining the placement of existing tasks ...

As Peter already mentioned, a better way of solving this issue would be
to try to find the moment when the utilization signal has converged to
something stable (assuming that it converges), but that, I think, isn't
straightforward at all ...

Does that make any sense ?

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v5 10/14] sched/cpufreq: Refactor the utilization aggregation method
  2018-08-02 15:21         ` Quentin Perret
@ 2018-08-02 17:36           ` Peter Zijlstra
  2018-08-03 12:42             ` Quentin Perret
  0 siblings, 1 reply; 72+ messages in thread
From: Peter Zijlstra @ 2018-08-02 17:36 UTC (permalink / raw)
  To: Quentin Perret
  Cc: skannan, rjw, linux-kernel, linux-pm, gregkh, mingo,
	dietmar.eggemann, morten.rasmussen, chris.redpath,
	patrick.bellasi, valentin.schneider, vincent.guittot,
	thara.gopinath, viresh.kumar, tkjos, joel, smuckle, adharmap,
	skannan, pkondeti, juri.lelli, edubezval, srinivas.pandruvada,
	currojerez, javi.merino, linux-pm-owner

On Thu, Aug 02, 2018 at 04:21:11PM +0100, Quentin Perret wrote:
> On Thursday 02 Aug 2018 at 14:45:11 (+0200), Peter Zijlstra wrote:

> > To clarify, it makes absolutely no sense what so ever to attempt EAS
> > when the DVFS control is not coordinated.
> 
> I tend to agree with that, but at the same time even if we create a very
> strong dependency on schedutil, we will have no guarantee that the actual
> frequencies used on the platform are the ones we predicted in EAS.

Sure; on x86 for example our micro-code does whatever. But using
schedutil we at least 'guide' it in the general direction we'd expect
with the control that is available.

Using a !schedutil governor doesn't even get us that and we're basically
running on random input without any feedback to close the loop. Not
something I feel we should support or care for.

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v5 09/14] sched: Add over-utilization/tipping point indicator
  2018-08-02 16:59                             ` Quentin Perret
@ 2018-08-03  7:48                               ` Vincent Guittot
  2018-08-03  8:18                                 ` Quentin Perret
  0 siblings, 1 reply; 72+ messages in thread
From: Vincent Guittot @ 2018-08-03  7:48 UTC (permalink / raw)
  To: Quentin Perret
  Cc: Peter Zijlstra, Rafael J. Wysocki, linux-kernel,
	open list:THERMAL, gregkh, Ingo Molnar, Dietmar Eggemann,
	Morten Rasmussen, Chris Redpath, Patrick Bellasi,
	Valentin Schneider, Thara Gopinath, viresh kumar, Todd Kjos,
	Joel Fernandes, Cc: Steve Muckle, adharmap, Kannan, Saravana,
	pkondeti, Juri Lelli, Eduardo Valentin, Srinivas Pandruvada,
	currojerez, Javi Merino

On Thu, 2 Aug 2018 at 18:59, Quentin Perret <quentin.perret@arm.com> wrote:
>
> On Thursday 02 Aug 2018 at 18:38:01 (+0200), Vincent Guittot wrote:
> > On Thu, 2 Aug 2018 at 18:10, Quentin Perret <quentin.perret@arm.com> wrote:
> > >
> > > On Thursday 02 Aug 2018 at 18:07:49 (+0200), Vincent Guittot wrote:
> > > > On Thu, 2 Aug 2018 at 18:00, Quentin Perret <quentin.perret@arm.com> wrote:
> > > > >
> > > > > On Thursday 02 Aug 2018 at 17:55:24 (+0200), Vincent Guittot wrote:
> > > > > > On Thu, 2 Aug 2018 at 17:30, Quentin Perret <quentin.perret@arm.com> wrote:
> > > > > > >
> > > > > > > On Thursday 02 Aug 2018 at 17:14:15 (+0200), Vincent Guittot wrote:
> > > > > > > > On Thu, 2 Aug 2018 at 16:14, Quentin Perret <quentin.perret@arm.com> wrote:
> > > > > > > > > Good point, setting the util_avg to 0 for new tasks should help
> > > > > > > > > filtering out those tiny tasks too. And that would match with the idea
> > > > > > > > > of letting tasks build their history before looking at their util_avg ...
> > > > > > > > >
> > > > > > > > > But there is one difference w.r.t frequency selection. The current code
> > > > > > > > > won't mark the system overutilized, but will let sugov raise the
> > > > > > > > > frequency when a new task is enqueued. So in case of a fork bomb, we
> > > > > > > >
> > > > > > > > If the initial value of util_avg is 0, we should not have any impact
> > > > > > > > on the util_avg of the cfs rq on which the task is attached, isn't it
> > > > > > > > ? so this should not impact both the over utilization state and the
> > > > > > > > frequency selected by sugov or I'm missing something ?
> > > > > > >
> > > > > > > What I tried to say is that setting util_avg to 0 for new tasks will
> > > > > > > prevent schedutil from raising the frequency in case of a fork bomb, and
> > > > > > > I think that could be an issue. And I think this isn't an issue with the
> > > > > > > patch as-is ...
> > > > > >
> > > > > > ok. So you also want to deal with fork bomb
> > > > > > Not sure that you don't have some problem with current proposal too
> > > > > > because select_task_rq_fair will always return prev_cpu because
> > > > > > util_avg and util_est are 0 at that time
> > > > >
> > > > > But find_idlest_cpu() should select a CPU using load in case of a forkee
> > > > > no ?
> > > >
> > > > So you have to wait for the next tick that will set the overutilized
> > > > and disable the want_energy. Until this point, all new tasks will be
> > > > put on the current cpu
> > >
> > > want_energy should always be false for forkees, because we set it only
> > > for SD_BALANCE_WAKE.
> >
> > Ah yes I forgot that point.
> > But doesn't this break the EAS policy ? I mean each time a new task is
> > created, we use the load to select the best CPU
>
> If you really keep spawning new tasks all the time, yes EAS won't help
> you, but there isn't a lot we can do :/. We need to have an idea of how

My point was more that it's also happen for every single new task and
not only with fork bomb

> big a task is for EAS, and we obviously don't know that for new tasks, so
> it's hard/dangerous to make assumptions.

But by not making any assumption, the new tasks are placed outside EAS
control and can easily break what EAS tries to achieve because it
looks for the idlest cpu which is unluckily most probably a CPU that
EAS doesn't want to use

>
> So the proposal here is that if you only have forkees once in a while,
> then those new tasks (and those new tasks only) will be placed using load
> the first time, and then they'll fall under EAS control has soon as they
> have at least a little bit of history. This _should_ happen without
> re-enabling load balance spuriously too often, and that _should_ prevent

I'm not really concerned about re-enabling load balance but more that
the effort of packing of tasks in few cpus/clusters that EAS tries to
do can be broken for every new task.
So I wonder what is better for EAS : Make sure to efficiently spread
newly created tasks in cas of fork bomb or  try to not break EAS task
placement with every newly created tasks

Vincent

> it from ruining the placement of existing tasks ...
>
> As Peter already mentioned, a better way of solving this issue would be
> to try to find the moment when the utilization signal has converged to
> something stable (assuming that it converges), but that, I think, isn't
> straightforward at all ...
>
> Does that make any sense ?
>
> Thanks,
> Quentin

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v5 09/14] sched: Add over-utilization/tipping point indicator
  2018-08-03  7:48                               ` Vincent Guittot
@ 2018-08-03  8:18                                 ` Quentin Perret
  2018-08-03 13:49                                   ` Vincent Guittot
  0 siblings, 1 reply; 72+ messages in thread
From: Quentin Perret @ 2018-08-03  8:18 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, Rafael J. Wysocki, linux-kernel,
	open list:THERMAL, gregkh, Ingo Molnar, Dietmar Eggemann,
	Morten Rasmussen, Chris Redpath, Patrick Bellasi,
	Valentin Schneider, Thara Gopinath, viresh kumar, Todd Kjos,
	Joel Fernandes, Cc: Steve Muckle, adharmap, Kannan, Saravana,
	pkondeti, Juri Lelli, Eduardo Valentin, Srinivas Pandruvada,
	currojerez, Javi Merino

On Friday 03 Aug 2018 at 09:48:47 (+0200), Vincent Guittot wrote:
> On Thu, 2 Aug 2018 at 18:59, Quentin Perret <quentin.perret@arm.com> wrote:
> I'm not really concerned about re-enabling load balance but more that
> the effort of packing of tasks in few cpus/clusters that EAS tries to
> do can be broken for every new task.

Well, re-enabling load balance immediately would break the nice placement
that EAS found, because it would shuffle all tasks around and break the
packing strategy. Letting that sole new task go in find_idlest_cpu()
shouldn't impact the placement of existing tasks. That might have an energy
cost for that one task, yes, but it's really hard to do anything smarter
with new tasks IMO ... EAS simply can't work without a utilization value.

> So I wonder what is better for EAS : Make sure to efficiently spread
> newly created tasks in cas of fork bomb or  try to not break EAS task
> placement with every newly created tasks

That shouldn't break the placement per se, we're just making one
temporary exception for new tasks. What do you think 'the right thing'
to do is ? To just put new tasks on prev_cpu or something like that ?

That might help some use-cases I suppose, but will probably harm others ...
I'm just not too keen on making assumptions about the size of new tasks,
that's all. But I'm definitely open to ideas if there is something
smarter we can do.

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v5 10/14] sched/cpufreq: Refactor the utilization aggregation method
  2018-08-02 17:36           ` Peter Zijlstra
@ 2018-08-03 12:42             ` Quentin Perret
  0 siblings, 0 replies; 72+ messages in thread
From: Quentin Perret @ 2018-08-03 12:42 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: skannan, rjw, linux-kernel, linux-pm, gregkh, mingo,
	dietmar.eggemann, morten.rasmussen, chris.redpath,
	patrick.bellasi, valentin.schneider, vincent.guittot,
	thara.gopinath, viresh.kumar, tkjos, joel, smuckle, adharmap,
	skannan, pkondeti, juri.lelli, edubezval, srinivas.pandruvada,
	currojerez, javi.merino, linux-pm-owner

On Thursday 02 Aug 2018 at 19:36:01 (+0200), Peter Zijlstra wrote:
> Using a !schedutil governor doesn't even get us that and we're basically
> running on random input without any feedback to close the loop. Not
> something I feel we should support or care for.

Fair enough, supporting users using something else than sugov doesn't
sound like a good idea indeed ...

Creating the dependency between sugov and EAS requires a bit of work. I
assume we don't want to check the current CPUFreq policy of all CPUs of
the current rd in the wakeup path ... So one possibility is to check that
from the topology code (build_freq_domains(), ...) and get a notification
from CPUFreq on governor change to call rebuild_sched_domains().

I following seems to do the trick. How ugly does that look ?

------------------>8------------------
diff --git a/drivers/cpufreq/cpufreq.c b/drivers/cpufreq/cpufreq.c
index b0dfd3222013..bed0a511c504 100644
--- a/drivers/cpufreq/cpufreq.c
+++ b/drivers/cpufreq/cpufreq.c
@@ -2271,6 +2271,10 @@ static int cpufreq_set_policy(struct cpufreq_policy *policy,
 		ret = cpufreq_start_governor(policy);
 		if (!ret) {
 			pr_debug("cpufreq: governor change\n");
+			/* Notification of the new governor */
+			blocking_notifier_call_chain(
+					&cpufreq_policy_notifier_list,
+					CPUFREQ_GOVERNOR, policy);
 			return 0;
 		}
 		cpufreq_exit_governor(policy);
diff --git a/include/linux/cpufreq.h b/include/linux/cpufreq.h
index 882a9b9e34bc..a4435b5ef3f9 100644
--- a/include/linux/cpufreq.h
+++ b/include/linux/cpufreq.h
@@ -437,6 +437,7 @@ static inline void cpufreq_resume(void) {}
 /* Policy Notifiers  */
 #define CPUFREQ_ADJUST			(0)
 #define CPUFREQ_NOTIFY			(1)
+#define CPUFREQ_GOVERNOR		(2)
 
 #ifdef CONFIG_CPU_FREQ
 int cpufreq_register_notifier(struct notifier_block *nb, unsigned int list);
diff --git a/kernel/sched/cpufreq_schedutil.c b/kernel/sched/cpufreq_schedutil.c
index d23a480bac7b..16c7a4ad1a77 100644
--- a/kernel/sched/cpufreq_schedutil.c
+++ b/kernel/sched/cpufreq_schedutil.c
@@ -632,7 +632,7 @@ static struct kobj_type sugov_tunables_ktype = {
 
 /********************** cpufreq governor interface *********************/
 
-static struct cpufreq_governor schedutil_gov;
+struct cpufreq_governor schedutil_gov;
 
 static struct sugov_policy *sugov_policy_alloc(struct cpufreq_policy *policy)
 {
@@ -891,7 +891,7 @@ static void sugov_limits(struct cpufreq_policy *policy)
 	sg_policy->need_freq_update = true;
 }
 
-static struct cpufreq_governor schedutil_gov = {
+struct cpufreq_governor schedutil_gov = {
 	.name			= "schedutil",
 	.owner			= THIS_MODULE,
 	.dynamic_switching	= true,
@@ -914,3 +914,46 @@ static int __init sugov_register(void)
 	return cpufreq_register_governor(&schedutil_gov);
 }
 fs_initcall(sugov_register);
+
+#ifdef CONFIG_ENERGY_MODEL
+extern bool sched_energy_update;
+static DEFINE_MUTEX(rebuild_sd_mutex);
+/*
+ * EAS shouldn't be attempted without sugov, so rebuild the sched_domains
+ * on governor changes to make sure the scheduler knows about it.
+ */
+static void rebuild_sd_workfn(struct work_struct *work)
+{
+	mutex_lock(&rebuild_sd_mutex);
+	sched_energy_update = true;
+	rebuild_sched_domains();
+	sched_energy_update = false;
+	mutex_unlock(&rebuild_sd_mutex);
+}
+static DECLARE_WORK(rebuild_sd_work, rebuild_sd_workfn);
+
+static int rebuild_sd_callback(struct notifier_block *nb, unsigned long val,
+			       void *data)
+{
+	if (val != CPUFREQ_GOVERNOR)
+		return 0;
+	/*
+	 * Sched_domains cannot be rebuild from a notifier context, so use a
+	 * workqueue.
+	 */
+	schedule_work(&rebuild_sd_work);
+
+	return 0;
+}
+
+static struct notifier_block rebuild_sd_notifier = {
+	.notifier_call = rebuild_sd_callback,
+};
+
+static int register_cpufreq_notifier(void)
+{
+	return cpufreq_register_notifier(&rebuild_sd_notifier,
+					 CPUFREQ_POLICY_NOTIFIER);
+}
+core_initcall(register_cpufreq_notifier);
+#endif
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index e224e90a36c3..861be053f2d2 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -2267,8 +2267,7 @@ unsigned long scale_irq_capacity(unsigned long util, unsigned long irq, unsigned
 }
 #endif
 
-#ifdef CONFIG_SMP
-#ifdef CONFIG_ENERGY_MODEL
+#if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)
 extern struct static_key_false sched_energy_present;
 /**
  * rd_freq_domain - Get the frequency domains of a root domain.
@@ -2290,4 +2289,3 @@ static inline struct freq_domain *rd_freq_domain(struct root_domain *rd)
 }
 #define freq_domain_span(fd) NULL
 #endif
-#endif
diff --git a/kernel/sched/topology.c b/kernel/sched/topology.c
index e95223b7a7f6..35246707a8e0 100644
--- a/kernel/sched/topology.c
+++ b/kernel/sched/topology.c
@@ -201,7 +201,7 @@ sd_parent_degenerate(struct sched_domain *sd, struct sched_domain *parent)
 	return 1;
 }
 
-#ifdef CONFIG_ENERGY_MODEL
+#if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)
 static void free_fd(struct freq_domain *fd)
 {
 	struct freq_domain *tmp;
@@ -287,13 +287,15 @@ static void destroy_freq_domain_rcu(struct rcu_head *rp)
  */
 #define EM_MAX_COMPLEXITY 2048
 
-
+extern struct cpufreq_governor schedutil_gov;
 static void build_freq_domains(const struct cpumask *cpu_map)
 {
 	int i, nr_fd = 0, nr_cs = 0, nr_cpus = cpumask_weight(cpu_map);
 	struct freq_domain *fd = NULL, *tmp;
 	int cpu = cpumask_first(cpu_map);
 	struct root_domain *rd = cpu_rq(cpu)->rd;
+	struct cpufreq_policy *policy;
+	struct cpufreq_governor *gov;
 
 	/* EAS is enabled for asymmetric CPU capacity topologies. */
 	if (!per_cpu(sd_asym_cpucapacity, cpu)) {
@@ -309,6 +311,13 @@ static void build_freq_domains(const struct cpumask *cpu_map)
 		if (find_fd(fd, i))
 			continue;
 
+		/* Do not attempt EAS if schedutil is not being used. */
+		policy = cpufreq_cpu_get(i);
+		gov = policy->governor;
+		cpufreq_cpu_put(policy);
+		if (gov != &schedutil_gov)
+			goto free;
+
 		/* Create the new fd and add it to the local list. */
 		tmp = fd_init(i);
 		if (!tmp)
@@ -355,6 +364,7 @@ static void build_freq_domains(const struct cpumask *cpu_map)
  *    3. the SD_ASYM_CPUCAPACITY flag is set in the sched_domain hierarchy.
  */
 DEFINE_STATIC_KEY_FALSE(sched_energy_present);
+bool sched_energy_update = false;
 
 static void sched_energy_start(int ndoms_new, cpumask_var_t doms_new[])
 {
@@ -2187,10 +2197,10 @@ void partition_sched_domains(int ndoms_new, cpumask_var_t doms_new[],
 		;
 	}
 
-#ifdef CONFIG_ENERGY_MODEL
+#if defined(CONFIG_ENERGY_MODEL) && defined(CONFIG_CPU_FREQ_GOV_SCHEDUTIL)
 	/* Build freq domains: */
 	for (i = 0; i < ndoms_new; i++) {
-		for (j = 0; j < n; j++) {
+		for (j = 0; j < n && !sched_energy_update; j++) {
 			if (cpumask_equal(doms_new[i], doms_cur[j]) &&
 			    cpu_rq(cpumask_first(doms_cur[j]))->rd->fd)
 				goto match3;


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v5 10/14] sched/cpufreq: Refactor the utilization aggregation method
  2018-08-02 15:39                   ` Quentin Perret
@ 2018-08-03 13:04                     ` Quentin Perret
  0 siblings, 0 replies; 72+ messages in thread
From: Quentin Perret @ 2018-08-03 13:04 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rafael J. Wysocki, Saravana Kannan, Rafael J. Wysocki,
	Linux Kernel Mailing List, Linux PM, Greg Kroah-Hartman,
	Ingo Molnar, Dietmar Eggemann, Morten Rasmussen, Chris Redpath,
	Patrick Bellasi, Valentin Schneider, Vincent Guittot,
	Thara Gopinath, Viresh Kumar, Todd Kjos, Joel Fernandes,
	Steve Muckle, adharmap, skannan, Pavan Kondeti, Juri Lelli,
	Eduardo Valentin, Srinivas Pandruvada, currojerez, Javi Merino,
	linux-pm-owner

On Thursday 02 Aug 2018 at 16:39:18 (+0100), Quentin Perret wrote:
> Sounds good :-) Would kernel/sched/pelt.c be the right place then ? It's
> cross-class and kinda pelt-related I guess

Since we agreed to create a dependency between EAS and sugov, I don't think
there is a lot of value in relocating that function for now.

I'll leave it in cpufreq_schedutil.c for v6 until there is a good reason
to move it.

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v5 09/14] sched: Add over-utilization/tipping point indicator
  2018-08-03  8:18                                 ` Quentin Perret
@ 2018-08-03 13:49                                   ` Vincent Guittot
  2018-08-03 14:21                                     ` Vincent Guittot
  2018-08-03 15:55                                     ` Quentin Perret
  0 siblings, 2 replies; 72+ messages in thread
From: Vincent Guittot @ 2018-08-03 13:49 UTC (permalink / raw)
  To: Quentin Perret
  Cc: Peter Zijlstra, Rafael J. Wysocki, linux-kernel,
	open list:THERMAL, gregkh, Ingo Molnar, Dietmar Eggemann,
	Morten Rasmussen, Chris Redpath, Patrick Bellasi,
	Valentin Schneider, Thara Gopinath, viresh kumar, Todd Kjos,
	Joel Fernandes, Cc: Steve Muckle, adharmap, Kannan, Saravana,
	pkondeti, Juri Lelli, Eduardo Valentin, Srinivas Pandruvada,
	currojerez, Javi Merino

On Fri, 3 Aug 2018 at 10:18, Quentin Perret <quentin.perret@arm.com> wrote:
>
> On Friday 03 Aug 2018 at 09:48:47 (+0200), Vincent Guittot wrote:
> > On Thu, 2 Aug 2018 at 18:59, Quentin Perret <quentin.perret@arm.com> wrote:
> > I'm not really concerned about re-enabling load balance but more that
> > the effort of packing of tasks in few cpus/clusters that EAS tries to
> > do can be broken for every new task.
>
> Well, re-enabling load balance immediately would break the nice placement
> that EAS found, because it would shuffle all tasks around and break the
> packing strategy. Letting that sole new task go in find_idlest_cpu()

Sorry I was not clear in my explanation. Re enabling load balance
would be a problem of course. I wanted to say that there is few chance
that this will re-enable the load balance immediately and break EAS
and I'm not worried by this case. But i'm only concerned by the new
task being put outside EAS policy.

For example, if you run on hikey960 the simple script below, which
can't really be seen as a fork bomb IMHO, you will see threads
scheduled on big cores every 0.5 seconds whereas everything should be
packed on little core  :
for i in {0..10}; do
echo "t"$i
sleep 0.5
done

> shouldn't impact the placement of existing tasks. That might have an energy
> cost for that one task, yes, but it's really hard to do anything smarter
> with new tasks IMO ... EAS simply can't work without a utilization value.
>
> > So I wonder what is better for EAS : Make sure to efficiently spread
> > newly created tasks in cas of fork bomb or  try to not break EAS task
> > placement with every newly created tasks
>
> That shouldn't break the placement per se, we're just making one
> temporary exception for new tasks. What do you think 'the right thing'
> to do is ? To just put new tasks on prev_cpu or something like that ?

I think that EAS, which is about saving power, could be a bit power
friendly when it has to make some assumptions about new task.

>
> That might help some use-cases I suppose, but will probably harm others ...
> I'm just not too keen on making assumptions about the size of new tasks,

But you are already doing some assumptions by letting the default
mode, which use load_avg, selecting the task for you. The comment of
the init function of load_avg states:

void init_entity_runnable_average()
{
...
/*
* Tasks are intialized with full load to be seen as heavy tasks until
* they get a chance to stabilize to their real load level.
* Group entities are intialized with zero load to reflect the fact that
* nothing has been attached to the task group yet.
*/

So it means that EAS makes the assumption that new task are heavy
tasks until they get a chance to stabilize

Regards,
Vincent

> that's all. But I'm definitely open to ideas if there is something
> smarter we can do.
>
> Thanks,
> Quentin

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v5 09/14] sched: Add over-utilization/tipping point indicator
  2018-08-03 13:49                                   ` Vincent Guittot
@ 2018-08-03 14:21                                     ` Vincent Guittot
  2018-08-03 15:55                                     ` Quentin Perret
  1 sibling, 0 replies; 72+ messages in thread
From: Vincent Guittot @ 2018-08-03 14:21 UTC (permalink / raw)
  To: Quentin Perret
  Cc: Peter Zijlstra, Rafael J. Wysocki, linux-kernel,
	open list:THERMAL, gregkh, Ingo Molnar, Dietmar Eggemann,
	Morten Rasmussen, Chris Redpath, Patrick Bellasi,
	Valentin Schneider, Thara Gopinath, viresh kumar, Todd Kjos,
	Joel Fernandes, Cc: Steve Muckle, adharmap, Kannan, Saravana,
	pkondeti, Juri Lelli, Eduardo Valentin, Srinivas Pandruvada,
	currojerez, Javi Merino

On Fri, 3 Aug 2018 at 15:49, Vincent Guittot <vincent.guittot@linaro.org> wrote:
>
> On Fri, 3 Aug 2018 at 10:18, Quentin Perret <quentin.perret@arm.com> wrote:
> >
> > On Friday 03 Aug 2018 at 09:48:47 (+0200), Vincent Guittot wrote:
> > > On Thu, 2 Aug 2018 at 18:59, Quentin Perret <quentin.perret@arm.com> wrote:
> > > I'm not really concerned about re-enabling load balance but more that
> > > the effort of packing of tasks in few cpus/clusters that EAS tries to
> > > do can be broken for every new task.
> >
> > Well, re-enabling load balance immediately would break the nice placement
> > that EAS found, because it would shuffle all tasks around and break the
> > packing strategy. Letting that sole new task go in find_idlest_cpu()
>
> Sorry I was not clear in my explanation. Re enabling load balance
> would be a problem of course. I wanted to say that there is few chance
> that this will re-enable the load balance immediately and break EAS
> and I'm not worried by this case. But i'm only concerned by the new
> task being put outside EAS policy.
>
> For example, if you run on hikey960 the simple script below, which
> can't really be seen as a fork bomb IMHO, you will see threads
> scheduled on big cores every 0.5 seconds whereas everything should be
> packed on little core  :
> for i in {0..10}; do
> echo "t"$i
> sleep 0.5
> done
>
> > shouldn't impact the placement of existing tasks. That might have an energy
> > cost for that one task, yes, but it's really hard to do anything smarter
> > with new tasks IMO ... EAS simply can't work without a utilization value.
> >
> > > So I wonder what is better for EAS : Make sure to efficiently spread
> > > newly created tasks in cas of fork bomb or  try to not break EAS task
> > > placement with every newly created tasks
> >
> > That shouldn't break the placement per se, we're just making one
> > temporary exception for new tasks. What do you think 'the right thing'
> > to do is ? To just put new tasks on prev_cpu or something like that ?
>
> I think that EAS, which is about saving power, could be a bit power
> friendly when it has to make some assumptions about new task.
>
> >
> > That might help some use-cases I suppose, but will probably harm others ...
> > I'm just not too keen on making assumptions about the size of new tasks,
>
> But you are already doing some assumptions by letting the default
> mode, which use load_avg, selecting the task for you. The comment of

s/selecting the task/selecting the cpu/

> the init function of load_avg states:
>
> void init_entity_runnable_average()
> {
> ...
> /*
> * Tasks are intialized with full load to be seen as heavy tasks until
> * they get a chance to stabilize to their real load level.
> * Group entities are intialized with zero load to reflect the fact that
> * nothing has been attached to the task group yet.
> */
>
> So it means that EAS makes the assumption that new task are heavy
> tasks until they get a chance to stabilize
>
> Regards,
> Vincent
>
> > that's all. But I'm definitely open to ideas if there is something
> > smarter we can do.
> >
> > Thanks,
> > Quentin

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v5 09/14] sched: Add over-utilization/tipping point indicator
  2018-08-03 13:49                                   ` Vincent Guittot
  2018-08-03 14:21                                     ` Vincent Guittot
@ 2018-08-03 15:55                                     ` Quentin Perret
  2018-08-06  8:40                                       ` Vincent Guittot
  1 sibling, 1 reply; 72+ messages in thread
From: Quentin Perret @ 2018-08-03 15:55 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, Rafael J. Wysocki, linux-kernel,
	open list:THERMAL, gregkh, Ingo Molnar, Dietmar Eggemann,
	Morten Rasmussen, Chris Redpath, Patrick Bellasi,
	Valentin Schneider, Thara Gopinath, viresh kumar, Todd Kjos,
	Joel Fernandes, Cc: Steve Muckle, adharmap, Kannan, Saravana,
	pkondeti, Juri Lelli, Eduardo Valentin, Srinivas Pandruvada,
	currojerez, Javi Merino

On Friday 03 Aug 2018 at 15:49:24 (+0200), Vincent Guittot wrote:
> On Fri, 3 Aug 2018 at 10:18, Quentin Perret <quentin.perret@arm.com> wrote:
> >
> > On Friday 03 Aug 2018 at 09:48:47 (+0200), Vincent Guittot wrote:
> > > On Thu, 2 Aug 2018 at 18:59, Quentin Perret <quentin.perret@arm.com> wrote:
> > > I'm not really concerned about re-enabling load balance but more that
> > > the effort of packing of tasks in few cpus/clusters that EAS tries to
> > > do can be broken for every new task.
> >
> > Well, re-enabling load balance immediately would break the nice placement
> > that EAS found, because it would shuffle all tasks around and break the
> > packing strategy. Letting that sole new task go in find_idlest_cpu()
> 
> Sorry I was not clear in my explanation. Re enabling load balance
> would be a problem of course. I wanted to say that there is few chance
> that this will re-enable the load balance immediately and break EAS
> and I'm not worried by this case. But i'm only concerned by the new
> task being put outside EAS policy.
> 
> For example, if you run on hikey960 the simple script below, which
> can't really be seen as a fork bomb IMHO, you will see threads
> scheduled on big cores every 0.5 seconds whereas everything should be
> packed on little core

I guess that also depends on what's running on the little cores, but I
see your point.

I think we're discussing two different things right now:
    1. Should forkees go in find_energy_efficient_cpu() ?
    2. Should forkees have 0 of initial util_avg when EAS is enabled ?

For 1, that would mean all forkees go on prev_cpu. I can see how that
can be more energy-efficient in some use-cases (the one you described
for example), but that also has drawbacks. Placing the task on a big
CPU can have an energy cost, but that should also help the task build
it's utilization faster, which is what we want to make smart decisions
with EAS. Also, it isn't always true that going on the little CPUs is
more energy efficient, only the Energy Model can tell. There is just no
perfect solution, so I'm still not fully decided on that one ...

For 2, I'm a little bit more reluctant, because that has more
implications ... That could probably harm some fairly standard use
cases (an simple app-launch for example). Enqueueing something new on a
CPU would go unnoticed, which might be fine for a very small task, but
probably a major issue if the task is actually big. I'd be more
comfortable with 2 only if we also speed-up the PELT half-life TBH ...

Is there a 3 that I missed ?

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v5 09/14] sched: Add over-utilization/tipping point indicator
  2018-08-03 15:55                                     ` Quentin Perret
@ 2018-08-06  8:40                                       ` Vincent Guittot
  2018-08-06  9:43                                         ` Quentin Perret
  2018-08-06 10:08                                         ` Dietmar Eggemann
  0 siblings, 2 replies; 72+ messages in thread
From: Vincent Guittot @ 2018-08-06  8:40 UTC (permalink / raw)
  To: Quentin Perret
  Cc: Peter Zijlstra, Rafael J. Wysocki, linux-kernel,
	open list:THERMAL, gregkh, Ingo Molnar, Dietmar Eggemann,
	Morten Rasmussen, Chris Redpath, Patrick Bellasi,
	Valentin Schneider, Thara Gopinath, viresh kumar, Todd Kjos,
	Joel Fernandes, Cc: Steve Muckle, adharmap, Kannan, Saravana,
	pkondeti, Juri Lelli, Eduardo Valentin, Srinivas Pandruvada,
	currojerez, Javi Merino

On Fri, 3 Aug 2018 at 17:55, Quentin Perret <quentin.perret@arm.com> wrote:
>
> On Friday 03 Aug 2018 at 15:49:24 (+0200), Vincent Guittot wrote:
> > On Fri, 3 Aug 2018 at 10:18, Quentin Perret <quentin.perret@arm.com> wrote:
> > >
> > > On Friday 03 Aug 2018 at 09:48:47 (+0200), Vincent Guittot wrote:
> > > > On Thu, 2 Aug 2018 at 18:59, Quentin Perret <quentin.perret@arm.com> wrote:
> > > > I'm not really concerned about re-enabling load balance but more that
> > > > the effort of packing of tasks in few cpus/clusters that EAS tries to
> > > > do can be broken for every new task.
> > >
> > > Well, re-enabling load balance immediately would break the nice placement
> > > that EAS found, because it would shuffle all tasks around and break the
> > > packing strategy. Letting that sole new task go in find_idlest_cpu()
> >
> > Sorry I was not clear in my explanation. Re enabling load balance
> > would be a problem of course. I wanted to say that there is few chance
> > that this will re-enable the load balance immediately and break EAS
> > and I'm not worried by this case. But i'm only concerned by the new
> > task being put outside EAS policy.
> >
> > For example, if you run on hikey960 the simple script below, which
> > can't really be seen as a fork bomb IMHO, you will see threads
> > scheduled on big cores every 0.5 seconds whereas everything should be
> > packed on little core
>
> I guess that also depends on what's running on the little cores, but I
> see your point.

In my case, the system was idle and nothing else than this script was running

>
> I think we're discussing two different things right now:
>     1. Should forkees go in find_energy_efficient_cpu() ?
>     2. Should forkees have 0 of initial util_avg when EAS is enabled ?

It's the same topic: How EAS should consider a newly created task ?

For now, we let the "performance" mode selects a CPU. This CPU will
most probably be worst CPU from a EAS pov because it's the idlest CPU
in the idlest group which is the opposite of what EAS tries to do

The current behavior is :
For every new task, the cpu selection is done assuming it's a heavy
task with the max possible load_avg,  and it looks for the idlest cpu.
This means that if the system is lightly loaded, scheduler will select
most probably a idle big core.
The utilization of this new task is then set to half of the remaining
capacity of the selected CPU which means that the idlest you are, the
biggest the task will be initialized to. This can easily be half a big
core which can be bigger than the max capacity of a little like on
hikey960. Then, util_est  will keep track of this value for a while
which will make this task like a big one.

>
> For 1, that would mean all forkees go on prev_cpu. I can see how that
> can be more energy-efficient in some use-cases (the one you described
> for example), but that also has drawbacks. Placing the task on a big
> CPU can have an energy cost, but that should also help the task build
> it's utilization faster, which is what we want to make smart decisions

With current  behavior, little task are seen as big for a long time
which is not really help the task to build its utilization faster
IMHO.

> with EAS. Also, it isn't always true that going on the little CPUs is
> more energy efficient, only the Energy Model can tell. There is just no

selecting big or Little is not the problem here. The problem is that
we don't use Energy Model so we will most probably do the wrong
choice. Nevertheless,  putting a task on big is clearly the wrong
choice  in the case I mentioned above " shell script on hikey960".

> perfect solution, so I'm still not fully decided on that one ...
>
> For 2, I'm a little bit more reluctant, because that has more
> implications ... That could probably harm some fairly standard use
> cases (an simple app-launch for example). Enqueueing something new on a
> CPU would go unnoticed, which might be fine for a very small task, but
> probably a major issue if the task is actually big. I'd be more
> comfortable with 2 only if we also speed-up the PELT half-life TBH ...
>
> Is there a 3 that I missed ?

Having something in the middle like taking into account load and/org
utilization of the parent in order to mitigate big task starting with
small utilization and small task starting with big utilization.
It's probably not perfect because big tasks can create small ones and
the opposite but if there are already big tasks, assuming that the new
one is also a big one should have less power impact as we are already
consuming power for the current bigs. At the opposite, if little are
running, assuming that new task is little will not harm the power
consumption unnecessarily.

My main concern is that by making no choice, you clearly make the most
power consumption choice which is a bit awkward for a policy that
wants to minimize power consumption.

Regards,
Vincent
>
> Thanks,
> Quentin

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v5 09/14] sched: Add over-utilization/tipping point indicator
  2018-08-06  8:40                                       ` Vincent Guittot
@ 2018-08-06  9:43                                         ` Quentin Perret
  2018-08-06 10:45                                           ` Vincent Guittot
  2018-08-06 10:08                                         ` Dietmar Eggemann
  1 sibling, 1 reply; 72+ messages in thread
From: Quentin Perret @ 2018-08-06  9:43 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, Rafael J. Wysocki, linux-kernel,
	open list:THERMAL, gregkh, Ingo Molnar, Dietmar Eggemann,
	Morten Rasmussen, Chris Redpath, Patrick Bellasi,
	Valentin Schneider, Thara Gopinath, viresh kumar, Todd Kjos,
	Joel Fernandes, Cc: Steve Muckle, adharmap, Kannan, Saravana,
	pkondeti, Juri Lelli, Eduardo Valentin, Srinivas Pandruvada,
	currojerez, Javi Merino

On Monday 06 Aug 2018 at 10:40:46 (+0200), Vincent Guittot wrote:
> On Fri, 3 Aug 2018 at 17:55, Quentin Perret <quentin.perret@arm.com> wrote:
> For every new task, the cpu selection is done assuming it's a heavy
> task with the max possible load_avg,  and it looks for the idlest cpu.
> This means that if the system is lightly loaded, scheduler will select
> most probably a idle big core.

Agreed, that is what should happen if the system is lightly loaded.
However, I'm still not totally convinced this is wrong. It's
definitely not _always_ wrong, at least. Just like starting new tasks
on little CPUs isn't always wrong either.

> selecting big or Little is not the problem here. The problem is that
> we don't use Energy Model so we will most probably do the wrong
> choice. Nevertheless,  putting a task on big is clearly the wrong
> choice  in the case I mentioned above " shell script on hikey960".

_You_ can say it's wrong because _you_ know the task composition. The
scheduler has no way to tell. You could come up with a script that
spawns heavy tasks every once in a while, and in this case putting
those on big cores would be beneficial ...

> Having something in the middle like taking into account load and/org
> utilization of the parent in order to mitigate big task starting with
> small utilization and small task starting with big utilization.
> It's probably not perfect because big tasks can create small ones and
> the opposite but if there are already big tasks, assuming that the new
> one is also a big one should have less power impact as we are already
> consuming power for the current bigs. At the opposite, if little are
> running, assuming that new task is little will not harm the power
> consumption unnecessarily.

Right, we can definitely come up with something more conservative than
what I'm currently proposing. I had a quick chat with Morten about this
the other day and one suggestion he had was to pick the CPU with the max
spare cap in the frequency domain in which the parent task is running ...

In any case, I really feel like there isn't an obvious right decision
here, so I'd prefer to keep things simple for now. This patch-set is a
first step, and fine-grained tuning for new tasks is probably something
that can be done later, if need be. What do you think ?

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v5 09/14] sched: Add over-utilization/tipping point indicator
  2018-08-06  8:40                                       ` Vincent Guittot
  2018-08-06  9:43                                         ` Quentin Perret
@ 2018-08-06 10:08                                         ` Dietmar Eggemann
  2018-08-06 10:33                                           ` Vincent Guittot
  1 sibling, 1 reply; 72+ messages in thread
From: Dietmar Eggemann @ 2018-08-06 10:08 UTC (permalink / raw)
  To: Vincent Guittot, Quentin Perret
  Cc: Peter Zijlstra, Rafael J. Wysocki, linux-kernel,
	open list:THERMAL, gregkh, Ingo Molnar, Morten Rasmussen,
	Chris Redpath, Patrick Bellasi, Valentin Schneider,
	Thara Gopinath, viresh kumar, Todd Kjos, Joel Fernandes,
	Cc: Steve Muckle, adharmap, Kannan, Saravana, pkondeti,
	Juri Lelli, Eduardo Valentin, Srinivas Pandruvada, currojerez,
	Javi Merino

On 08/06/2018 10:40 AM, Vincent Guittot wrote:
> On Fri, 3 Aug 2018 at 17:55, Quentin Perret <quentin.perret@arm.com> wrote:
>>
>> On Friday 03 Aug 2018 at 15:49:24 (+0200), Vincent Guittot wrote:
>>> On Fri, 3 Aug 2018 at 10:18, Quentin Perret <quentin.perret@arm.com> wrote:
>>>>
>>>> On Friday 03 Aug 2018 at 09:48:47 (+0200), Vincent Guittot wrote:
>>>>> On Thu, 2 Aug 2018 at 18:59, Quentin Perret <quentin.perret@arm.com> wrote:

[...]

>> I think we're discussing two different things right now:
>>      1. Should forkees go in find_energy_efficient_cpu() ?
>>      2. Should forkees have 0 of initial util_avg when EAS is enabled ?
> 
> It's the same topic: How EAS should consider a newly created task ?
> 
> For now, we let the "performance" mode selects a CPU. This CPU will
> most probably be worst CPU from a EAS pov because it's the idlest CPU
> in the idlest group which is the opposite of what EAS tries to do
> 
> The current behavior is :
> For every new task, the cpu selection is done assuming it's a heavy
> task with the max possible load_avg,  and it looks for the idlest cpu.
> This means that if the system is lightly loaded, scheduler will select
> most probably a idle big core.

AFAICS, task load doesn't seem to be used for find_idlest_cpu() ( 
find_idlest_group() and find_idlest_group_cpu()). So the forkee 
(SD_BALANCE_FORK) is placed independently of his task load.
Task load (task_h_load(p)) is used in 
wake_affine()->wake_affine_weight() but for this to be called it has to 
be a wakeup (SD_BALANCE_WAKE).

> The utilization of this new task is then set to half of the remaining
> capacity of the selected CPU which means that the idlest you are, the
> biggest the task will be initialized to. This can easily be half a big
> core which can be bigger than the max capacity of a little like on
> hikey960. Then, util_est  will keep track of this value for a while
> which will make this task like a big one.

[...]


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v5 09/14] sched: Add over-utilization/tipping point indicator
  2018-08-06 10:08                                         ` Dietmar Eggemann
@ 2018-08-06 10:33                                           ` Vincent Guittot
  2018-08-06 12:29                                             ` Dietmar Eggemann
  0 siblings, 1 reply; 72+ messages in thread
From: Vincent Guittot @ 2018-08-06 10:33 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Quentin Perret, Peter Zijlstra, Rafael J. Wysocki, linux-kernel,
	open list:THERMAL, gregkh, Ingo Molnar, Morten Rasmussen,
	Chris Redpath, Patrick Bellasi, Valentin Schneider,
	Thara Gopinath, viresh kumar, Todd Kjos, Joel Fernandes,
	Cc: Steve Muckle, adharmap, Kannan, Saravana, pkondeti,
	Juri Lelli, Eduardo Valentin, Srinivas Pandruvada, currojerez,
	Javi Merino

On Mon, 6 Aug 2018 at 12:08, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>
> On 08/06/2018 10:40 AM, Vincent Guittot wrote:
> > On Fri, 3 Aug 2018 at 17:55, Quentin Perret <quentin.perret@arm.com> wrote:
> >>
> >> On Friday 03 Aug 2018 at 15:49:24 (+0200), Vincent Guittot wrote:
> >>> On Fri, 3 Aug 2018 at 10:18, Quentin Perret <quentin.perret@arm.com> wrote:
> >>>>
> >>>> On Friday 03 Aug 2018 at 09:48:47 (+0200), Vincent Guittot wrote:
> >>>>> On Thu, 2 Aug 2018 at 18:59, Quentin Perret <quentin.perret@arm.com> wrote:
>
> [...]
>
> >> I think we're discussing two different things right now:
> >>      1. Should forkees go in find_energy_efficient_cpu() ?
> >>      2. Should forkees have 0 of initial util_avg when EAS is enabled ?
> >
> > It's the same topic: How EAS should consider a newly created task ?
> >
> > For now, we let the "performance" mode selects a CPU. This CPU will
> > most probably be worst CPU from a EAS pov because it's the idlest CPU
> > in the idlest group which is the opposite of what EAS tries to do
> >
> > The current behavior is :
> > For every new task, the cpu selection is done assuming it's a heavy
> > task with the max possible load_avg,  and it looks for the idlest cpu.
> > This means that if the system is lightly loaded, scheduler will select
> > most probably a idle big core.
>
> AFAICS, task load doesn't seem to be used for find_idlest_cpu() (
> find_idlest_group() and find_idlest_group_cpu()). So the forkee
> (SD_BALANCE_FORK) is placed independently of his task load.

hmm ... so what is used if load or runnable load are not used ?
find_idlest_group() uses load and runnable load but skip spare
capacity in case of fork

> Task load (task_h_load(p)) is used in
> wake_affine()->wake_affine_weight() but for this to be called it has to
> be a wakeup (SD_BALANCE_WAKE).
>
> > The utilization of this new task is then set to half of the remaining
> > capacity of the selected CPU which means that the idlest you are, the
> > biggest the task will be initialized to. This can easily be half a big
> > core which can be bigger than the max capacity of a little like on
> > hikey960. Then, util_est  will keep track of this value for a while
> > which will make this task like a big one.
>
> [...]
>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v5 09/14] sched: Add over-utilization/tipping point indicator
  2018-08-06  9:43                                         ` Quentin Perret
@ 2018-08-06 10:45                                           ` Vincent Guittot
  2018-08-06 11:02                                             ` Quentin Perret
  0 siblings, 1 reply; 72+ messages in thread
From: Vincent Guittot @ 2018-08-06 10:45 UTC (permalink / raw)
  To: Quentin Perret
  Cc: Peter Zijlstra, Rafael J. Wysocki, linux-kernel,
	open list:THERMAL, gregkh, Ingo Molnar, Dietmar Eggemann,
	Morten Rasmussen, Chris Redpath, Patrick Bellasi,
	Valentin Schneider, Thara Gopinath, viresh kumar, Todd Kjos,
	Joel Fernandes, Cc: Steve Muckle, adharmap, Kannan, Saravana,
	pkondeti, Juri Lelli, Eduardo Valentin, Srinivas Pandruvada,
	currojerez, Javi Merino

On Mon, 6 Aug 2018 at 11:43, Quentin Perret <quentin.perret@arm.com> wrote:
>
> On Monday 06 Aug 2018 at 10:40:46 (+0200), Vincent Guittot wrote:
> > On Fri, 3 Aug 2018 at 17:55, Quentin Perret <quentin.perret@arm.com> wrote:
> > For every new task, the cpu selection is done assuming it's a heavy
> > task with the max possible load_avg,  and it looks for the idlest cpu.
> > This means that if the system is lightly loaded, scheduler will select
> > most probably a idle big core.
>
> Agreed, that is what should happen if the system is lightly loaded.
> However, I'm still not totally convinced this is wrong. It's
> definitely not _always_ wrong, at least. Just like starting new tasks
> on little CPUs isn't always wrong either.

As explained before, IMHO, this is not wrong if you looks for
performance but this is wrong if you looks for power saving

>
> > selecting big or Little is not the problem here. The problem is that
> > we don't use Energy Model so we will most probably do the wrong
> > choice. Nevertheless,  putting a task on big is clearly the wrong
> > choice  in the case I mentioned above " shell script on hikey960".
>
> _You_ can say it's wrong because _you_ know the task composition. The
> scheduler has no way to tell. You could come up with a script that
> spawns heavy tasks every once in a while, and in this case putting
> those on big cores would be beneficial ...
>
> > Having something in the middle like taking into account load and/org
> > utilization of the parent in order to mitigate big task starting with
> > small utilization and small task starting with big utilization.
> > It's probably not perfect because big tasks can create small ones and
> > the opposite but if there are already big tasks, assuming that the new
> > one is also a big one should have less power impact as we are already
> > consuming power for the current bigs. At the opposite, if little are
> > running, assuming that new task is little will not harm the power
> > consumption unnecessarily.
>
> Right, we can definitely come up with something more conservative than
> what I'm currently proposing. I had a quick chat with Morten about this
> the other day and one suggestion he had was to pick the CPU with the max
> spare cap in the frequency domain in which the parent task is running ...
>
> In any case, I really feel like there isn't an obvious right decision
> here, so I'd prefer to keep things simple for now. This patch-set is a
> first step, and fine-grained tuning for new tasks is probably something
> that can be done later, if need be. What do you think ?

I would have preferred to have a full power policy for all task when
EAS is in used by default and then see if there is any performance
problem instead of letting some UC unclear but that's a personal
opinion.
so IMO, the minimum is to add a comment in the code that describes
this behavior for fork tasks so people will understand why EAS puts
newly created task on not "EAS friendly" cpus when they will look at
the code trying to understand the behavior

>
> Thanks,
> Quentin

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v5 09/14] sched: Add over-utilization/tipping point indicator
  2018-08-06 10:45                                           ` Vincent Guittot
@ 2018-08-06 11:02                                             ` Quentin Perret
  0 siblings, 0 replies; 72+ messages in thread
From: Quentin Perret @ 2018-08-06 11:02 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, Rafael J. Wysocki, linux-kernel,
	open list:THERMAL, gregkh, Ingo Molnar, Dietmar Eggemann,
	Morten Rasmussen, Chris Redpath, Patrick Bellasi,
	Valentin Schneider, Thara Gopinath, viresh kumar, Todd Kjos,
	Joel Fernandes, Cc: Steve Muckle, adharmap, Kannan, Saravana,
	pkondeti, Juri Lelli, Eduardo Valentin, Srinivas Pandruvada,
	currojerez, Javi Merino

On Monday 06 Aug 2018 at 12:45:44 (+0200), Vincent Guittot wrote:
> On Mon, 6 Aug 2018 at 11:43, Quentin Perret <quentin.perret@arm.com> wrote:
> I would have preferred to have a full power policy for all task when
> EAS is in used by default and then see if there is any performance
> problem instead of letting some UC unclear but that's a personal
> opinion.

Understood. I'd say let's keep things simple for now unless there is a
consensus that this is must-have form the start.

> so IMO, the minimum is to add a comment in the code that describes
> this behavior for fork tasks so people will understand why EAS puts
> newly created task on not "EAS friendly" cpus when they will look at
> the code trying to understand the behavior

Agreed, that really needs to be documented. I'll add a comment somewhere
in v6.

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v5 09/14] sched: Add over-utilization/tipping point indicator
  2018-08-06 10:33                                           ` Vincent Guittot
@ 2018-08-06 12:29                                             ` Dietmar Eggemann
  2018-08-06 12:37                                               ` Vincent Guittot
  0 siblings, 1 reply; 72+ messages in thread
From: Dietmar Eggemann @ 2018-08-06 12:29 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Quentin Perret, Peter Zijlstra, Rafael J. Wysocki, linux-kernel,
	open list:THERMAL, gregkh, Ingo Molnar, Morten Rasmussen,
	Chris Redpath, Patrick Bellasi, Valentin Schneider,
	Thara Gopinath, viresh kumar, Todd Kjos, Joel Fernandes,
	Cc: Steve Muckle, adharmap, Kannan, Saravana, pkondeti,
	Juri Lelli, Eduardo Valentin, Srinivas Pandruvada, currojerez,
	Javi Merino

On 08/06/2018 12:33 PM, Vincent Guittot wrote:
> On Mon, 6 Aug 2018 at 12:08, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>
>> On 08/06/2018 10:40 AM, Vincent Guittot wrote:
>>> On Fri, 3 Aug 2018 at 17:55, Quentin Perret <quentin.perret@arm.com> wrote:
>>>>
>>>> On Friday 03 Aug 2018 at 15:49:24 (+0200), Vincent Guittot wrote:
>>>>> On Fri, 3 Aug 2018 at 10:18, Quentin Perret <quentin.perret@arm.com> wrote:
>>>>>>
>>>>>> On Friday 03 Aug 2018 at 09:48:47 (+0200), Vincent Guittot wrote:
>>>>>>> On Thu, 2 Aug 2018 at 18:59, Quentin Perret <quentin.perret@arm.com> wrote:
>>
>> [...]
>>
>>>> I think we're discussing two different things right now:
>>>>       1. Should forkees go in find_energy_efficient_cpu() ?
>>>>       2. Should forkees have 0 of initial util_avg when EAS is enabled ?
>>>
>>> It's the same topic: How EAS should consider a newly created task ?
>>>
>>> For now, we let the "performance" mode selects a CPU. This CPU will
>>> most probably be worst CPU from a EAS pov because it's the idlest CPU
>>> in the idlest group which is the opposite of what EAS tries to do
>>>
>>> The current behavior is :
>>> For every new task, the cpu selection is done assuming it's a heavy
>>> task with the max possible load_avg,  and it looks for the idlest cpu.
>>> This means that if the system is lightly loaded, scheduler will select
>>> most probably a idle big core.
>>
>> AFAICS, task load doesn't seem to be used for find_idlest_cpu() (
>> find_idlest_group() and find_idlest_group_cpu()). So the forkee
>> (SD_BALANCE_FORK) is placed independently of his task load.
> 
> hmm ... so what is used if load or runnable load are not used ?
> find_idlest_group() uses load and runnable load but skip spare
> capacity in case of fork

Yes, runnable load and load are used, but from the cpus, not from the task.

[...]

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v5 09/14] sched: Add over-utilization/tipping point indicator
  2018-08-06 12:29                                             ` Dietmar Eggemann
@ 2018-08-06 12:37                                               ` Vincent Guittot
  2018-08-06 13:20                                                 ` Dietmar Eggemann
  0 siblings, 1 reply; 72+ messages in thread
From: Vincent Guittot @ 2018-08-06 12:37 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: Quentin Perret, Peter Zijlstra, Rafael J. Wysocki, linux-kernel,
	open list:THERMAL, gregkh, Ingo Molnar, Morten Rasmussen,
	Chris Redpath, Patrick Bellasi, Valentin Schneider,
	Thara Gopinath, viresh kumar, Todd Kjos, Joel Fernandes,
	Cc: Steve Muckle, adharmap, Kannan, Saravana, pkondeti,
	Juri Lelli, Eduardo Valentin, Srinivas Pandruvada, currojerez,
	Javi Merino

On Mon, 6 Aug 2018 at 14:29, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>
> On 08/06/2018 12:33 PM, Vincent Guittot wrote:
> > On Mon, 6 Aug 2018 at 12:08, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> >>
> >> On 08/06/2018 10:40 AM, Vincent Guittot wrote:
> >>> On Fri, 3 Aug 2018 at 17:55, Quentin Perret <quentin.perret@arm.com> wrote:
> >>>>
> >>>> On Friday 03 Aug 2018 at 15:49:24 (+0200), Vincent Guittot wrote:
> >>>>> On Fri, 3 Aug 2018 at 10:18, Quentin Perret <quentin.perret@arm.com> wrote:
> >>>>>>
> >>>>>> On Friday 03 Aug 2018 at 09:48:47 (+0200), Vincent Guittot wrote:
> >>>>>>> On Thu, 2 Aug 2018 at 18:59, Quentin Perret <quentin.perret@arm.com> wrote:
> >>
> >> [...]
> >>
> >>>> I think we're discussing two different things right now:
> >>>>       1. Should forkees go in find_energy_efficient_cpu() ?
> >>>>       2. Should forkees have 0 of initial util_avg when EAS is enabled ?
> >>>
> >>> It's the same topic: How EAS should consider a newly created task ?
> >>>
> >>> For now, we let the "performance" mode selects a CPU. This CPU will
> >>> most probably be worst CPU from a EAS pov because it's the idlest CPU
> >>> in the idlest group which is the opposite of what EAS tries to do
> >>>
> >>> The current behavior is :
> >>> For every new task, the cpu selection is done assuming it's a heavy
> >>> task with the max possible load_avg,  and it looks for the idlest cpu.
> >>> This means that if the system is lightly loaded, scheduler will select
> >>> most probably a idle big core.
> >>
> >> AFAICS, task load doesn't seem to be used for find_idlest_cpu() (
> >> find_idlest_group() and find_idlest_group_cpu()). So the forkee
> >> (SD_BALANCE_FORK) is placed independently of his task load.
> >
> > hmm ... so what is used if load or runnable load are not used ?
> > find_idlest_group() uses load and runnable load but skip spare
> > capacity in case of fork
>
> Yes, runnable load and load are used, but from the cpus, not from the task.

yes that's right, I have skipped the "task" word when reading.
So scheduler looks for the idlest CPU taking into account only CPU
loads. Then the task load starts to highest value until it get a
chance to reduce and stabilize to its final value

>
> [...]

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v5 09/14] sched: Add over-utilization/tipping point indicator
  2018-08-06 12:37                                               ` Vincent Guittot
@ 2018-08-06 13:20                                                 ` Dietmar Eggemann
  0 siblings, 0 replies; 72+ messages in thread
From: Dietmar Eggemann @ 2018-08-06 13:20 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Quentin Perret, Peter Zijlstra, Rafael J. Wysocki, linux-kernel,
	open list:THERMAL, gregkh, Ingo Molnar, Morten Rasmussen,
	Chris Redpath, Patrick Bellasi, Valentin Schneider,
	Thara Gopinath, viresh kumar, Todd Kjos, Joel Fernandes,
	Cc: Steve Muckle, adharmap, Kannan, Saravana, pkondeti,
	Juri Lelli, Eduardo Valentin, Srinivas Pandruvada, currojerez,
	Javi Merino

On 08/06/2018 02:37 PM, Vincent Guittot wrote:
> On Mon, 6 Aug 2018 at 14:29, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>
>> On 08/06/2018 12:33 PM, Vincent Guittot wrote:
>>> On Mon, 6 Aug 2018 at 12:08, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
>>>>
>>>> On 08/06/2018 10:40 AM, Vincent Guittot wrote:
>>>>> On Fri, 3 Aug 2018 at 17:55, Quentin Perret <quentin.perret@arm.com> wrote:
>>>>>>
>>>>>> On Friday 03 Aug 2018 at 15:49:24 (+0200), Vincent Guittot wrote:
>>>>>>> On Fri, 3 Aug 2018 at 10:18, Quentin Perret <quentin.perret@arm.com> wrote:
>>>>>>>>
>>>>>>>> On Friday 03 Aug 2018 at 09:48:47 (+0200), Vincent Guittot wrote:
>>>>>>>>> On Thu, 2 Aug 2018 at 18:59, Quentin Perret <quentin.perret@arm.com> wrote:
>>>>
>>>> [...]
>>>>
>>>>>> I think we're discussing two different things right now:
>>>>>>        1. Should forkees go in find_energy_efficient_cpu() ?
>>>>>>        2. Should forkees have 0 of initial util_avg when EAS is enabled ?
>>>>>
>>>>> It's the same topic: How EAS should consider a newly created task ?
>>>>>
>>>>> For now, we let the "performance" mode selects a CPU. This CPU will
>>>>> most probably be worst CPU from a EAS pov because it's the idlest CPU
>>>>> in the idlest group which is the opposite of what EAS tries to do
>>>>>
>>>>> The current behavior is :
>>>>> For every new task, the cpu selection is done assuming it's a heavy
>>>>> task with the max possible load_avg,  and it looks for the idlest cpu.
>>>>> This means that if the system is lightly loaded, scheduler will select
>>>>> most probably a idle big core.
>>>>
>>>> AFAICS, task load doesn't seem to be used for find_idlest_cpu() (
>>>> find_idlest_group() and find_idlest_group_cpu()). So the forkee
>>>> (SD_BALANCE_FORK) is placed independently of his task load.
>>>
>>> hmm ... so what is used if load or runnable load are not used ?
>>> find_idlest_group() uses load and runnable load but skip spare
>>> capacity in case of fork
>>
>> Yes, runnable load and load are used, but from the cpus, not from the task.
> 
> yes that's right, I have skipped the "task" word when reading.
> So scheduler looks for the idlest CPU taking into account only CPU
> loads. Then the task load starts to highest value until it get a
> chance to reduce and stabilize to its final value

This could potentially allow us to find a better init value for 
sa->[runnable]_load_avg. At least we could use the information of the 
initial task rq.

> 
>>
>> [...]


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v5 09/14] sched: Add over-utilization/tipping point indicator
  2018-07-24 12:25 ` [PATCH v5 09/14] sched: Add over-utilization/tipping point indicator Quentin Perret
  2018-08-02 12:26   ` Peter Zijlstra
@ 2018-08-09  9:30   ` Vincent Guittot
  2018-08-09  9:38     ` Quentin Perret
  1 sibling, 1 reply; 72+ messages in thread
From: Vincent Guittot @ 2018-08-09  9:30 UTC (permalink / raw)
  To: Quentin Perret
  Cc: Peter Zijlstra, Rafael J. Wysocki, linux-kernel,
	open list:THERMAL, gregkh, Ingo Molnar, Dietmar Eggemann,
	Morten Rasmussen, Chris Redpath, Patrick Bellasi,
	Valentin Schneider, Thara Gopinath, viresh kumar, Todd Kjos,
	Joel Fernandes, Cc: Steve Muckle, adharmap, Kannan, Saravana,
	pkondeti, Juri Lelli, Eduardo Valentin, Srinivas Pandruvada,
	currojerez, Javi Merino

On Tue, 24 Jul 2018 at 14:26, Quentin Perret <quentin.perret@arm.com> wrote:
>
> From: Morten Rasmussen <morten.rasmussen@arm.com>
>
> Energy-aware scheduling is only meant to be active while the system is
> _not_ over-utilized. That is, there are spare cycles available to shift
> tasks around based on their actual utilization to get a more
> energy-efficient task distribution without depriving any tasks. When
> above the tipping point task placement is done the traditional way based
> on load_avg, spreading the tasks across as many cpus as possible based
> on priority scaled load to preserve smp_nice. Below the tipping point we
> want to use util_avg instead. We need to define a criteria for when we
> make the switch.
>
> The util_avg for each cpu converges towards 100% (1024) regardless of

remove the "(1024)" because util_avg converges to max cpu capacity
which can be different from 1024

> how many task additional task we may put on it. If we define
> over-utilized as:
>
> sum_{cpus}(rq.cfs.avg.util_avg) + margin > sum_{cpus}(rq.capacity)
>
> some individual cpus may be over-utilized running multiple tasks even
> when the above condition is false. That should be okay as long as we try
> to spread the tasks out to avoid per-cpu over-utilization as much as
> possible and if all tasks have the _same_ priority. If the latter isn't
> true, we have to consider priority to preserve smp_nice.
>

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v5 09/14] sched: Add over-utilization/tipping point indicator
  2018-08-09  9:30   ` Vincent Guittot
@ 2018-08-09  9:38     ` Quentin Perret
  0 siblings, 0 replies; 72+ messages in thread
From: Quentin Perret @ 2018-08-09  9:38 UTC (permalink / raw)
  To: Vincent Guittot
  Cc: Peter Zijlstra, Rafael J. Wysocki, linux-kernel,
	open list:THERMAL, gregkh, Ingo Molnar, Dietmar Eggemann,
	Morten Rasmussen, Chris Redpath, Patrick Bellasi,
	Valentin Schneider, Thara Gopinath, viresh kumar, Todd Kjos,
	Joel Fernandes, Cc: Steve Muckle, adharmap, Kannan, Saravana,
	pkondeti, Juri Lelli, Eduardo Valentin, Srinivas Pandruvada,
	currojerez, Javi Merino

On Thursday 09 Aug 2018 at 11:30:57 (+0200), Vincent Guittot wrote:
> On Tue, 24 Jul 2018 at 14:26, Quentin Perret <quentin.perret@arm.com> wrote:
> >
> > From: Morten Rasmussen <morten.rasmussen@arm.com>
> >
> > Energy-aware scheduling is only meant to be active while the system is
> > _not_ over-utilized. That is, there are spare cycles available to shift
> > tasks around based on their actual utilization to get a more
> > energy-efficient task distribution without depriving any tasks. When
> > above the tipping point task placement is done the traditional way based
> > on load_avg, spreading the tasks across as many cpus as possible based
> > on priority scaled load to preserve smp_nice. Below the tipping point we
> > want to use util_avg instead. We need to define a criteria for when we
> > make the switch.
> >
> > The util_avg for each cpu converges towards 100% (1024) regardless of
> 
> remove the "(1024)" because util_avg converges to max cpu capacity
> which can be different from 1024

Good point, will be fixed in v6.

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v5 03/14] PM: Introduce an Energy Model management framework
  2018-07-24 12:25 ` [PATCH v5 03/14] PM: Introduce an Energy Model management framework Quentin Perret
@ 2018-08-09 21:52   ` Rafael J. Wysocki
  2018-08-10  8:15     ` Quentin Perret
  0 siblings, 1 reply; 72+ messages in thread
From: Rafael J. Wysocki @ 2018-08-09 21:52 UTC (permalink / raw)
  To: Quentin Perret
  Cc: peterz, linux-kernel, linux-pm, gregkh, mingo, dietmar.eggemann,
	morten.rasmussen, chris.redpath, patrick.bellasi,
	valentin.schneider, vincent.guittot, thara.gopinath,
	viresh.kumar, tkjos, joel, smuckle, adharmap, skannan, pkondeti,
	juri.lelli, edubezval, srinivas.pandruvada, currojerez,
	javi.merino

On Tuesday, July 24, 2018 2:25:10 PM CEST Quentin Perret wrote:
> Several subsystems in the kernel (task scheduler and/or thermal at the
> time of writing) can benefit from knowing about the energy consumed by
> CPUs. Yet, this information can come from different sources (DT or
> firmware for example), in different formats, hence making it hard to
> exploit without a standard API.
> 
> As an attempt to address this, introduce a centralized Energy Model
> (EM) management framework which aggregates the power values provided
> by drivers into a table for each frequency domain in the system. The
> power cost tables are made available to interested clients (e.g. task
> scheduler or thermal) via platform-agnostic APIs. The overall design
> is represented by the diagram below (focused on Arm-related drivers as
> an example, but applicable to any architecture):
> 
>      +---------------+  +-----------------+  +-------------+
>      | Thermal (IPA) |  | Scheduler (EAS) |  |    Other    |
>      +---------------+  +-----------------+  +-------------+
>              |                 | em_fd_energy()     |
>              |                 | em_cpu_get()       |
>              +-----------+     |         +----------+
>                          |     |         |
>                          v     v         v
>                       +---------------------+
>                       |                     |
>                       |    Energy Model     |
>                       |                     |
>                       |     Framework       |
>                       |                     |
>                       +---------------------+
>                          ^       ^       ^
>                          |       |       | em_register_freq_domain()
>               +----------+       |       +---------+
>               |                  |                 |
>       +---------------+  +---------------+  +--------------+
>       |  cpufreq-dt   |  |   arm_scmi    |  |    Other     |
>       +---------------+  +---------------+  +--------------+
>               ^                  ^                 ^
>               |                  |                 |
>       +--------------+   +---------------+  +--------------+
>       | Device Tree  |   |   Firmware    |  |      ?       |
>       +--------------+   +---------------+  +--------------+
> 
> Drivers (typically, but not limited to, CPUFreq drivers) can register
> data in the EM framework using the em_register_freq_domain() API. The
> calling driver must provide a callback function with a standardized
> signature that will be used by the EM framework to build the power
> cost tables of the frequency domain. This design should offer a lot of
> flexibility to calling drivers which are free of reading information
> from any location and to use any technique to compute power costs.
> Moreover, the capacity states registered by drivers in the EM framework
> are not required to match real performance states of the target. This
> is particularly important on targets where the performance states are
> not known by the OS.
> 
> On the client side, the EM framework offers APIs to access the power
> cost tables of a CPU (em_cpu_get()), and to estimate the energy
> consumed by the CPUs of a frequency domain (em_fd_energy()). Clients
> such as the task scheduler can then use these APIs to access the shared
> data structures holding the Energy Model of CPUs.

I'm a bit concerned that the code here appears to be designed around the
frequency domains concept which seems to be a limitation and which probably
is related to the properties of the current generation of hardware.

Assumptions like that tend to get tangled into the code tightly over time
and they may be hard to untangle from it when new use cases arise later.

For example, there probably will be more firmware involvement in future
systems and the firmware may not be willing to expose "raw" frequency
domains to the OS.  That already is the case with P-states on Intel HW and
with ACPI CPPC in general.

IMO, frequency domains in your current code could be replaced with something
more general, like "performance domains" providing the scheduler with the
(relative) cost of running a task on a busy (non-idle) CPU (and, analogously,
"idle domains" that would provide the scheduler with the - relative - cost
of waking up an idle CPU to run a task on it or, the other way around, the
possible relative gain from taking all tasks away from a CPU in order to make
it go idle).

Also bear in mind that the CPUs the scheduler deals with are logical ones,
so they may be like hardware threads within a single core, for example.

Thanks,
Rafael


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v5 03/14] PM: Introduce an Energy Model management framework
  2018-08-09 21:52   ` Rafael J. Wysocki
@ 2018-08-10  8:15     ` Quentin Perret
  2018-08-10  8:41       ` Rafael J. Wysocki
  0 siblings, 1 reply; 72+ messages in thread
From: Quentin Perret @ 2018-08-10  8:15 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: peterz, linux-kernel, linux-pm, gregkh, mingo, dietmar.eggemann,
	morten.rasmussen, chris.redpath, patrick.bellasi,
	valentin.schneider, vincent.guittot, thara.gopinath,
	viresh.kumar, tkjos, joel, smuckle, adharmap, skannan, pkondeti,
	juri.lelli, edubezval, srinivas.pandruvada, currojerez,
	javi.merino

Hi Rafael,

On Thursday 09 Aug 2018 at 23:52:29 (+0200), Rafael J. Wysocki wrote:
> I'm a bit concerned that the code here appears to be designed around the
> frequency domains concept which seems to be a limitation and which probably
> is related to the properties of the current generation of hardware.

That's correct. I went for 'frequency domains' only because this is what
EAS and IPA are interested in, as of today at least. And both of them
are somewhat dependent on CPU-Freq, which is called CPU-*Freq*, not
CPU-Perf after all :-)

> Assumptions like that tend to get tangled into the code tightly over time
> and they may be hard to untangle from it when new use cases arise later.
> 
> For example, there probably will be more firmware involvement in future
> systems and the firmware may not be willing to expose "raw" frequency
> domains to the OS.  That already is the case with P-states on Intel HW and
> with ACPI CPPC in general.

Agreed, and embedded/mobile systems are going in that direction too ...

> IMO, frequency domains in your current code could be replaced with something
> more general, like "performance domains"

I don't mind using a more abstract name as long as we keep the same
assumptions, and especially that all CPUs in a perf. domain *must* have
the same micro-architecture. From that assumption result several
properties that EAS (in its current) form needs. The first one is that
all CPUs of a performance domain have the same capacity at any possible
performance state. And the second is that they all consume the same
amount of (active) power.

I know it is theoretically possible to mix CPU types in the same perf
domain, but that becomes nightmare-ish to manage in EAS, and I don't
think there is a single platform like this on the market today. And I
hope nobody will build one. Peter wanted to mandate that too, I think.

> providing the scheduler with the (relative) cost of running a task

What do you mean by relative ? That we should normalize the power costs ?
Or just use an abstract scale, without specifying the unit ?

The main reason I'm a bit reluctant to do that just now is because IPA
needs to compare the power of CPUs with the power of other components
(GPUs, for example). And the power of those other components is specified
with a specific unit too. So, not imposing a comparable unit for the
power of CPUs will result in an unspecified behaviour in IPA, and that
will break things for sure. I would very much like to avoid that, of
course.

What I am currently proposing is to keep the unit (mW) in the EM
framework so that migrating IPA to using it can be done in a (relatively)
painless way. On a system where drivers  don't know the exact wattage,
then they should just 'lie' to the EM framework, but it's their job to
lie coherently to all subsystems and keep things consistent, because all
subsystems have specified power in comparable units.

Another solution to solve this problem could be to extend the EM
framework introduced by this patch and make it manage the EM of any
device, not just CPUs. Then we could just specify that all power costs
must be in the same scale, regardless of the actual unit, and register
the EM of CPUs, GPUs, ...
However, I was hoping that this patch as-is was enough for a first step,
and that this extension of the framework could be done in a second step ?
Thoughts ?

In any case, if we decide to keep the mW unit for now, I should at least
explain clearly why in the commit message.

> on a busy (non-idle) CPU (and, analogously,
> "idle domains" that would provide the scheduler with the - relative - cost
> of waking up an idle CPU to run a task on it or, the other way around, the
> possible relative gain from taking all tasks away from a CPU in order to make
> it go idle).

+1 for idle costs as a second type of 'domains' which could be managed
by the EM framework, alongside the 'perf' domains. I don't think we have
users of that just now (or providers of idle costs ?) so maybe that is
for later too ?

What do you think ?

Thank you very much for the feedback,
Quentin

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v5 03/14] PM: Introduce an Energy Model management framework
  2018-08-10  8:15     ` Quentin Perret
@ 2018-08-10  8:41       ` Rafael J. Wysocki
  2018-08-10  9:12         ` Quentin Perret
  0 siblings, 1 reply; 72+ messages in thread
From: Rafael J. Wysocki @ 2018-08-10  8:41 UTC (permalink / raw)
  To: Quentin Perret
  Cc: peterz, linux-kernel, linux-pm, gregkh, mingo, dietmar.eggemann,
	morten.rasmussen, chris.redpath, patrick.bellasi,
	valentin.schneider, vincent.guittot, thara.gopinath,
	viresh.kumar, tkjos, joel, smuckle, adharmap, skannan, pkondeti,
	juri.lelli, edubezval, srinivas.pandruvada, currojerez,
	javi.merino

Hi Quentin,

On Friday, August 10, 2018 10:15:39 AM CEST Quentin Perret wrote:
> Hi Rafael,
> 
> On Thursday 09 Aug 2018 at 23:52:29 (+0200), Rafael J. Wysocki wrote:
> > I'm a bit concerned that the code here appears to be designed around the
> > frequency domains concept which seems to be a limitation and which probably
> > is related to the properties of the current generation of hardware.
> 
> That's correct. I went for 'frequency domains' only because this is what
> EAS and IPA are interested in, as of today at least. And both of them
> are somewhat dependent on CPU-Freq, which is called CPU-*Freq*, not
> CPU-Perf after all :-)

Still, cpufreq manages CPU performance scaling really.

A cpufreq policy may represent a frequency domain or generally a group of
related CPUs and what matters is that there is a coordination between them
and not how that coordination happens at the hardware/firmware level etc.

> > Assumptions like that tend to get tangled into the code tightly over time
> > and they may be hard to untangle from it when new use cases arise later.
> > 
> > For example, there probably will be more firmware involvement in future
> > systems and the firmware may not be willing to expose "raw" frequency
> > domains to the OS.  That already is the case with P-states on Intel HW and
> > with ACPI CPPC in general.
> 
> Agreed, and embedded/mobile systems are going in that direction too ...
> 
> > IMO, frequency domains in your current code could be replaced with something
> > more general, like "performance domains"
> 
> I don't mind using a more abstract name as long as we keep the same
> assumptions, and especially that all CPUs in a perf. domain *must* have
> the same micro-architecture.

That's fair enough I think.

> From that assumption result several
> properties that EAS (in its current) form needs. The first one is that
> all CPUs of a performance domain have the same capacity at any possible
> performance state. And the second is that they all consume the same
> amount of (active) power.
> 
> I know it is theoretically possible to mix CPU types in the same perf
> domain, but that becomes nightmare-ish to manage in EAS, and I don't
> think there is a single platform like this on the market today. And I
> hope nobody will build one. Peter wanted to mandate that too, I think.

There are departures, say, at least as far as the capacity is concerned.

The uarch is the same for all of them, but the max capacity may vary
between them.

> > providing the scheduler with the (relative) cost of running a task
> 
> What do you mean by relative ? That we should normalize the power costs ?
> Or just use an abstract scale, without specifying the unit ?

I mean "relative with respect to the other choices"; not absolute.

> The main reason I'm a bit reluctant to do that just now is because IPA
> needs to compare the power of CPUs with the power of other components
> (GPUs, for example). And the power of those other components is specified
> with a specific unit too. So, not imposing a comparable unit for the
> power of CPUs will result in an unspecified behaviour in IPA, and that
> will break things for sure. I would very much like to avoid that, of
> course.

The absolute power numbers are generally hard to get though.

In the majority of cases you can figure out what the extra cost of X with
respect to (alternative) Y is (in certain units), but you may not be able
to say what X and Y are equal to in absolute terms (for example, there
may be an unknown component in both X and Y that you cannot measure, but
it may not be relevant for the outcome of the computation).

> What I am currently proposing is to keep the unit (mW) in the EM
> framework so that migrating IPA to using it can be done in a (relatively)
> painless way. On a system where drivers  don't know the exact wattage,
> then they should just 'lie' to the EM framework, but it's their job to
> lie coherently to all subsystems and keep things consistent, because all
> subsystems have specified power in comparable units.

Alternatively, there could be a translation layer between EM and IPA.

From my experience, if you want people to come up with some numbers,
they will just choose them to game the system this way or another
unless those numbers can be measured directly or are clearly documented.

And if that happens and then you want to make any significant changes,
you'll need to deal with "regressions" occuring because someone chose
the numbers to make the system behave in a specific way and your changes
break that.

As a rule, I rather avoid requesting unknown numbers from people. :-)

> Another solution to solve this problem could be to extend the EM
> framework introduced by this patch and make it manage the EM of any
> device, not just CPUs. Then we could just specify that all power costs
> must be in the same scale, regardless of the actual unit, and register
> the EM of CPUs, GPUs, ...
> However, I was hoping that this patch as-is was enough for a first step,
> and that this extension of the framework could be done in a second step ?
> Thoughts ?
> 
> In any case, if we decide to keep the mW unit for now, I should at least
> explain clearly why in the commit message.

Right.

Actually, the unit is as good as any other, but you need to bear in mind that
the numbers provided may not be realistic.

> > on a busy (non-idle) CPU (and, analogously,
> > "idle domains" that would provide the scheduler with the - relative - cost
> > of waking up an idle CPU to run a task on it or, the other way around, the
> > possible relative gain from taking all tasks away from a CPU in order to make
> > it go idle).
> 
> +1 for idle costs as a second type of 'domains' which could be managed
> by the EM framework, alongside the 'perf' domains. I don't think we have
> users of that just now (or providers of idle costs ?) so maybe that is
> for later too ?

Yes, this is for later IMO.

Thanks,
Rafael


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v5 03/14] PM: Introduce an Energy Model management framework
  2018-08-10  8:41       ` Rafael J. Wysocki
@ 2018-08-10  9:12         ` Quentin Perret
  2018-08-10 11:13           ` Rafael J. Wysocki
  0 siblings, 1 reply; 72+ messages in thread
From: Quentin Perret @ 2018-08-10  9:12 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: peterz, linux-kernel, linux-pm, gregkh, mingo, dietmar.eggemann,
	morten.rasmussen, chris.redpath, patrick.bellasi,
	valentin.schneider, vincent.guittot, thara.gopinath,
	viresh.kumar, tkjos, joel, smuckle, adharmap, skannan, pkondeti,
	juri.lelli, edubezval, srinivas.pandruvada, currojerez,
	javi.merino

On Friday 10 Aug 2018 at 10:41:56 (+0200), Rafael J. Wysocki wrote:
> On Friday, August 10, 2018 10:15:39 AM CEST Quentin Perret wrote:
> > On Thursday 09 Aug 2018 at 23:52:29 (+0200), Rafael J. Wysocki wrote:
> > > I'm a bit concerned that the code here appears to be designed around the
> > > frequency domains concept which seems to be a limitation and which probably
> > > is related to the properties of the current generation of hardware.
> > 
> > That's correct. I went for 'frequency domains' only because this is what
> > EAS and IPA are interested in, as of today at least. And both of them
> > are somewhat dependent on CPU-Freq, which is called CPU-*Freq*, not
> > CPU-Perf after all :-)
> 
> Still, cpufreq manages CPU performance scaling really.
> 
> A cpufreq policy may represent a frequency domain or generally a group of
> related CPUs and what matters is that there is a coordination between them
> and not how that coordination happens at the hardware/firmware level etc.

Fair enough :-)

> 
> > > Assumptions like that tend to get tangled into the code tightly over time
> > > and they may be hard to untangle from it when new use cases arise later.
> > > 
> > > For example, there probably will be more firmware involvement in future
> > > systems and the firmware may not be willing to expose "raw" frequency
> > > domains to the OS.  That already is the case with P-states on Intel HW and
> > > with ACPI CPPC in general.
> > 
> > Agreed, and embedded/mobile systems are going in that direction too ...
> > 
> > > IMO, frequency domains in your current code could be replaced with something
> > > more general, like "performance domains"
> > 
> > I don't mind using a more abstract name as long as we keep the same
> > assumptions, and especially that all CPUs in a perf. domain *must* have
> > the same micro-architecture.
> 
> That's fair enough I think.

Good.

> > From that assumption result several
> > properties that EAS (in its current) form needs. The first one is that
> > all CPUs of a performance domain have the same capacity at any possible
> > performance state. And the second is that they all consume the same
> > amount of (active) power.
> > 
> > I know it is theoretically possible to mix CPU types in the same perf
> > domain, but that becomes nightmare-ish to manage in EAS, and I don't
> > think there is a single platform like this on the market today. And I
> > hope nobody will build one. Peter wanted to mandate that too, I think.
> 
> There are departures, say, at least as far as the capacity is concerned.
> 
> The uarch is the same for all of them, but the max capacity may vary
> between them.

I assume you're thinking about ITMT and things like that for example ?
That's an interesting case indeed, but yes, being able to reach higher
freqs for single-threaded workloads shouldn't violate the assumption, I
think.

> > > providing the scheduler with the (relative) cost of running a task
> > 
> > What do you mean by relative ? That we should normalize the power costs ?
> > Or just use an abstract scale, without specifying the unit ?
> 
> I mean "relative with respect to the other choices"; not absolute.
> 
> > The main reason I'm a bit reluctant to do that just now is because IPA
> > needs to compare the power of CPUs with the power of other components
> > (GPUs, for example). And the power of those other components is specified
> > with a specific unit too. So, not imposing a comparable unit for the
> > power of CPUs will result in an unspecified behaviour in IPA, and that
> > will break things for sure. I would very much like to avoid that, of
> > course.
> 
> The absolute power numbers are generally hard to get though.
> 
> In the majority of cases you can figure out what the extra cost of X with
> respect to (alternative) Y is (in certain units), but you may not be able
> to say what X and Y are equal to in absolute terms (for example, there
> may be an unknown component in both X and Y that you cannot measure, but
> it may not be relevant for the outcome of the computation).

Agreed. EAS and IPA don't care about the absolute real power values, all
they care about is relative correctness. But what I really want to avoid
is having IPA getting the power of the GPUs in mW, and the power of CPUs
in an abstract scale without unit. That _will_ create problems eventually
IMO, because the behaviour is undefined. Specifying a unit everywhere is
an easy way to enforce a consistent design across sub-systems, that's
all.

> > What I am currently proposing is to keep the unit (mW) in the EM
> > framework so that migrating IPA to using it can be done in a (relatively)
> > painless way. On a system where drivers  don't know the exact wattage,
> > then they should just 'lie' to the EM framework, but it's their job to
> > lie coherently to all subsystems and keep things consistent, because all
> > subsystems have specified power in comparable units.
> 
> Alternatively, there could be a translation layer between EM and IPA.

Hmm, interesting... What do you have in mind exactly ? What would you
put in that layer ?

> From my experience, if you want people to come up with some numbers,
> they will just choose them to game the system this way or another
> unless those numbers can be measured directly or are clearly documented.
> 
> And if that happens and then you want to make any significant changes,
> you'll need to deal with "regressions" occuring because someone chose
> the numbers to make the system behave in a specific way and your changes
> break that.
> 
> As a rule, I rather avoid requesting unknown numbers from people. :-)
> 
> > Another solution to solve this problem could be to extend the EM
> > framework introduced by this patch and make it manage the EM of any
> > device, not just CPUs. Then we could just specify that all power costs
> > must be in the same scale, regardless of the actual unit, and register
> > the EM of CPUs, GPUs, ...
> > However, I was hoping that this patch as-is was enough for a first step,
> > and that this extension of the framework could be done in a second step ?
> > Thoughts ?
> > 
> > In any case, if we decide to keep the mW unit for now, I should at least
> > explain clearly why in the commit message.
> 
> Right.
> 
> Actually, the unit is as good as any other, but you need to bear in mind that
> the numbers provided may not be realistic.

As long as they're all correct in a relative way, that's fine by me :-)

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v5 03/14] PM: Introduce an Energy Model management framework
  2018-08-10  9:12         ` Quentin Perret
@ 2018-08-10 11:13           ` Rafael J. Wysocki
  2018-08-10 12:30             ` Quentin Perret
  0 siblings, 1 reply; 72+ messages in thread
From: Rafael J. Wysocki @ 2018-08-10 11:13 UTC (permalink / raw)
  To: Quentin Perret
  Cc: peterz, linux-kernel, linux-pm, gregkh, mingo, dietmar.eggemann,
	morten.rasmussen, chris.redpath, patrick.bellasi,
	valentin.schneider, vincent.guittot, thara.gopinath,
	viresh.kumar, tkjos, joel, smuckle, adharmap, skannan, pkondeti,
	juri.lelli, edubezval, srinivas.pandruvada, currojerez,
	javi.merino

On Friday, August 10, 2018 11:12:18 AM CEST Quentin Perret wrote:
> On Friday 10 Aug 2018 at 10:41:56 (+0200), Rafael J. Wysocki wrote:
> > On Friday, August 10, 2018 10:15:39 AM CEST Quentin Perret wrote:

[cut]

> Agreed. EAS and IPA don't care about the absolute real power values, all
> they care about is relative correctness. But what I really want to avoid
> is having IPA getting the power of the GPUs in mW, and the power of CPUs
> in an abstract scale without unit. That _will_ create problems eventually
> IMO, because the behaviour is undefined. Specifying a unit everywhere is
> an easy way to enforce a consistent design across sub-systems, that's
> all.

OK

> > > What I am currently proposing is to keep the unit (mW) in the EM
> > > framework so that migrating IPA to using it can be done in a (relatively)
> > > painless way. On a system where drivers  don't know the exact wattage,
> > > then they should just 'lie' to the EM framework, but it's their job to
> > > lie coherently to all subsystems and keep things consistent, because all
> > > subsystems have specified power in comparable units.
> > 
> > Alternatively, there could be a translation layer between EM and IPA.
> 
> Hmm, interesting... What do you have in mind exactly ? What would you
> put in that layer ?

Something able to say how the numbers used by EM and IPA are related. :-)

Do you think that IPA and EM will always need to use the same set of data for
the CPU?

> > From my experience, if you want people to come up with some numbers,
> > they will just choose them to game the system this way or another
> > unless those numbers can be measured directly or are clearly documented.
> > 
> > And if that happens and then you want to make any significant changes,
> > you'll need to deal with "regressions" occuring because someone chose
> > the numbers to make the system behave in a specific way and your changes
> > break that.
> > 
> > As a rule, I rather avoid requesting unknown numbers from people. :-)
> > 
> > > Another solution to solve this problem could be to extend the EM
> > > framework introduced by this patch and make it manage the EM of any
> > > device, not just CPUs. Then we could just specify that all power costs
> > > must be in the same scale, regardless of the actual unit, and register
> > > the EM of CPUs, GPUs, ...
> > > However, I was hoping that this patch as-is was enough for a first step,
> > > and that this extension of the framework could be done in a second step ?
> > > Thoughts ?
> > > 
> > > In any case, if we decide to keep the mW unit for now, I should at least
> > > explain clearly why in the commit message.
> > 
> > Right.
> > 
> > Actually, the unit is as good as any other, but you need to bear in mind that
> > the numbers provided may not be realistic.
> 
> As long as they're all correct in a relative way, that's fine by me :-)

OK

Thanks,
Rafael


^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v5 03/14] PM: Introduce an Energy Model management framework
  2018-08-10 11:13           ` Rafael J. Wysocki
@ 2018-08-10 12:30             ` Quentin Perret
  2018-08-12  9:49               ` Rafael J. Wysocki
  0 siblings, 1 reply; 72+ messages in thread
From: Quentin Perret @ 2018-08-10 12:30 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: peterz, linux-kernel, linux-pm, gregkh, mingo, dietmar.eggemann,
	morten.rasmussen, chris.redpath, patrick.bellasi,
	valentin.schneider, vincent.guittot, thara.gopinath,
	viresh.kumar, tkjos, joel, smuckle, adharmap, skannan, pkondeti,
	juri.lelli, edubezval, srinivas.pandruvada, currojerez,
	javi.merino

On Friday 10 Aug 2018 at 13:13:22 (+0200), Rafael J. Wysocki wrote:
> Something able to say how the numbers used by EM and IPA are related. :-)
> 
> Do you think that IPA and EM will always need to use the same set of data for
> the CPU?

Hmm, I would say yes. Both EAS and IPA basically need some sort of
<frequency, power> table, so hopefully the EM framework can provide them
with that :-).

If the EM framework doesn't accommodate the needs of its two users, then
something needs fixing ... Did you have something specific in mind ?

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 72+ messages in thread

* Re: [PATCH v5 03/14] PM: Introduce an Energy Model management framework
  2018-08-10 12:30             ` Quentin Perret
@ 2018-08-12  9:49               ` Rafael J. Wysocki
  0 siblings, 0 replies; 72+ messages in thread
From: Rafael J. Wysocki @ 2018-08-12  9:49 UTC (permalink / raw)
  To: Quentin Perret
  Cc: Rafael J. Wysocki, Peter Zijlstra, Linux Kernel Mailing List,
	Linux PM, Greg Kroah-Hartman, Ingo Molnar, Dietmar Eggemann,
	Morten Rasmussen, Chris Redpath, Patrick Bellasi,
	Valentin Schneider, Vincent Guittot, Thara Gopinath,
	Viresh Kumar, Todd Kjos, Joel Fernandes, Steve Muckle, adharmap,
	skannan, Pavan Kondeti, Juri Lelli, Eduardo Valentin,
	Srinivas Pandruvada, currojerez, Javi Merino

On Fri, Aug 10, 2018 at 2:30 PM Quentin Perret <quentin.perret@arm.com> wrote:
>
> On Friday 10 Aug 2018 at 13:13:22 (+0200), Rafael J. Wysocki wrote:
> > Something able to say how the numbers used by EM and IPA are related. :-)
> >
> > Do you think that IPA and EM will always need to use the same set of data for
> > the CPU?
>
> Hmm, I would say yes. Both EAS and IPA basically need some sort of
> <frequency, power> table, so hopefully the EM framework can provide them
> with that :-).
>
> If the EM framework doesn't accommodate the needs of its two users, then
> something needs fixing ... Did you have something specific in mind ?

Not really. :-)

^ permalink raw reply	[flat|nested] 72+ messages in thread

end of thread, back to index

Thread overview: 72+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-07-24 12:25 [PATCH v5 00/14] Energy Aware Scheduling Quentin Perret
2018-07-24 12:25 ` [PATCH v5 01/14] sched: Relocate arch_scale_cpu_capacity Quentin Perret
2018-07-24 12:25 ` [PATCH v5 02/14] sched/cpufreq: Factor out utilization to frequency mapping Quentin Perret
2018-07-24 12:25 ` [PATCH v5 03/14] PM: Introduce an Energy Model management framework Quentin Perret
2018-08-09 21:52   ` Rafael J. Wysocki
2018-08-10  8:15     ` Quentin Perret
2018-08-10  8:41       ` Rafael J. Wysocki
2018-08-10  9:12         ` Quentin Perret
2018-08-10 11:13           ` Rafael J. Wysocki
2018-08-10 12:30             ` Quentin Perret
2018-08-12  9:49               ` Rafael J. Wysocki
2018-07-24 12:25 ` [PATCH v5 04/14] PM / EM: Expose the Energy Model in sysfs Quentin Perret
2018-07-24 12:25 ` [PATCH v5 05/14] sched/topology: Reference the Energy Model of CPUs when available Quentin Perret
2018-07-24 12:25 ` [PATCH v5 06/14] sched/topology: Lowest energy aware balancing sched_domain level pointer Quentin Perret
2018-07-26 16:00   ` Valentin Schneider
2018-07-26 17:01     ` Quentin Perret
2018-07-24 12:25 ` [PATCH v5 07/14] sched/topology: Introduce sched_energy_present static key Quentin Perret
2018-07-24 12:25 ` [PATCH v5 08/14] sched/fair: Clean-up update_sg_lb_stats parameters Quentin Perret
2018-07-24 12:25 ` [PATCH v5 09/14] sched: Add over-utilization/tipping point indicator Quentin Perret
2018-08-02 12:26   ` Peter Zijlstra
2018-08-02 13:03     ` Quentin Perret
2018-08-02 13:08       ` Peter Zijlstra
2018-08-02 13:18         ` Quentin Perret
2018-08-02 13:48           ` Vincent Guittot
2018-08-02 14:14             ` Quentin Perret
2018-08-02 15:14               ` Vincent Guittot
2018-08-02 15:30                 ` Quentin Perret
2018-08-02 15:55                   ` Vincent Guittot
2018-08-02 16:00                     ` Quentin Perret
2018-08-02 16:07                       ` Vincent Guittot
2018-08-02 16:10                         ` Quentin Perret
2018-08-02 16:38                           ` Vincent Guittot
2018-08-02 16:59                             ` Quentin Perret
2018-08-03  7:48                               ` Vincent Guittot
2018-08-03  8:18                                 ` Quentin Perret
2018-08-03 13:49                                   ` Vincent Guittot
2018-08-03 14:21                                     ` Vincent Guittot
2018-08-03 15:55                                     ` Quentin Perret
2018-08-06  8:40                                       ` Vincent Guittot
2018-08-06  9:43                                         ` Quentin Perret
2018-08-06 10:45                                           ` Vincent Guittot
2018-08-06 11:02                                             ` Quentin Perret
2018-08-06 10:08                                         ` Dietmar Eggemann
2018-08-06 10:33                                           ` Vincent Guittot
2018-08-06 12:29                                             ` Dietmar Eggemann
2018-08-06 12:37                                               ` Vincent Guittot
2018-08-06 13:20                                                 ` Dietmar Eggemann
2018-08-09  9:30   ` Vincent Guittot
2018-08-09  9:38     ` Quentin Perret
2018-07-24 12:25 ` [PATCH v5 10/14] sched/cpufreq: Refactor the utilization aggregation method Quentin Perret
2018-07-30 19:35   ` skannan
2018-07-31  7:59     ` Quentin Perret
2018-07-31 19:31       ` skannan
2018-08-01  7:32         ` Rafael J. Wysocki
2018-08-01  8:23           ` Quentin Perret
2018-08-01  8:35             ` Rafael J. Wysocki
2018-08-01  9:23               ` Quentin Perret
2018-08-01  9:40                 ` Rafael J. Wysocki
2018-08-02 13:04                 ` Peter Zijlstra
2018-08-02 15:39                   ` Quentin Perret
2018-08-03 13:04                     ` Quentin Perret
2018-08-02 12:33     ` Peter Zijlstra
2018-08-02 12:45       ` Peter Zijlstra
2018-08-02 15:21         ` Quentin Perret
2018-08-02 17:36           ` Peter Zijlstra
2018-08-03 12:42             ` Quentin Perret
2018-07-24 12:25 ` [PATCH v5 11/14] sched/fair: Introduce an energy estimation helper function Quentin Perret
2018-07-24 12:25 ` [PATCH v5 12/14] sched/fair: Select an energy-efficient CPU on task wake-up Quentin Perret
2018-08-02 13:54   ` Peter Zijlstra
2018-08-02 16:21     ` Quentin Perret
2018-07-24 12:25 ` [PATCH v5 13/14] OPTIONAL: arch_topology: Start Energy Aware Scheduling Quentin Perret
2018-07-24 12:25 ` [PATCH v5 14/14] OPTIONAL: cpufreq: dt: Register an Energy Model Quentin Perret

LKML Archive on lore.kernel.org

Archives are clonable:
	git clone --mirror https://lore.kernel.org/lkml/0 lkml/git/0.git
	git clone --mirror https://lore.kernel.org/lkml/1 lkml/git/1.git
	git clone --mirror https://lore.kernel.org/lkml/2 lkml/git/2.git
	git clone --mirror https://lore.kernel.org/lkml/3 lkml/git/3.git
	git clone --mirror https://lore.kernel.org/lkml/4 lkml/git/4.git
	git clone --mirror https://lore.kernel.org/lkml/5 lkml/git/5.git
	git clone --mirror https://lore.kernel.org/lkml/6 lkml/git/6.git

	# If you have public-inbox 1.1+ installed, you may
	# initialize and index your mirror using the following commands:
	public-inbox-init -V2 lkml lkml/ https://lore.kernel.org/lkml \
		linux-kernel@vger.kernel.org linux-kernel@archiver.kernel.org
	public-inbox-index lkml


Newsgroup available over NNTP:
	nntp://nntp.lore.kernel.org/org.kernel.vger.linux-kernel


AGPL code for this site: git clone https://public-inbox.org/ public-inbox