All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] sched: Add schedutil overview
@ 2020-12-18 10:32 Peter Zijlstra
  2020-12-18 11:33 ` Valentin Schneider
  2021-01-14 11:29 ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
  0 siblings, 2 replies; 5+ messages in thread
From: Peter Zijlstra @ 2020-12-18 10:32 UTC (permalink / raw)
  To: Rafael J. Wysocki, Ingo Molnar
  Cc: Thomas Gleixner, Vincent Guittot, Morten Rasmussen,
	dietmar.eggemann, patrick.bellasi, lenb, linux-kernel,
	valentin.schneider, ionela.voinescu, qperret, viresh.kumar


Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
---
 Documentation/scheduler/schedutil.txt |  168 ++++++++++++++++++++++++++++++++++
 1 file changed, 168 insertions(+)

--- /dev/null
+++ b/Documentation/scheduler/schedutil.txt
@@ -0,0 +1,168 @@
+
+
+NOTE; all this assumes a linear relation between frequency and work capacity,
+we know this is flawed, but it is the best workable approximation.
+
+
+PELT (Per Entity Load Tracking)
+-------------------------------
+
+With PELT we track some metrics across the various scheduler entities, from
+individual tasks to task-group slices to CPU runqueues. As the basis for this
+we use an Exponentially Weighted Moving Average (EWMA), each period (1024us)
+is decayed such that y^32 = 0.5. That is, the most recent 32ms contribute
+half, while the rest of history contribute the other half.
+
+Specifically:
+
+  ewma_sum(u) := u_0 + u_1*y + u_2*y^2 + ...
+
+  ewma(u) = ewma_sum(u) / ewma_sum(1)
+
+Since this is essentially a progression of an infinite geometric series, the
+results are composable, that is ewma(A) + ewma(B) = ewma(A+B). This property
+is key, since it gives the ability to recompose the averages when tasks move
+around.
+
+Note that blocked tasks still contribute to the aggregates (task-group slices
+and CPU runqueues), which reflects their expected contribution when they
+resume running.
+
+Using this we track 2 key metrics: 'running' and 'runnable'. 'Running'
+reflects the time an entity spends on the CPU, while 'runnable' reflects the
+time an entity spends on the runqueue. When there is only a single task these
+two metrics are the same, but once there is contention for the CPU 'running'
+will decrease to reflect the fraction of time each task spends on the CPU
+while 'runnable' will increase to reflect the amount of contention.
+
+For more detail see: kernel/sched/pelt.c
+
+
+Frequency- / CPU Invariance
+---------------------------
+
+Because consuming the CPU for 50% at 1GHz is not the same as consuming the CPU
+for 50% at 2GHz, nor is running 50% on a LITTLE CPU the same as running 50% on
+a big CPU, we allow architectures to scale the time delta with two ratios, one
+Dynamic Voltage and Frequency Scaling (DVFS) ratio and one microarch ratio.
+
+For simple DVFS architectures (where software is in full control) we trivially
+compute the ratio as:
+
+	    f_cur
+  r_dvfs := -----
+            f_max
+
+For more dynamic systems where the hardware is in control of DVFS (Intel,
+ARMv8.4-AMU) we use hardware counters to provide us this ratio. For Intel
+specifically, we use:
+
+	   APERF
+  f_cur := ----- * P0
+	   MPERF
+
+	     4C-turbo;	if available and turbo enabled
+  f_max := { 1C-turbo;	if turbo enabled
+	     P0;	otherwise
+
+                    f_cur
+  r_dvfs := min( 1, ----- )
+                    f_max
+
+We pick 4C turbo over 1C turbo to make it slightly more sustainable.
+
+r_cpu is determined as the ratio of highest performance level of the current
+CPU vs the highest performance level of any other CPU in the system.
+
+  r_tot = r_dvfs * r_cpu
+
+The result is that the above 'running' and 'runnable' metrics become invariant
+of DVFS and CPU type. IOW. we can transfer and compare them between CPUs.
+
+For more detail see:
+
+ - kernel/sched/pelt.h:update_rq_clock_pelt()
+ - arch/x86/kernel/smpboot.c:"APERF/MPERF frequency ratio computation."
+ - Documentation/scheduler/sched-capacity.rst:"1. CPU Capacity + 2. Task utilization"
+
+
+UTIL_EST / UTIL_EST_FASTUP
+--------------------------
+
+Because periodic tasks have their averages decayed while they sleep, even
+though when running their expected utilization will be the same, they suffer a
+(DVFS) ramp-up after they are running again.
+
+To alleviate this (a default enabled option) UTIL_EST drives an Infinite
+Impulse Response (IIR) EWMA with the 'running' value on dequeue -- when it is
+highest. A further default enabled option UTIL_EST_FASTUP modifies the IIR
+filter to instantly increase and only decay on decrease.
+
+A further runqueue wide sum (of runnable tasks) is maintained of:
+
+  util_est := \Sum_t max( t_running, t_util_est_ewma )
+
+For more detail see: kernel/sched/fair.c:util_est_dequeue()
+
+
+UCLAMP
+------
+
+It is possible to set effective u_min and u_max clamps on each CFS or RT task;
+the runqueue keeps an max aggregate of these clamps for all running tasks.
+
+For more detail see: include/uapi/linux/sched/types.h
+
+
+Schedutil / DVFS
+----------------
+
+Every time the scheduler load tracking is updated (task wakeup, task
+migration, time progression) we call out to schedutil to update the hardware
+DVFS state.
+
+The basis is the CPU runqueue's 'running' metric, which per the above it is
+the frequency invariant utilization estimate of the CPU. From this we compute
+a desired frequency like:
+
+             max( running, util_est );	if UTIL_EST
+  u_cfs := { running;			otherwise
+
+  u_clamp := clamp( u_cfs, u_min, u_max )
+
+  u := u_cfs + u_rt + u_irq + u_dl;	[approx. see source for more detail]
+
+  f_des := min( f_max, 1.25 u * f_max )
+
+XXX IO-wait; when the update is due to a task wakeup from IO-completion we
+boost 'u' above.
+
+This frequency is then used to select a P-state/OPP or directly munged into a
+CPPC style request to the hardware.
+
+XXX: deadline tasks (Sporadic Task Model) allows us to calculate a hard f_min
+required to satisfy the workload.
+
+Because these callbacks are directly from the scheduler, the DVFS hardware
+interaction should be 'fast' and non-blocking. Schedutil supports
+rate-limiting DVFS requests for when hardware interaction is slow and
+expensive, this reduces effectiveness.
+
+For more information see: kernel/sched/cpufreq_schedutil.c
+
+
+NOTES
+-----
+
+ - On low-load scenarios, where DVFS is most relevant, the 'running' numbers
+   will closely reflect utilization.
+
+ - In saturated scenarios task movement will cause some transient dips,
+   suppose we have a CPU saturated with 4 tasks, then when we migrate a task
+   to an idle CPU, the old CPU will have a 'running' value of 0.75 while the
+   new CPU will gain 0.25. This is inevitable and time progression will
+   correct this. XXX do we still guarantee f_max due to no idle-time?
+
+ - Much of the above is about avoiding DVFS dips, and independent DVFS domains
+   having to re-learn / ramp-up when load shifts.
+

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] sched: Add schedutil overview
  2020-12-18 10:32 [PATCH] sched: Add schedutil overview Peter Zijlstra
@ 2020-12-18 11:33 ` Valentin Schneider
  2020-12-18 13:40   ` Morten Rasmussen
  2021-01-14 11:29 ` [tip: sched/core] " tip-bot2 for Peter Zijlstra
  1 sibling, 1 reply; 5+ messages in thread
From: Valentin Schneider @ 2020-12-18 11:33 UTC (permalink / raw)
  To: Peter Zijlstra
  Cc: Rafael J. Wysocki, Ingo Molnar, Thomas Gleixner, Vincent Guittot,
	Morten Rasmussen, dietmar.eggemann, patrick.bellasi, lenb,
	linux-kernel, ionela.voinescu, qperret, viresh.kumar


Hi,

Have some more nits below

On 18/12/20 10:32, Peter Zijlstra wrote:
> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
> ---
>  Documentation/scheduler/schedutil.txt |  168 ++++++++++++++++++++++++++++++++++
>  1 file changed, 168 insertions(+)
>
> --- /dev/null
> +++ b/Documentation/scheduler/schedutil.txt
[...]
> +Frequency- / CPU Invariance
> +---------------------------
> +
> +Because consuming the CPU for 50% at 1GHz is not the same as consuming the CPU
> +for 50% at 2GHz, nor is running 50% on a LITTLE CPU the same as running 50% on
> +a big CPU, we allow architectures to scale the time delta with two ratios, one
> +Dynamic Voltage and Frequency Scaling (DVFS) ratio and one microarch ratio.
> +
> +For simple DVFS architectures (where software is in full control) we trivially
> +compute the ratio as:
> +
> +	    f_cur
> +  r_dvfs := -----
> +            f_max
> +
> +For more dynamic systems where the hardware is in control of DVFS (Intel,
> +ARMv8.4-AMU) we use hardware counters to provide us this ratio. For Intel

Nit: To me this reads as if the presence of AMUs entail 'hardware is in
control of DVFS', which doesn't seem right. How about:

  For more dynamic systems where the hardware is in control of DVFS we use
  hardware counters (Intel APERF/MPERF, ARMv8.4-AMU) to provide us this
  ratio.

> +Schedutil / DVFS
> +----------------
> +
> +Every time the scheduler load tracking is updated (task wakeup, task
> +migration, time progression) we call out to schedutil to update the hardware
> +DVFS state.
> +
> +The basis is the CPU runqueue's 'running' metric, which per the above it is
> +the frequency invariant utilization estimate of the CPU. From this we compute
> +a desired frequency like:
> +
> +             max( running, util_est );	if UTIL_EST
> +  u_cfs := { running;			otherwise
> +
> +  u_clamp := clamp( u_cfs, u_min, u_max )
> +
> +  u := u_cfs + u_rt + u_irq + u_dl;	[approx. see source for more detail]
> +
> +  f_des := min( f_max, 1.25 u * f_max )
> +

In schedutil_cpu_util(), uclamp clamps both u_cfs and u_rt. I'm afraid the
below might just bring more confusion; what do you think?

               clamp( u_cfs + u_rt, u_min, u_max );      if UCLAMP_TASK
  u_clamp := { u_cfs + u_rt;                             otherwise

  u := u_clamp + u_irq + u_dl;	    [approx. see source for more detail]

(also, does this need a word about runnable rt tasks => goto max?)

> +XXX IO-wait; when the update is due to a task wakeup from IO-completion we
> +boost 'u' above.
> +
> +This frequency is then used to select a P-state/OPP or directly munged into a
> +CPPC style request to the hardware.
> +
> +XXX: deadline tasks (Sporadic Task Model) allows us to calculate a hard f_min
> +required to satisfy the workload.
> +
> +Because these callbacks are directly from the scheduler, the DVFS hardware
> +interaction should be 'fast' and non-blocking. Schedutil supports
> +rate-limiting DVFS requests for when hardware interaction is slow and
> +expensive, this reduces effectiveness.
> +
> +For more information see: kernel/sched/cpufreq_schedutil.c
> +

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] sched: Add schedutil overview
  2020-12-18 11:33 ` Valentin Schneider
@ 2020-12-18 13:40   ` Morten Rasmussen
  2020-12-18 14:25     ` Valentin Schneider
  0 siblings, 1 reply; 5+ messages in thread
From: Morten Rasmussen @ 2020-12-18 13:40 UTC (permalink / raw)
  To: Valentin Schneider
  Cc: Peter Zijlstra, Rafael J. Wysocki, Ingo Molnar, Thomas Gleixner,
	Vincent Guittot, dietmar.eggemann, patrick.bellasi, lenb,
	linux-kernel, ionela.voinescu, qperret, viresh.kumar

On Fri, Dec 18, 2020 at 11:33:09AM +0000, Valentin Schneider wrote:
> On 18/12/20 10:32, Peter Zijlstra wrote:
> > +Schedutil / DVFS
> > +----------------
> > +
> > +Every time the scheduler load tracking is updated (task wakeup, task
> > +migration, time progression) we call out to schedutil to update the hardware
> > +DVFS state.
> > +
> > +The basis is the CPU runqueue's 'running' metric, which per the above it is
> > +the frequency invariant utilization estimate of the CPU. From this we compute
> > +a desired frequency like:
> > +
> > +             max( running, util_est );	if UTIL_EST
> > +  u_cfs := { running;			otherwise
> > +
> > +  u_clamp := clamp( u_cfs, u_min, u_max )
> > +
> > +  u := u_cfs + u_rt + u_irq + u_dl;	[approx. see source for more detail]
> > +
> > +  f_des := min( f_max, 1.25 u * f_max )
> > +
> 
> In schedutil_cpu_util(), uclamp clamps both u_cfs and u_rt. I'm afraid the
> below might just bring more confusion; what do you think?
> 
>                clamp( u_cfs + u_rt, u_min, u_max );      if UCLAMP_TASK
>   u_clamp := { u_cfs + u_rt;                             otherwise
> 
>   u := u_clamp + u_irq + u_dl;	    [approx. see source for more detail]

It is reflecting the code so I think it is worth it. It also fixes the
typo in the original sum (u_cfs -> u_clamp).

> (also, does this need a word about runnable rt tasks => goto max?)

What is actually the intended policy there? I thought it was goto max
unless rt was clamped, but if I read the code correctly in
schedutil_cpu_util() the current policy is only goto max if uclamp isn't
in use at all, including cfs.

The write-up looks good to me.

Reviewed-by: Morten Rasmussen <morten.rasmussen@arm.com>

Morten

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [PATCH] sched: Add schedutil overview
  2020-12-18 13:40   ` Morten Rasmussen
@ 2020-12-18 14:25     ` Valentin Schneider
  0 siblings, 0 replies; 5+ messages in thread
From: Valentin Schneider @ 2020-12-18 14:25 UTC (permalink / raw)
  To: Morten Rasmussen
  Cc: Peter Zijlstra, Rafael J. Wysocki, Ingo Molnar, Thomas Gleixner,
	Vincent Guittot, dietmar.eggemann, patrick.bellasi, lenb,
	linux-kernel, ionela.voinescu, qperret, viresh.kumar


On 18/12/20 13:40, Morten Rasmussen wrote:
> On Fri, Dec 18, 2020 at 11:33:09AM +0000, Valentin Schneider wrote:
>> (also, does this need a word about runnable rt tasks => goto max?)
>
> What is actually the intended policy there? I thought it was goto max
> unless rt was clamped, but if I read the code correctly in
> schedutil_cpu_util() the current policy is only goto max if uclamp isn't
> in use at all, including cfs.
>

Right, so the policy pretty much is: by default, if there are runnable rt
tasks, goto max freq.

When uclamp isn't used, that's hardcoded.

When uclamp is in use, the default RT uclamp.min is 1024, so it "naturally"
drives frequency selection to the max when there are runnable RT tasks
(rq-aggregated uclamp.min == 1024). That default
(uclamp_util_min_rt_default) can be tweaked.

> The write-up looks good to me.
>
> Reviewed-by: Morten Rasmussen <morten.rasmussen@arm.com>
>
> Morten

^ permalink raw reply	[flat|nested] 5+ messages in thread

* [tip: sched/core] sched: Add schedutil overview
  2020-12-18 10:32 [PATCH] sched: Add schedutil overview Peter Zijlstra
  2020-12-18 11:33 ` Valentin Schneider
@ 2021-01-14 11:29 ` tip-bot2 for Peter Zijlstra
  1 sibling, 0 replies; 5+ messages in thread
From: tip-bot2 for Peter Zijlstra @ 2021-01-14 11:29 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Peter Zijlstra (Intel), Morten Rasmussen, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     0301925dd004539adbcf11f68a3a785472376e27
Gitweb:        https://git.kernel.org/tip/0301925dd004539adbcf11f68a3a785472376e27
Author:        Peter Zijlstra <peterz@infradead.org>
AuthorDate:    Fri, 18 Dec 2020 11:28:12 +01:00
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Thu, 14 Jan 2021 11:20:10 +01:00

sched: Add schedutil overview

Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Morten Rasmussen <morten.rasmussen@arm.com>
Link: https://lkml.kernel.org/r/20201218103258.GA3040@hirez.programming.kicks-ass.net
---
 Documentation/scheduler/schedutil.txt | 169 +++++++++++++++++++++++++-
 1 file changed, 169 insertions(+)
 create mode 100644 Documentation/scheduler/schedutil.txt

diff --git a/Documentation/scheduler/schedutil.txt b/Documentation/scheduler/schedutil.txt
new file mode 100644
index 0000000..78f6b91
--- /dev/null
+++ b/Documentation/scheduler/schedutil.txt
@@ -0,0 +1,169 @@
+
+
+NOTE; all this assumes a linear relation between frequency and work capacity,
+we know this is flawed, but it is the best workable approximation.
+
+
+PELT (Per Entity Load Tracking)
+-------------------------------
+
+With PELT we track some metrics across the various scheduler entities, from
+individual tasks to task-group slices to CPU runqueues. As the basis for this
+we use an Exponentially Weighted Moving Average (EWMA), each period (1024us)
+is decayed such that y^32 = 0.5. That is, the most recent 32ms contribute
+half, while the rest of history contribute the other half.
+
+Specifically:
+
+  ewma_sum(u) := u_0 + u_1*y + u_2*y^2 + ...
+
+  ewma(u) = ewma_sum(u) / ewma_sum(1)
+
+Since this is essentially a progression of an infinite geometric series, the
+results are composable, that is ewma(A) + ewma(B) = ewma(A+B). This property
+is key, since it gives the ability to recompose the averages when tasks move
+around.
+
+Note that blocked tasks still contribute to the aggregates (task-group slices
+and CPU runqueues), which reflects their expected contribution when they
+resume running.
+
+Using this we track 2 key metrics: 'running' and 'runnable'. 'Running'
+reflects the time an entity spends on the CPU, while 'runnable' reflects the
+time an entity spends on the runqueue. When there is only a single task these
+two metrics are the same, but once there is contention for the CPU 'running'
+will decrease to reflect the fraction of time each task spends on the CPU
+while 'runnable' will increase to reflect the amount of contention.
+
+For more detail see: kernel/sched/pelt.c
+
+
+Frequency- / CPU Invariance
+---------------------------
+
+Because consuming the CPU for 50% at 1GHz is not the same as consuming the CPU
+for 50% at 2GHz, nor is running 50% on a LITTLE CPU the same as running 50% on
+a big CPU, we allow architectures to scale the time delta with two ratios, one
+Dynamic Voltage and Frequency Scaling (DVFS) ratio and one microarch ratio.
+
+For simple DVFS architectures (where software is in full control) we trivially
+compute the ratio as:
+
+	    f_cur
+  r_dvfs := -----
+            f_max
+
+For more dynamic systems where the hardware is in control of DVFS we use
+hardware counters (Intel APERF/MPERF, ARMv8.4-AMU) to provide us this ratio.
+For Intel specifically, we use:
+
+	   APERF
+  f_cur := ----- * P0
+	   MPERF
+
+	     4C-turbo;	if available and turbo enabled
+  f_max := { 1C-turbo;	if turbo enabled
+	     P0;	otherwise
+
+                    f_cur
+  r_dvfs := min( 1, ----- )
+                    f_max
+
+We pick 4C turbo over 1C turbo to make it slightly more sustainable.
+
+r_cpu is determined as the ratio of highest performance level of the current
+CPU vs the highest performance level of any other CPU in the system.
+
+  r_tot = r_dvfs * r_cpu
+
+The result is that the above 'running' and 'runnable' metrics become invariant
+of DVFS and CPU type. IOW. we can transfer and compare them between CPUs.
+
+For more detail see:
+
+ - kernel/sched/pelt.h:update_rq_clock_pelt()
+ - arch/x86/kernel/smpboot.c:"APERF/MPERF frequency ratio computation."
+ - Documentation/scheduler/sched-capacity.rst:"1. CPU Capacity + 2. Task utilization"
+
+
+UTIL_EST / UTIL_EST_FASTUP
+--------------------------
+
+Because periodic tasks have their averages decayed while they sleep, even
+though when running their expected utilization will be the same, they suffer a
+(DVFS) ramp-up after they are running again.
+
+To alleviate this (a default enabled option) UTIL_EST drives an Infinite
+Impulse Response (IIR) EWMA with the 'running' value on dequeue -- when it is
+highest. A further default enabled option UTIL_EST_FASTUP modifies the IIR
+filter to instantly increase and only decay on decrease.
+
+A further runqueue wide sum (of runnable tasks) is maintained of:
+
+  util_est := \Sum_t max( t_running, t_util_est_ewma )
+
+For more detail see: kernel/sched/fair.c:util_est_dequeue()
+
+
+UCLAMP
+------
+
+It is possible to set effective u_min and u_max clamps on each CFS or RT task;
+the runqueue keeps an max aggregate of these clamps for all running tasks.
+
+For more detail see: include/uapi/linux/sched/types.h
+
+
+Schedutil / DVFS
+----------------
+
+Every time the scheduler load tracking is updated (task wakeup, task
+migration, time progression) we call out to schedutil to update the hardware
+DVFS state.
+
+The basis is the CPU runqueue's 'running' metric, which per the above it is
+the frequency invariant utilization estimate of the CPU. From this we compute
+a desired frequency like:
+
+             max( running, util_est );	if UTIL_EST
+  u_cfs := { running;			otherwise
+
+               clamp( u_cfs + u_rt , u_min, u_max );	if UCLAMP_TASK
+  u_clamp := { u_cfs + u_rt;				otherwise
+
+  u := u_clamp + u_irq + u_dl;		[approx. see source for more detail]
+
+  f_des := min( f_max, 1.25 u * f_max )
+
+XXX IO-wait; when the update is due to a task wakeup from IO-completion we
+boost 'u' above.
+
+This frequency is then used to select a P-state/OPP or directly munged into a
+CPPC style request to the hardware.
+
+XXX: deadline tasks (Sporadic Task Model) allows us to calculate a hard f_min
+required to satisfy the workload.
+
+Because these callbacks are directly from the scheduler, the DVFS hardware
+interaction should be 'fast' and non-blocking. Schedutil supports
+rate-limiting DVFS requests for when hardware interaction is slow and
+expensive, this reduces effectiveness.
+
+For more information see: kernel/sched/cpufreq_schedutil.c
+
+
+NOTES
+-----
+
+ - On low-load scenarios, where DVFS is most relevant, the 'running' numbers
+   will closely reflect utilization.
+
+ - In saturated scenarios task movement will cause some transient dips,
+   suppose we have a CPU saturated with 4 tasks, then when we migrate a task
+   to an idle CPU, the old CPU will have a 'running' value of 0.75 while the
+   new CPU will gain 0.25. This is inevitable and time progression will
+   correct this. XXX do we still guarantee f_max due to no idle-time?
+
+ - Much of the above is about avoiding DVFS dips, and independent DVFS domains
+   having to re-learn / ramp-up when load shifts.
+

^ permalink raw reply related	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2021-01-14 11:31 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-12-18 10:32 [PATCH] sched: Add schedutil overview Peter Zijlstra
2020-12-18 11:33 ` Valentin Schneider
2020-12-18 13:40   ` Morten Rasmussen
2020-12-18 14:25     ` Valentin Schneider
2021-01-14 11:29 ` [tip: sched/core] " tip-bot2 for Peter Zijlstra

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.