* Re: [PATCH v2 1/2] sched/fair: Fix task utilization accountability in compute_energy()
2021-02-25 8:36 ` [PATCH v2 1/2] sched/fair: Fix task utilization accountability in compute_energy() vincent.donnefort
@ 2021-02-25 8:52 ` Quentin Perret
2021-02-25 11:45 ` Dietmar Eggemann
` (3 subsequent siblings)
4 siblings, 0 replies; 16+ messages in thread
From: Quentin Perret @ 2021-02-25 8:52 UTC (permalink / raw)
To: vincent.donnefort
Cc: peterz, mingo, vincent.guittot, dietmar.eggemann, linux-kernel,
patrick.bellasi, valentin.schneider
On Thursday 25 Feb 2021 at 08:36:11 (+0000), vincent.donnefort@arm.com wrote:
> From: Vincent Donnefort <vincent.donnefort@arm.com>
>
> find_energy_efficient_cpu() (feec()) computes for each perf_domain (pd) an
> energy delta as follows:
>
> feec(task)
> for_each_pd
> base_energy = compute_energy(task, -1, pd)
> -> for_each_cpu(pd)
> -> cpu_util_next(cpu, task, -1)
>
> energy_delta = compute_energy(task, dst_cpu, pd)
> -> for_each_cpu(pd)
> -> cpu_util_next(cpu, task, dst_cpu)
> energy_delta -= base_energy
>
> Then it picks the best CPU as being the one that minimizes energy_delta.
>
> cpu_util_next() estimates the CPU utilization that would happen if the
> task was placed on dst_cpu as follows:
>
> max(cpu_util + task_util, cpu_util_est + _task_util_est)
>
> The task contribution to the energy delta can then be either:
>
> (1) _task_util_est, on a mostly idle CPU, where cpu_util is close to 0
> and _task_util_est > cpu_util.
> (2) task_util, on a mostly busy CPU, where cpu_util > _task_util_est.
>
> (cpu_util_est doesn't appear here. It is 0 when a CPU is idle and
> otherwise must be small enough so that feec() takes the CPU as a
> potential target for the task placement)
>
> This is problematic for feec(), as cpu_util_next() might give an unfair
> advantage to a CPU which is mostly busy (2) compared to one which is
> mostly idle (1). _task_util_est being always bigger than task_util in
> feec() (as the task is waking up), the task contribution to the energy
> might look smaller on certain CPUs (2) and this breaks the energy
> comparison.
>
> This issue is, moreover, not sporadic. By starving idle CPUs, it keeps
> their cpu_util < _task_util_est (1) while others will maintain cpu_util >
> _task_util_est (2).
>
> Fix this problem by always using max(task_util, _task_util_est) as a task
> contribution to the energy (ENERGY_UTIL). The new estimated CPU
> utilization for the energy would then be:
>
> max(cpu_util, cpu_util_est) + max(task_util, _task_util_est)
>
> compute_energy() still needs to know which OPP would be selected if the
> task would be migrated in the perf_domain (FREQUENCY_UTIL). Hence,
> cpu_util_next() is still used to estimate the maximum util within the pd.
>
> Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
Reviewed-by: Quentin Perret <qperret@google.com>
Thanks,
Quentin
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 1/2] sched/fair: Fix task utilization accountability in compute_energy()
2021-02-25 8:36 ` [PATCH v2 1/2] sched/fair: Fix task utilization accountability in compute_energy() vincent.donnefort
2021-02-25 8:52 ` Quentin Perret
@ 2021-02-25 11:45 ` Dietmar Eggemann
2021-02-25 11:58 ` Quentin Perret
2021-02-25 16:23 ` Vincent Donnefort
2021-03-02 9:01 ` [tip: sched/core] " tip-bot2 for Vincent Donnefort
` (2 subsequent siblings)
4 siblings, 2 replies; 16+ messages in thread
From: Dietmar Eggemann @ 2021-02-25 11:45 UTC (permalink / raw)
To: vincent.donnefort, peterz, mingo, vincent.guittot
Cc: linux-kernel, qperret, patrick.bellasi, valentin.schneider
On 25/02/2021 09:36, vincent.donnefort@arm.com wrote:
> From: Vincent Donnefort <vincent.donnefort@arm.com>
[...]
> cpu_util_next() estimates the CPU utilization that would happen if the
> task was placed on dst_cpu as follows:
>
> max(cpu_util + task_util, cpu_util_est + _task_util_est)
>
> The task contribution to the energy delta can then be either:
>
> (1) _task_util_est, on a mostly idle CPU, where cpu_util is close to 0
> and _task_util_est > cpu_util.
> (2) task_util, on a mostly busy CPU, where cpu_util > _task_util_est.
>
> (cpu_util_est doesn't appear here. It is 0 when a CPU is idle and
> otherwise must be small enough so that feec() takes the CPU as a
> potential target for the task placement)
I still don't quite get the reasoning for (2) why task_util is used as
task contribution.
So we use 'cpu_util + task_util' instead of 'cpu_util_est +
_task_util_est' in (2).
I.e. since _task_util_est is always >= task_util during wakeup, cpu_util
must be > cpu_util_est (by more than _task_util_est - task_util).
I can see it for a CPU whose cpu_util has a fair amount of contributions
from blocked tasks which cpu_util_est wouldn't have.
[...]
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 7043bb0f2621..146ac9fec4b6 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6573,8 +6573,24 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
> * its pd list and will not be accounted by compute_energy().
> */
> for_each_cpu_and(cpu, pd_mask, cpu_online_mask) {
> - unsigned long cpu_util, util_cfs = cpu_util_next(cpu, p, dst_cpu);
> - struct task_struct *tsk = cpu == dst_cpu ? p : NULL;
> + unsigned long util_freq = cpu_util_next(cpu, p, dst_cpu);
> + unsigned long cpu_util, util_running = util_freq;
> + struct task_struct *tsk = NULL;
> +
> + /*
> + * When @p is placed on @cpu:
> + *
> + * util_running = max(cpu_util, cpu_util_est) +
> + * max(task_util, _task_util_est)
> + *
> + * while cpu_util_next is: max(cpu_util + task_util,
> + * cpu_util_est + _task_util_est)
> + */
Nit pick:
s/on @cpu/on @dst_cpu ?
s/while cpu_util_next is/while cpu_util_next(cpu, p, cpu) would be
If dst_cpu != cpu (including dst_cpu == -1) task_util and _task_util_est
are not added to util resp. util_est.
Not sure if this is clear from the source code here?
[...]
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 1/2] sched/fair: Fix task utilization accountability in compute_energy()
2021-02-25 11:45 ` Dietmar Eggemann
@ 2021-02-25 11:58 ` Quentin Perret
2021-02-25 16:23 ` Vincent Donnefort
1 sibling, 0 replies; 16+ messages in thread
From: Quentin Perret @ 2021-02-25 11:58 UTC (permalink / raw)
To: Dietmar Eggemann
Cc: vincent.donnefort, peterz, mingo, vincent.guittot, linux-kernel,
patrick.bellasi, valentin.schneider
On Thursday 25 Feb 2021 at 12:45:06 (+0100), Dietmar Eggemann wrote:
> I.e. since _task_util_est is always >= task_util during wakeup, cpu_util
> must be > cpu_util_est (by more than _task_util_est - task_util).
>
> I can see it for a CPU whose cpu_util has a fair amount of contributions
> from blocked tasks which cpu_util_est wouldn't have.
Yes I think this can manifest itself if there are tasks changing
behaviour, or on an idle CPU as its ue.enqueued will be 0 and we'll be
left with blocked util only.
Thanks,
Quentin
^ permalink raw reply [flat|nested] 16+ messages in thread
* Re: [PATCH v2 1/2] sched/fair: Fix task utilization accountability in compute_energy()
2021-02-25 11:45 ` Dietmar Eggemann
2021-02-25 11:58 ` Quentin Perret
@ 2021-02-25 16:23 ` Vincent Donnefort
1 sibling, 0 replies; 16+ messages in thread
From: Vincent Donnefort @ 2021-02-25 16:23 UTC (permalink / raw)
To: Dietmar Eggemann
Cc: peterz, mingo, vincent.guittot, linux-kernel, qperret,
patrick.bellasi, valentin.schneider
On Thu, Feb 25, 2021 at 12:45:06PM +0100, Dietmar Eggemann wrote:
> On 25/02/2021 09:36, vincent.donnefort@arm.com wrote:
> > From: Vincent Donnefort <vincent.donnefort@arm.com>
>
> [...]
>
> > cpu_util_next() estimates the CPU utilization that would happen if the
> > task was placed on dst_cpu as follows:
> >
> > max(cpu_util + task_util, cpu_util_est + _task_util_est)
> >
> > The task contribution to the energy delta can then be either:
> >
> > (1) _task_util_est, on a mostly idle CPU, where cpu_util is close to 0
> > and _task_util_est > cpu_util.
> > (2) task_util, on a mostly busy CPU, where cpu_util > _task_util_est.
> >
> > (cpu_util_est doesn't appear here. It is 0 when a CPU is idle and
> > otherwise must be small enough so that feec() takes the CPU as a
> > potential target for the task placement)
>
> I still don't quite get the reasoning for (2) why task_util is used as
> task contribution.
>
> So we use 'cpu_util + task_util' instead of 'cpu_util_est +
> _task_util_est' in (2).
>
> I.e. since _task_util_est is always >= task_util during wakeup, cpu_util
> must be > cpu_util_est (by more than _task_util_est - task_util).
>
> I can see it for a CPU whose cpu_util has a fair amount of contributions
> from blocked tasks which cpu_util_est wouldn't have.
>
> [...]
Yes exactly. I discovered this issue in a trace where an overutilized happened.
Many tasks were migrated to the biggest CPU, but once EAS was back on, it was too
late. The big CPU had a high util_avg and the task_util _task_util_est
unfairness kept placing tasks on that one, despite being inefficient. All the
tasks enqueued on that CPU were enough to keep util_avg high enough and that
situation wasn't resolving.
>
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 7043bb0f2621..146ac9fec4b6 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6573,8 +6573,24 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
> > * its pd list and will not be accounted by compute_energy().
> > */
> > for_each_cpu_and(cpu, pd_mask, cpu_online_mask) {
> > - unsigned long cpu_util, util_cfs = cpu_util_next(cpu, p, dst_cpu);
> > - struct task_struct *tsk = cpu == dst_cpu ? p : NULL;
> > + unsigned long util_freq = cpu_util_next(cpu, p, dst_cpu);
> > + unsigned long cpu_util, util_running = util_freq;
> > + struct task_struct *tsk = NULL;
> > +
> > + /*
> > + * When @p is placed on @cpu:
> > + *
> > + * util_running = max(cpu_util, cpu_util_est) +
> > + * max(task_util, _task_util_est)
> > + *
> > + * while cpu_util_next is: max(cpu_util + task_util,
> > + * cpu_util_est + _task_util_est)
> > + */
>
> Nit pick:
>
> s/on @cpu/on @dst_cpu ?
I meant @cpu. When dst_cpu == cpu, it means that we simulate the task being
placed on cpu. That's what I wanted to highlight. But I can remove it if you
think this is not necessary.
>
> s/while cpu_util_next is/while cpu_util_next(cpu, p, cpu) would be
>
> If dst_cpu != cpu (including dst_cpu == -1) task_util and _task_util_est
> are not added to util resp. util_est.
>
> Not sure if this is clear from the source code here?
>
> [...]
>
> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
^ permalink raw reply [flat|nested] 16+ messages in thread
* [tip: sched/core] sched/fair: Fix task utilization accountability in compute_energy()
2021-02-25 8:36 ` [PATCH v2 1/2] sched/fair: Fix task utilization accountability in compute_energy() vincent.donnefort
2021-02-25 8:52 ` Quentin Perret
2021-02-25 11:45 ` Dietmar Eggemann
@ 2021-03-02 9:01 ` tip-bot2 for Vincent Donnefort
2021-03-03 9:49 ` tip-bot2 for Vincent Donnefort
2021-03-06 11:42 ` tip-bot2 for Vincent Donnefort
4 siblings, 0 replies; 16+ messages in thread
From: tip-bot2 for Vincent Donnefort @ 2021-03-02 9:01 UTC (permalink / raw)
To: linux-tip-commits
Cc: Vincent Donnefort, Peter Zijlstra (Intel),
Quentin Perret, Dietmar Eggemann, x86, linux-kernel
The following commit has been merged into the sched/core branch of tip:
Commit-ID: 8af20c5fe756a9ff556c9b520201b2d158874481
Gitweb: https://git.kernel.org/tip/8af20c5fe756a9ff556c9b520201b2d158874481
Author: Vincent Donnefort <vincent.donnefort@arm.com>
AuthorDate: Thu, 25 Feb 2021 08:36:11
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 01 Mar 2021 18:17:25 +01:00
sched/fair: Fix task utilization accountability in compute_energy()
find_energy_efficient_cpu() (feec()) computes for each perf_domain (pd) an
energy delta as follows:
feec(task)
for_each_pd
base_energy = compute_energy(task, -1, pd)
-> for_each_cpu(pd)
-> cpu_util_next(cpu, task, -1)
energy_delta = compute_energy(task, dst_cpu, pd)
-> for_each_cpu(pd)
-> cpu_util_next(cpu, task, dst_cpu)
energy_delta -= base_energy
Then it picks the best CPU as being the one that minimizes energy_delta.
cpu_util_next() estimates the CPU utilization that would happen if the
task was placed on dst_cpu as follows:
max(cpu_util + task_util, cpu_util_est + _task_util_est)
The task contribution to the energy delta can then be either:
(1) _task_util_est, on a mostly idle CPU, where cpu_util is close to 0
and _task_util_est > cpu_util.
(2) task_util, on a mostly busy CPU, where cpu_util > _task_util_est.
(cpu_util_est doesn't appear here. It is 0 when a CPU is idle and
otherwise must be small enough so that feec() takes the CPU as a
potential target for the task placement)
This is problematic for feec(), as cpu_util_next() might give an unfair
advantage to a CPU which is mostly busy (2) compared to one which is
mostly idle (1). _task_util_est being always bigger than task_util in
feec() (as the task is waking up), the task contribution to the energy
might look smaller on certain CPUs (2) and this breaks the energy
comparison.
This issue is, moreover, not sporadic. By starving idle CPUs, it keeps
their cpu_util < _task_util_est (1) while others will maintain cpu_util >
_task_util_est (2).
Fix this problem by always using max(task_util, _task_util_est) as a task
contribution to the energy (ENERGY_UTIL). The new estimated CPU
utilization for the energy would then be:
max(cpu_util, cpu_util_est) + max(task_util, _task_util_est)
compute_energy() still needs to know which OPP would be selected if the
task would be migrated in the perf_domain (FREQUENCY_UTIL). Hence,
cpu_util_next() is still used to estimate the maximum util within the pd.
Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Quentin Perret <qperret@google.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lkml.kernel.org/r/20210225083612.1113823-2-vincent.donnefort@arm.com
---
kernel/sched/fair.c | 24 ++++++++++++++++++++----
1 file changed, 20 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f1b55f9..b994db9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6518,8 +6518,24 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
* its pd list and will not be accounted by compute_energy().
*/
for_each_cpu_and(cpu, pd_mask, cpu_online_mask) {
- unsigned long cpu_util, util_cfs = cpu_util_next(cpu, p, dst_cpu);
- struct task_struct *tsk = cpu == dst_cpu ? p : NULL;
+ unsigned long util_freq = cpu_util_next(cpu, p, dst_cpu);
+ unsigned long cpu_util, util_running = util_freq;
+ struct task_struct *tsk = NULL;
+
+ /*
+ * When @p is placed on @cpu:
+ *
+ * util_running = max(cpu_util, cpu_util_est) +
+ * max(task_util, _task_util_est)
+ *
+ * while cpu_util_next is: max(cpu_util + task_util,
+ * cpu_util_est + _task_util_est)
+ */
+ if (cpu == dst_cpu) {
+ tsk = p;
+ util_running =
+ cpu_util_next(cpu, p, -1) + task_util_est(p);
+ }
/*
* Busy time computation: utilization clamping is not
@@ -6527,7 +6543,7 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
* is already enough to scale the EM reported power
* consumption at the (eventually clamped) cpu_capacity.
*/
- sum_util += effective_cpu_util(cpu, util_cfs, cpu_cap,
+ sum_util += effective_cpu_util(cpu, util_running, cpu_cap,
ENERGY_UTIL, NULL);
/*
@@ -6537,7 +6553,7 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
* NOTE: in case RT tasks are running, by default the
* FREQUENCY_UTIL's utilization can be max OPP.
*/
- cpu_util = effective_cpu_util(cpu, util_cfs, cpu_cap,
+ cpu_util = effective_cpu_util(cpu, util_freq, cpu_cap,
FREQUENCY_UTIL, tsk);
max_util = max(max_util, cpu_util);
}
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [tip: sched/core] sched/fair: Fix task utilization accountability in compute_energy()
2021-02-25 8:36 ` [PATCH v2 1/2] sched/fair: Fix task utilization accountability in compute_energy() vincent.donnefort
` (2 preceding siblings ...)
2021-03-02 9:01 ` [tip: sched/core] " tip-bot2 for Vincent Donnefort
@ 2021-03-03 9:49 ` tip-bot2 for Vincent Donnefort
2021-03-06 11:42 ` tip-bot2 for Vincent Donnefort
4 siblings, 0 replies; 16+ messages in thread
From: tip-bot2 for Vincent Donnefort @ 2021-03-03 9:49 UTC (permalink / raw)
To: linux-tip-commits
Cc: Vincent Donnefort, Peter Zijlstra (Intel),
Quentin Perret, Dietmar Eggemann, x86, linux-kernel
The following commit has been merged into the sched/core branch of tip:
Commit-ID: 2d120f71df4baeb7694f513c86fe6f85940f6f76
Gitweb: https://git.kernel.org/tip/2d120f71df4baeb7694f513c86fe6f85940f6f76
Author: Vincent Donnefort <vincent.donnefort@arm.com>
AuthorDate: Thu, 25 Feb 2021 08:36:11
Committer: Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 03 Mar 2021 10:33:00 +01:00
sched/fair: Fix task utilization accountability in compute_energy()
find_energy_efficient_cpu() (feec()) computes for each perf_domain (pd) an
energy delta as follows:
feec(task)
for_each_pd
base_energy = compute_energy(task, -1, pd)
-> for_each_cpu(pd)
-> cpu_util_next(cpu, task, -1)
energy_delta = compute_energy(task, dst_cpu, pd)
-> for_each_cpu(pd)
-> cpu_util_next(cpu, task, dst_cpu)
energy_delta -= base_energy
Then it picks the best CPU as being the one that minimizes energy_delta.
cpu_util_next() estimates the CPU utilization that would happen if the
task was placed on dst_cpu as follows:
max(cpu_util + task_util, cpu_util_est + _task_util_est)
The task contribution to the energy delta can then be either:
(1) _task_util_est, on a mostly idle CPU, where cpu_util is close to 0
and _task_util_est > cpu_util.
(2) task_util, on a mostly busy CPU, where cpu_util > _task_util_est.
(cpu_util_est doesn't appear here. It is 0 when a CPU is idle and
otherwise must be small enough so that feec() takes the CPU as a
potential target for the task placement)
This is problematic for feec(), as cpu_util_next() might give an unfair
advantage to a CPU which is mostly busy (2) compared to one which is
mostly idle (1). _task_util_est being always bigger than task_util in
feec() (as the task is waking up), the task contribution to the energy
might look smaller on certain CPUs (2) and this breaks the energy
comparison.
This issue is, moreover, not sporadic. By starving idle CPUs, it keeps
their cpu_util < _task_util_est (1) while others will maintain cpu_util >
_task_util_est (2).
Fix this problem by always using max(task_util, _task_util_est) as a task
contribution to the energy (ENERGY_UTIL). The new estimated CPU
utilization for the energy would then be:
max(cpu_util, cpu_util_est) + max(task_util, _task_util_est)
compute_energy() still needs to know which OPP would be selected if the
task would be migrated in the perf_domain (FREQUENCY_UTIL). Hence,
cpu_util_next() is still used to estimate the maximum util within the pd.
Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Quentin Perret <qperret@google.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lkml.kernel.org/r/20210225083612.1113823-2-vincent.donnefort@arm.com
---
kernel/sched/fair.c | 24 ++++++++++++++++++++----
1 file changed, 20 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f1b55f9..b994db9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6518,8 +6518,24 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
* its pd list and will not be accounted by compute_energy().
*/
for_each_cpu_and(cpu, pd_mask, cpu_online_mask) {
- unsigned long cpu_util, util_cfs = cpu_util_next(cpu, p, dst_cpu);
- struct task_struct *tsk = cpu == dst_cpu ? p : NULL;
+ unsigned long util_freq = cpu_util_next(cpu, p, dst_cpu);
+ unsigned long cpu_util, util_running = util_freq;
+ struct task_struct *tsk = NULL;
+
+ /*
+ * When @p is placed on @cpu:
+ *
+ * util_running = max(cpu_util, cpu_util_est) +
+ * max(task_util, _task_util_est)
+ *
+ * while cpu_util_next is: max(cpu_util + task_util,
+ * cpu_util_est + _task_util_est)
+ */
+ if (cpu == dst_cpu) {
+ tsk = p;
+ util_running =
+ cpu_util_next(cpu, p, -1) + task_util_est(p);
+ }
/*
* Busy time computation: utilization clamping is not
@@ -6527,7 +6543,7 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
* is already enough to scale the EM reported power
* consumption at the (eventually clamped) cpu_capacity.
*/
- sum_util += effective_cpu_util(cpu, util_cfs, cpu_cap,
+ sum_util += effective_cpu_util(cpu, util_running, cpu_cap,
ENERGY_UTIL, NULL);
/*
@@ -6537,7 +6553,7 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
* NOTE: in case RT tasks are running, by default the
* FREQUENCY_UTIL's utilization can be max OPP.
*/
- cpu_util = effective_cpu_util(cpu, util_cfs, cpu_cap,
+ cpu_util = effective_cpu_util(cpu, util_freq, cpu_cap,
FREQUENCY_UTIL, tsk);
max_util = max(max_util, cpu_util);
}
^ permalink raw reply related [flat|nested] 16+ messages in thread
* [tip: sched/core] sched/fair: Fix task utilization accountability in compute_energy()
2021-02-25 8:36 ` [PATCH v2 1/2] sched/fair: Fix task utilization accountability in compute_energy() vincent.donnefort
` (3 preceding siblings ...)
2021-03-03 9:49 ` tip-bot2 for Vincent Donnefort
@ 2021-03-06 11:42 ` tip-bot2 for Vincent Donnefort
4 siblings, 0 replies; 16+ messages in thread
From: tip-bot2 for Vincent Donnefort @ 2021-03-06 11:42 UTC (permalink / raw)
To: linux-tip-commits
Cc: Vincent Donnefort, Peter Zijlstra (Intel),
Ingo Molnar, Quentin Perret, Dietmar Eggemann, x86, linux-kernel
The following commit has been merged into the sched/core branch of tip:
Commit-ID: 0372e1cf70c28de6babcba38ef97b6ae3400b101
Gitweb: https://git.kernel.org/tip/0372e1cf70c28de6babcba38ef97b6ae3400b101
Author: Vincent Donnefort <vincent.donnefort@arm.com>
AuthorDate: Thu, 25 Feb 2021 08:36:11
Committer: Ingo Molnar <mingo@kernel.org>
CommitterDate: Sat, 06 Mar 2021 12:40:22 +01:00
sched/fair: Fix task utilization accountability in compute_energy()
find_energy_efficient_cpu() (feec()) computes for each perf_domain (pd) an
energy delta as follows:
feec(task)
for_each_pd
base_energy = compute_energy(task, -1, pd)
-> for_each_cpu(pd)
-> cpu_util_next(cpu, task, -1)
energy_delta = compute_energy(task, dst_cpu, pd)
-> for_each_cpu(pd)
-> cpu_util_next(cpu, task, dst_cpu)
energy_delta -= base_energy
Then it picks the best CPU as being the one that minimizes energy_delta.
cpu_util_next() estimates the CPU utilization that would happen if the
task was placed on dst_cpu as follows:
max(cpu_util + task_util, cpu_util_est + _task_util_est)
The task contribution to the energy delta can then be either:
(1) _task_util_est, on a mostly idle CPU, where cpu_util is close to 0
and _task_util_est > cpu_util.
(2) task_util, on a mostly busy CPU, where cpu_util > _task_util_est.
(cpu_util_est doesn't appear here. It is 0 when a CPU is idle and
otherwise must be small enough so that feec() takes the CPU as a
potential target for the task placement)
This is problematic for feec(), as cpu_util_next() might give an unfair
advantage to a CPU which is mostly busy (2) compared to one which is
mostly idle (1). _task_util_est being always bigger than task_util in
feec() (as the task is waking up), the task contribution to the energy
might look smaller on certain CPUs (2) and this breaks the energy
comparison.
This issue is, moreover, not sporadic. By starving idle CPUs, it keeps
their cpu_util < _task_util_est (1) while others will maintain cpu_util >
_task_util_est (2).
Fix this problem by always using max(task_util, _task_util_est) as a task
contribution to the energy (ENERGY_UTIL). The new estimated CPU
utilization for the energy would then be:
max(cpu_util, cpu_util_est) + max(task_util, _task_util_est)
compute_energy() still needs to know which OPP would be selected if the
task would be migrated in the perf_domain (FREQUENCY_UTIL). Hence,
cpu_util_next() is still used to estimate the maximum util within the pd.
Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Quentin Perret <qperret@google.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lkml.kernel.org/r/20210225083612.1113823-2-vincent.donnefort@arm.com
---
kernel/sched/fair.c | 24 ++++++++++++++++++++----
1 file changed, 20 insertions(+), 4 deletions(-)
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f1b55f9..b994db9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6518,8 +6518,24 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
* its pd list and will not be accounted by compute_energy().
*/
for_each_cpu_and(cpu, pd_mask, cpu_online_mask) {
- unsigned long cpu_util, util_cfs = cpu_util_next(cpu, p, dst_cpu);
- struct task_struct *tsk = cpu == dst_cpu ? p : NULL;
+ unsigned long util_freq = cpu_util_next(cpu, p, dst_cpu);
+ unsigned long cpu_util, util_running = util_freq;
+ struct task_struct *tsk = NULL;
+
+ /*
+ * When @p is placed on @cpu:
+ *
+ * util_running = max(cpu_util, cpu_util_est) +
+ * max(task_util, _task_util_est)
+ *
+ * while cpu_util_next is: max(cpu_util + task_util,
+ * cpu_util_est + _task_util_est)
+ */
+ if (cpu == dst_cpu) {
+ tsk = p;
+ util_running =
+ cpu_util_next(cpu, p, -1) + task_util_est(p);
+ }
/*
* Busy time computation: utilization clamping is not
@@ -6527,7 +6543,7 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
* is already enough to scale the EM reported power
* consumption at the (eventually clamped) cpu_capacity.
*/
- sum_util += effective_cpu_util(cpu, util_cfs, cpu_cap,
+ sum_util += effective_cpu_util(cpu, util_running, cpu_cap,
ENERGY_UTIL, NULL);
/*
@@ -6537,7 +6553,7 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
* NOTE: in case RT tasks are running, by default the
* FREQUENCY_UTIL's utilization can be max OPP.
*/
- cpu_util = effective_cpu_util(cpu, util_cfs, cpu_cap,
+ cpu_util = effective_cpu_util(cpu, util_freq, cpu_cap,
FREQUENCY_UTIL, tsk);
max_util = max(max_util, cpu_util);
}
^ permalink raw reply related [flat|nested] 16+ messages in thread