linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v2 0/2] Fix task utilization accountability for EAS
@ 2021-02-25  8:36 vincent.donnefort
  2021-02-25  8:36 ` [PATCH v2 1/2] sched/fair: Fix task utilization accountability in compute_energy() vincent.donnefort
                   ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: vincent.donnefort @ 2021-02-25  8:36 UTC (permalink / raw)
  To: peterz, mingo, vincent.guittot
  Cc: dietmar.eggemann, linux-kernel, qperret, patrick.bellasi,
	valentin.schneider, Vincent Donnefort

From: Vincent Donnefort <vincent.donnefort@arm.com>

Changelog since v1:
  - Fix the issue in compute_energy(), as a change in cpu_util_next() would
    break the OPP selection estimation.
  - Separate patch for lsub_positive usage in cpu_util_next()

Vincent Donnefort (2):
  sched/fair: Fix task utilization accountability in compute_energy()
  sched/fair: use lsub_positive in cpu_util_next()

 kernel/sched/fair.c | 26 +++++++++++++++++++++-----
 1 file changed, 21 insertions(+), 5 deletions(-)

-- 
2.25.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v2 1/2] sched/fair: Fix task utilization accountability in compute_energy()
  2021-02-25  8:36 [PATCH v2 0/2] Fix task utilization accountability for EAS vincent.donnefort
@ 2021-02-25  8:36 ` vincent.donnefort
  2021-02-25  8:52   ` Quentin Perret
                     ` (4 more replies)
  2021-02-25  8:36 ` [PATCH v2 2/2] sched/fair: use lsub_positive in cpu_util_next() vincent.donnefort
  2021-02-25 17:29 ` [PATCH v2 0/2] Fix task utilization accountability for EAS Peter Zijlstra
  2 siblings, 5 replies; 16+ messages in thread
From: vincent.donnefort @ 2021-02-25  8:36 UTC (permalink / raw)
  To: peterz, mingo, vincent.guittot
  Cc: dietmar.eggemann, linux-kernel, qperret, patrick.bellasi,
	valentin.schneider, Vincent Donnefort

From: Vincent Donnefort <vincent.donnefort@arm.com>

find_energy_efficient_cpu() (feec()) computes for each perf_domain (pd) an
energy delta as follows:

  feec(task)
    for_each_pd
      base_energy = compute_energy(task, -1, pd)
        -> for_each_cpu(pd)
           -> cpu_util_next(cpu, task, -1)

      energy_delta = compute_energy(task, dst_cpu, pd)
        -> for_each_cpu(pd)
           -> cpu_util_next(cpu, task, dst_cpu)
      energy_delta -= base_energy

Then it picks the best CPU as being the one that minimizes energy_delta.

cpu_util_next() estimates the CPU utilization that would happen if the
task was placed on dst_cpu as follows:

  max(cpu_util + task_util, cpu_util_est + _task_util_est)

The task contribution to the energy delta can then be either:

  (1) _task_util_est, on a mostly idle CPU, where cpu_util is close to 0
      and _task_util_est > cpu_util.
  (2) task_util, on a mostly busy CPU, where cpu_util > _task_util_est.

  (cpu_util_est doesn't appear here. It is 0 when a CPU is idle and
   otherwise must be small enough so that feec() takes the CPU as a
   potential target for the task placement)

This is problematic for feec(), as cpu_util_next() might give an unfair
advantage to a CPU which is mostly busy (2) compared to one which is
mostly idle (1). _task_util_est being always bigger than task_util in
feec() (as the task is waking up), the task contribution to the energy
might look smaller on certain CPUs (2) and this breaks the energy
comparison.

This issue is, moreover, not sporadic. By starving idle CPUs, it keeps
their cpu_util < _task_util_est (1) while others will maintain cpu_util >
_task_util_est (2).

Fix this problem by always using max(task_util, _task_util_est) as a task
contribution to the energy (ENERGY_UTIL). The new estimated CPU
utilization for the energy would then be:

  max(cpu_util, cpu_util_est) + max(task_util, _task_util_est)

compute_energy() still needs to know which OPP would be selected if the
task would be migrated in the perf_domain (FREQUENCY_UTIL). Hence,
cpu_util_next() is still used to estimate the maximum util within the pd.

Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 7043bb0f2621..146ac9fec4b6 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6573,8 +6573,24 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
 	 * its pd list and will not be accounted by compute_energy().
 	 */
 	for_each_cpu_and(cpu, pd_mask, cpu_online_mask) {
-		unsigned long cpu_util, util_cfs = cpu_util_next(cpu, p, dst_cpu);
-		struct task_struct *tsk = cpu == dst_cpu ? p : NULL;
+		unsigned long util_freq = cpu_util_next(cpu, p, dst_cpu);
+		unsigned long cpu_util, util_running = util_freq;
+		struct task_struct *tsk = NULL;
+
+		/*
+		 * When @p is placed on @cpu:
+		 *
+		 * util_running = max(cpu_util, cpu_util_est) +
+		 *		  max(task_util, _task_util_est)
+		 *
+		 * while cpu_util_next is: max(cpu_util + task_util,
+		 *			       cpu_util_est + _task_util_est)
+		 */
+		if (cpu == dst_cpu) {
+			tsk = p;
+			util_running =
+				cpu_util_next(cpu, p, -1) + task_util_est(p);
+		}
 
 		/*
 		 * Busy time computation: utilization clamping is not
@@ -6582,7 +6598,7 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
 		 * is already enough to scale the EM reported power
 		 * consumption at the (eventually clamped) cpu_capacity.
 		 */
-		sum_util += effective_cpu_util(cpu, util_cfs, cpu_cap,
+		sum_util += effective_cpu_util(cpu, util_running, cpu_cap,
 					       ENERGY_UTIL, NULL);
 
 		/*
@@ -6592,7 +6608,7 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
 		 * NOTE: in case RT tasks are running, by default the
 		 * FREQUENCY_UTIL's utilization can be max OPP.
 		 */
-		cpu_util = effective_cpu_util(cpu, util_cfs, cpu_cap,
+		cpu_util = effective_cpu_util(cpu, util_freq, cpu_cap,
 					      FREQUENCY_UTIL, tsk);
 		max_util = max(max_util, cpu_util);
 	}
-- 
2.25.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH v2 2/2] sched/fair: use lsub_positive in cpu_util_next()
  2021-02-25  8:36 [PATCH v2 0/2] Fix task utilization accountability for EAS vincent.donnefort
  2021-02-25  8:36 ` [PATCH v2 1/2] sched/fair: Fix task utilization accountability in compute_energy() vincent.donnefort
@ 2021-02-25  8:36 ` vincent.donnefort
  2021-02-25  8:53   ` Quentin Perret
                     ` (4 more replies)
  2021-02-25 17:29 ` [PATCH v2 0/2] Fix task utilization accountability for EAS Peter Zijlstra
  2 siblings, 5 replies; 16+ messages in thread
From: vincent.donnefort @ 2021-02-25  8:36 UTC (permalink / raw)
  To: peterz, mingo, vincent.guittot
  Cc: dietmar.eggemann, linux-kernel, qperret, patrick.bellasi,
	valentin.schneider, Vincent Donnefort

From: Vincent Donnefort <vincent.donnefort@arm.com>

The sub_positive local version is saving an explicit load-store and is
enough for the cpu_util_next() usage.

Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 146ac9fec4b6..1364f8b95214 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6525,7 +6525,7 @@ static unsigned long cpu_util_next(int cpu, struct task_struct *p, int dst_cpu)
 	 * util_avg should already be correct.
 	 */
 	if (task_cpu(p) == cpu && dst_cpu != cpu)
-		sub_positive(&util, task_util(p));
+		lsub_positive(&util, task_util(p));
 	else if (task_cpu(p) != cpu && dst_cpu == cpu)
 		util += task_util(p);
 
-- 
2.25.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/2] sched/fair: Fix task utilization accountability in compute_energy()
  2021-02-25  8:36 ` [PATCH v2 1/2] sched/fair: Fix task utilization accountability in compute_energy() vincent.donnefort
@ 2021-02-25  8:52   ` Quentin Perret
  2021-02-25 11:45   ` Dietmar Eggemann
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 16+ messages in thread
From: Quentin Perret @ 2021-02-25  8:52 UTC (permalink / raw)
  To: vincent.donnefort
  Cc: peterz, mingo, vincent.guittot, dietmar.eggemann, linux-kernel,
	patrick.bellasi, valentin.schneider

On Thursday 25 Feb 2021 at 08:36:11 (+0000), vincent.donnefort@arm.com wrote:
> From: Vincent Donnefort <vincent.donnefort@arm.com>
> 
> find_energy_efficient_cpu() (feec()) computes for each perf_domain (pd) an
> energy delta as follows:
> 
>   feec(task)
>     for_each_pd
>       base_energy = compute_energy(task, -1, pd)
>         -> for_each_cpu(pd)
>            -> cpu_util_next(cpu, task, -1)
> 
>       energy_delta = compute_energy(task, dst_cpu, pd)
>         -> for_each_cpu(pd)
>            -> cpu_util_next(cpu, task, dst_cpu)
>       energy_delta -= base_energy
> 
> Then it picks the best CPU as being the one that minimizes energy_delta.
> 
> cpu_util_next() estimates the CPU utilization that would happen if the
> task was placed on dst_cpu as follows:
> 
>   max(cpu_util + task_util, cpu_util_est + _task_util_est)
> 
> The task contribution to the energy delta can then be either:
> 
>   (1) _task_util_est, on a mostly idle CPU, where cpu_util is close to 0
>       and _task_util_est > cpu_util.
>   (2) task_util, on a mostly busy CPU, where cpu_util > _task_util_est.
> 
>   (cpu_util_est doesn't appear here. It is 0 when a CPU is idle and
>    otherwise must be small enough so that feec() takes the CPU as a
>    potential target for the task placement)
> 
> This is problematic for feec(), as cpu_util_next() might give an unfair
> advantage to a CPU which is mostly busy (2) compared to one which is
> mostly idle (1). _task_util_est being always bigger than task_util in
> feec() (as the task is waking up), the task contribution to the energy
> might look smaller on certain CPUs (2) and this breaks the energy
> comparison.
> 
> This issue is, moreover, not sporadic. By starving idle CPUs, it keeps
> their cpu_util < _task_util_est (1) while others will maintain cpu_util >
> _task_util_est (2).
> 
> Fix this problem by always using max(task_util, _task_util_est) as a task
> contribution to the energy (ENERGY_UTIL). The new estimated CPU
> utilization for the energy would then be:
> 
>   max(cpu_util, cpu_util_est) + max(task_util, _task_util_est)
> 
> compute_energy() still needs to know which OPP would be selected if the
> task would be migrated in the perf_domain (FREQUENCY_UTIL). Hence,
> cpu_util_next() is still used to estimate the maximum util within the pd.
> 
> Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>

Reviewed-by: Quentin Perret <qperret@google.com>

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 2/2] sched/fair: use lsub_positive in cpu_util_next()
  2021-02-25  8:36 ` [PATCH v2 2/2] sched/fair: use lsub_positive in cpu_util_next() vincent.donnefort
@ 2021-02-25  8:53   ` Quentin Perret
  2021-02-25 11:46   ` Dietmar Eggemann
                     ` (3 subsequent siblings)
  4 siblings, 0 replies; 16+ messages in thread
From: Quentin Perret @ 2021-02-25  8:53 UTC (permalink / raw)
  To: vincent.donnefort
  Cc: peterz, mingo, vincent.guittot, dietmar.eggemann, linux-kernel,
	patrick.bellasi, valentin.schneider

On Thursday 25 Feb 2021 at 08:36:12 (+0000), vincent.donnefort@arm.com wrote:
> From: Vincent Donnefort <vincent.donnefort@arm.com>
> 
> The sub_positive local version is saving an explicit load-store and is
> enough for the cpu_util_next() usage.
> 
> Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>

Reviewed-by: Quentin Perret <qperret@google.com>

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/2] sched/fair: Fix task utilization accountability in compute_energy()
  2021-02-25  8:36 ` [PATCH v2 1/2] sched/fair: Fix task utilization accountability in compute_energy() vincent.donnefort
  2021-02-25  8:52   ` Quentin Perret
@ 2021-02-25 11:45   ` Dietmar Eggemann
  2021-02-25 11:58     ` Quentin Perret
  2021-02-25 16:23     ` Vincent Donnefort
  2021-03-02  9:01   ` [tip: sched/core] " tip-bot2 for Vincent Donnefort
                     ` (2 subsequent siblings)
  4 siblings, 2 replies; 16+ messages in thread
From: Dietmar Eggemann @ 2021-02-25 11:45 UTC (permalink / raw)
  To: vincent.donnefort, peterz, mingo, vincent.guittot
  Cc: linux-kernel, qperret, patrick.bellasi, valentin.schneider

On 25/02/2021 09:36, vincent.donnefort@arm.com wrote:
> From: Vincent Donnefort <vincent.donnefort@arm.com>

[...]

> cpu_util_next() estimates the CPU utilization that would happen if the
> task was placed on dst_cpu as follows:
> 
>   max(cpu_util + task_util, cpu_util_est + _task_util_est)
> 
> The task contribution to the energy delta can then be either:
> 
>   (1) _task_util_est, on a mostly idle CPU, where cpu_util is close to 0
>       and _task_util_est > cpu_util.
>   (2) task_util, on a mostly busy CPU, where cpu_util > _task_util_est.
> 
>   (cpu_util_est doesn't appear here. It is 0 when a CPU is idle and
>    otherwise must be small enough so that feec() takes the CPU as a
>    potential target for the task placement)

I still don't quite get the reasoning for (2) why task_util is used as
task contribution.

So we use 'cpu_util + task_util' instead of 'cpu_util_est +
_task_util_est' in (2).

I.e. since _task_util_est is always >= task_util during wakeup, cpu_util
must be > cpu_util_est (by more than _task_util_est - task_util).

I can see it for a CPU whose cpu_util has a fair amount of contributions
from blocked tasks which cpu_util_est wouldn't have.

[...]

> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 7043bb0f2621..146ac9fec4b6 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6573,8 +6573,24 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
>  	 * its pd list and will not be accounted by compute_energy().
>  	 */
>  	for_each_cpu_and(cpu, pd_mask, cpu_online_mask) {
> -		unsigned long cpu_util, util_cfs = cpu_util_next(cpu, p, dst_cpu);
> -		struct task_struct *tsk = cpu == dst_cpu ? p : NULL;
> +		unsigned long util_freq = cpu_util_next(cpu, p, dst_cpu);
> +		unsigned long cpu_util, util_running = util_freq;
> +		struct task_struct *tsk = NULL;
> +
> +		/*
> +		 * When @p is placed on @cpu:
> +		 *
> +		 * util_running = max(cpu_util, cpu_util_est) +
> +		 *		  max(task_util, _task_util_est)
> +		 *
> +		 * while cpu_util_next is: max(cpu_util + task_util,
> +		 *			       cpu_util_est + _task_util_est)
> +		 */

Nit pick:

s/on @cpu/on @dst_cpu ?

s/while cpu_util_next is/while cpu_util_next(cpu, p, cpu) would be

If dst_cpu != cpu (including dst_cpu == -1) task_util and _task_util_est
are not added to util resp. util_est.

Not sure if this is clear from the source code here?

[...]

Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 2/2] sched/fair: use lsub_positive in cpu_util_next()
  2021-02-25  8:36 ` [PATCH v2 2/2] sched/fair: use lsub_positive in cpu_util_next() vincent.donnefort
  2021-02-25  8:53   ` Quentin Perret
@ 2021-02-25 11:46   ` Dietmar Eggemann
  2021-03-02  9:01   ` [tip: sched/core] " tip-bot2 for Vincent Donnefort
                     ` (2 subsequent siblings)
  4 siblings, 0 replies; 16+ messages in thread
From: Dietmar Eggemann @ 2021-02-25 11:46 UTC (permalink / raw)
  To: vincent.donnefort, peterz, mingo, vincent.guittot
  Cc: linux-kernel, qperret, patrick.bellasi, valentin.schneider

On 25/02/2021 09:36, vincent.donnefort@arm.com wrote:
> From: Vincent Donnefort <vincent.donnefort@arm.com>
> 
> The sub_positive local version is saving an explicit load-store and is
> enough for the cpu_util_next() usage.
> 
> Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 146ac9fec4b6..1364f8b95214 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6525,7 +6525,7 @@ static unsigned long cpu_util_next(int cpu, struct task_struct *p, int dst_cpu)
>  	 * util_avg should already be correct.
>  	 */
>  	if (task_cpu(p) == cpu && dst_cpu != cpu)
> -		sub_positive(&util, task_util(p));
> +		lsub_positive(&util, task_util(p));
>  	else if (task_cpu(p) != cpu && dst_cpu == cpu)
>  		util += task_util(p);

Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>


^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/2] sched/fair: Fix task utilization accountability in compute_energy()
  2021-02-25 11:45   ` Dietmar Eggemann
@ 2021-02-25 11:58     ` Quentin Perret
  2021-02-25 16:23     ` Vincent Donnefort
  1 sibling, 0 replies; 16+ messages in thread
From: Quentin Perret @ 2021-02-25 11:58 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: vincent.donnefort, peterz, mingo, vincent.guittot, linux-kernel,
	patrick.bellasi, valentin.schneider

On Thursday 25 Feb 2021 at 12:45:06 (+0100), Dietmar Eggemann wrote:
> I.e. since _task_util_est is always >= task_util during wakeup, cpu_util
> must be > cpu_util_est (by more than _task_util_est - task_util).
> 
> I can see it for a CPU whose cpu_util has a fair amount of contributions
> from blocked tasks which cpu_util_est wouldn't have.

Yes I think this can manifest itself if there are tasks changing
behaviour, or on an idle CPU as its ue.enqueued will be 0 and we'll be
left with blocked util only.

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 1/2] sched/fair: Fix task utilization accountability in compute_energy()
  2021-02-25 11:45   ` Dietmar Eggemann
  2021-02-25 11:58     ` Quentin Perret
@ 2021-02-25 16:23     ` Vincent Donnefort
  1 sibling, 0 replies; 16+ messages in thread
From: Vincent Donnefort @ 2021-02-25 16:23 UTC (permalink / raw)
  To: Dietmar Eggemann
  Cc: peterz, mingo, vincent.guittot, linux-kernel, qperret,
	patrick.bellasi, valentin.schneider

On Thu, Feb 25, 2021 at 12:45:06PM +0100, Dietmar Eggemann wrote:
> On 25/02/2021 09:36, vincent.donnefort@arm.com wrote:
> > From: Vincent Donnefort <vincent.donnefort@arm.com>
> 
> [...]
> 
> > cpu_util_next() estimates the CPU utilization that would happen if the
> > task was placed on dst_cpu as follows:
> > 
> >   max(cpu_util + task_util, cpu_util_est + _task_util_est)
> > 
> > The task contribution to the energy delta can then be either:
> > 
> >   (1) _task_util_est, on a mostly idle CPU, where cpu_util is close to 0
> >       and _task_util_est > cpu_util.
> >   (2) task_util, on a mostly busy CPU, where cpu_util > _task_util_est.
> > 
> >   (cpu_util_est doesn't appear here. It is 0 when a CPU is idle and
> >    otherwise must be small enough so that feec() takes the CPU as a
> >    potential target for the task placement)
> 
> I still don't quite get the reasoning for (2) why task_util is used as
> task contribution.
> 
> So we use 'cpu_util + task_util' instead of 'cpu_util_est +
> _task_util_est' in (2).
> 
> I.e. since _task_util_est is always >= task_util during wakeup, cpu_util
> must be > cpu_util_est (by more than _task_util_est - task_util).
> 
> I can see it for a CPU whose cpu_util has a fair amount of contributions
> from blocked tasks which cpu_util_est wouldn't have.
> 
> [...]

Yes exactly. I discovered this issue in a trace where an overutilized happened.
Many tasks were migrated to the biggest CPU, but once EAS was back on, it was too
late. The big CPU had a high util_avg and the task_util _task_util_est
unfairness kept placing tasks on that one, despite being inefficient. All the
tasks enqueued on that CPU were enough to keep util_avg high enough and that
situation wasn't resolving.

> 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 7043bb0f2621..146ac9fec4b6 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -6573,8 +6573,24 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
> >  	 * its pd list and will not be accounted by compute_energy().
> >  	 */
> >  	for_each_cpu_and(cpu, pd_mask, cpu_online_mask) {
> > -		unsigned long cpu_util, util_cfs = cpu_util_next(cpu, p, dst_cpu);
> > -		struct task_struct *tsk = cpu == dst_cpu ? p : NULL;
> > +		unsigned long util_freq = cpu_util_next(cpu, p, dst_cpu);
> > +		unsigned long cpu_util, util_running = util_freq;
> > +		struct task_struct *tsk = NULL;
> > +
> > +		/*
> > +		 * When @p is placed on @cpu:
> > +		 *
> > +		 * util_running = max(cpu_util, cpu_util_est) +
> > +		 *		  max(task_util, _task_util_est)
> > +		 *
> > +		 * while cpu_util_next is: max(cpu_util + task_util,
> > +		 *			       cpu_util_est + _task_util_est)
> > +		 */
> 
> Nit pick:
> 
> s/on @cpu/on @dst_cpu ?

I meant @cpu. When dst_cpu == cpu, it means that we simulate the task being
placed on cpu. That's what I wanted to highlight. But I can remove it if you
think this is not necessary.

> 
> s/while cpu_util_next is/while cpu_util_next(cpu, p, cpu) would be
> 
> If dst_cpu != cpu (including dst_cpu == -1) task_util and _task_util_est
> are not added to util resp. util_est.
> 
> Not sure if this is clear from the source code here?
> 
> [...]
> 
> Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v2 0/2] Fix task utilization accountability for EAS
  2021-02-25  8:36 [PATCH v2 0/2] Fix task utilization accountability for EAS vincent.donnefort
  2021-02-25  8:36 ` [PATCH v2 1/2] sched/fair: Fix task utilization accountability in compute_energy() vincent.donnefort
  2021-02-25  8:36 ` [PATCH v2 2/2] sched/fair: use lsub_positive in cpu_util_next() vincent.donnefort
@ 2021-02-25 17:29 ` Peter Zijlstra
  2 siblings, 0 replies; 16+ messages in thread
From: Peter Zijlstra @ 2021-02-25 17:29 UTC (permalink / raw)
  To: vincent.donnefort
  Cc: mingo, vincent.guittot, dietmar.eggemann, linux-kernel, qperret,
	patrick.bellasi, valentin.schneider

On Thu, Feb 25, 2021 at 08:36:10AM +0000, vincent.donnefort@arm.com wrote:
> From: Vincent Donnefort <vincent.donnefort@arm.com>
> 
> Changelog since v1:
>   - Fix the issue in compute_energy(), as a change in cpu_util_next() would
>     break the OPP selection estimation.
>   - Separate patch for lsub_positive usage in cpu_util_next()
> 
> Vincent Donnefort (2):
>   sched/fair: Fix task utilization accountability in compute_energy()
>   sched/fair: use lsub_positive in cpu_util_next()
> 
>  kernel/sched/fair.c | 26 +++++++++++++++++++++-----
>  1 file changed, 21 insertions(+), 5 deletions(-)

Thanks!

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [tip: sched/core] sched/fair: use lsub_positive in cpu_util_next()
  2021-02-25  8:36 ` [PATCH v2 2/2] sched/fair: use lsub_positive in cpu_util_next() vincent.donnefort
  2021-02-25  8:53   ` Quentin Perret
  2021-02-25 11:46   ` Dietmar Eggemann
@ 2021-03-02  9:01   ` tip-bot2 for Vincent Donnefort
  2021-03-03  9:49   ` tip-bot2 for Vincent Donnefort
  2021-03-06 11:42   ` tip-bot2 for Vincent Donnefort
  4 siblings, 0 replies; 16+ messages in thread
From: tip-bot2 for Vincent Donnefort @ 2021-03-02  9:01 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Vincent Donnefort, Peter Zijlstra (Intel),
	Quentin Perret, Dietmar Eggemann, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     4537b36ee8a290376174b297d6812cba1375ea79
Gitweb:        https://git.kernel.org/tip/4537b36ee8a290376174b297d6812cba1375ea79
Author:        Vincent Donnefort <vincent.donnefort@arm.com>
AuthorDate:    Thu, 25 Feb 2021 08:36:12 
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 01 Mar 2021 18:17:25 +01:00

sched/fair: use lsub_positive in cpu_util_next()

The sub_positive local version is saving an explicit load-store and is
enough for the cpu_util_next() usage.

Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Quentin Perret <qperret@google.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lkml.kernel.org/r/20210225083612.1113823-3-vincent.donnefort@arm.com
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b994db9..7b2fac0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6471,7 +6471,7 @@ static unsigned long cpu_util_next(int cpu, struct task_struct *p, int dst_cpu)
 	 * util_avg should already be correct.
 	 */
 	if (task_cpu(p) == cpu && dst_cpu != cpu)
-		sub_positive(&util, task_util(p));
+		lsub_positive(&util, task_util(p));
 	else if (task_cpu(p) != cpu && dst_cpu == cpu)
 		util += task_util(p);
 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [tip: sched/core] sched/fair: Fix task utilization accountability in compute_energy()
  2021-02-25  8:36 ` [PATCH v2 1/2] sched/fair: Fix task utilization accountability in compute_energy() vincent.donnefort
  2021-02-25  8:52   ` Quentin Perret
  2021-02-25 11:45   ` Dietmar Eggemann
@ 2021-03-02  9:01   ` tip-bot2 for Vincent Donnefort
  2021-03-03  9:49   ` tip-bot2 for Vincent Donnefort
  2021-03-06 11:42   ` tip-bot2 for Vincent Donnefort
  4 siblings, 0 replies; 16+ messages in thread
From: tip-bot2 for Vincent Donnefort @ 2021-03-02  9:01 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Vincent Donnefort, Peter Zijlstra (Intel),
	Quentin Perret, Dietmar Eggemann, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     8af20c5fe756a9ff556c9b520201b2d158874481
Gitweb:        https://git.kernel.org/tip/8af20c5fe756a9ff556c9b520201b2d158874481
Author:        Vincent Donnefort <vincent.donnefort@arm.com>
AuthorDate:    Thu, 25 Feb 2021 08:36:11 
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Mon, 01 Mar 2021 18:17:25 +01:00

sched/fair: Fix task utilization accountability in compute_energy()

find_energy_efficient_cpu() (feec()) computes for each perf_domain (pd) an
energy delta as follows:

  feec(task)
    for_each_pd
      base_energy = compute_energy(task, -1, pd)
        -> for_each_cpu(pd)
           -> cpu_util_next(cpu, task, -1)

      energy_delta = compute_energy(task, dst_cpu, pd)
        -> for_each_cpu(pd)
           -> cpu_util_next(cpu, task, dst_cpu)
      energy_delta -= base_energy

Then it picks the best CPU as being the one that minimizes energy_delta.

cpu_util_next() estimates the CPU utilization that would happen if the
task was placed on dst_cpu as follows:

  max(cpu_util + task_util, cpu_util_est + _task_util_est)

The task contribution to the energy delta can then be either:

  (1) _task_util_est, on a mostly idle CPU, where cpu_util is close to 0
      and _task_util_est > cpu_util.
  (2) task_util, on a mostly busy CPU, where cpu_util > _task_util_est.

  (cpu_util_est doesn't appear here. It is 0 when a CPU is idle and
   otherwise must be small enough so that feec() takes the CPU as a
   potential target for the task placement)

This is problematic for feec(), as cpu_util_next() might give an unfair
advantage to a CPU which is mostly busy (2) compared to one which is
mostly idle (1). _task_util_est being always bigger than task_util in
feec() (as the task is waking up), the task contribution to the energy
might look smaller on certain CPUs (2) and this breaks the energy
comparison.

This issue is, moreover, not sporadic. By starving idle CPUs, it keeps
their cpu_util < _task_util_est (1) while others will maintain cpu_util >
_task_util_est (2).

Fix this problem by always using max(task_util, _task_util_est) as a task
contribution to the energy (ENERGY_UTIL). The new estimated CPU
utilization for the energy would then be:

  max(cpu_util, cpu_util_est) + max(task_util, _task_util_est)

compute_energy() still needs to know which OPP would be selected if the
task would be migrated in the perf_domain (FREQUENCY_UTIL). Hence,
cpu_util_next() is still used to estimate the maximum util within the pd.

Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Quentin Perret <qperret@google.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lkml.kernel.org/r/20210225083612.1113823-2-vincent.donnefort@arm.com
---
 kernel/sched/fair.c | 24 ++++++++++++++++++++----
 1 file changed, 20 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f1b55f9..b994db9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6518,8 +6518,24 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
 	 * its pd list and will not be accounted by compute_energy().
 	 */
 	for_each_cpu_and(cpu, pd_mask, cpu_online_mask) {
-		unsigned long cpu_util, util_cfs = cpu_util_next(cpu, p, dst_cpu);
-		struct task_struct *tsk = cpu == dst_cpu ? p : NULL;
+		unsigned long util_freq = cpu_util_next(cpu, p, dst_cpu);
+		unsigned long cpu_util, util_running = util_freq;
+		struct task_struct *tsk = NULL;
+
+		/*
+		 * When @p is placed on @cpu:
+		 *
+		 * util_running = max(cpu_util, cpu_util_est) +
+		 *		  max(task_util, _task_util_est)
+		 *
+		 * while cpu_util_next is: max(cpu_util + task_util,
+		 *			       cpu_util_est + _task_util_est)
+		 */
+		if (cpu == dst_cpu) {
+			tsk = p;
+			util_running =
+				cpu_util_next(cpu, p, -1) + task_util_est(p);
+		}
 
 		/*
 		 * Busy time computation: utilization clamping is not
@@ -6527,7 +6543,7 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
 		 * is already enough to scale the EM reported power
 		 * consumption at the (eventually clamped) cpu_capacity.
 		 */
-		sum_util += effective_cpu_util(cpu, util_cfs, cpu_cap,
+		sum_util += effective_cpu_util(cpu, util_running, cpu_cap,
 					       ENERGY_UTIL, NULL);
 
 		/*
@@ -6537,7 +6553,7 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
 		 * NOTE: in case RT tasks are running, by default the
 		 * FREQUENCY_UTIL's utilization can be max OPP.
 		 */
-		cpu_util = effective_cpu_util(cpu, util_cfs, cpu_cap,
+		cpu_util = effective_cpu_util(cpu, util_freq, cpu_cap,
 					      FREQUENCY_UTIL, tsk);
 		max_util = max(max_util, cpu_util);
 	}

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [tip: sched/core] sched/fair: Fix task utilization accountability in compute_energy()
  2021-02-25  8:36 ` [PATCH v2 1/2] sched/fair: Fix task utilization accountability in compute_energy() vincent.donnefort
                     ` (2 preceding siblings ...)
  2021-03-02  9:01   ` [tip: sched/core] " tip-bot2 for Vincent Donnefort
@ 2021-03-03  9:49   ` tip-bot2 for Vincent Donnefort
  2021-03-06 11:42   ` tip-bot2 for Vincent Donnefort
  4 siblings, 0 replies; 16+ messages in thread
From: tip-bot2 for Vincent Donnefort @ 2021-03-03  9:49 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Vincent Donnefort, Peter Zijlstra (Intel),
	Quentin Perret, Dietmar Eggemann, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     2d120f71df4baeb7694f513c86fe6f85940f6f76
Gitweb:        https://git.kernel.org/tip/2d120f71df4baeb7694f513c86fe6f85940f6f76
Author:        Vincent Donnefort <vincent.donnefort@arm.com>
AuthorDate:    Thu, 25 Feb 2021 08:36:11 
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 03 Mar 2021 10:33:00 +01:00

sched/fair: Fix task utilization accountability in compute_energy()

find_energy_efficient_cpu() (feec()) computes for each perf_domain (pd) an
energy delta as follows:

  feec(task)
    for_each_pd
      base_energy = compute_energy(task, -1, pd)
        -> for_each_cpu(pd)
           -> cpu_util_next(cpu, task, -1)

      energy_delta = compute_energy(task, dst_cpu, pd)
        -> for_each_cpu(pd)
           -> cpu_util_next(cpu, task, dst_cpu)
      energy_delta -= base_energy

Then it picks the best CPU as being the one that minimizes energy_delta.

cpu_util_next() estimates the CPU utilization that would happen if the
task was placed on dst_cpu as follows:

  max(cpu_util + task_util, cpu_util_est + _task_util_est)

The task contribution to the energy delta can then be either:

  (1) _task_util_est, on a mostly idle CPU, where cpu_util is close to 0
      and _task_util_est > cpu_util.
  (2) task_util, on a mostly busy CPU, where cpu_util > _task_util_est.

  (cpu_util_est doesn't appear here. It is 0 when a CPU is idle and
   otherwise must be small enough so that feec() takes the CPU as a
   potential target for the task placement)

This is problematic for feec(), as cpu_util_next() might give an unfair
advantage to a CPU which is mostly busy (2) compared to one which is
mostly idle (1). _task_util_est being always bigger than task_util in
feec() (as the task is waking up), the task contribution to the energy
might look smaller on certain CPUs (2) and this breaks the energy
comparison.

This issue is, moreover, not sporadic. By starving idle CPUs, it keeps
their cpu_util < _task_util_est (1) while others will maintain cpu_util >
_task_util_est (2).

Fix this problem by always using max(task_util, _task_util_est) as a task
contribution to the energy (ENERGY_UTIL). The new estimated CPU
utilization for the energy would then be:

  max(cpu_util, cpu_util_est) + max(task_util, _task_util_est)

compute_energy() still needs to know which OPP would be selected if the
task would be migrated in the perf_domain (FREQUENCY_UTIL). Hence,
cpu_util_next() is still used to estimate the maximum util within the pd.

Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Quentin Perret <qperret@google.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lkml.kernel.org/r/20210225083612.1113823-2-vincent.donnefort@arm.com
---
 kernel/sched/fair.c | 24 ++++++++++++++++++++----
 1 file changed, 20 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f1b55f9..b994db9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6518,8 +6518,24 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
 	 * its pd list and will not be accounted by compute_energy().
 	 */
 	for_each_cpu_and(cpu, pd_mask, cpu_online_mask) {
-		unsigned long cpu_util, util_cfs = cpu_util_next(cpu, p, dst_cpu);
-		struct task_struct *tsk = cpu == dst_cpu ? p : NULL;
+		unsigned long util_freq = cpu_util_next(cpu, p, dst_cpu);
+		unsigned long cpu_util, util_running = util_freq;
+		struct task_struct *tsk = NULL;
+
+		/*
+		 * When @p is placed on @cpu:
+		 *
+		 * util_running = max(cpu_util, cpu_util_est) +
+		 *		  max(task_util, _task_util_est)
+		 *
+		 * while cpu_util_next is: max(cpu_util + task_util,
+		 *			       cpu_util_est + _task_util_est)
+		 */
+		if (cpu == dst_cpu) {
+			tsk = p;
+			util_running =
+				cpu_util_next(cpu, p, -1) + task_util_est(p);
+		}
 
 		/*
 		 * Busy time computation: utilization clamping is not
@@ -6527,7 +6543,7 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
 		 * is already enough to scale the EM reported power
 		 * consumption at the (eventually clamped) cpu_capacity.
 		 */
-		sum_util += effective_cpu_util(cpu, util_cfs, cpu_cap,
+		sum_util += effective_cpu_util(cpu, util_running, cpu_cap,
 					       ENERGY_UTIL, NULL);
 
 		/*
@@ -6537,7 +6553,7 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
 		 * NOTE: in case RT tasks are running, by default the
 		 * FREQUENCY_UTIL's utilization can be max OPP.
 		 */
-		cpu_util = effective_cpu_util(cpu, util_cfs, cpu_cap,
+		cpu_util = effective_cpu_util(cpu, util_freq, cpu_cap,
 					      FREQUENCY_UTIL, tsk);
 		max_util = max(max_util, cpu_util);
 	}

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [tip: sched/core] sched/fair: use lsub_positive in cpu_util_next()
  2021-02-25  8:36 ` [PATCH v2 2/2] sched/fair: use lsub_positive in cpu_util_next() vincent.donnefort
                     ` (2 preceding siblings ...)
  2021-03-02  9:01   ` [tip: sched/core] " tip-bot2 for Vincent Donnefort
@ 2021-03-03  9:49   ` tip-bot2 for Vincent Donnefort
  2021-03-06 11:42   ` tip-bot2 for Vincent Donnefort
  4 siblings, 0 replies; 16+ messages in thread
From: tip-bot2 for Vincent Donnefort @ 2021-03-03  9:49 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Vincent Donnefort, Peter Zijlstra (Intel),
	Quentin Perret, Dietmar Eggemann, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     b641a8b52c6162172ca31590510569eaadcd5e49
Gitweb:        https://git.kernel.org/tip/b641a8b52c6162172ca31590510569eaadcd5e49
Author:        Vincent Donnefort <vincent.donnefort@arm.com>
AuthorDate:    Thu, 25 Feb 2021 08:36:12 
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Wed, 03 Mar 2021 10:33:00 +01:00

sched/fair: use lsub_positive in cpu_util_next()

The sub_positive local version is saving an explicit load-store and is
enough for the cpu_util_next() usage.

Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Quentin Perret <qperret@google.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lkml.kernel.org/r/20210225083612.1113823-3-vincent.donnefort@arm.com
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b994db9..7b2fac0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6471,7 +6471,7 @@ static unsigned long cpu_util_next(int cpu, struct task_struct *p, int dst_cpu)
 	 * util_avg should already be correct.
 	 */
 	if (task_cpu(p) == cpu && dst_cpu != cpu)
-		sub_positive(&util, task_util(p));
+		lsub_positive(&util, task_util(p));
 	else if (task_cpu(p) != cpu && dst_cpu == cpu)
 		util += task_util(p);
 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [tip: sched/core] sched/fair: use lsub_positive in cpu_util_next()
  2021-02-25  8:36 ` [PATCH v2 2/2] sched/fair: use lsub_positive in cpu_util_next() vincent.donnefort
                     ` (3 preceding siblings ...)
  2021-03-03  9:49   ` tip-bot2 for Vincent Donnefort
@ 2021-03-06 11:42   ` tip-bot2 for Vincent Donnefort
  4 siblings, 0 replies; 16+ messages in thread
From: tip-bot2 for Vincent Donnefort @ 2021-03-06 11:42 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Vincent Donnefort, Peter Zijlstra (Intel),
	Ingo Molnar, Quentin Perret, Dietmar Eggemann, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     736cc6b31102236a55470c72523ed0a65eb3f804
Gitweb:        https://git.kernel.org/tip/736cc6b31102236a55470c72523ed0a65eb3f804
Author:        Vincent Donnefort <vincent.donnefort@arm.com>
AuthorDate:    Thu, 25 Feb 2021 08:36:12 
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Sat, 06 Mar 2021 12:40:22 +01:00

sched/fair: use lsub_positive in cpu_util_next()

The sub_positive local version is saving an explicit load-store and is
enough for the cpu_util_next() usage.

Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Quentin Perret <qperret@google.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lkml.kernel.org/r/20210225083612.1113823-3-vincent.donnefort@arm.com
---
 kernel/sched/fair.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index b994db9..7b2fac0 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6471,7 +6471,7 @@ static unsigned long cpu_util_next(int cpu, struct task_struct *p, int dst_cpu)
 	 * util_avg should already be correct.
 	 */
 	if (task_cpu(p) == cpu && dst_cpu != cpu)
-		sub_positive(&util, task_util(p));
+		lsub_positive(&util, task_util(p));
 	else if (task_cpu(p) != cpu && dst_cpu == cpu)
 		util += task_util(p);
 

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [tip: sched/core] sched/fair: Fix task utilization accountability in compute_energy()
  2021-02-25  8:36 ` [PATCH v2 1/2] sched/fair: Fix task utilization accountability in compute_energy() vincent.donnefort
                     ` (3 preceding siblings ...)
  2021-03-03  9:49   ` tip-bot2 for Vincent Donnefort
@ 2021-03-06 11:42   ` tip-bot2 for Vincent Donnefort
  4 siblings, 0 replies; 16+ messages in thread
From: tip-bot2 for Vincent Donnefort @ 2021-03-06 11:42 UTC (permalink / raw)
  To: linux-tip-commits
  Cc: Vincent Donnefort, Peter Zijlstra (Intel),
	Ingo Molnar, Quentin Perret, Dietmar Eggemann, x86, linux-kernel

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     0372e1cf70c28de6babcba38ef97b6ae3400b101
Gitweb:        https://git.kernel.org/tip/0372e1cf70c28de6babcba38ef97b6ae3400b101
Author:        Vincent Donnefort <vincent.donnefort@arm.com>
AuthorDate:    Thu, 25 Feb 2021 08:36:11 
Committer:     Ingo Molnar <mingo@kernel.org>
CommitterDate: Sat, 06 Mar 2021 12:40:22 +01:00

sched/fair: Fix task utilization accountability in compute_energy()

find_energy_efficient_cpu() (feec()) computes for each perf_domain (pd) an
energy delta as follows:

  feec(task)
    for_each_pd
      base_energy = compute_energy(task, -1, pd)
        -> for_each_cpu(pd)
           -> cpu_util_next(cpu, task, -1)

      energy_delta = compute_energy(task, dst_cpu, pd)
        -> for_each_cpu(pd)
           -> cpu_util_next(cpu, task, dst_cpu)
      energy_delta -= base_energy

Then it picks the best CPU as being the one that minimizes energy_delta.

cpu_util_next() estimates the CPU utilization that would happen if the
task was placed on dst_cpu as follows:

  max(cpu_util + task_util, cpu_util_est + _task_util_est)

The task contribution to the energy delta can then be either:

  (1) _task_util_est, on a mostly idle CPU, where cpu_util is close to 0
      and _task_util_est > cpu_util.
  (2) task_util, on a mostly busy CPU, where cpu_util > _task_util_est.

  (cpu_util_est doesn't appear here. It is 0 when a CPU is idle and
   otherwise must be small enough so that feec() takes the CPU as a
   potential target for the task placement)

This is problematic for feec(), as cpu_util_next() might give an unfair
advantage to a CPU which is mostly busy (2) compared to one which is
mostly idle (1). _task_util_est being always bigger than task_util in
feec() (as the task is waking up), the task contribution to the energy
might look smaller on certain CPUs (2) and this breaks the energy
comparison.

This issue is, moreover, not sporadic. By starving idle CPUs, it keeps
their cpu_util < _task_util_est (1) while others will maintain cpu_util >
_task_util_est (2).

Fix this problem by always using max(task_util, _task_util_est) as a task
contribution to the energy (ENERGY_UTIL). The new estimated CPU
utilization for the energy would then be:

  max(cpu_util, cpu_util_est) + max(task_util, _task_util_est)

compute_energy() still needs to know which OPP would be selected if the
task would be migrated in the perf_domain (FREQUENCY_UTIL). Hence,
cpu_util_next() is still used to estimate the maximum util within the pd.

Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Quentin Perret <qperret@google.com>
Reviewed-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
Link: https://lkml.kernel.org/r/20210225083612.1113823-2-vincent.donnefort@arm.com
---
 kernel/sched/fair.c | 24 ++++++++++++++++++++----
 1 file changed, 20 insertions(+), 4 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index f1b55f9..b994db9 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6518,8 +6518,24 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
 	 * its pd list and will not be accounted by compute_energy().
 	 */
 	for_each_cpu_and(cpu, pd_mask, cpu_online_mask) {
-		unsigned long cpu_util, util_cfs = cpu_util_next(cpu, p, dst_cpu);
-		struct task_struct *tsk = cpu == dst_cpu ? p : NULL;
+		unsigned long util_freq = cpu_util_next(cpu, p, dst_cpu);
+		unsigned long cpu_util, util_running = util_freq;
+		struct task_struct *tsk = NULL;
+
+		/*
+		 * When @p is placed on @cpu:
+		 *
+		 * util_running = max(cpu_util, cpu_util_est) +
+		 *		  max(task_util, _task_util_est)
+		 *
+		 * while cpu_util_next is: max(cpu_util + task_util,
+		 *			       cpu_util_est + _task_util_est)
+		 */
+		if (cpu == dst_cpu) {
+			tsk = p;
+			util_running =
+				cpu_util_next(cpu, p, -1) + task_util_est(p);
+		}
 
 		/*
 		 * Busy time computation: utilization clamping is not
@@ -6527,7 +6543,7 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
 		 * is already enough to scale the EM reported power
 		 * consumption at the (eventually clamped) cpu_capacity.
 		 */
-		sum_util += effective_cpu_util(cpu, util_cfs, cpu_cap,
+		sum_util += effective_cpu_util(cpu, util_running, cpu_cap,
 					       ENERGY_UTIL, NULL);
 
 		/*
@@ -6537,7 +6553,7 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
 		 * NOTE: in case RT tasks are running, by default the
 		 * FREQUENCY_UTIL's utilization can be max OPP.
 		 */
-		cpu_util = effective_cpu_util(cpu, util_cfs, cpu_cap,
+		cpu_util = effective_cpu_util(cpu, util_freq, cpu_cap,
 					      FREQUENCY_UTIL, tsk);
 		max_util = max(max_util, cpu_util);
 	}

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2021-03-06 11:43 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-25  8:36 [PATCH v2 0/2] Fix task utilization accountability for EAS vincent.donnefort
2021-02-25  8:36 ` [PATCH v2 1/2] sched/fair: Fix task utilization accountability in compute_energy() vincent.donnefort
2021-02-25  8:52   ` Quentin Perret
2021-02-25 11:45   ` Dietmar Eggemann
2021-02-25 11:58     ` Quentin Perret
2021-02-25 16:23     ` Vincent Donnefort
2021-03-02  9:01   ` [tip: sched/core] " tip-bot2 for Vincent Donnefort
2021-03-03  9:49   ` tip-bot2 for Vincent Donnefort
2021-03-06 11:42   ` tip-bot2 for Vincent Donnefort
2021-02-25  8:36 ` [PATCH v2 2/2] sched/fair: use lsub_positive in cpu_util_next() vincent.donnefort
2021-02-25  8:53   ` Quentin Perret
2021-02-25 11:46   ` Dietmar Eggemann
2021-03-02  9:01   ` [tip: sched/core] " tip-bot2 for Vincent Donnefort
2021-03-03  9:49   ` tip-bot2 for Vincent Donnefort
2021-03-06 11:42   ` tip-bot2 for Vincent Donnefort
2021-02-25 17:29 ` [PATCH v2 0/2] Fix task utilization accountability for EAS Peter Zijlstra

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).