All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH] sched/fair: Fix task utilization accountability in cpu_util_next()
@ 2021-02-22  9:54 vincent.donnefort
  2021-02-22 10:11 ` Quentin Perret
  2021-02-23 14:44 ` Dietmar Eggemann
  0 siblings, 2 replies; 13+ messages in thread
From: vincent.donnefort @ 2021-02-22  9:54 UTC (permalink / raw)
  To: peterz, mingo, vincent.guittot
  Cc: dietmar.eggemann, linux-kernel, qperret, patrick.bellasi,
	valentin.schneider, Vincent Donnefort

From: Vincent Donnefort <vincent.donnefort@arm.com>

Currently, cpu_util_next() estimates the CPU utilization as follows:

  max(cpu_util + task_util,
      cpu_util_est + task_util_est)

This is an issue when making a comparison between CPUs, as the task
contribution can be either:

  (1) task_util_est, on a mostly idle CPU, where cpu_util is close to 0
      and task_util_est > cpu_util.
  (2) task_util, on a mostly busy CPU, where cpu_util > task_util_est.

This gives an unfair advantage to some CPUs, when comparing energy deltas
in the task waking placement, where task_util is always smaller than
task_util_est. The energy delta is therefore, likely to be bigger on
a mostly idle CPU (1) than a mostly busy CPU (2).

This issue is, moreover, not sporadic. By starving idle CPUs, it keeps
their cpu_util < task_util_est (1) while others will maintain cpu_util >
task_util_est (2).

The new approach uses (if UTIL_EST is enabled) task_util_est() as task
contribution, which ensures that all CPUs will use the same value:

  max(cpu_util + max(task_util, task_util_est),
      cpu_util_est + max(task_util, task_util_est))

This patch doesn't modify the !UTIL_EST behaviour.

Also, replace sub_positive with lsub_positive which saves one explicit
load-store.

Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fb9f10d4312b..dd143aafaf97 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6516,32 +6516,42 @@ static unsigned long cpu_util_without(int cpu, struct task_struct *p)
 static unsigned long cpu_util_next(int cpu, struct task_struct *p, int dst_cpu)
 {
 	struct cfs_rq *cfs_rq = &cpu_rq(cpu)->cfs;
-	unsigned long util_est, util = READ_ONCE(cfs_rq->avg.util_avg);
+	unsigned long util = READ_ONCE(cfs_rq->avg.util_avg);
 
 	/*
-	 * If @p migrates from @cpu to another, remove its contribution. Or,
-	 * if @p migrates from another CPU to @cpu, add its contribution. In
-	 * the other cases, @cpu is not impacted by the migration, so the
-	 * util_avg should already be correct.
+	 * UTIL_EST case: hide the task_util contribution from util.
+	 * During wake-up, the task isn't enqueued yet and doesn't
+	 * appear in the util_est of any CPU. No contribution has
+	 * therefore to be removed from util_est.
+	 *
+	 * If @p migrates to this CPU, add its contribution to util and
+	 * util_est.
 	 */
-	if (task_cpu(p) == cpu && dst_cpu != cpu)
-		sub_positive(&util, task_util(p));
-	else if (task_cpu(p) != cpu && dst_cpu == cpu)
-		util += task_util(p);
-
 	if (sched_feat(UTIL_EST)) {
+		unsigned long util_est;
+
 		util_est = READ_ONCE(cfs_rq->avg.util_est.enqueued);
 
-		/*
-		 * During wake-up, the task isn't enqueued yet and doesn't
-		 * appear in the cfs_rq->avg.util_est.enqueued of any rq,
-		 * so just add it (if needed) to "simulate" what will be
-		 * cpu_util() after the task has been enqueued.
-		 */
-		if (dst_cpu == cpu)
-			util_est += _task_util_est(p);
+		if (task_cpu(p) == cpu)
+			lsub_positive(&util, task_util(p));
+
+		if (dst_cpu == cpu) {
+			unsigned long task_util = task_util_est(p);
+
+			util += task_util;
+			util_est += task_util;
+		}
 
 		util = max(util, util_est);
+	/*
+	 * !UTIL_EST case: If @p migrates from @cpu to another, remove its
+	 * contribution. If @p migrates to @cpu, add it.
+	 */
+	} else {
+		if (task_cpu(p) == cpu && dst_cpu != cpu)
+			lsub_positive(&util, task_util(p));
+		else if (task_cpu(p) != cpu && dst_cpu == cpu)
+			util += task_util(p);
 	}
 
 	return min(util, arch_scale_cpu_capacity(cpu));
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH] sched/fair: Fix task utilization accountability in cpu_util_next()
  2021-02-22  9:54 [PATCH] sched/fair: Fix task utilization accountability in cpu_util_next() vincent.donnefort
@ 2021-02-22 10:11 ` Quentin Perret
  2021-02-22 11:36   ` Vincent Donnefort
  2021-02-23 14:44 ` Dietmar Eggemann
  1 sibling, 1 reply; 13+ messages in thread
From: Quentin Perret @ 2021-02-22 10:11 UTC (permalink / raw)
  To: vincent.donnefort
  Cc: peterz, mingo, vincent.guittot, dietmar.eggemann, linux-kernel,
	patrick.bellasi, valentin.schneider

Hey Vincent,

On Monday 22 Feb 2021 at 09:54:01 (+0000), vincent.donnefort@arm.com wrote:
> From: Vincent Donnefort <vincent.donnefort@arm.com>
> 
> Currently, cpu_util_next() estimates the CPU utilization as follows:
> 
>   max(cpu_util + task_util,
>       cpu_util_est + task_util_est)

s/task_util_est/_task_util_est

This is an important difference.

> 
> This is an issue when making a comparison between CPUs, as the task
> contribution can be either:
> 
>   (1) task_util_est, on a mostly idle CPU, where cpu_util is close to 0
>       and task_util_est > cpu_util.
>   (2) task_util, on a mostly busy CPU, where cpu_util > task_util_est.

I don't understand how this is an issue, this is by design with util-est
no?

Note that cpu_util_next() tries to accurately predict what cpu_util(@cpu)
will be once @p is enqueued on @dst_cpu. There should be no policy
decision here, we just reproduce the enqueue aggreagation -- see
util_est_enqueue() and cpu_util().

Could you please give an example where you think cpu_util_next()
computes the wrong value?

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] sched/fair: Fix task utilization accountability in cpu_util_next()
  2021-02-22 10:11 ` Quentin Perret
@ 2021-02-22 11:36   ` Vincent Donnefort
  2021-02-22 12:23     ` Quentin Perret
  0 siblings, 1 reply; 13+ messages in thread
From: Vincent Donnefort @ 2021-02-22 11:36 UTC (permalink / raw)
  To: Quentin Perret
  Cc: peterz, mingo, vincent.guittot, dietmar.eggemann, linux-kernel,
	patrick.bellasi, valentin.schneider

Hi Quentin,

On Mon, Feb 22, 2021 at 10:11:03AM +0000, Quentin Perret wrote:
> Hey Vincent,
> 
> On Monday 22 Feb 2021 at 09:54:01 (+0000), vincent.donnefort@arm.com wrote:
> > From: Vincent Donnefort <vincent.donnefort@arm.com>
> > 
> > Currently, cpu_util_next() estimates the CPU utilization as follows:
> > 
> >   max(cpu_util + task_util,
> >       cpu_util_est + task_util_est)
> 
> s/task_util_est/_task_util_est
> 
> This is an important difference.
> 
> > 
> > This is an issue when making a comparison between CPUs, as the task
> > contribution can be either:
> > 
> >   (1) task_util_est, on a mostly idle CPU, where cpu_util is close to 0
> >       and task_util_est > cpu_util.
> >   (2) task_util, on a mostly busy CPU, where cpu_util > task_util_est.
> 
> I don't understand how this is an issue, this is by design with util-est
> no?
> 
> Note that cpu_util_next() tries to accurately predict what cpu_util(@cpu)
> will be once @p is enqueued on @dst_cpu. There should be no policy
> decision here, we just reproduce the enqueue aggreagation -- see
> util_est_enqueue() and cpu_util().
> 
> Could you please give an example where you think cpu_util_next()
> computes the wrong value?

Here's with real life numbers.

The task: util_avg=3 (1) util_est=11 (2)

pd0 (CPU-0, CPU-1, CPU-2)

 cpu_util_next(CPU-0, NULL): 7
 cpu_util_next(CPU-1, NULL): 3
 cpu_util_next(CPU-2, NULL): 0 <- Most capacity, try to place task here.

 cpu_util_next(CPU-2, task): 0 + 11 (2)


pd1 (CPU-3):

 cpu_util_next(CPU-3, NULL): 77

 cpu_util_next(CPU-3, task): 77 + 3 (1)


On pd0, the task contribution is 11. On pd1, it is 3. When computing the energy
deltas, pd0's is likely to be higher than pd1's, only because the task
contribution is higher for one comparison than the other.

-- 
Vincent

> 
> Thanks,
> Quentin


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] sched/fair: Fix task utilization accountability in cpu_util_next()
  2021-02-22 11:36   ` Vincent Donnefort
@ 2021-02-22 12:23     ` Quentin Perret
  2021-02-22 15:01       ` Vincent Donnefort
  0 siblings, 1 reply; 13+ messages in thread
From: Quentin Perret @ 2021-02-22 12:23 UTC (permalink / raw)
  To: Vincent Donnefort
  Cc: peterz, mingo, vincent.guittot, dietmar.eggemann, linux-kernel,
	patrick.bellasi, valentin.schneider

On Monday 22 Feb 2021 at 11:36:03 (+0000), Vincent Donnefort wrote:
> Here's with real life numbers.
> 
> The task: util_avg=3 (1) util_est=11 (2)
> 
> pd0 (CPU-0, CPU-1, CPU-2)
> 
>  cpu_util_next(CPU-0, NULL): 7
>  cpu_util_next(CPU-1, NULL): 3
>  cpu_util_next(CPU-2, NULL): 0 <- Most capacity, try to place task here.
> 
>  cpu_util_next(CPU-2, task): 0 + 11 (2)
> 
> 
> pd1 (CPU-3):
> 
>  cpu_util_next(CPU-3, NULL): 77
> 
>  cpu_util_next(CPU-3, task): 77 + 3 (1)
> 
> 
> On pd0, the task contribution is 11. On pd1, it is 3.

Yes but that accurately reflects what the task's impact on frequency
selection of those CPUs if it was enqueued there, right?

This is an important property we should aim to keep, the frequency
prediction needs to be in sync with the actual frequency request, or
the energy estimate will be off.

> When computing the energy
> deltas, pd0's is likely to be higher than pd1's, only because the task
> contribution is higher for one comparison than the other.

You mean the contribution to sum_util right? I think I see what you mean
but I'm still not sure if this really is an issue. This is how util_est
works, and the EM stuff is just consistent with that.

The issue you describe can only happen (I think) when a rq's util_avg is
larger than its util-est emwa by some margin (that has to do with the
ewma-util_avg delta for the task?). But that means the ewma is not to be
trusted to begin with, so ...

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] sched/fair: Fix task utilization accountability in cpu_util_next()
  2021-02-22 12:23     ` Quentin Perret
@ 2021-02-22 15:01       ` Vincent Donnefort
  2021-02-22 15:58         ` Quentin Perret
  0 siblings, 1 reply; 13+ messages in thread
From: Vincent Donnefort @ 2021-02-22 15:01 UTC (permalink / raw)
  To: Quentin Perret
  Cc: peterz, mingo, vincent.guittot, dietmar.eggemann, linux-kernel,
	patrick.bellasi, valentin.schneider

On Mon, Feb 22, 2021 at 12:23:04PM +0000, Quentin Perret wrote:
> On Monday 22 Feb 2021 at 11:36:03 (+0000), Vincent Donnefort wrote:
> > Here's with real life numbers.
> > 
> > The task: util_avg=3 (1) util_est=11 (2)
> > 
> > pd0 (CPU-0, CPU-1, CPU-2)
> > 
> >  cpu_util_next(CPU-0, NULL): 7
> >  cpu_util_next(CPU-1, NULL): 3
> >  cpu_util_next(CPU-2, NULL): 0 <- Most capacity, try to place task here.
> > 
> >  cpu_util_next(CPU-2, task): 0 + 11 (2)
> > 
> > 
> > pd1 (CPU-3):
> > 
> >  cpu_util_next(CPU-3, NULL): 77
> > 
> >  cpu_util_next(CPU-3, task): 77 + 3 (1)
> > 
> > 
> > On pd0, the task contribution is 11. On pd1, it is 3.
> 
> Yes but that accurately reflects what the task's impact on frequency
> selection of those CPUs if it was enqueued there, right?
> 
> This is an important property we should aim to keep, the frequency
> prediction needs to be in sync with the actual frequency request, or
> the energy estimate will be off.

You mean that it could lead to a wrong frequency estimation when doing
freq = map_util_freq() in em_cpu_energy()?

But in any case, the computed energy, being the product of sum_util with the
OPP's cost, it is directly affected by this util_avg/util_est difference.

In the case where the task placement doesn't change the OPP, which is often the
case, we can simplify the comparison and end-up with the following:

  delta_energy(CPU-3): OPP3 cost * (cpu_util_avg + task_util_avg - cpu_util_avg)
  delta_energy(CPU-2): OPP2 cost * (cpu_util_est + task_util_est - cpu_util_est)

  => OPP3 cost * task_util_avg < task_util_est * OPP2 cost

With the same example I described previously, if you add the scaled OPP cost of
0.76 for CPU-3 and 0.65 for CPU-2 (real life OPP scaled costs), we have:

  2.3 (CPU-3) < 7.15 (CPU-2)

The task is placed on CPU-3, while it would have been much more efficient to use
CPU-2.

> 
> > When computing the energy
> > deltas, pd0's is likely to be higher than pd1's, only because the task
> > contribution is higher for one comparison than the other.
> 
> You mean the contribution to sum_util right? I think I see what you mean
> but I'm still not sure if this really is an issue. This is how util_est
> works, and the EM stuff is just consistent with that.
> 
> The issue you describe can only happen (I think) when a rq's util_avg is
> larger than its util-est emwa by some margin (that has to do with the
> ewma-util_avg delta for the task?). But that means the ewma is not to be
> trusted to begin with, so ...

cfs_rq->avg.util_est.ewma is not used. cpu_util() will only return the max
between ue.enqueued and util_avg.


^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] sched/fair: Fix task utilization accountability in cpu_util_next()
  2021-02-22 15:01       ` Vincent Donnefort
@ 2021-02-22 15:58         ` Quentin Perret
  2021-02-22 16:23           ` Quentin Perret
  2021-02-22 16:31           ` Vincent Donnefort
  0 siblings, 2 replies; 13+ messages in thread
From: Quentin Perret @ 2021-02-22 15:58 UTC (permalink / raw)
  To: Vincent Donnefort
  Cc: peterz, mingo, vincent.guittot, dietmar.eggemann, linux-kernel,
	patrick.bellasi, valentin.schneider

On Monday 22 Feb 2021 at 15:01:51 (+0000), Vincent Donnefort wrote:
> You mean that it could lead to a wrong frequency estimation when doing
> freq = map_util_freq() in em_cpu_energy()?

I'm not too worried about the map_util_freq() part, I'm worried about
the schedutil aggregation. Specifically, when a task is enqueued on a
rq, we sum its util_avg to the rq's util_avg, and the _task_util_est()
to the rq's util_est.enqueue member (as per util_est_enqueue()).

Now, in schedutil, sugov_get_util() calls cpu_util_cfs(), which does the
following:

	static inline unsigned long cpu_util_cfs(struct rq *rq)
	{
		unsigned long util = READ_ONCE(rq->cfs.avg.util_avg);

		if (sched_feat(UTIL_EST)) {
			util = max_t(unsigned long, util,
				     READ_ONCE(rq->cfs.avg.util_est.enqueued));
		}

		return util;
	}

And that value will be the base for frequency selection. cpu_util_next()
tries to mimic this as accurately as possible, by doing the sums
independently, and then computing the max, exactly as we will do when
the task is enqueued and a freq update is generated.

> But in any case, the computed energy, being the product of sum_util with the
> OPP's cost, it is directly affected by this util_avg/util_est difference.

Sure, but we're not going to fix it by messing up the OPP selection part ;-)

> In the case where the task placement doesn't change the OPP, which is often the
> case, we can simplify the comparison and end-up with the following:
> 
>   delta_energy(CPU-3): OPP3 cost * (cpu_util_avg + task_util_avg - cpu_util_avg)
>   delta_energy(CPU-2): OPP2 cost * (cpu_util_est + task_util_est - cpu_util_est)
> 
>   => OPP3 cost * task_util_avg < task_util_est * OPP2 cost
> 
> With the same example I described previously, if you add the scaled OPP cost of
> 0.76 for CPU-3 and 0.65 for CPU-2 (real life OPP scaled costs), we have:
> 
>   2.3 (CPU-3) < 7.15 (CPU-2)
> 
> The task is placed on CPU-3, while it would have been much more efficient to use
> CPU-2.

That should really be a transient state: having a util_avg much larger
than util_est.enqueued is indicative of either a new task or a workload
changing behaviour. And so, chances are all the estimations are wrong
anyways -- it's hard to do good estimations when the present doesn't
look like the recent past.

But in any case, if we're going to address this, I'm still not sure this
patch will be what we want. As per my first comment we need to keep the
frequency estimation right.

> > > When computing the energy
> > > deltas, pd0's is likely to be higher than pd1's, only because the task
> > > contribution is higher for one comparison than the other.
> > 
> > You mean the contribution to sum_util right? I think I see what you mean
> > but I'm still not sure if this really is an issue. This is how util_est
> > works, and the EM stuff is just consistent with that.
> > 
> > The issue you describe can only happen (I think) when a rq's util_avg is
> > larger than its util-est emwa by some margin (that has to do with the
> > ewma-util_avg delta for the task?). But that means the ewma is not to be
> > trusted to begin with, so ...
> 
> cfs_rq->avg.util_est.ewma is not used. cpu_util() will only return the max
> between ue.enqueued and util_avg.

Right, my bad, it was the 'enqueued' member. But the rest of the
argument is still valid I think, but with s/ewma/enqueued :-)

Thanks,
Quentin

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] sched/fair: Fix task utilization accountability in cpu_util_next()
  2021-02-22 15:58         ` Quentin Perret
@ 2021-02-22 16:23           ` Quentin Perret
  2021-02-22 16:39             ` Vincent Donnefort
  2021-02-23 14:47             ` Dietmar Eggemann
  2021-02-22 16:31           ` Vincent Donnefort
  1 sibling, 2 replies; 13+ messages in thread
From: Quentin Perret @ 2021-02-22 16:23 UTC (permalink / raw)
  To: Vincent Donnefort
  Cc: peterz, mingo, vincent.guittot, dietmar.eggemann, linux-kernel,
	patrick.bellasi, valentin.schneider

On Monday 22 Feb 2021 at 15:58:56 (+0000), Quentin Perret wrote:
> But in any case, if we're going to address this, I'm still not sure this
> patch will be what we want. As per my first comment we need to keep the
> frequency estimation right.

Totally untested, but I think in principle you would like something like
the snippet below. Would that work?

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 04a3ce20da67..6594d875c6ac 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -6534,8 +6534,13 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
         * its pd list and will not be accounted by compute_energy().
         */
        for_each_cpu_and(cpu, pd_mask, cpu_online_mask) {
-               unsigned long cpu_util, util_cfs = cpu_util_next(cpu, p, dst_cpu);
+               unsigned long util_freq = cpu_util_next(cpu, p, dst_cpu);
+               unsigned long util_running = cpu_util_without(cpu, p);
                struct task_struct *tsk = cpu == dst_cpu ? p : NULL;
+               unsigned long cpu_util;
+
+               if (cpu == dst_cpu)
+                       util_running += task_util_est();

                /*
                 * Busy time computation: utilization clamping is not
@@ -6543,7 +6548,7 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
                 * is already enough to scale the EM reported power
                 * consumption at the (eventually clamped) cpu_capacity.
                 */
-               sum_util += schedutil_cpu_util(cpu, util_cfs, cpu_cap,
+               sum_util += schedutil_cpu_util(cpu, util_running, cpu_cap,
                                               ENERGY_UTIL, NULL);

                /*
@@ -6553,7 +6558,7 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
                 * NOTE: in case RT tasks are running, by default the
                 * FREQUENCY_UTIL's utilization can be max OPP.
                 */
-               cpu_util = schedutil_cpu_util(cpu, util_cfs, cpu_cap,
+               cpu_util = schedutil_cpu_util(cpu, util_freq, cpu_cap,
                                              FREQUENCY_UTIL, tsk);
                max_util = max(max_util, cpu_util);
        }


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: [PATCH] sched/fair: Fix task utilization accountability in cpu_util_next()
  2021-02-22 15:58         ` Quentin Perret
  2021-02-22 16:23           ` Quentin Perret
@ 2021-02-22 16:31           ` Vincent Donnefort
  2021-02-22 16:35             ` Quentin Perret
  1 sibling, 1 reply; 13+ messages in thread
From: Vincent Donnefort @ 2021-02-22 16:31 UTC (permalink / raw)
  To: Quentin Perret
  Cc: peterz, mingo, vincent.guittot, dietmar.eggemann, linux-kernel,
	patrick.bellasi, valentin.schneider

On Mon, Feb 22, 2021 at 03:58:56PM +0000, Quentin Perret wrote:
> On Monday 22 Feb 2021 at 15:01:51 (+0000), Vincent Donnefort wrote:
> > You mean that it could lead to a wrong frequency estimation when doing
> > freq = map_util_freq() in em_cpu_energy()?
> 
> I'm not too worried about the map_util_freq() part, I'm worried about
> the schedutil aggregation. Specifically, when a task is enqueued on a
> rq, we sum its util_avg to the rq's util_avg, and the _task_util_est()
> to the rq's util_est.enqueue member (as per util_est_enqueue()).
> 
> Now, in schedutil, sugov_get_util() calls cpu_util_cfs(), which does the
> following:
> 
> 	static inline unsigned long cpu_util_cfs(struct rq *rq)
> 	{
> 		unsigned long util = READ_ONCE(rq->cfs.avg.util_avg);
> 
> 		if (sched_feat(UTIL_EST)) {
> 			util = max_t(unsigned long, util,
> 				     READ_ONCE(rq->cfs.avg.util_est.enqueued));
> 		}
> 
> 		return util;
> 	}
> 
> And that value will be the base for frequency selection. cpu_util_next()
> tries to mimic this as accurately as possible, by doing the sums
> independently, and then computing the max, exactly as we will do when
> the task is enqueued and a freq update is generated.
> 
> > But in any case, the computed energy, being the product of sum_util with the
> > OPP's cost, it is directly affected by this util_avg/util_est difference.
> 
> Sure, but we're not going to fix it by messing up the OPP selection part ;-)
> 
> > In the case where the task placement doesn't change the OPP, which is often the
> > case, we can simplify the comparison and end-up with the following:
> > 
> >   delta_energy(CPU-3): OPP3 cost * (cpu_util_avg + task_util_avg - cpu_util_avg)
> >   delta_energy(CPU-2): OPP2 cost * (cpu_util_est + task_util_est - cpu_util_est)
> > 
> >   => OPP3 cost * task_util_avg < task_util_est * OPP2 cost
> > 
> > With the same example I described previously, if you add the scaled OPP cost of
> > 0.76 for CPU-3 and 0.65 for CPU-2 (real life OPP scaled costs), we have:
> > 
> >   2.3 (CPU-3) < 7.15 (CPU-2)
> > 
> > The task is placed on CPU-3, while it would have been much more efficient to use
> > CPU-2.
> 
> That should really be a transient state: having a util_avg much larger
> than util_est.enqueued is indicative of either a new task or a workload
> changing behaviour. And so, chances are all the estimations are wrong
> anyways -- it's hard to do good estimations when the present doesn't
> look like the recent past.

Not really a transient state sadly. This problem could happen with several tasks.
All of them ending-up on the same CPU, they'll keep its util_avg high enough,
while others will starve by being stuck with the task_util_est usage.

> 
> But in any case, if we're going to address this, I'm still not sure this
> patch will be what we want. As per my first comment we need to keep the
> frequency estimation right.

No indeed, there's still a util_est/util_avg mix-up in this proposal too. For a
CPU with util_avg > util_est, we would use the CPU's util_avg and the task's
util_est, which doesn't reflect the "real" util.

I suppose, a way of fixing this, is to keep cpu_util_next() the way it is to
get the appropriate frequency at which the CPU would run once the task has been
enqueued, for the 'max_util', and have 'sum_util' being the sum of the pd's util
(without the task) + task_util_est().

Thoughts?

--
Vincent

> 
> > > > When computing the energy
> > > > deltas, pd0's is likely to be higher than pd1's, only because the task
> > > > contribution is higher for one comparison than the other.
> > > 
> > > You mean the contribution to sum_util right? I think I see what you mean
> > > but I'm still not sure if this really is an issue. This is how util_est
> > > works, and the EM stuff is just consistent with that.
> > > 
> > > The issue you describe can only happen (I think) when a rq's util_avg is
> > > larger than its util-est emwa by some margin (that has to do with the
> > > ewma-util_avg delta for the task?). But that means the ewma is not to be
> > > trusted to begin with, so ...
> > 
> > cfs_rq->avg.util_est.ewma is not used. cpu_util() will only return the max
> > between ue.enqueued and util_avg.
> 
> Right, my bad, it was the 'enqueued' member. But the rest of the
> argument is still valid I think, but with s/ewma/enqueued :-)
> 
> Thanks,
> Quentin

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] sched/fair: Fix task utilization accountability in cpu_util_next()
  2021-02-22 16:31           ` Vincent Donnefort
@ 2021-02-22 16:35             ` Quentin Perret
  0 siblings, 0 replies; 13+ messages in thread
From: Quentin Perret @ 2021-02-22 16:35 UTC (permalink / raw)
  To: Vincent Donnefort
  Cc: peterz, mingo, vincent.guittot, dietmar.eggemann, linux-kernel,
	patrick.bellasi, valentin.schneider

On Monday 22 Feb 2021 at 16:31:08 (+0000), Vincent Donnefort wrote:
> I suppose, a way of fixing this, is to keep cpu_util_next() the way it is to
> get the appropriate frequency at which the CPU would run once the task has been
> enqueued, for the 'max_util', and have 'sum_util' being the sum of the pd's util
> (without the task) + task_util_est().

You read my mind :)

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] sched/fair: Fix task utilization accountability in cpu_util_next()
  2021-02-22 16:23           ` Quentin Perret
@ 2021-02-22 16:39             ` Vincent Donnefort
  2021-02-22 16:43               ` Quentin Perret
  2021-02-23 14:47             ` Dietmar Eggemann
  1 sibling, 1 reply; 13+ messages in thread
From: Vincent Donnefort @ 2021-02-22 16:39 UTC (permalink / raw)
  To: Quentin Perret
  Cc: peterz, mingo, vincent.guittot, dietmar.eggemann, linux-kernel,
	patrick.bellasi, valentin.schneider

On Mon, Feb 22, 2021 at 04:23:42PM +0000, Quentin Perret wrote:
> On Monday 22 Feb 2021 at 15:58:56 (+0000), Quentin Perret wrote:
> > But in any case, if we're going to address this, I'm still not sure this
> > patch will be what we want. As per my first comment we need to keep the
> > frequency estimation right.
> 
> Totally untested, but I think in principle you would like something like
> the snippet below. Would that work?

You preempted my previous email :)

Yeah, that looks like what we want, I'll give a try.

Thanks,

-- 
Vincent.

> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 04a3ce20da67..6594d875c6ac 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6534,8 +6534,13 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
>          * its pd list and will not be accounted by compute_energy().
>          */
>         for_each_cpu_and(cpu, pd_mask, cpu_online_mask) {
> -               unsigned long cpu_util, util_cfs = cpu_util_next(cpu, p, dst_cpu);
> +               unsigned long util_freq = cpu_util_next(cpu, p, dst_cpu);
> +               unsigned long util_running = cpu_util_without(cpu, p);
>                 struct task_struct *tsk = cpu == dst_cpu ? p : NULL;
> +               unsigned long cpu_util;
> +
> +               if (cpu == dst_cpu)
> +                       util_running += task_util_est();
> 
>                 /*
>                  * Busy time computation: utilization clamping is not
> @@ -6543,7 +6548,7 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
>                  * is already enough to scale the EM reported power
>                  * consumption at the (eventually clamped) cpu_capacity.
>                  */
> -               sum_util += schedutil_cpu_util(cpu, util_cfs, cpu_cap,
> +               sum_util += schedutil_cpu_util(cpu, util_running, cpu_cap,
>                                                ENERGY_UTIL, NULL);
> 
>                 /*
> @@ -6553,7 +6558,7 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
>                  * NOTE: in case RT tasks are running, by default the
>                  * FREQUENCY_UTIL's utilization can be max OPP.
>                  */
> -               cpu_util = schedutil_cpu_util(cpu, util_cfs, cpu_cap,
> +               cpu_util = schedutil_cpu_util(cpu, util_freq, cpu_cap,
>                                               FREQUENCY_UTIL, tsk);
>                 max_util = max(max_util, cpu_util);
>         }
>




^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] sched/fair: Fix task utilization accountability in cpu_util_next()
  2021-02-22 16:39             ` Vincent Donnefort
@ 2021-02-22 16:43               ` Quentin Perret
  0 siblings, 0 replies; 13+ messages in thread
From: Quentin Perret @ 2021-02-22 16:43 UTC (permalink / raw)
  To: Vincent Donnefort
  Cc: peterz, mingo, vincent.guittot, dietmar.eggemann, linux-kernel,
	patrick.bellasi, valentin.schneider

On Monday 22 Feb 2021 at 16:39:47 (+0000), Vincent Donnefort wrote:
> On Mon, Feb 22, 2021 at 04:23:42PM +0000, Quentin Perret wrote:
> > On Monday 22 Feb 2021 at 15:58:56 (+0000), Quentin Perret wrote:
> > > But in any case, if we're going to address this, I'm still not sure this
> > > patch will be what we want. As per my first comment we need to keep the
> > > frequency estimation right.
> > 
> > Totally untested, but I think in principle you would like something like
> > the snippet below. Would that work?
> 
> You preempted my previous email :)
> 
> Yeah, that looks like what we want, I'll give a try.

Cool, thanks.

And ofc no strong opinion about the implementation details, this can
most certainly be optimized in some way so have fun :)

Cheers,
Quentin

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] sched/fair: Fix task utilization accountability in cpu_util_next()
  2021-02-22  9:54 [PATCH] sched/fair: Fix task utilization accountability in cpu_util_next() vincent.donnefort
  2021-02-22 10:11 ` Quentin Perret
@ 2021-02-23 14:44 ` Dietmar Eggemann
  1 sibling, 0 replies; 13+ messages in thread
From: Dietmar Eggemann @ 2021-02-23 14:44 UTC (permalink / raw)
  To: vincent.donnefort, peterz, mingo, vincent.guittot
  Cc: linux-kernel, qperret, patrick.bellasi, valentin.schneider

On 22/02/2021 10:54, vincent.donnefort@arm.com wrote:
> From: Vincent Donnefort <vincent.donnefort@arm.com>

[...]

> Also, replace sub_positive with lsub_positive which saves one explicit
> load-store.

IMHO, in case you're going to fix this now in compute_energy(), this
optimization could still be done. Maybe in an extra patch?
cpu_util_without() is using lsub_positive() to remove task util from cpu
util as well.

[...]

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: [PATCH] sched/fair: Fix task utilization accountability in cpu_util_next()
  2021-02-22 16:23           ` Quentin Perret
  2021-02-22 16:39             ` Vincent Donnefort
@ 2021-02-23 14:47             ` Dietmar Eggemann
  1 sibling, 0 replies; 13+ messages in thread
From: Dietmar Eggemann @ 2021-02-23 14:47 UTC (permalink / raw)
  To: Quentin Perret, Vincent Donnefort
  Cc: peterz, mingo, vincent.guittot, linux-kernel, patrick.bellasi,
	valentin.schneider

On 22/02/2021 17:23, Quentin Perret wrote:
> On Monday 22 Feb 2021 at 15:58:56 (+0000), Quentin Perret wrote:
>> But in any case, if we're going to address this, I'm still not sure this
>> patch will be what we want. As per my first comment we need to keep the
>> frequency estimation right.
> 
> Totally untested, but I think in principle you would like something like
> the snippet below. Would that work?
> 
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 04a3ce20da67..6594d875c6ac 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6534,8 +6534,13 @@ compute_energy(struct task_struct *p, int dst_cpu, struct perf_domain *pd)
>          * its pd list and will not be accounted by compute_energy().
>          */
>         for_each_cpu_and(cpu, pd_mask, cpu_online_mask) {
> -               unsigned long cpu_util, util_cfs = cpu_util_next(cpu, p, dst_cpu);
> +               unsigned long util_freq = cpu_util_next(cpu, p, dst_cpu);
> +               unsigned long util_running = cpu_util_without(cpu, p);

Wouldn't this be the same as:

                 unsigned long util_running = cpu_util_next(cpu, p, -1);

except some different handling of !last_update_time and
'task_on_rq_queued(p) || current == p)' in cpu_util_without() which
shouldn't happen in EAS.

We have quite a lot of util related functions!

[...]

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2021-02-23 14:49 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-02-22  9:54 [PATCH] sched/fair: Fix task utilization accountability in cpu_util_next() vincent.donnefort
2021-02-22 10:11 ` Quentin Perret
2021-02-22 11:36   ` Vincent Donnefort
2021-02-22 12:23     ` Quentin Perret
2021-02-22 15:01       ` Vincent Donnefort
2021-02-22 15:58         ` Quentin Perret
2021-02-22 16:23           ` Quentin Perret
2021-02-22 16:39             ` Vincent Donnefort
2021-02-22 16:43               ` Quentin Perret
2021-02-23 14:47             ` Dietmar Eggemann
2021-02-22 16:31           ` Vincent Donnefort
2021-02-22 16:35             ` Quentin Perret
2021-02-23 14:44 ` Dietmar Eggemann

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.