All of lore.kernel.org
 help / color / mirror / Atom feed
From: Vincent Guittot <vincent.guittot@linaro.org>
To: Vincent Donnefort <Vincent.Donnefort@arm.com>
Cc: peterz@infradead.org, mingo@redhat.com,
	linux-kernel@vger.kernel.org, dietmar.eggemann@arm.com,
	Valentin.Schneider@arm.com, Morten.Rasmussen@arm.com,
	Chris.Redpath@arm.com, qperret@google.com, Lukasz.Luba@arm.com
Subject: Re: [PATCH v2 2/7] sched/fair: Decay task PELT values during migration
Date: Mon, 17 Jan 2022 18:31:25 +0100	[thread overview]
Message-ID: <CAKfTPtC2wCw4U9w=saW0dGYHfOKo42nBKU7oHcEM7KeDj7MzWA@mail.gmail.com> (raw)
In-Reply-To: <20220112161230.836326-3-vincent.donnefort@arm.com>

On Wed, 12 Jan 2022 at 17:14, Vincent Donnefort
<vincent.donnefort@arm.com> wrote:
>
> Before being migrated to a new CPU, a task sees its PELT values
> synchronized with rq last_update_time. Once done, that same task will also
> have its sched_avg last_update_time reset. This means the time between
> the migration and the last clock update (B) will not be accounted for in
> util_avg and a discontinuity will appear. This issue is amplified by the
> PELT clock scaling. If the clock hasn't been updated while the CPU is
> idle, clock_pelt will not be aligned with clock_task and that time (A)
> will be also lost.
>
>    ---------|----- A -----|-----------|------- B -----|>
>         clock_pelt   clock_task     clock            now
>
> This is especially problematic for asymmetric CPU capacity systems which
> need stable util_avg signals for task placement and energy estimation.
>
> Ideally, this problem would be solved by updating the runqueue clocks
> before the migration. But that would require taking the runqueue lock
> which is quite expensive [1]. Instead estimate the missing time and update
> the task util_avg with that value:
>
>   A + B = clock_task - clock_pelt + sched_clock_cpu() - clock
>
> Neither clock_task, clock_pelt nor clock can be accessed without the
> runqueue lock. The new runqueue clock_pelt_lag is therefore created and
> encode those three values.
>
>   clock_pelt_lag = clock - clock_task + clock_pelt
>
> And we can then write the missing time as follow:
>
>   A + B = sched_clock_cpu() - clock_pelt_lag
>
> The B. part of the missing time is however an estimation that doesn't take
> into account IRQ and Paravirt time.
>
> Now we have an estimation for A + B, we can create an estimator for the
> PELT value at the time of the migration. We need for this purpose to
> inject last_update_time which is a combination of both clock_pelt and
> lost_idle_time. The latter is a time value which is completely lost form a
> PELT point of view and must be ignored. And finally, we can write:
>
>   rq_clock_pelt_estimator() = last_update_time + A + B
>                             = last_update_time +
>                                    sched_clock_cpu() - clock_pelt_lag
>
> [1] https://lore.kernel.org/all/20190709115759.10451-1-chris.redpath@arm.com/
>
> Signed-off-by: Vincent Donnefort <vincent.donnefort@arm.com>
>
> diff --git a/kernel/sched/core.c b/kernel/sched/core.c
> index 06cf7620839a..11c6aeef4583 100644
> --- a/kernel/sched/core.c
> +++ b/kernel/sched/core.c
> @@ -618,6 +618,12 @@ struct rq *task_rq_lock(struct task_struct *p, struct rq_flags *rf)
>         }
>  }
>
> +static void update_rq_clock_pelt_lag(struct rq *rq)
> +{
> +       u64_u32_store(rq->clock_pelt_lag,
> +                     rq->clock - rq->clock_task + rq->clock_pelt);

This has several shortfalls:
- have a look at cfs_rq_clock_pelt() and rq_clock_pelt(). What you
name clock_pelt in your commit message and is used to update PELT and
saved in se->avg.last_update_time is : rq->clock_pelt -
rq->lost_idle_time - cfs_rq->throttled_clock_task_time
- you are doing this whatever the state of the cpu : idle or not. But
the clock cycles are not accounted for in the same way in both cases.
- (B) doesn't seem to be accurate as you skip irq and steal time
accounting and you don't apply any scale invariance if the cpu is not
idle
- IIUC your explanation in the commit message above, the (A) period
seems to be a problem only when idle but you apply it unconditionally.
If cpu is idle you can assume that clock_pelt should be equal to
clock_task but you can't if cpu is not idle otherwise your sync will
be inaccurate and defeat the primary goal of this patch. If your
problem with clock_pelt is that the pending idle time is not accounted
for when entering idle but only at the next update (update blocked
load or wakeup of a thread). This patch below should fix this and
remove your A.


diff --git a/kernel/sched/pelt.h b/kernel/sched/pelt.h
index e06071bf3472..855877be4dd8 100644
--- a/kernel/sched/pelt.h
+++ b/kernel/sched/pelt.h
@@ -114,6 +114,7 @@ static inline void update_idle_rq_clock_pelt(struct rq *rq)
 {
        u32 divider = ((LOAD_AVG_MAX - 1024) << SCHED_CAPACITY_SHIFT)
- LOAD_AVG_MAX;
        u32 util_sum = rq->cfs.avg.util_sum;
+       u64 now = rq_clock_task(rq);
        util_sum += rq->avg_rt.util_sum;
        util_sum += rq->avg_dl.util_sum;

@@ -127,7 +128,10 @@ static inline void update_idle_rq_clock_pelt(struct rq *rq)
         * rq's clock_task.
         */
        if (util_sum >= divider)
-               rq->lost_idle_time += rq_clock_task(rq) - rq->clock_pelt;
+               rq->lost_idle_time += now - rq->clock_pelt;
+
+       /* The rq is idle, we can sync to clock_task */
+       rq->clock_pelt  = now;
 }

 static inline u64 rq_clock_pelt(struct rq *rq)

---


> +}
> +
>  /*
>   * RQ-clock updating methods:
>   */
> @@ -674,6 +680,7 @@ static void update_rq_clock_task(struct rq *rq, s64 delta)
>                 update_irq_load_avg(rq, irq_delta + steal);
>  #endif
>         update_rq_clock_pelt(rq, delta);
> +       update_rq_clock_pelt_lag(rq);
>  }
>
>  void update_rq_clock(struct rq *rq)
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 99ea9540ece4..046d5397eb8a 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -6852,6 +6852,14 @@ select_task_rq_fair(struct task_struct *p, int prev_cpu, int wake_flags)
>
>  static void detach_entity_cfs_rq(struct sched_entity *se);
>
> +static u64 rq_clock_pelt_estimator(struct rq *rq, u64 last_update_time)
> +{
> +       u64 pelt_lag = sched_clock_cpu(cpu_of(rq)) -
> +                      u64_u32_load(rq->clock_pelt_lag);

Have you evaluated the impact of calling sched_clock_cpu(cpu_of(rq))
for a remote cpu ? especially with a huge number of migration and
concurrent access from several cpus

> +
> +       return last_update_time + pelt_lag;
> +}
> +
>  /*
>   * Called immediately before a task is migrated to a new CPU; task_cpu(p) and
>   * cfs_rq_of(p) references at time of call are still valid and identify the
> @@ -6859,6 +6867,9 @@ static void detach_entity_cfs_rq(struct sched_entity *se);
>   */
>  static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
>  {
> +       struct sched_entity *se = &p->se;
> +       struct rq *rq = task_rq(p);
> +
>         /*
>          * As blocked tasks retain absolute vruntime the migration needs to
>          * deal with this by subtracting the old and adding the new
> @@ -6866,7 +6877,6 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
>          * the task on the new runqueue.
>          */
>         if (READ_ONCE(p->__state) == TASK_WAKING) {
> -               struct sched_entity *se = &p->se;
>                 struct cfs_rq *cfs_rq = cfs_rq_of(se);
>
>                 se->vruntime -= u64_u32_load(cfs_rq->min_vruntime);
> @@ -6877,26 +6887,32 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu)
>                  * In case of TASK_ON_RQ_MIGRATING we in fact hold the 'old'
>                  * rq->lock and can modify state directly.
>                  */
> -               lockdep_assert_rq_held(task_rq(p));
> -               detach_entity_cfs_rq(&p->se);
> +               lockdep_assert_rq_held(rq);
> +               detach_entity_cfs_rq(se);
>
>         } else {
> +               u64 now;
> +
> +               remove_entity_load_avg(se);
> +
>                 /*
> -                * We are supposed to update the task to "current" time, then
> -                * its up to date and ready to go to new CPU/cfs_rq. But we
> -                * have difficulty in getting what current time is, so simply
> -                * throw away the out-of-date time. This will result in the
> -                * wakee task is less decayed, but giving the wakee more load
> -                * sounds not bad.
> +                * Here, the task's PELT values have been updated according to
> +                * the current rq's clock. But if that clock hasn't been
> +                * updated in a while, a substantial idle time will be missed,
> +                * leading to an inflation after wake-up on the new rq.
> +                *
> +                * Estimate the PELT clock lag, and update sched_avg to ensure
> +                * PELT continuity after migration.
>                  */
> -               remove_entity_load_avg(&p->se);
> +               now = rq_clock_pelt_estimator(rq, se->avg.last_update_time);
> +               __update_load_avg_blocked_se(now, se);
>         }
>
>         /* Tell new CPU we are migrated */
> -       p->se.avg.last_update_time = 0;
> +       se->avg.last_update_time = 0;
>
>         /* We have migrated, no longer consider this task hot */
> -       p->se.exec_start = 0;
> +       se->exec_start = 0;
>
>         update_scan_period(p, new_cpu);
>  }
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index f1a445efdc63..fdf2a9e54c0e 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -1027,8 +1027,13 @@ struct rq {
>         /* Ensure that all clocks are in the same cache line */
>         u64                     clock_task ____cacheline_aligned;
>         u64                     clock_pelt;
> +       u64                     clock_pelt_lag;
>         unsigned long           lost_idle_time;
>
> +#ifndef CONFIG_64BIT
> +       u64                     clock_pelt_lag_copy;
> +#endif
> +
>         atomic_t                nr_iowait;
>
>  #ifdef CONFIG_SCHED_DEBUG
> --
> 2.25.1
>

  reply	other threads:[~2022-01-17 17:31 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-01-12 16:12 [PATCH v2 0/7] feec() energy margin removal Vincent Donnefort
2022-01-12 16:12 ` [PATCH v2 1/7] sched/fair: Provide u64 read for 32-bits arch helper Vincent Donnefort
2022-01-17 16:11   ` Tao Zhou
2022-01-17 19:42     ` Vincent Donnefort
2022-01-17 23:44       ` Tao Zhou
2022-01-12 16:12 ` [PATCH v2 2/7] sched/fair: Decay task PELT values during migration Vincent Donnefort
2022-01-17 17:31   ` Vincent Guittot [this message]
2022-01-18 10:56     ` Vincent Donnefort
2022-01-19  9:54       ` Vincent Guittot
2022-01-19 11:59         ` Vincent Donnefort
2022-01-19 13:22           ` Vincent Guittot
2022-01-20 21:12             ` Vincent Donnefort
2022-01-21 15:27               ` Vincent Guittot
2022-01-12 16:12 ` [PATCH v2 3/7] sched, drivers: Remove max param from effective_cpu_util()/sched_cpu_util() Vincent Donnefort
2022-01-12 16:12 ` [PATCH v2 4/7] sched/fair: Rename select_idle_mask to select_rq_mask Vincent Donnefort
2022-01-12 16:12 ` [PATCH v2 5/7] sched/fair: Use the same cpumask per-PD throughout find_energy_efficient_cpu() Vincent Donnefort
2022-01-12 16:12 ` [PATCH v2 6/7] sched/fair: Remove task_util from effective utilization in feec() Vincent Donnefort
2022-01-12 19:44   ` kernel test robot
2022-01-12 19:44     ` kernel test robot
2022-01-17 13:17   ` Dietmar Eggemann
2022-01-18  9:46     ` Vincent Donnefort
2022-01-12 16:12 ` [PATCH v2 7/7] sched/fair: Remove the energy margin " Vincent Donnefort

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAKfTPtC2wCw4U9w=saW0dGYHfOKo42nBKU7oHcEM7KeDj7MzWA@mail.gmail.com' \
    --to=vincent.guittot@linaro.org \
    --cc=Chris.Redpath@arm.com \
    --cc=Lukasz.Luba@arm.com \
    --cc=Morten.Rasmussen@arm.com \
    --cc=Valentin.Schneider@arm.com \
    --cc=Vincent.Donnefort@arm.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=qperret@google.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.