All of lore.kernel.org
 help / color / mirror / Atom feed
From: Vincent Donnefort <vincent.donnefort@arm.com>
To: Vincent Guittot <vincent.guittot@linaro.org>
Cc: peterz@infradead.org, mingo@redhat.com,
	linux-kernel@vger.kernel.org, dietmar.eggemann@arm.com,
	Valentin.Schneider@arm.com, Morten.Rasmussen@arm.com,
	Chris.Redpath@arm.com, qperret@google.com, Lukasz.Luba@arm.com
Subject: Re: [PATCH v2 2/7] sched/fair: Decay task PELT values during migration
Date: Thu, 20 Jan 2022 21:12:02 +0000	[thread overview]
Message-ID: <YenQIsHJhaEuJFQl@FVFF7649Q05P> (raw)
In-Reply-To: <CAKfTPtB=CJNFDrpXY9o8g5XfjBfnVTUgb2rWke1SyWMUxz0M+g@mail.gmail.com>

[...]

> > > And if cpu is not idle, you can't apply the diff between clk_pelt and clock_task
> > >
> > > >
> > > > > - (B) doesn't seem to be accurate as you skip irq and steal time
> > > > > accounting and you don't apply any scale invariance if the cpu is not
> > > > > idle
> > > >
> > > > The missing irq and paravirt time is the reason why it is called "estimator".
> > > > But maybe there's a chance of improving this part with a lockless version of
> > > > rq->prev_irq_time and rq->prev_steal_time_rq?
> > > >
> > > > > - IIUC your explanation in the commit message above, the (A) period
> > > > > seems to be a problem only when idle but you apply it unconditionally.
> > > >
> > > > If the CPU is idle (and clock_pelt == clock_task), only the B part would be
> > > > worth something:
> > > >
> > > >   A + B = [clock_task - clock_pelt] + [sched_clock_cpu() - clock]
> > > >                       A                            B
> > > >
> > > > > If cpu is idle you can assume that clock_pelt should be equal to
> > > > > clock_task but you can't if cpu is not idle otherwise your sync will
> > > > > be inaccurate and defeat the primary goal of this patch. If your
> > > > > problem with clock_pelt is that the pending idle time is not accounted
> > > > > for when entering idle but only at the next update (update blocked
> > > > > load or wakeup of a thread). This patch below should fix this and
> > > > > remove your A.
> > > >
> > > > That would help slightly the current situation, but this part is already
> > > > covered by the estimator.
> > >
> > > But the estimator, as you name it, is wrong beaus ethe A part can't be
> > > applied unconditionally
> >
> > Hum, it is used only in the !active migration. So we know the task was sleeping
> > before that migration. As a consequence, the time we need to account is "sleeping"
> > time from the task point of view, which is clock_pelt == clock_task (for
> > __update_load_avg_blocked_se()). Otherwise, we would only decay with the
> > "wallclock" idle time instead of the "scaled" one wouldn't we?
> 
> clock_pelt == clock_task only when cpu is idle and after updating
> lost_idle_time but you have no idea of the state of the cpu when
> migrating the task

I was just applying the time scaling at the task level. Why shall it depends on
the CPU state?

The situation would be as follows:

                    <--X--> <--Y-->
           +-------+-------+-------+
CPUX    ___|   B   |   A   |   B   |___
                                  ^
                               migrate A
                    
In a such scenario, CPUX's PELT clock is indeed scaled. The Task A running
time (X) has already been accounted, so what's left is to get an idle time (Y)
contribution accurate. We would usually rely on the CPU being idle for the
catch-up and that time would be Y + (X - scaled(X)). Without the catch-up, we
would account at the migration, for the sleeping time Y, only (scaled(Y)). Applied
to the same graph as for update_rq_clock_pelt()'s:

 clock_task    | 1| 2| 3| 4| 5| 6| 7| 8|
 clock_pelt    | 1   | 2   | 3   | 4   |  (CPU's running, clock_pelt is scaled)
 expected      | 1   | 2   | 5| 6| 7| 8|
               <---- X ---><--- Y ---->
 Task A -------************----------
                                   ^ 
                               migrate A

Contribution for Task A idle time at the migration (as we know we won't have
another chance to catch-up clock_task later) should be 6, not 2, regardless of
the CPU state.

_But_ indeed, there would be a risk of hitting the lost_idle_time threshold and
decay too much... (which is absolutely not handled in the current version). So
now, if we don't want to bother too much, we could simplify the problem and
say (which is true with NOHZ_IDLE) that if the CPU is running, the clock must
not be that old anyway. So we should only care of the idle case, which is
mitigated with your proposed snippet and I allow to get rid of the [A]
part (clock_task - clock_pelt).

As per sched_clock_cpu() usage, I haven't measured anything yet but notice it's
already used in the wakeup path in ttwu_queue_wakelist().


  reply	other threads:[~2022-01-20 21:12 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-01-12 16:12 [PATCH v2 0/7] feec() energy margin removal Vincent Donnefort
2022-01-12 16:12 ` [PATCH v2 1/7] sched/fair: Provide u64 read for 32-bits arch helper Vincent Donnefort
2022-01-17 16:11   ` Tao Zhou
2022-01-17 19:42     ` Vincent Donnefort
2022-01-17 23:44       ` Tao Zhou
2022-01-12 16:12 ` [PATCH v2 2/7] sched/fair: Decay task PELT values during migration Vincent Donnefort
2022-01-17 17:31   ` Vincent Guittot
2022-01-18 10:56     ` Vincent Donnefort
2022-01-19  9:54       ` Vincent Guittot
2022-01-19 11:59         ` Vincent Donnefort
2022-01-19 13:22           ` Vincent Guittot
2022-01-20 21:12             ` Vincent Donnefort [this message]
2022-01-21 15:27               ` Vincent Guittot
2022-01-12 16:12 ` [PATCH v2 3/7] sched, drivers: Remove max param from effective_cpu_util()/sched_cpu_util() Vincent Donnefort
2022-01-12 16:12 ` [PATCH v2 4/7] sched/fair: Rename select_idle_mask to select_rq_mask Vincent Donnefort
2022-01-12 16:12 ` [PATCH v2 5/7] sched/fair: Use the same cpumask per-PD throughout find_energy_efficient_cpu() Vincent Donnefort
2022-01-12 16:12 ` [PATCH v2 6/7] sched/fair: Remove task_util from effective utilization in feec() Vincent Donnefort
2022-01-12 19:44   ` kernel test robot
2022-01-12 19:44     ` kernel test robot
2022-01-17 13:17   ` Dietmar Eggemann
2022-01-18  9:46     ` Vincent Donnefort
2022-01-12 16:12 ` [PATCH v2 7/7] sched/fair: Remove the energy margin " Vincent Donnefort

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YenQIsHJhaEuJFQl@FVFF7649Q05P \
    --to=vincent.donnefort@arm.com \
    --cc=Chris.Redpath@arm.com \
    --cc=Lukasz.Luba@arm.com \
    --cc=Morten.Rasmussen@arm.com \
    --cc=Valentin.Schneider@arm.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=qperret@google.com \
    --cc=vincent.guittot@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.