Re: [PATCH v2] sched/fair: update scale invariance of PELT

From: Peter Zijlstra <peterz@infradead.org>
To: Vincent Guittot <vincent.guittot@linaro.org>
Cc: mingo@kernel.org, linux-kernel@vger.kernel.org,
	dietmar.eggemann@arm.com, Morten.Rasmussen@arm.com,
	yuyang.du@intel.com, pjt@google.com, bsegall@google.com
Subject: Re: [PATCH v2] sched/fair: update scale invariance of PELT
Date: Wed, 12 Apr 2017 13:28:58 +0200	[thread overview]
Message-ID: <20170412112858.75hg75sd3clfxvvk@hirez.programming.kicks-ass.net> (raw)
In-Reply-To: <20170411130920.GB22895@linaro.org>

On Tue, Apr 11, 2017 at 03:09:20PM +0200, Vincent Guittot wrote:
> Le Tuesday 11 Apr 2017 à 12:49:49 (+0200), Peter Zijlstra a écrit :
> > 
> > Lets go back to the unscaled version:
> > 
> >     running   idle
> >    |*********|---------|
> > 
> > With the current code, that would effectively end up like (again
> > assuming 50%):
> > 
> >     running   idle
> >    |*_*_*_*_*|---------|
> > 
> > 
> > Time stays the same, but we add extra idle cycles.
> 
> In fact it's not really like that because this doesn't reflect the impact of
> the decay window which is not linear with time
> 
> For a task that run at max freq
> 
> (1) |-------|-------|-------|----
>        ****            ****
>     ---****------------****------
> 
> The same task running at half frequency, will run twice longer
>     |-------|-------|-------|----
>        ********        ********                          
>     ---********--------********--
> 
> With the current implementation, we are not inserting idle cycle but
> dividing the contribution by half in order to compensate the fact that the task
> will run twice longer:
> 
> (2) |-------|-------|-------|----
> 
>     ---********--------********--  
> 
> But the final result is neither equal to
> 
>     |-------|-------|-------|----
>        * * * *         * * * *                
>     ---*_*_*_*---------*_*_*_*---
> 
> nor to (1) as described below:

I'm not sure I get what you say; dividing the contribution in half is
exactly equal to the above picture. Because as you can see, there's half
the number of * per window.

> Let assume all durations are aligned with decay window. This will simplify the
> formula for the explanation and will not change the demonstration. We can come back
> on the full equation later on.
> 
> For (1), we have the equation that you write previously:
> 
>      util_sum' = utilsum * y^p +
>                                   p-1
>                d1 * y^p + 1024 * \Sum y^n + d3 * y^0
>  	                              n=1
> 
> Which becomes like below when we are aligned to decay window (d1 and d3 are null)

> (A)  util_sum' = utilsum * y^p +
>                        p-1
> 			   1024 * \Sum y^n
>  	                   n=1
> 
> The running time at max frequency is d and p is the number of decay window period
> In this case p = d/1024us and l = 0
> 
> For (2), task runs at half frequency so the duration d' is twice longer than
> d and p'=2*p. In current implementation, we compensate the increase of running
> duration (and lost idle time) by dividing the contribution by 2:

>      util_sum' = utilsum * y^p' +
>                        p'-1
>                 512 * \Sum y^n
>  	                   n=1
> 
> (B)  util_sum' = utilsum * y^(2*p) +
>                        2*p-1
>                 512 * \Sum y^n
>                        n=1

Hmmm.. but 2p is the actual wall-time of the window period.

> With the new scale invariance, we have the following pattern before scaling
> time:
> 
>     |-------|-------|-------|--
>        ********        ********
>     ---********--------********
> 
> Task still runs at half frequency so the duration d' is twice longer than
> d and p': p'=2*p just like for (2).

> But before computing util_sum', we change the temporal referencial to reflect
> what would have been done at max frequency: we scale the running time (divide it by 2)
> in order to have something like:
> 
>     |-------|-------|-------|--
>        ****            ****
>     ---****____--------****____

So you're moving the window edges away from wall-time.

> so we now have a duration d'' that is half d'and a number of period p'' that
> is half p' so p'' = 1/2 * p' == p

>      util_sum' = utilsum * y^p'' +
>                        p''-1
>                1024 * \Sum y^n
>  	                   n=1
>      util_sum' = utilsum * y^p +
>                        p-1
>                 512 * \Sum y^n
>  	                   n=1

  |---------|---------|          (wall-time)
  ----****------------- F=100%
  ----******----------- F= 66%
  |--------------|----|          (fudge-time)

(explicitly not used 50%, because then the second window would have
collapsed to 0, imagine the joy if you go lower still)

So in fudge-time the first window has 6/15 == 4/10 for the max-freq /
wall-time combo.

> 
> Then l = p' - p''. The lost idle time is tracked to apply the same amount of decay
> window when the task is sleeping
> 
> so at the end we have a number of decay window of p''+l = p'' so we still have
> the same amount of decay window than previously.

Now, we have to stretch time back to equal window size, and while you do
that for the active windows, we have to do manual compensation for idle
windows (which is somewhat ugleh) and is where the lost-time comes from.

Also, this all feels entirely yucky, because as per the above, if we'd
ran at 33%, we'd have ended up with a negative time window.

Not to mention that this only seems to work for low utilization. Once
you hit higher utilization scenarios, where there isn't much idle time
to compensate for the stretching, things go wobbly. Although both
scenarios might end up being the same.

And instead of resurrecting 0 sized windows, you throw them out, which
results in max-util at low F being reported as max-util at F=1, which I
suppose is a useful property and results in the increased ramp-up (which
is a desired property).

So not only do you have non-linear time, you also have non-continuous
time.

I still wonder about the whole !running vs !weight thing.. this all
needs more thinking.