Re: [PATCH v2] sched/fair: update scale invariance of PELT

From: Vincent Guittot <vincent.guittot@linaro.org>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Morten Rasmussen <Morten.Rasmussen@arm.com>,
	Yuyang Du <yuyang.du@intel.com>, Paul Turner <pjt@google.com>,
	Ben Segall <bsegall@google.com>
Subject: Re: [PATCH v2] sched/fair: update scale invariance of PELT
Date: Thu, 13 Apr 2017 11:42:29 +0200	[thread overview]
Message-ID: <CAKfTPtCH6h7N+c1JM0QhF7Byn-eWyG3gESWLWu8x=kVbKkmsYA@mail.gmail.com> (raw)
In-Reply-To: <20170412154447.coqnzhlhimz5pc3l@hirez.programming.kicks-ass.net>

On 12 April 2017 at 17:44, Peter Zijlstra <peterz@infradead.org> wrote:
> On Wed, Apr 12, 2017 at 04:50:47PM +0200, Vincent Guittot wrote:
>> Le Wednesday 12 Apr 2017 à 13:28:58 (+0200), Peter Zijlstra a écrit :
>
>> >
>> >   |---------|---------|          (wall-time)
>> >   ----****------------- F=100%
>> >   ----******----------- F= 66%
>> >   |--------------|----|          (fudge-time)
>>
>> It has been a bit hard for me to catch the diagram above because you scale the
>> idle time to get same ratio at 100% and 66% wherease I don't scale idle
>> time but only running time.
>
> Ah, so below I wrote that we then scale each window back to equal size,
> so the absolute size in wall-time becomes immaterial.
>
>> > (explicitly not used 50%, because then the second window would have
>> > collapsed to 0, imagine the joy if you go lower still)
>>
>> The second window can't collapse because we are working on delta time not
>> absolute wall-time and the delta is for only 1 type at a time: running or idle
>
> Right, but consider what happens when F drops too low, idle goes away
> from where there would've been some at F=1. At that point things become
> unrecoverable afaict.

Yes I agree that if the frequency is too low to handle the running
time in the period, there is no more idle time to recover lost idle
time.
In this case, either the frequency will be increase by cpufreq and we
will finally run fast enough to recover the lost idle time
or we stay at this frequency and the task is always running and we
can't do anything else than assuming that this is an always running
task and at this end (once task util_avg reach max value) discard the
lost idle time

>
>> > So in fudge-time the first window has 6/15 == 4/10 for the max-freq /
>> > wall-time combo.
>> >
>> > >
>> > > Then l = p' - p''. The lost idle time is tracked to apply the same amount of decay
>> > > window when the task is sleeping
>> > >
>> > > so at the end we have a number of decay window of p''+l = p'' so we still have
>> > > the same amount of decay window than previously.
>> >
>> > Now, we have to stretch time back to equal window size, and while you do
>> > that for the active windows, we have to do manual compensation for idle
>> > windows (which is somewhat ugleh) and is where the lost-time comes from.
>>
>> We can't stretch idle time because there is no relation between the idle time
>> and the current capacity.
>
> Brain melts..
>
>> > Also, this all feels entirely yucky, because as per the above, if we'd
>> > ran at 33%, we'd have ended up with a negative time window.
>>
>> Not sure to catch how we can end up with negative window. We are working with
>> delta time not absolute time.
>
>
>    |---------|---------|---------|  F=100%
>     --****------------------------
>
>    |--------------|----|---------|  F= 66%
>     --******----------------------
>
>    |-------------------|---------|  F= 50%
>     --********--------------------
>
>    |-----------------------------|  F= 33%
>     --************----------------
>
>
> So what happens is that when the (wall) time for a window goes negative
> it simply moves the next window along, until that too is compressed
> etc..
>
> So in the above figure, the right most edge of F=33% contains 2 whole
> periods of idle time, both contracted to measure 0 (wall) time.
>
> The only thing you have to recover them from is the lost idle time
> measure.
>
>> > Not to mention that this only seems to work for low utilization. Once
>> > you hit higher utilization scenarios, where there isn't much idle time
>> > to compensate for the stretching, things go wobbly. Although both
>> > scenarios might end up being the same.
>>
>> During the running phase, we calculate how much idle time has diseappeared
>> because we are running at lower frequency and we compensate it once back to
>> idle.
>>
>> >
>> > And instead of resurrecting 0 sized windows, you throw them out, which
>>
>> I don't catch point above
>
> It might've been slightly inaccurate. But the point remains that you
> destroy time. Not all accrued lost idle time is recovered.
>
> +               if (sa->util_sum < (LOAD_AVG_MAX * 1000)) {
> +                       /*
> +                        * Add the idle time stolen by running at lower compute
> +                        * capacity
> +                        */
> +                       delta += sa->stolen_idle_time;
> +               }
> +               sa->stolen_idle_time = 0;
>
> See here, stolen_idle_time is reset regardless. Time is non-continuous
> at that point.

In fact, we are provisioning some possible stolen time when running at
lower frequency but it can happen that there is no idle time even if
the task runs at max frequency. In this case, there is no lost idle
time to recover.
I consider that a task that have a utilization at max value (minus a
threshold) is an always running time and in this case there is no lost
idle time to recover

>
>
> I still have to draw me more interesting cases, I'm not convinced I
> fully understand things.