From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1754594AbdDLPo6 (ORCPT <rfc822;w@1wt.eu>);
        Wed, 12 Apr 2017 11:44:58 -0400
Received: from merlin.infradead.org ([205.233.59.134]:40682 "EHLO
        merlin.infradead.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1752513AbdDLPoz (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Wed, 12 Apr 2017 11:44:55 -0400
Date: Wed, 12 Apr 2017 17:44:47 +0200
From: Peter Zijlstra <peterz@infradead.org>
To: Vincent Guittot <vincent.guittot@linaro.org>
Cc: mingo@kernel.org, linux-kernel@vger.kernel.org,
        dietmar.eggemann@arm.com, Morten.Rasmussen@arm.com,
        yuyang.du@intel.com, pjt@google.com, bsegall@google.com
Subject: Re: [PATCH v2] sched/fair: update scale invariance of PELT
Message-ID: <20170412154447.coqnzhlhimz5pc3l@hirez.programming.kicks-ass.net>
References: <1491815909-13345-1-git-send-email-vincent.guittot@linaro.org>
 <20170410173802.orygigjbcpefqtdv@hirez.programming.kicks-ass.net>
 <20170411075221.GA30421@linaro.org>
 <20170411085305.aik6gdy6n3wa22ok@hirez.programming.kicks-ass.net>
 <20170411094021.GA17811@linaro.org>
 <20170411104136.33hkvzlmoa7zc72l@hirez.programming.kicks-ass.net>
 <20170411104949.eat4o37rlqiiobeu@hirez.programming.kicks-ass.net>
 <20170411130920.GB22895@linaro.org>
 <20170412112858.75hg75sd3clfxvvk@hirez.programming.kicks-ass.net>
 <20170412145047.GA19363@linaro.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=iso-8859-1
Content-Disposition: inline
Content-Transfer-Encoding: 8bit
In-Reply-To: <20170412145047.GA19363@linaro.org>
User-Agent: NeoMutt/20170113 (1.7.2)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On Wed, Apr 12, 2017 at 04:50:47PM +0200, Vincent Guittot wrote:
> Le Wednesday 12 Apr 2017 à 13:28:58 (+0200), Peter Zijlstra a écrit :

> > 
> >   |---------|---------|          (wall-time)
> >   ----****------------- F=100%
> >   ----******----------- F= 66%
> >   |--------------|----|          (fudge-time)
> 
> It has been a bit hard for me to catch the diagram above because you scale the
> idle time to get same ratio at 100% and 66% wherease I don't scale idle
> time but only running time.

Ah, so below I wrote that we then scale each window back to equal size,
so the absolute size in wall-time becomes immaterial.

> > (explicitly not used 50%, because then the second window would have
> > collapsed to 0, imagine the joy if you go lower still)
> 
> The second window can't collapse because we are working on delta time not
> absolute wall-time and the delta is for only 1 type at a time: running or idle

Right, but consider what happens when F drops too low, idle goes away
from where there would've been some at F=1. At that point things become
unrecoverable afaict.

> > So in fudge-time the first window has 6/15 == 4/10 for the max-freq /
> > wall-time combo.
> > 
> > > 
> > > Then l = p' - p''. The lost idle time is tracked to apply the same amount of decay
> > > window when the task is sleeping
> > > 
> > > so at the end we have a number of decay window of p''+l = p'' so we still have
> > > the same amount of decay window than previously.
> > 
> > Now, we have to stretch time back to equal window size, and while you do
> > that for the active windows, we have to do manual compensation for idle
> > windows (which is somewhat ugleh) and is where the lost-time comes from.
> 
> We can't stretch idle time because there is no relation between the idle time
> and the current capacity.

Brain melts..

> > Also, this all feels entirely yucky, because as per the above, if we'd
> > ran at 33%, we'd have ended up with a negative time window.
> 
> Not sure to catch how we can end up with negative window. We are working with
> delta time not absolute time.


   |---------|---------|---------|  F=100%
    --****------------------------

   |--------------|----|---------|  F= 66%
    --******----------------------

   |-------------------|---------|  F= 50%
    --********--------------------

   |-----------------------------|  F= 33%
    --************----------------


So what happens is that when the (wall) time for a window goes negative
it simply moves the next window along, until that too is compressed
etc..

So in the above figure, the right most edge of F=33% contains 2 whole
periods of idle time, both contracted to measure 0 (wall) time.

The only thing you have to recover them from is the lost idle time
measure.

> > Not to mention that this only seems to work for low utilization. Once
> > you hit higher utilization scenarios, where there isn't much idle time
> > to compensate for the stretching, things go wobbly. Although both
> > scenarios might end up being the same.
> 
> During the running phase, we calculate how much idle time has diseappeared
> because we are running at lower frequency and we compensate it once back to
> idle. 
> 
> > 
> > And instead of resurrecting 0 sized windows, you throw them out, which
> 
> I don't catch point above

It might've been slightly inaccurate. But the point remains that you
destroy time. Not all accrued lost idle time is recovered.

+               if (sa->util_sum < (LOAD_AVG_MAX * 1000)) {
+                       /*
+                        * Add the idle time stolen by running at lower compute
+                        * capacity
+                        */
+                       delta += sa->stolen_idle_time;
+               }
+               sa->stolen_idle_time = 0;

See here, stolen_idle_time is reset regardless. Time is non-continuous
at that point.


I still have to draw me more interesting cases, I'm not convinced I
fully understand things.