From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1752143AbdECHzX (ORCPT <rfc822;w@1wt.eu>);
        Wed, 3 May 2017 03:55:23 -0400
Received: from mail-oi0-f53.google.com ([209.85.218.53]:33219 "EHLO
        mail-oi0-f53.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1751621AbdECHzQ (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Wed, 3 May 2017 03:55:16 -0400
MIME-Version: 1.0
In-Reply-To: <CAKfTPtDZzNrJxU+9dvPtXmK1y7nwrwniPM6H2kf5M=BpZWVEjQ@mail.gmail.com>
References: <CAKfTPtCf0JUPubBtjY25Lr6J1aihUMjs3HEw+8MXcCwpuku7eQ@mail.gmail.com>
 <CAKfTPtBLYKXyEYpyTWDRakP8zwe0z=_2HT3Lg7UM2PdQUF3kAA@mail.gmail.com>
 <CAKfTPtAJmJsT2=DbZFtK2aBVkNKbksueuDs_vCzsvWPR-_Aebg@mail.gmail.com>
 <20170425184941.GB15593@wtj.duckdns.org> <20170425210810.GB20255@wtj.duckdns.org>
 <CAKfTPtC92nVXCH3QX-Qqf5R5gD58pk2=S_OpwiTao5y16g84Xw@mail.gmail.com>
 <20170427003020.GD11348@wtj.duckdns.org> <CAKfTPtAx_aoW+UQN_KP9F2-zYnwjT6KJ6hZJ7uH2nvq1iGJL=Q@mail.gmail.com>
 <20170428161450.GA4628@htj.duckdns.org> <CAKfTPtBYokz_2P_VFSatBQT9+uzJFgUKD=ETksS2FUHZZ4mxAw@mail.gmail.com>
 <20170502205639.GB5335@htj.duckdns.org> <CAKfTPtDZzNrJxU+9dvPtXmK1y7nwrwniPM6H2kf5M=BpZWVEjQ@mail.gmail.com>
From: Vincent Guittot <vincent.guittot@linaro.org>
Date: Wed, 3 May 2017 09:54:54 +0200
Message-ID: <CAKfTPtDC+gf5xUbpV-uYa9RrWxSqL-0OXZeo0wBDDLerOV3VaA@mail.gmail.com>
Subject: Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg
To: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>, Peter Zijlstra <peterz@infradead.org>,
        linux-kernel <linux-kernel@vger.kernel.org>,
        Linus Torvalds <torvalds@linux-foundation.org>,
        Mike Galbraith <efault@gmx.de>, Paul Turner <pjt@google.com>,
        Chris Mason <clm@fb.com>, kernel-team@fb.com
Content-Type: text/plain; charset=UTF-8
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On 3 May 2017 at 09:25, Vincent Guittot <vincent.guittot@linaro.org> wrote:
> On 2 May 2017 at 22:56, Tejun Heo <tj@kernel.org> wrote:
>> Hello, Vincent.
>>
>> On Tue, May 02, 2017 at 08:56:52AM +0200, Vincent Guittot wrote:
>>> On 28 April 2017 at 18:14, Tejun Heo <tj@kernel.org> wrote:
>>> > I'll follow up in the other subthread but there really is fundamental
>>> > difference in how we calculate runnable_avg w/ and w/o cgroups.
>>> > Indepndent of whether we can improve the load balancer further, it is
>>> > an existing bug.
>>>
>>> I'd like to weight that a bit.
>>> The runnable_load_avg works correctly as it is because it reflects
>>> correctly the load of runnable entity at root domain
>>> If you start to propagate the runnable_load_avg on the load_avg of the
>>> group entity, the load will become unstable.
>>> runnable_load_avg has been added to fix load_balance being unable to
>>> select the right busiest rq. So the goal is to use more and more
>>> load_avg not the opposite
>>
>> I have a hard time understanding what you're trying to say here.
>>
>> Without cgroup, the load balancer uses the sum of load_avgs of the
>> running tasks on the queue.  As shown by the debug trace, the load
>> balancer repeatedly ends up picking the wrong CPU when cgroup is
>> involved because it ends up including the load_avgs of nested blocked
>> tasks into runnable_load_avg of root - we lose the distinction between
>> running and blocked load_avgs when we pass through a nested cfs_rq.
>>
>> We can further improve the load balancer all we want, for example,
>> right now, we would end up picking a CPU with one task which has a
>> really high weight over another CPU with two normal weight tasks even,
>> which isn't ideal; however, there is something obviously broken in the
>> existing mechanism and we want to fix that first independent of
>> further improvements, and it won't be a good idea to paper over an
>> existing problem with a different mechanism either.
>
> That's probably where I disagree.
> IMO, runnable_load_avg is already a fix to compensate the fact when
> PELT has been rewritten, the load balance has not been update
> accordingly  and was unable to make the right choice
> between 2 groups: one group having only 1 task but a higher load and
> another group with several tasks but a lower load (either because of
> the blocked load or because a runnable task with far higher load than
> others). runnable_load_avg has been put back to quickly fix the 1st
> case but it doesn't fix the other. There were also some issues of
> propagating load update with task migration but this has been fixed
> now.
> So we have 2 use case with the exact same behavior a group with a
> higher load (runnable_load_avg or load_avg) but only 1 task and
> another one with several tasks but a lower load. At now
> runnable_load_avg can only fix one with the impact of breaking
> load_avg behavior whereas one fix in the load_balance can fix both
> withour breaking load_avg behavior.
> We should better work on removing runnable_load_avg completely than
> adding more stuff on it
>
>>
>>> >> I always let time between 2 consecutive run and the 10 consecutive
>>> >> runs take around 2min to execute
>>> >>
>>> >> Then I have run several time these 10 consecutive test and results stay the same
>>> >
>>> > Can you please try the patch at the end of this message?  I'm
>>> > wondering whether you're seeing the errors accumulating from the wrong
>>> > min().
>>>
>>> I still have the regression with the patch below.
>>> The regression comes from the use runnable_load_avg to propagate. As
>>> load_avg becomes null at some point, it break the computation of share
>>> and the load_avg stay very low
>>
>> That's surprising given that what the patch does is bringing the
>> cgroup behavior closer to !cgroup behavior.  It'd be great to be able
>
> But it's also break load_avg which is used to calculate per cfs_rq
> share and this change generates the regression on my board
>
>> to reproduce the problem and trace it.  It looks like the board is
>> pretty standardized.  Would the following be equivalent to the one you
>> have?
>>
>>  http://a.co/f3dD1lm
>
> Yes it's this board i have the previous version but it should not
> change the behavior
>>
>> If so, I can just buy it, get your test image and repro it here and
>> trace why the regression is happening with the setup.  We might be
>> hitting something else.
>
> I have pushed my branch which includes v4.11-rc8 + your patches + some
> debug trace here:
> https://git.linaro.org/people/vincent.guittot/kernel.git/log/?h=test-tejun-patches
>
> and uploaded a trace_cmd's trace that shows the regression
> http://people.linaro.org/~vincent.guittot/trace-tejun-patches

I have also uploaded the trace withour your patches so you can compare both:
http://people.linaro.org/~vincent.guittot/trace-v4.8-rc11

>
> As an example, we can easily see arounf time stamp 94.826469 sec that
> cpu5 is idle while CPU2 and CPU3 share their time between several
> task.
> This doesn't happens without your patch
>
>>
>> Thanks.
>>
>> --
>> tejun