From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S932815AbdDZSMo (ORCPT <rfc822;w@1wt.eu>);
        Wed, 26 Apr 2017 14:12:44 -0400
Received: from mail-oi0-f49.google.com ([209.85.218.49]:35945 "EHLO
        mail-oi0-f49.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S932787AbdDZSMg (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Wed, 26 Apr 2017 14:12:36 -0400
MIME-Version: 1.0
In-Reply-To: <20170424201444.GC14169@wtj.duckdns.org>
References: <20170424201344.GA14169@wtj.duckdns.org> <20170424201444.GC14169@wtj.duckdns.org>
From: Vincent Guittot <vincent.guittot@linaro.org>
Date: Wed, 26 Apr 2017 20:12:09 +0200
Message-ID: <CAKfTPtDsZ4bRbmd47a3P-jDq4GC8FfM9=b+jpqnLEHqA8L+UtQ@mail.gmail.com>
Subject: Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg
To: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>, Peter Zijlstra <peterz@infradead.org>,
        linux-kernel <linux-kernel@vger.kernel.org>,
        Linus Torvalds <torvalds@linux-foundation.org>,
        Mike Galbraith <efault@gmx.de>, Paul Turner <pjt@google.com>,
        Chris Mason <clm@fb.com>, kernel-team@fb.com
Content-Type: text/plain; charset=UTF-8
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On 24 April 2017 at 22:14, Tejun Heo <tj@kernel.org> wrote:
> We noticed that with cgroup CPU controller in use, the scheduling
>
> Note the drastic increase in p99 scheduling latency.  After
> investigation, it turned out that the update_sd_lb_stats(), which is
> used by load_balance() to pick the most loaded group, was often
> picking the wrong group.  A CPU which has one schbench running and

Can the problem be on the load balance side instead ?  and more
precisely in the wakeup path ?
After looking at the trace, it seems that task placement happens at
wake up path and if it fails to select the right idle cpu at wake up,
you will have to wait for a load balance which is alreayd too late

> another queued wouldn't report the correspondingly higher

It will as load_avg includes the runnable_load_avg so whatever load is
in runnable_load_avg will be in load_avg too. But at the contrary,
runnable_load_avg will not have the blocked that is going to wake up
soon in the case of schbench

One last thing, the load_avg of an idle CPU can stay blocked for a
while (until a load balance happens that will update blocked load) and
can be seen has "busy" whereas it is not. Could it be a reason of your
problem ?

I have an ongoing patch to solve the problem at least partly if this
can be a reason

> weighted_cpuload() and get looked over as the target of load
> balancing.
>
> weighted_cpuload() is the root cfs_rq's runnable_load_avg which is the
> sum of the load_avg of all queued sched_entities.  Without cgroups or
> at the root cgroup, each task's load_avg contributes directly to the
> sum.  When a task wakes up or goes to sleep, the change is immediately
> reflected on runnable_load_avg which in turn affects load balancing.
>

>  #else /* CONFIG_FAIR_GROUP_SCHED */