From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S977522AbdDXUN6 (ORCPT <rfc822;w@1wt.eu>);
        Mon, 24 Apr 2017 16:13:58 -0400
Received: from mail-it0-f65.google.com ([209.85.214.65]:36545 "EHLO
        mail-it0-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S977198AbdDXUNr (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Mon, 24 Apr 2017 16:13:47 -0400
Date: Mon, 24 Apr 2017 13:13:44 -0700
From: Tejun Heo <tj@kernel.org>
To: Ingo Molnar <mingo@redhat.com>, Peter Zijlstra <peterz@infradead.org>
Cc: linux-kernel@vger.kernel.org,
        Linus Torvalds <torvalds@linux-foundation.org>,
        Vincent Guittot <vincent.guittot@linaro.org>,
        Mike Galbraith <efault@gmx.de>, Paul Turner <pjt@google.com>,
        Chris Mason <clm@fb.com>, kernel-team@fb.com
Subject: [RFC PATCHSET] sched/fair: fix load balancer behavior when cgroup is
 in use
Message-ID: <20170424201344.GA14169@wtj.duckdns.org>
MIME-Version: 1.0
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
User-Agent: Mutt/1.8.0 (2017-02-23)
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

Hello,

We've noticed scheduling latency spike when cgroup is in use even when
the machine is idle enough with moderate scheduling frequency and
single level of cgroup nesting.  More details are in the patch
descriptions but here's a schbench run from the root cgroup.

 # ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30
 Latency percentiles (usec)
	 50.0000th: 26
	 75.0000th: 62
	 90.0000th: 74
	 95.0000th: 86
	 *99.0000th: 887
	 99.5000th: 3692
	 99.9000th: 10832
	 min=0, max=13374

And here's one from inside a first level cgroup.

 # ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30
 Latency percentiles (usec)
	 50.0000th: 31
	 75.0000th: 65
	 90.0000th: 71
	 95.0000th: 91
	 *99.0000th: 7288
	 99.5000th: 10352
	 99.9000th: 12496
	 min=0, max=13023

The p99 latency spike got tracked down to runnable_load_avg not being
propagated through nested cfs_rqs and thus load_balance() operating on
out-of-sync load information.  It ended up picking the wrong CPU as
load balance target often enough to significantly impact p99 latency.

This patchset fixes the issue by always propagating runnable_load_avg
so that, regardless of nesting, every cfs_rq's runnable_load_avg is
the sum of the scaled loads of all tasks queued below it.

As a side effect, this changes the load_avg behavior of sched_entities
associated cfs_rq's.  It doesn't seem wrong to me and I can't think of
a better / cleaner way, but if there is, please let me know.

This patchset is on top of v4.11-rc8 and contains the following two
patches.

 sched/fair: Fix how load gets propagated from cfs_rq to its sched_entity
 sched/fair: Always propagate runnable_load_avg

diffstat follows.

 kernel/sched/fair.c |   46 +++++++++++++++++++---------------------------
 1 file changed, 19 insertions(+), 27 deletions(-)

Thanks.

-- 
tejun