From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S977522AbdDXUN6 (ORCPT ); Mon, 24 Apr 2017 16:13:58 -0400 Received: from mail-it0-f65.google.com ([209.85.214.65]:36545 "EHLO mail-it0-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S977198AbdDXUNr (ORCPT ); Mon, 24 Apr 2017 16:13:47 -0400 Date: Mon, 24 Apr 2017 13:13:44 -0700 From: Tejun Heo To: Ingo Molnar , Peter Zijlstra Cc: linux-kernel@vger.kernel.org, Linus Torvalds , Vincent Guittot , Mike Galbraith , Paul Turner , Chris Mason , kernel-team@fb.com Subject: [RFC PATCHSET] sched/fair: fix load balancer behavior when cgroup is in use Message-ID: <20170424201344.GA14169@wtj.duckdns.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline User-Agent: Mutt/1.8.0 (2017-02-23) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hello, We've noticed scheduling latency spike when cgroup is in use even when the machine is idle enough with moderate scheduling frequency and single level of cgroup nesting. More details are in the patch descriptions but here's a schbench run from the root cgroup. # ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30 Latency percentiles (usec) 50.0000th: 26 75.0000th: 62 90.0000th: 74 95.0000th: 86 *99.0000th: 887 99.5000th: 3692 99.9000th: 10832 min=0, max=13374 And here's one from inside a first level cgroup. # ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30 Latency percentiles (usec) 50.0000th: 31 75.0000th: 65 90.0000th: 71 95.0000th: 91 *99.0000th: 7288 99.5000th: 10352 99.9000th: 12496 min=0, max=13023 The p99 latency spike got tracked down to runnable_load_avg not being propagated through nested cfs_rqs and thus load_balance() operating on out-of-sync load information. It ended up picking the wrong CPU as load balance target often enough to significantly impact p99 latency. This patchset fixes the issue by always propagating runnable_load_avg so that, regardless of nesting, every cfs_rq's runnable_load_avg is the sum of the scaled loads of all tasks queued below it. As a side effect, this changes the load_avg behavior of sched_entities associated cfs_rq's. It doesn't seem wrong to me and I can't think of a better / cleaner way, but if there is, please let me know. This patchset is on top of v4.11-rc8 and contains the following two patches. sched/fair: Fix how load gets propagated from cfs_rq to its sched_entity sched/fair: Always propagate runnable_load_avg diffstat follows. kernel/sched/fair.c | 46 +++++++++++++++++++--------------------------- 1 file changed, 19 insertions(+), 27 deletions(-) Thanks. -- tejun