From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1755454AbcJRIoD (ORCPT ); Tue, 18 Oct 2016 04:44:03 -0400 Received: from mail-lf0-f50.google.com ([209.85.215.50]:34636 "EHLO mail-lf0-f50.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752020AbcJRInr (ORCPT ); Tue, 18 Oct 2016 04:43:47 -0400 MIME-Version: 1.0 In-Reply-To: <94cc6deb-f93e-60ec-5834-e84a8b98e73c@arm.com> References: <57FFADC8.2020602@canonical.com> <43c59cba-2044-1de2-0f78-8f346bd1e3cb@arm.com> <20161014151827.GA10379@linaro.org> <2bb765e7-8a5f-c525-a6ae-fbec6fae6354@canonical.com> <20161017090903.GA11962@linaro.org> <4e15ad55-beeb-e860-0420-8f439d076758@arm.com> <20161017131952.GR3117@twins.programming.kicks-ass.net> <94cc6deb-f93e-60ec-5834-e84a8b98e73c@arm.com> From: Vincent Guittot Date: Tue, 18 Oct 2016 10:43:24 +0200 Message-ID: Subject: Re: [v4.8-rc1 Regression] sched/fair: Apply more PELT fixes To: Dietmar Eggemann Cc: Peter Zijlstra , Joseph Salisbury , Ingo Molnar , Linus Torvalds , Thomas Gleixner , LKML , Mike Galbraith , omer.akram@canonical.com Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 18 October 2016 at 00:52, Dietmar Eggemann wrote: > On 10/17/2016 02:54 PM, Vincent Guittot wrote: >> On 17 October 2016 at 15:19, Peter Zijlstra wrote: >>> On Mon, Oct 17, 2016 at 12:49:55PM +0100, Dietmar Eggemann wrote: > > [...] > >>>> BTW, I guess we can reach .tg_load_avg up to ~300000-400000 on such a system >>>> initially because systemd will create all ~100 services (and therefore the >>>> corresponding 2. level tg's) at once. In my previous example, there was 500ms >>>> between the creation of 2 tg's so there was a lot of decaying going on in between. >>> >>> Cute... on current kernels that translates to simply removing the call >>> to update_tg_load_avg(), lets see if we can figure out what goes >>> sideways first though, because it _should_ decay back out. And if that >> >> yes, Reaching ~300000-400000 is not an issue in itself, the problem is >> that load_avg has decayed but it has not been reflected in >> tg->load_avg in the buggy case >> >>> can fail here, I'm not seeing why that wouldn't fail elsewhere either. >>> >>> I'll see if I can reproduce this with a script creating heaps of cgroups >>> in a hurry, I have a total lack of system-disease on all my machines. > Hi Dietmar, > > Something looks weird related to the use of for_each_possible_cpu(i) in > online_fair_sched_group() on my i5-3320M CPU (4 logical cpus). > > In case I print out cpu id and the cpu masks inside the for_each_possible_cpu(i) > I get: > > [ 5.462368] cpu=0 cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3 > [ 5.462370] cpu=1 cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3 > [ 5.462370] cpu=2 cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3 > [ 5.462371] cpu=3 cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3 > [ 5.462372] *cpu=4* cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3 > [ 5.462373] *cpu=5* cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3 > [ 5.462374] *cpu=6* cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3 > [ 5.462375] *cpu=7* cpu_possible_mask=0-7 cpu_online_mask=0-3 cpu_present_mask=0-3 cpu_active_mask=0-3 > Thanks to your description above, i have been able to reproduce the issue on my ARM platform. The key point is to have cpu_possible_mask different from cpu_present_mask in order to reproduce the problem. When cpu_present_mask equals cpu_possible_mask, i can't reproduce the problem I create a 1st level of task group tg-l1. Then each time, I create a new task group in tg-l1, tg-l1.tg_load_avg will increase with 1024* number of cpu that are possible but not present like you described below Thanks Vincent > T430:/sys/fs/cgroup/cpu,cpuacct/system.slice# ls -l | grep '^d' | wc -l > 80 > > /proc/sched_debug: > > cfs_rq[0]:/system.slice > ... > .tg_load_avg : 323584 > ... > > 80 * 1024 * 4 (not existent cpu4-cpu7) = 327680 (with a little bit of decay, > this could be this extra load on the systen.slice tg) > > Using for_each_online_cpu(i) instead of for_each_possible_cpu(i) in > online_fair_sched_group() works on this machine, i.e. the .tg_load_avg > of system.slice tg is 0 after startup.