From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751980AbdECHfV (ORCPT ); Wed, 3 May 2017 03:35:21 -0400 Received: from mail-oi0-f53.google.com ([209.85.218.53]:34481 "EHLO mail-oi0-f53.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751014AbdECHfN (ORCPT ); Wed, 3 May 2017 03:35:13 -0400 MIME-Version: 1.0 In-Reply-To: <20170502215054.GC5335@htj.duckdns.org> References: <20170424201344.GA14169@wtj.duckdns.org> <20170424201444.GC14169@wtj.duckdns.org> <20170426225202.GC11348@wtj.duckdns.org> <20170428203347.GC19364@htj.duckdns.org> <20170502215054.GC5335@htj.duckdns.org> From: Vincent Guittot Date: Wed, 3 May 2017 09:34:51 +0200 Message-ID: Subject: Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg To: Tejun Heo Cc: Ingo Molnar , Peter Zijlstra , linux-kernel , Linus Torvalds , Mike Galbraith , Paul Turner , Chris Mason , kernel-team@fb.com Content-Type: text/plain; charset=UTF-8 Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi Tejun, On 2 May 2017 at 23:50, Tejun Heo wrote: > Hello, > > On Tue, May 02, 2017 at 09:18:53AM +0200, Vincent Guittot wrote: >> > dbg_odd: odd: dst=28 idle=2 brk=32 lbtgt=0-31 type=2 >> > dbg_odd_dump: A: grp=1,17 w=2 avg=7.247 grp=8.337 sum=8.337 pertask=2.779 >> > dbg_odd_dump: A: gcap=1.150 gutil=1.095 run=3 idle=0 gwt=2 type=2 nocap=1 >> > dbg_odd_dump: A: CPU001: run=1 schb=1 >> > dbg_odd_dump: A: Q001-asdf: w=1.000,l=0.525,u=0.513,r=0.527 run=1 hrun=1 tgs=100.000 tgw=17.266 >> > dbg_odd_dump: A: Q001-asdf: schbench(153757C):w=1.000,l=0.527,u=0.514 >> > dbg_odd_dump: A: Q001-/: w=5.744,l=2.522,u=0.520,r=3.067 run=1 hrun=1 tgs=1.000 tgw=0.000 >> > dbg_odd_dump: A: Q001-/: asdf(C):w=5.744,l=3.017,u=0.521 >> > dbg_odd_dump: A: CPU017: run=2 schb=2 >> > dbg_odd_dump: A: Q017-asdf: w=2.000,l=0.989,u=0.966,r=0.988 run=2 hrun=2 tgs=100.000 tgw=17.266 >> > dbg_odd_dump: A: Q017-asdf: schbench(153737C):w=1.000,l=0.493,u=0.482 schbench(153739):w=1.000,l=0.494,u=0.483 >> > dbg_odd_dump: A: Q017-/: w=10.653,l=7.888,u=0.973,r=5.270 run=1 hrun=2 tgs=1.000 tgw=0.000 >> > dbg_odd_dump: A: Q017-/: asdf(C):w=10.653,l=5.269,u=0.966 >> > dbg_odd_dump: B: grp=14,30 w=2 avg=7.666 grp=8.819 sum=8.819 pertask=4.409 >> > dbg_odd_dump: B: gcap=1.150 gutil=1.116 run=2 idle=0 gwt=2 type=2 nocap=1 >> > dbg_odd_dump: B: CPU014: run=1 schb=1 >> > dbg_odd_dump: B: Q014-asdf: w=1.000,l=1.004,u=0.970,r=0.492 run=1 hrun=1 tgs=100.000 tgw=17.266 >> > dbg_odd_dump: B: Q014-asdf: schbench(153760C):w=1.000,l=0.491,u=0.476 >> > dbg_odd_dump: B: Q014-/: w=5.605,l=11.146,u=0.970,r=5.774 run=1 hrun=1 tgs=1.000 tgw=0.000 >> > dbg_odd_dump: B: Q014-/: asdf(C):w=5.605,l=5.766,u=0.970 >> > dbg_odd_dump: B: CPU030: run=1 schb=1 >> > dbg_odd_dump: B: Q030-asdf: w=1.000,l=0.538,u=0.518,r=0.558 run=1 hrun=1 tgs=100.000 tgw=17.266 >> > dbg_odd_dump: B: Q030-asdf: schbench(153747C):w=1.000,l=0.537,u=0.516 >> > dbg_odd_dump: B: Q030-/: w=5.758,l=3.186,u=0.541,r=3.044 run=1 hrun=1 tgs=1.000 tgw=0.000 >> > dbg_odd_dump: B: Q030-/: asdf(C):w=5.758,l=3.092,u=0.516 >> > >> > You can notice that B's pertask weight is 4.409 which is way higher >> > than A's 2.779, and this is from Q014-asdf's contribution to Q014-/ is >> > twice as high as it should be. The root queue's runnable avg should >> >> Are you sure that this is because of blocked load in group A ? it can >> be that Q014-asdf has already have to wait before running and its load >> still increase while runnable but not running . > > This is with propagation enabled, so the only thing contributing to > the root queue's runnable_load_avg is the load being propagated from > Q014-asdf, which has twice high load avg than runnable. The past > history doesn't matter for load balancing and without cgroup this > blocked load wouldn't have contributed to root's runnable load avg. I > don't think it can get much clearer. > >> IIUC your trace, group A has 2 running tasks and group B only one but >> load_balance selects B because of its sgs->avg_load being higher. But >> this can also happen even if runnable_load_avg of child cfs_rq was >> propagated correctly in group entity because we can have situation >> where a group A has only 1 task with higher load than 2 tasks on >> groupB and even if blocked load is not taken into account, and >> load_balance will select A. > > Yes, it can happen with tasks w/ different weights. That's clearly > not what's happening here. The load balancer is picking the wrong CPU > far more frequently because the root queue's runnable load avg > incorrectly includes blocked load avgs from nested cfs_rqs. > >> IMHO, we should better improve load balance selection. I'm going to >> add smarter group selection in load_balance. that's something we >> should have already done but it was difficult without load/util_avg >> propagation. it should be doable now > > That's all well and great but let's fix a bug first; otherwise, we'd > be papering over an existing issue with a new mechanism which is a bad > idea for any code base which has to last. runnable_load_avg is already a kind of fix and breaking load_avg seems worse than fixing load_balance > >> > We can argue whether overriding a cfs_rq se's load_avg to the scaled >> > runnable_load_avg of the cfs_rq is the right way to go or we should >> > introduce a separate channel to propagate runnable_load_avg; however, >> > it's clear that we need to fix runnable_load_avg propagation one way >> > or another. >> >> The minimum would be to not break load_avg > > Oh yeah, this I can understand. The proposed change is icky in that > it forces group se->load_avg.avg to be runnable_load_avg of the > corresponding group cfs_rq. We *can* introduce a separate channel, > say, se->group_runnable_load_avg which is used to propagate > runnable_load_avg; however, the thing is that we don't really use > group se->load_avg.avg anywhere, so we might as well just override it. We use load_avg for calculating a stable share and we want to use it more and more. So breaking it because it's easier doesn't seems to be the right way to do IMHO > > I have a preliminary patch to introduce a separate field but it looks > sad too because we end up calculating the load_avg and > runnable_load_avg to propagate separately without actually using the > former value anywhere. > >> > The thing with cfs_rq se's load_avg is that, it isn't really used >> > anywhere else AFAICS, so overriding it to the cfs_rq's >> > runnable_load_avg isn't prettiest but doesn't really change anything. >> >> load_avg is used for defining the share of each cfs_rq. > > Each cfs_rq calculates its load_avg independently from the weight sum. > The queued se's load_avgs don't affect cfs_rq's load_avg in any direct > way. The only time the value is used is for propagation during > migration; however, group se themselves never get migrated themselves > and during propagation only deltas matter so the difference between > load_avg and runnable_load_avg isn't gonna matter that much. In > short, we don't really use group se's load_avg in any way significant. > > Thanks. > > -- > tejun