From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1035654AbdD1Ud7 (ORCPT ); Fri, 28 Apr 2017 16:33:59 -0400 Received: from mail-yw0-f196.google.com ([209.85.161.196]:33467 "EHLO mail-yw0-f196.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1032235AbdD1Udu (ORCPT ); Fri, 28 Apr 2017 16:33:50 -0400 Date: Fri, 28 Apr 2017 16:33:47 -0400 From: Tejun Heo To: Vincent Guittot Cc: Ingo Molnar , Peter Zijlstra , linux-kernel , Linus Torvalds , Mike Galbraith , Paul Turner , Chris Mason , kernel-team@fb.com Subject: Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg Message-ID: <20170428203347.GC19364@htj.duckdns.org> References: <20170424201344.GA14169@wtj.duckdns.org> <20170424201444.GC14169@wtj.duckdns.org> <20170426225202.GC11348@wtj.duckdns.org> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: User-Agent: Mutt/1.8.0 (2017-02-23) Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hello, Vincent. On Thu, Apr 27, 2017 at 10:29:10AM +0200, Vincent Guittot wrote: > On 27 April 2017 at 00:52, Tejun Heo wrote: > > Hello, > > > > On Wed, Apr 26, 2017 at 08:12:09PM +0200, Vincent Guittot wrote: > >> On 24 April 2017 at 22:14, Tejun Heo wrote: > >> Can the problem be on the load balance side instead ? and more > >> precisely in the wakeup path ? > >> After looking at the trace, it seems that task placement happens at > >> wake up path and if it fails to select the right idle cpu at wake up, > >> you will have to wait for a load balance which is alreayd too late > > > > Oh, I was tracing most of scheduler activities and the ratios of > > wakeups picking idle CPUs were about the same regardless of cgroup > > membership. I can confidently say that the latency issue that I'm > > seeing is from load balancer picking the wrong busiest CPU, which is > > not to say that there can be other problems. > > ok. Is there any trace that you can share ? your behavior seems > different of mine I'm attaching the debug patch. With your change (avg instead of runnable_avg), the following trace shows why it's wrong. It's dumping a case where group A has a CPU w/ more than two schbench threads and B doesn't, but the load balancer is determining that B is loaded heavier. dbg_odd: odd: dst=28 idle=2 brk=32 lbtgt=0-31 type=2 dbg_odd_dump: A: grp=1,17 w=2 avg=7.247 grp=8.337 sum=8.337 pertask=2.779 dbg_odd_dump: A: gcap=1.150 gutil=1.095 run=3 idle=0 gwt=2 type=2 nocap=1 dbg_odd_dump: A: CPU001: run=1 schb=1 dbg_odd_dump: A: Q001-asdf: w=1.000,l=0.525,u=0.513,r=0.527 run=1 hrun=1 tgs=100.000 tgw=17.266 dbg_odd_dump: A: Q001-asdf: schbench(153757C):w=1.000,l=0.527,u=0.514 dbg_odd_dump: A: Q001-/: w=5.744,l=2.522,u=0.520,r=3.067 run=1 hrun=1 tgs=1.000 tgw=0.000 dbg_odd_dump: A: Q001-/: asdf(C):w=5.744,l=3.017,u=0.521 dbg_odd_dump: A: CPU017: run=2 schb=2 dbg_odd_dump: A: Q017-asdf: w=2.000,l=0.989,u=0.966,r=0.988 run=2 hrun=2 tgs=100.000 tgw=17.266 dbg_odd_dump: A: Q017-asdf: schbench(153737C):w=1.000,l=0.493,u=0.482 schbench(153739):w=1.000,l=0.494,u=0.483 dbg_odd_dump: A: Q017-/: w=10.653,l=7.888,u=0.973,r=5.270 run=1 hrun=2 tgs=1.000 tgw=0.000 dbg_odd_dump: A: Q017-/: asdf(C):w=10.653,l=5.269,u=0.966 dbg_odd_dump: B: grp=14,30 w=2 avg=7.666 grp=8.819 sum=8.819 pertask=4.409 dbg_odd_dump: B: gcap=1.150 gutil=1.116 run=2 idle=0 gwt=2 type=2 nocap=1 dbg_odd_dump: B: CPU014: run=1 schb=1 dbg_odd_dump: B: Q014-asdf: w=1.000,l=1.004,u=0.970,r=0.492 run=1 hrun=1 tgs=100.000 tgw=17.266 dbg_odd_dump: B: Q014-asdf: schbench(153760C):w=1.000,l=0.491,u=0.476 dbg_odd_dump: B: Q014-/: w=5.605,l=11.146,u=0.970,r=5.774 run=1 hrun=1 tgs=1.000 tgw=0.000 dbg_odd_dump: B: Q014-/: asdf(C):w=5.605,l=5.766,u=0.970 dbg_odd_dump: B: CPU030: run=1 schb=1 dbg_odd_dump: B: Q030-asdf: w=1.000,l=0.538,u=0.518,r=0.558 run=1 hrun=1 tgs=100.000 tgw=17.266 dbg_odd_dump: B: Q030-asdf: schbench(153747C):w=1.000,l=0.537,u=0.516 dbg_odd_dump: B: Q030-/: w=5.758,l=3.186,u=0.541,r=3.044 run=1 hrun=1 tgs=1.000 tgw=0.000 dbg_odd_dump: B: Q030-/: asdf(C):w=5.758,l=3.092,u=0.516 You can notice that B's pertask weight is 4.409 which is way higher than A's 2.779, and this is from Q014-asdf's contribution to Q014-/ is twice as high as it should be. The root queue's runnable avg should only contain what's currently active but because we're scaling load avg which includes both active and blocked, we're ending up picking group B over A. This shows up in the total number of times we pick the wrong queue and thus latency. I'm running the following script with the debug patch applied. #!/bin/bash date cat /proc/self/cgroup echo 1000 > /sys/module/fair/parameters/dbg_odd_nth echo 0 > /sys/module/fair/parameters/dbg_odd_cnt ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30 cat /sys/module/fair/parameters/dbg_odd_cnt With your patch applied, in the root cgroup, Fri Apr 28 12:48:59 PDT 2017 0::/ Latency percentiles (usec) 50.0000th: 26 75.0000th: 63 90.0000th: 78 95.0000th: 88 *99.0000th: 707 99.5000th: 5096 99.9000th: 10352 min=0, max=13743 577 In the /asdf cgroup, Fri Apr 28 13:19:53 PDT 2017 0::/asdf Latency percentiles (usec) 50.0000th: 35 75.0000th: 67 90.0000th: 81 95.0000th: 98 *99.0000th: 2212 99.5000th: 4536 99.9000th: 11024 min=0, max=13026 1708 The last line is the number of times the load balancer picked a group w/o more than two schbench threads on a CPU over one w/. Some number of these are expected as there are other threads and there are some plays in all the calculations but propgating avg or not propgating at all significantly increases the count and latency. > > The issue isn't about whether runnable_load_avg or load_avg should be > > used but the unexpected differences in the metrics that the load > > I think that's the root of the problem. I explain a bit more my view > on the other thread So, when picking the busiest group, the only thing which matters is the queue's runnable_load_avg, which should approximate the sum of all on-queue loads on that CPU. If we don't propagate or propagate load_avg, we're factoring in blocked avg of descendent cgroups into the root's runnable_load_avg which is obviously wrong. We can argue whether overriding a cfs_rq se's load_avg to the scaled runnable_load_avg of the cfs_rq is the right way to go or we should introduce a separate channel to propagate runnable_load_avg; however, it's clear that we need to fix runnable_load_avg propagation one way or another. The thing with cfs_rq se's load_avg is that, it isn't really used anywhere else AFAICS, so overriding it to the cfs_rq's runnable_load_avg isn't prettiest but doesn't really change anything. Thanks. -- tejun