Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg

From: Tejun Heo <tj@kernel.org>
To: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Mike Galbraith <efault@gmx.de>, Paul Turner <pjt@google.com>,
	Chris Mason <clm@fb.com>,
	kernel-team@fb.com
Subject: Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg
Date: Fri, 28 Apr 2017 16:33:47 -0400	[thread overview]
Message-ID: <20170428203347.GC19364@htj.duckdns.org> (raw)
In-Reply-To: <CAKfTPtAasz-YAUGRKciY_Pab13n+2QGU6-xC=j7ef+jvqGwFtQ@mail.gmail.com>

Hello, Vincent.

On Thu, Apr 27, 2017 at 10:29:10AM +0200, Vincent Guittot wrote:
> On 27 April 2017 at 00:52, Tejun Heo <tj@kernel.org> wrote:
> > Hello,
> >
> > On Wed, Apr 26, 2017 at 08:12:09PM +0200, Vincent Guittot wrote:
> >> On 24 April 2017 at 22:14, Tejun Heo <tj@kernel.org> wrote:
> >> Can the problem be on the load balance side instead ?  and more
> >> precisely in the wakeup path ?
> >> After looking at the trace, it seems that task placement happens at
> >> wake up path and if it fails to select the right idle cpu at wake up,
> >> you will have to wait for a load balance which is alreayd too late
> >
> > Oh, I was tracing most of scheduler activities and the ratios of
> > wakeups picking idle CPUs were about the same regardless of cgroup
> > membership.  I can confidently say that the latency issue that I'm
> > seeing is from load balancer picking the wrong busiest CPU, which is
> > not to say that there can be other problems.
> 
> ok. Is there any trace that you can share ? your behavior seems
> different of mine

I'm attaching the debug patch.  With your change (avg instead of
runnable_avg), the following trace shows why it's wrong.

It's dumping a case where group A has a CPU w/ more than two schbench
threads and B doesn't, but the load balancer is determining that B is
loaded heavier.

 dbg_odd: odd: dst=28 idle=2 brk=32 lbtgt=0-31 type=2
 dbg_odd_dump: A: grp=1,17 w=2 avg=7.247 grp=8.337 sum=8.337 pertask=2.779
 dbg_odd_dump: A: gcap=1.150 gutil=1.095 run=3 idle=0 gwt=2 type=2 nocap=1
 dbg_odd_dump: A: CPU001: run=1 schb=1
 dbg_odd_dump: A: Q001-asdf: w=1.000,l=0.525,u=0.513,r=0.527 run=1 hrun=1 tgs=100.000 tgw=17.266
 dbg_odd_dump: A: Q001-asdf:  schbench(153757C):w=1.000,l=0.527,u=0.514
 dbg_odd_dump: A: Q001-/: w=5.744,l=2.522,u=0.520,r=3.067 run=1 hrun=1 tgs=1.000 tgw=0.000
 dbg_odd_dump: A: Q001-/:  asdf(C):w=5.744,l=3.017,u=0.521
 dbg_odd_dump: A: CPU017: run=2 schb=2
 dbg_odd_dump: A: Q017-asdf: w=2.000,l=0.989,u=0.966,r=0.988 run=2 hrun=2 tgs=100.000 tgw=17.266
 dbg_odd_dump: A: Q017-asdf:  schbench(153737C):w=1.000,l=0.493,u=0.482 schbench(153739):w=1.000,l=0.494,u=0.483
 dbg_odd_dump: A: Q017-/: w=10.653,l=7.888,u=0.973,r=5.270 run=1 hrun=2 tgs=1.000 tgw=0.000
 dbg_odd_dump: A: Q017-/:  asdf(C):w=10.653,l=5.269,u=0.966
 dbg_odd_dump: B: grp=14,30 w=2 avg=7.666 grp=8.819 sum=8.819 pertask=4.409
 dbg_odd_dump: B: gcap=1.150 gutil=1.116 run=2 idle=0 gwt=2 type=2 nocap=1
 dbg_odd_dump: B: CPU014: run=1 schb=1
 dbg_odd_dump: B: Q014-asdf: w=1.000,l=1.004,u=0.970,r=0.492 run=1 hrun=1 tgs=100.000 tgw=17.266
 dbg_odd_dump: B: Q014-asdf:  schbench(153760C):w=1.000,l=0.491,u=0.476
 dbg_odd_dump: B: Q014-/: w=5.605,l=11.146,u=0.970,r=5.774 run=1 hrun=1 tgs=1.000 tgw=0.000
 dbg_odd_dump: B: Q014-/:  asdf(C):w=5.605,l=5.766,u=0.970
 dbg_odd_dump: B: CPU030: run=1 schb=1
 dbg_odd_dump: B: Q030-asdf: w=1.000,l=0.538,u=0.518,r=0.558 run=1 hrun=1 tgs=100.000 tgw=17.266
 dbg_odd_dump: B: Q030-asdf:  schbench(153747C):w=1.000,l=0.537,u=0.516
 dbg_odd_dump: B: Q030-/: w=5.758,l=3.186,u=0.541,r=3.044 run=1 hrun=1 tgs=1.000 tgw=0.000
 dbg_odd_dump: B: Q030-/:  asdf(C):w=5.758,l=3.092,u=0.516

You can notice that B's pertask weight is 4.409 which is way higher
than A's 2.779, and this is from Q014-asdf's contribution to Q014-/ is
twice as high as it should be.  The root queue's runnable avg should
only contain what's currently active but because we're scaling load
avg which includes both active and blocked, we're ending up picking
group B over A.

This shows up in the total number of times we pick the wrong queue and
thus latency.  I'm running the following script with the debug patch
applied.

  #!/bin/bash

  date
  cat /proc/self/cgroup

  echo 1000 > /sys/module/fair/parameters/dbg_odd_nth
  echo 0 > /sys/module/fair/parameters/dbg_odd_cnt

  ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30

  cat /sys/module/fair/parameters/dbg_odd_cnt

With your patch applied, in the root cgroup,

  Fri Apr 28 12:48:59 PDT 2017
  0::/
  Latency percentiles (usec)
	  50.0000th: 26
	  75.0000th: 63
	  90.0000th: 78
	  95.0000th: 88
	  *99.0000th: 707
	  99.5000th: 5096
	  99.9000th: 10352
	  min=0, max=13743
  577

In the /asdf cgroup,

  Fri Apr 28 13:19:53 PDT 2017
  0::/asdf
  Latency percentiles (usec)
	  50.0000th: 35
	  75.0000th: 67
	  90.0000th: 81
	  95.0000th: 98
	  *99.0000th: 2212
	  99.5000th: 4536
	  99.9000th: 11024
	  min=0, max=13026
  1708

The last line is the number of times the load balancer picked a group
w/o more than two schbench threads on a CPU over one w/.  Some number
of these are expected as there are other threads and there are some
plays in all the calculations but propgating avg or not propgating at
all significantly increases the count and latency.

> > The issue isn't about whether runnable_load_avg or load_avg should be
> > used but the unexpected differences in the metrics that the load
> 
> I think that's the root of the problem. I explain a bit more my view
> on the other thread

So, when picking the busiest group, the only thing which matters is
the queue's runnable_load_avg, which should approximate the sum of all
on-queue loads on that CPU.

If we don't propagate or propagate load_avg, we're factoring in
blocked avg of descendent cgroups into the root's runnable_load_avg
which is obviously wrong.

We can argue whether overriding a cfs_rq se's load_avg to the scaled
runnable_load_avg of the cfs_rq is the right way to go or we should
introduce a separate channel to propagate runnable_load_avg; however,
it's clear that we need to fix runnable_load_avg propagation one way
or another.

The thing with cfs_rq se's load_avg is that, it isn't really used
anywhere else AFAICS, so overriding it to the cfs_rq's
runnable_load_avg isn't prettiest but doesn't really change anything.

Thanks.

-- 
tejun