linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Vincent Guittot <vincent.guittot@linaro.org>
To: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Mike Galbraith <efault@gmx.de>, Paul Turner <pjt@google.com>,
	Chris Mason <clm@fb.com>,
	kernel-team@fb.com
Subject: Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg
Date: Tue, 2 May 2017 09:18:53 +0200	[thread overview]
Message-ID: <CAKfTPtB3PectZVm_z6NhSszXwG-G=y-J1Y5Rxk48bLxHQ8mx4g@mail.gmail.com> (raw)
In-Reply-To: <20170428203347.GC19364@htj.duckdns.org>

On 28 April 2017 at 22:33, Tejun Heo <tj@kernel.org> wrote:
> Hello, Vincent.
>
> On Thu, Apr 27, 2017 at 10:29:10AM +0200, Vincent Guittot wrote:
>> On 27 April 2017 at 00:52, Tejun Heo <tj@kernel.org> wrote:
>> > Hello,
>> >
>> > On Wed, Apr 26, 2017 at 08:12:09PM +0200, Vincent Guittot wrote:
>> >> On 24 April 2017 at 22:14, Tejun Heo <tj@kernel.org> wrote:
>> >> Can the problem be on the load balance side instead ?  and more
>> >> precisely in the wakeup path ?
>> >> After looking at the trace, it seems that task placement happens at
>> >> wake up path and if it fails to select the right idle cpu at wake up,
>> >> you will have to wait for a load balance which is alreayd too late
>> >
>> > Oh, I was tracing most of scheduler activities and the ratios of
>> > wakeups picking idle CPUs were about the same regardless of cgroup
>> > membership.  I can confidently say that the latency issue that I'm
>> > seeing is from load balancer picking the wrong busiest CPU, which is
>> > not to say that there can be other problems.
>>
>> ok. Is there any trace that you can share ? your behavior seems
>> different of mine
>
> I'm attaching the debug patch.  With your change (avg instead of
> runnable_avg), the following trace shows why it's wrong.
>
> It's dumping a case where group A has a CPU w/ more than two schbench
> threads and B doesn't, but the load balancer is determining that B is
> loaded heavier.
>
>  dbg_odd: odd: dst=28 idle=2 brk=32 lbtgt=0-31 type=2
>  dbg_odd_dump: A: grp=1,17 w=2 avg=7.247 grp=8.337 sum=8.337 pertask=2.779
>  dbg_odd_dump: A: gcap=1.150 gutil=1.095 run=3 idle=0 gwt=2 type=2 nocap=1
>  dbg_odd_dump: A: CPU001: run=1 schb=1
>  dbg_odd_dump: A: Q001-asdf: w=1.000,l=0.525,u=0.513,r=0.527 run=1 hrun=1 tgs=100.000 tgw=17.266
>  dbg_odd_dump: A: Q001-asdf:  schbench(153757C):w=1.000,l=0.527,u=0.514
>  dbg_odd_dump: A: Q001-/: w=5.744,l=2.522,u=0.520,r=3.067 run=1 hrun=1 tgs=1.000 tgw=0.000
>  dbg_odd_dump: A: Q001-/:  asdf(C):w=5.744,l=3.017,u=0.521
>  dbg_odd_dump: A: CPU017: run=2 schb=2
>  dbg_odd_dump: A: Q017-asdf: w=2.000,l=0.989,u=0.966,r=0.988 run=2 hrun=2 tgs=100.000 tgw=17.266
>  dbg_odd_dump: A: Q017-asdf:  schbench(153737C):w=1.000,l=0.493,u=0.482 schbench(153739):w=1.000,l=0.494,u=0.483
>  dbg_odd_dump: A: Q017-/: w=10.653,l=7.888,u=0.973,r=5.270 run=1 hrun=2 tgs=1.000 tgw=0.000
>  dbg_odd_dump: A: Q017-/:  asdf(C):w=10.653,l=5.269,u=0.966
>  dbg_odd_dump: B: grp=14,30 w=2 avg=7.666 grp=8.819 sum=8.819 pertask=4.409
>  dbg_odd_dump: B: gcap=1.150 gutil=1.116 run=2 idle=0 gwt=2 type=2 nocap=1
>  dbg_odd_dump: B: CPU014: run=1 schb=1
>  dbg_odd_dump: B: Q014-asdf: w=1.000,l=1.004,u=0.970,r=0.492 run=1 hrun=1 tgs=100.000 tgw=17.266
>  dbg_odd_dump: B: Q014-asdf:  schbench(153760C):w=1.000,l=0.491,u=0.476
>  dbg_odd_dump: B: Q014-/: w=5.605,l=11.146,u=0.970,r=5.774 run=1 hrun=1 tgs=1.000 tgw=0.000
>  dbg_odd_dump: B: Q014-/:  asdf(C):w=5.605,l=5.766,u=0.970
>  dbg_odd_dump: B: CPU030: run=1 schb=1
>  dbg_odd_dump: B: Q030-asdf: w=1.000,l=0.538,u=0.518,r=0.558 run=1 hrun=1 tgs=100.000 tgw=17.266
>  dbg_odd_dump: B: Q030-asdf:  schbench(153747C):w=1.000,l=0.537,u=0.516
>  dbg_odd_dump: B: Q030-/: w=5.758,l=3.186,u=0.541,r=3.044 run=1 hrun=1 tgs=1.000 tgw=0.000
>  dbg_odd_dump: B: Q030-/:  asdf(C):w=5.758,l=3.092,u=0.516
>
> You can notice that B's pertask weight is 4.409 which is way higher
> than A's 2.779, and this is from Q014-asdf's contribution to Q014-/ is
> twice as high as it should be.  The root queue's runnable avg should

Are you sure that this is because of blocked load in group A ? it can
be that Q014-asdf has already have to wait before running and its load
still increase while runnable but not running .
IIUC your trace, group A has 2 running tasks and group B only one but
load_balance selects B because of its sgs->avg_load being higher. But
this can also happen even if runnable_load_avg of child cfs_rq was
propagated correctly in group entity because we can have situation
where a group A has only 1 task with higher load than 2 tasks on
groupB and even if blocked load is not taken into account, and
load_balance will select A.

IMHO, we should better improve load balance selection. I'm going to
add smarter group selection in load_balance. that's something we
should have already done but it was difficult without load/util_avg
propagation. it should be doable now

> only contain what's currently active but because we're scaling load
> avg which includes both active and blocked, we're ending up picking
> group B over A.
>
> This shows up in the total number of times we pick the wrong queue and
> thus latency.  I'm running the following script with the debug patch
> applied.
>
>   #!/bin/bash
>
>   date
>   cat /proc/self/cgroup
>
>   echo 1000 > /sys/module/fair/parameters/dbg_odd_nth
>   echo 0 > /sys/module/fair/parameters/dbg_odd_cnt
>
>   ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30
>
>   cat /sys/module/fair/parameters/dbg_odd_cnt
>
>
> With your patch applied, in the root cgroup,
>
>   Fri Apr 28 12:48:59 PDT 2017
>   0::/
>   Latency percentiles (usec)
>           50.0000th: 26
>           75.0000th: 63
>           90.0000th: 78
>           95.0000th: 88
>           *99.0000th: 707
>           99.5000th: 5096
>           99.9000th: 10352
>           min=0, max=13743
>   577
>
>
> In the /asdf cgroup,
>
>   Fri Apr 28 13:19:53 PDT 2017
>   0::/asdf
>   Latency percentiles (usec)
>           50.0000th: 35
>           75.0000th: 67
>           90.0000th: 81
>           95.0000th: 98
>           *99.0000th: 2212
>           99.5000th: 4536
>           99.9000th: 11024
>           min=0, max=13026
>   1708
>
>
> The last line is the number of times the load balancer picked a group
> w/o more than two schbench threads on a CPU over one w/.  Some number
> of these are expected as there are other threads and there are some
> plays in all the calculations but propgating avg or not propgating at
> all significantly increases the count and latency.
>
>> > The issue isn't about whether runnable_load_avg or load_avg should be
>> > used but the unexpected differences in the metrics that the load
>>
>> I think that's the root of the problem. I explain a bit more my view
>> on the other thread
>
> So, when picking the busiest group, the only thing which matters is
> the queue's runnable_load_avg, which should approximate the sum of all
> on-queue loads on that CPU.
>
> If we don't propagate or propagate load_avg, we're factoring in
> blocked avg of descendent cgroups into the root's runnable_load_avg
> which is obviously wrong.
>
> We can argue whether overriding a cfs_rq se's load_avg to the scaled
> runnable_load_avg of the cfs_rq is the right way to go or we should
> introduce a separate channel to propagate runnable_load_avg; however,
> it's clear that we need to fix runnable_load_avg propagation one way
> or another.

The minimum would be to not break load_avg

>
> The thing with cfs_rq se's load_avg is that, it isn't really used
> anywhere else AFAICS, so overriding it to the cfs_rq's
> runnable_load_avg isn't prettiest but doesn't really change anything.

load_avg is used for defining the share of each cfs_rq.

>
> Thanks.
>
> --
> tejun

  parent reply	other threads:[~2017-05-02  7:19 UTC|newest]

Thread overview: 69+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-04-24 20:13 [RFC PATCHSET] sched/fair: fix load balancer behavior when cgroup is in use Tejun Heo
2017-04-24 20:14 ` [PATCH 1/2] sched/fair: Fix how load gets propagated from cfs_rq to its sched_entity Tejun Heo
2017-04-24 21:33   ` [PATCH v2 " Tejun Heo
2017-05-03 18:00     ` Peter Zijlstra
2017-05-03 21:45       ` Tejun Heo
2017-05-04  5:51         ` Peter Zijlstra
2017-05-04  6:21           ` Peter Zijlstra
2017-05-04  9:49             ` Dietmar Eggemann
2017-05-04 10:57               ` Peter Zijlstra
2017-05-04 17:39               ` Tejun Heo
2017-05-05 10:36                 ` Dietmar Eggemann
2017-05-04 10:26       ` Vincent Guittot
2017-04-25  8:35   ` [PATCH " Vincent Guittot
2017-04-25 18:12     ` Tejun Heo
2017-04-26 16:51       ` Vincent Guittot
2017-04-26 22:40         ` Tejun Heo
2017-04-27  7:00           ` Vincent Guittot
2017-05-01 14:17         ` Peter Zijlstra
2017-05-01 14:52           ` Peter Zijlstra
2017-05-01 21:56           ` Tejun Heo
2017-05-02  8:19             ` Peter Zijlstra
2017-05-02  8:30               ` Peter Zijlstra
2017-05-02 20:00                 ` Tejun Heo
2017-05-03  9:10                   ` Peter Zijlstra
2017-04-26 16:14   ` Vincent Guittot
2017-04-26 22:27     ` Tejun Heo
2017-04-27  8:59       ` Vincent Guittot
2017-04-28 17:46         ` Tejun Heo
2017-05-02  7:20           ` Vincent Guittot
2017-04-24 20:14 ` [PATCH 2/2] sched/fair: Always propagate runnable_load_avg Tejun Heo
2017-04-25  8:46   ` Vincent Guittot
2017-04-25  9:05     ` Vincent Guittot
2017-04-25 12:59       ` Vincent Guittot
2017-04-25 18:49         ` Tejun Heo
2017-04-25 20:49           ` Tejun Heo
2017-04-25 21:15             ` Chris Mason
2017-04-25 21:08           ` Tejun Heo
2017-04-26 10:21             ` Vincent Guittot
2017-04-27  0:30               ` Tejun Heo
2017-04-27  8:28                 ` Vincent Guittot
2017-04-28 16:14                   ` Tejun Heo
2017-05-02  6:56                     ` Vincent Guittot
2017-05-02 20:56                       ` Tejun Heo
2017-05-03  7:25                         ` Vincent Guittot
2017-05-03  7:54                           ` Vincent Guittot
2017-04-26 18:12   ` Vincent Guittot
2017-04-26 22:52     ` Tejun Heo
2017-04-27  8:29       ` Vincent Guittot
2017-04-28 20:33         ` Tejun Heo
2017-04-28 20:38           ` Tejun Heo
2017-05-01 15:56           ` Peter Zijlstra
2017-05-02 22:01             ` Tejun Heo
2017-05-02  7:18           ` Vincent Guittot [this message]
2017-05-02 13:26             ` Vincent Guittot
2017-05-02 22:37               ` Tejun Heo
2017-05-02 21:50             ` Tejun Heo
2017-05-03  7:34               ` Vincent Guittot
2017-05-03  9:37                 ` Peter Zijlstra
2017-05-03 10:37                   ` Vincent Guittot
2017-05-03 13:09                     ` Peter Zijlstra
2017-05-03 21:49                       ` Tejun Heo
2017-05-04  8:19                         ` Vincent Guittot
2017-05-04 17:43                           ` Tejun Heo
2017-05-04 19:02                             ` Vincent Guittot
2017-05-04 19:04                               ` Tejun Heo
2017-04-24 21:35 ` [PATCH 3/2] sched/fair: Skip __update_load_avg() on cfs_rq sched_entities Tejun Heo
2017-04-24 21:48   ` Peter Zijlstra
2017-04-24 22:54     ` Tejun Heo
2017-04-25 21:09   ` Tejun Heo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAKfTPtB3PectZVm_z6NhSszXwG-G=y-J1Y5Rxk48bLxHQ8mx4g@mail.gmail.com' \
    --to=vincent.guittot@linaro.org \
    --cc=clm@fb.com \
    --cc=efault@gmx.de \
    --cc=kernel-team@fb.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=pjt@google.com \
    --cc=tj@kernel.org \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).