All of lore.kernel.org
 help / color / mirror / Atom feed
From: Vincent Guittot <vincent.guittot@linaro.org>
To: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Mike Galbraith <efault@gmx.de>, Paul Turner <pjt@google.com>,
	Chris Mason <clm@fb.com>,
	kernel-team@fb.com
Subject: Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg
Date: Tue, 2 May 2017 09:18:53 +0200	[thread overview]
Message-ID: <CAKfTPtB3PectZVm_z6NhSszXwG-G=y-J1Y5Rxk48bLxHQ8mx4g@mail.gmail.com> (raw)
In-Reply-To: <20170428203347.GC19364@htj.duckdns.org>

On 28 April 2017 at 22:33, Tejun Heo <tj@kernel.org> wrote:
> Hello, Vincent.
>
> On Thu, Apr 27, 2017 at 10:29:10AM +0200, Vincent Guittot wrote:
>> On 27 April 2017 at 00:52, Tejun Heo <tj@kernel.org> wrote:
>> > Hello,
>> >
>> > On Wed, Apr 26, 2017 at 08:12:09PM +0200, Vincent Guittot wrote:
>> >> On 24 April 2017 at 22:14, Tejun Heo <tj@kernel.org> wrote:
>> >> Can the problem be on the load balance side instead ?  and more
>> >> precisely in the wakeup path ?
>> >> After looking at the trace, it seems that task placement happens at
>> >> wake up path and if it fails to select the right idle cpu at wake up,
>> >> you will have to wait for a load balance which is alreayd too late
>> >
>> > Oh, I was tracing most of scheduler activities and the ratios of
>> > wakeups picking idle CPUs were about the same regardless of cgroup
>> > membership.  I can confidently say that the latency issue that I'm
>> > seeing is from load balancer picking the wrong busiest CPU, which is
>> > not to say that there can be other problems.
>>
>> ok. Is there any trace that you can share ? your behavior seems
>> different of mine
>
> I'm attaching the debug patch.  With your change (avg instead of
> runnable_avg), the following trace shows why it's wrong.
>
> It's dumping a case where group A has a CPU w/ more than two schbench
> threads and B doesn't, but the load balancer is determining that B is
> loaded heavier.
>
>  dbg_odd: odd: dst=28 idle=2 brk=32 lbtgt=0-31 type=2
>  dbg_odd_dump: A: grp=1,17 w=2 avg=7.247 grp=8.337 sum=8.337 pertask=2.779
>  dbg_odd_dump: A: gcap=1.150 gutil=1.095 run=3 idle=0 gwt=2 type=2 nocap=1
>  dbg_odd_dump: A: CPU001: run=1 schb=1
>  dbg_odd_dump: A: Q001-asdf: w=1.000,l=0.525,u=0.513,r=0.527 run=1 hrun=1 tgs=100.000 tgw=17.266
>  dbg_odd_dump: A: Q001-asdf:  schbench(153757C):w=1.000,l=0.527,u=0.514
>  dbg_odd_dump: A: Q001-/: w=5.744,l=2.522,u=0.520,r=3.067 run=1 hrun=1 tgs=1.000 tgw=0.000
>  dbg_odd_dump: A: Q001-/:  asdf(C):w=5.744,l=3.017,u=0.521
>  dbg_odd_dump: A: CPU017: run=2 schb=2
>  dbg_odd_dump: A: Q017-asdf: w=2.000,l=0.989,u=0.966,r=0.988 run=2 hrun=2 tgs=100.000 tgw=17.266
>  dbg_odd_dump: A: Q017-asdf:  schbench(153737C):w=1.000,l=0.493,u=0.482 schbench(153739):w=1.000,l=0.494,u=0.483
>  dbg_odd_dump: A: Q017-/: w=10.653,l=7.888,u=0.973,r=5.270 run=1 hrun=2 tgs=1.000 tgw=0.000
>  dbg_odd_dump: A: Q017-/:  asdf(C):w=10.653,l=5.269,u=0.966
>  dbg_odd_dump: B: grp=14,30 w=2 avg=7.666 grp=8.819 sum=8.819 pertask=4.409
>  dbg_odd_dump: B: gcap=1.150 gutil=1.116 run=2 idle=0 gwt=2 type=2 nocap=1
>  dbg_odd_dump: B: CPU014: run=1 schb=1
>  dbg_odd_dump: B: Q014-asdf: w=1.000,l=1.004,u=0.970,r=0.492 run=1 hrun=1 tgs=100.000 tgw=17.266
>  dbg_odd_dump: B: Q014-asdf:  schbench(153760C):w=1.000,l=0.491,u=0.476
>  dbg_odd_dump: B: Q014-/: w=5.605,l=11.146,u=0.970,r=5.774 run=1 hrun=1 tgs=1.000 tgw=0.000
>  dbg_odd_dump: B: Q014-/:  asdf(C):w=5.605,l=5.766,u=0.970
>  dbg_odd_dump: B: CPU030: run=1 schb=1
>  dbg_odd_dump: B: Q030-asdf: w=1.000,l=0.538,u=0.518,r=0.558 run=1 hrun=1 tgs=100.000 tgw=17.266
>  dbg_odd_dump: B: Q030-asdf:  schbench(153747C):w=1.000,l=0.537,u=0.516
>  dbg_odd_dump: B: Q030-/: w=5.758,l=3.186,u=0.541,r=3.044 run=1 hrun=1 tgs=1.000 tgw=0.000
>  dbg_odd_dump: B: Q030-/:  asdf(C):w=5.758,l=3.092,u=0.516
>
> You can notice that B's pertask weight is 4.409 which is way higher
> than A's 2.779, and this is from Q014-asdf's contribution to Q014-/ is
> twice as high as it should be.  The root queue's runnable avg should

Are you sure that this is because of blocked load in group A ? it can
be that Q014-asdf has already have to wait before running and its load
still increase while runnable but not running .
IIUC your trace, group A has 2 running tasks and group B only one but
load_balance selects B because of its sgs->avg_load being higher. But
this can also happen even if runnable_load_avg of child cfs_rq was
propagated correctly in group entity because we can have situation
where a group A has only 1 task with higher load than 2 tasks on
groupB and even if blocked load is not taken into account, and
load_balance will select A.

IMHO, we should better improve load balance selection. I'm going to
add smarter group selection in load_balance. that's something we
should have already done but it was difficult without load/util_avg
propagation. it should be doable now

> only contain what's currently active but because we're scaling load
> avg which includes both active and blocked, we're ending up picking
> group B over A.
>
> This shows up in the total number of times we pick the wrong queue and
> thus latency.  I'm running the following script with the debug patch
> applied.
>
>   #!/bin/bash
>
>   date
>   cat /proc/self/cgroup
>
>   echo 1000 > /sys/module/fair/parameters/dbg_odd_nth
>   echo 0 > /sys/module/fair/parameters/dbg_odd_cnt
>
>   ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30
>
>   cat /sys/module/fair/parameters/dbg_odd_cnt
>
>
> With your patch applied, in the root cgroup,
>
>   Fri Apr 28 12:48:59 PDT 2017
>   0::/
>   Latency percentiles (usec)
>           50.0000th: 26
>           75.0000th: 63
>           90.0000th: 78
>           95.0000th: 88
>           *99.0000th: 707
>           99.5000th: 5096
>           99.9000th: 10352
>           min=0, max=13743
>   577
>
>
> In the /asdf cgroup,
>
>   Fri Apr 28 13:19:53 PDT 2017
>   0::/asdf
>   Latency percentiles (usec)
>           50.0000th: 35
>           75.0000th: 67
>           90.0000th: 81
>           95.0000th: 98
>           *99.0000th: 2212
>           99.5000th: 4536
>           99.9000th: 11024
>           min=0, max=13026
>   1708
>
>
> The last line is the number of times the load balancer picked a group
> w/o more than two schbench threads on a CPU over one w/.  Some number
> of these are expected as there are other threads and there are some
> plays in all the calculations but propgating avg or not propgating at
> all significantly increases the count and latency.
>
>> > The issue isn't about whether runnable_load_avg or load_avg should be
>> > used but the unexpected differences in the metrics that the load
>>
>> I think that's the root of the problem. I explain a bit more my view
>> on the other thread
>
> So, when picking the busiest group, the only thing which matters is
> the queue's runnable_load_avg, which should approximate the sum of all
> on-queue loads on that CPU.
>
> If we don't propagate or propagate load_avg, we're factoring in
> blocked avg of descendent cgroups into the root's runnable_load_avg
> which is obviously wrong.
>
> We can argue whether overriding a cfs_rq se's load_avg to the scaled
> runnable_load_avg of the cfs_rq is the right way to go or we should
> introduce a separate channel to propagate runnable_load_avg; however,
> it's clear that we need to fix runnable_load_avg propagation one way
> or another.

The minimum would be to not break load_avg

>
> The thing with cfs_rq se's load_avg is that, it isn't really used
> anywhere else AFAICS, so overriding it to the cfs_rq's
> runnable_load_avg isn't prettiest but doesn't really change anything.

load_avg is used for defining the share of each cfs_rq.

>
> Thanks.
>
> --
> tejun

  parent reply	other threads:[~2017-05-02  7:19 UTC|newest]

Thread overview: 69+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-04-24 20:13 [RFC PATCHSET] sched/fair: fix load balancer behavior when cgroup is in use Tejun Heo
2017-04-24 20:14 ` [PATCH 1/2] sched/fair: Fix how load gets propagated from cfs_rq to its sched_entity Tejun Heo
2017-04-24 21:33   ` [PATCH v2 " Tejun Heo
2017-05-03 18:00     ` Peter Zijlstra
2017-05-03 21:45       ` Tejun Heo
2017-05-04  5:51         ` Peter Zijlstra
2017-05-04  6:21           ` Peter Zijlstra
2017-05-04  9:49             ` Dietmar Eggemann
2017-05-04 10:57               ` Peter Zijlstra
2017-05-04 17:39               ` Tejun Heo
2017-05-05 10:36                 ` Dietmar Eggemann
2017-05-04 10:26       ` Vincent Guittot
2017-04-25  8:35   ` [PATCH " Vincent Guittot
2017-04-25 18:12     ` Tejun Heo
2017-04-26 16:51       ` Vincent Guittot
2017-04-26 22:40         ` Tejun Heo
2017-04-27  7:00           ` Vincent Guittot
2017-05-01 14:17         ` Peter Zijlstra
2017-05-01 14:52           ` Peter Zijlstra
2017-05-01 21:56           ` Tejun Heo
2017-05-02  8:19             ` Peter Zijlstra
2017-05-02  8:30               ` Peter Zijlstra
2017-05-02 20:00                 ` Tejun Heo
2017-05-03  9:10                   ` Peter Zijlstra
2017-04-26 16:14   ` Vincent Guittot
2017-04-26 22:27     ` Tejun Heo
2017-04-27  8:59       ` Vincent Guittot
2017-04-28 17:46         ` Tejun Heo
2017-05-02  7:20           ` Vincent Guittot
2017-04-24 20:14 ` [PATCH 2/2] sched/fair: Always propagate runnable_load_avg Tejun Heo
2017-04-25  8:46   ` Vincent Guittot
2017-04-25  9:05     ` Vincent Guittot
2017-04-25 12:59       ` Vincent Guittot
2017-04-25 18:49         ` Tejun Heo
2017-04-25 20:49           ` Tejun Heo
2017-04-25 21:15             ` Chris Mason
2017-04-25 21:08           ` Tejun Heo
2017-04-26 10:21             ` Vincent Guittot
2017-04-27  0:30               ` Tejun Heo
2017-04-27  8:28                 ` Vincent Guittot
2017-04-28 16:14                   ` Tejun Heo
2017-05-02  6:56                     ` Vincent Guittot
2017-05-02 20:56                       ` Tejun Heo
2017-05-03  7:25                         ` Vincent Guittot
2017-05-03  7:54                           ` Vincent Guittot
2017-04-26 18:12   ` Vincent Guittot
2017-04-26 22:52     ` Tejun Heo
2017-04-27  8:29       ` Vincent Guittot
2017-04-28 20:33         ` Tejun Heo
2017-04-28 20:38           ` Tejun Heo
2017-05-01 15:56           ` Peter Zijlstra
2017-05-02 22:01             ` Tejun Heo
2017-05-02  7:18           ` Vincent Guittot [this message]
2017-05-02 13:26             ` Vincent Guittot
2017-05-02 22:37               ` Tejun Heo
2017-05-02 21:50             ` Tejun Heo
2017-05-03  7:34               ` Vincent Guittot
2017-05-03  9:37                 ` Peter Zijlstra
2017-05-03 10:37                   ` Vincent Guittot
2017-05-03 13:09                     ` Peter Zijlstra
2017-05-03 21:49                       ` Tejun Heo
2017-05-04  8:19                         ` Vincent Guittot
2017-05-04 17:43                           ` Tejun Heo
2017-05-04 19:02                             ` Vincent Guittot
2017-05-04 19:04                               ` Tejun Heo
2017-04-24 21:35 ` [PATCH 3/2] sched/fair: Skip __update_load_avg() on cfs_rq sched_entities Tejun Heo
2017-04-24 21:48   ` Peter Zijlstra
2017-04-24 22:54     ` Tejun Heo
2017-04-25 21:09   ` Tejun Heo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAKfTPtB3PectZVm_z6NhSszXwG-G=y-J1Y5Rxk48bLxHQ8mx4g@mail.gmail.com' \
    --to=vincent.guittot@linaro.org \
    --cc=clm@fb.com \
    --cc=efault@gmx.de \
    --cc=kernel-team@fb.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=pjt@google.com \
    --cc=tj@kernel.org \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.