linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Vincent Guittot <vincent.guittot@linaro.org>
To: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Mike Galbraith <efault@gmx.de>, Paul Turner <pjt@google.com>,
	Chris Mason <clm@fb.com>,
	kernel-team@fb.com
Subject: Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg
Date: Tue, 25 Apr 2017 11:05:05 +0200	[thread overview]
Message-ID: <CAKfTPtBLYKXyEYpyTWDRakP8zwe0z=_2HT3Lg7UM2PdQUF3kAA@mail.gmail.com> (raw)
In-Reply-To: <CAKfTPtCf0JUPubBtjY25Lr6J1aihUMjs3HEw+8MXcCwpuku7eQ@mail.gmail.com>

On 25 April 2017 at 10:46, Vincent Guittot <vincent.guittot@linaro.org> wrote:
> On 24 April 2017 at 22:14, Tejun Heo <tj@kernel.org> wrote:
>> We noticed that with cgroup CPU controller in use, the scheduling
>> latency gets wonky regardless of nesting level or weight
>> configuration.  This is easily reproducible with Chris Mason's
>> schbench[1].
>>
>> All tests are run on a single socket, 16 cores, 32 threads machine.
>> While the machine is mostly idle, it isn't completey.  There's always
>> some variable management load going on.  The command used is
>>
>>  schbench -m 2 -t 16 -s 10000 -c 15000 -r 30
>>
>> which measures scheduling latency with 32 threads each of which
>> repeatedly runs for 15ms and then sleeps for 10ms.  Here's a
>> representative result when running from the root cgroup.
>>
>>  # ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30
>>  Latency percentiles (usec)
>>          50.0000th: 26
>>          75.0000th: 62
>>          90.0000th: 74
>>          95.0000th: 86
>>          *99.0000th: 887
>>          99.5000th: 3692
>>          99.9000th: 10832
>>          min=0, max=13374
>>
>> The following is inside a first level CPU cgroup with the maximum
>> weight.
>>
>>  # ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30
>>  Latency percentiles (usec)
>>          50.0000th: 31
>>          75.0000th: 65
>>          90.0000th: 71
>>          95.0000th: 91
>>          *99.0000th: 7288
>>          99.5000th: 10352
>>          99.9000th: 12496
>>          min=0, max=13023
>>
>> Note the drastic increase in p99 scheduling latency.  After
>> investigation, it turned out that the update_sd_lb_stats(), which is
>> used by load_balance() to pick the most loaded group, was often
>> picking the wrong group.  A CPU which has one schbench running and
>> another queued wouldn't report the correspondingly higher
>> weighted_cpuload() and get looked over as the target of load
>> balancing.
>>
>> weighted_cpuload() is the root cfs_rq's runnable_load_avg which is the
>> sum of the load_avg of all queued sched_entities.  Without cgroups or
>> at the root cgroup, each task's load_avg contributes directly to the
>> sum.  When a task wakes up or goes to sleep, the change is immediately
>> reflected on runnable_load_avg which in turn affects load balancing.
>>
>> However, when CPU cgroup is in use, a nesting cfs_rq blocks this
>> immediate reflection.  When a task wakes up inside a cgroup, the
>> nested cfs_rq's runnable_load_avg is updated but doesn't get
>> propagated to the parent.  It only affects the matching sched_entity's
>> load_avg over time which then gets propagated to the runnable_load_avg
>> at that level.  This makes weighted_cpuload() often temporarily out of
>> sync leading to suboptimal behavior of load_balance() and increase in
>> scheduling latencies as shown above.
>>
>> This patch fixes the issue by updating propagate_entity_load_avg() to
>> always propagate to the parent's runnable_load_avg.  Combined with the
>> previous patch, this keeps a cfs_rq's runnable_load_avg always the sum
>> of the scaled loads of all tasks queued below removing the artifacts
>> from nesting cfs_rqs.  The following is from inside three levels of
>> nesting with the patch applied.
>
> So you are changing the purpose of propagate_entity_load_avg which
> aims to propagate load_avg/util_avg changes only when a task migrate
> and you also want to propagate the enqueue/dequeue in the parent
> cfs_rq->runnable_load_avg

In fact you want that sched_entity load_avg reflects
cfs_rq->runnable_load_avg and not cfs_rq->avg.load_avg

>
>>
>>  # ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30
>>  Latency percentiles (usec)
>>          50.0000th: 40
>>          75.0000th: 71
>>          90.0000th: 89
>>          95.0000th: 108
>>          *99.0000th: 679
>>          99.5000th: 3500
>>          99.9000th: 10960
>>          min=0, max=13790
>>
>> [1] git://git.kernel.org/pub/scm/linux/kernel/git/mason/schbench.git
>>
>> Signed-off-by: Tejun Heo <tj@kernel.org>
>> Cc: Vincent Guittot <vincent.guittot@linaro.org>
>> Cc: Ingo Molnar <mingo@redhat.com>
>> Cc: Peter Zijlstra <peterz@infradead.org>
>> Cc: Mike Galbraith <efault@gmx.de>
>> Cc: Paul Turner <pjt@google.com>
>> Cc: Chris Mason <clm@fb.com>
>> ---
>>  kernel/sched/fair.c |   34 +++++++++++++++++++++-------------
>>  1 file changed, 21 insertions(+), 13 deletions(-)
>>
>> --- a/kernel/sched/fair.c
>> +++ b/kernel/sched/fair.c
>> @@ -3075,7 +3075,8 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq
>>
>>  /* Take into account change of load of a child task group */
>>  static inline void
>> -update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se)
>> +update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se,
>> +                  bool propagate_avg)
>>  {
>>         struct cfs_rq *gcfs_rq = group_cfs_rq(se);
>>         long load = 0, delta;
>> @@ -3113,9 +3114,11 @@ update_tg_cfs_load(struct cfs_rq *cfs_rq
>>         se->avg.load_avg = load;
>>         se->avg.load_sum = se->avg.load_avg * LOAD_AVG_MAX;
>>
>> -       /* Update parent cfs_rq load */
>> -       add_positive(&cfs_rq->avg.load_avg, delta);
>> -       cfs_rq->avg.load_sum = cfs_rq->avg.load_avg * LOAD_AVG_MAX;
>> +       if (propagate_avg) {
>> +               /* Update parent cfs_rq load */
>> +               add_positive(&cfs_rq->avg.load_avg, delta);
>> +               cfs_rq->avg.load_sum = cfs_rq->avg.load_avg * LOAD_AVG_MAX;
>> +       }
>>
>>         /*
>>          * If the sched_entity is already enqueued, we also have to update the
>> @@ -3147,22 +3150,27 @@ static inline int test_and_clear_tg_cfs_
>>  /* Update task and its cfs_rq load average */
>>  static inline int propagate_entity_load_avg(struct sched_entity *se)
>>  {
>> -       struct cfs_rq *cfs_rq;
>> +       struct cfs_rq *cfs_rq = cfs_rq_of(se);
>> +       bool propagate_avg;
>>
>>         if (entity_is_task(se))
>>                 return 0;
>>
>> -       if (!test_and_clear_tg_cfs_propagate(se))
>> -               return 0;
>> -
>> -       cfs_rq = cfs_rq_of(se);
>> +       propagate_avg = test_and_clear_tg_cfs_propagate(se);
>>
>> -       set_tg_cfs_propagate(cfs_rq);
>> +       /*
>> +        * We want to keep @cfs_rq->runnable_load_avg always in sync so
>> +        * that the load balancer can accurately determine the busiest CPU
>> +        * regardless of cfs_rq nesting.
>> +        */
>> +       update_tg_cfs_load(cfs_rq, se, propagate_avg);
>>
>> -       update_tg_cfs_util(cfs_rq, se);
>> -       update_tg_cfs_load(cfs_rq, se);
>> +       if (propagate_avg) {
>> +               set_tg_cfs_propagate(cfs_rq);
>> +               update_tg_cfs_util(cfs_rq, se);
>> +       }
>>
>> -       return 1;
>> +       return propagate_avg;
>>  }
>>
>>  #else /* CONFIG_FAIR_GROUP_SCHED */

  reply	other threads:[~2017-04-25  9:06 UTC|newest]

Thread overview: 69+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-04-24 20:13 [RFC PATCHSET] sched/fair: fix load balancer behavior when cgroup is in use Tejun Heo
2017-04-24 20:14 ` [PATCH 1/2] sched/fair: Fix how load gets propagated from cfs_rq to its sched_entity Tejun Heo
2017-04-24 21:33   ` [PATCH v2 " Tejun Heo
2017-05-03 18:00     ` Peter Zijlstra
2017-05-03 21:45       ` Tejun Heo
2017-05-04  5:51         ` Peter Zijlstra
2017-05-04  6:21           ` Peter Zijlstra
2017-05-04  9:49             ` Dietmar Eggemann
2017-05-04 10:57               ` Peter Zijlstra
2017-05-04 17:39               ` Tejun Heo
2017-05-05 10:36                 ` Dietmar Eggemann
2017-05-04 10:26       ` Vincent Guittot
2017-04-25  8:35   ` [PATCH " Vincent Guittot
2017-04-25 18:12     ` Tejun Heo
2017-04-26 16:51       ` Vincent Guittot
2017-04-26 22:40         ` Tejun Heo
2017-04-27  7:00           ` Vincent Guittot
2017-05-01 14:17         ` Peter Zijlstra
2017-05-01 14:52           ` Peter Zijlstra
2017-05-01 21:56           ` Tejun Heo
2017-05-02  8:19             ` Peter Zijlstra
2017-05-02  8:30               ` Peter Zijlstra
2017-05-02 20:00                 ` Tejun Heo
2017-05-03  9:10                   ` Peter Zijlstra
2017-04-26 16:14   ` Vincent Guittot
2017-04-26 22:27     ` Tejun Heo
2017-04-27  8:59       ` Vincent Guittot
2017-04-28 17:46         ` Tejun Heo
2017-05-02  7:20           ` Vincent Guittot
2017-04-24 20:14 ` [PATCH 2/2] sched/fair: Always propagate runnable_load_avg Tejun Heo
2017-04-25  8:46   ` Vincent Guittot
2017-04-25  9:05     ` Vincent Guittot [this message]
2017-04-25 12:59       ` Vincent Guittot
2017-04-25 18:49         ` Tejun Heo
2017-04-25 20:49           ` Tejun Heo
2017-04-25 21:15             ` Chris Mason
2017-04-25 21:08           ` Tejun Heo
2017-04-26 10:21             ` Vincent Guittot
2017-04-27  0:30               ` Tejun Heo
2017-04-27  8:28                 ` Vincent Guittot
2017-04-28 16:14                   ` Tejun Heo
2017-05-02  6:56                     ` Vincent Guittot
2017-05-02 20:56                       ` Tejun Heo
2017-05-03  7:25                         ` Vincent Guittot
2017-05-03  7:54                           ` Vincent Guittot
2017-04-26 18:12   ` Vincent Guittot
2017-04-26 22:52     ` Tejun Heo
2017-04-27  8:29       ` Vincent Guittot
2017-04-28 20:33         ` Tejun Heo
2017-04-28 20:38           ` Tejun Heo
2017-05-01 15:56           ` Peter Zijlstra
2017-05-02 22:01             ` Tejun Heo
2017-05-02  7:18           ` Vincent Guittot
2017-05-02 13:26             ` Vincent Guittot
2017-05-02 22:37               ` Tejun Heo
2017-05-02 21:50             ` Tejun Heo
2017-05-03  7:34               ` Vincent Guittot
2017-05-03  9:37                 ` Peter Zijlstra
2017-05-03 10:37                   ` Vincent Guittot
2017-05-03 13:09                     ` Peter Zijlstra
2017-05-03 21:49                       ` Tejun Heo
2017-05-04  8:19                         ` Vincent Guittot
2017-05-04 17:43                           ` Tejun Heo
2017-05-04 19:02                             ` Vincent Guittot
2017-05-04 19:04                               ` Tejun Heo
2017-04-24 21:35 ` [PATCH 3/2] sched/fair: Skip __update_load_avg() on cfs_rq sched_entities Tejun Heo
2017-04-24 21:48   ` Peter Zijlstra
2017-04-24 22:54     ` Tejun Heo
2017-04-25 21:09   ` Tejun Heo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAKfTPtBLYKXyEYpyTWDRakP8zwe0z=_2HT3Lg7UM2PdQUF3kAA@mail.gmail.com' \
    --to=vincent.guittot@linaro.org \
    --cc=clm@fb.com \
    --cc=efault@gmx.de \
    --cc=kernel-team@fb.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=pjt@google.com \
    --cc=tj@kernel.org \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).