All of lore.kernel.org
 help / color / mirror / Atom feed
From: Vincent Guittot <vincent.guittot@linaro.org>
To: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Mike Galbraith <efault@gmx.de>, Paul Turner <pjt@google.com>,
	Chris Mason <clm@fb.com>,
	kernel-team@fb.com
Subject: Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg
Date: Tue, 25 Apr 2017 14:59:18 +0200	[thread overview]
Message-ID: <CAKfTPtAJmJsT2=DbZFtK2aBVkNKbksueuDs_vCzsvWPR-_Aebg@mail.gmail.com> (raw)
In-Reply-To: <CAKfTPtBLYKXyEYpyTWDRakP8zwe0z=_2HT3Lg7UM2PdQUF3kAA@mail.gmail.com>

On 25 April 2017 at 11:05, Vincent Guittot <vincent.guittot@linaro.org> wrote:
> On 25 April 2017 at 10:46, Vincent Guittot <vincent.guittot@linaro.org> wrote:
>> On 24 April 2017 at 22:14, Tejun Heo <tj@kernel.org> wrote:
>>> We noticed that with cgroup CPU controller in use, the scheduling
>>> latency gets wonky regardless of nesting level or weight
>>> configuration.  This is easily reproducible with Chris Mason's
>>> schbench[1].
>>>
>>> All tests are run on a single socket, 16 cores, 32 threads machine.
>>> While the machine is mostly idle, it isn't completey.  There's always
>>> some variable management load going on.  The command used is
>>>
>>>  schbench -m 2 -t 16 -s 10000 -c 15000 -r 30
>>>
>>> which measures scheduling latency with 32 threads each of which
>>> repeatedly runs for 15ms and then sleeps for 10ms.  Here's a
>>> representative result when running from the root cgroup.
>>>
>>>  # ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30
>>>  Latency percentiles (usec)
>>>          50.0000th: 26
>>>          75.0000th: 62
>>>          90.0000th: 74
>>>          95.0000th: 86
>>>          *99.0000th: 887
>>>          99.5000th: 3692
>>>          99.9000th: 10832
>>>          min=0, max=13374
>>>
>>> The following is inside a first level CPU cgroup with the maximum
>>> weight.
>>>
>>>  # ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30
>>>  Latency percentiles (usec)
>>>          50.0000th: 31
>>>          75.0000th: 65
>>>          90.0000th: 71
>>>          95.0000th: 91
>>>          *99.0000th: 7288
>>>          99.5000th: 10352
>>>          99.9000th: 12496
>>>          min=0, max=13023
>>>
>>> Note the drastic increase in p99 scheduling latency.  After
>>> investigation, it turned out that the update_sd_lb_stats(), which is
>>> used by load_balance() to pick the most loaded group, was often
>>> picking the wrong group.  A CPU which has one schbench running and
>>> another queued wouldn't report the correspondingly higher
>>> weighted_cpuload() and get looked over as the target of load
>>> balancing.
>>>
>>> weighted_cpuload() is the root cfs_rq's runnable_load_avg which is the
>>> sum of the load_avg of all queued sched_entities.  Without cgroups or
>>> at the root cgroup, each task's load_avg contributes directly to the
>>> sum.  When a task wakes up or goes to sleep, the change is immediately
>>> reflected on runnable_load_avg which in turn affects load balancing.
>>>
>>> However, when CPU cgroup is in use, a nesting cfs_rq blocks this
>>> immediate reflection.  When a task wakes up inside a cgroup, the
>>> nested cfs_rq's runnable_load_avg is updated but doesn't get
>>> propagated to the parent.  It only affects the matching sched_entity's
>>> load_avg over time which then gets propagated to the runnable_load_avg
>>> at that level.  This makes weighted_cpuload() often temporarily out of
>>> sync leading to suboptimal behavior of load_balance() and increase in
>>> scheduling latencies as shown above.
>>>
>>> This patch fixes the issue by updating propagate_entity_load_avg() to
>>> always propagate to the parent's runnable_load_avg.  Combined with the
>>> previous patch, this keeps a cfs_rq's runnable_load_avg always the sum
>>> of the scaled loads of all tasks queued below removing the artifacts
>>> from nesting cfs_rqs.  The following is from inside three levels of
>>> nesting with the patch applied.
>>
>> So you are changing the purpose of propagate_entity_load_avg which
>> aims to propagate load_avg/util_avg changes only when a task migrate
>> and you also want to propagate the enqueue/dequeue in the parent
>> cfs_rq->runnable_load_avg
>
> In fact you want that sched_entity load_avg reflects
> cfs_rq->runnable_load_avg and not cfs_rq->avg.load_avg

I have run a quick test with your patches and schbench on my platform.
I haven't been able to reproduce your regression but my platform is
quite different from yours (only 8 cores without SMT)
But most importantly, the parent cfs_rq->runnable_load_avg never
reaches 0 (or almost 0) when it is idle. Instead, it still has a
runnable_load_avg (this is not due to rounding computation) whereas
runnable_load_avg should be 0

Just to be curious, Is your regression still there if you disable
SMT/hyperthreading on your paltform?

Regards,
Vincent

>
>>
>>>
>>>  # ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30
>>>  Latency percentiles (usec)
>>>          50.0000th: 40
>>>          75.0000th: 71
>>>          90.0000th: 89
>>>          95.0000th: 108
>>>          *99.0000th: 679
>>>          99.5000th: 3500
>>>          99.9000th: 10960
>>>          min=0, max=13790
>>>
>>> [1] git://git.kernel.org/pub/scm/linux/kernel/git/mason/schbench.git
>>>
>>> Signed-off-by: Tejun Heo <tj@kernel.org>
>>> Cc: Vincent Guittot <vincent.guittot@linaro.org>
>>> Cc: Ingo Molnar <mingo@redhat.com>
>>> Cc: Peter Zijlstra <peterz@infradead.org>
>>> Cc: Mike Galbraith <efault@gmx.de>
>>> Cc: Paul Turner <pjt@google.com>
>>> Cc: Chris Mason <clm@fb.com>
>>> ---
>>>  kernel/sched/fair.c |   34 +++++++++++++++++++++-------------
>>>  1 file changed, 21 insertions(+), 13 deletions(-)
>>>
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -3075,7 +3075,8 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq
>>>
>>>  /* Take into account change of load of a child task group */
>>>  static inline void
>>> -update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se)
>>> +update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se,
>>> +                  bool propagate_avg)
>>>  {
>>>         struct cfs_rq *gcfs_rq = group_cfs_rq(se);
>>>         long load = 0, delta;
>>> @@ -3113,9 +3114,11 @@ update_tg_cfs_load(struct cfs_rq *cfs_rq
>>>         se->avg.load_avg = load;
>>>         se->avg.load_sum = se->avg.load_avg * LOAD_AVG_MAX;
>>>
>>> -       /* Update parent cfs_rq load */
>>> -       add_positive(&cfs_rq->avg.load_avg, delta);
>>> -       cfs_rq->avg.load_sum = cfs_rq->avg.load_avg * LOAD_AVG_MAX;
>>> +       if (propagate_avg) {
>>> +               /* Update parent cfs_rq load */
>>> +               add_positive(&cfs_rq->avg.load_avg, delta);
>>> +               cfs_rq->avg.load_sum = cfs_rq->avg.load_avg * LOAD_AVG_MAX;
>>> +       }
>>>
>>>         /*
>>>          * If the sched_entity is already enqueued, we also have to update the
>>> @@ -3147,22 +3150,27 @@ static inline int test_and_clear_tg_cfs_
>>>  /* Update task and its cfs_rq load average */
>>>  static inline int propagate_entity_load_avg(struct sched_entity *se)
>>>  {
>>> -       struct cfs_rq *cfs_rq;
>>> +       struct cfs_rq *cfs_rq = cfs_rq_of(se);
>>> +       bool propagate_avg;
>>>
>>>         if (entity_is_task(se))
>>>                 return 0;
>>>
>>> -       if (!test_and_clear_tg_cfs_propagate(se))
>>> -               return 0;
>>> -
>>> -       cfs_rq = cfs_rq_of(se);
>>> +       propagate_avg = test_and_clear_tg_cfs_propagate(se);
>>>
>>> -       set_tg_cfs_propagate(cfs_rq);
>>> +       /*
>>> +        * We want to keep @cfs_rq->runnable_load_avg always in sync so
>>> +        * that the load balancer can accurately determine the busiest CPU
>>> +        * regardless of cfs_rq nesting.
>>> +        */
>>> +       update_tg_cfs_load(cfs_rq, se, propagate_avg);
>>>
>>> -       update_tg_cfs_util(cfs_rq, se);
>>> -       update_tg_cfs_load(cfs_rq, se);
>>> +       if (propagate_avg) {
>>> +               set_tg_cfs_propagate(cfs_rq);
>>> +               update_tg_cfs_util(cfs_rq, se);
>>> +       }
>>>
>>> -       return 1;
>>> +       return propagate_avg;
>>>  }
>>>
>>>  #else /* CONFIG_FAIR_GROUP_SCHED */

  reply	other threads:[~2017-04-25 12:59 UTC|newest]

Thread overview: 69+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2017-04-24 20:13 [RFC PATCHSET] sched/fair: fix load balancer behavior when cgroup is in use Tejun Heo
2017-04-24 20:14 ` [PATCH 1/2] sched/fair: Fix how load gets propagated from cfs_rq to its sched_entity Tejun Heo
2017-04-24 21:33   ` [PATCH v2 " Tejun Heo
2017-05-03 18:00     ` Peter Zijlstra
2017-05-03 21:45       ` Tejun Heo
2017-05-04  5:51         ` Peter Zijlstra
2017-05-04  6:21           ` Peter Zijlstra
2017-05-04  9:49             ` Dietmar Eggemann
2017-05-04 10:57               ` Peter Zijlstra
2017-05-04 17:39               ` Tejun Heo
2017-05-05 10:36                 ` Dietmar Eggemann
2017-05-04 10:26       ` Vincent Guittot
2017-04-25  8:35   ` [PATCH " Vincent Guittot
2017-04-25 18:12     ` Tejun Heo
2017-04-26 16:51       ` Vincent Guittot
2017-04-26 22:40         ` Tejun Heo
2017-04-27  7:00           ` Vincent Guittot
2017-05-01 14:17         ` Peter Zijlstra
2017-05-01 14:52           ` Peter Zijlstra
2017-05-01 21:56           ` Tejun Heo
2017-05-02  8:19             ` Peter Zijlstra
2017-05-02  8:30               ` Peter Zijlstra
2017-05-02 20:00                 ` Tejun Heo
2017-05-03  9:10                   ` Peter Zijlstra
2017-04-26 16:14   ` Vincent Guittot
2017-04-26 22:27     ` Tejun Heo
2017-04-27  8:59       ` Vincent Guittot
2017-04-28 17:46         ` Tejun Heo
2017-05-02  7:20           ` Vincent Guittot
2017-04-24 20:14 ` [PATCH 2/2] sched/fair: Always propagate runnable_load_avg Tejun Heo
2017-04-25  8:46   ` Vincent Guittot
2017-04-25  9:05     ` Vincent Guittot
2017-04-25 12:59       ` Vincent Guittot [this message]
2017-04-25 18:49         ` Tejun Heo
2017-04-25 20:49           ` Tejun Heo
2017-04-25 21:15             ` Chris Mason
2017-04-25 21:08           ` Tejun Heo
2017-04-26 10:21             ` Vincent Guittot
2017-04-27  0:30               ` Tejun Heo
2017-04-27  8:28                 ` Vincent Guittot
2017-04-28 16:14                   ` Tejun Heo
2017-05-02  6:56                     ` Vincent Guittot
2017-05-02 20:56                       ` Tejun Heo
2017-05-03  7:25                         ` Vincent Guittot
2017-05-03  7:54                           ` Vincent Guittot
2017-04-26 18:12   ` Vincent Guittot
2017-04-26 22:52     ` Tejun Heo
2017-04-27  8:29       ` Vincent Guittot
2017-04-28 20:33         ` Tejun Heo
2017-04-28 20:38           ` Tejun Heo
2017-05-01 15:56           ` Peter Zijlstra
2017-05-02 22:01             ` Tejun Heo
2017-05-02  7:18           ` Vincent Guittot
2017-05-02 13:26             ` Vincent Guittot
2017-05-02 22:37               ` Tejun Heo
2017-05-02 21:50             ` Tejun Heo
2017-05-03  7:34               ` Vincent Guittot
2017-05-03  9:37                 ` Peter Zijlstra
2017-05-03 10:37                   ` Vincent Guittot
2017-05-03 13:09                     ` Peter Zijlstra
2017-05-03 21:49                       ` Tejun Heo
2017-05-04  8:19                         ` Vincent Guittot
2017-05-04 17:43                           ` Tejun Heo
2017-05-04 19:02                             ` Vincent Guittot
2017-05-04 19:04                               ` Tejun Heo
2017-04-24 21:35 ` [PATCH 3/2] sched/fair: Skip __update_load_avg() on cfs_rq sched_entities Tejun Heo
2017-04-24 21:48   ` Peter Zijlstra
2017-04-24 22:54     ` Tejun Heo
2017-04-25 21:09   ` Tejun Heo

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAKfTPtAJmJsT2=DbZFtK2aBVkNKbksueuDs_vCzsvWPR-_Aebg@mail.gmail.com' \
    --to=vincent.guittot@linaro.org \
    --cc=clm@fb.com \
    --cc=efault@gmx.de \
    --cc=kernel-team@fb.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=peterz@infradead.org \
    --cc=pjt@google.com \
    --cc=tj@kernel.org \
    --cc=torvalds@linux-foundation.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.