From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S1176834AbdDYIrF (ORCPT <rfc822;w@1wt.eu>);
        Tue, 25 Apr 2017 04:47:05 -0400
Received: from mail-oi0-f42.google.com ([209.85.218.42]:33299 "EHLO
        mail-oi0-f42.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1758297AbdDYIq6 (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Tue, 25 Apr 2017 04:46:58 -0400
MIME-Version: 1.0
In-Reply-To: <20170424201444.GC14169@wtj.duckdns.org>
References: <20170424201344.GA14169@wtj.duckdns.org> <20170424201444.GC14169@wtj.duckdns.org>
From: Vincent Guittot <vincent.guittot@linaro.org>
Date: Tue, 25 Apr 2017 10:46:37 +0200
Message-ID: <CAKfTPtCf0JUPubBtjY25Lr6J1aihUMjs3HEw+8MXcCwpuku7eQ@mail.gmail.com>
Subject: Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg
To: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>, Peter Zijlstra <peterz@infradead.org>,
        linux-kernel <linux-kernel@vger.kernel.org>,
        Linus Torvalds <torvalds@linux-foundation.org>,
        Mike Galbraith <efault@gmx.de>, Paul Turner <pjt@google.com>,
        Chris Mason <clm@fb.com>, kernel-team@fb.com
Content-Type: text/plain; charset=UTF-8
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On 24 April 2017 at 22:14, Tejun Heo <tj@kernel.org> wrote:
> We noticed that with cgroup CPU controller in use, the scheduling
> latency gets wonky regardless of nesting level or weight
> configuration.  This is easily reproducible with Chris Mason's
> schbench[1].
>
> All tests are run on a single socket, 16 cores, 32 threads machine.
> While the machine is mostly idle, it isn't completey.  There's always
> some variable management load going on.  The command used is
>
>  schbench -m 2 -t 16 -s 10000 -c 15000 -r 30
>
> which measures scheduling latency with 32 threads each of which
> repeatedly runs for 15ms and then sleeps for 10ms.  Here's a
> representative result when running from the root cgroup.
>
>  # ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30
>  Latency percentiles (usec)
>          50.0000th: 26
>          75.0000th: 62
>          90.0000th: 74
>          95.0000th: 86
>          *99.0000th: 887
>          99.5000th: 3692
>          99.9000th: 10832
>          min=0, max=13374
>
> The following is inside a first level CPU cgroup with the maximum
> weight.
>
>  # ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30
>  Latency percentiles (usec)
>          50.0000th: 31
>          75.0000th: 65
>          90.0000th: 71
>          95.0000th: 91
>          *99.0000th: 7288
>          99.5000th: 10352
>          99.9000th: 12496
>          min=0, max=13023
>
> Note the drastic increase in p99 scheduling latency.  After
> investigation, it turned out that the update_sd_lb_stats(), which is
> used by load_balance() to pick the most loaded group, was often
> picking the wrong group.  A CPU which has one schbench running and
> another queued wouldn't report the correspondingly higher
> weighted_cpuload() and get looked over as the target of load
> balancing.
>
> weighted_cpuload() is the root cfs_rq's runnable_load_avg which is the
> sum of the load_avg of all queued sched_entities.  Without cgroups or
> at the root cgroup, each task's load_avg contributes directly to the
> sum.  When a task wakes up or goes to sleep, the change is immediately
> reflected on runnable_load_avg which in turn affects load balancing.
>
> However, when CPU cgroup is in use, a nesting cfs_rq blocks this
> immediate reflection.  When a task wakes up inside a cgroup, the
> nested cfs_rq's runnable_load_avg is updated but doesn't get
> propagated to the parent.  It only affects the matching sched_entity's
> load_avg over time which then gets propagated to the runnable_load_avg
> at that level.  This makes weighted_cpuload() often temporarily out of
> sync leading to suboptimal behavior of load_balance() and increase in
> scheduling latencies as shown above.
>
> This patch fixes the issue by updating propagate_entity_load_avg() to
> always propagate to the parent's runnable_load_avg.  Combined with the
> previous patch, this keeps a cfs_rq's runnable_load_avg always the sum
> of the scaled loads of all tasks queued below removing the artifacts
> from nesting cfs_rqs.  The following is from inside three levels of
> nesting with the patch applied.

So you are changing the purpose of propagate_entity_load_avg which
aims to propagate load_avg/util_avg changes only when a task migrate
and you also want to propagate the enqueue/dequeue in the parent
cfs_rq->runnable_load_avg

>
>  # ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30
>  Latency percentiles (usec)
>          50.0000th: 40
>          75.0000th: 71
>          90.0000th: 89
>          95.0000th: 108
>          *99.0000th: 679
>          99.5000th: 3500
>          99.9000th: 10960
>          min=0, max=13790
>
> [1] git://git.kernel.org/pub/scm/linux/kernel/git/mason/schbench.git
>
> Signed-off-by: Tejun Heo <tj@kernel.org>
> Cc: Vincent Guittot <vincent.guittot@linaro.org>
> Cc: Ingo Molnar <mingo@redhat.com>
> Cc: Peter Zijlstra <peterz@infradead.org>
> Cc: Mike Galbraith <efault@gmx.de>
> Cc: Paul Turner <pjt@google.com>
> Cc: Chris Mason <clm@fb.com>
> ---
>  kernel/sched/fair.c |   34 +++++++++++++++++++++-------------
>  1 file changed, 21 insertions(+), 13 deletions(-)
>
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -3075,7 +3075,8 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq
>
>  /* Take into account change of load of a child task group */
>  static inline void
> -update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se)
> +update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se,
> +                  bool propagate_avg)
>  {
>         struct cfs_rq *gcfs_rq = group_cfs_rq(se);
>         long load = 0, delta;
> @@ -3113,9 +3114,11 @@ update_tg_cfs_load(struct cfs_rq *cfs_rq
>         se->avg.load_avg = load;
>         se->avg.load_sum = se->avg.load_avg * LOAD_AVG_MAX;
>
> -       /* Update parent cfs_rq load */
> -       add_positive(&cfs_rq->avg.load_avg, delta);
> -       cfs_rq->avg.load_sum = cfs_rq->avg.load_avg * LOAD_AVG_MAX;
> +       if (propagate_avg) {
> +               /* Update parent cfs_rq load */
> +               add_positive(&cfs_rq->avg.load_avg, delta);
> +               cfs_rq->avg.load_sum = cfs_rq->avg.load_avg * LOAD_AVG_MAX;
> +       }
>
>         /*
>          * If the sched_entity is already enqueued, we also have to update the
> @@ -3147,22 +3150,27 @@ static inline int test_and_clear_tg_cfs_
>  /* Update task and its cfs_rq load average */
>  static inline int propagate_entity_load_avg(struct sched_entity *se)
>  {
> -       struct cfs_rq *cfs_rq;
> +       struct cfs_rq *cfs_rq = cfs_rq_of(se);
> +       bool propagate_avg;
>
>         if (entity_is_task(se))
>                 return 0;
>
> -       if (!test_and_clear_tg_cfs_propagate(se))
> -               return 0;
> -
> -       cfs_rq = cfs_rq_of(se);
> +       propagate_avg = test_and_clear_tg_cfs_propagate(se);
>
> -       set_tg_cfs_propagate(cfs_rq);
> +       /*
> +        * We want to keep @cfs_rq->runnable_load_avg always in sync so
> +        * that the load balancer can accurately determine the busiest CPU
> +        * regardless of cfs_rq nesting.
> +        */
> +       update_tg_cfs_load(cfs_rq, se, propagate_avg);
>
> -       update_tg_cfs_util(cfs_rq, se);
> -       update_tg_cfs_load(cfs_rq, se);
> +       if (propagate_avg) {
> +               set_tg_cfs_propagate(cfs_rq);
> +               update_tg_cfs_util(cfs_rq, se);
> +       }
>
> -       return 1;
> +       return propagate_avg;
>  }
>
>  #else /* CONFIG_FAIR_GROUP_SCHED */