Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg

From: Vincent Guittot <vincent.guittot@linaro.org>
To: Tejun Heo <tj@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	linux-kernel <linux-kernel@vger.kernel.org>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Mike Galbraith <efault@gmx.de>, Paul Turner <pjt@google.com>,
	Chris Mason <clm@fb.com>,
	kernel-team@fb.com
Subject: Re: [PATCH 2/2] sched/fair: Always propagate runnable_load_avg
Date: Tue, 25 Apr 2017 14:59:18 +0200	[thread overview]
Message-ID: <CAKfTPtAJmJsT2=DbZFtK2aBVkNKbksueuDs_vCzsvWPR-_Aebg@mail.gmail.com> (raw)
In-Reply-To: <CAKfTPtBLYKXyEYpyTWDRakP8zwe0z=_2HT3Lg7UM2PdQUF3kAA@mail.gmail.com>

On 25 April 2017 at 11:05, Vincent Guittot <vincent.guittot@linaro.org> wrote:
> On 25 April 2017 at 10:46, Vincent Guittot <vincent.guittot@linaro.org> wrote:
>> On 24 April 2017 at 22:14, Tejun Heo <tj@kernel.org> wrote:
>>> We noticed that with cgroup CPU controller in use, the scheduling
>>> latency gets wonky regardless of nesting level or weight
>>> configuration.  This is easily reproducible with Chris Mason's
>>> schbench[1].
>>>
>>> All tests are run on a single socket, 16 cores, 32 threads machine.
>>> While the machine is mostly idle, it isn't completey.  There's always
>>> some variable management load going on.  The command used is
>>>
>>>  schbench -m 2 -t 16 -s 10000 -c 15000 -r 30
>>>
>>> which measures scheduling latency with 32 threads each of which
>>> repeatedly runs for 15ms and then sleeps for 10ms.  Here's a
>>> representative result when running from the root cgroup.
>>>
>>>  # ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30
>>>  Latency percentiles (usec)
>>>          50.0000th: 26
>>>          75.0000th: 62
>>>          90.0000th: 74
>>>          95.0000th: 86
>>>          *99.0000th: 887
>>>          99.5000th: 3692
>>>          99.9000th: 10832
>>>          min=0, max=13374
>>>
>>> The following is inside a first level CPU cgroup with the maximum
>>> weight.
>>>
>>>  # ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30
>>>  Latency percentiles (usec)
>>>          50.0000th: 31
>>>          75.0000th: 65
>>>          90.0000th: 71
>>>          95.0000th: 91
>>>          *99.0000th: 7288
>>>          99.5000th: 10352
>>>          99.9000th: 12496
>>>          min=0, max=13023
>>>
>>> Note the drastic increase in p99 scheduling latency.  After
>>> investigation, it turned out that the update_sd_lb_stats(), which is
>>> used by load_balance() to pick the most loaded group, was often
>>> picking the wrong group.  A CPU which has one schbench running and
>>> another queued wouldn't report the correspondingly higher
>>> weighted_cpuload() and get looked over as the target of load
>>> balancing.
>>>
>>> weighted_cpuload() is the root cfs_rq's runnable_load_avg which is the
>>> sum of the load_avg of all queued sched_entities.  Without cgroups or
>>> at the root cgroup, each task's load_avg contributes directly to the
>>> sum.  When a task wakes up or goes to sleep, the change is immediately
>>> reflected on runnable_load_avg which in turn affects load balancing.
>>>
>>> However, when CPU cgroup is in use, a nesting cfs_rq blocks this
>>> immediate reflection.  When a task wakes up inside a cgroup, the
>>> nested cfs_rq's runnable_load_avg is updated but doesn't get
>>> propagated to the parent.  It only affects the matching sched_entity's
>>> load_avg over time which then gets propagated to the runnable_load_avg
>>> at that level.  This makes weighted_cpuload() often temporarily out of
>>> sync leading to suboptimal behavior of load_balance() and increase in
>>> scheduling latencies as shown above.
>>>
>>> This patch fixes the issue by updating propagate_entity_load_avg() to
>>> always propagate to the parent's runnable_load_avg.  Combined with the
>>> previous patch, this keeps a cfs_rq's runnable_load_avg always the sum
>>> of the scaled loads of all tasks queued below removing the artifacts
>>> from nesting cfs_rqs.  The following is from inside three levels of
>>> nesting with the patch applied.
>>
>> So you are changing the purpose of propagate_entity_load_avg which
>> aims to propagate load_avg/util_avg changes only when a task migrate
>> and you also want to propagate the enqueue/dequeue in the parent
>> cfs_rq->runnable_load_avg
>
> In fact you want that sched_entity load_avg reflects
> cfs_rq->runnable_load_avg and not cfs_rq->avg.load_avg

I have run a quick test with your patches and schbench on my platform.
I haven't been able to reproduce your regression but my platform is
quite different from yours (only 8 cores without SMT)
But most importantly, the parent cfs_rq->runnable_load_avg never
reaches 0 (or almost 0) when it is idle. Instead, it still has a
runnable_load_avg (this is not due to rounding computation) whereas
runnable_load_avg should be 0

Just to be curious, Is your regression still there if you disable
SMT/hyperthreading on your paltform?

Regards,
Vincent

>
>>
>>>
>>>  # ~/schbench -m 2 -t 16 -s 10000 -c 15000 -r 30
>>>  Latency percentiles (usec)
>>>          50.0000th: 40
>>>          75.0000th: 71
>>>          90.0000th: 89
>>>          95.0000th: 108
>>>          *99.0000th: 679
>>>          99.5000th: 3500
>>>          99.9000th: 10960
>>>          min=0, max=13790
>>>
>>> [1] git://git.kernel.org/pub/scm/linux/kernel/git/mason/schbench.git
>>>
>>> Signed-off-by: Tejun Heo <tj@kernel.org>
>>> Cc: Vincent Guittot <vincent.guittot@linaro.org>
>>> Cc: Ingo Molnar <mingo@redhat.com>
>>> Cc: Peter Zijlstra <peterz@infradead.org>
>>> Cc: Mike Galbraith <efault@gmx.de>
>>> Cc: Paul Turner <pjt@google.com>
>>> Cc: Chris Mason <clm@fb.com>
>>> ---
>>>  kernel/sched/fair.c |   34 +++++++++++++++++++++-------------
>>>  1 file changed, 21 insertions(+), 13 deletions(-)
>>>
>>> --- a/kernel/sched/fair.c
>>> +++ b/kernel/sched/fair.c
>>> @@ -3075,7 +3075,8 @@ update_tg_cfs_util(struct cfs_rq *cfs_rq
>>>
>>>  /* Take into account change of load of a child task group */
>>>  static inline void
>>> -update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se)
>>> +update_tg_cfs_load(struct cfs_rq *cfs_rq, struct sched_entity *se,
>>> +                  bool propagate_avg)
>>>  {
>>>         struct cfs_rq *gcfs_rq = group_cfs_rq(se);
>>>         long load = 0, delta;
>>> @@ -3113,9 +3114,11 @@ update_tg_cfs_load(struct cfs_rq *cfs_rq
>>>         se->avg.load_avg = load;
>>>         se->avg.load_sum = se->avg.load_avg * LOAD_AVG_MAX;
>>>
>>> -       /* Update parent cfs_rq load */
>>> -       add_positive(&cfs_rq->avg.load_avg, delta);
>>> -       cfs_rq->avg.load_sum = cfs_rq->avg.load_avg * LOAD_AVG_MAX;
>>> +       if (propagate_avg) {
>>> +               /* Update parent cfs_rq load */
>>> +               add_positive(&cfs_rq->avg.load_avg, delta);
>>> +               cfs_rq->avg.load_sum = cfs_rq->avg.load_avg * LOAD_AVG_MAX;
>>> +       }
>>>
>>>         /*
>>>          * If the sched_entity is already enqueued, we also have to update the
>>> @@ -3147,22 +3150,27 @@ static inline int test_and_clear_tg_cfs_
>>>  /* Update task and its cfs_rq load average */
>>>  static inline int propagate_entity_load_avg(struct sched_entity *se)
>>>  {
>>> -       struct cfs_rq *cfs_rq;
>>> +       struct cfs_rq *cfs_rq = cfs_rq_of(se);
>>> +       bool propagate_avg;
>>>
>>>         if (entity_is_task(se))
>>>                 return 0;
>>>
>>> -       if (!test_and_clear_tg_cfs_propagate(se))
>>> -               return 0;
>>> -
>>> -       cfs_rq = cfs_rq_of(se);
>>> +       propagate_avg = test_and_clear_tg_cfs_propagate(se);
>>>
>>> -       set_tg_cfs_propagate(cfs_rq);
>>> +       /*
>>> +        * We want to keep @cfs_rq->runnable_load_avg always in sync so
>>> +        * that the load balancer can accurately determine the busiest CPU
>>> +        * regardless of cfs_rq nesting.
>>> +        */
>>> +       update_tg_cfs_load(cfs_rq, se, propagate_avg);
>>>
>>> -       update_tg_cfs_util(cfs_rq, se);
>>> -       update_tg_cfs_load(cfs_rq, se);
>>> +       if (propagate_avg) {
>>> +               set_tg_cfs_propagate(cfs_rq);
>>> +               update_tg_cfs_util(cfs_rq, se);
>>> +       }
>>>
>>> -       return 1;
>>> +       return propagate_avg;
>>>  }
>>>
>>>  #else /* CONFIG_FAIR_GROUP_SCHED */