From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
        id S934035AbcLPI4U (ORCPT <rfc822;w@1wt.eu>);
        Fri, 16 Dec 2016 03:56:20 -0500
Received: from mail-oi0-f47.google.com ([209.85.218.47]:33087 "EHLO
        mail-oi0-f47.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
        with ESMTP id S1757181AbcLPI4M (ORCPT
        <rfc822;linux-kernel@vger.kernel.org>);
        Fri, 16 Dec 2016 03:56:12 -0500
MIME-Version: 1.0
In-Reply-To: <20161215214214.GC3124@twins.programming.kicks-ass.net>
References: <1480610333-23329-1-git-send-email-vincent.guittot@linaro.org> <20161215214214.GC3124@twins.programming.kicks-ass.net>
From: Vincent Guittot <vincent.guittot@linaro.org>
Date: Fri, 16 Dec 2016 09:55:51 +0100
Message-ID: <CAKfTPtA27Ft+N2X5EZ4HT6ZmJGO_5N7eGb81mDeaKak=6S=JCg@mail.gmail.com>
Subject: Re: [PATCH] sched: fix group_entity's share update
To: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>,
        linux-kernel <linux-kernel@vger.kernel.org>,
        Paul Turner <pjt@google.com>, stable@vger.kernel.org
Content-Type: text/plain; charset=UTF-8
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

On 15 December 2016 at 22:42, Peter Zijlstra <peterz@infradead.org> wrote:
>
> On Thu, Dec 01, 2016 at 05:38:53PM +0100, Vincent Guittot wrote:
> > The update of the share of a cfs_rq is done when its load_avg is updated
> > but before the group_entity's load_avg has been updated for the past time
> > slot. This generates wrong load_avg accounting which can be significant
> > when small tasks are involved in the scheduling.
> >
> > Let take the example of a task TA that is dequeued of its task group TG1.
> > TA was the only task in TG1 which becomes idle.
> >
> > We have the sequence:
> >
> > - dequeue_entity TA->se
> >     - update_load_avg(TA->se)
> >     - dequeue_entity_load_avg(TG1->cfs_rq, TA->se)
> >     - account_entity_dequeue(TG1->cfs_rq, TA->se)
> >           TG1->cfs_rq->load.weight = 0
> >     - update_cfs_shares(TG1->cfs_rq)
> >               TG1->se->load.weight is updated with the new share of
> >               cfs_rq. TG1->se->load.weight = 0.
> > - dequeue_entity TG1->se
> >     - update_load_avg(TG1->se) but its weight is now null so the last time
> > slot (up to a tick) will be accounted with its new weight (0 in our case)
> > instead of its real weight. The last time slot is accounted as an idle one
> > whereas it was a running one.
> >
> > If the running time of TA is short enough that no tick happens when it
> > runs, all running time of TG1->se will be accounted as idle time.
> >
> > Instead, we should update the share of a cfs_rq (in fact the weight of its
> > group entity) only after having updated the load_avg of the group_entity.
> >
> > update_cfs_shares() now takes the sched_entity as parameter instead of the
> > cfs_rq and the weight of the group_entity is updated only once its load_avg
> > has been synced with current time.
>
> Urgh, brain hurt, also those names don't help; s/TG1/A/ s/TA/a/
>
> So the problem is that in our for_each_sched_entity(se) loop we end up
> changing the next se before we get there.
>
>
>                 root
>               (cfs_rq)
>                   \
>                   (se)
>                     A
>                  (cfs_rq)
>                       \
>                       (se)
>                        a
>
>
> Starting at a's se, we update_cfs_shares() on A's cfs_rq, which then
> updates A's se, which is the next se in our iteration and mucks with
> state before we get there.
>
> So you change update_cfs_shares() to go downward while we go upward,
> ensuring we only update things that we've finished with.

yes

>
> Makes sense..
>
> >  kernel/sched/fair.c | 27 ++++++++++++++++-----------
> >  1 file changed, 16 insertions(+), 11 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index 18d9e75..19092fa 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -2689,15 +2689,18 @@ static void reweight_entity(struct cfs_rq *cfs_rq, struct sched_entity *se,
> >
> >  static inline int throttled_hierarchy(struct cfs_rq *cfs_rq);
> >
> > -static void update_cfs_shares(struct cfs_rq *cfs_rq)
> > +static void update_cfs_shares(struct sched_entity *se)
> >  {
> >       struct task_group *tg;
> > -     struct sched_entity *se;
> > +     struct cfs_rq *cfs_rq = group_cfs_rq(se);
> >       long shares;
>
> please keep them ordered by length.

Ok

>
> >
> > +     if (entity_is_task(se))
>
> can be: !cfs_rq, which is the same and we already done that load.

yes. My goal was to keep it more readable about the meaning of the
test and I was expecting that the compiler would be smart enough to
use the same one load for both cfs_rq = group_cfs_rq(se) and
entity_is_task(se)

I can change with !cfs_rq

>
> > +             return;
> > +
> >       tg = cfs_rq->tg;
>
> This load isn't needed here yet, can be moved down a bit.

Indeed

>
> > -     se = tg->se[cpu_of(rq_of(cfs_rq))];
> > -     if (!se || throttled_hierarchy(cfs_rq))
> > +
> > +     if (throttled_hierarchy(cfs_rq))
> >               return;
> >  #ifndef CONFIG_SMP
> >       if (likely(se->load.weight == tg->shares))
>
>
> > @@ -3583,9 +3588,9 @@ enqueue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> >               se->vruntime += cfs_rq->min_vruntime;
> >
> >       update_load_avg(se, UPDATE_TG);
> > +     update_cfs_shares(se);
> >       enqueue_entity_load_avg(cfs_rq, se);
> >       account_entity_enqueue(cfs_rq, se);
> > -     update_cfs_shares(cfs_rq);
> >
> >       if (flags & ENQUEUE_WAKEUP)
> >               place_entity(cfs_rq, se, 0);
>
> So here we need to update_cfs_shares() _before_ enqueue_entity, because
> the update_cfs_shares() will affect this se's load, right?

exactly

>
> > @@ -3681,7 +3686,7 @@ dequeue_entity(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> >       /* return excess runtime on last dequeue */
> >       return_cfs_rq_runtime(cfs_rq);
> >
> > -     update_cfs_shares(cfs_rq);
> > +     update_cfs_shares(se);
> >
> >       /*
> >        * Now advance min_vruntime if @se was the entity holding it back,
>
> But this one hurts my brain..
>
> It must be done after dequeue_entity_load_avg() such that we subtract
> the load as was seen until now.

 update_cfs_shares(A's se) must be done after update_load_avg(A's se,
UPDATE_TG); so the update od A's se ->load-avg will be updated with
the previous load to update load_avg for the previous time slot.

update_cfs_shares(A's se) could be done before or after
dequeue_entity_load_avg(A's se) because the root's cfs_rq is kept sync
during the reweight of A's se. Nevertheless, i see one advantage of
doing that after: reweight_entity will be faster because A's se->on_rq
will have been cleared in the meantime

>
> Could we please add comments explaining this ordering, because I forever
> need to think about this (both enqueue and dequeue).

OK

>
> > @@ -3864,7 +3869,7 @@ entity_tick(struct cfs_rq *cfs_rq, struct sched_entity *curr, int queued)
> >        * Ensure that runnable average is periodically updated.
> >        */
> >       update_load_avg(curr, UPDATE_TG);
> > -     update_cfs_shares(cfs_rq);
> > +     update_cfs_shares(curr);
> >
> >  #ifdef CONFIG_SCHED_HRTICK
> >       /*
> > @@ -4761,7 +4766,7 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >                       break;
> >
> >               update_load_avg(se, UPDATE_TG);
> > -             update_cfs_shares(cfs_rq);
> > +             update_cfs_shares(se);
> >       }
> >
> >       if (!se)
> > @@ -4820,7 +4825,7 @@ static void dequeue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >                       break;
> >
> >               update_load_avg(se, UPDATE_TG);
> > -             update_cfs_shares(cfs_rq);
> > +             update_cfs_shares(se);
> >       }
> >
> >       if (!se)
>
> This has a distinct pattern to it though; should we think about
> something like: UPDATE_SHARES for update_load_avg() or does that confuse
> things?

IMHO, keeping update_cfs_shares separated from update_load_avg make it
clear about when we update the shares and enable some optimization
like for  dequeue_entity

>
> > @@ -9316,7 +9321,7 @@ int sched_group_set_shares(struct task_group *tg, unsigned long shares)
> >               /* Possible calls to update_curr() need rq clock */
> >               update_rq_clock(rq);
> >               for_each_sched_entity(se)
> > -                     update_cfs_shares(group_cfs_rq(se));
> > +                     update_cfs_shares(se);
>
> Should we not also catch up with our load before we frob the shares?

yes you're right, an update_load_avg is missing

>
> >               raw_spin_unlock_irqrestore(&rq->lock, flags);
> >       }