From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <linux-kernel-owner@vger.kernel.org>
Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand
	id S1756397AbcDGNEu (ORCPT <rfc822;w@1wt.eu>);
	Thu, 7 Apr 2016 09:04:50 -0400
Received: from mail-lb0-f180.google.com ([209.85.217.180]:36611 "EHLO
	mail-lb0-f180.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org
	with ESMTP id S1752929AbcDGNEt (ORCPT
	<rfc822;linux-kernel@vger.kernel.org>);
	Thu, 7 Apr 2016 09:04:49 -0400
MIME-Version: 1.0
In-Reply-To: <57055B41.6000906@arm.com>
References: <1459528717-17339-1-git-send-email-leo.yan@linaro.org>
 <20160401194948.GN3448@twins.programming.kicks-ass.net> <56FEF621.3070404@linaro.org>
 <20160402071154.GA7046@leoy-linaro> <20160404084821.GA18516@e105550-lin.cambridge.arm.com>
 <20160404183003.GA8697@intel.com> <20160405075112.GC18516@e105550-lin.cambridge.arm.com>
 <20160405001552.GB8697@intel.com> <5703EF38.2060204@arm.com>
 <20160406083702.GE18516@e105550-lin.cambridge.arm.com> <57055B41.6000906@arm.com>
From: Vincent Guittot <vincent.guittot@linaro.org>
Date: Thu, 7 Apr 2016 15:04:27 +0200
Message-ID: <CAKfTPtC4sUYaBU36rs86tLwORWTHxoYDuivA94Rt93pGfCFUcQ@mail.gmail.com>
Subject: Re: [PATCH RFC] sched/fair: let cpu's cfs_rq to reflect task migration
To: Dietmar Eggemann <dietmar.eggemann@arm.com>
Cc: Morten Rasmussen <morten.rasmussen@arm.com>,
        Yuyang Du <yuyang.du@intel.com>, Leo Yan <leo.yan@linaro.org>,
        Steve Muckle <steve.muckle@linaro.org>,
        Peter Zijlstra <peterz@infradead.org>, Ingo Molnar <mingo@redhat.com>,
        linux-kernel <linux-kernel@vger.kernel.org>,
        "eas-dev@lists.linaro.org" <eas-dev@lists.linaro.org>
Content-Type: text/plain; charset=UTF-8
Sender: linux-kernel-owner@vger.kernel.org
List-ID: <linux-kernel.vger.kernel.org>
X-Mailing-List: linux-kernel@vger.kernel.org

Hi Dietmar,

On 6 April 2016 at 20:53, Dietmar Eggemann <dietmar.eggemann@arm.com> wrote:
> On 06/04/16 09:37, Morten Rasmussen wrote:
>> On Tue, Apr 05, 2016 at 06:00:40PM +0100, Dietmar Eggemann wrote:
>>> @@ -2893,8 +2906,12 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
>>>         se->avg.last_update_time = cfs_rq->avg.last_update_time;
>>>         cfs_rq->avg.load_avg += se->avg.load_avg;
>>>         cfs_rq->avg.load_sum += se->avg.load_sum;
>>> -       cfs_rq->avg.util_avg += se->avg.util_avg;
>>> -       cfs_rq->avg.util_sum += se->avg.util_sum;
>>> +
>>> +       if (!entity_is_task(se))
>>> +               return;
>>> +
>>> +       rq_of(cfs_rq)->cfs.avg.util_avg += se->avg.util_avg;
>>> +       rq_of(cfs_rq)->cfs.avg.util_sum += se->avg.util_sum;
>>
>> To me it seems that you cannot be sure that the rq_of(cfs_rq)->cfs.avg
>> time stamp is aligned with se->avg time stamp, which is necessary before
>> you can add/subtract two geometric series without introducing an error.
>>
>> attach_entity_load_avg() is called (through a couple of other functions)
>> from the for_each_sched_entity() loop in enqueue_task_fair() which works
>> its way towards the root cfs_rq, i.e. rq_of(cfs_rq)->cfs. So in the loop
>> iteration where you attach the task sched_entity, we haven't yet visited
>> and updated rq_of(cfs_rq)->cfs.avg.
>>
>> If you just add the task contribution and discover later that there is a
>> time delta when you update rq_of(cfs_rq)->cfs.avg you end up decaying
>> the task contribution which was already up-to-date and its util
>> contribution to rq_of(cfs_rq)->cfs.avg ends up being smaller than it
>> should be.
>>
>> Am I missing something?
>
> No that's definitely wrong in the current patch. I could defer the attach into
> the update_cfs_rq_load_avg() call for the root cfs_rq if
> '&rq_of(cfs_rq)->cfs != cfs_rq' in attach_entity_load_avg():
>
> Something like this (only lightly tested!):
>
> diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> index 51d675715776..d31d9cd453a1 100644
> --- a/kernel/sched/fair.c
> +++ b/kernel/sched/fair.c
> @@ -2856,6 +2856,16 @@ static inline int update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
>         decayed = __update_load_avg(now, cpu_of(rq_of(cfs_rq)), sa,
>                 scale_load_down(cfs_rq->load.weight), cfs_rq->curr != NULL, cfs_rq);
>
> +       if (cfs_rq->added_util_avg) {
> +               sa->util_avg += cfs_rq->added_util_avg;
> +               cfs_rq->added_util_avg = 0;
> +       }
> +
> +       if (cfs_rq->added_util_sum) {
> +               sa->util_sum += cfs_rq->added_util_sum;
> +               cfs_rq->added_util_sum = 0;
> +       }
> +
>  #ifndef CONFIG_64BIT
>         smp_wmb();
>         cfs_rq->load_last_update_time_copy = sa->last_update_time;
> @@ -2910,8 +2920,13 @@ static void attach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
>         if (!entity_is_task(se))
>                 return;
>
> -       rq_of(cfs_rq)->cfs.avg.util_avg += se->avg.util_avg;
> -       rq_of(cfs_rq)->cfs.avg.util_sum += se->avg.util_sum;
> +       if (&rq_of(cfs_rq)->cfs == cfs_rq) {
> +               rq_of(cfs_rq)->cfs.avg.util_avg += se->avg.util_avg;
> +               rq_of(cfs_rq)->cfs.avg.util_sum += se->avg.util_sum;
> +       } else {
> +               rq_of(cfs_rq)->cfs.added_util_avg = se->avg.util_avg;
> +               rq_of(cfs_rq)->cfs.added_util_sum = se->avg.util_sum;
> +       }
>  }

Don't you also need similar thing for the detach ?

>
>  static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se)
> @@ -4344,8 +4359,11 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
>                 cfs_rq = cfs_rq_of(se);
>                 cfs_rq->h_nr_running++;
>
> -               if (cfs_rq_throttled(cfs_rq))
> +               if (cfs_rq_throttled(cfs_rq)) {
> +                       rq_of(cfs_rq)->cfs.added_util_avg = 0;
> +                       rq_of(cfs_rq)->cfs.added_util_sum = 0;
>                         break;
> +               }
>
>                 update_load_avg(se, 1);
>                 update_cfs_shares(cfs_rq);
> diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
> index b2ff5a2bd6df..f0ea3a7eaf07 100644
> --- a/kernel/sched/sched.h
> +++ b/kernel/sched/sched.h
> @@ -391,6 +391,8 @@ struct cfs_rq {
>         unsigned long tg_load_avg_contrib;
>  #endif
>         atomic_long_t removed_load_avg, removed_util_avg;
> +       u32 added_util_sum;
> +       unsigned long added_util_avg;
>  #ifndef CONFIG_64BIT
>         u64 load_last_update_time_copy;
>
> But attach_entity_load_avg() is not only called in enqueue_entity_load_avg() for migrated
> tasks but also in attach_task_cfs_rq() which is called from switched_to_fair() and
> task_move_group_fair() where we can't assume that after the enqueue_entity_load_avg() a
> call to update_cfs_rq_load_avg() follows like in
>
> enqueue_task_fair():
>
>     for_each_sched_entity(se)
>         enqueue_entity()
>             enqueue_entity_load_avg()
>                     update_cfs_rq_load_avg(now, cfs_rq)
>                     if (migrated) attach_entity_load_avg()
>
>     for_each_sched_entity(se)
>         update_load_avg()
>             update_cfs_rq_load_avg(now, cfs_rq)
>
>
> Not sure if we can just update the root cfs_rq to se->avg.last_update_time before we add
> se->avg.util_[avg/sum] to rq_of(cfs_rq)->cfs.avg.util_[avg/sum] in attach_entity_load_avg()?
>
> cfs_rq throttling has to be considered as well ...

IIUC this new proposal, the utilization of a task will be accounted on
the utilization of the root cfs_rq thanks  to
tsk->se->cfs_rq->tg->se[cpu]->... down to the root cfs_rq. Then, you
directly add the utilization of the newly enqueued task in the root
cfs_rq.