All of lore.kernel.org
 help / color / mirror / Atom feed
From: Vincent Guittot <vincent.guittot@linaro.org>
To: Valentin Schneider <valentin.schneider@arm.com>
Cc: linux-kernel <linux-kernel@vger.kernel.org>,
	Ingo Molnar <mingo@redhat.com>,
	Peter Zijlstra <peterz@infradead.org>,
	Phil Auld <pauld@redhat.com>,
	Srikar Dronamraju <srikar@linux.vnet.ibm.com>,
	Quentin Perret <quentin.perret@arm.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Morten Rasmussen <Morten.Rasmussen@arm.com>
Subject: Re: [PATCH v2 4/8] sched/fair: rework load_balance
Date: Mon, 26 Aug 2019 11:26:06 +0200	[thread overview]
Message-ID: <CAKfTPtA1-8u2LCiq5o1go_M7FywBao-EDxCHMfsxEN8es4pXcw@mail.gmail.com> (raw)
In-Reply-To: <ee4ac9c8-ac70-e56a-2aa9-9cce2e5aa25b@arm.com>

On Mon, 5 Aug 2019 at 19:07, Valentin Schneider
<valentin.schneider@arm.com> wrote:
>
> Hi Vincent,
>
> Here's another batch of comments, still need to go through some more of it.
>
> On 01/08/2019 15:40, Vincent Guittot wrote:
> > The load_balance algorithm contains some heuristics which have becomes
>
> s/becomes/become/

yep for this one and other typo mistakes

>
> > meaningless since the rework of metrics and the introduction of PELT.
>                                ^^^^^^^^^^
> Which metrics? I suppose you mean the s*_lb_stats structs?
>
> >
> > Furthermore, it's sometimes difficult to fix wrong scheduling decisions
> > because everything is based on load whereas some imbalances are not
> > related to the load.
>
> Hmm, well, they don't start as wrong decisions, right? It's just that
> certain imbalance scenarios can't be solved by looking only at load.
>
> What about something along those lines?
>
> """
> Furthermore, load is an ill-suited metric for solving certain task
> placement imbalance scenarios. For instance, in the presence of idle CPUs,
> we should simply try to get at least one task per CPU, whereas the current
> load-based algorithm can actually leave idle CPUs alone simply because the
> load is somewhat balanced.
> """
>
> > The current algorithm ends up to create virtual and
>
> s/to create/creating/
>
> > meaningless value like the avg_load_per_task or tweaks the state of a
> > group to make it overloaded whereas it's not, in order to try to migrate
> > tasks.
> >
> > load_balance should better qualify the imbalance of the group and define
> > cleary what has to be moved to fix this imbalance.
>
> s/define cleary/clearly define/
>
> >
> > The type of sched_group has been extended to better reflect the type of
> > imbalance. We now have :
> >       group_has_spare
> >       group_fully_busy
> >       group_misfit_task
> >       group_asym_capacity
> >       group_imbalanced
> >       group_overloaded
> >
> > Based on the type of sched_group, load_balance now sets what it wants to
> > move in order to fix the imnbalance. It can be some load as before but
>
> s/imnbalance/imbalance/
>
> > also some utilization, a number of task or a type of task:
> >       migrate_task
> >       migrate_util
> >       migrate_load
> >       migrate_misfit
> >
> > This new load_balance algorithm fixes several pending wrong tasks
> > placement:
> > - the 1 task per CPU case with asymetrics system
>
> s/asymetrics/asymmetric/
>
> > - the case of cfs task preempted by other class
> > - the case of tasks not evenly spread on groups with spare capacity
> >
> > The load balance decisions have been gathered in 3 functions:
> > - update_sd_pick_busiest() select the busiest sched_group.
> > - find_busiest_group() checks if there is an imabalance between local and
>
> s/imabalance/imbalance/
>
> >   busiest group.
> > - calculate_imbalance() decides what have to be moved.
>
> That's nothing new, isn't it? I think what you mean there is that the

There is 2 things:
-part of the algorithm is new and fixes wrong task placement
-everything has been consolidated in the 3 functions above whereas
there were some bypasses and hack in the current code

> decisions have been consolidated in those areas, rather than being spread

I would not say that the code was spread all over the place because
90% was already correctly placed but there were few cases that have
been added outside these functions

> all over the place.
>
> >
> > Signed-off-by: Vincent Guittot <vincent.guittot@linaro.org>
> > ---
> >  kernel/sched/fair.c | 581 ++++++++++++++++++++++++++++++++++------------------
> >  1 file changed, 379 insertions(+), 202 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index d7f4a7e..a8681c3 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -7136,13 +7136,28 @@ static unsigned long __read_mostly max_load_balance_interval = HZ/10;
> >
> >  enum fbq_type { regular, remote, all };
> >
> > +/*
> > + * group_type describes the group of CPUs at the moment of the load balance.
> > + * The enum is ordered by pulling priority, with the group with lowest priority
> > + * first so the groupe_type can be simply compared when selecting the busiest
> > + * group. see update_sd_pick_busiest().
> > + */
> >  enum group_type {
> > -     group_other = 0,
> > +     group_has_spare = 0,
> > +     group_fully_busy,
> >       group_misfit_task,
> > +     group_asym_capacity,
>
> That one got me confused - it's about asym packing, not asym capacity, and
> the name should reflect that. I'd vote for group_asym_packing to stay in
> line with what Quentin did for the sd shortcut pointers in

yep asym_packing is probably better

>
>   011b27bb5d31 ("sched/topology: Add lowest CPU asymmetry sched_domain level pointer")
>
> >       group_imbalanced,
> >       group_overloaded,
> >  };
> >
> > +enum migration_type {
> > +     migrate_task = 0,
> > +     migrate_util,
> > +     migrate_load,
> > +     migrate_misfit,
>
> nitpicking here: AFAICT the ordering of this doesn't really matter,
> could we place migrate_misfit next to migrate_task? As you call out in the
> header, we can migrate a number of tasks or a type of task, so these should
> be somewhat together.

misfit has been added last because it's specific whereas others are
somehow generic and I want to keep generic first and specific last

>
> If we're afraid that we'll add other types of tasks later on and that this
> won't result in a neat append-to-the-end, we could reverse the order like
> so:
>
> migrate_load
> migrate_util
> migrate_task
> migrate_misfit

I can put in this order

>
> [...]
> > @@ -7745,10 +7793,10 @@ struct sg_lb_stats {
> >  struct sd_lb_stats {
> >       struct sched_group *busiest;    /* Busiest group in this sd */
> >       struct sched_group *local;      /* Local group in this sd */
> > -     unsigned long total_running;
>
> Could be worth calling out in the log that this gets snipped out. Or it
> could go into its own small cleanup patch, since it's just an unused field.

I can mention it more specifically in the log but that's part of those
meaningless metrics which is no more used
>
> [...]> @@ -8147,11 +8223,67 @@ static bool update_sd_pick_busiest(struct lb_env *env,
> >       if (sgs->group_type < busiest->group_type)
> >               return false;
> >
> > -     if (sgs->avg_load <= busiest->avg_load)
> > +     /*
> > +      * The candidate and the current busiest group are the same type of
> > +      * group. Let check which one is the busiest according to the type.
> > +      */
> > +
> > +     switch (sgs->group_type) {
> > +     case group_overloaded:
> > +             /* Select the overloaded group with highest avg_load. */
> > +             if (sgs->avg_load <= busiest->avg_load)
> > +                     return false;
> > +             break;
> > +
> > +     case group_imbalanced:
> > +             /* Select the 1st imbalanced group as we don't have
> > +              * any way to choose one more than another
> > +              */
> >               return false;
> > +             break;
>
> You already have an unconditional return above.

good point

>
> >
> > -     if (!(env->sd->flags & SD_ASYM_CPUCAPACITY))
> > -             goto asym_packing;
> > +     case group_asym_capacity:
> > +             /* Prefer to move from lowest priority CPU's work */
> > +             if (sched_asym_prefer(sg->asym_prefer_cpu, sds->busiest->asym_prefer_cpu))
> > +                      return false;
>                         ^
>                 Stray whitespace
>
> [...]
> > @@ -8438,17 +8581,17 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
> >       local = &sds.local_stat;
> >       busiest = &sds.busiest_stat;
> >
> > -     /* ASYM feature bypasses nice load balance check */
> > -     if (busiest->group_asym_capacity)
> > -             goto force_balance;
> > -
> >       /* There is no busy sibling group to pull tasks from */
> >       if (!sds.busiest || busiest->sum_h_nr_running == 0)
>                          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
> That can go, since you now filter this in update_sd_pick_busiest()

yes

>
> [...]
> > @@ -8459,59 +8602,71 @@ static struct sched_group *find_busiest_group(struct lb_env *env)
> >               goto force_balance;
> >
> >       /*
> > -      * When dst_cpu is idle, prevent SMP nice and/or asymmetric group
> > -      * capacities from resulting in underutilization due to avg_load.
> > -      */
> > -     if (env->idle != CPU_NOT_IDLE && group_has_capacity(env, local) &&
> > -         busiest->group_no_capacity)
> > -             goto force_balance;
> > -
> > -     /* Misfit tasks should be dealt with regardless of the avg load */
> > -     if (busiest->group_type == group_misfit_task)
> > -             goto force_balance;
> > -
> > -     /*
> >        * If the local group is busier than the selected busiest group
> >        * don't try and pull any tasks.
> >        */
> > -     if (local->avg_load >= busiest->avg_load)
> > +     if (local->group_type > busiest->group_type)
> >               goto out_balanced;
> >
> >       /*
> > -      * Don't pull any tasks if this group is already above the domain
> > -      * average load.
> > +      * When groups are overloaded, use the avg_load to ensure fairness
> > +      * between tasks.
> >        */
> > -     if (local->avg_load >= sds.avg_load)
> > -             goto out_balanced;
> > +     if (local->group_type == group_overloaded) {
> > +             /*
> > +              * If the local group is more loaded than the selected
> > +              * busiest group don't try and pull any tasks.
> > +              */
> > +             if (local->avg_load >= busiest->avg_load)
> > +                     goto out_balanced;
> > +
> > +             /* XXX broken for overlapping NUMA groups */
> > +             sds.avg_load = (sds.total_load * SCHED_CAPACITY_SCALE) /
> > +                             sds.total_capacity;
> >
> > -     if (env->idle == CPU_IDLE) {
> >               /*
> > -              * This CPU is idle. If the busiest group is not overloaded
> > -              * and there is no imbalance between this and busiest group
> > -              * wrt idle CPUs, it is balanced. The imbalance becomes
> > -              * significant if the diff is greater than 1 otherwise we
> > -              * might end up to just move the imbalance on another group
> > +              * Don't pull any tasks if this group is already above the
> > +              * domain average load.
> >                */
> > -             if ((busiest->group_type != group_overloaded) &&
> > -                             (local->idle_cpus <= (busiest->idle_cpus + 1)))
> > +             if (local->avg_load >= sds.avg_load)
> >                       goto out_balanced;
> > -     } else {
> > +
> >               /*
> > -              * In the CPU_NEWLY_IDLE, CPU_NOT_IDLE cases, use
> > -              * imbalance_pct to be conservative.
> > +              * If the busiest group is more loaded, use imbalance_pct to be
> > +              * conservative.
> >                */
> >               if (100 * busiest->avg_load <=
> >                               env->sd->imbalance_pct * local->avg_load)
> >                       goto out_balanced;
> > +
> >       }
> >
> > +     /* Try to move all excess tasks to child's sibling domain */
> > +     if (sds.prefer_sibling && local->group_type == group_has_spare &&
> > +         busiest->sum_h_nr_running > local->sum_h_nr_running + 1)
> > +             goto force_balance;
> > +
> > +     if (busiest->group_type != group_overloaded &&
> > +          (env->idle == CPU_NOT_IDLE ||
> > +           local->idle_cpus <= (busiest->idle_cpus + 1)))
> > +             /*
> > +              * If the busiest group is not overloaded
> > +              * and there is no imbalance between this and busiest group
> > +              * wrt idle CPUs, it is balanced. The imbalance
> > +              * becomes significant if the diff is greater than 1 otherwise
> > +              * we might end up to just move the imbalance on another
> > +              * group.
> > +              */
> > +             goto out_balanced;
> > +
> >  force_balance:
> >       /* Looks like there is an imbalance. Compute it */
> > -     env->src_grp_type = busiest->group_type;
> >       calculate_imbalance(env, &sds);
> > +
>
> Stray newline?
>
> >       return env->imbalance ? sds.busiest : NULL;
> >
> >  out_balanced:
> > +
>
> Ditto?
>
> >       env->imbalance = 0;
> >       return NULL;
> >  }
> [...]

  reply	other threads:[~2019-08-26  9:26 UTC|newest]

Thread overview: 31+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2019-08-01 14:40 [PATCH v2 0/8] sched/fair: rework the CFS load balance Vincent Guittot
2019-08-01 14:40 ` [PATCH v2 1/8] sched/fair: clean up asym packing Vincent Guittot
2019-08-01 14:40 ` [PATCH v2 2/8] sched/fair: rename sum_nr_running to sum_h_nr_running Vincent Guittot
2019-08-01 14:40 ` [PATCH v2 3/8] sched/fair: remove meaningless imbalance calculation Vincent Guittot
2019-08-01 14:40 ` [PATCH v2 4/8] sched/fair: rework load_balance Vincent Guittot
2019-08-05 17:07   ` Valentin Schneider
2019-08-26  9:26     ` Vincent Guittot [this message]
2019-08-28 10:25       ` Valentin Schneider
2019-08-06 15:56   ` Peter Zijlstra
2019-08-26  9:31     ` Vincent Guittot
2019-08-06 17:17   ` Valentin Schneider
2019-08-07 11:16     ` Valentin Schneider
2019-08-26 10:11     ` Vincent Guittot
2019-08-28 14:19       ` Valentin Schneider
2019-08-29 14:26         ` Vincent Guittot
2019-08-30 14:33           ` Valentin Schneider
2019-08-01 14:40 ` [PATCH v2 5/8] sched/fair: use rq->nr_running when balancing load Vincent Guittot
2019-08-01 14:40 ` [PATCH v2 6/8] sched/fair: use load instead of runnable load Vincent Guittot
2019-08-06 16:07   ` Peter Zijlstra
2019-08-26 15:45     ` Vincent Guittot
2019-08-01 14:40 ` [PATCH v2 7/8] sched/fair: evenly spread tasks when not overloaded Vincent Guittot
2019-08-01 14:40 ` [PATCH v2 8/8] sched/fair: use utilization to select misfit task Vincent Guittot
2019-08-01 16:27   ` Valentin Schneider
2019-08-02  8:29     ` Vincent Guittot
2019-08-02 10:49       ` Valentin Schneider
2019-08-02 12:56   ` [PATCH v3] " Vincent Guittot
2019-08-02 14:27     ` Valentin Schneider
2019-08-05 11:01     ` Valentin Schneider
2019-08-29 19:23 ` [PATCH v2 0/8] sched/fair: rework the CFS load balance Phil Auld
2019-08-30  6:46   ` Vincent Guittot
     [not found] ` <20190809052124.13016-1-hdanton@sina.com>
2019-09-02 13:07   ` [PATCH v2 5/8] sched/fair: use rq->nr_running when balancing load Vincent Guittot

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=CAKfTPtA1-8u2LCiq5o1go_M7FywBao-EDxCHMfsxEN8es4pXcw@mail.gmail.com \
    --to=vincent.guittot@linaro.org \
    --cc=Morten.Rasmussen@arm.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@redhat.com \
    --cc=pauld@redhat.com \
    --cc=peterz@infradead.org \
    --cc=quentin.perret@arm.com \
    --cc=srikar@linux.vnet.ibm.com \
    --cc=valentin.schneider@arm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.