Re: [RFC PATCH v2 3/6] sched: Add over-utilization/tipping point indicator

From: Quentin Perret <quentin.perret@arm.com>
To: Joel Fernandes <joelaf@google.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>,
	LKML <linux-kernel@vger.kernel.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Thara Gopinath <thara.gopinath@linaro.org>,
	Linux PM <linux-pm@vger.kernel.org>,
	Morten Rasmussen <morten.rasmussen@arm.com>,
	Chris Redpath <chris.redpath@arm.com>,
	Patrick Bellasi <patrick.bellasi@arm.com>,
	Valentin Schneider <valentin.schneider@arm.com>,
	"Rafael J . Wysocki" <rjw@rjwysocki.net>,
	Greg Kroah-Hartman <gregkh@linuxfoundation.org>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Viresh Kumar <viresh.kumar@linaro.org>,
	Todd Kjos <tkjos@google.com>, Juri Lelli <juri.lelli@redhat.com>,
	Steve Muckle <smuckle@google.com>,
	Eduardo Valentin <edubezval@gmail.com>
Subject: Re: [RFC PATCH v2 3/6] sched: Add over-utilization/tipping point indicator
Date: Fri, 20 Apr 2018 09:31:18 +0100	[thread overview]
Message-ID: <20180420083118.GA14391@e108498-lin.cambridge.arm.com> (raw)
In-Reply-To: <CAJWu+oqGtfpwX9_s=Gwt5CKQNfRhztV48RjriGiBS7ZXG8g90w@mail.gmail.com>

On Friday 20 Apr 2018 at 01:14:35 (-0700), Joel Fernandes wrote:
> On Fri, Apr 20, 2018 at 1:13 AM, Joel Fernandes <joelaf@google.com> wrote:
> > On Wed, Apr 18, 2018 at 4:17 AM, Quentin Perret <quentin.perret@arm.com> wrote:
> >> On Friday 13 Apr 2018 at 16:56:39 (-0700), Joel Fernandes wrote:
> >>> Hi,
> >>>
> >>> On Fri, Apr 6, 2018 at 8:36 AM, Dietmar Eggemann
> >>> <dietmar.eggemann@arm.com> wrote:
> >>> > From: Thara Gopinath <thara.gopinath@linaro.org>
> >>> >
> >>> > Energy-aware scheduling should only operate when the system is not
> >>> > overutilized. There must be cpu time available to place tasks based on
> >>> > utilization in an energy-aware fashion, i.e. to pack tasks on
> >>> > energy-efficient cpus without harming the overall throughput.
> >>> >
> >>> > In case the system operates above this tipping point the tasks have to
> >>> > be placed based on task and cpu load in the classical way of spreading
> >>> > tasks across as many cpus as possible.
> >>> >
> >>> > The point in which a system switches from being not overutilized to
> >>> > being overutilized is called the tipping point.
> >>> >
> >>> > Such a tipping point indicator on a sched domain as the system
> >>> > boundary is introduced here. As soon as one cpu of a sched domain is
> >>> > overutilized the whole sched domain is declared overutilized as well.
> >>> > A cpu becomes overutilized when its utilization is higher that 80%
> >>> > (capacity_margin) of its capacity.
> >>> >
> >>> > The implementation takes advantage of the shared sched domain which is
> >>> > shared across all per-cpu views of a sched domain level. The new
> >>> > overutilized flag is placed in this shared sched domain.
> >>> >
> >>> > Load balancing is skipped in case the energy model is present and the
> >>> > sched domain is not overutilized because under this condition the
> >>> > predominantly load-per-capacity driven load-balancer should not
> >>> > interfere with the energy-aware wakeup placement based on utilization.
> >>> >
> >>> > In case the total utilization of a sched domain is greater than the
> >>> > total sched domain capacity the overutilized flag is set at the parent
> >>> > sched domain level to let other sched groups help getting rid of the
> >>> > overutilization of cpus.
> >>> >
> >>> > Signed-off-by: Thara Gopinath <thara.gopinath@linaro.org>
> >>> > Signed-off-by: Dietmar Eggemann <dietmar.eggemann@arm.com>
> >>> > ---
> >>> >  include/linux/sched/topology.h |  1 +
> >>> >  kernel/sched/fair.c            | 62 ++++++++++++++++++++++++++++++++++++++++--
> >>> >  kernel/sched/sched.h           |  1 +
> >>> >  kernel/sched/topology.c        | 12 +++-----
> >>> >  4 files changed, 65 insertions(+), 11 deletions(-)
> >>> >
> >>> > diff --git a/include/linux/sched/topology.h b/include/linux/sched/topology.h
> >>> > index 26347741ba50..dd001c232646 100644
> >>> > --- a/include/linux/sched/topology.h
> >>> > +++ b/include/linux/sched/topology.h
> >>> > @@ -72,6 +72,7 @@ struct sched_domain_shared {
> >>> >         atomic_t        ref;
> >>> >         atomic_t        nr_busy_cpus;
> >>> >         int             has_idle_cores;
> >>> > +       int             overutilized;
> >>> >  };
> >>> >
> >>> >  struct sched_domain {
> >>> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> >>> > index 0a76ad2ef022..6960e5ef3c14 100644
> >>> > --- a/kernel/sched/fair.c
> >>> > +++ b/kernel/sched/fair.c
> >>> > @@ -5345,6 +5345,28 @@ static inline void hrtick_update(struct rq *rq)
> >>> >  }
> >>> >  #endif
> >>> >
> >>> > +#ifdef CONFIG_SMP
> >>> > +static inline int cpu_overutilized(int cpu);
> >>> > +
> >>> > +static inline int sd_overutilized(struct sched_domain *sd)
> >>> > +{
> >>> > +       return READ_ONCE(sd->shared->overutilized);
> >>> > +}
> >>> > +
> >>> > +static inline void update_overutilized_status(struct rq *rq)
> >>> > +{
> >>> > +       struct sched_domain *sd;
> >>> > +
> >>> > +       rcu_read_lock();
> >>> > +       sd = rcu_dereference(rq->sd);
> >>> > +       if (sd && !sd_overutilized(sd) && cpu_overutilized(rq->cpu))
> >>> > +               WRITE_ONCE(sd->shared->overutilized, 1);
> >>> > +       rcu_read_unlock();
> >>> > +}
> >>> > +#else
> >>> > +static inline void update_overutilized_status(struct rq *rq) {}
> >>> > +#endif /* CONFIG_SMP */
> >>> > +
> >>> >  /*
> >>> >   * The enqueue_task method is called before nr_running is
> >>> >   * increased. Here we update the fair scheduling stats and
> >>> > @@ -5394,8 +5416,10 @@ enqueue_task_fair(struct rq *rq, struct task_struct *p, int flags)
> >>> >                 update_cfs_group(se);
> >>> >         }
> >>> >
> >>> > -       if (!se)
> >>> > +       if (!se) {
> >>> >                 add_nr_running(rq, 1);
> >>> > +               update_overutilized_status(rq);
> >>> > +       }
> >>>
> >>> I'm wondering if it makes sense for considering scenarios whether
> >>> other classes cause CPUs in the domain to go above the tipping point.
> >>> Then in that case also, it makes sense to not to do EAS in that domain
> >>> because of the overutilization.
> >>>
> >>> I guess task_fits using cpu_util which is PELT only at the moment...
> >>> so may require some other method like aggregation of CFS PELT, with
> >>> RT-PELT and DL running bw or something.
> >>>
> >>
> >> So at the moment in cpu_overutilized() we comapre cpu_util() to
> >> capacity_of() which should include RT and IRQ pressure IIRC. But
> >> you're right, we might be able to do more here... Perhaps we
> >> could also use cpu_util_dl() which is available in sched.h now ?
> >
> > Yes, should be Ok, and then when RT utilization stuff is available,
> > then that can be included in the equation as well (probably for now
> > you could use rt_avg).
> >
> > Another crazy idea is to check the contribution of higher classes in
> > one-shot with (capacity_orig_of - capacity_of) although I think that
> > method would be less instantaneous/accurate.
> 
> Just to add to the last point, the capacity_of also factors in the IRQ
> contribution if I remember correctly, which is probably a good thing?
> 

I think so too yes. But actually, since we compare cpu_util() to
capacity_of() in cpu_overutilized(), the current implementation should
already be fairly similar to the "capacity_orig_of - capacity_of"
implementation you're suggesting I guess.
And I agree that when Vincent's RT PELT patches get merged we should
probably use that :-)

Thanks !
Quentin