All of lore.kernel.org
 help / color / mirror / Atom feed
From: Aaron Lu <aaron.lu@intel.com>
To: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Daniel Jordan <daniel.m.jordan@oracle.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>, "Mel Gorman" <mgorman@suse.de>,
	Daniel Bristot de Oliveira <bristot@redhat.com>,
	Valentin Schneider <vschneid@redhat.com>,
	Tim Chen <tim.c.chen@intel.com>,
	Nitin Tekchandani <nitin.tekchandani@intel.com>,
	Yu Chen <yu.c.chen@intel.com>, Waiman Long <longman@redhat.com>,
	<linux-kernel@vger.kernel.org>
Subject: Re: [RFC PATCH 3/4] sched/fair: delay update_tg_load_avg() for cfs_rq's removed load
Date: Thu, 20 Jul 2023 22:42:33 +0800	[thread overview]
Message-ID: <20230720144233.GA185317@ziqianlu-dell> (raw)
In-Reply-To: <CAKfTPtAqpAo8Y9BdWZ-fmnyYgA8PEtFbObqWJxc-hs2Ktqkt3Q@mail.gmail.com>

On Thu, Jul 20, 2023 at 03:10:30PM +0200, Vincent Guittot wrote:
> On Wed, 19 Jul 2023 at 15:29, Aaron Lu <aaron.lu@intel.com> wrote:
> >
> > On Wed, Jul 19, 2023 at 11:47:06AM +0200, Vincent Guittot wrote:
> > > On Wed, 19 Jul 2023 at 10:01, Aaron Lu <aaron.lu@intel.com> wrote:
> > > >
> > > > On Wed, Jul 19, 2023 at 01:18:26PM +0800, Aaron Lu wrote:
> > > > > Alternatively, I can remove some callsites of update_tg_load_avg() like
> > > > > you suggested below and only call update_tg_load_avg() when cfs_rq is
> > > > > decayed(really just decayed, not when it detected it has removed load
> > > > > pending or load propagated from its children). This way it would give us
> > > > > similar result as above(roughly once per ms).
> > > >
> > > > Something like this: (I think this is better since it removed those
> > > > unnecessary calls to update_tg_load_avg(), although it is inline but
> > > > still)
> > > >
> > > >
> > > > From bc749aaefa6bed36aa946921a4006b3dddb69b77 Mon Sep 17 00:00:00 2001
> > > > From: Aaron Lu <aaron.lu@intel.com>
> > > > Date: Wed, 19 Jul 2023 13:54:48 +0800
> > > > Subject: [PATCH] sched/fair: only update_tg_load_avg() when cfs_rq decayed
> > > >
> > > > ---
> > > >  kernel/sched/fair.c | 22 +++++++---------------
> > > >  1 file changed, 7 insertions(+), 15 deletions(-)
> > > >
> > > > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > > > index a80a73909dc2..7d5b7352b8b5 100644
> > > > --- a/kernel/sched/fair.c
> > > > +++ b/kernel/sched/fair.c
> > > > @@ -3913,16 +3913,16 @@ static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum
> > > >  }
> > > >
> > > >  /* Update task and its cfs_rq load average */
> > > > -static inline int propagate_entity_load_avg(struct sched_entity *se)
> > > > +static inline void propagate_entity_load_avg(struct sched_entity *se)
> > > >  {
> > > >         struct cfs_rq *cfs_rq, *gcfs_rq;
> > > >
> > > >         if (entity_is_task(se))
> > > > -               return 0;
> > > > +               return;
> > > >
> > > >         gcfs_rq = group_cfs_rq(se);
> > > >         if (!gcfs_rq->propagate)
> > > > -               return 0;
> > > > +               return;
> > > >
> > > >         gcfs_rq->propagate = 0;
> > > >
> > > > @@ -3936,8 +3936,6 @@ static inline int propagate_entity_load_avg(struct sched_entity *se)
> > > >
> > > >         trace_pelt_cfs_tp(cfs_rq);
> > > >         trace_pelt_se_tp(se);
> > > > -
> > > > -       return 1;
> > > >  }
> > > >
> > > >  /*
> > > > @@ -3974,9 +3972,8 @@ static inline bool skip_blocked_update(struct sched_entity *se)
> > > >
> > > >  static inline void update_tg_load_avg(struct cfs_rq *cfs_rq) {}
> > > >
> > > > -static inline int propagate_entity_load_avg(struct sched_entity *se)
> > > > +static inline void propagate_entity_load_avg(struct sched_entity *se)
> > > >  {
> > > > -       return 0;
> > > >  }
> > > >
> > > >  static inline void add_tg_cfs_propagate(struct cfs_rq *cfs_rq, long runnable_sum) {}
> > > > @@ -4086,7 +4083,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
> > > >  {
> > > >         unsigned long removed_load = 0, removed_util = 0, removed_runnable = 0;
> > > >         struct sched_avg *sa = &cfs_rq->avg;
> > > > -       int decayed = 0;
> > > > +       int decayed;
> > > >
> > > >         if (cfs_rq->removed.nr) {
> > > >                 unsigned long r;
> > > > @@ -4134,11 +4131,9 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
> > > >                  */
> > > >                 add_tg_cfs_propagate(cfs_rq,
> > > >                         -(long)(removed_runnable * divider) >> SCHED_CAPACITY_SHIFT);
> > > > -
> > > > -               decayed = 1;
> > > >         }
> > > >
> > > > -       decayed |= __update_load_avg_cfs_rq(now, cfs_rq);
> > > > +       decayed = __update_load_avg_cfs_rq(now, cfs_rq);
> > > >         u64_u32_store_copy(sa->last_update_time,
> > > >                            cfs_rq->last_update_time_copy,
> > > >                            sa->last_update_time);
> > > > @@ -4252,7 +4247,7 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> > > >                 __update_load_avg_se(now, cfs_rq, se);
> > > >
> > > >         decayed  = update_cfs_rq_load_avg(now, cfs_rq);
> > > > -       decayed |= propagate_entity_load_avg(se);
> > > > +       propagate_entity_load_avg(se);
> > >
> > > but then you  also skip the call to cfs_rq_util_change()
> >
> > Ah right, I missed that, thanks for catching this.
> >
> > Updated:
> >
> > From 09a649f8111cfca656b7b735da975ef607b00956 Mon Sep 17 00:00:00 2001
> > From: Aaron Lu <aaron.lu@intel.com>
> > Date: Wed, 19 Jul 2023 13:54:48 +0800
> > Subject: [PATCH] sched/fair: only update_tg_load_avg() when cfs_rq decayed
> >
> > ---
> >  kernel/sched/fair.c | 17 ++++++-----------
> >  1 file changed, 6 insertions(+), 11 deletions(-)
> >
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index a80a73909dc2..8d4b9e0a19b6 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -4086,7 +4086,7 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
> >  {
> >         unsigned long removed_load = 0, removed_util = 0, removed_runnable = 0;
> >         struct sched_avg *sa = &cfs_rq->avg;
> > -       int decayed = 0;
> > +       int decayed;
> >
> >         if (cfs_rq->removed.nr) {
> >                 unsigned long r;
> > @@ -4134,11 +4134,9 @@ update_cfs_rq_load_avg(u64 now, struct cfs_rq *cfs_rq)
> >                  */
> >                 add_tg_cfs_propagate(cfs_rq,
> >                         -(long)(removed_runnable * divider) >> SCHED_CAPACITY_SHIFT);
> > -
> > -               decayed = 1;
> 
> We need this to propagate the change in other place like cpufreq

Ah, I just made the same mistake again, sorry.

So there are three cases for a cfs_rq: load decayed, load removed and
load propagated. For all three cases, cfs_rq_util_change() needs to be
called and only for decayed, update_tg_load_avg() needs to be called.

I'll update the patch accordingly.

> >         }
> >
> > -       decayed |= __update_load_avg_cfs_rq(now, cfs_rq);
> > +       decayed = __update_load_avg_cfs_rq(now, cfs_rq);
> >         u64_u32_store_copy(sa->last_update_time,
> >                            cfs_rq->last_update_time_copy,
> >                            sa->last_update_time);
> > @@ -4242,7 +4240,7 @@ static void detach_entity_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> >  static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *se, int flags)
> >  {
> >         u64 now = cfs_rq_clock_pelt(cfs_rq);
> > -       int decayed;
> > +       int decayed, propagated;
> >
> >         /*
> >          * Track task load average for carrying it to new CPU after migrated, and
> > @@ -4252,7 +4250,7 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> >                 __update_load_avg_se(now, cfs_rq, se);
> >
> >         decayed  = update_cfs_rq_load_avg(now, cfs_rq);
> > -       decayed |= propagate_entity_load_avg(se);
> > +       propagated = propagate_entity_load_avg(se);
> >
> >         if (!se->avg.last_update_time && (flags & DO_ATTACH)) {
> >
> > @@ -4264,19 +4262,16 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> >                  * IOW we're enqueueing a task on a new CPU.
> >                  */
> >                 attach_entity_load_avg(cfs_rq, se);
> > -               update_tg_load_avg(cfs_rq);
> > -
> >         } else if (flags & DO_DETACH) {
> >                 /*
> >                  * DO_DETACH means we're here from dequeue_entity()
> >                  * and we are migrating task out of the CPU.
> >                  */
> >                 detach_entity_load_avg(cfs_rq, se);
> > -               update_tg_load_avg(cfs_rq);
> > -       } else if (decayed) {
> > +       } else if (decayed || propagated) {
> >                 cfs_rq_util_change(cfs_rq, 0);
> >
> > -               if (flags & UPDATE_TG)
> > +               if (decayed && (flags & UPDATE_TG))
> 
> It would be simpler and more readable to clear UPDATE_TG or not set it
> from the beginning

Do you mean something like this?

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 8d4b9e0a19b6..084d63371355 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -4249,7 +4249,7 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 	if (se->avg.last_update_time && !(flags & SKIP_AGE_LOAD))
 		__update_load_avg_se(now, cfs_rq, se);
 
-	decayed  = update_cfs_rq_load_avg(now, cfs_rq);
+	decayed  = update_cfs_rq_load_avg(now, cfs_rq) && (flags & UPDATE_TG);
 	propagated = propagate_entity_load_avg(se);
 
 	if (!se->avg.last_update_time && (flags & DO_ATTACH)) {
@@ -4271,7 +4271,7 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
 	} else if (decayed || propagated) {
 		cfs_rq_util_change(cfs_rq, 0);
 
-		if (decayed && (flags & UPDATE_TG))
+		if (decayed)
 			update_tg_load_avg(cfs_rq);
 	}
 }

> IIUC, you rely on the fact that a decay happens every 1024 us of the
> cfs_rq_clock_pelt() which is scaled by frequency and cpu compute
> capacity. So you can end up with a cfs_rq_clock_pelt() that is far
> slower than real clock and the 1ms can easily be extended to dozens of
> ms

Thanks for the info. I'm not familiar with this clock scale part and will
need to take a closer look.

As you know, the intent is to make the unbound update to tg->load_avg
become bound for those wakeup migration heavy workloads and the way this
change does to achieve it is to remove the update to tg->load_avg on
attach and detach path, just leave the update on load decay path. And if
the current implementation makes load decay longer than 1024us, that
shouldn't be a problem for this change. I don't see an immediate problem
if update to tg->load_avg happens less often than once per ms but please
let me know if I missed something, thanks.

> 
> >                         update_tg_load_avg(cfs_rq);
> >         }
> >  }
> > --
> > 2.41.0
> >
> >
> > > >
> > > >         if (!se->avg.last_update_time && (flags & DO_ATTACH)) {
> > > >
> > > > @@ -4264,15 +4259,12 @@ static inline void update_load_avg(struct cfs_rq *cfs_rq, struct sched_entity *s
> > > >                  * IOW we're enqueueing a task on a new CPU.
> > > >                  */
> > > >                 attach_entity_load_avg(cfs_rq, se);
> > > > -               update_tg_load_avg(cfs_rq);
> > > > -
> > > >         } else if (flags & DO_DETACH) {
> > > >                 /*
> > > >                  * DO_DETACH means we're here from dequeue_entity()
> > > >                  * and we are migrating task out of the CPU.
> > > >                  */
> > > >                 detach_entity_load_avg(cfs_rq, se);
> > > > -               update_tg_load_avg(cfs_rq);
> > > >         } else if (decayed) {
> > > >                 cfs_rq_util_change(cfs_rq, 0);
> > > >
> > > > --
> > > > 2.41.0
> > > >

  reply	other threads:[~2023-07-20 14:42 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-07-18 13:41 [RFC PATCH 0/4] Reduce cost of accessing tg->load_avg Aaron Lu
2023-07-18 13:41 ` [PATCH 1/4] sched/fair: free allocated memory on error in alloc_fair_sched_group() Aaron Lu
2023-07-18 15:13   ` Chen Yu
2023-07-19  2:13     ` Aaron Lu
2023-08-02  7:01       ` Aaron Lu
2023-08-02  8:17         ` Chen Yu
2023-07-18 13:41 ` [RFC PATCH 2/4] sched/fair: Make tg->load_avg per node Aaron Lu
2023-07-19 11:53   ` Peter Zijlstra
2023-07-19 13:45     ` Aaron Lu
2023-07-19 13:53       ` Peter Zijlstra
2023-07-19 14:22         ` Aaron Lu
2023-08-02 11:28       ` Peter Zijlstra
2023-08-11  9:48         ` Aaron Lu
2023-07-19 15:59     ` Yury Norov
2023-07-18 13:41 ` [RFC PATCH 3/4] sched/fair: delay update_tg_load_avg() for cfs_rq's removed load Aaron Lu
2023-07-18 16:01   ` Vincent Guittot
2023-07-19  5:18     ` Aaron Lu
2023-07-19  8:01       ` Aaron Lu
2023-07-19  9:47         ` Vincent Guittot
2023-07-19 13:29           ` Aaron Lu
2023-07-20 13:10             ` Vincent Guittot
2023-07-20 14:42               ` Aaron Lu [this message]
2023-07-20 15:02                 ` Vincent Guittot
2023-07-20 15:22                   ` Dietmar Eggemann
2023-07-20 15:24                     ` Vincent Guittot
2023-07-21  6:42                     ` Aaron Lu
2023-07-21  1:57                   ` Aaron Lu
2023-08-11  9:28                     ` Aaron Lu
2023-07-20 15:04                 ` Vincent Guittot
2023-07-19  8:11       ` Aaron Lu
2023-07-19  9:12         ` Vincent Guittot
2023-07-19  9:09       ` Vincent Guittot
2023-07-18 13:41 ` [RFC PATCH 4/4] sched/fair: skip some update_cfs_group() on en/dequeue_entity() Aaron Lu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230720144233.GA185317@ziqianlu-dell \
    --to=aaron.lu@intel.com \
    --cc=bristot@redhat.com \
    --cc=bsegall@google.com \
    --cc=daniel.m.jordan@oracle.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=longman@redhat.com \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=nitin.tekchandani@intel.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=tim.c.chen@intel.com \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    --cc=yu.c.chen@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.