All of lore.kernel.org
 help / color / mirror / Atom feed
From: Aaron Lu <aaron.lu@intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Vincent Guittot <vincent.guittot@linaro.org>
Cc: Ingo Molnar <mingo@redhat.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Daniel Jordan <daniel.m.jordan@oracle.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
	"Daniel Bristot de Oliveira" <bristot@redhat.com>,
	Valentin Schneider <vschneid@redhat.com>,
	Tim Chen <tim.c.chen@intel.com>,
	Nitin Tekchandani <nitin.tekchandani@intel.com>,
	Yu Chen <yu.c.chen@intel.com>, Waiman Long <longman@redhat.com>,
	<linux-kernel@vger.kernel.org>
Subject: Re: [RFC PATCH 3/4] sched/fair: delay update_tg_load_avg() for cfs_rq's removed load
Date: Fri, 11 Aug 2023 17:28:11 +0800	[thread overview]
Message-ID: <20230811092811.GA399195@ziqianlu-dell> (raw)
In-Reply-To: <20230721015704.GA212678@ziqianlu-dell>

On Fri, Jul 21, 2023 at 09:57:04AM +0800, Aaron Lu wrote:
> On Thu, Jul 20, 2023 at 05:02:32PM +0200, Vincent Guittot wrote:
> >
> > What was wrong with your proposal to limit the update inside
> > update_tg_load_avg()  ? except maybe s/1000000/NSEC_PER_MSEC/ and
> > computing delta after testing the time since last update
> 
> I was thinking it might be better to align the update_tg_load_avg() with
> cfs_rq's decay point but that's just my feeling.
> 
> Absolutely nothing wrong with the below approach, it's straightforward
> and effective. I'll fix the use of cfs_rq_clock_pelt() and collect
> some data and then send out v2.
> 
> Thank you Vincent for all your comments, they're very useful to me.
> 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index a80a73909dc2..e48fd0e6982d 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -3665,6 +3665,7 @@ static inline bool cfs_rq_is_decayed(struct
> > cfs_rq *cfs_rq)
> >  static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
> >  {
> >         long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
> > +       u64 now = cfs_rq_clock_pelt(cfs_rq);
> > 
> >         /*
> >          * No need to update load_avg for root_task_group as it is not used.
> > @@ -3672,9 +3673,11 @@ static inline void update_tg_load_avg(struct
> > cfs_rq *cfs_rq)
> >         if (cfs_rq->tg == &root_task_group)
> >                 return;
> > 
> > -       if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
> > +       if ((now - cfs_rq->last_update_tg_load_avg > 1000000) &&
> > +           abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
> >                 atomic_long_add(delta, &cfs_rq->tg->load_avg);
> >                 cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
> > +               cfs_rq->last_update_tg_load_avg = now;
> >         }
> >  }

While collecting data for v2, I find that with this "rate limit updates
to tg->load_avg to at most once per ms", the other two optimizations:
"skip some update_cfs_group() on en/dequeue_entity()" and "Make
tg->load_avg per node" doesn't matter anymore, i.e. as long as
"ratelimit" patch is there, adding those two other optimizations doesn't
improve performance further.

I think this is reasonable, since this "ratelimit" patch reduced updates
to tg->load_avg so much like from some 10 millions during a 5s window to
~25k, the overhead of accessing tg->load_avg is dropped greatly and
improving it by reducing the read side calls and making the counter
per-node doesn't matter anymore.

So I think for v2, I'll go with this "ratelimit" patch alone, but let me
know if you think otherwise.

The tests I've run are listed below.
1 postgres_sysbench with nr_thread=25%/50%/75%/100% on a 2 sockets, 112
cores, 224 threads Intel Sapphire Rapids(SPR);
2 hackbench with group=1/4/8/16, pipe, thread mode on a 2 sockets, 64
cores, 128 threads Intel Skylake(ICL) and another 2 sockets, 112 cores,
224 threads Intel SPR;
3 netperf with tcp_stream and udp_rr modes. For each mode, nr_client is
set to 25%/50%/75%/100%. Tested on the same ICL and SPR as in 2).

The test summary is:

postgres_sysbench on SPR
- when nr_thread=25% and 50%, results are in noise range;
- when nr_thread=75% and 100%, "ratelimit" patch can improve transaction
  by 12.2% and 21.2%.

hackbench:
- when nr_group=1 and 4, results are in noise range for both ICL and SPR;
- when nr_group=8 and 16, "ratelimit" patch can improve throughput by
  6.6% and 22.4% on ICL. For SPR, the improvement is 12.5% and 48.8%.

netperf:
- for tcp_stream mode test, all test results are in noise range for ICL
  and SPR;
- for udp_rr mode test, when nr_client=25% and 50% of nr_cpu, test results
  are in noise range; when nr_client=75% and 100% of nr_cpu, on ICL,
  "ratelimit" patch can improve throughput by 38.5% and 13.4%; on SPR,
  the improvement is 162% and 189%.

Full test results are listed below.

base: peterz's sched/core branch with commit e8d5ff8258bf7("sched/fair: 
Block nohz tick_stop when cfs bandwidth in use") as head.
rate: the patch that rate limit updates to tg->load_avg to at most once
per ms; applied on top of "base".
skip: the patch "sched/fair: skip some update_cfs_group() on
en/dequeue_entity()" as in this v1 series; applied on top of "rate".
node: the patch "sched/fair: Make tg->load_avg per node" as in this v1
series + peterz's numa optimizations; applied on top of "skip".

All improvement percents are against "base".

postgres_sysbench on SPR:

25%  (all in noise range)
base:   42382±19.8%
rate:   50174±9.5%
skip:   46311±12.0%
node:   48690±13.9%

50%  (all in noise range)
base:   67626±1.3%
rate:   67365±3.1%
skip:   68218±4.9%
node:   66878±3.3%

75%
base:  100216±1.2%
rate:  112470±0.1% +12.2%
skip:  113345±0.3% +13.1%
node:  112962±0.1% +12.7%

100%
base:   93671±0.4%
rate:  113563±0.2% +21.2%
skip:  113741±0.2% +21.4%
node:  112793±0.2% +20.4%

hackbench on ICL:
group=1 (all in noise range)
base:   114912±5.2%
rate:   117857±2.5%
skip:   118501±4.3%
node:   119240±7.0%

group=4 (all in noise range)
base:   359902±1.6%
rate:   361685±2.7%
skip:   356354±1.6%
node:   358744±1.7%

group=8
base:   461070±0.8%
rate:   491713±0.3% +6.6%
skip:   490183±1.4% +6.3%
node:   490864±0.5% +6.5%

group=16
base:   309032±5.0%
rate:   378337±1.3% +22.4%
skip:   385637±1.4% +24.8%
node:   375087±0.9% +21.4%

hackbench on SPR:
group=1 (all in noise range)
base:   100768±2.9%
rate:   103134±2.9%
skip:    99642±1.7%
node:   104251±2.5%

group=4 (all in noise range)
base:   413830±12.5%
rate:   378660±16.6%
skip:   367234±13.1%
node:   372122±14.4%

group=8
base:   436124±0.6%
rate:   490787±3.2% +12.5%
skip:   493097±1.0% +13.0%
node:   497336±1.8% +14.0%

group=16
base:   457730±3.2%
rate:   680452±1.3% +48.8%
skip:   679271±0.9% +48.4%
node:   672365±0.5% +46.9%

netperf/udp_rr on ICL
25%
base:   114413±0.1%
rate:   115111±0.0%  +0.6%
skip:   115546±0.2%  +1.0%
node:   115214±0.3%  +0.7%

50%  (all in noise range)
base:   86803±0.5%
rate:   86611±0.0%
skip:   86986±0.5%
node:   87466±0.8%

75%
base:   35959±5.3%
rate:   49801±0.6% +38.5%
skip:   50745±1.5% +41.1%
node:   50457±1.5% +40.3%

100%
base:   61951±6.4%
rate:   70224±0.8% +13.4%
skip:   67321±4.3%  +8.7%
node:   70591±2.3% +13.9%

netperf/udp_rr on SPR
25%  (all in noise range)
base:   104954±1.3%
rate:   107312±2.8%
skip:   107688±1.7%
node:   106024±1.8%

50%  (all in noise range)
base:   55394±4.6%
rate:   54940±7.4%
skip:   57829±1.8%
node:   57144±6.4%

75%
base:   13779±3.1%
rate:   36105±1.1% +162%
skip:   35319±6.3% +156%
node:   36764±3.5% +167%

100%
base:   9703±3.7%
rate:  28011±0.2% +189%
skip:  27087±1.3% +179%
node:  27337±2.2% +182%

netperf/tcp_stream on ICL
25%  (all in noise range)
base:   43092±0.1%
rate:   42891±0.5%
skip:   43208±0.4%
node:   43044±0.6%

50%  (all in noise range)
base:   19278±14.9%
rate:   22369±7.2%
skip:   19375±16.0%
node:   21312±8.2%

75%  (all in noise range)
base:   16822±3.0%
rate:   17086±2.3%
skip:   17571±1.8%
node:   16801±2.4%

100%  (all in noise range)
base:   18216±0.6%
rate:   18078±2.9%
skip:   18040±1.0%
node:   18064±1.6%

netperf/tcp_stream on SPR
25%  (all in noise range)
base:   34491±0.3%
rate:   34886±0.5%
skip:   34894±0.9%
node:   34692±0.6%

50%  (all in noise range)
base:   19278±14.9%
rate:   22369±7.2%
skip:   19375±16.0%
node:   21312±8.2%

75%  (all in noise range)
base:   16822±3.0%
rate:   17086±2.3%
skip:   17571±1.8%
node:   16801±2.4%

100%  (all in noise range)
base:   18216±0.6%
rate:   18078±2.9%
skip:   18040±0.1%
node:   18064±1.6%

Thanks,
Aaron

  reply	other threads:[~2023-08-11  9:28 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-07-18 13:41 [RFC PATCH 0/4] Reduce cost of accessing tg->load_avg Aaron Lu
2023-07-18 13:41 ` [PATCH 1/4] sched/fair: free allocated memory on error in alloc_fair_sched_group() Aaron Lu
2023-07-18 15:13   ` Chen Yu
2023-07-19  2:13     ` Aaron Lu
2023-08-02  7:01       ` Aaron Lu
2023-08-02  8:17         ` Chen Yu
2023-07-18 13:41 ` [RFC PATCH 2/4] sched/fair: Make tg->load_avg per node Aaron Lu
2023-07-19 11:53   ` Peter Zijlstra
2023-07-19 13:45     ` Aaron Lu
2023-07-19 13:53       ` Peter Zijlstra
2023-07-19 14:22         ` Aaron Lu
2023-08-02 11:28       ` Peter Zijlstra
2023-08-11  9:48         ` Aaron Lu
2023-07-19 15:59     ` Yury Norov
2023-07-18 13:41 ` [RFC PATCH 3/4] sched/fair: delay update_tg_load_avg() for cfs_rq's removed load Aaron Lu
2023-07-18 16:01   ` Vincent Guittot
2023-07-19  5:18     ` Aaron Lu
2023-07-19  8:01       ` Aaron Lu
2023-07-19  9:47         ` Vincent Guittot
2023-07-19 13:29           ` Aaron Lu
2023-07-20 13:10             ` Vincent Guittot
2023-07-20 14:42               ` Aaron Lu
2023-07-20 15:02                 ` Vincent Guittot
2023-07-20 15:22                   ` Dietmar Eggemann
2023-07-20 15:24                     ` Vincent Guittot
2023-07-21  6:42                     ` Aaron Lu
2023-07-21  1:57                   ` Aaron Lu
2023-08-11  9:28                     ` Aaron Lu [this message]
2023-07-20 15:04                 ` Vincent Guittot
2023-07-19  8:11       ` Aaron Lu
2023-07-19  9:12         ` Vincent Guittot
2023-07-19  9:09       ` Vincent Guittot
2023-07-18 13:41 ` [RFC PATCH 4/4] sched/fair: skip some update_cfs_group() on en/dequeue_entity() Aaron Lu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230811092811.GA399195@ziqianlu-dell \
    --to=aaron.lu@intel.com \
    --cc=bristot@redhat.com \
    --cc=bsegall@google.com \
    --cc=daniel.m.jordan@oracle.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=longman@redhat.com \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=nitin.tekchandani@intel.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=tim.c.chen@intel.com \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    --cc=yu.c.chen@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.