Re: [RFC PATCH 3/4] sched/fair: delay update_tg_load_avg() for cfs_rq's removed load

From: Aaron Lu <aaron.lu@intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Vincent Guittot <vincent.guittot@linaro.org>
Cc: Ingo Molnar <mingo@redhat.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Daniel Jordan <daniel.m.jordan@oracle.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
	"Daniel Bristot de Oliveira" <bristot@redhat.com>,
	Valentin Schneider <vschneid@redhat.com>,
	Tim Chen <tim.c.chen@intel.com>,
	Nitin Tekchandani <nitin.tekchandani@intel.com>,
	Yu Chen <yu.c.chen@intel.com>, Waiman Long <longman@redhat.com>,
	<linux-kernel@vger.kernel.org>
Subject: Re: [RFC PATCH 3/4] sched/fair: delay update_tg_load_avg() for cfs_rq's removed load
Date: Fri, 11 Aug 2023 17:28:11 +0800	[thread overview]
Message-ID: <20230811092811.GA399195@ziqianlu-dell> (raw)
In-Reply-To: <20230721015704.GA212678@ziqianlu-dell>

On Fri, Jul 21, 2023 at 09:57:04AM +0800, Aaron Lu wrote:
> On Thu, Jul 20, 2023 at 05:02:32PM +0200, Vincent Guittot wrote:
> >
> > What was wrong with your proposal to limit the update inside
> > update_tg_load_avg()  ? except maybe s/1000000/NSEC_PER_MSEC/ and
> > computing delta after testing the time since last update
> 
> I was thinking it might be better to align the update_tg_load_avg() with
> cfs_rq's decay point but that's just my feeling.
> 
> Absolutely nothing wrong with the below approach, it's straightforward
> and effective. I'll fix the use of cfs_rq_clock_pelt() and collect
> some data and then send out v2.
> 
> Thank you Vincent for all your comments, they're very useful to me.
> 
> > diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
> > index a80a73909dc2..e48fd0e6982d 100644
> > --- a/kernel/sched/fair.c
> > +++ b/kernel/sched/fair.c
> > @@ -3665,6 +3665,7 @@ static inline bool cfs_rq_is_decayed(struct
> > cfs_rq *cfs_rq)
> >  static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
> >  {
> >         long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
> > +       u64 now = cfs_rq_clock_pelt(cfs_rq);
> > 
> >         /*
> >          * No need to update load_avg for root_task_group as it is not used.
> > @@ -3672,9 +3673,11 @@ static inline void update_tg_load_avg(struct
> > cfs_rq *cfs_rq)
> >         if (cfs_rq->tg == &root_task_group)
> >                 return;
> > 
> > -       if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
> > +       if ((now - cfs_rq->last_update_tg_load_avg > 1000000) &&
> > +           abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
> >                 atomic_long_add(delta, &cfs_rq->tg->load_avg);
> >                 cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
> > +               cfs_rq->last_update_tg_load_avg = now;
> >         }
> >  }

While collecting data for v2, I find that with this "rate limit updates
to tg->load_avg to at most once per ms", the other two optimizations:
"skip some update_cfs_group() on en/dequeue_entity()" and "Make
tg->load_avg per node" doesn't matter anymore, i.e. as long as
"ratelimit" patch is there, adding those two other optimizations doesn't
improve performance further.

I think this is reasonable, since this "ratelimit" patch reduced updates
to tg->load_avg so much like from some 10 millions during a 5s window to
~25k, the overhead of accessing tg->load_avg is dropped greatly and
improving it by reducing the read side calls and making the counter
per-node doesn't matter anymore.

So I think for v2, I'll go with this "ratelimit" patch alone, but let me
know if you think otherwise.

The tests I've run are listed below.
1 postgres_sysbench with nr_thread=25%/50%/75%/100% on a 2 sockets, 112
cores, 224 threads Intel Sapphire Rapids(SPR);
2 hackbench with group=1/4/8/16, pipe, thread mode on a 2 sockets, 64
cores, 128 threads Intel Skylake(ICL) and another 2 sockets, 112 cores,
224 threads Intel SPR;
3 netperf with tcp_stream and udp_rr modes. For each mode, nr_client is
set to 25%/50%/75%/100%. Tested on the same ICL and SPR as in 2).

The test summary is:

postgres_sysbench on SPR
- when nr_thread=25% and 50%, results are in noise range;
- when nr_thread=75% and 100%, "ratelimit" patch can improve transaction
  by 12.2% and 21.2%.

hackbench:
- when nr_group=1 and 4, results are in noise range for both ICL and SPR;
- when nr_group=8 and 16, "ratelimit" patch can improve throughput by
  6.6% and 22.4% on ICL. For SPR, the improvement is 12.5% and 48.8%.

netperf:
- for tcp_stream mode test, all test results are in noise range for ICL
  and SPR;
- for udp_rr mode test, when nr_client=25% and 50% of nr_cpu, test results
  are in noise range; when nr_client=75% and 100% of nr_cpu, on ICL,
  "ratelimit" patch can improve throughput by 38.5% and 13.4%; on SPR,
  the improvement is 162% and 189%.

Full test results are listed below.

base: peterz's sched/core branch with commit e8d5ff8258bf7("sched/fair: 
Block nohz tick_stop when cfs bandwidth in use") as head.
rate: the patch that rate limit updates to tg->load_avg to at most once
per ms; applied on top of "base".
skip: the patch "sched/fair: skip some update_cfs_group() on
en/dequeue_entity()" as in this v1 series; applied on top of "rate".
node: the patch "sched/fair: Make tg->load_avg per node" as in this v1
series + peterz's numa optimizations; applied on top of "skip".

All improvement percents are against "base".

postgres_sysbench on SPR:

25%  (all in noise range)
base:   42382±19.8%
rate:   50174±9.5%
skip:   46311±12.0%
node:   48690±13.9%

50%  (all in noise range)
base:   67626±1.3%
rate:   67365±3.1%
skip:   68218±4.9%
node:   66878±3.3%

75%
base:  100216±1.2%
rate:  112470±0.1% +12.2%
skip:  113345±0.3% +13.1%
node:  112962±0.1% +12.7%

100%
base:   93671±0.4%
rate:  113563±0.2% +21.2%
skip:  113741±0.2% +21.4%
node:  112793±0.2% +20.4%

hackbench on ICL:
group=1 (all in noise range)
base:   114912±5.2%
rate:   117857±2.5%
skip:   118501±4.3%
node:   119240±7.0%

group=4 (all in noise range)
base:   359902±1.6%
rate:   361685±2.7%
skip:   356354±1.6%
node:   358744±1.7%

group=8
base:   461070±0.8%
rate:   491713±0.3% +6.6%
skip:   490183±1.4% +6.3%
node:   490864±0.5% +6.5%

group=16
base:   309032±5.0%
rate:   378337±1.3% +22.4%
skip:   385637±1.4% +24.8%
node:   375087±0.9% +21.4%

hackbench on SPR:
group=1 (all in noise range)
base:   100768±2.9%
rate:   103134±2.9%
skip:    99642±1.7%
node:   104251±2.5%

group=4 (all in noise range)
base:   413830±12.5%
rate:   378660±16.6%
skip:   367234±13.1%
node:   372122±14.4%

group=8
base:   436124±0.6%
rate:   490787±3.2% +12.5%
skip:   493097±1.0% +13.0%
node:   497336±1.8% +14.0%

group=16
base:   457730±3.2%
rate:   680452±1.3% +48.8%
skip:   679271±0.9% +48.4%
node:   672365±0.5% +46.9%

netperf/udp_rr on ICL
25%
base:   114413±0.1%
rate:   115111±0.0%  +0.6%
skip:   115546±0.2%  +1.0%
node:   115214±0.3%  +0.7%

50%  (all in noise range)
base:   86803±0.5%
rate:   86611±0.0%
skip:   86986±0.5%
node:   87466±0.8%

75%
base:   35959±5.3%
rate:   49801±0.6% +38.5%
skip:   50745±1.5% +41.1%
node:   50457±1.5% +40.3%

100%
base:   61951±6.4%
rate:   70224±0.8% +13.4%
skip:   67321±4.3%  +8.7%
node:   70591±2.3% +13.9%

netperf/udp_rr on SPR
25%  (all in noise range)
base:   104954±1.3%
rate:   107312±2.8%
skip:   107688±1.7%
node:   106024±1.8%

50%  (all in noise range)
base:   55394±4.6%
rate:   54940±7.4%
skip:   57829±1.8%
node:   57144±6.4%

75%
base:   13779±3.1%
rate:   36105±1.1% +162%
skip:   35319±6.3% +156%
node:   36764±3.5% +167%

100%
base:   9703±3.7%
rate:  28011±0.2% +189%
skip:  27087±1.3% +179%
node:  27337±2.2% +182%

netperf/tcp_stream on ICL
25%  (all in noise range)
base:   43092±0.1%
rate:   42891±0.5%
skip:   43208±0.4%
node:   43044±0.6%

50%  (all in noise range)
base:   19278±14.9%
rate:   22369±7.2%
skip:   19375±16.0%
node:   21312±8.2%

75%  (all in noise range)
base:   16822±3.0%
rate:   17086±2.3%
skip:   17571±1.8%
node:   16801±2.4%

100%  (all in noise range)
base:   18216±0.6%
rate:   18078±2.9%
skip:   18040±1.0%
node:   18064±1.6%

netperf/tcp_stream on SPR
25%  (all in noise range)
base:   34491±0.3%
rate:   34886±0.5%
skip:   34894±0.9%
node:   34692±0.6%

50%  (all in noise range)
base:   19278±14.9%
rate:   22369±7.2%
skip:   19375±16.0%
node:   21312±8.2%

75%  (all in noise range)
base:   16822±3.0%
rate:   17086±2.3%
skip:   17571±1.8%
node:   16801±2.4%

100%  (all in noise range)
base:   18216±0.6%
rate:   18078±2.9%
skip:   18040±0.1%
node:   18064±1.6%

Thanks,
Aaron