All of lore.kernel.org
 help / color / mirror / Atom feed
From: Aaron Lu <aaron.lu@intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
	Daniel Bristot de Oliveira <bristot@redhat.com>,
	Valentin Schneider <vschneid@redhat.com>,
	Tim Chen <tim.c.chen@intel.com>,
	Nitin Tekchandani <nitin.tekchandani@intel.com>,
	Yu Chen <yu.c.chen@intel.com>, Waiman Long <longman@redhat.com>,
	linux-kernel@vger.kernel.org
Subject: [RFC PATCH 0/4] Reduce cost of accessing tg->load_avg
Date: Tue, 18 Jul 2023 21:41:16 +0800	[thread overview]
Message-ID: <20230718134120.81199-1-aaron.lu@intel.com> (raw)

Nitin Tekchandani noticed some scheduler functions have high cost
according to perf/cycles while running postgres_sysbench workload.
I perf/annotated the high cost functions: update_cfs_group() and
update_load_avg() and found the costs were ~90% due to accessing to
tg->load_avg. This series is an attempt to reduce the cost of accessing
tg->load_avg.

The 1st patch is a bug fix. I probably should try PeterZ's SBRM which
should make things a lot cleaner but I'm not sure if this fix is stable
material and if so, writing in traditional way is easier for backport?

The 2nd patch is actually v2, the first version was sent here:
https://lore.kernel.org/lkml/20230327053955.GA570404@ziqianlu-desk2/
since it is part of this series, I didn't maintain its version. The idea
is from PeterZ: by making the hot variable per-node, at least, the write
side update_load_avg() will become local. The sad part is, the read side
is still global because it has to sum up each node's part and that hurts
performance. Also, update_cfs_group() is called very frequently, e.g. on
every en/dequeue_task_fair(). With this patch, postgres_sysbench on SPR
sees some improvement. hackbench/netperf also see some improvement on
SPR and Cascade Lake. For details, please see patch2.

While there are improvments from making tg->load_avg per node, the two
functions cost is still there while running postgres_sysbench on SPR: ~7%
for update_cfs_group() and ~5% for update_load_avg(). To further reduce
the cost of accessing tg->load_avg, the 3rd patch tried to reduce the
number of updates to tg->load_avg when a task migrates on wakeup: current
code will immediately update_tg_load_avg() once update_load_avg() found
cfs_rq has removed_load pending; patch3 changed this behaviour: it ignores
cfs_rq's pending removed_load and rely on following events like task
attaching so that to aggregate the process of tg->load_avg. For a wakeup
heavy workload, this can roughly reduce the call number to
update_tg_load_avg() by half.

After patch3, the cost of the two functions while running
postgres_sysbench on SPR dropped to ~1% but running netperf/UDP_RR on
SPR the two functions cost are still ~20% and ~10%. So patch4 tried to
reduce the number of calls to update_cfs_group() on en/dequeue path and
that made the two functions cost dropped to ~2% when running netperf/UDP_RR
on SPR.

One issue with patch3 and patch4 is, they both reduced the tracking
accuracy of task group's load_avg in favor of reducing cost of accessing
tg->load_avg. patch3 is probably better than patch4 in this regard,
because we already delay processing cfs_rq's removed_load although patch3
made the delay even longer; patch4 skipped some calls to
update_cfs_group() on en/dequeue path and that may affect things. I made
a test inspired by an old commit and result seem to suggest it's not bad
but I may miss some scenarios.

This series is based on v6.5-rc1 with one more patch applied that made
tg->load_avg stay in a dedicated cacheline to avoid any false sharing
issue as discovered by Deng Pan:
https://lore.kernel.org/lkml/20230621081425.420607-1-pan.deng@intel.com/

For performance data in each commit's changelog, node means patch2,
delay means patch3 and skip means patch4. All performance changes percent
are against base.

Comments are welcome.

Aaron Lu (4):
  sched/fair: free allocated memory on error in alloc_fair_sched_group()
  sched/fair: Make tg->load_avg per node
  sched/fair: delay update_tg_load_avg() for cfs_rq's removed load
  sched/fair: skip some update_cfs_group() on en/dequeue_entity()

 kernel/sched/debug.c |  2 +-
 kernel/sched/fair.c  | 80 ++++++++++++++++++++++++++++++++++----------
 kernel/sched/sched.h | 44 ++++++++++++++++++------
 3 files changed, 97 insertions(+), 29 deletions(-)

-- 
2.41.0


             reply	other threads:[~2023-07-18 13:41 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-07-18 13:41 Aaron Lu [this message]
2023-07-18 13:41 ` [PATCH 1/4] sched/fair: free allocated memory on error in alloc_fair_sched_group() Aaron Lu
2023-07-18 15:13   ` Chen Yu
2023-07-19  2:13     ` Aaron Lu
2023-08-02  7:01       ` Aaron Lu
2023-08-02  8:17         ` Chen Yu
2023-07-18 13:41 ` [RFC PATCH 2/4] sched/fair: Make tg->load_avg per node Aaron Lu
2023-07-19 11:53   ` Peter Zijlstra
2023-07-19 13:45     ` Aaron Lu
2023-07-19 13:53       ` Peter Zijlstra
2023-07-19 14:22         ` Aaron Lu
2023-08-02 11:28       ` Peter Zijlstra
2023-08-11  9:48         ` Aaron Lu
2023-07-19 15:59     ` Yury Norov
2023-07-18 13:41 ` [RFC PATCH 3/4] sched/fair: delay update_tg_load_avg() for cfs_rq's removed load Aaron Lu
2023-07-18 16:01   ` Vincent Guittot
2023-07-19  5:18     ` Aaron Lu
2023-07-19  8:01       ` Aaron Lu
2023-07-19  9:47         ` Vincent Guittot
2023-07-19 13:29           ` Aaron Lu
2023-07-20 13:10             ` Vincent Guittot
2023-07-20 14:42               ` Aaron Lu
2023-07-20 15:02                 ` Vincent Guittot
2023-07-20 15:22                   ` Dietmar Eggemann
2023-07-20 15:24                     ` Vincent Guittot
2023-07-21  6:42                     ` Aaron Lu
2023-07-21  1:57                   ` Aaron Lu
2023-08-11  9:28                     ` Aaron Lu
2023-07-20 15:04                 ` Vincent Guittot
2023-07-19  8:11       ` Aaron Lu
2023-07-19  9:12         ` Vincent Guittot
2023-07-19  9:09       ` Vincent Guittot
2023-07-18 13:41 ` [RFC PATCH 4/4] sched/fair: skip some update_cfs_group() on en/dequeue_entity() Aaron Lu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230718134120.81199-1-aaron.lu@intel.com \
    --to=aaron.lu@intel.com \
    --cc=bristot@redhat.com \
    --cc=bsegall@google.com \
    --cc=daniel.m.jordan@oracle.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=longman@redhat.com \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=nitin.tekchandani@intel.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=tim.c.chen@intel.com \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    --cc=yu.c.chen@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.