All of lore.kernel.org
 help / color / mirror / Atom feed
From: Aaron Lu <aaron.lu@intel.com>
To: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@redhat.com>,
	Juri Lelli <juri.lelli@redhat.com>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Daniel Jordan <daniel.m.jordan@oracle.com>
Cc: Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>, Mel Gorman <mgorman@suse.de>,
	Daniel Bristot de Oliveira <bristot@redhat.com>,
	Valentin Schneider <vschneid@redhat.com>,
	Tim Chen <tim.c.chen@intel.com>,
	Nitin Tekchandani <nitin.tekchandani@intel.com>,
	Yu Chen <yu.c.chen@intel.com>, Waiman Long <longman@redhat.com>,
	linux-kernel@vger.kernel.org
Subject: [RFC PATCH 2/4] sched/fair: Make tg->load_avg per node
Date: Tue, 18 Jul 2023 21:41:18 +0800	[thread overview]
Message-ID: <20230718134120.81199-3-aaron.lu@intel.com> (raw)
In-Reply-To: <20230718134120.81199-1-aaron.lu@intel.com>

When using sysbench to benchmark Postgres in a single docker instance
with sysbench's nr_threads set to nr_cpu, it is observed there are times
update_cfs_group() and update_load_avg() shows noticeable overhead on
a 2sockets/112core/224cpu Intel Sapphire Rapids(SPR):

        13.75%    13.74%  [kernel.vmlinux]           [k] update_cfs_group
        10.63%    10.04%  [kernel.vmlinux]           [k] update_load_avg

Annotate shows the cycles are mostly spent on accessing tg->load_avg
with update_load_avg() being the write side and update_cfs_group() being
the read side.

Tim Chen told me that PeterZ once mentioned a way to solve a similar
problem by making a counter per node so do the same for tg->load_avg.
After this change, the cost of the two functions are reduced and
sysbench transactions are increased on SPR. Below are test results.

===============================================
postgres_sysbench(transaction, higher is better)
nr_thread=100%/75%/50% were tested on 2 sockets SPR and Icelake and
results that have a measuable difference are:

nr_thread=100% on SPR
base:  90569.11±1.15%
node: 104152.26±0.34%  +15.0%

nr_thread=75% on SPR
base: 100803.96±0.57%
node: 107333.58±0.44%   +6.5%

=======================================================================
hackbench/pipe/threads/fd=20/loop=1000000 (throughput, higher is better)
group=1/4/8/16 were tested on 2 sockets SPR and Cascade lake and the
results that have a measuable difference are:

group=8 on SPR:
base:  437163±2.6%
node:  471203±1.2%   +7.8%

group=16 on SPR:
base:  468279±1.9%
node:  580385±1.7%  +23.9%

=============================================
netperf/TCP_STRAM
nr_thread=1/25%/50%/75%/100% were tested on 2 sockets SPR and Cascade
Lake and there is no measuable difference.

=============================================
netperf/UDP_RR (throughput, higher is better)
nr_thread=1/25%/50%/75%/100% were tested on 2 sockets SPR and Cascade
Lake and results that have measuable difference are:

nr_thread=75% on Cascade lake:
base:  36701±1.7%
node:  39949±1.4%   +8.8%

nr_thread=75% on SPR:
base:  14249±3.8%
node:  19890±2.0%   +39.6%

nr_thread=100% on Cascade lake
base:  52275±0.6%
node:  53827±0.4%   +3.0%

nr_thread=100% on SPR
base:   9560±1.6%
node:  14186±3.9%   +48.4%

Reported-by: Nitin Tekchandani <nitin.tekchandani@intel.com>
Signed-off-by: Aaron Lu <aaron.lu@intel.com>
---
 kernel/sched/debug.c |  2 +-
 kernel/sched/fair.c  | 29 ++++++++++++++++++++++++++---
 kernel/sched/sched.h | 43 +++++++++++++++++++++++++++++++++----------
 3 files changed, 60 insertions(+), 14 deletions(-)

diff --git a/kernel/sched/debug.c b/kernel/sched/debug.c
index 066ff1c8ae4e..3af965a18866 100644
--- a/kernel/sched/debug.c
+++ b/kernel/sched/debug.c
@@ -691,7 +691,7 @@ void print_cfs_rq(struct seq_file *m, int cpu, struct cfs_rq *cfs_rq)
 	SEQ_printf(m, "  .%-30s: %lu\n", "tg_load_avg_contrib",
 			cfs_rq->tg_load_avg_contrib);
 	SEQ_printf(m, "  .%-30s: %ld\n", "tg_load_avg",
-			atomic_long_read(&cfs_rq->tg->load_avg));
+			tg_load_avg(cfs_rq->tg));
 #endif
 #endif
 #ifdef CONFIG_CFS_BANDWIDTH
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 0f913487928d..aceb8f5922cb 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -3496,7 +3496,7 @@ static long calc_group_shares(struct cfs_rq *cfs_rq)
 
 	load = max(scale_load_down(cfs_rq->load.weight), cfs_rq->avg.load_avg);
 
-	tg_weight = atomic_long_read(&tg->load_avg);
+	tg_weight = tg_load_avg(tg);
 
 	/* Ensure tg_weight >= load */
 	tg_weight -= cfs_rq->tg_load_avg_contrib;
@@ -3665,6 +3665,7 @@ static inline bool cfs_rq_is_decayed(struct cfs_rq *cfs_rq)
 static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
 {
 	long delta = cfs_rq->avg.load_avg - cfs_rq->tg_load_avg_contrib;
+	int node = cpu_to_node(smp_processor_id());
 
 	/*
 	 * No need to update load_avg for root_task_group as it is not used.
@@ -3673,7 +3674,7 @@ static inline void update_tg_load_avg(struct cfs_rq *cfs_rq)
 		return;
 
 	if (abs(delta) > cfs_rq->tg_load_avg_contrib / 64) {
-		atomic_long_add(delta, &cfs_rq->tg->load_avg);
+		atomic_long_add(delta, &cfs_rq->tg->node_info[node]->load_avg);
 		cfs_rq->tg_load_avg_contrib = cfs_rq->avg.load_avg;
 	}
 }
@@ -12439,7 +12440,7 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
 {
 	struct sched_entity *se;
 	struct cfs_rq *cfs_rq;
-	int i;
+	int i, nodes;
 
 	tg->cfs_rq = kcalloc(nr_cpu_ids, sizeof(cfs_rq), GFP_KERNEL);
 	if (!tg->cfs_rq)
@@ -12468,8 +12469,30 @@ int alloc_fair_sched_group(struct task_group *tg, struct task_group *parent)
 		init_entity_runnable_average(se);
 	}
 
+#ifdef CONFIG_SMP
+	nodes = num_possible_nodes();
+	tg->node_info = kcalloc(nodes, sizeof(struct tg_node_info *), GFP_KERNEL);
+	if (!tg->node_info)
+		goto err_free;
+
+	for_each_node(i) {
+		tg->node_info[i] = kzalloc_node(sizeof(struct tg_node_info), GFP_KERNEL, i);
+		if (!tg->node_info[i])
+			goto err_free_node;
+	}
+#endif
+
 	return 1;
 
+#ifdef CONFIG_SMP
+err_free_node:
+	for_each_node(i) {
+		kfree(tg->node_info[i]);
+		if (!tg->node_info[i])
+			break;
+	}
+	kfree(tg->node_info);
+#endif
 err_free:
 	for_each_possible_cpu(i) {
 		kfree(tg->cfs_rq[i]);
diff --git a/kernel/sched/sched.h b/kernel/sched/sched.h
index 14dfaafb3a8f..9cece2dbc95b 100644
--- a/kernel/sched/sched.h
+++ b/kernel/sched/sched.h
@@ -359,6 +359,17 @@ struct cfs_bandwidth {
 #endif
 };
 
+struct tg_node_info {
+	/*
+	 * load_avg can be heavily contended at clock tick time and task
+	 * enqueue/dequeue time, so put it in its own cacheline separated
+	 * from other fields.
+	 */
+	struct {
+		atomic_long_t		load_avg;
+	} ____cacheline_aligned_in_smp;
+};
+
 /* Task group related information */
 struct task_group {
 	struct cgroup_subsys_state css;
@@ -373,15 +384,8 @@ struct task_group {
 	/* A positive value indicates that this is a SCHED_IDLE group. */
 	int			idle;
 
-#ifdef	CONFIG_SMP
-	/*
-	 * load_avg can be heavily contended at clock tick time, so put
-	 * it in its own cacheline separated from the fields above which
-	 * will also be accessed at each tick.
-	 */
-	struct {
-		atomic_long_t		load_avg;
-	} ____cacheline_aligned_in_smp;
+#ifdef CONFIG_SMP
+	struct tg_node_info	**node_info;
 #endif
 #endif
 
@@ -413,9 +417,28 @@ struct task_group {
 	/* Effective clamp values used for a task group */
 	struct uclamp_se	uclamp[UCLAMP_CNT];
 #endif
-
 };
 
+#if defined(CONFIG_FAIR_GROUP_SCHED) && defined(CONFIG_SMP)
+static inline long tg_load_avg(struct task_group *tg)
+{
+	long load_avg = 0;
+	int i;
+
+	/*
+	 * The only path that can give us a root_task_group
+	 * here is from print_cfs_rq() thus unlikely.
+	 */
+	if (unlikely(tg == &root_task_group))
+		return 0;
+
+	for_each_node(i)
+		load_avg += atomic_long_read(&tg->node_info[i]->load_avg);
+
+	return load_avg;
+}
+#endif
+
 #ifdef CONFIG_FAIR_GROUP_SCHED
 #define ROOT_TASK_GROUP_LOAD	NICE_0_LOAD
 
-- 
2.41.0


  parent reply	other threads:[~2023-07-18 13:41 UTC|newest]

Thread overview: 33+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-07-18 13:41 [RFC PATCH 0/4] Reduce cost of accessing tg->load_avg Aaron Lu
2023-07-18 13:41 ` [PATCH 1/4] sched/fair: free allocated memory on error in alloc_fair_sched_group() Aaron Lu
2023-07-18 15:13   ` Chen Yu
2023-07-19  2:13     ` Aaron Lu
2023-08-02  7:01       ` Aaron Lu
2023-08-02  8:17         ` Chen Yu
2023-07-18 13:41 ` Aaron Lu [this message]
2023-07-19 11:53   ` [RFC PATCH 2/4] sched/fair: Make tg->load_avg per node Peter Zijlstra
2023-07-19 13:45     ` Aaron Lu
2023-07-19 13:53       ` Peter Zijlstra
2023-07-19 14:22         ` Aaron Lu
2023-08-02 11:28       ` Peter Zijlstra
2023-08-11  9:48         ` Aaron Lu
2023-07-19 15:59     ` Yury Norov
2023-07-18 13:41 ` [RFC PATCH 3/4] sched/fair: delay update_tg_load_avg() for cfs_rq's removed load Aaron Lu
2023-07-18 16:01   ` Vincent Guittot
2023-07-19  5:18     ` Aaron Lu
2023-07-19  8:01       ` Aaron Lu
2023-07-19  9:47         ` Vincent Guittot
2023-07-19 13:29           ` Aaron Lu
2023-07-20 13:10             ` Vincent Guittot
2023-07-20 14:42               ` Aaron Lu
2023-07-20 15:02                 ` Vincent Guittot
2023-07-20 15:22                   ` Dietmar Eggemann
2023-07-20 15:24                     ` Vincent Guittot
2023-07-21  6:42                     ` Aaron Lu
2023-07-21  1:57                   ` Aaron Lu
2023-08-11  9:28                     ` Aaron Lu
2023-07-20 15:04                 ` Vincent Guittot
2023-07-19  8:11       ` Aaron Lu
2023-07-19  9:12         ` Vincent Guittot
2023-07-19  9:09       ` Vincent Guittot
2023-07-18 13:41 ` [RFC PATCH 4/4] sched/fair: skip some update_cfs_group() on en/dequeue_entity() Aaron Lu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20230718134120.81199-3-aaron.lu@intel.com \
    --to=aaron.lu@intel.com \
    --cc=bristot@redhat.com \
    --cc=bsegall@google.com \
    --cc=daniel.m.jordan@oracle.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=longman@redhat.com \
    --cc=mgorman@suse.de \
    --cc=mingo@redhat.com \
    --cc=nitin.tekchandani@intel.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=tim.c.chen@intel.com \
    --cc=vincent.guittot@linaro.org \
    --cc=vschneid@redhat.com \
    --cc=yu.c.chen@intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.