linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: "tip-bot2 for Mel Gorman" <tip-bot2@linutronix.de>
To: linux-tip-commits@vger.kernel.org
Cc: Mel Gorman <mgorman@techsingularity.net>,
	"Peter Zijlstra (Intel)" <peterz@infradead.org>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	x86@kernel.org, linux-kernel@vger.kernel.org
Subject: [tip: sched/core] sched: Limit the amount of NUMA imbalance that can exist at fork time
Date: Wed, 25 Nov 2020 14:02:53 -0000	[thread overview]
Message-ID: <160631297307.3364.9271634433734932379.tip-bot2@tip-bot2> (raw)
In-Reply-To: <20201120090630.3286-5-mgorman@techsingularity.net>

The following commit has been merged into the sched/core branch of tip:

Commit-ID:     23e6082a522e32232f7377540b4d42d8304253b8
Gitweb:        https://git.kernel.org/tip/23e6082a522e32232f7377540b4d42d8304253b8
Author:        Mel Gorman <mgorman@techsingularity.net>
AuthorDate:    Fri, 20 Nov 2020 09:06:30 
Committer:     Peter Zijlstra <peterz@infradead.org>
CommitterDate: Tue, 24 Nov 2020 16:47:48 +01:00

sched: Limit the amount of NUMA imbalance that can exist at fork time

At fork time currently, a local node can be allowed to fill completely
and allow the periodic load balancer to fix the problem. This can be
problematic in cases where a task creates lots of threads that idle until
woken as part of a worker poll causing a memory bandwidth problem.

However, a "real" workload suffers badly from this behaviour. The workload
in question is mostly NUMA aware but spawns large numbers of threads
that act as a worker pool that can be called from anywhere. These need
to spread early to get reasonable behaviour.

This patch limits how much a local node can fill before spilling over
to another node and it will not be a universal win. Specifically,
very short-lived workloads that fit within a NUMA node would prefer
the memory bandwidth.

As I cannot describe the "real" workload, the best proxy measure I found
for illustration was a page fault microbenchmark. It's not representative
of the workload but demonstrates the hazard of the current behaviour.

pft timings
                                 5.10.0-rc2             5.10.0-rc2
                          imbalancefloat-v2          forkspread-v2
Amean     elapsed-1        46.37 (   0.00%)       46.05 *   0.69%*
Amean     elapsed-4        12.43 (   0.00%)       12.49 *  -0.47%*
Amean     elapsed-7         7.61 (   0.00%)        7.55 *   0.81%*
Amean     elapsed-12        4.79 (   0.00%)        4.80 (  -0.17%)
Amean     elapsed-21        3.13 (   0.00%)        2.89 *   7.74%*
Amean     elapsed-30        3.65 (   0.00%)        2.27 *  37.62%*
Amean     elapsed-48        3.08 (   0.00%)        2.13 *  30.69%*
Amean     elapsed-79        2.00 (   0.00%)        1.90 *   4.95%*
Amean     elapsed-80        2.00 (   0.00%)        1.90 *   4.70%*

This is showing the time to fault regions belonging to threads. The target
machine has 80 logical CPUs and two nodes. Note the ~30% gain when the
machine is approximately the point where one node becomes fully utilised.
The slower results are borderline noise.

Kernel building shows similar benefits around the same balance point.
Generally performance was either neutral or better in the tests conducted.
The main consideration with this patch is the point where fork stops
spreading a task so some workloads may benefit from different balance
points but it would be a risky tuning parameter.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Link: https://lkml.kernel.org/r/20201120090630.3286-5-mgorman@techsingularity.net
---
 kernel/sched/fair.c | 17 +++++++++++++++--
 1 file changed, 15 insertions(+), 2 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 377c77b..2e8aade 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8762,6 +8762,16 @@ static bool update_pick_idlest(struct sched_group *idlest,
 }
 
 /*
+ * Allow a NUMA imbalance if busy CPUs is less than 25% of the domain.
+ * This is an approximation as the number of running tasks may not be
+ * related to the number of busy CPUs due to sched_setaffinity.
+ */
+static inline bool allow_numa_imbalance(int dst_running, int dst_weight)
+{
+	return (dst_running < (dst_weight >> 2));
+}
+
+/*
  * find_idlest_group() finds and returns the least busy CPU group within the
  * domain.
  *
@@ -8893,7 +8903,7 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p, int this_cpu)
 			 * a real need of migration, periodic load balance will
 			 * take care of it.
 			 */
-			if (local_sgs.idle_cpus)
+			if (allow_numa_imbalance(local_sgs.sum_nr_running, sd->span_weight))
 				return NULL;
 		}
 
@@ -9000,11 +9010,14 @@ next_group:
 static inline long adjust_numa_imbalance(int imbalance,
 				int dst_running, int dst_weight)
 {
+	if (!allow_numa_imbalance(dst_running, dst_weight))
+		return imbalance;
+
 	/*
 	 * Allow a small imbalance based on a simple pair of communicating
 	 * tasks that remain local when the destination is lightly loaded.
 	 */
-	if (dst_running < (dst_weight >> 2) && imbalance <= NUMA_IMBALANCE_MIN)
+	if (imbalance <= NUMA_IMBALANCE_MIN)
 		return 0;
 
 	return imbalance;

  parent reply	other threads:[~2020-11-25 14:03 UTC|newest]

Thread overview: 15+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-11-20  9:06 [PATCH v3 0/4] Revisit NUMA imbalance tolerance and fork balancing Mel Gorman
2020-11-20  9:06 ` [PATCH 1/4] sched/numa: Rename nr_running and break out the magic number Mel Gorman
2020-11-20 13:32   ` Vincent Guittot
2020-11-25 14:02   ` [tip: sched/core] " tip-bot2 for Mel Gorman
2020-11-20  9:06 ` [PATCH 2/4] sched: Avoid unnecessary calculation of load imbalance at clone time Mel Gorman
2020-11-20 13:32   ` Vincent Guittot
2020-11-25 14:02   ` [tip: sched/core] " tip-bot2 for Mel Gorman
2020-11-20  9:06 ` [PATCH 3/4] sched/numa: Allow a floating imbalance between NUMA nodes Mel Gorman
2020-11-20 13:33   ` Vincent Guittot
2020-11-25 14:02   ` [tip: sched/core] " tip-bot2 for Mel Gorman
2020-11-20  9:06 ` [PATCH 4/4] sched: Limit the amount of NUMA imbalance that can exist at fork time Mel Gorman
2020-11-20 13:33   ` Vincent Guittot
2020-11-25 14:02   ` tip-bot2 for Mel Gorman [this message]
2020-11-20 12:58 ` [PATCH v3 0/4] Revisit NUMA imbalance tolerance and fork balancing Peter Zijlstra
2020-11-20 14:02   ` Mel Gorman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=160631297307.3364.9271634433734932379.tip-bot2@tip-bot2 \
    --to=tip-bot2@linutronix.de \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-tip-commits@vger.kernel.org \
    --cc=mgorman@techsingularity.net \
    --cc=peterz@infradead.org \
    --cc=vincent.guittot@linaro.org \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).