linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Mel Gorman <mgorman@techsingularity.net>
To: Vincent Guittot <vincent.guittot@linaro.org>
Cc: Ingo Molnar <mingo@kernel.org>,
	Peter Zijlstra <peterz@infradead.org>,
	Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Valentin Schneider <valentin.schneider@arm.com>,
	Phil Auld <pauld@redhat.com>, LKML <linux-kernel@vger.kernel.org>,
	Mel Gorman <mgorman@techsingularity.net>
Subject: [PATCH 01/11] sched/fair: Allow a small load imbalance between low utilisation SD_NUMA domains
Date: Wed, 12 Feb 2020 09:36:44 +0000	[thread overview]
Message-ID: <20200212093654.4816-2-mgorman@techsingularity.net> (raw)
In-Reply-To: <20200212093654.4816-1-mgorman@techsingularity.net>

The CPU load balancer balances between different domains to spread load
and strives to have equal balance everywhere. Communicating tasks can
migrate so they are topologically close to each other but these decisions
are independent. On a lightly loaded NUMA machine, two communicating tasks
pulled together at wakeup time can be pushed apart by the load balancer.
In isolation, the load balancer decision is fine but it ignores the tasks
data locality and the wakeup/LB paths continually conflict. NUMA balancing
is also a factor but it also simply conflicts with the load balancer.

This patch allows a fixed degree of imbalance of two tasks to exist between
NUMA domains when the source domain is almost idle. In some cases, this
prevents communicating tasks being pulled apart. It was evaluated whether
the imbalance should be scaled to the domain size. However, no additional
benefit was measured across a range of workloads and machines and scaling
adds the risk that lower domains have to be rebalanced. While this could
change again in the future, such a change should specify the use case
and benefit.

The most obvious impact is on netperf TCP_STREAM -- two simple
communicating tasks with some softirq offload depending on the
transmission rate.

 2-socket Haswell machine 48 core, HT enabled
 netperf-tcp -- mmtests config config-network-netperf-unbound
			      baseline              lbnuma-v3
 Hmean     64         568.73 (   0.00%)      577.56 *   1.55%*
 Hmean     128       1089.98 (   0.00%)     1128.06 *   3.49%*
 Hmean     256       2061.72 (   0.00%)     2104.39 *   2.07%*
 Hmean     1024      7254.27 (   0.00%)     7557.52 *   4.18%*
 Hmean     2048     11729.20 (   0.00%)    13350.67 *  13.82%*
 Hmean     3312     15309.08 (   0.00%)    18058.95 *  17.96%*
 Hmean     4096     17338.75 (   0.00%)    20483.66 *  18.14%*
 Hmean     8192     25047.12 (   0.00%)    27806.84 *  11.02%*
 Hmean     16384    27359.55 (   0.00%)    33071.88 *  20.88%*
 Stddev    64           2.16 (   0.00%)        2.02 (   6.53%)
 Stddev    128          2.31 (   0.00%)        2.19 (   5.05%)
 Stddev    256         11.88 (   0.00%)        3.22 (  72.88%)
 Stddev    1024        23.68 (   0.00%)        7.24 (  69.43%)
 Stddev    2048        79.46 (   0.00%)       71.49 (  10.03%)
 Stddev    3312        26.71 (   0.00%)       57.80 (-116.41%)
 Stddev    4096       185.57 (   0.00%)       96.15 (  48.19%)
 Stddev    8192       245.80 (   0.00%)      100.73 (  59.02%)
 Stddev    16384      207.31 (   0.00%)      141.65 (  31.67%)

In this case, there was a sizable improvement to performance and
a general reduction in variance. However, this is not univeral.
For most machines, the impact was roughly a 3% performance gain.

 Ops NUMA base-page range updates       19796.00         292.00
 Ops NUMA PTE updates                   19796.00         292.00
 Ops NUMA PMD updates                       0.00           0.00
 Ops NUMA hint faults                   16113.00         143.00
 Ops NUMA hint local faults %            8407.00         142.00
 Ops NUMA hint local percent               52.18          99.30
 Ops NUMA pages migrated                 4244.00           1.00

Without the patch, only 52.18% of sampled accesses are local.  In an
earlier changelog, 100% of sampled accesses are local and indeed on
most machines, this was still the case. In this specific case, the
local sampled rates was 99.3% but note the "base-page range updates"
and "PTE updates".  The activity with the patch is negligible as were
the number of faults. The small number of pages migrated were related to
shared libraries.  A 2-socket Broadwell showed better results on average
but are not presented for brevity as the performance was similar except
it showed 100% of the sampled NUMA hints were local. The patch holds up
for a 4-socket Haswell, an AMD EPYC and AMD Epyc 2 machine.

For dbench, the impact depends on the filesystem used and the number of
clients. On XFS, there is little difference as the clients typically
communicate with workqueues which have a separate class of scheduler
problem at the moment. For ext4, performance is generally better,
particularly for small numbers of clients as NUMA balancing activity is
negligible with the patch applied.

A more interesting example is the Facebook schbench which uses a
number of messaging threads to communicate with worker threads. In this
configuration, one messaging thread is used per NUMA node and the number of
worker threads is varied. The 50, 75, 90, 95, 99, 99.5 and 99.9 percentiles
for response latency is then reported.

 Lat 50.00th-qrtle-1        44.00 (   0.00%)       37.00 (  15.91%)
 Lat 75.00th-qrtle-1        53.00 (   0.00%)       41.00 (  22.64%)
 Lat 90.00th-qrtle-1        57.00 (   0.00%)       42.00 (  26.32%)
 Lat 95.00th-qrtle-1        63.00 (   0.00%)       43.00 (  31.75%)
 Lat 99.00th-qrtle-1        76.00 (   0.00%)       51.00 (  32.89%)
 Lat 99.50th-qrtle-1        89.00 (   0.00%)       52.00 (  41.57%)
 Lat 99.90th-qrtle-1        98.00 (   0.00%)       55.00 (  43.88%)
 Lat 50.00th-qrtle-2        42.00 (   0.00%)       42.00 (   0.00%)
 Lat 75.00th-qrtle-2        48.00 (   0.00%)       47.00 (   2.08%)
 Lat 90.00th-qrtle-2        53.00 (   0.00%)       52.00 (   1.89%)
 Lat 95.00th-qrtle-2        55.00 (   0.00%)       53.00 (   3.64%)
 Lat 99.00th-qrtle-2        62.00 (   0.00%)       60.00 (   3.23%)
 Lat 99.50th-qrtle-2        63.00 (   0.00%)       63.00 (   0.00%)
 Lat 99.90th-qrtle-2        68.00 (   0.00%)       66.00 (   2.94%

For higher worker threads, the differences become negligible but it's
interesting to note the difference in wakeup latency at low utilisation
and mpstat confirms that activity was almost all on one node until
the number of worker threads increase.

Hackbench generally showed neutral results across a range of machines.
This is different to earlier versions of the patch which allowed imbalances
for higher degrees of utilisation. perf bench pipe showed negligible
differences in overall performance as the differences are very close to
the noise.

An earlier prototype of the patch showed major regressions for NAS C-class
when running with only half of the available CPUs -- 20-30% performance
hits were measured at the time. With this version of the patch, the impact
is negligible with small gains/losses within the noise measured. This is
because the number of threads far exceeds the small imbalance the aptch
cares about. Similarly, there were report of regressions for the autonuma
benchmark against earlier versions but again, normal load balancing now
applies for that workload.

In general, the patch simply seeks to avoid unnecessary cross-node
migrations in the basic case where imbalances are very small.  For low
utilisation communicating workloads, this patch generally behaves better
with less NUMA balancing activity. For high utilisation, there is no
change in behaviour.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org>
Signed-off-by: Ingo Molnar <mingo@kernel.org>
Reviewed-by: Valentin Schneider <valentin.schneider@arm.com>
Reviewed-by: Vincent Guittot <vincent.guittot@linaro.org>
Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com>
Acked-by: Phil Auld <pauld@redhat.com>
Tested-by: Phil Auld <pauld@redhat.com>
Link: https://lkml.kernel.org/r/20200114101319.GO3466@techsingularity.net
---
 kernel/sched/fair.c | 35 +++++++++++++++++++++++------------
 1 file changed, 23 insertions(+), 12 deletions(-)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index fe4e0d775375..199d1476bb90 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -8658,10 +8658,6 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 	/*
 	 * Try to use spare capacity of local group without overloading it or
 	 * emptying busiest.
-	 * XXX Spreading tasks across NUMA nodes is not always the best policy
-	 * and special care should be taken for SD_NUMA domain level before
-	 * spreading the tasks. For now, load_balance() fully relies on
-	 * NUMA_BALANCING and fbq_classify_group/rq to override the decision.
 	 */
 	if (local->group_type == group_has_spare) {
 		if (busiest->group_type > group_fully_busy) {
@@ -8701,16 +8697,31 @@ static inline void calculate_imbalance(struct lb_env *env, struct sd_lb_stats *s
 			env->migration_type = migrate_task;
 			lsub_positive(&nr_diff, local->sum_nr_running);
 			env->imbalance = nr_diff >> 1;
-			return;
-		}
+		} else {
 
-		/*
-		 * If there is no overload, we just want to even the number of
-		 * idle cpus.
-		 */
-		env->migration_type = migrate_task;
-		env->imbalance = max_t(long, 0, (local->idle_cpus -
+			/*
+			 * If there is no overload, we just want to even the number of
+			 * idle cpus.
+			 */
+			env->migration_type = migrate_task;
+			env->imbalance = max_t(long, 0, (local->idle_cpus -
 						 busiest->idle_cpus) >> 1);
+		}
+
+		/* Consider allowing a small imbalance between NUMA groups */
+		if (env->sd->flags & SD_NUMA) {
+			unsigned int imbalance_min;
+
+			/*
+			 * Allow a small imbalance based on a simple pair of
+			 * communicating tasks that remain local when the
+			 * source domain is almost idle.
+			 */
+			imbalance_min = 2;
+			if (busiest->sum_nr_running <= imbalance_min)
+				env->imbalance = 0;
+		}
+
 		return;
 	}
 
-- 
2.16.4


  reply	other threads:[~2020-02-12  9:37 UTC|newest]

Thread overview: 21+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-02-12  9:36 [RFC PATCH 00/11] Reconcile NUMA balancing decisions with the load balancer Mel Gorman
2020-02-12  9:36 ` Mel Gorman [this message]
2020-02-12  9:36 ` [PATCH 02/11] sched/fair: Optimize select_idle_core() Mel Gorman
2020-02-12  9:36 ` [PATCH 03/11] sched/fair: Allow a per-CPU kthread waking a task to stack on the same CPU, to fix XFS performance regression Mel Gorman
2020-02-12  9:36 ` [PATCH 04/11] sched/numa: Trace when no candidate CPU was found on the preferred node Mel Gorman
2020-02-12  9:36 ` [PATCH 05/11] sched/numa: Distinguish between the different task_numa_migrate failure cases Mel Gorman
2020-02-12 14:43   ` Steven Rostedt
2020-02-12 15:59     ` Mel Gorman
2020-02-12  9:36 ` [PATCH 06/11] sched/numa: Prefer using an idle cpu as a migration target instead of comparing tasks Mel Gorman
2020-02-12  9:36 ` [PATCH 07/11] sched/numa: Find an alternative idle CPU if the CPU is part of an active NUMA balance Mel Gorman
2020-02-12  9:36 ` [PATCH 08/11] sched/numa: Bias swapping tasks based on their preferred node Mel Gorman
2020-02-13 10:31   ` Peter Zijlstra
2020-02-13 11:18     ` Mel Gorman
2020-02-12 13:22 ` [RFC PATCH 00/11] Reconcile NUMA balancing decisions with the load balancer Vincent Guittot
2020-02-12 14:07   ` Valentin Schneider
2020-02-12 15:48   ` Mel Gorman
2020-02-12 16:13     ` Vincent Guittot
2020-02-12 15:45 ` [PATCH 09/11] sched/fair: Split out helper to adjust imbalances between domains Mel Gorman
2020-02-12 15:46 ` [PATCH 10/11] sched/numa: Use similar logic to the load balancer for moving between domains with spare capacity Mel Gorman
2020-02-12 15:46 ` [PATCH 11/11] sched/numa: Use similar logic to the load balancer for moving between overloaded domains Mel Gorman
     [not found] ` <20200214041232.18904-1-hdanton@sina.com>
2020-02-14  7:50   ` [PATCH 08/11] sched/numa: Bias swapping tasks based on their preferred node Mel Gorman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200212093654.4816-2-mgorman@techsingularity.net \
    --to=mgorman@techsingularity.net \
    --cc=bsegall@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=pauld@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=valentin.schneider@arm.com \
    --cc=vincent.guittot@linaro.org \
    --subject='Re: [PATCH 01/11] sched/fair: Allow a small load imbalance between low utilisation SD_NUMA domains' \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).