All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mel Gorman <mgorman@techsingularity.net>
To: Peter Zijlstra <peterz@infradead.org>
Cc: Ingo Molnar <mingo@kernel.org>, Mike Galbraith <efault@gmx.de>,
	Matt Fleming <matt@codeblueprint.co.uk>,
	Giovanni Gherdovich <ggherdovich@suse.cz>,
	LKML <linux-kernel@vger.kernel.org>,
	Mel Gorman <mgorman@techsingularity.net>
Subject: [PATCH 5/6] sched/fair: Consider SD_NUMA when selecting the most idle group to schedule on
Date: Tue, 13 Feb 2018 13:37:29 +0000	[thread overview]
Message-ID: <20180213133730.24064-6-mgorman@techsingularity.net> (raw)
In-Reply-To: <20180213133730.24064-1-mgorman@techsingularity.net>

find_idlest_group() compares a local group with each other group to select
the one that is most idle. When comparing groups in different NUMA domains,
a very slight imbalance is enough to select a remote NUMA node even if the
runnable load on both groups is 0 or close to 0. This ignores the cost of
remote accesses entirely and is a problem when selecting the CPU for a
newly forked task to run on.  This is problematic when a forking server
is almost guaranteed to run on a remote node incurring numerous remote
accesses and potentially causing automatic NUMA balancing to try migrate
the task back or migrate the data to another node. Similar weirdness is
observed if a basic shell command pipes output to another as each process
in the pipeline is likely to start on different nodes and then get adjusted
later by wake_affine.

This patch adds imbalance to remote domains when considering whether to
select CPUs from remote domains. If the local domain is selected, imbalance
will still be used to try select a CPU from a lower scheduler domain's group
instead of stacking tasks on the same CPU.

A variety of workloads and machines were tested and as expected, there is no
difference on UMA. The difference on NUMA can be dramatic. This is a comparison
of elapsed times running the git regression test suite. It's fork-intensive with
short-lived processes

                                 4.15.0                 4.15.0
                           noexit-v1r23           sdnuma-v1r23
Elapsed min          1706.06 (   0.00%)     1435.94 (  15.83%)
Elapsed mean         1709.53 (   0.00%)     1436.98 (  15.94%)
Elapsed stddev          2.16 (   0.00%)        1.01 (  53.38%)
Elapsed coeffvar        0.13 (   0.00%)        0.07 (  44.54%)
Elapsed max          1711.59 (   0.00%)     1438.01 (  15.98%)

              4.15.0      4.15.0
        noexit-v1r23 sdnuma-v1r23
User         5434.12     5188.41
System       4878.77     3467.09
Elapsed     10259.06     8624.21

That shows a considerable reduction in elapsed times. It's important to
note that automatic NUMA balancing does not affect this load as processes
are too short-lived.

There is also a noticable impact on hackbench such as this example using
processes and pipes

hackbench-process-pipes
                              4.15.0                 4.15.0
                        noexit-v1r23           sdnuma-v1r23
Amean     1        1.0973 (   0.00%)      0.9393 (  14.40%)
Amean     4        1.3427 (   0.00%)      1.3730 (  -2.26%)
Amean     7        1.4233 (   0.00%)      1.6670 ( -17.12%)
Amean     12       3.0250 (   0.00%)      3.3013 (  -9.13%)
Amean     21       9.0860 (   0.00%)      9.5343 (  -4.93%)
Amean     30      14.6547 (   0.00%)     13.2433 (   9.63%)
Amean     48      22.5447 (   0.00%)     20.4303 (   9.38%)
Amean     79      29.2010 (   0.00%)     26.7853 (   8.27%)
Amean     110     36.7443 (   0.00%)     35.8453 (   2.45%)
Amean     141     45.8533 (   0.00%)     42.6223 (   7.05%)
Amean     172     55.1317 (   0.00%)     50.6473 (   8.13%)
Amean     203     64.4420 (   0.00%)     58.3957 (   9.38%)
Amean     234     73.2293 (   0.00%)     67.1047 (   8.36%)
Amean     265     80.5220 (   0.00%)     75.7330 (   5.95%)
Amean     296     88.7567 (   0.00%)     82.1533 (   7.44%)

It's not a universal win as there are occasions when spreading wide and
quickly is a benefit but it's more of a win than it is a loss. For other
workloads, there is little difference but netperf is interesting. Without
the patch, the server and client starts on different nodes but quickly get
migrated due to wake_affine. Hence, the difference is overall performance
is marginal but detectable

                                     4.15.0                 4.15.0
                               noexit-v1r23           sdnuma-v1r23
Hmean     send-64         349.09 (   0.00%)      354.67 (   1.60%)
Hmean     send-128        699.16 (   0.00%)      702.91 (   0.54%)
Hmean     send-256       1316.34 (   0.00%)     1350.07 (   2.56%)
Hmean     send-1024      5063.99 (   0.00%)     5124.38 (   1.19%)
Hmean     send-2048      9705.19 (   0.00%)     9687.44 (  -0.18%)
Hmean     send-3312     14359.48 (   0.00%)    14577.64 (   1.52%)
Hmean     send-4096     16324.20 (   0.00%)    16393.62 (   0.43%)
Hmean     send-8192     26112.61 (   0.00%)    26877.26 (   2.93%)
Hmean     send-16384    37208.44 (   0.00%)    38683.43 (   3.96%)
Hmean     recv-64         349.09 (   0.00%)      354.67 (   1.60%)
Hmean     recv-128        699.16 (   0.00%)      702.91 (   0.54%)
Hmean     recv-256       1316.34 (   0.00%)     1350.07 (   2.56%)
Hmean     recv-1024      5063.99 (   0.00%)     5124.38 (   1.19%)
Hmean     recv-2048      9705.16 (   0.00%)     9687.43 (  -0.18%)
Hmean     recv-3312     14359.42 (   0.00%)    14577.59 (   1.52%)
Hmean     recv-4096     16323.98 (   0.00%)    16393.55 (   0.43%)
Hmean     recv-8192     26111.85 (   0.00%)    26876.96 (   2.93%)
Hmean     recv-16384    37206.99 (   0.00%)    38682.41 (   3.97%)

However, what is very interesting is how automatic NUMA balancing behaves.
Each netperf instance runs long enough for balancing to activate.

NUMA base PTE updates             4620        1473
NUMA huge PMD updates                0           0
NUMA page range updates           4620        1473
NUMA hint faults                  4301        1383
NUMA hint local faults            1309         451
NUMA hint local percent             30          32
NUMA pages migrated               1335         491
AutoNUMA cost                      21%          6%

There is an unfortunate number of remote faults although tracing indicated
that the vast majority are in shared libraries. However, the tendency to
start tasks on the same node if there is capacity means that there were
far fewer PTE updates and faults incurred overall.

Signed-off-by: Mel Gorman <mgorman@techsingularity.net>
---
 kernel/sched/fair.c | 12 ++++++++++++
 1 file changed, 12 insertions(+)

diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index e5bbcbefd01b..b1efd7570c88 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -5911,6 +5911,18 @@ find_idlest_group(struct sched_domain *sd, struct task_struct *p,
 	if (!idlest)
 		return NULL;
 
+	/*
+	 * When comparing groups across NUMA domains, it's possible for the
+	 * local domain to be very lightly loaded relative to the remote
+	 * domains but "imbalance" skews the comparison making remote CPUs
+	 * look much more favourable. When considering cross-domain, add
+	 * imbalance to the runnable load on the remote node and consider
+	 * staying local.
+	 */
+	if ((sd->flags & SD_NUMA) &&
+	    min_runnable_load + imbalance >= this_runnable_load)
+		return NULL;
+
 	if (min_runnable_load > (this_runnable_load + imbalance))
 		return NULL;
 
-- 
2.15.1

  parent reply	other threads:[~2018-02-13 13:39 UTC|newest]

Thread overview: 22+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2018-02-13 13:37 [PATCH 0/6] Reduce migrations and conflicts with automatic NUMA balancing v2 Mel Gorman
2018-02-13 13:37 ` [PATCH 1/6] sched/fair: Avoid an unnecessary lookup of current CPU ID during wake_affine Mel Gorman
2018-02-21 10:27   ` [tip:sched/core] " tip-bot for Mel Gorman
2018-02-13 13:37 ` [PATCH 2/6] sched/fair: Defer calculation of prev_eff_load in wake_affine until needed Mel Gorman
2018-02-21 10:28   ` [tip:sched/core] sched/fair: Defer calculation of 'prev_eff_load' in wake_affine_weight() " tip-bot for Mel Gorman
2018-02-13 13:37 ` [PATCH 3/6] sched/fair: Do not migrate on wake_affine_weight if weights are equal Mel Gorman
2018-02-21 10:28   ` [tip:sched/core] sched/fair: Do not migrate on wake_affine_weight() " tip-bot for Mel Gorman
2018-02-13 13:37 ` [PATCH 4/6] sched/fair: Do not migrate due to a sync wakeup on exit Mel Gorman
2018-02-21 10:29   ` [tip:sched/core] " tip-bot for Peter Zijlstra
2018-02-13 13:37 ` Mel Gorman [this message]
2018-02-21 10:29   ` [tip:sched/core] sched/fair: Consider SD_NUMA when selecting the most idle group to schedule on tip-bot for Mel Gorman
2018-02-13 13:37 ` [PATCH 6/6] sched/numa: Delay retrying placement for automatic NUMA balance after wake_affine Mel Gorman
2018-02-13 14:01   ` Peter Zijlstra
2018-02-13 14:18     ` Mel Gorman
2018-02-13 14:43       ` Peter Zijlstra
2018-02-13 15:00         ` Mel Gorman
2018-02-13 15:10           ` Peter Zijlstra
2018-02-21 10:30   ` [tip:sched/core] sched/numa: Delay retrying placement for automatic NUMA balance after wake_affine() tip-bot for Mel Gorman
2018-05-07 11:06     ` Srikar Dronamraju
2018-05-09  8:41       ` Mel Gorman
2018-05-09 10:58         ` Srikar Dronamraju
2018-05-09 16:34           ` Mel Gorman

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20180213133730.24064-6-mgorman@techsingularity.net \
    --to=mgorman@techsingularity.net \
    --cc=efault@gmx.de \
    --cc=ggherdovich@suse.cz \
    --cc=linux-kernel@vger.kernel.org \
    --cc=matt@codeblueprint.co.uk \
    --cc=mingo@kernel.org \
    --cc=peterz@infradead.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.