linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Mel Gorman <mgorman@techsingularity.net>
To: Phil Auld <pauld@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@kernel.org>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Valentin Schneider <valentin.schneider@arm.com>,
	Hillf Danton <hdanton@sina.com>,
	Jirka Hladky <jhladky@redhat.com>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6
Date: Thu, 12 Mar 2020 09:54:32 +0000	[thread overview]
Message-ID: <20200312095432.GW3818@techsingularity.net> (raw)
In-Reply-To: <20200309203625.GU3818@techsingularity.net>

On Mon, Mar 09, 2020 at 08:36:25PM +0000, Mel Gorman wrote:
> > The actual data reports are on an intranet web page so they are harder to 
> > share. I can create PDFs or screenshots but I didn't want to just blast 
> > those to the list. I'd be happy to send some direclty if you are interested. 
> > 
> 
> Send them to me privately please.
> 
> > Some data in text format I can easily include shows imbalances across the
> > numa nodes. This is for NAS sp.C.x benchmark because it was easiest to
> > pull and see the data in text. The regressions can be seen in other tests
> > as well.
> > 
> 
> What was the value for x?
> 
> I ask because I ran NAS across a variety of machines for C class in two
> configurations -- one using as many CPUs as possible and one running
> with a third of the available CPUs for both MPI and OMP. Generally there
> were small gains and losses across multiple kernels but often within the
> noise or within a few percent of each other.
> 

On re-examining the case, this pattern matches. There are some corner cases
for large machines that have low utilisation that are obvious. With the
old behaviour, load balancing would even load evenly all available NUMA
nodes while NUMA balancing would constantly adjust it for locality. The
old load balancer does this even if a task starts with all of its memory
local to one node.

The degree where it causes the most problems appears to be roughly for
task counts lower than 2 * NR_NODES as per the small imbalance allowed by
adjust_numa_imbalance but the actual distribution is variable. It's not
always 2 per node, sometimes it can be a little higher depending on when
idle balancing happens and other machine activity. This is not universal
as other machine sizes and workloads are fine with the new behaviour and
generally benefit.

The problem is particularly visible when the only active tasks in the
system have set numa_preferred_nid because as far as the load balancer and
NUMA balancer is concerned, there is no reason to force the SP workload
to spread wide.

> The largest machine I had available was 4 sockets.
> 
> The other curiousity is that you used C class. On bigger machines, that
> is very short lived to the point of being almost useless. Is D class
> similarly affected?
> 

I expect D class to be similarly affected because the same pattern holds
-- tasks say on CPUs local to their memory even though more memory
bandwidth may be available on remote nodes.

> > 5.6.0_rc3.tip_lb_numa+
> > sp.C.x_008_02  - CPU load average across the individual NUMA nodes 
> > (timestep is 5 seconds)
> > # NUMA | AVR | Utilization over time in percentage
> >   0    | 5   |  12  9  3  0  0 11  8  0  1  3  5 17  9  5  0  0  0 11  3
> >   1    | 16  |  20 21 10 10  2  6  9 12 11  9  9 23 24 23 24 24 24 19 20
> >   2    | 21  |  19 23 26 22 22 23 25 20 25 34 38 17 13 13 13 13 13 27 13
> >   3    | 15  |  19 23 20 21 21 15 15 20 20 18 10 10  9  9  9  9  9  9 11
> >   4    | 19  |  13 14 15 22 23 20 19 20 17 12 15 15 25 25 24 24 24 14 24
> >   5    | 3   |   0  2 11  6 20  8  0  0  0  0  0  0  0  0  0  0  0  0  9
> >   6    | 0   |   0  0  0  5  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0
> >   7    | 4   |   0  0  0  0  0  0  4 11  9  0  0  0  0  5 12 12 12  3  0
> > 
> > 5.6.0-0.rc3.1.elrdy
> > sp.C.x_008_01  - CPU load average across the individual NUMA nodes 
> > (timestep is 5 seconds)
> > # NUMA | AVR | Utilization over time in percentage
> >   0    | 13  |   6  8 10 10 11 10 18 13 20 17 14 15
> >   1    | 11  |  10 10 11 11  9 16 12 14  9 11 11 10
> >   2    | 17  |  25 19 16 11 13 12 11 16 17 22 22 16
> >   3    | 21  |  21 22 22 23 21 23 23 21 21 17 22 21
> >   4    | 14  |  20 23 11 12 15 18 12 10  9 13 12 18
> >   5    | 4   |   0  0  8 10  7  0  8  2  0  0  8  2
> >   6    | 1   |   0  5  1  0  0  0  0  0  0  1  0  0
> >   7    | 7   |   7  3 10 10 10 11  3  8 10  4  0  5
> > 
> 
> A critical difference with the series is that large imbalances shouldn't
> happen but prior to the series the NUMA balancing would keep trying to
> move tasks to a node with load balancing moving them back. That should
> not happen any more but there are cases where it's actually faster to
> have the fight between NUMA balancing and load balancing. Ideally a
> degree of imbalance would be allowed but I haven't found a way of doing
> that without side effects.
> 

So this is what's happening -- at low utilisation, tasks are staying local
to their memory. For a lot of cases, this is a good thing -- communicating
tasks stay local for example and tasks that are not completely memory
bound benefit. Machines that have sufficient local memory bandwidth also
appear to benefit.

sp.C appears to be a significant corner case when the degree of
parallelisation is lower than the number of NUMA nodes in the system
and of the NAS workloads, bt is also mildly affected.  In each cases,
memory was almost completely local and there was low NUMA activity but
performance suffered. This is the BT case;

                            5.6.0-rc3              5.6.0-rc3
                              vanilla     schedcore-20200227
Min       bt.C      176.05 (   0.00%)      185.03 (  -5.10%)
Amean     bt.C      178.62 (   0.00%)      185.54 *  -3.88%*
Stddev    bt.C        4.26 (   0.00%)        0.60 (  85.95%)
CoeffVar  bt.C        2.38 (   0.00%)        0.32 (  86.47%)
Max       bt.C      186.09 (   0.00%)      186.48 (  -0.21%)
BAmean-50 bt.C      176.18 (   0.00%)      185.08 (  -5.06%)
BAmean-95 bt.C      176.75 (   0.00%)      185.31 (  -4.84%)
BAmean-99 bt.C      176.75 (   0.00%)      185.31 (  -4.84%)

Note the spread in performance. tip/sched/core looks worse than average but
its coefficient of variance was just 0.32% versus 2.38% with the vanilla
kernel. The vanilla kernel is a lot less stable in terms of performance
due to the fighting between CPU Load and NUMA Balancing.

A heatmap of the CPU usage per LLC showed 4 tasks running on 2 nodes
with two nodes idle -- there was almost no other system activity that
would allow the load balancer to balance on tasks that are unconcerned
with locality. The vanilla case was interesting -- of the 5 iterations,
4 spread with one task on 4 nodes but one iteration stacked 4 tasks on
2 nodes so it's not even consistent.  The NUMA activity looked like this
for the overall workload.

Ops NUMA alloc hit                   3450166.00     2406738.00
Ops NUMA alloc miss                        0.00           0.00
Ops NUMA interleave hit                    0.00           0.00
Ops NUMA alloc local                 1047975.00       41131.00
Ops NUMA base-page range updates    15864254.00    16283456.00
Ops NUMA PTE updates                15148478.00    15563584.00
Ops NUMA PMD updates                    1398.00        1406.00
Ops NUMA hint faults                15128332.00    15535357.00
Ops NUMA hint local faults %        12253847.00    14471269.00
Ops NUMA hint local percent               81.00          93.15
Ops NUMA pages migrated               993033.00           4.00
Ops AutoNUMA cost                      75771.58       77790.77

PTE hinting was more or less the same but look at the locality. 81%
local for the baseline vanilla kernel and 93.15% for what's in
tip/sched/core. The baseline kernel migrates almost 1 million pages over
15 minutes (5 iterations) and tip/sched/core migrates ... 4 pages.

Looking at the faults over time, the baseline kernel initially faults
with pages local, drops to 80% shortly after starting and then starts
climbing back up again as pages get migrated. Initially the number of
hints the baseline kernel traps is extremely high and drops as pages
migrate

Most others were almost neutral with the impact of the series more
obvious in some than others. is.C is really short-lived for example but
locality of faults went from 43% to 95% local for example.

sp.C was by far the most obvious impact

                            5.6.0-rc3              5.6.0-rc3
                              vanilla     schedcore-20200227
Min       sp.C      141.52 (   0.00%)      173.61 ( -22.68%)
Amean     sp.C      141.87 (   0.00%)      174.00 * -22.65%*
Stddev    sp.C        0.26 (   0.00%)        0.25 (   5.06%)
CoeffVar  sp.C        0.18 (   0.00%)        0.14 (  22.59%)
Max       sp.C      142.10 (   0.00%)      174.25 ( -22.62%)
BAmean-50 sp.C      141.59 (   0.00%)      173.79 ( -22.74%)
BAmean-95 sp.C      141.81 (   0.00%)      173.93 ( -22.65%)
BAmean-99 sp.C      141.81 (   0.00%)      173.93 ( -22.65%)

That's a big hit in terms of performance and it looks less
variable. Looking at the NUMA stats

Ops NUMA alloc hit                   3100836.00     2161667.00
Ops NUMA alloc miss                        0.00           0.00
Ops NUMA interleave hit                    0.00           0.00
Ops NUMA alloc local                  915700.00       98531.00
Ops NUMA base-page range updates    12178032.00    13483382.00
Ops NUMA PTE updates                11809904.00    12792182.00
Ops NUMA PMD updates                     719.00        1350.00
Ops NUMA hint faults                11791912.00    12782512.00
Ops NUMA hint local faults %         9345987.00    11467427.00
Ops NUMA hint local percent               79.26          89.71
Ops NUMA pages migrated               871805.00       21505.00
Ops AutoNUMA cost                      59061.37       64007.35

Note the locality -- 79.26% to 89.71% but the vanilla kernel migrated 871K
pages and the new kernel migrates 21K. Looking at migrations over time,
I can see that the vanilla kernel migrates 180K pages in the first 10
seconds of each iteration while tip/sched/core migrated few enough that
it's not even clear on the graph. The workload was long-lived enough that
the initial disruption was less visible when running for long enough.

The problem is that there is nothing unique that the kernel measures that
I can think of that uniquely identifies that SP should spread wide and
migrate early to move its shared pages from other processes that are less
memory bound or communicating heavily. The state is simply not maintained
and it cannot be inferred from the runqueue or task state. From both a
locality point of view and available CPUs, leaving SP alone makes sense
but we do not detect that memory bandwidth is an issue. In other cases, the
cost of migrations alone would damage performance and SP is an exception as
it's long-lived enough to benefit once the first few seconds have passed.

I experimented with a few different approaches but without being able to
detect the bandwidth, it was a case that SP can be improved but almost
everything else suffers. For example, SP on 2-socket degrades when spread
too quickly on machines with enough memory bandwidth so with tip/sched/core
SP either benefits or suffers depending on the machine. Basic communicating
tasks degrade 4-8% depending on the machine and exact workload when moving
back to the vanilla kernel and that is fairly universal AFAIS.

So I think that the new behaviour generally is more sane -- do not
excessively fight between memory and CPU balancing but if there are
suggestions on how to distinguish between tasks that should spread wide
and evenly regardless of initial memory locality then I'm all ears.
I do not think migrating like crazy hoping it happens to work out and
having CPU Load and NUMA Balancing using very different criteria for
evaluation is a better approach.

-- 
Mel Gorman
SUSE Labs

  reply	other threads:[~2020-03-12  9:54 UTC|newest]

Thread overview: 83+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-02-24  9:52 [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6 Mel Gorman
2020-02-24  9:52 ` [PATCH 01/13] sched/fair: Allow a per-CPU kthread waking a task to stack on the same CPU, to fix XFS performance regression Mel Gorman
2020-02-24  9:52 ` [PATCH 02/13] sched/numa: Trace when no candidate CPU was found on the preferred node Mel Gorman
2020-02-24 15:20   ` [tip: sched/core] " tip-bot2 for Mel Gorman
2020-02-24  9:52 ` [PATCH 03/13] sched/numa: Distinguish between the different task_numa_migrate failure cases Mel Gorman
2020-02-24 15:20   ` [tip: sched/core] sched/numa: Distinguish between the different task_numa_migrate() " tip-bot2 for Mel Gorman
2020-02-24  9:52 ` [PATCH 04/13] sched/fair: Reorder enqueue/dequeue_task_fair path Mel Gorman
2020-02-24 15:20   ` [tip: sched/core] " tip-bot2 for Vincent Guittot
2020-02-24  9:52 ` [PATCH 05/13] sched/numa: Replace runnable_load_avg by load_avg Mel Gorman
2020-02-24 15:20   ` [tip: sched/core] " tip-bot2 for Vincent Guittot
2020-02-24  9:52 ` [PATCH 06/13] sched/numa: Use similar logic to the load balancer for moving between domains with spare capacity Mel Gorman
2020-02-24 15:20   ` [tip: sched/core] " tip-bot2 for Mel Gorman
2020-02-24  9:52 ` [PATCH 07/13] sched/pelt: Remove unused runnable load average Mel Gorman
2020-02-24 15:20   ` [tip: sched/core] " tip-bot2 for Vincent Guittot
2020-02-24  9:52 ` [PATCH 08/13] sched/pelt: Add a new runnable average signal Mel Gorman
2020-02-24 15:20   ` [tip: sched/core] " tip-bot2 for Vincent Guittot
2020-02-24 16:01     ` Valentin Schneider
2020-02-24 16:34       ` Mel Gorman
2020-02-25  8:23       ` Vincent Guittot
2020-02-24  9:52 ` [PATCH 09/13] sched/fair: Take into account runnable_avg to classify group Mel Gorman
2020-02-24 15:20   ` [tip: sched/core] " tip-bot2 for Vincent Guittot
2020-02-24  9:52 ` [PATCH 10/13] sched/numa: Prefer using an idle cpu as a migration target instead of comparing tasks Mel Gorman
2020-02-24 15:20   ` [tip: sched/core] sched/numa: Prefer using an idle CPU " tip-bot2 for Mel Gorman
2020-02-24  9:52 ` [PATCH 11/13] sched/numa: Find an alternative idle CPU if the CPU is part of an active NUMA balance Mel Gorman
2020-02-24 15:20   ` [tip: sched/core] " tip-bot2 for Mel Gorman
2020-02-24  9:52 ` [PATCH 12/13] sched/numa: Bias swapping tasks based on their preferred node Mel Gorman
2020-02-24 15:20   ` [tip: sched/core] " tip-bot2 for Mel Gorman
2020-02-24  9:52 ` [PATCH 13/13] sched/numa: Stop an exhastive search if a reasonable swap candidate or idle CPU is found Mel Gorman
2020-02-24 15:20   ` [tip: sched/core] " tip-bot2 for Mel Gorman
2020-02-24 15:16 ` [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6 Ingo Molnar
2020-02-25 11:59   ` Mel Gorman
2020-02-25 13:28     ` Vincent Guittot
2020-02-25 14:24       ` Mel Gorman
2020-02-25 14:53         ` Vincent Guittot
2020-02-27  9:09         ` Ingo Molnar
2020-03-09 19:12 ` Phil Auld
2020-03-09 20:36   ` Mel Gorman
2020-03-12  9:54     ` Mel Gorman [this message]
2020-03-12 12:17       ` Jirka Hladky
     [not found]       ` <CAE4VaGA4q4_qfC5qe3zaLRfiJhvMaSb2WADgOcQeTwmPvNat+A@mail.gmail.com>
2020-03-12 15:56         ` Mel Gorman
2020-03-12 17:06           ` Jirka Hladky
     [not found]           ` <CAE4VaGD8DUEi6JnKd8vrqUL_8HZXnNyHMoK2D+1-F5wo+5Z53Q@mail.gmail.com>
2020-03-12 21:47             ` Mel Gorman
2020-03-12 22:24               ` Jirka Hladky
2020-03-20 15:08                 ` Jirka Hladky
     [not found]                 ` <CAE4VaGC09OfU2zXeq2yp_N0zXMbTku5ETz0KEocGi-RSiKXv-w@mail.gmail.com>
2020-03-20 15:22                   ` Mel Gorman
2020-03-20 15:33                     ` Jirka Hladky
     [not found]                     ` <CAE4VaGBGbTT8dqNyLWAwuiqL8E+3p1_SqP6XTTV71wNZMjc9Zg@mail.gmail.com>
2020-03-20 16:38                       ` Mel Gorman
2020-03-20 17:21                         ` Jirka Hladky
2020-05-07 15:24                         ` Jirka Hladky
2020-05-07 15:54                           ` Mel Gorman
2020-05-07 16:29                             ` Jirka Hladky
2020-05-07 17:49                               ` Phil Auld
     [not found]                                 ` <20200508034741.13036-1-hdanton@sina.com>
2020-05-18 14:52                                   ` Jirka Hladky
     [not found]                                     ` <20200519043154.10876-1-hdanton@sina.com>
2020-05-20 13:58                                       ` Jirka Hladky
2020-05-20 16:01                                         ` Jirka Hladky
2020-05-21 11:06                                         ` Mel Gorman
     [not found]                                         ` <20200521140931.15232-1-hdanton@sina.com>
2020-05-21 16:04                                           ` Mel Gorman
     [not found]                                           ` <20200522010950.3336-1-hdanton@sina.com>
2020-05-22 11:05                                             ` Mel Gorman
2020-05-08  9:22                               ` Mel Gorman
2020-05-08 11:05                                 ` Jirka Hladky
     [not found]                                 ` <CAE4VaGC_v6On-YvqdTwAWu3Mq4ofiV0pLov-QpV+QHr_SJr+Rw@mail.gmail.com>
2020-05-13 14:57                                   ` Jirka Hladky
2020-05-13 15:30                                     ` Mel Gorman
2020-05-13 16:20                                       ` Jirka Hladky
2020-05-14  9:50                                         ` Mel Gorman
     [not found]                                           ` <CAE4VaGCGUFOAZ+YHDnmeJ95o4W0j04Yb7EWnf8a43caUQs_WuQ@mail.gmail.com>
2020-05-14 10:08                                             ` Mel Gorman
2020-05-14 10:22                                               ` Jirka Hladky
2020-05-14 11:50                                                 ` Mel Gorman
2020-05-14 13:34                                                   ` Jirka Hladky
2020-05-14 15:31                                       ` Peter Zijlstra
2020-05-15  8:47                                         ` Mel Gorman
2020-05-15 11:17                                           ` Peter Zijlstra
2020-05-15 13:03                                             ` Mel Gorman
2020-05-15 13:12                                               ` Peter Zijlstra
2020-05-15 13:28                                                 ` Peter Zijlstra
2020-05-15 14:24                                             ` Peter Zijlstra
2020-05-21 10:38                                               ` Mel Gorman
2020-05-21 11:41                                                 ` Peter Zijlstra
2020-05-22 13:28                                                   ` Mel Gorman
2020-05-22 14:38                                                     ` Peter Zijlstra
2020-05-15 11:28                                           ` Peter Zijlstra
2020-05-15 12:22                                             ` Mel Gorman
2020-05-15 12:51                                               ` Peter Zijlstra
2020-05-15 14:43                                       ` Jirka Hladky

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200312095432.GW3818@techsingularity.net \
    --to=mgorman@techsingularity.net \
    --cc=bsegall@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=hdanton@sina.com \
    --cc=jhladky@redhat.com \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=pauld@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=valentin.schneider@arm.com \
    --cc=vincent.guittot@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).