All of lore.kernel.org
 help / color / mirror / Atom feed
From: Jirka Hladky <jhladky@redhat.com>
To: LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6
Date: Thu, 12 Mar 2020 13:17:36 +0100	[thread overview]
Message-ID: <CAE4VaGB7+sR1nf3Ux8W=hgN46gNXRYr0uBWJU0oYnk7h00Y_dw@mail.gmail.com> (raw)
In-Reply-To: <20200312095432.GW3818@techsingularity.net>

Hi Mel,

thanks a lot for analyzing it!

My big concern is that the performance drop for low threads counts
(roughly up to 2x number of NUMA nodes) is not just a rare corner
case, but it might be more common. We see the drop for the following
benchmarks/tests, especially on 8 NUMA nodes servers. However, four
and even 2 NUMA node servers are affected as well.

Numbers show average performance drop (median of runtime collected
from 5 subsequential runs) compared to vanilla kernel.

2x AMD 7351 (EPYC Naples), 8 NUMA nodes
===================================
NAS: sp_C test: -50%, peak perf. drop with 8 threads
NAS: mg_D: -10% with 16 threads
SPECjvm2008: co_sunflow test: -20% (peak drop with 8 threads)
SPECjvm2008: compress and cr_signverify tests: -10%(peak drop with 8 threads)
SPECjbb2005: -10% for 16 threads

4x INTEL Xeon GOLD-6126 with Sub-NUMA clustering enabled, 8 NUMA nodes
=============================================================
NAS: sp_C test: -35%, peak perf. drop with 16 threads
SPECjvm2008: co_sunflow, compress and cr_signverify tests: -10%(peak
drop with 8 threads)
SPECjbb2005: -10% for 24 threads

So far, I have run only a limited number of our tests. I can run our
full testing suite next week when required. Please let me know.

Thanks!
Jirka


On Thu, Mar 12, 2020 at 10:54 AM Mel Gorman <mgorman@techsingularity.net> wrote:
>
> On Mon, Mar 09, 2020 at 08:36:25PM +0000, Mel Gorman wrote:
> > > The actual data reports are on an intranet web page so they are harder to
> > > share. I can create PDFs or screenshots but I didn't want to just blast
> > > those to the list. I'd be happy to send some direclty if you are interested.
> > >
> >
> > Send them to me privately please.
> >
> > > Some data in text format I can easily include shows imbalances across the
> > > numa nodes. This is for NAS sp.C.x benchmark because it was easiest to
> > > pull and see the data in text. The regressions can be seen in other tests
> > > as well.
> > >
> >
> > What was the value for x?
> >
> > I ask because I ran NAS across a variety of machines for C class in two
> > configurations -- one using as many CPUs as possible and one running
> > with a third of the available CPUs for both MPI and OMP. Generally there
> > were small gains and losses across multiple kernels but often within the
> > noise or within a few percent of each other.
> >
>
> On re-examining the case, this pattern matches. There are some corner cases
> for large machines that have low utilisation that are obvious. With the
> old behaviour, load balancing would even load evenly all available NUMA
> nodes while NUMA balancing would constantly adjust it for locality. The
> old load balancer does this even if a task starts with all of its memory
> local to one node.
>
> The degree where it causes the most problems appears to be roughly for
> task counts lower than 2 * NR_NODES as per the small imbalance allowed by
> adjust_numa_imbalance but the actual distribution is variable. It's not
> always 2 per node, sometimes it can be a little higher depending on when
> idle balancing happens and other machine activity. This is not universal
> as other machine sizes and workloads are fine with the new behaviour and
> generally benefit.
>
> The problem is particularly visible when the only active tasks in the
> system have set numa_preferred_nid because as far as the load balancer and
> NUMA balancer is concerned, there is no reason to force the SP workload
> to spread wide.
>
> > The largest machine I had available was 4 sockets.
> >
> > The other curiousity is that you used C class. On bigger machines, that
> > is very short lived to the point of being almost useless. Is D class
> > similarly affected?
> >
>
> I expect D class to be similarly affected because the same pattern holds
> -- tasks say on CPUs local to their memory even though more memory
> bandwidth may be available on remote nodes.
>
> > > 5.6.0_rc3.tip_lb_numa+
> > > sp.C.x_008_02  - CPU load average across the individual NUMA nodes
> > > (timestep is 5 seconds)
> > > # NUMA | AVR | Utilization over time in percentage
> > >   0    | 5   |  12  9  3  0  0 11  8  0  1  3  5 17  9  5  0  0  0 11  3
> > >   1    | 16  |  20 21 10 10  2  6  9 12 11  9  9 23 24 23 24 24 24 19 20
> > >   2    | 21  |  19 23 26 22 22 23 25 20 25 34 38 17 13 13 13 13 13 27 13
> > >   3    | 15  |  19 23 20 21 21 15 15 20 20 18 10 10  9  9  9  9  9  9 11
> > >   4    | 19  |  13 14 15 22 23 20 19 20 17 12 15 15 25 25 24 24 24 14 24
> > >   5    | 3   |   0  2 11  6 20  8  0  0  0  0  0  0  0  0  0  0  0  0  9
> > >   6    | 0   |   0  0  0  5  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0
> > >   7    | 4   |   0  0  0  0  0  0  4 11  9  0  0  0  0  5 12 12 12  3  0
> > >
> > > 5.6.0-0.rc3.1.elrdy
> > > sp.C.x_008_01  - CPU load average across the individual NUMA nodes
> > > (timestep is 5 seconds)
> > > # NUMA | AVR | Utilization over time in percentage
> > >   0    | 13  |   6  8 10 10 11 10 18 13 20 17 14 15
> > >   1    | 11  |  10 10 11 11  9 16 12 14  9 11 11 10
> > >   2    | 17  |  25 19 16 11 13 12 11 16 17 22 22 16
> > >   3    | 21  |  21 22 22 23 21 23 23 21 21 17 22 21
> > >   4    | 14  |  20 23 11 12 15 18 12 10  9 13 12 18
> > >   5    | 4   |   0  0  8 10  7  0  8  2  0  0  8  2
> > >   6    | 1   |   0  5  1  0  0  0  0  0  0  1  0  0
> > >   7    | 7   |   7  3 10 10 10 11  3  8 10  4  0  5
> > >
> >
> > A critical difference with the series is that large imbalances shouldn't
> > happen but prior to the series the NUMA balancing would keep trying to
> > move tasks to a node with load balancing moving them back. That should
> > not happen any more but there are cases where it's actually faster to
> > have the fight between NUMA balancing and load balancing. Ideally a
> > degree of imbalance would be allowed but I haven't found a way of doing
> > that without side effects.
> >
>
> So this is what's happening -- at low utilisation, tasks are staying local
> to their memory. For a lot of cases, this is a good thing -- communicating
> tasks stay local for example and tasks that are not completely memory
> bound benefit. Machines that have sufficient local memory bandwidth also
> appear to benefit.
>
> sp.C appears to be a significant corner case when the degree of
> parallelisation is lower than the number of NUMA nodes in the system
> and of the NAS workloads, bt is also mildly affected.  In each cases,
> memory was almost completely local and there was low NUMA activity but
> performance suffered. This is the BT case;
>
>                             5.6.0-rc3              5.6.0-rc3
>                               vanilla     schedcore-20200227
> Min       bt.C      176.05 (   0.00%)      185.03 (  -5.10%)
> Amean     bt.C      178.62 (   0.00%)      185.54 *  -3.88%*
> Stddev    bt.C        4.26 (   0.00%)        0.60 (  85.95%)
> CoeffVar  bt.C        2.38 (   0.00%)        0.32 (  86.47%)
> Max       bt.C      186.09 (   0.00%)      186.48 (  -0.21%)
> BAmean-50 bt.C      176.18 (   0.00%)      185.08 (  -5.06%)
> BAmean-95 bt.C      176.75 (   0.00%)      185.31 (  -4.84%)
> BAmean-99 bt.C      176.75 (   0.00%)      185.31 (  -4.84%)
>
> Note the spread in performance. tip/sched/core looks worse than average but
> its coefficient of variance was just 0.32% versus 2.38% with the vanilla
> kernel. The vanilla kernel is a lot less stable in terms of performance
> due to the fighting between CPU Load and NUMA Balancing.
>
> A heatmap of the CPU usage per LLC showed 4 tasks running on 2 nodes
> with two nodes idle -- there was almost no other system activity that
> would allow the load balancer to balance on tasks that are unconcerned
> with locality. The vanilla case was interesting -- of the 5 iterations,
> 4 spread with one task on 4 nodes but one iteration stacked 4 tasks on
> 2 nodes so it's not even consistent.  The NUMA activity looked like this
> for the overall workload.
>
> Ops NUMA alloc hit                   3450166.00     2406738.00
> Ops NUMA alloc miss                        0.00           0.00
> Ops NUMA interleave hit                    0.00           0.00
> Ops NUMA alloc local                 1047975.00       41131.00
> Ops NUMA base-page range updates    15864254.00    16283456.00
> Ops NUMA PTE updates                15148478.00    15563584.00
> Ops NUMA PMD updates                    1398.00        1406.00
> Ops NUMA hint faults                15128332.00    15535357.00
> Ops NUMA hint local faults %        12253847.00    14471269.00
> Ops NUMA hint local percent               81.00          93.15
> Ops NUMA pages migrated               993033.00           4.00
> Ops AutoNUMA cost                      75771.58       77790.77
>
> PTE hinting was more or less the same but look at the locality. 81%
> local for the baseline vanilla kernel and 93.15% for what's in
> tip/sched/core. The baseline kernel migrates almost 1 million pages over
> 15 minutes (5 iterations) and tip/sched/core migrates ... 4 pages.
>
> Looking at the faults over time, the baseline kernel initially faults
> with pages local, drops to 80% shortly after starting and then starts
> climbing back up again as pages get migrated. Initially the number of
> hints the baseline kernel traps is extremely high and drops as pages
> migrate
>
> Most others were almost neutral with the impact of the series more
> obvious in some than others. is.C is really short-lived for example but
> locality of faults went from 43% to 95% local for example.
>
> sp.C was by far the most obvious impact
>
>                             5.6.0-rc3              5.6.0-rc3
>                               vanilla     schedcore-20200227
> Min       sp.C      141.52 (   0.00%)      173.61 ( -22.68%)
> Amean     sp.C      141.87 (   0.00%)      174.00 * -22.65%*
> Stddev    sp.C        0.26 (   0.00%)        0.25 (   5.06%)
> CoeffVar  sp.C        0.18 (   0.00%)        0.14 (  22.59%)
> Max       sp.C      142.10 (   0.00%)      174.25 ( -22.62%)
> BAmean-50 sp.C      141.59 (   0.00%)      173.79 ( -22.74%)
> BAmean-95 sp.C      141.81 (   0.00%)      173.93 ( -22.65%)
> BAmean-99 sp.C      141.81 (   0.00%)      173.93 ( -22.65%)
>
> That's a big hit in terms of performance and it looks less
> variable. Looking at the NUMA stats
>
> Ops NUMA alloc hit                   3100836.00     2161667.00
> Ops NUMA alloc miss                        0.00           0.00
> Ops NUMA interleave hit                    0.00           0.00
> Ops NUMA alloc local                  915700.00       98531.00
> Ops NUMA base-page range updates    12178032.00    13483382.00
> Ops NUMA PTE updates                11809904.00    12792182.00
> Ops NUMA PMD updates                     719.00        1350.00
> Ops NUMA hint faults                11791912.00    12782512.00
> Ops NUMA hint local faults %         9345987.00    11467427.00
> Ops NUMA hint local percent               79.26          89.71
> Ops NUMA pages migrated               871805.00       21505.00
> Ops AutoNUMA cost                      59061.37       64007.35
>
> Note the locality -- 79.26% to 89.71% but the vanilla kernel migrated 871K
> pages and the new kernel migrates 21K. Looking at migrations over time,
> I can see that the vanilla kernel migrates 180K pages in the first 10
> seconds of each iteration while tip/sched/core migrated few enough that
> it's not even clear on the graph. The workload was long-lived enough that
> the initial disruption was less visible when running for long enough.
>
> The problem is that there is nothing unique that the kernel measures that
> I can think of that uniquely identifies that SP should spread wide and
> migrate early to move its shared pages from other processes that are less
> memory bound or communicating heavily. The state is simply not maintained
> and it cannot be inferred from the runqueue or task state. From both a
> locality point of view and available CPUs, leaving SP alone makes sense
> but we do not detect that memory bandwidth is an issue. In other cases, the
> cost of migrations alone would damage performance and SP is an exception as
> it's long-lived enough to benefit once the first few seconds have passed.
>
> I experimented with a few different approaches but without being able to
> detect the bandwidth, it was a case that SP can be improved but almost
> everything else suffers. For example, SP on 2-socket degrades when spread
> too quickly on machines with enough memory bandwidth so with tip/sched/core
> SP either benefits or suffers depending on the machine. Basic communicating
> tasks degrade 4-8% depending on the machine and exact workload when moving
> back to the vanilla kernel and that is fairly universal AFAIS.
>
> So I think that the new behaviour generally is more sane -- do not
> excessively fight between memory and CPU balancing but if there are
> suggestions on how to distinguish between tasks that should spread wide
> and evenly regardless of initial memory locality then I'm all ears.
> I do not think migrating like crazy hoping it happens to work out and
> having CPU Load and NUMA Balancing using very different criteria for
> evaluation is a better approach.
>
> --
> Mel Gorman
> SUSE Labs
>


-- 
-Jirka


  reply	other threads:[~2020-03-12 12:17 UTC|newest]

Thread overview: 83+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-02-24  9:52 [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6 Mel Gorman
2020-02-24  9:52 ` [PATCH 01/13] sched/fair: Allow a per-CPU kthread waking a task to stack on the same CPU, to fix XFS performance regression Mel Gorman
2020-02-24  9:52 ` [PATCH 02/13] sched/numa: Trace when no candidate CPU was found on the preferred node Mel Gorman
2020-02-24 15:20   ` [tip: sched/core] " tip-bot2 for Mel Gorman
2020-02-24  9:52 ` [PATCH 03/13] sched/numa: Distinguish between the different task_numa_migrate failure cases Mel Gorman
2020-02-24 15:20   ` [tip: sched/core] sched/numa: Distinguish between the different task_numa_migrate() " tip-bot2 for Mel Gorman
2020-02-24  9:52 ` [PATCH 04/13] sched/fair: Reorder enqueue/dequeue_task_fair path Mel Gorman
2020-02-24 15:20   ` [tip: sched/core] " tip-bot2 for Vincent Guittot
2020-02-24  9:52 ` [PATCH 05/13] sched/numa: Replace runnable_load_avg by load_avg Mel Gorman
2020-02-24 15:20   ` [tip: sched/core] " tip-bot2 for Vincent Guittot
2020-02-24  9:52 ` [PATCH 06/13] sched/numa: Use similar logic to the load balancer for moving between domains with spare capacity Mel Gorman
2020-02-24 15:20   ` [tip: sched/core] " tip-bot2 for Mel Gorman
2020-02-24  9:52 ` [PATCH 07/13] sched/pelt: Remove unused runnable load average Mel Gorman
2020-02-24 15:20   ` [tip: sched/core] " tip-bot2 for Vincent Guittot
2020-02-24  9:52 ` [PATCH 08/13] sched/pelt: Add a new runnable average signal Mel Gorman
2020-02-24 15:20   ` [tip: sched/core] " tip-bot2 for Vincent Guittot
2020-02-24 16:01     ` Valentin Schneider
2020-02-24 16:34       ` Mel Gorman
2020-02-25  8:23       ` Vincent Guittot
2020-02-24  9:52 ` [PATCH 09/13] sched/fair: Take into account runnable_avg to classify group Mel Gorman
2020-02-24 15:20   ` [tip: sched/core] " tip-bot2 for Vincent Guittot
2020-02-24  9:52 ` [PATCH 10/13] sched/numa: Prefer using an idle cpu as a migration target instead of comparing tasks Mel Gorman
2020-02-24 15:20   ` [tip: sched/core] sched/numa: Prefer using an idle CPU " tip-bot2 for Mel Gorman
2020-02-24  9:52 ` [PATCH 11/13] sched/numa: Find an alternative idle CPU if the CPU is part of an active NUMA balance Mel Gorman
2020-02-24 15:20   ` [tip: sched/core] " tip-bot2 for Mel Gorman
2020-02-24  9:52 ` [PATCH 12/13] sched/numa: Bias swapping tasks based on their preferred node Mel Gorman
2020-02-24 15:20   ` [tip: sched/core] " tip-bot2 for Mel Gorman
2020-02-24  9:52 ` [PATCH 13/13] sched/numa: Stop an exhastive search if a reasonable swap candidate or idle CPU is found Mel Gorman
2020-02-24 15:20   ` [tip: sched/core] " tip-bot2 for Mel Gorman
2020-02-24 15:16 ` [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6 Ingo Molnar
2020-02-25 11:59   ` Mel Gorman
2020-02-25 13:28     ` Vincent Guittot
2020-02-25 14:24       ` Mel Gorman
2020-02-25 14:53         ` Vincent Guittot
2020-02-27  9:09         ` Ingo Molnar
2020-03-09 19:12 ` Phil Auld
2020-03-09 20:36   ` Mel Gorman
2020-03-12  9:54     ` Mel Gorman
2020-03-12 12:17       ` Jirka Hladky [this message]
     [not found]       ` <CAE4VaGA4q4_qfC5qe3zaLRfiJhvMaSb2WADgOcQeTwmPvNat+A@mail.gmail.com>
2020-03-12 15:56         ` Mel Gorman
2020-03-12 17:06           ` Jirka Hladky
     [not found]           ` <CAE4VaGD8DUEi6JnKd8vrqUL_8HZXnNyHMoK2D+1-F5wo+5Z53Q@mail.gmail.com>
2020-03-12 21:47             ` Mel Gorman
2020-03-12 22:24               ` Jirka Hladky
2020-03-20 15:08                 ` Jirka Hladky
     [not found]                 ` <CAE4VaGC09OfU2zXeq2yp_N0zXMbTku5ETz0KEocGi-RSiKXv-w@mail.gmail.com>
2020-03-20 15:22                   ` Mel Gorman
2020-03-20 15:33                     ` Jirka Hladky
     [not found]                     ` <CAE4VaGBGbTT8dqNyLWAwuiqL8E+3p1_SqP6XTTV71wNZMjc9Zg@mail.gmail.com>
2020-03-20 16:38                       ` Mel Gorman
2020-03-20 17:21                         ` Jirka Hladky
2020-05-07 15:24                         ` Jirka Hladky
2020-05-07 15:54                           ` Mel Gorman
2020-05-07 16:29                             ` Jirka Hladky
2020-05-07 17:49                               ` Phil Auld
     [not found]                                 ` <20200508034741.13036-1-hdanton@sina.com>
2020-05-18 14:52                                   ` Jirka Hladky
     [not found]                                     ` <20200519043154.10876-1-hdanton@sina.com>
2020-05-20 13:58                                       ` Jirka Hladky
2020-05-20 16:01                                         ` Jirka Hladky
2020-05-21 11:06                                         ` Mel Gorman
     [not found]                                         ` <20200521140931.15232-1-hdanton@sina.com>
2020-05-21 16:04                                           ` Mel Gorman
     [not found]                                           ` <20200522010950.3336-1-hdanton@sina.com>
2020-05-22 11:05                                             ` Mel Gorman
2020-05-08  9:22                               ` Mel Gorman
2020-05-08 11:05                                 ` Jirka Hladky
     [not found]                                 ` <CAE4VaGC_v6On-YvqdTwAWu3Mq4ofiV0pLov-QpV+QHr_SJr+Rw@mail.gmail.com>
2020-05-13 14:57                                   ` Jirka Hladky
2020-05-13 15:30                                     ` Mel Gorman
2020-05-13 16:20                                       ` Jirka Hladky
2020-05-14  9:50                                         ` Mel Gorman
     [not found]                                           ` <CAE4VaGCGUFOAZ+YHDnmeJ95o4W0j04Yb7EWnf8a43caUQs_WuQ@mail.gmail.com>
2020-05-14 10:08                                             ` Mel Gorman
2020-05-14 10:22                                               ` Jirka Hladky
2020-05-14 11:50                                                 ` Mel Gorman
2020-05-14 13:34                                                   ` Jirka Hladky
2020-05-14 15:31                                       ` Peter Zijlstra
2020-05-15  8:47                                         ` Mel Gorman
2020-05-15 11:17                                           ` Peter Zijlstra
2020-05-15 13:03                                             ` Mel Gorman
2020-05-15 13:12                                               ` Peter Zijlstra
2020-05-15 13:28                                                 ` Peter Zijlstra
2020-05-15 14:24                                             ` Peter Zijlstra
2020-05-21 10:38                                               ` Mel Gorman
2020-05-21 11:41                                                 ` Peter Zijlstra
2020-05-22 13:28                                                   ` Mel Gorman
2020-05-22 14:38                                                     ` Peter Zijlstra
2020-05-15 11:28                                           ` Peter Zijlstra
2020-05-15 12:22                                             ` Mel Gorman
2020-05-15 12:51                                               ` Peter Zijlstra
2020-05-15 14:43                                       ` Jirka Hladky

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to='CAE4VaGB7+sR1nf3Ux8W=hgN46gNXRYr0uBWJU0oYnk7h00Y_dw@mail.gmail.com' \
    --to=jhladky@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.