All of lore.kernel.org
 help / color / mirror / Atom feed
From: Mel Gorman <mgorman@techsingularity.net>
To: Phil Auld <pauld@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>,
	Ingo Molnar <mingo@kernel.org>,
	Vincent Guittot <vincent.guittot@linaro.org>,
	Juri Lelli <juri.lelli@redhat.com>,
	Dietmar Eggemann <dietmar.eggemann@arm.com>,
	Steven Rostedt <rostedt@goodmis.org>,
	Ben Segall <bsegall@google.com>,
	Valentin Schneider <valentin.schneider@arm.com>,
	Hillf Danton <hdanton@sina.com>,
	LKML <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6
Date: Mon, 9 Mar 2020 20:36:25 +0000	[thread overview]
Message-ID: <20200309203625.GU3818@techsingularity.net> (raw)
In-Reply-To: <20200309191233.GG10065@pauld.bos.csb>

On Mon, Mar 09, 2020 at 03:12:34PM -0400, Phil Auld wrote:
> > A variety of other workloads were evaluated and appear to be mostly
> > neutral or improved. netperf running on localhost shows gains between 1-8%
> > depending on the machine. hackbench is a mixed bag -- small regressions
> > on one machine around 1-2% depending on the group count but up to 15%
> > gain on another machine. dbench looks generally ok, very small performance
> > gains. pgbench looks ok, small gains and losses, much of which is within
> > the noise. schbench (Facebook workload that is sensitive to wakeup
> > latencies) is mostly good.  The autonuma benchmark also generally looks
> > good, most differences are within the noise but with higher locality
> > success rates and fewer page migrations. Other long lived workloads are
> > still running but I'm not expecting many surprises.
> > 
> >  include/linux/sched.h        |  31 ++-
> >  include/trace/events/sched.h |  49 ++--
> >  kernel/sched/core.c          |  13 -
> >  kernel/sched/debug.c         |  17 +-
> >  kernel/sched/fair.c          | 626 ++++++++++++++++++++++++++++---------------
> >  kernel/sched/pelt.c          |  59 ++--
> >  kernel/sched/sched.h         |  42 ++-
> >  7 files changed, 535 insertions(+), 302 deletions(-)
> > 
> > -- 
> > 2.16.4
> > 
> 
> Our Perf team was been comparing tip/sched/core (v5.6-rc3 + lb_numa series) 
> with upstream v5.6-rc3 and has noticed some regressions. 
> 
> Here's a summary of what Jirka Hladky reported to me:
> 
> ---
> We see following problems when comparing 5.6.0_rc3.tip_lb_numa+-5 against
> 5.6.0-0.rc3.1.elrdy:
> 
>   ??? performance drop by 20% - 40% across almost all benchmarks especially 
>     for low and medium thread counts and especially on 4 NUMA and 8 NUMA nodes
>     servers
>   ??? 2 NUMA nodes servers are affected as well, but performance drop is less
>     significant (10-20%)
>   ??? servers with just one NUMA node are NOT affected
>   ??? we see big load imbalances between different NUMA nodes
> ---
> 

UMA being unaffected is not a surprise, the rest is obviously.

> The actual data reports are on an intranet web page so they are harder to 
> share. I can create PDFs or screenshots but I didn't want to just blast 
> those to the list. I'd be happy to send some direclty if you are interested. 
> 

Send them to me privately please.

> Some data in text format I can easily include shows imbalances across the
> numa nodes. This is for NAS sp.C.x benchmark because it was easiest to
> pull and see the data in text. The regressions can be seen in other tests
> as well.
> 

What was the value for x?

I ask because I ran NAS across a variety of machines for C class in two
configurations -- one using as many CPUs as possible and one running
with a third of the available CPUs for both MPI and OMP. Generally there
were small gains and losses across multiple kernels but often within the
noise or within a few percent of each other.

The largest machine I had available was 4 sockets.

The other curiousity is that you used C class. On bigger machines, that
is very short lived to the point of being almost useless. Is D class
similarly affected?

> For example:
> 
> 5.6.0_rc3.tip_lb_numa+
> sp.C.x_008_02  - CPU load average across the individual NUMA nodes 
> (timestep is 5 seconds)
> # NUMA | AVR | Utilization over time in percentage
>   0    | 5   |  12  9  3  0  0 11  8  0  1  3  5 17  9  5  0  0  0 11  3
>   1    | 16  |  20 21 10 10  2  6  9 12 11  9  9 23 24 23 24 24 24 19 20
>   2    | 21  |  19 23 26 22 22 23 25 20 25 34 38 17 13 13 13 13 13 27 13
>   3    | 15  |  19 23 20 21 21 15 15 20 20 18 10 10  9  9  9  9  9  9 11
>   4    | 19  |  13 14 15 22 23 20 19 20 17 12 15 15 25 25 24 24 24 14 24
>   5    | 3   |   0  2 11  6 20  8  0  0  0  0  0  0  0  0  0  0  0  0  9
>   6    | 0   |   0  0  0  5  0  0  0  0  0  0  1  0  0  0  0  0  0  0  0
>   7    | 4   |   0  0  0  0  0  0  4 11  9  0  0  0  0  5 12 12 12  3  0
> 
> 5.6.0-0.rc3.1.elrdy
> sp.C.x_008_01  - CPU load average across the individual NUMA nodes 
> (timestep is 5 seconds)
> # NUMA | AVR | Utilization over time in percentage
>   0    | 13  |   6  8 10 10 11 10 18 13 20 17 14 15
>   1    | 11  |  10 10 11 11  9 16 12 14  9 11 11 10
>   2    | 17  |  25 19 16 11 13 12 11 16 17 22 22 16
>   3    | 21  |  21 22 22 23 21 23 23 21 21 17 22 21
>   4    | 14  |  20 23 11 12 15 18 12 10  9 13 12 18
>   5    | 4   |   0  0  8 10  7  0  8  2  0  0  8  2
>   6    | 1   |   0  5  1  0  0  0  0  0  0  1  0  0
>   7    | 7   |   7  3 10 10 10 11  3  8 10  4  0  5
> 

A critical difference with the series is that large imbalances shouldn't
happen but prior to the series the NUMA balancing would keep trying to
move tasks to a node with load balancing moving them back. That should
not happen any more but there are cases where it's actually faster to
have the fight between NUMA balancing and load balancing. Ideally a
degree of imbalance would be allowed but I haven't found a way of doing
that without side effects.

Generally the utilisatiojn looks low in either kernel making me think
the value for x is relatively small.

> 
> mops/s sp_C_x
> kernel      		threads	8 	16 	32 	48 	56 	64
> 5.6.0-0.rc3.1.elrdy 	mean 	22819.8 39955.3 34301.6 31086.8 30316.2 30089.2
> 			max 	23911.4 47185.6 37995.9 33742.6 33048.0 30853.4
> 			min 	20799.3 36518.0 31459.4 27559.9 28565.7 29306.3
> 			stdev 	1096.7 	3965.3 	2350.2 	2134.7 	1871.1 	612.1
> 			first_q 22780.4 37263.7 32081.7 29955.8 28609.8 29632.0
> 			median 	22936.7 37577.6 34866.0 32020.8 29299.9 29906.3
> 			third_q 23671.0 41231.7 35105.1 32154.7 32057.8 30748.0
> 5.6.0_rc3.tip_lb_numa 	mean 	12111.1 24712.6 30147.8 32560.7 31040.4 28219.4
> 			max 	17772.9 28934.4 33828.3 33659.3 32275.3 30434.9
> 			min 	9869.9 	18407.9 25092.7 31512.9 29964.3 25652.8
> 			stdev 	2868.4 	3673.6 	2877.6 	832.2 	765.8 	1800.6
> 			first_q 10763.4 23346.1 29608.5 31827.2 30609.4 27008.8
> 			median 	10855.0 25415.4 30804.4 32462.1 31061.8 27992.6
> 			third_q 11294.5 27459.2 31405.0 33341.8 31291.2 30007.9
> Comparison 		mean 	-47 	-38 	-12 	5 	2 	-6
> 			median 	-53 	-32 	-12 	1 	6 	-6
> 

So I *think* this is observing the difference in imbalance. The range
for 8 threads is massive but it stabilises when the thread count is
higher. When fewer threads than a NUMA side is used, it can be beneficial
to run everything one one node but the load balancer doesn't let that
happen and the NUMA balancer no longer fights it.

> On 5.6.0-rc3.tip-lb_numa+ we see:
> 
>   ??? BIG fluctuation in runtime
>   ??? NAS running up to 2x slower than on 5.6.0-0.rc3.1.elrdy
> 
> $ grep "Time in seconds" *log
> sp.C.x.defaultRun.008threads.loop01.log: Time in seconds = 125.73
> sp.C.x.defaultRun.008threads.loop02.log: Time in seconds = 87.54
> sp.C.x.defaultRun.008threads.loop03.log: Time in seconds = 86.93
> sp.C.x.defaultRun.008threads.loop04.log: Time in seconds = 165.98
> sp.C.x.defaultRun.008threads.loop05.log: Time in seconds = 114.78
> 
> On the other hand, runtime on 5.6.0-0.rc3.1.elrdy is stable:
> $ grep "Time in seconds" *log
> sp.C.x.defaultRun.008threads.loop01.log: Time in seconds = 59.83
> sp.C.x.defaultRun.008threads.loop02.log: Time in seconds = 67.72
> sp.C.x.defaultRun.008threads.loop03.log: Time in seconds = 63.62
> sp.C.x.defaultRun.008threads.loop04.log: Time in seconds = 55.01
> sp.C.x.defaultRun.008threads.loop05.log: Time in seconds = 65.20
> 
> It looks like things are moving around a lot but not getting balanced
> as well across the numa nodes. I have a couple of nice heat maps that 
> show this if you want to see them. 
> 

I'd like to see the heatmap. I just looked at the ones I had for NAS and
I'm not seeing a bad pattern with either all CPUs used or a third. What
I'm looking for is a pattern showing higher utilisation on one node over
another in the baseline kernel and showing relatively even utilisation
in tip/sched/core.

-- 
Mel Gorman
SUSE Labs

  reply	other threads:[~2020-03-09 20:36 UTC|newest]

Thread overview: 83+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2020-02-24  9:52 [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6 Mel Gorman
2020-02-24  9:52 ` [PATCH 01/13] sched/fair: Allow a per-CPU kthread waking a task to stack on the same CPU, to fix XFS performance regression Mel Gorman
2020-02-24  9:52 ` [PATCH 02/13] sched/numa: Trace when no candidate CPU was found on the preferred node Mel Gorman
2020-02-24 15:20   ` [tip: sched/core] " tip-bot2 for Mel Gorman
2020-02-24  9:52 ` [PATCH 03/13] sched/numa: Distinguish between the different task_numa_migrate failure cases Mel Gorman
2020-02-24 15:20   ` [tip: sched/core] sched/numa: Distinguish between the different task_numa_migrate() " tip-bot2 for Mel Gorman
2020-02-24  9:52 ` [PATCH 04/13] sched/fair: Reorder enqueue/dequeue_task_fair path Mel Gorman
2020-02-24 15:20   ` [tip: sched/core] " tip-bot2 for Vincent Guittot
2020-02-24  9:52 ` [PATCH 05/13] sched/numa: Replace runnable_load_avg by load_avg Mel Gorman
2020-02-24 15:20   ` [tip: sched/core] " tip-bot2 for Vincent Guittot
2020-02-24  9:52 ` [PATCH 06/13] sched/numa: Use similar logic to the load balancer for moving between domains with spare capacity Mel Gorman
2020-02-24 15:20   ` [tip: sched/core] " tip-bot2 for Mel Gorman
2020-02-24  9:52 ` [PATCH 07/13] sched/pelt: Remove unused runnable load average Mel Gorman
2020-02-24 15:20   ` [tip: sched/core] " tip-bot2 for Vincent Guittot
2020-02-24  9:52 ` [PATCH 08/13] sched/pelt: Add a new runnable average signal Mel Gorman
2020-02-24 15:20   ` [tip: sched/core] " tip-bot2 for Vincent Guittot
2020-02-24 16:01     ` Valentin Schneider
2020-02-24 16:34       ` Mel Gorman
2020-02-25  8:23       ` Vincent Guittot
2020-02-24  9:52 ` [PATCH 09/13] sched/fair: Take into account runnable_avg to classify group Mel Gorman
2020-02-24 15:20   ` [tip: sched/core] " tip-bot2 for Vincent Guittot
2020-02-24  9:52 ` [PATCH 10/13] sched/numa: Prefer using an idle cpu as a migration target instead of comparing tasks Mel Gorman
2020-02-24 15:20   ` [tip: sched/core] sched/numa: Prefer using an idle CPU " tip-bot2 for Mel Gorman
2020-02-24  9:52 ` [PATCH 11/13] sched/numa: Find an alternative idle CPU if the CPU is part of an active NUMA balance Mel Gorman
2020-02-24 15:20   ` [tip: sched/core] " tip-bot2 for Mel Gorman
2020-02-24  9:52 ` [PATCH 12/13] sched/numa: Bias swapping tasks based on their preferred node Mel Gorman
2020-02-24 15:20   ` [tip: sched/core] " tip-bot2 for Mel Gorman
2020-02-24  9:52 ` [PATCH 13/13] sched/numa: Stop an exhastive search if a reasonable swap candidate or idle CPU is found Mel Gorman
2020-02-24 15:20   ` [tip: sched/core] " tip-bot2 for Mel Gorman
2020-02-24 15:16 ` [PATCH 00/13] Reconcile NUMA balancing decisions with the load balancer v6 Ingo Molnar
2020-02-25 11:59   ` Mel Gorman
2020-02-25 13:28     ` Vincent Guittot
2020-02-25 14:24       ` Mel Gorman
2020-02-25 14:53         ` Vincent Guittot
2020-02-27  9:09         ` Ingo Molnar
2020-03-09 19:12 ` Phil Auld
2020-03-09 20:36   ` Mel Gorman [this message]
2020-03-12  9:54     ` Mel Gorman
2020-03-12 12:17       ` Jirka Hladky
     [not found]       ` <CAE4VaGA4q4_qfC5qe3zaLRfiJhvMaSb2WADgOcQeTwmPvNat+A@mail.gmail.com>
2020-03-12 15:56         ` Mel Gorman
2020-03-12 17:06           ` Jirka Hladky
     [not found]           ` <CAE4VaGD8DUEi6JnKd8vrqUL_8HZXnNyHMoK2D+1-F5wo+5Z53Q@mail.gmail.com>
2020-03-12 21:47             ` Mel Gorman
2020-03-12 22:24               ` Jirka Hladky
2020-03-20 15:08                 ` Jirka Hladky
     [not found]                 ` <CAE4VaGC09OfU2zXeq2yp_N0zXMbTku5ETz0KEocGi-RSiKXv-w@mail.gmail.com>
2020-03-20 15:22                   ` Mel Gorman
2020-03-20 15:33                     ` Jirka Hladky
     [not found]                     ` <CAE4VaGBGbTT8dqNyLWAwuiqL8E+3p1_SqP6XTTV71wNZMjc9Zg@mail.gmail.com>
2020-03-20 16:38                       ` Mel Gorman
2020-03-20 17:21                         ` Jirka Hladky
2020-05-07 15:24                         ` Jirka Hladky
2020-05-07 15:54                           ` Mel Gorman
2020-05-07 16:29                             ` Jirka Hladky
2020-05-07 17:49                               ` Phil Auld
     [not found]                                 ` <20200508034741.13036-1-hdanton@sina.com>
2020-05-18 14:52                                   ` Jirka Hladky
     [not found]                                     ` <20200519043154.10876-1-hdanton@sina.com>
2020-05-20 13:58                                       ` Jirka Hladky
2020-05-20 16:01                                         ` Jirka Hladky
2020-05-21 11:06                                         ` Mel Gorman
     [not found]                                         ` <20200521140931.15232-1-hdanton@sina.com>
2020-05-21 16:04                                           ` Mel Gorman
     [not found]                                           ` <20200522010950.3336-1-hdanton@sina.com>
2020-05-22 11:05                                             ` Mel Gorman
2020-05-08  9:22                               ` Mel Gorman
2020-05-08 11:05                                 ` Jirka Hladky
     [not found]                                 ` <CAE4VaGC_v6On-YvqdTwAWu3Mq4ofiV0pLov-QpV+QHr_SJr+Rw@mail.gmail.com>
2020-05-13 14:57                                   ` Jirka Hladky
2020-05-13 15:30                                     ` Mel Gorman
2020-05-13 16:20                                       ` Jirka Hladky
2020-05-14  9:50                                         ` Mel Gorman
     [not found]                                           ` <CAE4VaGCGUFOAZ+YHDnmeJ95o4W0j04Yb7EWnf8a43caUQs_WuQ@mail.gmail.com>
2020-05-14 10:08                                             ` Mel Gorman
2020-05-14 10:22                                               ` Jirka Hladky
2020-05-14 11:50                                                 ` Mel Gorman
2020-05-14 13:34                                                   ` Jirka Hladky
2020-05-14 15:31                                       ` Peter Zijlstra
2020-05-15  8:47                                         ` Mel Gorman
2020-05-15 11:17                                           ` Peter Zijlstra
2020-05-15 13:03                                             ` Mel Gorman
2020-05-15 13:12                                               ` Peter Zijlstra
2020-05-15 13:28                                                 ` Peter Zijlstra
2020-05-15 14:24                                             ` Peter Zijlstra
2020-05-21 10:38                                               ` Mel Gorman
2020-05-21 11:41                                                 ` Peter Zijlstra
2020-05-22 13:28                                                   ` Mel Gorman
2020-05-22 14:38                                                     ` Peter Zijlstra
2020-05-15 11:28                                           ` Peter Zijlstra
2020-05-15 12:22                                             ` Mel Gorman
2020-05-15 12:51                                               ` Peter Zijlstra
2020-05-15 14:43                                       ` Jirka Hladky

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20200309203625.GU3818@techsingularity.net \
    --to=mgorman@techsingularity.net \
    --cc=bsegall@google.com \
    --cc=dietmar.eggemann@arm.com \
    --cc=hdanton@sina.com \
    --cc=juri.lelli@redhat.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=mingo@kernel.org \
    --cc=pauld@redhat.com \
    --cc=peterz@infradead.org \
    --cc=rostedt@goodmis.org \
    --cc=valentin.schneider@arm.com \
    --cc=vincent.guittot@linaro.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.