Re: [4.17 regression] Performance drop on kernel-4.17 visible on Stream, Linpack and NAS parallel benchmarks

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* Re: [4.17 regression] Performance drop on kernel-4.17 visible on Stream, Linpack and NAS parallel benchmarks
       [not found] <CAE4VaGBRFBM-uZEE=DdUzQkcNmpnUHdjK-7hgEeywmG8bvOOgw@mail.gmail.com>
@ 2018-06-11 14:11 ` Mel Gorman
       [not found]   ` <CAE4VaGCMS2pXfPVSnMbudexv_m5wRCTuBKA5ijh2x==11uQg9g@mail.gmail.com>
  0 siblings, 1 reply; 24+ messages in thread
From: Mel Gorman @ 2018-06-11 14:11 UTC (permalink / raw)
  To: Jirka Hladky
  Cc: Jakub Racek, linux-kernel, Rafael J. Wysocki, Len Brown,
	linux-acpi, kkolakow

On Mon, Jun 11, 2018 at 12:04:34PM +0200, Jirka Hladky wrote:
> Hi Mel,
> 
> your suggestion about the commit which has caused the regression was right
> - it's indeed this commit:
> 
> 2c83362734dad8e48ccc0710b5cd2436a0323893
> 
> The question now is what can be done to improve the results. I have made
> stream to run longer and I see that data are moved very slowly from NODE#1
> to NODE#0.
> 

Ok, this is somewhat expected although I suspect the scan rate slowed a lot
in the early phase of the program and that's why the migration is slow --
slow scan means fewer samples and takes longer to reach the 2-pass filter.

> The process has started on NODE#1 where all memory has been allocated.
> Right after the start, the process has been moved to NODE#0 but only part
> of the memory has been moved to that node. numa_preferred_nid has stayed 1
> for 30 seconds. The numa_preferred_nid has changed to 0 at
> 2018-Jun-09_03h35m58s and most of the memory has been finally reallocated.
> See the logs below.
> 
> Could we try to make numa_preferred_nid to change faster?
> 

What catches us is that each element in itself makes sense, it's just not a
universal win. The identified patch makes a reasonable choice in that fork
shouldn't necessary spread across the machine as it hurts short-lived
or communicating processes. Unfortunately, if a load is NUMA-aware
and the processes are independent then automatic NUMA balancing has to
take action which means there is a period of time where performance is
sub-optimal. Similarly, the load balancer is making a reasonable decision
when a socket gets overloaded. Fixing any part of it for STREAM will end
up regressing something else.

The numa_preferred_nid can probably be changed faster by adjusting the scan
rate. Unfortunately, it comes with the penalty that system CPU overhead
will be higher and stalls in the process increase to handle the PTE updates
and the subsequent faults. This might help STREAM but anything that is
latency sensitive will be hurt. Worse, if a socket is over-saturated and
there is a high frequency of cross-node migrations to load balance then
the scan rate might always stay at the max frequency and a very high cost
incurred so we end up with another class of regression.

Srikar Dronamra did have a series with two patches that increase the scan
rate when there is a cross-node migration. It may be the case that it
also has the impact of changing numa_preferred_nid faster but it has a
real risk of introducing regressions. Still, for the purposes of testing
you might be interested in testing the following two patches?

Srikar Dronamra [PATCH 17/19] sched/numa: Pass destination cpu as a parameter to migrate_task_rq
Srikar Dronamra [PATCH 18/19] sched/numa: Reset scan rate whenever task moves across nodes

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [4.17 regression] Performance drop on kernel-4.17 visible on Stream, Linpack and NAS parallel benchmarks
       [not found]   ` <CAE4VaGCMS2pXfPVSnMbudexv_m5wRCTuBKA5ijh2x==11uQg9g@mail.gmail.com>
@ 2018-06-14  8:36     ` Mel Gorman
       [not found]       ` <CAE4VaGCzB99es_TpAaYvtjX8fqzFA=7HX-ezqgO6FaEB5if4zg@mail.gmail.com>
  0 siblings, 1 reply; 24+ messages in thread
From: Mel Gorman @ 2018-06-14  8:36 UTC (permalink / raw)
  To: Jirka Hladky
  Cc: Jakub Racek, linux-kernel, Rafael J. Wysocki, Len Brown,
	linux-acpi, kkolakow

On Mon, Jun 11, 2018 at 06:07:58PM +0200, Jirka Hladky wrote:
> >
> > Fixing any part of it for STREAM will end up regressing something else.
> 
> 
> I fully understand that. We run a set of benchmarks and we always look at
> the results as the ensemble. Looking only at one benchmark would be
> completely wrong.
> 

Indeed

> And in fact, we do see regression on NAS benchmark going from 4.16 to 4.17
> kernel as well. On 4 NUMA node server with Xeon Gold CPUs we see the
> regression around 26% for ft_C,   35% for mg_C_x and 25% for sp_C_x. The
> biggest regression is with 32 threads (the box has 96 CPUs in total). I
> have not yet tried if it's
> linked to 2c83362734dad8e48ccc0710b5cd2436a0323893. I will do that
> testing tomorrow.
> 

It would be worthwhile. However, it's also worth noting that 32 threads
out of 96 implies that 4 nodes would not be evenly used and it may
account for some of the discrepency. ft and mg for C class are typically
short-lived on modern hardware and sp is not particularly long-lived
either. Hence, they are most likely to see problems with a patch that
avoids spreading tasks across the machine early. Admittedly, I have not
seen similar slowdowns but NAS has a lot of configuration options.

In terms of the speed of migration, it may be worth checking how often the
mm_numa_migrate_ratelimit tracepoint is triggered with bonus points for using
the nr_pages to calculate how many pages get throttled from migrating. If
it's high frequency then you could test increasing ratelimit_pages (which
is set at compile time despite not being a macro). It still may not work
for tasks that are too short-lived to have enough time to identify a
misplacement and migration.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [4.17 regression] Performance drop on kernel-4.17 visible on Stream, Linpack and NAS parallel benchmarks
       [not found]       ` <CAE4VaGCzB99es_TpAaYvtjX8fqzFA=7HX-ezqgO6FaEB5if4zg@mail.gmail.com>
@ 2018-06-15 11:25         ` Mel Gorman
       [not found]           ` <CAE4VaGBtasbDBoZ-c5R-AY++Y1BXgjrE7DwN0zOt113xmV95xw@mail.gmail.com>
  0 siblings, 1 reply; 24+ messages in thread
From: Mel Gorman @ 2018-06-15 11:25 UTC (permalink / raw)
  To: Jirka Hladky
  Cc: Jakub Racek, linux-kernel, Rafael J. Wysocki, Len Brown,
	linux-acpi, kkolakow

On Fri, Jun 15, 2018 at 01:07:32AM +0200, Jirka Hladky wrote:
> >
> > In terms of the speed of migration, it may be worth checking how often the
> > mm_numa_migrate_ratelimit tracepoint is triggered with bonus points for
> > using
> > the nr_pages to calculate how many pages get throttled from migrating. If
> > it's high frequency then you could test increasing ratelimit_pages (which
> > is set at compile time despite not being a macro). It still may not work
> > for tasks that are too short-lived to have enough time to identify a
> > misplacement and migration.
> 
> 
> I have done testing on 2 NUMA and 4 NUMA servers, all equipped with the
> same CPUs ( Gold 6126) with 48 and 96 cores respectively.
> 
> I have used ft.C.x and ft.D.x tests with 20 threads on 2 NUMA box and 32
> threads on 4 NUMA box. (This is where I see the biggest perf. drop between
> 4.16 and 4.17 kernels).  While ft.C is a short-lived test (it takes few
> seconds to finish), ft.D is a long test with runtime over 3 minutes with 20
> threads and 4.5 minutes with 20 threads.
> 

Understood.

> I have used this command to run the test:
> 
> OMP_NUM_THREADS=${THREADS} trace-cmd record -e
> migrate:mm_numa_migrate_ratelimit -o
> ${DIR}/${BIN}_${THREADS}_threads_with_trace.trace.dat ./${BIN}
> 

Ok, the fact you're using OpenMP instead of MPI is an important detail.
OpenMP threads inherit the numa_preferred_nid from their parent while
MPI are usually processes and do not inherit the preferred nid. They
also inherit the page tables so even though there is a preferred nid,
they also potentially handle NUMA hinting faults. This has an important
impact on what the hints look like if there is a window before a thread
gets migrated to another socket.

> I can see that 2c83362734dad8e48ccc0710b5cd2436a0323893 has caused big
> increase in number of mm_numa_migrate_ratelimit events.
> 

That implies the threads are getting throttled and, for NAS at least,
indicate why migration is slow. It doesn't apply to stream.

> I have tested following 3 kernels: 4.16, 4.16_p1
> (2c83362734dad8e48ccc0710b5cd2436a0323893) and 4.16_p2 (4.16_p1 + 2 patched
> from Srikar Dronamra).
> 
> There is clear performance drop going from 4.16 to 4.16_p1. 4.16_p2 shows a
> small improvement over 4.16_p1 for ft.C but additional perf. drop for ft.D
> on 4 NUMA node server.
> 

Ok, so as expected a higher scan rate is not necessarily a good thing.
I've observed before that often it simply increases system CPU usage
without any improvement in locality.

> I think you have mentioned that you are using NAS benchmark but you don't
> see the regression.

Correct.

> I do wonder if you run NAS with the number of
> threads being roughly 1/3 of the available cores - this is the scenario
> where I consistently see big perf. drop caused by
> 2c83362734dad8e48ccc0710b5cd2436a0323893.
> 

It's possible. Until relatively recently, the NAS configurations used as
many CPUs as possible rounded down to a power-of-two or square number
where required if MPI was in use. Due to the fact that saturating the
machine alters how MPI behaves (and is not great for openMP either),
I added configurations that used half of the CPUs. However, that would
mean it fits too nicely within sockets. I've added another set for one
third of the CPUs and scheduled the tests. Unfortunately, they will not
complete quickly as my test grid has a massive backlog of work.

> Results are bellow:
> 

Nice one, thanks. It's fairly clear that rate limiting may be a major
component and it's worth testing with the ratelimit increased. Given that
there have been a lot of improvements on locality and corner cases since
the rate limit was first introduced, it may also be worth considering
elimintating the rate limiting entirely and see what falls out.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [4.17 regression] Performance drop on kernel-4.17 visible on Stream, Linpack and NAS parallel benchmarks
       [not found]           ` <CAE4VaGBtasbDBoZ-c5R-AY++Y1BXgjrE7DwN0zOt113xmV95xw@mail.gmail.com>
@ 2018-06-15 13:52             ` Mel Gorman
       [not found]               ` <CAE4VaGAdXNYXMUn4eQgMqQtLKfp6-YHMa1NUSpL-L078oX7C-w@mail.gmail.com>
  0 siblings, 1 reply; 24+ messages in thread
From: Mel Gorman @ 2018-06-15 13:52 UTC (permalink / raw)
  To: Jirka Hladky
  Cc: Jakub Racek, linux-kernel, Rafael J. Wysocki, Len Brown,
	linux-acpi, kkolakow

On Fri, Jun 15, 2018 at 02:23:17PM +0200, Jirka Hladky wrote:
> I added configurations that used half of the CPUs. However, that would
> > mean it fits too nicely within sockets. I've added another set for one
> > third of the CPUs and scheduled the tests. Unfortunately, they will not
> > complete quickly as my test grid has a massive backlog of work.
> 
> 
> We always use the number of threads being an integer multiple of the number
> of sockets.  With another number of threads, we have seen the bigger
> variation in results (that's variation between subsequent runs of the same
> test).
> 

It's not immediately obvious what's special about those numbers. I did
briefly recheck the variability of NAS on one of the machines but the
coefficient of variance was usually quite low with occasional outliers
of +/- 5% or +/- 7%. Anyway, it's a side-issue.

>  Nice one, thanks. It's fairly clear that rate limiting may be a major
> > component and it's worth testing with the ratelimit increased. Given that
> > there have been a lot of improvements on locality and corner cases since
> > the rate limit was first introduced, it may also be worth considering
> > elimintating the rate limiting entirely and see what falls out.
> 
> 
> How can we tune mm_numa_migrate_ratelimit? It doesn't seem to be a runtime
> tunable nor kernel boot parameter. Could you please share some hints on how
> to change it and what value to use? I would be interested to try it out.
> 

It's not runtime tunable I'm afraid. It's a code change and recompile.
For example the following allows more pages to be migrated within a
100ms window.

diff --git a/mm/migrate.c b/mm/migrate.c
index 8c0af0f7cab1..edb550493f06 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1862,7 +1862,7 @@ static struct page *alloc_misplaced_dst_page(struct page *page,
  * window of time. Default here says do not migrate more than 1280M per second.
  */
 static unsigned int migrate_interval_millisecs __read_mostly = 100;
-static unsigned int ratelimit_pages __read_mostly = 128 << (20 - PAGE_SHIFT);
+static unsigned int ratelimit_pages __read_mostly = 512 << (20 - PAGE_SHIFT);
 
 /* Returns true if the node is migrate rate-limited after the update */
 static bool numamigrate_update_ratelimit(pg_data_t *pgdat,

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [4.17 regression] Performance drop on kernel-4.17 visible on Stream, Linpack and NAS parallel benchmarks
       [not found]                 ` <CAE4VaGBeTpxd1phR4rVAjqOXuLgLWPtVMPoRSOcG3HXfWDF=8w@mail.gmail.com>
@ 2018-06-19 15:18                   ` Mel Gorman
       [not found]                     ` <CAE4VaGAPOfy0RtQehKoe+443C1GRrJXCveBFgcAZ1nChVavp1g@mail.gmail.com>
  0 siblings, 1 reply; 24+ messages in thread
From: Mel Gorman @ 2018-06-19 15:18 UTC (permalink / raw)
  To: Jirka Hladky
  Cc: Jakub Racek, linux-kernel, Rafael J. Wysocki, Len Brown,
	linux-acpi, kkolakow

On Tue, Jun 19, 2018 at 03:36:53PM +0200, Jirka Hladky wrote:
> Hi Mel,
> 
> we have tested following variants:
> 
> var1: 4.16 + 2c83362734dad8e48ccc0710b5cd2436a0323893
> fix1: var1+ ratelimit_pages __read_mostly increased by factor 4x
> -static unsigned int ratelimit_pages __read_mostly = 128 << (20 - PAGE_SHIFT);
> +static unsigned int ratelimit_pages __read_mostly = 512 << (20 - PAGE_SHIFT);
> fix2: var1+ ratelimit_pages __read_mostly increased by factor 8x
> -static unsigned int ratelimit_pages __read_mostly = 512 << (20 - PAGE_SHIFT);
> +static unsigned int ratelimit_pages __read_mostly = 1024 << (20 - PAGE_SHIFT);
> fix3: var1+ ratelimit_pages __read_mostly increased by factor 16x
> -static unsigned int ratelimit_pages __read_mostly = 1024 << (20 - PAGE_SHIFT);
> +static unsigned int ratelimit_pages __read_mostly = 2048 << (20 - PAGE_SHIFT);
> 
> Results for the stream benchmark (standalone processes) have gradually
> improved. For fix3, stream benchmark with runtime 60 seconds does not show
> performance drop compared to 4.16 kernel anymore.
> 

Ok, so at least one option is to remove the rate limiting.  It'll be ok as
long as cross-node migrations are not both a) a regular event and b) each
migrate remains on the new socket long enough for migrations to occur and c)
the bandwidth used for cross-node migration does not interfere badly with
tasks accessing local memory. It'll vary depending on workload and machine
unfortuantely but the rate limiting never accounted for the real capabilities
of hardware and cannot detect bandwidth used for regular accesses.

> For the OpenMP NAS, results are still worse than with vanilla 4.16 kernel.
> Increasing the ratelimit has helped, but even the best results of {fix1,
> fix2, fix3} are still some 5-10% slower than with vanilla 4.16 kernel.  If
> I should pick the best value for ratelimit_pages __read_mostly it would be
> fix2:
> +static unsigned int ratelimit_pages __read_mostly = 1024 << (20 -
> PAGE_SHIFT);
> 
> I have also used the Intel LINPACK (OpenMP) benchmark on 4 NUMA server - it
> gives the similar results as NAS test.
> 
> I think that patch 2c83362734dad8e48ccc0710b5cd2436a0323893 needs a review
> for the OpenMP and standalone processes workflow.
> 

I did get some results although testing of the different potential patches
(revert, numabalance series, faster scanning in isolation etc) are still
in progress. However, I did find that rate limiting was not a factor for
NAS at least (STREAM was too short lived in the configuration I used) on
the machine I used. That does not prevent the ratelimiting being removed
but it highlights that the impact is workload and machine specific.

Second, I did manage to see the regression and the fix from the revert *but*
it required both one third of CPUs to be used and the openMP parallelisation
method. Using all CPUs shows no regression and using a third of the CPUs
with MPI shows no regression. In other words, the impact is specific to
the workload, the configuration and the machine.

I don't have a LINPACK configuration but you say that the behaviour is
similar so I'll stick with NAS.

On the topic of STREAM, it's meant to be a memory bandwidth benchmark and
there is no knowledge within the scheduler for automatically moving tasks
to a memory controller. It really should be tuned to run as one instance
bound to one controller for the figures to make sense. For automatic NUMA
balancing to fix it up, it needs to run long enough and it's not guaranteed
to be optimally located. I think it's less relevant as a workload in this
instance and it'll never be optimal as even spreading early does not mean
it'll spread to each memory controller.

Given the specific requirement of CPUs used and parallelisation method, I
think a plain revert is not the answer because it'll fix one particular
workload and reintroduce regressions on others (as laid out in the
original changelog). There are other observations we can make about the
NAS-OpenMP workload though

1. The locality sucks
Parallelising with MPI indicates that locality as measured by the NUMA
hinting achieves 94% local hits and minimal migration. With OpenMP,
locality is 66% with large amounts of migration. Many of the updates are
huge PMDs so this may be an instance of false sharing or it might be the
timing of when migrations start.

2. Migrations with the revert are lower
There are fewer migrations when the patch is reverted and this may be an
indication that it simply benefits by spreading early before any memory
is allocated so that migrations are avoided. Unfortunately, this is not
a universal win for every workload.

I'll experiement a bit with faster migrations on cross-node accesses
but I think no matter which way we jump on this one it'll be a case of
"it helps one workload and hurts another".

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [4.17 regression] Performance drop on kernel-4.17 visible on Stream, Linpack and NAS parallel benchmarks
       [not found]                       ` <CAE4VaGBMeL82SJK53gtcWkor-9eXeLX6VP9juw=FW=BOyp+hMA@mail.gmail.com>
@ 2018-06-21  9:23                         ` Mel Gorman
       [not found]                           ` <CAE4VaGCQV+cS-vhdLyMwzftbB-xBHPt4Y4chg_0ykLHTE9cRfw@mail.gmail.com>
  0 siblings, 1 reply; 24+ messages in thread
From: Mel Gorman @ 2018-06-21  9:23 UTC (permalink / raw)
  To: Jirka Hladky
  Cc: Jakub Racek, linux-kernel, Rafael J. Wysocki, Len Brown,
	linux-acpi, kkolakow

On Wed, Jun 20, 2018 at 07:25:19PM +0200, Jirka Hladky wrote:
> Hi Mel and others,
> 
> I would like to let you know that I have tested following patch
> 

Understood. FWIW, there is a lot in flight at the moment but the first
likely patch is removing rate limiting entirely and see what falls out.
The rest of the experiment series deals with fast-scan-start, reset of
preferred_nid on cross-node load balancing and dealing with THP false
sharing but it's all preliminary and untested.

Furthermore, matters have been complicated by the posting of "Fixes for
sched/numa_balancing". My own testing indicates that this helped which
means that I need to review this first and then rebase anything else on
top of it.

I would also suggest you test that series paying particular attention to
whether it a) improves performance and b) how close it gets to the
revert in terms of overall performance.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [4.17 regression] Performance drop on kernel-4.17 visible on Stream, Linpack and NAS parallel benchmarks
       [not found]                             ` <CAE4VaGDHcZbnDpJ+FiQLfA1DRftY0j_GJSnh3FDRi34OztVH6Q@mail.gmail.com>
@ 2018-06-27  8:49                               ` Mel Gorman
       [not found]                                 ` <CAE4VaGA9KzX05rdfw2PhEATLisV-NVMc9rOyjzSg-rX1rug9Dw@mail.gmail.com>
  0 siblings, 1 reply; 24+ messages in thread
From: Mel Gorman @ 2018-06-27  8:49 UTC (permalink / raw)
  To: Jirka Hladky
  Cc: Jakub Racek, linux-kernel, Rafael J. Wysocki, Len Brown,
	linux-acpi, kkolakow

On Wed, Jun 27, 2018 at 12:18:37AM +0200, Jirka Hladky wrote:
> Hi Mel,
> 
> we have results for the "Fixes for sched/numa_balancing" series and overall
> it looks very promising.
> 
> We see improvements in the range 15-20% for the stream benchmark and
> upto 60% for the OpenMP NAS benchmark. While NAS results are noisy (have
> quite big variations)  we see improvements on a wide range of 2 and 4 NUMA
> systems.  More importantly, we don't see any regressions. I have posted
> screenshots of the median differences on 4 NUMA servers for NAS benchmark
> with
> 
> 4x Gold 6126 CPU @ 2.60GHz
> 4x E5-4627 v2 @ 3.30GHz
> 
> to illustrate the typical results.
> 
> How are things looking at your side?
> 

I saw similar results in that it was a general win. I'm also trialing
the following monolithic patch on top if you want to try it. The timing
of this will depend on when/if numa_blancing fixes get picked up and I'm
still waiting on test results to come through. Unfortunately, on Friday,
I'll be unavailable for two weeks so this may drag on a bit.  The expanded
set of patches is at

git://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git sched-numa-fast-crossnode-v1r12

If the results are positive, I'll update the series and post it as a
RFC. Monolithic patch on top of fixes for sched/numa_balancing is as follows.

diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h
index 0dbe1d5bb936..eea5f82ca447 100644
--- a/include/linux/mmzone.h
+++ b/include/linux/mmzone.h
@@ -669,11 +669,6 @@ typedef struct pglist_data {
 	struct task_struct *kcompactd;
 #endif
 #ifdef CONFIG_NUMA_BALANCING
-	/* Rate limiting time interval */
-	unsigned long numabalancing_migrate_next_window;
-
-	/* Number of pages migrated during the rate limiting time interval */
-	unsigned long numabalancing_migrate_nr_pages;
 	int active_node_migrate;
 #endif
 	/*
diff --git a/include/trace/events/migrate.h b/include/trace/events/migrate.h
index 711372845945..de8c73f9abcf 100644
--- a/include/trace/events/migrate.h
+++ b/include/trace/events/migrate.h
@@ -71,32 +71,6 @@ TRACE_EVENT(mm_migrate_pages,
 		__print_symbolic(__entry->reason, MIGRATE_REASON))
 );
 
-TRACE_EVENT(mm_numa_migrate_ratelimit,
-
-	TP_PROTO(struct task_struct *p, int dst_nid, unsigned long nr_pages),
-
-	TP_ARGS(p, dst_nid, nr_pages),
-
-	TP_STRUCT__entry(
-		__array(	char,		comm,	TASK_COMM_LEN)
-		__field(	pid_t,		pid)
-		__field(	int,		dst_nid)
-		__field(	unsigned long,	nr_pages)
-	),
-
-	TP_fast_assign(
-		memcpy(__entry->comm, p->comm, TASK_COMM_LEN);
-		__entry->pid		= p->pid;
-		__entry->dst_nid	= dst_nid;
-		__entry->nr_pages	= nr_pages;
-	),
-
-	TP_printk("comm=%s pid=%d dst_nid=%d nr_pages=%lu",
-		__entry->comm,
-		__entry->pid,
-		__entry->dst_nid,
-		__entry->nr_pages)
-);
 #endif /* _TRACE_MIGRATE_H */
 
 /* This part must be outside protection */
diff --git a/kernel/sched/fair.c b/kernel/sched/fair.c
index 6ca3be059872..c020af2c58ec 100644
--- a/kernel/sched/fair.c
+++ b/kernel/sched/fair.c
@@ -1394,6 +1394,17 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
 	int last_cpupid, this_cpupid;
 
 	this_cpupid = cpu_pid_to_cpupid(dst_cpu, current->pid);
+	last_cpupid = page_cpupid_xchg_last(page, this_cpupid);
+
+	/*
+	 * Allow first faults or private faults to migrate immediately early in
+	 * the lifetime of a task. The magic number 4 is based on waiting for
+	 * two full passes of the "multi-stage node selection" test that is
+	 * executed below.
+	 */
+	if ((p->numa_preferred_nid == -1 || p->numa_scan_seq <= 4) &&
+	    (cpupid_pid_unset(last_cpupid) || cpupid_match_pid(p, last_cpupid)))
+		return true;
 
 	/*
 	 * Multi-stage node selection is used in conjunction with a periodic
@@ -1412,7 +1423,6 @@ bool should_numa_migrate_memory(struct task_struct *p, struct page * page,
 	 * This quadric squishes small probabilities, making it less likely we
 	 * act on an unlikely task<->page relation.
 	 */
-	last_cpupid = page_cpupid_xchg_last(page, this_cpupid);
 	if (!cpupid_pid_unset(last_cpupid) &&
 				cpupid_to_nid(last_cpupid) != dst_nid)
 		return false;
@@ -6702,6 +6712,9 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu __maybe_unus
 	p->se.exec_start = 0;
 
 #ifdef CONFIG_NUMA_BALANCING
+	if (!static_branch_likely(&sched_numa_balancing))
+		return;
+
 	if (!p->mm || (p->flags & PF_EXITING))
 		return;
 
@@ -6709,8 +6722,26 @@ static void migrate_task_rq_fair(struct task_struct *p, int new_cpu __maybe_unus
 		int src_nid = cpu_to_node(task_cpu(p));
 		int dst_nid = cpu_to_node(new_cpu);
 
-		if (src_nid != dst_nid)
-			p->numa_scan_period = task_scan_start(p);
+		if (src_nid == dst_nid)
+			return;
+
+		/*
+		 * Allow resets if faults have been trapped before one scan
+		 * has completed. This is most likely due to a new task that
+		 * is pulled cross-node due to wakeups or load balancing.
+		 */
+		if (p->numa_scan_seq) {
+			/*
+			 * Avoid scan adjustments if moving to the preferred
+			 * node or if the task was not previously running on
+			 * the preferred node.
+			 */
+			if (dst_nid == p->numa_preferred_nid ||
+			    (p->numa_preferred_nid != -1 && src_nid != p->numa_preferred_nid))
+				return;
+		}
+
+		p->numa_scan_period = task_scan_start(p);
 	}
 #endif
 }
diff --git a/mm/migrate.c b/mm/migrate.c
index c7749902a160..f935f4781036 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1856,54 +1856,6 @@ static struct page *alloc_misplaced_dst_page(struct page *page,
 	return newpage;
 }
 
-/*
- * page migration rate limiting control.
- * Do not migrate more than @pages_to_migrate in a @migrate_interval_millisecs
- * window of time. Default here says do not migrate more than 1280M per second.
- */
-static unsigned int migrate_interval_millisecs __read_mostly = 100;
-static unsigned int ratelimit_pages __read_mostly = 128 << (20 - PAGE_SHIFT);
-
-/* Returns true if the node is migrate rate-limited after the update */
-static bool numamigrate_update_ratelimit(pg_data_t *pgdat,
-					unsigned long nr_pages)
-{
-	unsigned long next_window, interval;
-
-	next_window = READ_ONCE(pgdat->numabalancing_migrate_next_window);
-	interval = msecs_to_jiffies(migrate_interval_millisecs);
-
-	/*
-	 * Rate-limit the amount of data that is being migrated to a node.
-	 * Optimal placement is no good if the memory bus is saturated and
-	 * all the time is being spent migrating!
-	 */
-	if (time_after(jiffies, next_window)) {
-		if (xchg(&pgdat->numabalancing_migrate_nr_pages, 0)) {
-			do {
-				next_window += interval;
-			} while (unlikely(time_after(jiffies, next_window)));
-
-			WRITE_ONCE(pgdat->numabalancing_migrate_next_window,
-							       next_window);
-		}
-	}
-	if (pgdat->numabalancing_migrate_nr_pages > ratelimit_pages) {
-		trace_mm_numa_migrate_ratelimit(current, pgdat->node_id,
-								nr_pages);
-		return true;
-	}
-
-	/*
-	 * This is an unlocked non-atomic update so errors are possible.
-	 * The consequences are failing to migrate when we potentiall should
-	 * have which is not severe enough to warrant locking. If it is ever
-	 * a problem, it can be converted to a per-cpu counter.
-	 */
-	pgdat->numabalancing_migrate_nr_pages += nr_pages;
-	return false;
-}
-
 static int numamigrate_isolate_page(pg_data_t *pgdat, struct page *page)
 {
 	int page_lru;
@@ -1976,14 +1928,6 @@ int migrate_misplaced_page(struct page *page, struct vm_area_struct *vma,
 	if (page_is_file_cache(page) && PageDirty(page))
 		goto out;
 
-	/*
-	 * Rate-limit the amount of data that is being migrated to a node.
-	 * Optimal placement is no good if the memory bus is saturated and
-	 * all the time is being spent migrating!
-	 */
-	if (numamigrate_update_ratelimit(pgdat, 1))
-		goto out;
-
 	isolated = numamigrate_isolate_page(pgdat, page);
 	if (!isolated)
 		goto out;
@@ -2030,14 +1974,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 	unsigned long mmun_start = address & HPAGE_PMD_MASK;
 	unsigned long mmun_end = mmun_start + HPAGE_PMD_SIZE;
 
-	/*
-	 * Rate-limit the amount of data that is being migrated to a node.
-	 * Optimal placement is no good if the memory bus is saturated and
-	 * all the time is being spent migrating!
-	 */
-	if (numamigrate_update_ratelimit(pgdat, HPAGE_PMD_NR))
-		goto out_dropref;
-
 	new_page = alloc_pages_node(node,
 		(GFP_TRANSHUGE_LIGHT | __GFP_THISNODE),
 		HPAGE_PMD_ORDER);
@@ -2134,7 +2070,6 @@ int migrate_misplaced_transhuge_page(struct mm_struct *mm,
 
 out_fail:
 	count_vm_events(PGMIGRATE_FAIL, HPAGE_PMD_NR);
-out_dropref:
 	ptl = pmd_lock(mm, pmd);
 	if (pmd_same(*pmd, entry)) {
 		entry = pmd_modify(entry, vma->vm_page_prot);
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index a4fc9b0798df..9049e7b26e92 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -6211,11 +6211,6 @@ static void __paginginit free_area_init_core(struct pglist_data *pgdat)
 	int nid = pgdat->node_id;
 
 	pgdat_resize_init(pgdat);
-#ifdef CONFIG_NUMA_BALANCING
-	pgdat->numabalancing_migrate_nr_pages = 0;
-	pgdat->active_node_migrate = 0;
-	pgdat->numabalancing_migrate_next_window = jiffies;
-#endif
 #ifdef CONFIG_TRANSPARENT_HUGEPAGE
 	spin_lock_init(&pgdat->split_queue_lock);
 	INIT_LIST_HEAD(&pgdat->split_queue);

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply related	[flat|nested] 24+ messages in thread

* Re: [4.17 regression] Performance drop on kernel-4.17 visible on Stream, Linpack and NAS parallel benchmarks
       [not found]                                       ` <CAE4VaGArxDYHzg8G203yKjgkuw3mULFSw8yCYbCcqvAUSUxy+A@mail.gmail.com>
@ 2018-07-17 10:03                                         ` Mel Gorman
       [not found]                                           ` <CAE4VaGA_L1AEj+Un0oQEEqZp_jgaFLk+Z=vNoad08oXnU2T1nw@mail.gmail.com>
  0 siblings, 1 reply; 24+ messages in thread
From: Mel Gorman @ 2018-07-17 10:03 UTC (permalink / raw)
  To: Jirka Hladky
  Cc: Kamil Kolakowski, Jakub Racek, linux-kernel, Rafael J. Wysocki,
	Len Brown, linux-acpi

On Tue, Jul 17, 2018 at 10:45:51AM +0200, Jirka Hladky wrote:
> Hi Mel,
> 
> we have compared 4.18 + git://
> git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git
> sched-numa-fast-crossnode-v1r12 against 4.16 kernel and performance results
> look very good!
> 

Excellent, thanks to both Kamil and yourself for collecting the data.
It's helpful to have independent verification.

> We see performance gains about 10-20% for SPECjbb2005. NAS results are a
> little bit noisy but show overall performance gains as well (total runtime
> for reduced from 6 hours 34 minutes to 6 hours 26 minutes to give you a
> specific example).

Great.

> The only benchmark showing a slight regression is stream
> - but the regression is just a few percents ( upto 10%) and I think it's
> not a real concern given that it's an artificial benchmark.
> 

Agreed.

> How is your testing going? Do you think
> that sched-numa-fast-crossnode-v1r12 series can make it into the 4.18?
> 

My own testing completed and the results are within expectations and I
saw no red flags. Unfortunately, I consider it unlikely they'll be merged
for 4.18. Srikar Dronamraju's series is likely to need another update
and I would need to rebase my patches on top of that. Given the scope
and complexity, I find it unlikely they would be accepted for an -rc,
particularly this late of an rc. Whether we hit the 4.19 merge window or
not will depend on when Srikar's series gets updated.

> Thanks a lot for your efforts to improve the performance!

My pleasure.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [4.17 regression] Performance drop on kernel-4.17 visible on Stream, Linpack and NAS parallel benchmarks
       [not found]                                           ` <CAE4VaGA_L1AEj+Un0oQEEqZp_jgaFLk+Z=vNoad08oXnU2T1nw@mail.gmail.com>
@ 2018-09-03 15:07                                             ` Jirka Hladky
  2018-09-04  9:00                                               ` Mel Gorman
  0 siblings, 1 reply; 24+ messages in thread
From: Jirka Hladky @ 2018-09-03 15:07 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Kamil Kolakowski, Jakub Racek, linux-kernel, Rafael J. Wysocki,
	Len Brown, linux-acpi

Resending in the plain text mode.

> My own testing completed and the results are within expectations and I
> saw no red flags. Unfortunately, I consider it unlikely they'll be merged
> for 4.18. Srikar Dronamraju's series is likely to need another update
> and I would need to rebase my patches on top of that. Given the scope
> and complexity, I find it unlikely they would be accepted for an -rc,
> particularly this late of an rc. Whether we hit the 4.19 merge window or
> not will depend on when Srikar's series gets updated.


Hi Mel,

we have collaborated back in July on the scheduler patch, improving
the performance by allowing faster memory migration. You came up with
the "sched-numa-fast-crossnode-v1r12" series here:

https://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git

which has shown good performance results both in your and our testing.

Do you have some update on the latest status? Is there any plan to
merge this series into 4.19 kernel? We have just tested 4.19.0-0.rc1.1
and based on the results it seems that the patch is not included (and
I don't see it listed in  git shortlog v4.18..v4.19-rc1
./kernel/sched)

With 4.19rc1 we see performance drop
  * up to 40% (NAS bench) relatively to  4.18 + sched-numa-fast-crossnode-v1r12
  * up to 20% (NAS, Stream, SPECjbb2005, SPECjvm2008) relatively to 4.18 vanilla
The performance is dropping. It's quite unclear what are the next
steps - should we wait for "sched-numa-fast-crossnode-v1r12" to be
merged or should we start looking at what has caused the drop in
performance going from 4.19rc1 to 4.18?

We would appreciate any guidance on how to proceed.

Thanks a lot!
Jirka

On Mon, Sep 3, 2018 at 5:04 PM, Jirka Hladky <jhladky@redhat.com> wrote:
>> My own testing completed and the results are within expectations and I
>> saw no red flags. Unfortunately, I consider it unlikely they'll be merged
>> for 4.18. Srikar Dronamraju's series is likely to need another update
>> and I would need to rebase my patches on top of that. Given the scope
>> and complexity, I find it unlikely they would be accepted for an -rc,
>> particularly this late of an rc. Whether we hit the 4.19 merge window or
>> not will depend on when Srikar's series gets updated.
>
>
> Hi Mel,
>
> we have collaborated back in July on the scheduler patch, improving the
> performance by allowing faster memory migration. You came up with the
> "sched-numa-fast-crossnode-v1r12" series here:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git
>
> which has shown good performance results both in your and our testing.
>
> Do you have some update on the latest status? Is there any plan to merge
> this series into 4.19 kernel? We have just tested 4.19.0-0.rc1.1 and based
> on the results it seems that the patch is not included (and I don't see it
> listed in  git shortlog v4.18..v4.19-rc1 ./kernel/sched)
>
> With 4.19rc1 we see performance drop
>
> up to 40% (NAS bench) relatively to  4.18 + sched-numa-fast-crossnode-v1r12
> up to 20% (NAS, Stream, SPECjbb2005, SPECjvm2008) relatively to 4.18 vanilla
>
> The performance is dropping. It's quite unclear what are the next steps -
> should we wait for "sched-numa-fast-crossnode-v1r12" to be merged or should
> we start looking at what has caused the drop in performance going from
> 4.19rc1 to 4.18?
>
> We would appreciate any guidance on how to proceed.
>
> Thanks a lot!
> Jirka
>
>
>
>
> On Tue, Jul 17, 2018 at 12:03 PM, Mel Gorman <mgorman@techsingularity.net>
> wrote:
>>
>> On Tue, Jul 17, 2018 at 10:45:51AM +0200, Jirka Hladky wrote:
>> > Hi Mel,
>> >
>> > we have compared 4.18 + git://
>> > git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git
>> > sched-numa-fast-crossnode-v1r12 against 4.16 kernel and performance
>> > results
>> > look very good!
>> >
>>
>> Excellent, thanks to both Kamil and yourself for collecting the data.
>> It's helpful to have independent verification.
>>
>> > We see performance gains about 10-20% for SPECjbb2005. NAS results are a
>> > little bit noisy but show overall performance gains as well (total
>> > runtime
>> > for reduced from 6 hours 34 minutes to 6 hours 26 minutes to give you a
>> > specific example).
>>
>> Great.
>>
>> > The only benchmark showing a slight regression is stream
>> > - but the regression is just a few percents ( upto 10%) and I think it's
>> > not a real concern given that it's an artificial benchmark.
>> >
>>
>> Agreed.
>>
>> > How is your testing going? Do you think
>> > that sched-numa-fast-crossnode-v1r12 series can make it into the 4.18?
>> >
>>
>> My own testing completed and the results are within expectations and I
>> saw no red flags. Unfortunately, I consider it unlikely they'll be merged
>> for 4.18. Srikar Dronamraju's series is likely to need another update
>> and I would need to rebase my patches on top of that. Given the scope
>> and complexity, I find it unlikely they would be accepted for an -rc,
>> particularly this late of an rc. Whether we hit the 4.19 merge window or
>> not will depend on when Srikar's series gets updated.
>>
>> > Thanks a lot for your efforts to improve the performance!
>>
>> My pleasure.
>>
>> --
>> Mel Gorman
>> SUSE Labs
>
>

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [4.17 regression] Performance drop on kernel-4.17 visible on Stream, Linpack and NAS parallel benchmarks
  2018-09-03 15:07                                             ` Jirka Hladky
@ 2018-09-04  9:00                                               ` Mel Gorman
  2018-09-04 10:07                                                 ` Jirka Hladky
  0 siblings, 1 reply; 24+ messages in thread
From: Mel Gorman @ 2018-09-04  9:00 UTC (permalink / raw)
  To: Jirka Hladky
  Cc: Kamil Kolakowski, Jakub Racek, linux-kernel, Rafael J. Wysocki,
	Len Brown, linux-acpi

On Mon, Sep 03, 2018 at 05:07:15PM +0200, Jirka Hladky wrote:
> Resending in the plain text mode.
> 
> > My own testing completed and the results are within expectations and I
> > saw no red flags. Unfortunately, I consider it unlikely they'll be merged
> > for 4.18. Srikar Dronamraju's series is likely to need another update
> > and I would need to rebase my patches on top of that. Given the scope
> > and complexity, I find it unlikely they would be accepted for an -rc,
> > particularly this late of an rc. Whether we hit the 4.19 merge window or
> > not will depend on when Srikar's series gets updated.
> 
> 
> Hi Mel,
> 
> we have collaborated back in July on the scheduler patch, improving
> the performance by allowing faster memory migration. You came up with
> the "sched-numa-fast-crossnode-v1r12" series here:
> 
> https://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git
> 
> which has shown good performance results both in your and our testing.
> 

I remember.

> Do you have some update on the latest status? Is there any plan to
> merge this series into 4.19 kernel? We have just tested 4.19.0-0.rc1.1
> and based on the results it seems that the patch is not included (and
> I don't see it listed in  git shortlog v4.18..v4.19-rc1
> ./kernel/sched)
> 

Srikar's series that mine depended upon was only partially merged due to
a review bottleneck. He posted a v2 but it was during the merge window
and likely will need a v3 to avoid falling through the cracks. When it
is merged, I'll rebase my series on top and post it. While I didn't
check against 4.19-rc1, I did find that rebasing on top of the partial
series in 4.18 did not have as big an improvement.

> With 4.19rc1 we see performance drop
>   * up to 40% (NAS bench) relatively to  4.18 + sched-numa-fast-crossnode-v1r12
>   * up to 20% (NAS, Stream, SPECjbb2005, SPECjvm2008) relatively to 4.18 vanilla
> The performance is dropping. It's quite unclear what are the next
> steps - should we wait for "sched-numa-fast-crossnode-v1r12" to be
> merged or should we start looking at what has caused the drop in
> performance going from 4.19rc1 to 4.18?
> 

Both are valid options. If you take the latter option, I suggest looking
at whether 2d4056fafa196e1ab4e7161bae4df76f9602d56d is the source of the
issue as at least one auto-bisection found that it may be problematic.
Whether it is an issue or not depends heavily on the number of threads
relative to a socket size.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [4.17 regression] Performance drop on kernel-4.17 visible on Stream, Linpack and NAS parallel benchmarks
  2018-09-04  9:00                                               ` Mel Gorman
@ 2018-09-04 10:07                                                 ` Jirka Hladky
  2018-09-06  8:16                                                   ` Jirka Hladky
  0 siblings, 1 reply; 24+ messages in thread
From: Jirka Hladky @ 2018-09-04 10:07 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Kamil Kolakowski, Jakub Racek, linux-kernel, Rafael J. Wysocki,
	Len Brown, linux-acpi

Hi Mel,

thanks for sharing the background information! We will check if
2d4056fafa196e1ab4e7161bae4df76f9602d56d is causing the current
regression in 4.19 rc1 and let you know the outcome.

Jirka

On Tue, Sep 4, 2018 at 11:00 AM, Mel Gorman <mgorman@techsingularity.net> wrote:
> On Mon, Sep 03, 2018 at 05:07:15PM +0200, Jirka Hladky wrote:
>> Resending in the plain text mode.
>>
>> > My own testing completed and the results are within expectations and I
>> > saw no red flags. Unfortunately, I consider it unlikely they'll be merged
>> > for 4.18. Srikar Dronamraju's series is likely to need another update
>> > and I would need to rebase my patches on top of that. Given the scope
>> > and complexity, I find it unlikely they would be accepted for an -rc,
>> > particularly this late of an rc. Whether we hit the 4.19 merge window or
>> > not will depend on when Srikar's series gets updated.
>>
>>
>> Hi Mel,
>>
>> we have collaborated back in July on the scheduler patch, improving
>> the performance by allowing faster memory migration. You came up with
>> the "sched-numa-fast-crossnode-v1r12" series here:
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git
>>
>> which has shown good performance results both in your and our testing.
>>
>
> I remember.
>
>> Do you have some update on the latest status? Is there any plan to
>> merge this series into 4.19 kernel? We have just tested 4.19.0-0.rc1.1
>> and based on the results it seems that the patch is not included (and
>> I don't see it listed in  git shortlog v4.18..v4.19-rc1
>> ./kernel/sched)
>>
>
> Srikar's series that mine depended upon was only partially merged due to
> a review bottleneck. He posted a v2 but it was during the merge window
> and likely will need a v3 to avoid falling through the cracks. When it
> is merged, I'll rebase my series on top and post it. While I didn't
> check against 4.19-rc1, I did find that rebasing on top of the partial
> series in 4.18 did not have as big an improvement.
>
>> With 4.19rc1 we see performance drop
>>   * up to 40% (NAS bench) relatively to  4.18 + sched-numa-fast-crossnode-v1r12
>>   * up to 20% (NAS, Stream, SPECjbb2005, SPECjvm2008) relatively to 4.18 vanilla
>> The performance is dropping. It's quite unclear what are the next
>> steps - should we wait for "sched-numa-fast-crossnode-v1r12" to be
>> merged or should we start looking at what has caused the drop in
>> performance going from 4.19rc1 to 4.18?
>>
>
> Both are valid options. If you take the latter option, I suggest looking
> at whether 2d4056fafa196e1ab4e7161bae4df76f9602d56d is the source of the
> issue as at least one auto-bisection found that it may be problematic.
> Whether it is an issue or not depends heavily on the number of threads
> relative to a socket size.
>
> --
> Mel Gorman
> SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [4.17 regression] Performance drop on kernel-4.17 visible on Stream, Linpack and NAS parallel benchmarks
  2018-09-04 10:07                                                 ` Jirka Hladky
@ 2018-09-06  8:16                                                   ` Jirka Hladky
  2018-09-06 12:58                                                     ` Mel Gorman
  0 siblings, 1 reply; 24+ messages in thread
From: Jirka Hladky @ 2018-09-06  8:16 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Kamil Kolakowski, Jakub Racek, linux-kernel, Rafael J. Wysocki,
	Len Brown, linux-acpi

Hi Mel,

we have results with 2d4056fafa196e1ab4e7161bae4df76f9602d56d reverted.

  * Compared to 4.18, there is still performance regression -
especially with NAS (sp_C_x subtest) and SPECjvm2008. On 4 NUMA
systems, regression is around 10-15%
  * Compared to 4.19rc1 there is a clear gain across all benchmarks around 20%

While reverting 2d4056fafa196e1ab4e7161bae4df76f9602d56d has helped a
lot there is another issue as well. Could you please recommend some
commit prior to 2d4056fafa196e1ab4e7161bae4df76f9602d56d to try?

Regarding the current results, how do we proceed? Could you please
contact Srikar and ask for the advice or should we contact him
directly?

Thanks a lot!
Jirka

On Tue, Sep 4, 2018 at 12:07 PM, Jirka Hladky <jhladky@redhat.com> wrote:
> Hi Mel,
>
> thanks for sharing the background information! We will check if
> 2d4056fafa196e1ab4e7161bae4df76f9602d56d is causing the current
> regression in 4.19 rc1 and let you know the outcome.
>
> Jirka
>
> On Tue, Sep 4, 2018 at 11:00 AM, Mel Gorman <mgorman@techsingularity.net> wrote:
>> On Mon, Sep 03, 2018 at 05:07:15PM +0200, Jirka Hladky wrote:
>>> Resending in the plain text mode.
>>>
>>> > My own testing completed and the results are within expectations and I
>>> > saw no red flags. Unfortunately, I consider it unlikely they'll be merged
>>> > for 4.18. Srikar Dronamraju's series is likely to need another update
>>> > and I would need to rebase my patches on top of that. Given the scope
>>> > and complexity, I find it unlikely they would be accepted for an -rc,
>>> > particularly this late of an rc. Whether we hit the 4.19 merge window or
>>> > not will depend on when Srikar's series gets updated.
>>>
>>>
>>> Hi Mel,
>>>
>>> we have collaborated back in July on the scheduler patch, improving
>>> the performance by allowing faster memory migration. You came up with
>>> the "sched-numa-fast-crossnode-v1r12" series here:
>>>
>>> https://git.kernel.org/pub/scm/linux/kernel/git/mel/linux.git
>>>
>>> which has shown good performance results both in your and our testing.
>>>
>>
>> I remember.
>>
>>> Do you have some update on the latest status? Is there any plan to
>>> merge this series into 4.19 kernel? We have just tested 4.19.0-0.rc1.1
>>> and based on the results it seems that the patch is not included (and
>>> I don't see it listed in  git shortlog v4.18..v4.19-rc1
>>> ./kernel/sched)
>>>
>>
>> Srikar's series that mine depended upon was only partially merged due to
>> a review bottleneck. He posted a v2 but it was during the merge window
>> and likely will need a v3 to avoid falling through the cracks. When it
>> is merged, I'll rebase my series on top and post it. While I didn't
>> check against 4.19-rc1, I did find that rebasing on top of the partial
>> series in 4.18 did not have as big an improvement.
>>
>>> With 4.19rc1 we see performance drop
>>>   * up to 40% (NAS bench) relatively to  4.18 + sched-numa-fast-crossnode-v1r12
>>>   * up to 20% (NAS, Stream, SPECjbb2005, SPECjvm2008) relatively to 4.18 vanilla
>>> The performance is dropping. It's quite unclear what are the next
>>> steps - should we wait for "sched-numa-fast-crossnode-v1r12" to be
>>> merged or should we start looking at what has caused the drop in
>>> performance going from 4.19rc1 to 4.18?
>>>
>>
>> Both are valid options. If you take the latter option, I suggest looking
>> at whether 2d4056fafa196e1ab4e7161bae4df76f9602d56d is the source of the
>> issue as at least one auto-bisection found that it may be problematic.
>> Whether it is an issue or not depends heavily on the number of threads
>> relative to a socket size.
>>
>> --
>> Mel Gorman
>> SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [4.17 regression] Performance drop on kernel-4.17 visible on Stream, Linpack and NAS parallel benchmarks
  2018-09-06  8:16                                                   ` Jirka Hladky
@ 2018-09-06 12:58                                                     ` Mel Gorman
  2018-09-07  8:09                                                       ` Jirka Hladky
  0 siblings, 1 reply; 24+ messages in thread
From: Mel Gorman @ 2018-09-06 12:58 UTC (permalink / raw)
  To: Jirka Hladky
  Cc: Kamil Kolakowski, Jakub Racek, linux-kernel, Rafael J. Wysocki,
	Len Brown, linux-acpi

On Thu, Sep 06, 2018 at 10:16:28AM +0200, Jirka Hladky wrote:
> Hi Mel,
> 
> we have results with 2d4056fafa196e1ab4e7161bae4df76f9602d56d reverted.
> 
>   * Compared to 4.18, there is still performance regression -
> especially with NAS (sp_C_x subtest) and SPECjvm2008. On 4 NUMA
> systems, regression is around 10-15%
>   * Compared to 4.19rc1 there is a clear gain across all benchmarks around 20%
> 

Ok.

> While reverting 2d4056fafa196e1ab4e7161bae4df76f9602d56d has helped a
> lot there is another issue as well. Could you please recommend some
> commit prior to 2d4056fafa196e1ab4e7161bae4df76f9602d56d to try?
> 

Maybe 305c1fac3225dfa7eeb89bfe91b7335a6edd5172. That introduces a weird
condition in terms of idle CPU handling that has been problematic.

> Regarding the current results, how do we proceed? Could you please
> contact Srikar and ask for the advice or should we contact him
> directly?
> 

I would suggest contacting Srikar directly. While I'm working on a
series that touches off some similar areas, there is no guarantee it'll
be a success as I'm not primarily upstream focused at the moment.

Restarting the thread would also end up with a much more sensible cc
list.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [4.17 regression] Performance drop on kernel-4.17 visible on Stream, Linpack and NAS parallel benchmarks
  2018-09-06 12:58                                                     ` Mel Gorman
@ 2018-09-07  8:09                                                       ` Jirka Hladky
  2018-09-14 16:50                                                         ` Jirka Hladky
  0 siblings, 1 reply; 24+ messages in thread
From: Jirka Hladky @ 2018-09-07  8:09 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Kamil Kolakowski, Jakub Racek, linux-kernel, Rafael J. Wysocki,
	Len Brown, linux-acpi

> Maybe 305c1fac3225dfa7eeb89bfe91b7335a6edd5172. That introduces a weird
> condition in terms of idle CPU handling that has been problematic.


We will try that, thanks!

>  I would suggest contacting Srikar directly.


I will do that right away. Whom should I put on Cc? Just you and
linux-kernel@vger.kernel.org ? Should I put Ingo and Peter on Cc as
well?

$scripts/get_maintainer.pl -f kernel/sched
Ingo Molnar <mingo@redhat.com> (maintainer:SCHEDULER)
Peter Zijlstra <peterz@infradead.org> (maintainer:SCHEDULER)
linux-kernel@vger.kernel.org (open list:SCHEDULER)

Jirka

On Thu, Sep 6, 2018 at 2:58 PM, Mel Gorman <mgorman@techsingularity.net> wrote:
> On Thu, Sep 06, 2018 at 10:16:28AM +0200, Jirka Hladky wrote:
>> Hi Mel,
>>
>> we have results with 2d4056fafa196e1ab4e7161bae4df76f9602d56d reverted.
>>
>>   * Compared to 4.18, there is still performance regression -
>> especially with NAS (sp_C_x subtest) and SPECjvm2008. On 4 NUMA
>> systems, regression is around 10-15%
>>   * Compared to 4.19rc1 there is a clear gain across all benchmarks around 20%
>>
>
> Ok.
>
>> While reverting 2d4056fafa196e1ab4e7161bae4df76f9602d56d has helped a
>> lot there is another issue as well. Could you please recommend some
>> commit prior to 2d4056fafa196e1ab4e7161bae4df76f9602d56d to try?
>>
>
> Maybe 305c1fac3225dfa7eeb89bfe91b7335a6edd5172. That introduces a weird
> condition in terms of idle CPU handling that has been problematic.
>
>> Regarding the current results, how do we proceed? Could you please
>> contact Srikar and ask for the advice or should we contact him
>> directly?
>>
>
> I would suggest contacting Srikar directly. While I'm working on a
> series that touches off some similar areas, there is no guarantee it'll
> be a success as I'm not primarily upstream focused at the moment.
>
> Restarting the thread would also end up with a much more sensible cc
> list.
>
> --
> Mel Gorman
> SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [4.17 regression] Performance drop on kernel-4.17 visible on Stream, Linpack and NAS parallel benchmarks
  2018-09-07  8:09                                                       ` Jirka Hladky
@ 2018-09-14 16:50                                                         ` Jirka Hladky
  0 siblings, 0 replies; 24+ messages in thread
From: Jirka Hladky @ 2018-09-14 16:50 UTC (permalink / raw)
  To: Mel Gorman
  Cc: Kamil Kolakowski, Jakub Racek, linux-kernel, Rafael J. Wysocki,
	Len Brown, linux-acpi

Hi Mel,

we have tried to revert following 2 commits:

305c1fac3225
2d4056fafa196e1ab

We had to revert 10864a9e222048a862da2c21efa28929a4dfed15 as well.

The performance of the kernel was better than when only
2d4056fafa196e1ab was reverted but still worse than the performance of
4.18 kernel.

Since the patch series from Srikar shows very good results we would
wait till it's merged to mainline kernel and stop the bisecting
efforts for now. Your patch series sched-numa-fast-crossnode-v1r12 (on
top of 4.18) is giving in some cases slightly better results than
Srikar's series so it would be really great if both series could be
merged together. Removing NUMA migration rate limit helps performance.

Thanks a lot for your help on this!
Jirka


On Fri, Sep 7, 2018 at 10:09 AM, Jirka Hladky <jhladky@redhat.com> wrote:
>> Maybe 305c1fac3225dfa7eeb89bfe91b7335a6edd5172. That introduces a weird
>> condition in terms of idle CPU handling that has been problematic.
>
>
> We will try that, thanks!
>
>>  I would suggest contacting Srikar directly.
>
>
> I will do that right away. Whom should I put on Cc? Just you and
> linux-kernel@vger.kernel.org ? Should I put Ingo and Peter on Cc as
> well?
>
> $scripts/get_maintainer.pl -f kernel/sched
> Ingo Molnar <mingo@redhat.com> (maintainer:SCHEDULER)
> Peter Zijlstra <peterz@infradead.org> (maintainer:SCHEDULER)
> linux-kernel@vger.kernel.org (open list:SCHEDULER)
>
> Jirka
>
> On Thu, Sep 6, 2018 at 2:58 PM, Mel Gorman <mgorman@techsingularity.net> wrote:
>> On Thu, Sep 06, 2018 at 10:16:28AM +0200, Jirka Hladky wrote:
>>> Hi Mel,
>>>
>>> we have results with 2d4056fafa196e1ab4e7161bae4df76f9602d56d reverted.
>>>
>>>   * Compared to 4.18, there is still performance regression -
>>> especially with NAS (sp_C_x subtest) and SPECjvm2008. On 4 NUMA
>>> systems, regression is around 10-15%
>>>   * Compared to 4.19rc1 there is a clear gain across all benchmarks around 20%
>>>
>>
>> Ok.
>>
>>> While reverting 2d4056fafa196e1ab4e7161bae4df76f9602d56d has helped a
>>> lot there is another issue as well. Could you please recommend some
>>> commit prior to 2d4056fafa196e1ab4e7161bae4df76f9602d56d to try?
>>>
>>
>> Maybe 305c1fac3225dfa7eeb89bfe91b7335a6edd5172. That introduces a weird
>> condition in terms of idle CPU handling that has been problematic.
>>
>>> Regarding the current results, how do we proceed? Could you please
>>> contact Srikar and ask for the advice or should we contact him
>>> directly?
>>>
>>
>> I would suggest contacting Srikar directly. While I'm working on a
>> series that touches off some similar areas, there is no guarantee it'll
>> be a success as I'm not primarily upstream focused at the moment.
>>
>> Restarting the thread would also end up with a much more sensible cc
>> list.
>>
>> --
>> Mel Gorman
>> SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [4.17 regression] Performance drop on kernel-4.17 visible on Stream, Linpack and NAS parallel benchmarks
       [not found]           ` <CAE4VaGATk3_Hr_2Wh44BZvXDc06A=rxUZXRFj+D=Xwh2x1YOyg@mail.gmail.com>
@ 2018-06-08 11:15             ` Mel Gorman
  0 siblings, 0 replies; 24+ messages in thread
From: Mel Gorman @ 2018-06-08 11:15 UTC (permalink / raw)
  To: Jirka Hladky
  Cc: Jakub Racek, linux-kernel, Rafael J. Wysocki, Len Brown, linux-acpi

On Fri, Jun 08, 2018 at 01:02:54PM +0200, Jirka Hladky wrote:
> >
> > Unknown and unknowable. It depends entirely on the reference pattern of
> > the different threads. If they are fully parallelised with private buffers
> > that are page-aligned then I expect it to be quick (to pass the 2-reference
> > filter).
> 
> 
> I'm running 20 parallel processes. There is no connection between them. If
> I read it correctly the migration should happen fast in this case, right?
> 
> I have checked the source code and variables are global and static (and
> thus allocated in the data segment). They are NOT 4k aligned:
> 
> variable a is at address: 0x9e999e0
> variable b is at address: 0x524e5e0
> variable c is at address: 0x6031e0
> 
> static double a[N],
> b[N],
> c[N];
> 

If these are 20 completely indepent processes (and not sharing data via
MPI if you're using that version of STREAM) then the migration should be
relatively quick. Migrations should start within 3 seconds of the process
starting. How long it takes depends on the size of the STREAM processes
as it's only scanned in chunks and migrations won't start until there
are two full passes of the address space. You can partially monitor the
progress using /proc/pid/numa_maps. More detailed monitoring needs ftrace
for some activity and the use of probes on specific functions to get
detailed information.

It may also be worth examining /proc/pid/sched and seeing if a task
sets numa_preferred_nid to node 0 and keeps it there even after
migrating to node 1 but that's doubtful.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [4.17 regression] Performance drop on kernel-4.17 visible on Stream, Linpack and NAS parallel benchmarks
       [not found]       ` <CAE4VaGAgC7vDwaa-9AzJYst9hdQ5KbnrBUnk_mfp=NeTEe5dAQ@mail.gmail.com>
@ 2018-06-08  9:24         ` Mel Gorman
       [not found]           ` <CAE4VaGATk3_Hr_2Wh44BZvXDc06A=rxUZXRFj+D=Xwh2x1YOyg@mail.gmail.com>
  0 siblings, 1 reply; 24+ messages in thread
From: Mel Gorman @ 2018-06-08  9:24 UTC (permalink / raw)
  To: Jirka Hladky
  Cc: Jakub Racek, linux-kernel, Rafael J. Wysocki, Len Brown, linux-acpi

On Fri, Jun 08, 2018 at 10:49:03AM +0200, Jirka Hladky wrote:
> Hi Mel,
> 
> automatic NUMA balancing doesn't run long enough to migrate all the
> > memory. That would definitely be the case for STREAM.
> 
> This could explain the behavior we observe. stream is running ~20 seconds
> at the moment. I can easily change the runtime by changing the number of
> iterations. What is the time period when you expect the memory to be fully
> migrated?
> 

Unknown and unknowable. It depends entirely on the reference pattern of
the different threads. If they are fully parallelised with private buffers
that are page-aligned then I expect it to be quick (to pass the 2-reference
filter). If threads are sharing data on a 4K (base page case) or 2M boundary
(THP enabled) then it may take longer as two or more threads will disagree
on what the appropriate placement for a page is.

> I have now checked numastat logs and after 15 seconds I see roughly 80MiB
> out of 200MiB of the allocated memory migrated for each of 10 processes
> which have changed the NUMA CPU node after started. This is on 2
> socket Gold 6126 CPU @ 2.60GHz server with DDR4 2666 MHz. That's 800 MiB of
> memory migrated in 15 seconds which is results in the average migration
> rate of 50MiB/s - is this an expected value?
> 

I expect that to be far short of the capabilities of the machine.
Again, migrations can be delayed indefinitely if threads have buffers
that are not page-aligned (4K or 2M depending).

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [4.17 regression] Performance drop on kernel-4.17 visible on Stream, Linpack and NAS parallel benchmarks
       [not found]   ` <CAE4VaGBAZ0HCy-M2rC3ce9ePOBhE6H-LDVBuJDJMNFf40j70Aw@mail.gmail.com>
@ 2018-06-08  7:40     ` Mel Gorman
       [not found]       ` <CAE4VaGAgC7vDwaa-9AzJYst9hdQ5KbnrBUnk_mfp=NeTEe5dAQ@mail.gmail.com>
  0 siblings, 1 reply; 24+ messages in thread
From: Mel Gorman @ 2018-06-08  7:40 UTC (permalink / raw)
  To: Jirka Hladky
  Cc: Jakub Racek, linux-kernel, Rafael J. Wysocki, Len Brown, linux-acpi

On Fri, Jun 08, 2018 at 07:49:37AM +0200, Jirka Hladky wrote:
> Hi Mel,
> 
> we will do the bisection today and report the results back.
> 

The most likely outcome is 2c83362734dad8e48ccc0710b5cd2436a0323893
which is a patch that restricts newly forked processes from selecting a
remote node when the local node is similarly loaded. The upside is that
an almost idle node will not queue that task on a remote node. The
downside is that there are cases that the newly forked task allocates a
lot of memory and then the idle balancer spreads it anyway. It'll be a
classic case of "win some, lose some".

That would match this pattern

> > > * all processes are started at NODE #1

So at fork time, the local node is almost idle and is used

> > > * memory is also allocated on NODE #1

Early in the lifetime of the task

> > > * roughly half of the processes are moved to the NODE #0 very quickly. *

Idle balancer kicks in

> > > however, memory is not moved to NODE #0 and stays allocated on NODE #1
> > >

automatic NUMA balancing doesn't run long enough to migrate all the
memory. That would definitely be the case for STREAM. It's less clear
for NAS where, depending on the parallelisation, wake_affine can keep a
task away from its memory or it's cross-node migrating a lot. As before,
I've no idea about linpack.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [4.17 regression] Performance drop on kernel-4.17 visible on Stream, Linpack and NAS parallel benchmarks
  2018-06-06 12:27 Jakub Racek
  2018-06-06 12:34 ` Rafael J. Wysocki
  2018-06-07 11:07 ` Michal Hocko
@ 2018-06-07 12:39 ` Mel Gorman
       [not found]   ` <CAE4VaGBAZ0HCy-M2rC3ce9ePOBhE6H-LDVBuJDJMNFf40j70Aw@mail.gmail.com>
  2 siblings, 1 reply; 24+ messages in thread
From: Mel Gorman @ 2018-06-07 12:39 UTC (permalink / raw)
  To: Jakub Racek; +Cc: linux-kernel, Rafael J. Wysocki, Len Brown, linux-acpi

On Wed, Jun 06, 2018 at 02:27:32PM +0200, Jakub Racek wrote:
> There is a huge performance regression on the 2 and 4 NUMA node systems on
> stream benchmark with 4.17 kernel compared to 4.16 kernel. Stream, Linpack
> and NAS parallel benchmarks show upto 50% performance drop.
> 

I have not observed this yet but NAS is the only one I'll see and that could
be a week or more away before I have data. I'll keep an eye out at least.

> When running for example 20 stream processes in parallel, we see the following behavior:
> 
> * all processes are started at NODE #1
> * memory is also allocated on NODE #1
> * roughly half of the processes are moved to the NODE #0 very quickly. *
> however, memory is not moved to NODE #0 and stays allocated on NODE #1
> 

Ok, 20 processes getting rescheduled to another node is not unreasonable
from a load-balancing perspective but memory locality is not always taken
into account. You also don't state what parallelisation method you used
for STREAM and it's relevant because of how tasks end up communicating
and what that means for placement.

The only automatic NUMA balancing patch I can think of that has a high
chance of being a factor is 7347fc87dfe6b7315e74310ee1243dc222c68086
but I cannot see how STREAM would be affected as I severely doubt
the processes are communicating heavily (unless openmp and then it's
a maybe). It might affect NAS because that does a lot of wakeups
via futex that has "interesting" characteristics (either openmp or
openmpi). 082f764a2f3f2968afa1a0b04a1ccb1b70633844 might also be a factor
but it's doubtful. I don't know about Linpack as I've never characterised
it so I don't know how it behaves.

There are a few patches that affect utilisation calculation which might
affect the load balancer but I can't pinpoint a single likely candidate.

Given that STREAM is usually short-lived, is bisection an option?

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [4.17 regression] Performance drop on kernel-4.17 visible on Stream, Linpack and NAS parallel benchmarks
  2018-06-07 11:07 ` Michal Hocko
@ 2018-06-07 11:19   ` Jakub Raček
  0 siblings, 0 replies; 24+ messages in thread
From: Jakub Raček @ 2018-06-07 11:19 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, Rafael J. Wysocki, Len Brown, linux-acpi,
	Mel Gorman, linux-mm

Hi,

On 06/07/2018 01:07 PM, Michal Hocko wrote:
> [CCing Mel and MM mailing list]
> 
> On Wed 06-06-18 14:27:32, Jakub Racek wrote:
>> Hi,
>>
>> There is a huge performance regression on the 2 and 4 NUMA node systems on
>> stream benchmark with 4.17 kernel compared to 4.16 kernel. Stream, Linpack
>> and NAS parallel benchmarks show upto 50% performance drop.
>>
>> When running for example 20 stream processes in parallel, we see the following behavior:
>>
>> * all processes are started at NODE #1
>> * memory is also allocated on NODE #1
>> * roughly half of the processes are moved to the NODE #0 very quickly. *
>> however, memory is not moved to NODE #0 and stays allocated on NODE #1
>>
>> As the result, half of the processes are running on NODE#0 with memory being
>> still allocated on NODE#1. This leads to non-local memory accesses
>> on the high Remote-To-Local Memory Access Ratio on the numatop charts.
>>
>> So it seems that 4.17 is not doing a good job to move the memory to the right NUMA
>> node after the process has been moved.
>>
>> ----8<----
>>
>> The above is an excerpt from performance testing on 4.16 and 4.17 kernels.
>>
>> For now I'm merely making sure the problem is reported.
> 
> Do you have numa balancing enabled?
> 

Yes. The relevant settings are:

kernel.numa_balancing = 1
kernel.numa_balancing_scan_delay_ms = 1000
kernel.numa_balancing_scan_period_max_ms = 60000
kernel.numa_balancing_scan_period_min_ms = 1000
kernel.numa_balancing_scan_size_mb = 256


-- 
Best regards,
Jakub Racek
FMK

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [4.17 regression] Performance drop on kernel-4.17 visible on Stream, Linpack and NAS parallel benchmarks
  2018-06-06 12:27 Jakub Racek
  2018-06-06 12:34 ` Rafael J. Wysocki
@ 2018-06-07 11:07 ` Michal Hocko
  2018-06-07 11:19   ` Jakub Raček
  2018-06-07 12:39 ` Mel Gorman
  2 siblings, 1 reply; 24+ messages in thread
From: Michal Hocko @ 2018-06-07 11:07 UTC (permalink / raw)
  To: Jakub Racek
  Cc: linux-kernel, Rafael J. Wysocki, Len Brown, linux-acpi,
	Mel Gorman, linux-mm

[CCing Mel and MM mailing list]

On Wed 06-06-18 14:27:32, Jakub Racek wrote:
> Hi,
> 
> There is a huge performance regression on the 2 and 4 NUMA node systems on
> stream benchmark with 4.17 kernel compared to 4.16 kernel. Stream, Linpack
> and NAS parallel benchmarks show upto 50% performance drop.
> 
> When running for example 20 stream processes in parallel, we see the following behavior:
> 
> * all processes are started at NODE #1
> * memory is also allocated on NODE #1
> * roughly half of the processes are moved to the NODE #0 very quickly. *
> however, memory is not moved to NODE #0 and stays allocated on NODE #1
> 
> As the result, half of the processes are running on NODE#0 with memory being
> still allocated on NODE#1. This leads to non-local memory accesses
> on the high Remote-To-Local Memory Access Ratio on the numatop charts.
> 
> So it seems that 4.17 is not doing a good job to move the memory to the right NUMA
> node after the process has been moved.
> 
> ----8<----
> 
> The above is an excerpt from performance testing on 4.16 and 4.17 kernels.
> 
> For now I'm merely making sure the problem is reported.

Do you have numa balancing enabled?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [4.17 regression] Performance drop on kernel-4.17 visible on Stream, Linpack and NAS parallel benchmarks
  2018-06-06 12:34 ` Rafael J. Wysocki
@ 2018-06-06 12:44   ` Rafael J. Wysocki
  0 siblings, 0 replies; 24+ messages in thread
From: Rafael J. Wysocki @ 2018-06-06 12:44 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Jakub Racek, Linux Kernel Mailing List, Rafael J. Wysocki,
	Len Brown, ACPI Devel Maling List, Peter Zijlstra

On Wed, Jun 6, 2018 at 2:34 PM, Rafael J. Wysocki <rafael@kernel.org> wrote:
> On Wed, Jun 6, 2018 at 2:27 PM, Jakub Racek <jracek@redhat.com> wrote:
>> Hi,
>>
>> There is a huge performance regression on the 2 and 4 NUMA node systems on
>> stream benchmark with 4.17 kernel compared to 4.16 kernel. Stream, Linpack
>> and NAS parallel benchmarks show upto 50% performance drop.
>>
>> When running for example 20 stream processes in parallel, we see the
>> following behavior:
>>
>> * all processes are started at NODE #1
>> * memory is also allocated on NODE #1
>> * roughly half of the processes are moved to the NODE #0 very quickly. *
>> however, memory is not moved to NODE #0 and stays allocated on NODE #1
>>
>> As the result, half of the processes are running on NODE#0 with memory being
>> still allocated on NODE#1. This leads to non-local memory accesses
>> on the high Remote-To-Local Memory Access Ratio on the numatop charts.
>> So it seems that 4.17 is not doing a good job to move the memory to the
>> right NUMA
>> node after the process has been moved.
>>
>> ----8<----
>>
>> The above is an excerpt from performance testing on 4.16 and 4.17 kernels.
>>
>> For now I'm merely making sure the problem is reported.
>
> OK, and why do you think that it is related to ACPI?

In any case, we need more information here.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [4.17 regression] Performance drop on kernel-4.17 visible on Stream, Linpack and NAS parallel benchmarks
  2018-06-06 12:27 Jakub Racek
@ 2018-06-06 12:34 ` Rafael J. Wysocki
  2018-06-06 12:44   ` Rafael J. Wysocki
  2018-06-07 11:07 ` Michal Hocko
  2018-06-07 12:39 ` Mel Gorman
  2 siblings, 1 reply; 24+ messages in thread
From: Rafael J. Wysocki @ 2018-06-06 12:34 UTC (permalink / raw)
  To: Jakub Racek
  Cc: Linux Kernel Mailing List, Rafael J. Wysocki, Len Brown,
	ACPI Devel Maling List

On Wed, Jun 6, 2018 at 2:27 PM, Jakub Racek <jracek@redhat.com> wrote:
> Hi,
>
> There is a huge performance regression on the 2 and 4 NUMA node systems on
> stream benchmark with 4.17 kernel compared to 4.16 kernel. Stream, Linpack
> and NAS parallel benchmarks show upto 50% performance drop.
>
> When running for example 20 stream processes in parallel, we see the
> following behavior:
>
> * all processes are started at NODE #1
> * memory is also allocated on NODE #1
> * roughly half of the processes are moved to the NODE #0 very quickly. *
> however, memory is not moved to NODE #0 and stays allocated on NODE #1
>
> As the result, half of the processes are running on NODE#0 with memory being
> still allocated on NODE#1. This leads to non-local memory accesses
> on the high Remote-To-Local Memory Access Ratio on the numatop charts.
> So it seems that 4.17 is not doing a good job to move the memory to the
> right NUMA
> node after the process has been moved.
>
> ----8<----
>
> The above is an excerpt from performance testing on 4.16 and 4.17 kernels.
>
> For now I'm merely making sure the problem is reported.

OK, and why do you think that it is related to ACPI?

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 24+ messages in thread

* [4.17 regression] Performance drop on kernel-4.17 visible on Stream, Linpack and NAS parallel benchmarks
@ 2018-06-06 12:27 Jakub Racek
  2018-06-06 12:34 ` Rafael J. Wysocki
                   ` (2 more replies)
  0 siblings, 3 replies; 24+ messages in thread
From: Jakub Racek @ 2018-06-06 12:27 UTC (permalink / raw)
  To: linux-kernel; +Cc: Rafael J. Wysocki, Len Brown, linux-acpi, jracek

Hi,

There is a huge performance regression on the 2 and 4 NUMA node systems on stream 
benchmark with 4.17 kernel compared to 4.16 kernel. 
Stream, Linpack and NAS parallel benchmarks show upto 50% performance drop.

When running for example 20 stream processes in parallel, we see the following behavior:

* all processes are started at NODE #1
* memory is also allocated on NODE #1
* roughly half of the processes are moved to the NODE #0 very quickly. 
* however, memory is not moved to NODE #0 and stays allocated on NODE #1

As the result, half of the processes are running on NODE#0 with memory being still 
allocated on NODE#1. This leads to non-local memory accesses
on the high Remote-To-Local Memory Access Ratio on the numatop charts.  

So it seems that 4.17 is not doing a good job to move the memory to the right NUMA
node after the process has been moved.

----8<----

The above is an excerpt from performance testing on 4.16 and 4.17 kernels.

For now I'm merely making sure the problem is reported.

Thank you.

Best regards,
Jakub Racek

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2018-09-14 16:51 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <CAE4VaGBRFBM-uZEE=DdUzQkcNmpnUHdjK-7hgEeywmG8bvOOgw@mail.gmail.com>
2018-06-11 14:11 ` [4.17 regression] Performance drop on kernel-4.17 visible on Stream, Linpack and NAS parallel benchmarks Mel Gorman
     [not found]   ` <CAE4VaGCMS2pXfPVSnMbudexv_m5wRCTuBKA5ijh2x==11uQg9g@mail.gmail.com>
2018-06-14  8:36     ` Mel Gorman
     [not found]       ` <CAE4VaGCzB99es_TpAaYvtjX8fqzFA=7HX-ezqgO6FaEB5if4zg@mail.gmail.com>
2018-06-15 11:25         ` Mel Gorman
     [not found]           ` <CAE4VaGBtasbDBoZ-c5R-AY++Y1BXgjrE7DwN0zOt113xmV95xw@mail.gmail.com>
2018-06-15 13:52             ` Mel Gorman
     [not found]               ` <CAE4VaGAdXNYXMUn4eQgMqQtLKfp6-YHMa1NUSpL-L078oX7C-w@mail.gmail.com>
     [not found]                 ` <CAE4VaGBeTpxd1phR4rVAjqOXuLgLWPtVMPoRSOcG3HXfWDF=8w@mail.gmail.com>
2018-06-19 15:18                   ` Mel Gorman
     [not found]                     ` <CAE4VaGAPOfy0RtQehKoe+443C1GRrJXCveBFgcAZ1nChVavp1g@mail.gmail.com>
     [not found]                       ` <CAE4VaGBMeL82SJK53gtcWkor-9eXeLX6VP9juw=FW=BOyp+hMA@mail.gmail.com>
2018-06-21  9:23                         ` Mel Gorman
     [not found]                           ` <CAE4VaGCQV+cS-vhdLyMwzftbB-xBHPt4Y4chg_0ykLHTE9cRfw@mail.gmail.com>
     [not found]                             ` <CAE4VaGDHcZbnDpJ+FiQLfA1DRftY0j_GJSnh3FDRi34OztVH6Q@mail.gmail.com>
2018-06-27  8:49                               ` Mel Gorman
     [not found]                                 ` <CAE4VaGA9KzX05rdfw2PhEATLisV-NVMc9rOyjzSg-rX1rug9Dw@mail.gmail.com>
     [not found]                                   ` <CABuKy6MUNX85PBVchz_hqXy+FxXU2x0U9ZEZB13rVSLGpWOWvQ@mail.gmail.com>
     [not found]                                     ` <CAE4VaGD12BLS_kk=pRwgTKL8YOU63Nowwa42cEdZObQ=P1MFnA@mail.gmail.com>
     [not found]                                       ` <CAE4VaGArxDYHzg8G203yKjgkuw3mULFSw8yCYbCcqvAUSUxy+A@mail.gmail.com>
2018-07-17 10:03                                         ` Mel Gorman
     [not found]                                           ` <CAE4VaGA_L1AEj+Un0oQEEqZp_jgaFLk+Z=vNoad08oXnU2T1nw@mail.gmail.com>
2018-09-03 15:07                                             ` Jirka Hladky
2018-09-04  9:00                                               ` Mel Gorman
2018-09-04 10:07                                                 ` Jirka Hladky
2018-09-06  8:16                                                   ` Jirka Hladky
2018-09-06 12:58                                                     ` Mel Gorman
2018-09-07  8:09                                                       ` Jirka Hladky
2018-09-14 16:50                                                         ` Jirka Hladky
2018-06-06 12:27 Jakub Racek
2018-06-06 12:34 ` Rafael J. Wysocki
2018-06-06 12:44   ` Rafael J. Wysocki
2018-06-07 11:07 ` Michal Hocko
2018-06-07 11:19   ` Jakub Raček
2018-06-07 12:39 ` Mel Gorman
     [not found]   ` <CAE4VaGBAZ0HCy-M2rC3ce9ePOBhE6H-LDVBuJDJMNFf40j70Aw@mail.gmail.com>
2018-06-08  7:40     ` Mel Gorman
     [not found]       ` <CAE4VaGAgC7vDwaa-9AzJYst9hdQ5KbnrBUnk_mfp=NeTEe5dAQ@mail.gmail.com>
2018-06-08  9:24         ` Mel Gorman
     [not found]           ` <CAE4VaGATk3_Hr_2Wh44BZvXDc06A=rxUZXRFj+D=Xwh2x1YOyg@mail.gmail.com>
2018-06-08 11:15             ` Mel Gorman

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).