[4.17 regression] Performance drop on kernel-4.17 visible on Stream, Linpack and NAS parallel benchmarks

All of lore.kernel.org
 help / color / mirror / Atom feed

* [4.17 regression] Performance drop on kernel-4.17 visible on Stream, Linpack and NAS parallel benchmarks
@ 2018-06-06 12:27 Jakub Racek
  2018-06-06 12:34 ` Rafael J. Wysocki
                   ` (2 more replies)
  0 siblings, 3 replies; 12+ messages in thread
From: Jakub Racek @ 2018-06-06 12:27 UTC (permalink / raw)
  To: linux-kernel; +Cc: Rafael J. Wysocki, Len Brown, linux-acpi, jracek

Hi,

There is a huge performance regression on the 2 and 4 NUMA node systems on stream 
benchmark with 4.17 kernel compared to 4.16 kernel. 
Stream, Linpack and NAS parallel benchmarks show upto 50% performance drop.

When running for example 20 stream processes in parallel, we see the following behavior:

* all processes are started at NODE #1
* memory is also allocated on NODE #1
* roughly half of the processes are moved to the NODE #0 very quickly. 
* however, memory is not moved to NODE #0 and stays allocated on NODE #1

As the result, half of the processes are running on NODE#0 with memory being still 
allocated on NODE#1. This leads to non-local memory accesses
on the high Remote-To-Local Memory Access Ratio on the numatop charts.  

So it seems that 4.17 is not doing a good job to move the memory to the right NUMA
node after the process has been moved.

----8<----

The above is an excerpt from performance testing on 4.16 and 4.17 kernels.

For now I'm merely making sure the problem is reported.

Thank you.

Best regards,
Jakub Racek

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [4.17 regression] Performance drop on kernel-4.17 visible on Stream, Linpack and NAS parallel benchmarks
  2018-06-06 12:27 [4.17 regression] Performance drop on kernel-4.17 visible on Stream, Linpack and NAS parallel benchmarks Jakub Racek
@ 2018-06-06 12:34 ` Rafael J. Wysocki
  2018-06-06 12:44   ` Rafael J. Wysocki
  2018-06-06 12:50   ` Jakub Racek
  2018-06-07 11:07 ` [4.17 regression] " Michal Hocko
  2018-06-07 12:39 ` Mel Gorman
  2 siblings, 2 replies; 12+ messages in thread
From: Rafael J. Wysocki @ 2018-06-06 12:34 UTC (permalink / raw)
  To: Jakub Racek
  Cc: Linux Kernel Mailing List, Rafael J. Wysocki, Len Brown,
	ACPI Devel Maling List

On Wed, Jun 6, 2018 at 2:27 PM, Jakub Racek <jracek@redhat.com> wrote:
> Hi,
>
> There is a huge performance regression on the 2 and 4 NUMA node systems on
> stream benchmark with 4.17 kernel compared to 4.16 kernel. Stream, Linpack
> and NAS parallel benchmarks show upto 50% performance drop.
>
> When running for example 20 stream processes in parallel, we see the
> following behavior:
>
> * all processes are started at NODE #1
> * memory is also allocated on NODE #1
> * roughly half of the processes are moved to the NODE #0 very quickly. *
> however, memory is not moved to NODE #0 and stays allocated on NODE #1
>
> As the result, half of the processes are running on NODE#0 with memory being
> still allocated on NODE#1. This leads to non-local memory accesses
> on the high Remote-To-Local Memory Access Ratio on the numatop charts.
> So it seems that 4.17 is not doing a good job to move the memory to the
> right NUMA
> node after the process has been moved.
>
> ----8<----
>
> The above is an excerpt from performance testing on 4.16 and 4.17 kernels.
>
> For now I'm merely making sure the problem is reported.

OK, and why do you think that it is related to ACPI?

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [4.17 regression] Performance drop on kernel-4.17 visible on Stream, Linpack and NAS parallel benchmarks
  2018-06-06 12:34 ` Rafael J. Wysocki
@ 2018-06-06 12:44   ` Rafael J. Wysocki
  2018-06-06 12:50   ` Jakub Racek
  1 sibling, 0 replies; 12+ messages in thread
From: Rafael J. Wysocki @ 2018-06-06 12:44 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Jakub Racek, Linux Kernel Mailing List, Rafael J. Wysocki,
	Len Brown, ACPI Devel Maling List, Peter Zijlstra

On Wed, Jun 6, 2018 at 2:34 PM, Rafael J. Wysocki <rafael@kernel.org> wrote:
> On Wed, Jun 6, 2018 at 2:27 PM, Jakub Racek <jracek@redhat.com> wrote:
>> Hi,
>>
>> There is a huge performance regression on the 2 and 4 NUMA node systems on
>> stream benchmark with 4.17 kernel compared to 4.16 kernel. Stream, Linpack
>> and NAS parallel benchmarks show upto 50% performance drop.
>>
>> When running for example 20 stream processes in parallel, we see the
>> following behavior:
>>
>> * all processes are started at NODE #1
>> * memory is also allocated on NODE #1
>> * roughly half of the processes are moved to the NODE #0 very quickly. *
>> however, memory is not moved to NODE #0 and stays allocated on NODE #1
>>
>> As the result, half of the processes are running on NODE#0 with memory being
>> still allocated on NODE#1. This leads to non-local memory accesses
>> on the high Remote-To-Local Memory Access Ratio on the numatop charts.
>> So it seems that 4.17 is not doing a good job to move the memory to the
>> right NUMA
>> node after the process has been moved.
>>
>> ----8<----
>>
>> The above is an excerpt from performance testing on 4.16 and 4.17 kernels.
>>
>> For now I'm merely making sure the problem is reported.
>
> OK, and why do you think that it is related to ACPI?

In any case, we need more information here.

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Performance drop on kernel-4.17 visible on Stream, Linpack and NAS parallel benchmarks
  2018-06-06 12:34 ` Rafael J. Wysocki
  2018-06-06 12:44   ` Rafael J. Wysocki
@ 2018-06-06 12:50   ` Jakub Racek
  2018-06-06 12:56     ` Rafael J. Wysocki
  1 sibling, 1 reply; 12+ messages in thread
From: Jakub Racek @ 2018-06-06 12:50 UTC (permalink / raw)
  To: Rafael J. Wysocki
  Cc: Linux Kernel Mailing List, Rafael J. Wysocki, Len Brown,
	ACPI Devel Maling List

+++ Rafael J. Wysocki [06/06/18 14:34 +0200]:
>On Wed, Jun 6, 2018 at 2:27 PM, Jakub Racek <jracek@redhat.com> wrote:
>> Hi,
>>
>> There is a huge performance regression on the 2 and 4 NUMA node systems on
>> stream benchmark with 4.17 kernel compared to 4.16 kernel. Stream, Linpack
>> and NAS parallel benchmarks show upto 50% performance drop.
>
>OK, and why do you think that it is related to ACPI?

I don't know where the problems is or who to CC. 

What information should be added? I can probably provide it, if I know
what.

>Thanks,
>Rafael

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: Performance drop on kernel-4.17 visible on Stream, Linpack and NAS parallel benchmarks
  2018-06-06 12:50   ` Jakub Racek
@ 2018-06-06 12:56     ` Rafael J. Wysocki
  0 siblings, 0 replies; 12+ messages in thread
From: Rafael J. Wysocki @ 2018-06-06 12:56 UTC (permalink / raw)
  To: Jakub Racek
  Cc: Rafael J. Wysocki, Linux Kernel Mailing List, Rafael J. Wysocki,
	Len Brown, ACPI Devel Maling List

On Wed, Jun 6, 2018 at 2:50 PM, Jakub Racek <jracek@redhat.com> wrote:
> +++ Rafael J. Wysocki [06/06/18 14:34 +0200]:
>>
>> On Wed, Jun 6, 2018 at 2:27 PM, Jakub Racek <jracek@redhat.com> wrote:
>>>
>>> Hi,
>>>
>>> There is a huge performance regression on the 2 and 4 NUMA node systems
>>> on
>>> stream benchmark with 4.17 kernel compared to 4.16 kernel. Stream,
>>> Linpack
>>> and NAS parallel benchmarks show upto 50% performance drop.
>>
>>
>> OK, and why do you think that it is related to ACPI?
>
>
> I don't know where the problems is or who to CC.
> What information should be added? I can probably provide it, if I know
> what.

The problem appears to be reproducible 100% of the time, so can you
possibly carry out a "git bisect" binary search to find the
problematic commit?

Thanks,
Rafael

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [4.17 regression] Performance drop on kernel-4.17 visible on Stream, Linpack and NAS parallel benchmarks
  2018-06-06 12:27 [4.17 regression] Performance drop on kernel-4.17 visible on Stream, Linpack and NAS parallel benchmarks Jakub Racek
  2018-06-06 12:34 ` Rafael J. Wysocki
@ 2018-06-07 11:07 ` Michal Hocko
  2018-06-07 11:19   ` Jakub Raček
  2018-06-07 12:39 ` Mel Gorman
  2 siblings, 1 reply; 12+ messages in thread
From: Michal Hocko @ 2018-06-07 11:07 UTC (permalink / raw)
  To: Jakub Racek
  Cc: linux-kernel, Rafael J. Wysocki, Len Brown, linux-acpi,
	Mel Gorman, linux-mm

[CCing Mel and MM mailing list]

On Wed 06-06-18 14:27:32, Jakub Racek wrote:
> Hi,
> 
> There is a huge performance regression on the 2 and 4 NUMA node systems on
> stream benchmark with 4.17 kernel compared to 4.16 kernel. Stream, Linpack
> and NAS parallel benchmarks show upto 50% performance drop.
> 
> When running for example 20 stream processes in parallel, we see the following behavior:
> 
> * all processes are started at NODE #1
> * memory is also allocated on NODE #1
> * roughly half of the processes are moved to the NODE #0 very quickly. *
> however, memory is not moved to NODE #0 and stays allocated on NODE #1
> 
> As the result, half of the processes are running on NODE#0 with memory being
> still allocated on NODE#1. This leads to non-local memory accesses
> on the high Remote-To-Local Memory Access Ratio on the numatop charts.
> 
> So it seems that 4.17 is not doing a good job to move the memory to the right NUMA
> node after the process has been moved.
> 
> ----8<----
> 
> The above is an excerpt from performance testing on 4.16 and 4.17 kernels.
> 
> For now I'm merely making sure the problem is reported.

Do you have numa balancing enabled?
-- 
Michal Hocko
SUSE Labs

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [4.17 regression] Performance drop on kernel-4.17 visible on Stream, Linpack and NAS parallel benchmarks
  2018-06-07 11:07 ` [4.17 regression] " Michal Hocko
@ 2018-06-07 11:19   ` Jakub Raček
  2018-06-07 11:56     ` Jirka Hladky
  0 siblings, 1 reply; 12+ messages in thread
From: Jakub Raček @ 2018-06-07 11:19 UTC (permalink / raw)
  To: Michal Hocko
  Cc: linux-kernel, Rafael J. Wysocki, Len Brown, linux-acpi,
	Mel Gorman, linux-mm

Hi,

On 06/07/2018 01:07 PM, Michal Hocko wrote:
> [CCing Mel and MM mailing list]
> 
> On Wed 06-06-18 14:27:32, Jakub Racek wrote:
>> Hi,
>>
>> There is a huge performance regression on the 2 and 4 NUMA node systems on
>> stream benchmark with 4.17 kernel compared to 4.16 kernel. Stream, Linpack
>> and NAS parallel benchmarks show upto 50% performance drop.
>>
>> When running for example 20 stream processes in parallel, we see the following behavior:
>>
>> * all processes are started at NODE #1
>> * memory is also allocated on NODE #1
>> * roughly half of the processes are moved to the NODE #0 very quickly. *
>> however, memory is not moved to NODE #0 and stays allocated on NODE #1
>>
>> As the result, half of the processes are running on NODE#0 with memory being
>> still allocated on NODE#1. This leads to non-local memory accesses
>> on the high Remote-To-Local Memory Access Ratio on the numatop charts.
>>
>> So it seems that 4.17 is not doing a good job to move the memory to the right NUMA
>> node after the process has been moved.
>>
>> ----8<----
>>
>> The above is an excerpt from performance testing on 4.16 and 4.17 kernels.
>>
>> For now I'm merely making sure the problem is reported.
> 
> Do you have numa balancing enabled?
> 

Yes. The relevant settings are:

kernel.numa_balancing = 1
kernel.numa_balancing_scan_delay_ms = 1000
kernel.numa_balancing_scan_period_max_ms = 60000
kernel.numa_balancing_scan_period_min_ms = 1000
kernel.numa_balancing_scan_size_mb = 256


-- 
Best regards,
Jakub Racek
FMK

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [4.17 regression] Performance drop on kernel-4.17 visible on Stream, Linpack and NAS parallel benchmarks
  2018-06-07 11:19   ` Jakub Raček
@ 2018-06-07 11:56     ` Jirka Hladky
  0 siblings, 0 replies; 12+ messages in thread
From: Jirka Hladky @ 2018-06-07 11:56 UTC (permalink / raw)
  To: Jakub Raček
  Cc: Michal Hocko, linux-kernel, Rafael J. Wysocki, Len Brown,
	linux-acpi, Mel Gorman, linux-mm, jhladky

[-- Attachment #1: Type: text/plain, Size: 1781 bytes --]

Adding myself to Cc.

On Thu, Jun 7, 2018 at 1:19 PM, Jakub Raček <jracek@redhat.com> wrote:

> Hi,
>
> On 06/07/2018 01:07 PM, Michal Hocko wrote:
>
>> [CCing Mel and MM mailing list]
>>
>> On Wed 06-06-18 14:27:32, Jakub Racek wrote:
>>
>>> Hi,
>>>
>>> There is a huge performance regression on the 2 and 4 NUMA node systems
>>> on
>>> stream benchmark with 4.17 kernel compared to 4.16 kernel. Stream,
>>> Linpack
>>> and NAS parallel benchmarks show upto 50% performance drop.
>>>
>>> When running for example 20 stream processes in parallel, we see the
>>> following behavior:
>>>
>>> * all processes are started at NODE #1
>>> * memory is also allocated on NODE #1
>>> * roughly half of the processes are moved to the NODE #0 very quickly. *
>>> however, memory is not moved to NODE #0 and stays allocated on NODE #1
>>>
>>> As the result, half of the processes are running on NODE#0 with memory
>>> being
>>> still allocated on NODE#1. This leads to non-local memory accesses
>>> on the high Remote-To-Local Memory Access Ratio on the numatop charts.
>>>
>>> So it seems that 4.17 is not doing a good job to move the memory to the
>>> right NUMA
>>> node after the process has been moved.
>>>
>>> ----8<----
>>>
>>> The above is an excerpt from performance testing on 4.16 and 4.17
>>> kernels.
>>>
>>> For now I'm merely making sure the problem is reported.
>>>
>>
>> Do you have numa balancing enabled?
>>
>>
> Yes. The relevant settings are:
>
> kernel.numa_balancing = 1
> kernel.numa_balancing_scan_delay_ms = 1000
> kernel.numa_balancing_scan_period_max_ms = 60000
> kernel.numa_balancing_scan_period_min_ms = 1000
> kernel.numa_balancing_scan_size_mb = 256
>
>
> --
> Best regards,
> Jakub Racek
> FMK
>

[-- Attachment #2: Type: text/html, Size: 2424 bytes --]

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [4.17 regression] Performance drop on kernel-4.17 visible on Stream, Linpack and NAS parallel benchmarks
  2018-06-06 12:27 [4.17 regression] Performance drop on kernel-4.17 visible on Stream, Linpack and NAS parallel benchmarks Jakub Racek
  2018-06-06 12:34 ` Rafael J. Wysocki
  2018-06-07 11:07 ` [4.17 regression] " Michal Hocko
@ 2018-06-07 12:39 ` Mel Gorman
       [not found]   ` <CAE4VaGBAZ0HCy-M2rC3ce9ePOBhE6H-LDVBuJDJMNFf40j70Aw@mail.gmail.com>
  2 siblings, 1 reply; 12+ messages in thread
From: Mel Gorman @ 2018-06-07 12:39 UTC (permalink / raw)
  To: Jakub Racek; +Cc: linux-kernel, Rafael J. Wysocki, Len Brown, linux-acpi

On Wed, Jun 06, 2018 at 02:27:32PM +0200, Jakub Racek wrote:
> There is a huge performance regression on the 2 and 4 NUMA node systems on
> stream benchmark with 4.17 kernel compared to 4.16 kernel. Stream, Linpack
> and NAS parallel benchmarks show upto 50% performance drop.
> 

I have not observed this yet but NAS is the only one I'll see and that could
be a week or more away before I have data. I'll keep an eye out at least.

> When running for example 20 stream processes in parallel, we see the following behavior:
> 
> * all processes are started at NODE #1
> * memory is also allocated on NODE #1
> * roughly half of the processes are moved to the NODE #0 very quickly. *
> however, memory is not moved to NODE #0 and stays allocated on NODE #1
> 

Ok, 20 processes getting rescheduled to another node is not unreasonable
from a load-balancing perspective but memory locality is not always taken
into account. You also don't state what parallelisation method you used
for STREAM and it's relevant because of how tasks end up communicating
and what that means for placement.

The only automatic NUMA balancing patch I can think of that has a high
chance of being a factor is 7347fc87dfe6b7315e74310ee1243dc222c68086
but I cannot see how STREAM would be affected as I severely doubt
the processes are communicating heavily (unless openmp and then it's
a maybe). It might affect NAS because that does a lot of wakeups
via futex that has "interesting" characteristics (either openmp or
openmpi). 082f764a2f3f2968afa1a0b04a1ccb1b70633844 might also be a factor
but it's doubtful. I don't know about Linpack as I've never characterised
it so I don't know how it behaves.

There are a few patches that affect utilisation calculation which might
affect the load balancer but I can't pinpoint a single likely candidate.

Given that STREAM is usually short-lived, is bisection an option?

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 12+ messages in thread

[parent not found: <CAE4VaGBAZ0HCy-M2rC3ce9ePOBhE6H-LDVBuJDJMNFf40j70Aw@mail.gmail.com>]

* Re: [4.17 regression] Performance drop on kernel-4.17 visible on Stream, Linpack and NAS parallel benchmarks
       [not found]   ` <CAE4VaGBAZ0HCy-M2rC3ce9ePOBhE6H-LDVBuJDJMNFf40j70Aw@mail.gmail.com>
@ 2018-06-08  7:40     ` Mel Gorman
       [not found]       ` <CAE4VaGAgC7vDwaa-9AzJYst9hdQ5KbnrBUnk_mfp=NeTEe5dAQ@mail.gmail.com>
  0 siblings, 1 reply; 12+ messages in thread
From: Mel Gorman @ 2018-06-08  7:40 UTC (permalink / raw)
  To: Jirka Hladky
  Cc: Jakub Racek, linux-kernel, Rafael J. Wysocki, Len Brown, linux-acpi

On Fri, Jun 08, 2018 at 07:49:37AM +0200, Jirka Hladky wrote:
> Hi Mel,
> 
> we will do the bisection today and report the results back.
> 

The most likely outcome is 2c83362734dad8e48ccc0710b5cd2436a0323893
which is a patch that restricts newly forked processes from selecting a
remote node when the local node is similarly loaded. The upside is that
an almost idle node will not queue that task on a remote node. The
downside is that there are cases that the newly forked task allocates a
lot of memory and then the idle balancer spreads it anyway. It'll be a
classic case of "win some, lose some".

That would match this pattern

> > > * all processes are started at NODE #1

So at fork time, the local node is almost idle and is used

> > > * memory is also allocated on NODE #1

Early in the lifetime of the task

> > > * roughly half of the processes are moved to the NODE #0 very quickly. *

Idle balancer kicks in

> > > however, memory is not moved to NODE #0 and stays allocated on NODE #1
> > >

automatic NUMA balancing doesn't run long enough to migrate all the
memory. That would definitely be the case for STREAM. It's less clear
for NAS where, depending on the parallelisation, wake_affine can keep a
task away from its memory or it's cross-node migrating a lot. As before,
I've no idea about linpack.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 12+ messages in thread

[parent not found: <CAE4VaGAgC7vDwaa-9AzJYst9hdQ5KbnrBUnk_mfp=NeTEe5dAQ@mail.gmail.com>]

* Re: [4.17 regression] Performance drop on kernel-4.17 visible on Stream, Linpack and NAS parallel benchmarks
       [not found]       ` <CAE4VaGAgC7vDwaa-9AzJYst9hdQ5KbnrBUnk_mfp=NeTEe5dAQ@mail.gmail.com>
@ 2018-06-08  9:24         ` Mel Gorman
       [not found]           ` <CAE4VaGATk3_Hr_2Wh44BZvXDc06A=rxUZXRFj+D=Xwh2x1YOyg@mail.gmail.com>
  0 siblings, 1 reply; 12+ messages in thread
From: Mel Gorman @ 2018-06-08  9:24 UTC (permalink / raw)
  To: Jirka Hladky
  Cc: Jakub Racek, linux-kernel, Rafael J. Wysocki, Len Brown, linux-acpi

On Fri, Jun 08, 2018 at 10:49:03AM +0200, Jirka Hladky wrote:
> Hi Mel,
> 
> automatic NUMA balancing doesn't run long enough to migrate all the
> > memory. That would definitely be the case for STREAM.
> 
> This could explain the behavior we observe. stream is running ~20 seconds
> at the moment. I can easily change the runtime by changing the number of
> iterations. What is the time period when you expect the memory to be fully
> migrated?
> 

Unknown and unknowable. It depends entirely on the reference pattern of
the different threads. If they are fully parallelised with private buffers
that are page-aligned then I expect it to be quick (to pass the 2-reference
filter). If threads are sharing data on a 4K (base page case) or 2M boundary
(THP enabled) then it may take longer as two or more threads will disagree
on what the appropriate placement for a page is.

> I have now checked numastat logs and after 15 seconds I see roughly 80MiB
> out of 200MiB of the allocated memory migrated for each of 10 processes
> which have changed the NUMA CPU node after started. This is on 2
> socket Gold 6126 CPU @ 2.60GHz server with DDR4 2666 MHz. That's 800 MiB of
> memory migrated in 15 seconds which is results in the average migration
> rate of 50MiB/s - is this an expected value?
> 

I expect that to be far short of the capabilities of the machine.
Again, migrations can be delayed indefinitely if threads have buffers
that are not page-aligned (4K or 2M depending).

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 12+ messages in thread

[parent not found: <CAE4VaGATk3_Hr_2Wh44BZvXDc06A=rxUZXRFj+D=Xwh2x1YOyg@mail.gmail.com>]

* Re: [4.17 regression] Performance drop on kernel-4.17 visible on Stream, Linpack and NAS parallel benchmarks
       [not found]           ` <CAE4VaGATk3_Hr_2Wh44BZvXDc06A=rxUZXRFj+D=Xwh2x1YOyg@mail.gmail.com>
@ 2018-06-08 11:15             ` Mel Gorman
  0 siblings, 0 replies; 12+ messages in thread
From: Mel Gorman @ 2018-06-08 11:15 UTC (permalink / raw)
  To: Jirka Hladky
  Cc: Jakub Racek, linux-kernel, Rafael J. Wysocki, Len Brown, linux-acpi

On Fri, Jun 08, 2018 at 01:02:54PM +0200, Jirka Hladky wrote:
> >
> > Unknown and unknowable. It depends entirely on the reference pattern of
> > the different threads. If they are fully parallelised with private buffers
> > that are page-aligned then I expect it to be quick (to pass the 2-reference
> > filter).
> 
> 
> I'm running 20 parallel processes. There is no connection between them. If
> I read it correctly the migration should happen fast in this case, right?
> 
> I have checked the source code and variables are global and static (and
> thus allocated in the data segment). They are NOT 4k aligned:
> 
> variable a is at address: 0x9e999e0
> variable b is at address: 0x524e5e0
> variable c is at address: 0x6031e0
> 
> static double a[N],
> b[N],
> c[N];
> 

If these are 20 completely indepent processes (and not sharing data via
MPI if you're using that version of STREAM) then the migration should be
relatively quick. Migrations should start within 3 seconds of the process
starting. How long it takes depends on the size of the STREAM processes
as it's only scanned in chunks and migrations won't start until there
are two full passes of the address space. You can partially monitor the
progress using /proc/pid/numa_maps. More detailed monitoring needs ftrace
for some activity and the use of probes on specific functions to get
detailed information.

It may also be worth examining /proc/pid/sched and seeing if a task
sets numa_preferred_nid to node 0 and keeps it there even after
migrating to node 1 but that's doubtful.

-- 
Mel Gorman
SUSE Labs

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2018-06-08 11:15 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-06 12:27 [4.17 regression] Performance drop on kernel-4.17 visible on Stream, Linpack and NAS parallel benchmarks Jakub Racek
2018-06-06 12:34 ` Rafael J. Wysocki
2018-06-06 12:44   ` Rafael J. Wysocki
2018-06-06 12:50   ` Jakub Racek
2018-06-06 12:56     ` Rafael J. Wysocki
2018-06-07 11:07 ` [4.17 regression] " Michal Hocko
2018-06-07 11:19   ` Jakub Raček
2018-06-07 11:56     ` Jirka Hladky
2018-06-07 12:39 ` Mel Gorman
     [not found]   ` <CAE4VaGBAZ0HCy-M2rC3ce9ePOBhE6H-LDVBuJDJMNFf40j70Aw@mail.gmail.com>
2018-06-08  7:40     ` Mel Gorman
     [not found]       ` <CAE4VaGAgC7vDwaa-9AzJYst9hdQ5KbnrBUnk_mfp=NeTEe5dAQ@mail.gmail.com>
2018-06-08  9:24         ` Mel Gorman
     [not found]           ` <CAE4VaGATk3_Hr_2Wh44BZvXDc06A=rxUZXRFj+D=Xwh2x1YOyg@mail.gmail.com>
2018-06-08 11:15             ` Mel Gorman

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.