* [4.17 regression] Performance drop on kernel-4.17 visible on Stream, Linpack and NAS parallel benchmarks
@ 2018-06-06 12:27 Jakub Racek
2018-06-06 12:34 ` Rafael J. Wysocki
` (2 more replies)
0 siblings, 3 replies; 12+ messages in thread
From: Jakub Racek @ 2018-06-06 12:27 UTC (permalink / raw)
To: linux-kernel; +Cc: Rafael J. Wysocki, Len Brown, linux-acpi, jracek
Hi,
There is a huge performance regression on the 2 and 4 NUMA node systems on stream
benchmark with 4.17 kernel compared to 4.16 kernel.
Stream, Linpack and NAS parallel benchmarks show upto 50% performance drop.
When running for example 20 stream processes in parallel, we see the following behavior:
* all processes are started at NODE #1
* memory is also allocated on NODE #1
* roughly half of the processes are moved to the NODE #0 very quickly.
* however, memory is not moved to NODE #0 and stays allocated on NODE #1
As the result, half of the processes are running on NODE#0 with memory being still
allocated on NODE#1. This leads to non-local memory accesses
on the high Remote-To-Local Memory Access Ratio on the numatop charts.
So it seems that 4.17 is not doing a good job to move the memory to the right NUMA
node after the process has been moved.
----8<----
The above is an excerpt from performance testing on 4.16 and 4.17 kernels.
For now I'm merely making sure the problem is reported.
Thank you.
Best regards,
Jakub Racek
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [4.17 regression] Performance drop on kernel-4.17 visible on Stream, Linpack and NAS parallel benchmarks
2018-06-06 12:27 [4.17 regression] Performance drop on kernel-4.17 visible on Stream, Linpack and NAS parallel benchmarks Jakub Racek
@ 2018-06-06 12:34 ` Rafael J. Wysocki
2018-06-06 12:44 ` Rafael J. Wysocki
2018-06-06 12:50 ` Jakub Racek
2018-06-07 11:07 ` [4.17 regression] " Michal Hocko
2018-06-07 12:39 ` Mel Gorman
2 siblings, 2 replies; 12+ messages in thread
From: Rafael J. Wysocki @ 2018-06-06 12:34 UTC (permalink / raw)
To: Jakub Racek
Cc: Linux Kernel Mailing List, Rafael J. Wysocki, Len Brown,
ACPI Devel Maling List
On Wed, Jun 6, 2018 at 2:27 PM, Jakub Racek <jracek@redhat.com> wrote:
> Hi,
>
> There is a huge performance regression on the 2 and 4 NUMA node systems on
> stream benchmark with 4.17 kernel compared to 4.16 kernel. Stream, Linpack
> and NAS parallel benchmarks show upto 50% performance drop.
>
> When running for example 20 stream processes in parallel, we see the
> following behavior:
>
> * all processes are started at NODE #1
> * memory is also allocated on NODE #1
> * roughly half of the processes are moved to the NODE #0 very quickly. *
> however, memory is not moved to NODE #0 and stays allocated on NODE #1
>
> As the result, half of the processes are running on NODE#0 with memory being
> still allocated on NODE#1. This leads to non-local memory accesses
> on the high Remote-To-Local Memory Access Ratio on the numatop charts.
> So it seems that 4.17 is not doing a good job to move the memory to the
> right NUMA
> node after the process has been moved.
>
> ----8<----
>
> The above is an excerpt from performance testing on 4.16 and 4.17 kernels.
>
> For now I'm merely making sure the problem is reported.
OK, and why do you think that it is related to ACPI?
Thanks,
Rafael
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [4.17 regression] Performance drop on kernel-4.17 visible on Stream, Linpack and NAS parallel benchmarks
2018-06-06 12:34 ` Rafael J. Wysocki
@ 2018-06-06 12:44 ` Rafael J. Wysocki
2018-06-06 12:50 ` Jakub Racek
1 sibling, 0 replies; 12+ messages in thread
From: Rafael J. Wysocki @ 2018-06-06 12:44 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Jakub Racek, Linux Kernel Mailing List, Rafael J. Wysocki,
Len Brown, ACPI Devel Maling List, Peter Zijlstra
On Wed, Jun 6, 2018 at 2:34 PM, Rafael J. Wysocki <rafael@kernel.org> wrote:
> On Wed, Jun 6, 2018 at 2:27 PM, Jakub Racek <jracek@redhat.com> wrote:
>> Hi,
>>
>> There is a huge performance regression on the 2 and 4 NUMA node systems on
>> stream benchmark with 4.17 kernel compared to 4.16 kernel. Stream, Linpack
>> and NAS parallel benchmarks show upto 50% performance drop.
>>
>> When running for example 20 stream processes in parallel, we see the
>> following behavior:
>>
>> * all processes are started at NODE #1
>> * memory is also allocated on NODE #1
>> * roughly half of the processes are moved to the NODE #0 very quickly. *
>> however, memory is not moved to NODE #0 and stays allocated on NODE #1
>>
>> As the result, half of the processes are running on NODE#0 with memory being
>> still allocated on NODE#1. This leads to non-local memory accesses
>> on the high Remote-To-Local Memory Access Ratio on the numatop charts.
>> So it seems that 4.17 is not doing a good job to move the memory to the
>> right NUMA
>> node after the process has been moved.
>>
>> ----8<----
>>
>> The above is an excerpt from performance testing on 4.16 and 4.17 kernels.
>>
>> For now I'm merely making sure the problem is reported.
>
> OK, and why do you think that it is related to ACPI?
In any case, we need more information here.
Thanks,
Rafael
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Performance drop on kernel-4.17 visible on Stream, Linpack and NAS parallel benchmarks
2018-06-06 12:34 ` Rafael J. Wysocki
2018-06-06 12:44 ` Rafael J. Wysocki
@ 2018-06-06 12:50 ` Jakub Racek
2018-06-06 12:56 ` Rafael J. Wysocki
1 sibling, 1 reply; 12+ messages in thread
From: Jakub Racek @ 2018-06-06 12:50 UTC (permalink / raw)
To: Rafael J. Wysocki
Cc: Linux Kernel Mailing List, Rafael J. Wysocki, Len Brown,
ACPI Devel Maling List
+++ Rafael J. Wysocki [06/06/18 14:34 +0200]:
>On Wed, Jun 6, 2018 at 2:27 PM, Jakub Racek <jracek@redhat.com> wrote:
>> Hi,
>>
>> There is a huge performance regression on the 2 and 4 NUMA node systems on
>> stream benchmark with 4.17 kernel compared to 4.16 kernel. Stream, Linpack
>> and NAS parallel benchmarks show upto 50% performance drop.
>
>OK, and why do you think that it is related to ACPI?
I don't know where the problems is or who to CC.
What information should be added? I can probably provide it, if I know
what.
>Thanks,
>Rafael
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: Performance drop on kernel-4.17 visible on Stream, Linpack and NAS parallel benchmarks
2018-06-06 12:50 ` Jakub Racek
@ 2018-06-06 12:56 ` Rafael J. Wysocki
0 siblings, 0 replies; 12+ messages in thread
From: Rafael J. Wysocki @ 2018-06-06 12:56 UTC (permalink / raw)
To: Jakub Racek
Cc: Rafael J. Wysocki, Linux Kernel Mailing List, Rafael J. Wysocki,
Len Brown, ACPI Devel Maling List
On Wed, Jun 6, 2018 at 2:50 PM, Jakub Racek <jracek@redhat.com> wrote:
> +++ Rafael J. Wysocki [06/06/18 14:34 +0200]:
>>
>> On Wed, Jun 6, 2018 at 2:27 PM, Jakub Racek <jracek@redhat.com> wrote:
>>>
>>> Hi,
>>>
>>> There is a huge performance regression on the 2 and 4 NUMA node systems
>>> on
>>> stream benchmark with 4.17 kernel compared to 4.16 kernel. Stream,
>>> Linpack
>>> and NAS parallel benchmarks show upto 50% performance drop.
>>
>>
>> OK, and why do you think that it is related to ACPI?
>
>
> I don't know where the problems is or who to CC.
> What information should be added? I can probably provide it, if I know
> what.
The problem appears to be reproducible 100% of the time, so can you
possibly carry out a "git bisect" binary search to find the
problematic commit?
Thanks,
Rafael
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [4.17 regression] Performance drop on kernel-4.17 visible on Stream, Linpack and NAS parallel benchmarks
2018-06-06 12:27 [4.17 regression] Performance drop on kernel-4.17 visible on Stream, Linpack and NAS parallel benchmarks Jakub Racek
2018-06-06 12:34 ` Rafael J. Wysocki
@ 2018-06-07 11:07 ` Michal Hocko
2018-06-07 11:19 ` Jakub Raček
2018-06-07 12:39 ` Mel Gorman
2 siblings, 1 reply; 12+ messages in thread
From: Michal Hocko @ 2018-06-07 11:07 UTC (permalink / raw)
To: Jakub Racek
Cc: linux-kernel, Rafael J. Wysocki, Len Brown, linux-acpi,
Mel Gorman, linux-mm
[CCing Mel and MM mailing list]
On Wed 06-06-18 14:27:32, Jakub Racek wrote:
> Hi,
>
> There is a huge performance regression on the 2 and 4 NUMA node systems on
> stream benchmark with 4.17 kernel compared to 4.16 kernel. Stream, Linpack
> and NAS parallel benchmarks show upto 50% performance drop.
>
> When running for example 20 stream processes in parallel, we see the following behavior:
>
> * all processes are started at NODE #1
> * memory is also allocated on NODE #1
> * roughly half of the processes are moved to the NODE #0 very quickly. *
> however, memory is not moved to NODE #0 and stays allocated on NODE #1
>
> As the result, half of the processes are running on NODE#0 with memory being
> still allocated on NODE#1. This leads to non-local memory accesses
> on the high Remote-To-Local Memory Access Ratio on the numatop charts.
>
> So it seems that 4.17 is not doing a good job to move the memory to the right NUMA
> node after the process has been moved.
>
> ----8<----
>
> The above is an excerpt from performance testing on 4.16 and 4.17 kernels.
>
> For now I'm merely making sure the problem is reported.
Do you have numa balancing enabled?
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [4.17 regression] Performance drop on kernel-4.17 visible on Stream, Linpack and NAS parallel benchmarks
2018-06-07 11:07 ` [4.17 regression] " Michal Hocko
@ 2018-06-07 11:19 ` Jakub Raček
2018-06-07 11:56 ` Jirka Hladky
0 siblings, 1 reply; 12+ messages in thread
From: Jakub Raček @ 2018-06-07 11:19 UTC (permalink / raw)
To: Michal Hocko
Cc: linux-kernel, Rafael J. Wysocki, Len Brown, linux-acpi,
Mel Gorman, linux-mm
Hi,
On 06/07/2018 01:07 PM, Michal Hocko wrote:
> [CCing Mel and MM mailing list]
>
> On Wed 06-06-18 14:27:32, Jakub Racek wrote:
>> Hi,
>>
>> There is a huge performance regression on the 2 and 4 NUMA node systems on
>> stream benchmark with 4.17 kernel compared to 4.16 kernel. Stream, Linpack
>> and NAS parallel benchmarks show upto 50% performance drop.
>>
>> When running for example 20 stream processes in parallel, we see the following behavior:
>>
>> * all processes are started at NODE #1
>> * memory is also allocated on NODE #1
>> * roughly half of the processes are moved to the NODE #0 very quickly. *
>> however, memory is not moved to NODE #0 and stays allocated on NODE #1
>>
>> As the result, half of the processes are running on NODE#0 with memory being
>> still allocated on NODE#1. This leads to non-local memory accesses
>> on the high Remote-To-Local Memory Access Ratio on the numatop charts.
>>
>> So it seems that 4.17 is not doing a good job to move the memory to the right NUMA
>> node after the process has been moved.
>>
>> ----8<----
>>
>> The above is an excerpt from performance testing on 4.16 and 4.17 kernels.
>>
>> For now I'm merely making sure the problem is reported.
>
> Do you have numa balancing enabled?
>
Yes. The relevant settings are:
kernel.numa_balancing = 1
kernel.numa_balancing_scan_delay_ms = 1000
kernel.numa_balancing_scan_period_max_ms = 60000
kernel.numa_balancing_scan_period_min_ms = 1000
kernel.numa_balancing_scan_size_mb = 256
--
Best regards,
Jakub Racek
FMK
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [4.17 regression] Performance drop on kernel-4.17 visible on Stream, Linpack and NAS parallel benchmarks
2018-06-07 11:19 ` Jakub Raček
@ 2018-06-07 11:56 ` Jirka Hladky
0 siblings, 0 replies; 12+ messages in thread
From: Jirka Hladky @ 2018-06-07 11:56 UTC (permalink / raw)
To: Jakub Raček
Cc: Michal Hocko, linux-kernel, Rafael J. Wysocki, Len Brown,
linux-acpi, Mel Gorman, linux-mm, jhladky
[-- Attachment #1: Type: text/plain, Size: 1781 bytes --]
Adding myself to Cc.
On Thu, Jun 7, 2018 at 1:19 PM, Jakub Raček <jracek@redhat.com> wrote:
> Hi,
>
> On 06/07/2018 01:07 PM, Michal Hocko wrote:
>
>> [CCing Mel and MM mailing list]
>>
>> On Wed 06-06-18 14:27:32, Jakub Racek wrote:
>>
>>> Hi,
>>>
>>> There is a huge performance regression on the 2 and 4 NUMA node systems
>>> on
>>> stream benchmark with 4.17 kernel compared to 4.16 kernel. Stream,
>>> Linpack
>>> and NAS parallel benchmarks show upto 50% performance drop.
>>>
>>> When running for example 20 stream processes in parallel, we see the
>>> following behavior:
>>>
>>> * all processes are started at NODE #1
>>> * memory is also allocated on NODE #1
>>> * roughly half of the processes are moved to the NODE #0 very quickly. *
>>> however, memory is not moved to NODE #0 and stays allocated on NODE #1
>>>
>>> As the result, half of the processes are running on NODE#0 with memory
>>> being
>>> still allocated on NODE#1. This leads to non-local memory accesses
>>> on the high Remote-To-Local Memory Access Ratio on the numatop charts.
>>>
>>> So it seems that 4.17 is not doing a good job to move the memory to the
>>> right NUMA
>>> node after the process has been moved.
>>>
>>> ----8<----
>>>
>>> The above is an excerpt from performance testing on 4.16 and 4.17
>>> kernels.
>>>
>>> For now I'm merely making sure the problem is reported.
>>>
>>
>> Do you have numa balancing enabled?
>>
>>
> Yes. The relevant settings are:
>
> kernel.numa_balancing = 1
> kernel.numa_balancing_scan_delay_ms = 1000
> kernel.numa_balancing_scan_period_max_ms = 60000
> kernel.numa_balancing_scan_period_min_ms = 1000
> kernel.numa_balancing_scan_size_mb = 256
>
>
> --
> Best regards,
> Jakub Racek
> FMK
>
[-- Attachment #2: Type: text/html, Size: 2424 bytes --]
^ permalink raw reply [flat|nested] 12+ messages in thread
* Re: [4.17 regression] Performance drop on kernel-4.17 visible on Stream, Linpack and NAS parallel benchmarks
2018-06-06 12:27 [4.17 regression] Performance drop on kernel-4.17 visible on Stream, Linpack and NAS parallel benchmarks Jakub Racek
2018-06-06 12:34 ` Rafael J. Wysocki
2018-06-07 11:07 ` [4.17 regression] " Michal Hocko
@ 2018-06-07 12:39 ` Mel Gorman
[not found] ` <CAE4VaGBAZ0HCy-M2rC3ce9ePOBhE6H-LDVBuJDJMNFf40j70Aw@mail.gmail.com>
2 siblings, 1 reply; 12+ messages in thread
From: Mel Gorman @ 2018-06-07 12:39 UTC (permalink / raw)
To: Jakub Racek; +Cc: linux-kernel, Rafael J. Wysocki, Len Brown, linux-acpi
On Wed, Jun 06, 2018 at 02:27:32PM +0200, Jakub Racek wrote:
> There is a huge performance regression on the 2 and 4 NUMA node systems on
> stream benchmark with 4.17 kernel compared to 4.16 kernel. Stream, Linpack
> and NAS parallel benchmarks show upto 50% performance drop.
>
I have not observed this yet but NAS is the only one I'll see and that could
be a week or more away before I have data. I'll keep an eye out at least.
> When running for example 20 stream processes in parallel, we see the following behavior:
>
> * all processes are started at NODE #1
> * memory is also allocated on NODE #1
> * roughly half of the processes are moved to the NODE #0 very quickly. *
> however, memory is not moved to NODE #0 and stays allocated on NODE #1
>
Ok, 20 processes getting rescheduled to another node is not unreasonable
from a load-balancing perspective but memory locality is not always taken
into account. You also don't state what parallelisation method you used
for STREAM and it's relevant because of how tasks end up communicating
and what that means for placement.
The only automatic NUMA balancing patch I can think of that has a high
chance of being a factor is 7347fc87dfe6b7315e74310ee1243dc222c68086
but I cannot see how STREAM would be affected as I severely doubt
the processes are communicating heavily (unless openmp and then it's
a maybe). It might affect NAS because that does a lot of wakeups
via futex that has "interesting" characteristics (either openmp or
openmpi). 082f764a2f3f2968afa1a0b04a1ccb1b70633844 might also be a factor
but it's doubtful. I don't know about Linpack as I've never characterised
it so I don't know how it behaves.
There are a few patches that affect utilisation calculation which might
affect the load balancer but I can't pinpoint a single likely candidate.
Given that STREAM is usually short-lived, is bisection an option?
--
Mel Gorman
SUSE Labs
^ permalink raw reply [flat|nested] 12+ messages in thread
end of thread, other threads:[~2018-06-08 11:15 UTC | newest]
Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2018-06-06 12:27 [4.17 regression] Performance drop on kernel-4.17 visible on Stream, Linpack and NAS parallel benchmarks Jakub Racek
2018-06-06 12:34 ` Rafael J. Wysocki
2018-06-06 12:44 ` Rafael J. Wysocki
2018-06-06 12:50 ` Jakub Racek
2018-06-06 12:56 ` Rafael J. Wysocki
2018-06-07 11:07 ` [4.17 regression] " Michal Hocko
2018-06-07 11:19 ` Jakub Raček
2018-06-07 11:56 ` Jirka Hladky
2018-06-07 12:39 ` Mel Gorman
[not found] ` <CAE4VaGBAZ0HCy-M2rC3ce9ePOBhE6H-LDVBuJDJMNFf40j70Aw@mail.gmail.com>
2018-06-08 7:40 ` Mel Gorman
[not found] ` <CAE4VaGAgC7vDwaa-9AzJYst9hdQ5KbnrBUnk_mfp=NeTEe5dAQ@mail.gmail.com>
2018-06-08 9:24 ` Mel Gorman
[not found] ` <CAE4VaGATk3_Hr_2Wh44BZvXDc06A=rxUZXRFj+D=Xwh2x1YOyg@mail.gmail.com>
2018-06-08 11:15 ` Mel Gorman
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.