From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ot0-f200.google.com (mail-ot0-f200.google.com [74.125.82.200]) by kanga.kvack.org (Postfix) with ESMTP id 6B9526B0003 for ; Thu, 7 Jun 2018 07:56:16 -0400 (EDT) Received: by mail-ot0-f200.google.com with SMTP id p41-v6so6182972oth.5 for ; Thu, 07 Jun 2018 04:56:16 -0700 (PDT) Received: from mail-sor-f41.google.com (mail-sor-f41.google.com. [209.85.220.41]) by mx.google.com with SMTPS id l74-v6sor8558261oih.1.2018.06.07.04.56.15 for (Google Transport Security); Thu, 07 Jun 2018 04:56:15 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: References: <20180606122731.GB27707@jra-laptop.brq.redhat.com> <20180607110713.GJ32433@dhcp22.suse.cz> From: Jirka Hladky Date: Thu, 7 Jun 2018 13:56:14 +0200 Message-ID: Subject: Re: [4.17 regression] Performance drop on kernel-4.17 visible on Stream, Linpack and NAS parallel benchmarks Content-Type: multipart/alternative; boundary="0000000000006f99b7056e0bf892" Sender: owner-linux-mm@kvack.org List-ID: To: =?UTF-8?B?SmFrdWIgUmHEjWVr?= Cc: Michal Hocko , linux-kernel , "Rafael J. Wysocki" , Len Brown , linux-acpi@vger.kernel.org, Mel Gorman , linux-mm@kvack.org, "jhladky@redhat.com" --0000000000006f99b7056e0bf892 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Adding myself to Cc. On Thu, Jun 7, 2018 at 1:19 PM, Jakub Ra=C4=8Dek wrote: > Hi, > > On 06/07/2018 01:07 PM, Michal Hocko wrote: > >> [CCing Mel and MM mailing list] >> >> On Wed 06-06-18 14:27:32, Jakub Racek wrote: >> >>> Hi, >>> >>> There is a huge performance regression on the 2 and 4 NUMA node systems >>> on >>> stream benchmark with 4.17 kernel compared to 4.16 kernel. Stream, >>> Linpack >>> and NAS parallel benchmarks show upto 50% performance drop. >>> >>> When running for example 20 stream processes in parallel, we see the >>> following behavior: >>> >>> * all processes are started at NODE #1 >>> * memory is also allocated on NODE #1 >>> * roughly half of the processes are moved to the NODE #0 very quickly. = * >>> however, memory is not moved to NODE #0 and stays allocated on NODE #1 >>> >>> As the result, half of the processes are running on NODE#0 with memory >>> being >>> still allocated on NODE#1. This leads to non-local memory accesses >>> on the high Remote-To-Local Memory Access Ratio on the numatop charts. >>> >>> So it seems that 4.17 is not doing a good job to move the memory to the >>> right NUMA >>> node after the process has been moved. >>> >>> ----8<---- >>> >>> The above is an excerpt from performance testing on 4.16 and 4.17 >>> kernels. >>> >>> For now I'm merely making sure the problem is reported. >>> >> >> Do you have numa balancing enabled? >> >> > Yes. The relevant settings are: > > kernel.numa_balancing =3D 1 > kernel.numa_balancing_scan_delay_ms =3D 1000 > kernel.numa_balancing_scan_period_max_ms =3D 60000 > kernel.numa_balancing_scan_period_min_ms =3D 1000 > kernel.numa_balancing_scan_size_mb =3D 256 > > > -- > Best regards, > Jakub Racek > FMK > --0000000000006f99b7056e0bf892 Content-Type: text/html; charset="UTF-8" Content-Transfer-Encoding: quoted-printable

Adding myself to Cc.

<= div class=3D"gmail_quote">On Thu, Jun 7, 2018 at 1:19 PM, Jakub Ra=C4=8Dek = <jracek@redhat.com> wrote:

Hi,

On 06/07/2018 01:07 PM, Michal Hocko wrote:

[CCing Mel and MM mailing list]

On Wed 06-06-18 14:27:32, Jakub Racek wrote:

Hi,

There is a huge performance regression on the 2 and 4 NUMA node systems on<= br> stream benchmark with 4.17 kernel compared to 4.16 kernel. Stream, Linpack<= br> and NAS parallel benchmarks show upto 50% performance drop.

When running for example 20 stream processes in parallel, we see the follow= ing behavior:

* all processes are started at NODE #1
* memory is also allocated on NODE #1
* roughly half of the processes are moved to the NODE #0 very quickly. * however, memory is not moved to NODE #0 and stays allocated on NODE #1

As the result, half of the processes are running on NODE#0 with memory bein= g
still allocated on NODE#1. This leads to non-local memory accesses
on the high Remote-To-Local Memory Access Ratio on the numatop charts.

So it seems that 4.17 is not doing a good job to move the memory to the rig= ht NUMA
node after the process has been moved.

----8<----

The above is an excerpt from performance testing on 4.16 and 4.17 kernels.<= br>
For now I'm merely making sure the problem is reported.

Do you have numa balancing enabled?

Yes. The relevant settings are:

kernel.numa_balancing =3D 1
kernel.numa_balancing_scan_delay_ms =3D 1000
kernel.numa_balancing_scan_period_max_ms =3D 60000
kernel.numa_balancing_scan_period_min_ms =3D 1000
kernel.numa_balancing_scan_size_mb =3D 256

--
Best regards,
Jakub Racek
FMK

--0000000000006f99b7056e0bf892--