Re: [mm/page_alloc] 7fef431be9: vm-scalability.throughput 87.8% improvement

From: David Hildenbrand <david@redhat.com>
To: Axel Rasmussen <axelrasmussen@google.com>
Cc: David Hildenbrand <david@redhat.com>,
	David Rientjes <rientjes@google.com>,
	kernel test robot <rong.a.chen@intel.com>,
	Kevin Ko <kevko@google.com>,
	Linus Torvalds <torvalds@linux-foundation.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	Vlastimil Babka <vbabka@suse.cz>,
	Oscar Salvador <osalvador@suse.de>,
	Wei Yang <richard.weiyang@linux.alibaba.com>,
	Pankaj Gupta <pankaj.gupta.linux@gmail.com>,
	Michal Hocko <mhocko@suse.com>,
	Alexander Duyck <alexander.h.duyck@linux.intel.com>,
	Mel Gorman <mgorman@techsingularity.net>,
	Dave Hansen <dave.hansen@intel.com>,
	Mike Rapoport <rppt@kernel.org>,
	"K. Y. Srinivasan" <kys@microsoft.com>,
	Haiyang Zhang <haiyangz@microsoft.com>,
	Stephen Hemminger <sthemmin@microsoft.com>,
	Wei Liu <wei.liu@kernel.org>,
	Matthew Wilcox <willy@infradead.org>,
	Michael Ellerman <mpe@ellerman.id.au>,
	Michal Hocko <mhocko@kernel.org>,
	Scott Cheloha <cheloha@linux.ibm.com>,
	LKML <linux-kernel@vger.kernel.org>,
	lkp@lists.01.org, lkp@intel.com, ying.huang@intel.com,
	feng.tang@intel.com, zhengjun.xing@intel.com
Subject: Re: [mm/page_alloc] 7fef431be9: vm-scalability.throughput 87.8% improvement
Date: Mon, 26 Oct 2020 20:09:43 +0100	[thread overview]
Message-ID: <494C73E0-452D-4503-8ED6-DAE11A8471E5@redhat.com> (raw)
In-Reply-To: <CAJHvVcj1NToZO9ZoyWZKWzCe2jMOrLjLAxESiD84Q_V+8er9Ag@mail.gmail.com>

> Am 26.10.2020 um 19:11 schrieb Axel Rasmussen <axelrasmussen@google.com>:
> 
> On Mon, Oct 26, 2020 at 1:31 AM David Hildenbrand <david@redhat.com> wrote:
>> 
>>> On 23.10.20 21:44, Axel Rasmussen wrote:
>>> On Fri, Oct 23, 2020 at 12:29 PM David Rientjes <rientjes@google.com> wrote:
>>>> 
>>>> On Wed, 21 Oct 2020, kernel test robot wrote:
>>>> 
>>>>> Greeting,
>>>>> 
>>>>> FYI, we noticed a 87.8% improvement of vm-scalability.throughput due to commit:
>>>>> 
>>>>> 
>>>>> commit: 7fef431be9c9ac255838a9578331567b9dba4477 ("mm/page_alloc: place pages to tail in __free_pages_core()")
>>>>> https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git master
>>>>> 
>>>>> 
>>>>> in testcase: vm-scalability
>>>>> on test machine: 192 threads Intel(R) Xeon(R) Platinum 9242 CPU @ 2.30GHz with 192G memory
>>>>> with following parameters:
>>>>> 
>>>>>      runtime: 300s
>>>>>      size: 512G
>>>>>      test: anon-wx-rand-mt
>>>>>      cpufreq_governor: performance
>>>>>      ucode: 0x5002f01
>>>>> 
>>>>> test-description: The motivation behind this suite is to exercise functions and regions of the mm/ of the Linux kernel which are of interest to us.
>>>>> test-url: https://git.kernel.org/cgit/linux/kernel/git/wfg/vm-scalability.git/
>>>>> 
>>>> 
>>>> I'm curious why we are not able to reproduce this improvement on Skylake
>>>> and actually see a slight performance degradation, at least for
>>>> 300s_128G_truncate_throughput.
>>>> 
>>>> Axel Rasmussen <axelrasmussen@google.com> can provide more details on our
>>>> results.
>>> 
>>> Right, our results show a slight regression on a Skylake machine [1],
>>> and a slight performance increase on a Rome machine [2]. For these
>>> tests, I used Linus' v5.9 tag as a baseline, and then applied this
>>> patchset onto that tag as a test kernel (the patches applied cleanly
>>> besides one comment, I didn't have to do any code fixups). This is
>>> running the same anon-wx-rand-mt test defined in the upstream
>>> lkp-tests job file:
>>> https://github.com/intel/lkp-tests/blob/master/jobs/vm-scalability.yaml
>> 
>> Hi,
>> 
>> looking at the yaml, am I right that each test is run after a fresh boot?
> 
> Yes-ish. For the results I posted, the larger context would have been
> something like:
> 
> - Kernel installed, machine freshly rebooted.
> - Various machine management daemons start by default, some are
> stopped so as not to interfere with the test.
> - Some packages are installed on the machine (the thing which
> orchestrates the testing in particular).
> - The test is run.
> 
> So, the machine is somewhat fresh in the sense that it hasn't been
> e.g. serving production traffic just before running the test, but it's
> also not as clean as it could be. It seems plausible this difference
> explains the difference in the results (I'm not too familiar with how
> the Intel kernel test robot is implemented).

Ah, okay. So most memory in the system is indeed untouched. Thanks!

> 
>> 
>> As I already replied to David, this patch merely changes the initial
>> order of the freelists. The general end result is that lower memory
>> addresses will be allocated before higher memory addresses will be
>> allocated - within a zone, the first time memory is getting allocated.
>> Before, it was the other way around. Once a system ran for some time,
>> freelists are randomized.
>> 
>> There might be benchmarks/systems where this initial system state might
>> now be better suited - or worse. It doesn't really tell you that core-mm
>> is behaving better/worse now - it merely means that the initial system
>> state under which the benchmark was started affected the benchmark.
>> 
>> Looks like so far there is one benchmark+system where it's really
>> beneficial, there is one benchmark+system where it's slightly
>> beneficial, and one benchmark+system where there is a slight regression.
>> 
>> 
>> Something like the following would revert to the previous behavior:
>> 
>> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
>> index 23f5066bd4a5..fac82420cc3d 100644
>> --- a/mm/page_alloc.c
>> +++ b/mm/page_alloc.c
>> @@ -1553,7 +1553,9 @@ void __free_pages_core(struct page *page, unsigned
>> int order)
>>         * Bypass PCP and place fresh pages right to the tail, primarily
>>         * relevant for memory onlining.
>>         */
>> -       __free_pages_ok(page, order, FPI_TO_TAIL);
>> +       __free_pages_ok(page, order,
>> +                       system_state < SYSTEM_RUNNING ? FPI_NONE :
>> +                                                       FPI_TO_TAIL);
>> }
>> 
>> #ifdef CONFIG_NEED_MULTIPLE_NODES
>> 
>> 
>> (Or better, passing the expected behavior via MEMINIT_EARLY/... to
>> __free_pages_core().)
>> 
>> 
>> But then, I am not convinced we should perform that change: having a
>> clean (initial) state might be true for these benchmarks, but it's far
>> from reality. The change in numbers doesn't show you that core-mm is
>> operating better/worse, just that the baseline for you tests changed due
>> to a changed initial system state.
> 
> Not to put words in David's mouth :) but at least from my perspective,
> our original interest was "wow, an 87% improvement! maybe we should
> deploy this patch to production!", and I'm mostly sharing my results
> just to say "it actually doesn't seem to be a huge *general*
> improvement", rather than to advocate for further changes / fixes.

Ah, yes, I saw the +87% and thought „that can‘t be right“.

> IIUC the original motivation for this patch was to fix somewhat of an
> edge case, not to make a very general improvement, so this seems fine.
> 

Exactly.