Re: [linus:master] [mm] 0ba09b1733: will-it-scale.per_thread_ops -21.1% regression in mmap1 benchmark

From: "Yin, Fengwei" <fengwei.yin@intel.com>
To: Liam Howlett <liam.howlett@oracle.com>
Cc: Yang Shi <shy828301@gmail.com>, Yujie Liu <yujie.liu@intel.com>,
	"Linus Torvalds" <torvalds@linux-foundation.org>,
	"oe-lkp@lists.linux.dev" <oe-lkp@lists.linux.dev>,
	"lkp@intel.com" <lkp@intel.com>,
	Nathan Chancellor <nathan@kernel.org>,
	"Huang, Ying" <ying.huang@intel.com>,
	Rik van Riel <riel@surriel.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"feng.tang@intel.com" <feng.tang@intel.com>,
	"zhengjun.xing@linux.intel.com" <zhengjun.xing@linux.intel.com>
Subject: Re: [linus:master] [mm] 0ba09b1733: will-it-scale.per_thread_ops -21.1% regression in mmap1 benchmark
Date: Fri, 23 Dec 2022 14:27:01 +0800	[thread overview]
Message-ID: <99bc3831-8f5f-4a79-b0c3-8a492a7e5867@intel.com> (raw)
In-Reply-To: <20221223024321.itxwvcdyckepnyiz@revolver>


On 12/23/2022 10:45 AM, Liam Howlett wrote:
> * Yin, Fengwei <fengwei.yin@intel.com> [221221 20:19]:
>>
>>
>> On 12/22/2022 12:45 AM, Yang Shi wrote:
>>>> We caught two mmap1 regressions on mailine, please see the data below:
>>>>
>>>> 830b3c68c1fb1 Linux 6.1                                                              2085 2355 2088
>>>> 76dcd734eca23 Linux 6.1-rc8                                                          2093 2082 2094 2073 2304 2088
>>>> 0ba09b1733878 Revert "mm: align larger anonymous mappings on THP boundaries"         2124 2286 2086 2114 2065 2081
>>>> 23393c6461422 char: tpm: Protect tpm_pm_suspend with locks                           2756 2711 2689 2696 2660 2665
>>>> b7b275e60bcd5 Linux 6.1-rc7                                                          2670 2656 2720 2691 2667
>>>> ...
>>>> 9abf2313adc1c Linux 6.1-rc1                                                          2725 2717 2690 2691 2710
>>>> 3b0e81a1cdc9a mmap: change zeroing of maple tree in __vma_adjust()                   2736 2781 2748
>>>> 524e00b36e8c5 mm: remove rb tree.                                                    2747 2744 2747
>>>> 0c563f1480435 proc: remove VMA rbtree use from nommu
>>>> d0cf3dd47f0d5 damon: convert __damon_va_three_regions to use the VMA iterator
>>>> 3499a13168da6 mm/mmap: use maple tree for unmapped_area{_topdown}
>>>> 7fdbd37da5c6f mm/mmap: use the maple tree for find_vma_prev() instead of the rbtree
>>>> f39af05949a42 mm: add VMA iterator
>>>> d4af56c5c7c67 mm: start tracking VMAs with maple tree
>>>> e15e06a839232 lib/test_maple_tree: add testing for maple tree                        4638 4628 4502
>>>> 9832fb87834e2 mm/demotion: expose memory tier details via sysfs                      4625 4509 4548
>>>> 4fe89d07dcc28 Linux 6.0                                                              4385 4205 4348 4228 4504
>>>>
>>>>
>>>> The first regression was between v6.0 and v6.1-rc1. The score dropped
>>>> from 4600 to 2700, and bisected to the patches switching from rb tree to
>>>> maple tree. This was reported at
>>>> https://lore.kernel.org/oe-lkp/202212191714.524e00b3-yujie.liu@intel.com/
>>>> Thanks for the explanation that it is an expected regression as a trade
>>>> off to benefit read performance.
>>>>
>>>> The second regression was between v6.1-rc7 and v6.1-rc8. The score
>>>> dropped from 2700 to 2100, and bisected to this "Revert "mm: align larger
>>>> anonymous mappings on THP boundaries"" commit.
>>> So it means "mm: align larger anonymous mappings on THP boundaries"
>>> actually improved the mmap1 benchmark? But it caused regression for
>>> other usecase, for example, building kernel with clang, which is a
>>> regression for a real life usecase.
>> Yes. The patch "mm: align larger anonymous mappings on THP boundaries"
>> can improve the mmap1 benchmark.
>>
> 
> If the aligned VMAs cannot be merged, then they do not need to be split
> on freeing.  This means we are just allocating a new vma, write it in
> the tree, removing it from the tree, free the vma.  We can do this 4600
> times a second, apparently.
> 
> If the VMAs do get merged, we will go through __vma_adjust() to expand a
> boundary, write it to the tree, allocate a new vma, __vma_adjust() the
> vma boundary back, insert the new VMA that covers the boundary area,
> remove the new vma from the tree, free the vma.  We can only do this
> 2700 times a second.  Note this is writing 3 times to the tree in this
> loop vs 2 in the other option.
Thanks a lot for sharing the quantification of two path.

> 
> So yes, merging/splitting is more work and always has been.  We are
> doing this to avoid having too many VMAs though.  There really isn't a
> good reason an application would do this for any meaningful number of
> iterations.
> 
>> For building kernel regression, looks like it's not related with the
>> patch "mm: align larger anonymous mappings on THP boundaries" directly.
>> It's another existing behavior more visible with the patch.
>> https://lore.kernel.org/all/a4bcddad-e56f-cedc-891a-916e86d9a02c@intel.com/
>>
> 
> Having a snapshot of the VMA layout would help here since the THP
Can you share how to do snapshot of VMA layout?


> boundary alignment may be changing if the VMAs can be merged or not.  I
> suspect it is not able to merge and is fragmenting the VMA space which
> would speed up this benchmark at the expense of having more VMAs.
Same thought in my side for more VMAs here. Thanks.


Regards
Yin, Fengwei

> 
> Thanks,
> Liam