Re: [RFC PATCH v3 00/49] 1GB PUD THP support on x86_64

From: Zi Yan <ziy@nvidia.com>
To: David Hildenbrand <david@redhat.com>
Cc: <linux-mm@kvack.org>, Matthew Wilcox <willy@infradead.org>,
	"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
	Roman Gushchin <guro@fb.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Yang Shi <shy828301@gmail.com>, Michal Hocko <mhocko@suse.com>,
	John Hubbard <jhubbard@nvidia.com>,
	"Ralph Campbell" <rcampbell@nvidia.com>,
	David Nellans <dnellans@nvidia.com>,
	"Jason Gunthorpe" <jgg@ziepe.ca>,
	David Rientjes <rientjes@google.com>,
	"Vlastimil Babka" <vbabka@suse.cz>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	Song Liu <songliubraving@fb.com>
Subject: Re: [RFC PATCH v3 00/49] 1GB PUD THP support on x86_64
Date: Thu, 25 Feb 2021 17:13:38 -0500	[thread overview]
Message-ID: <67B2C538-45DB-4678-A64D-295A9703EDE1@nvidia.com> (raw)
In-Reply-To: <c0caebd1-9da6-9147-b30e-cc8ae1121228@redhat.com>

[-- Attachment #1: Type: text/plain, Size: 4405 bytes --]

On 25 Feb 2021, at 6:02, David Hildenbrand wrote:

> On 24.02.21 23:35, Zi Yan wrote:
>> From: Zi Yan <ziy@nvidia.com>
>>
>> Hi all,
>>
>> I have rebased my 1GB PUD THP support patches on v5.11-mmotm-2021-02-18-18-29
>> and the code is available at
>> https://github.com/x-y-z/linux-1gb-thp/tree/1gb_thp_v5.11-mmotm-2021-02-18-18-29
>> if you want to give it a try. The actual 49 patches are not sent out with this
>> cover letter. :)
>>
>> Instead of asking for code review, I would like to discuss on the concerns I got
>> from previous RFCs. I think there are two major ones:
>>
>> 1. 1GB page allocation. Current implementation allocates 1GB pages from CMA
>>     regions that are reserved at boot time like hugetlbfs. The concerns on
>>     using CMA is that an educated guess is needed to avoid depleting kernel
>>     memory in case CMA regions are set too large. Recently David Rientjes
>>     proposes to use process_madvise() for hugepage collapse, which is an
>>     alternative [1] but might not work for 1GB pages, since there is no way of
>
> I see two core ideas of THP:
>
> 1) Transparent to the user: you get speedup without really caring *except* having to enable/disable the optimization sometimes manually (i.e., MADV_HUGEPAGE) -  because in corner cases (e.g., userfaultfd), it's not completely transparent and might have performance impacts. mprotect(), mmap(MAP_FIXED), mremap() work as expected.
>
> 2) Transparent to other subsystems of the kernel: the page size of the mapping is in base pages - we can split anytime on demand in case we cannot handle THP. In addition, no special requirements: no CMA, no movability restrictions, no swappability restrictions, ... most stuff works transparently by splitting.
>
> Your current approach messes with 2). Your proposal here messes with 1).
>
> Any kind of explicit placement by the user can silently get reverted any time. So process_madvise() would really only be useful in cases where a temporary split might get reverted later on by the os automatically - like we have for 2MB THP right now.
>
> So process_madvise() is less likely to help if the system won't try collapsing automatically (more below).
>>     _allocating_ a 1GB page to which collapse pages. I proposed a similar
>>     approach at LSF/MM 2019, generating physically contiguous memory after pages
>>     are allocated [2], which is usable for 1GB THPs. This approach does in-place
>>     huge page promotion thus does not require page allocation.
>
> I like the idea of forming a 1GB THP at a location where already consecutive pages allow for it. It can be applied generically - and both 1) and 2) keep working as expected. Anytime there was a split, we can retry forming a THP later.
>
> However, I don't follow how this is actually really feasible in big scale. You could only ever collapse into a 1GB THP if you happen to have 1GB consecutive 2MB THP / 4k already. Sounds to me like this happens when the stars align.

Both the process_madvise() approach and my proposal require page migration to bring back THPs, since like you said having consecutive pages ready is extremely rare. IIUC, the process_madvise() approach reuses khugepaged code to collapse huge pages,
namely first allocating a 2MB THP, then copying data over, finally free old base pages. My proposal would migrate pages within
a virtual address range (>1GB and 1GB-aligned) to get all physical pages contiguous, then promote the resulting 1GB consecutive
pages to 1GB THP. No new page allocation is needed.

Both approaches would need user-space invocation, assuming either the application itself wants to get THPs for a specific region or a user-space daemon would do this for a group of application, instead of waiting for khugepaged to slowly (4096 pages every 10s) scan and do huge page collapse. User will pay the cost of getting THP. This also means THPs are not completely transparent to user, but I think it should be fine when users explicitly invoke these two methods to get THPs for better performance.

The difference of my proposal is that it does not need a 1GB THP allocation, so there is no special requirements like using CMA
or increasing MAX_ORDER in buddy allocator to allow 1GB page allocation. It makes creating THPs with orders > MAX_ORDER possible
without other intrusive changes.

—
Best Regards,
Yan Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]