Re: [RFC PATCH 00/15] Make MAX_ORDER adjustable as a kernel boot time parameter.

From: Zi Yan <ziy@nvidia.com>
To: Hugh Dickins <hughd@google.com>
Cc: Vlastimil Babka <vbabka@suse.cz>,
	David Hildenbrand <david@redhat.com>,
	linux-mm@kvack.org, Matthew Wilcox <willy@infradead.org>,
	"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	Michal Hocko <mhocko@kernel.org>,
	John Hubbard <jhubbard@nvidia.com>,
	linux-kernel@vger.kernel.org, Roman Gushchin <guro@fb.com>
Subject: Re: [RFC PATCH 00/15] Make MAX_ORDER adjustable as a kernel boot time parameter.
Date: Fri, 06 Aug 2021 17:26:09 -0400	[thread overview]
Message-ID: <13DF8783-289F-4ED7-AC13-E60DF7CD0710@nvidia.com> (raw)
In-Reply-To: <6ae6cd92-3ff4-7ed3-b337-a4dfe33da1c@google.com>

[-- Attachment #1: Type: text/plain, Size: 2862 bytes --]

On 6 Aug 2021, at 16:27, Hugh Dickins wrote:

> On Fri, 6 Aug 2021, Zi Yan wrote:
>>
>> In addition, I would like to share more detail on my plan on supporting 1GB PUD THP.
>> This patchset is the first step, enabling kernel to allocate 1GB pages, so that
>> user can get 1GB THPs from ZONE_NORMAL and ZONE_MOVABLE without using
>> alloc_contig_pages() or CMA allocator. The next step is to improve kernel memory
>> fragmentation handling for pages up to MAX_ORDER, since currently pageblock size
>> is still limited by memory section size. As a result, I will explore solutions
>> like having additional larger pageblocks (up to MAX_ORDER) to counter memory
>> fragmentation. I will discover what else needs to be solved as I gradually improve
>> 1GB PUD THP support.
>
> Sorry to be blunt, but let me state my opinion: 2MB THPs have given and
> continue to give us more than enough trouble.  Complicating the kernel's
> mm further, just to allow 1GB THPs, seems a very bad tradeoff to me.  I
> understand that it's an appealing personal project; but for the sake of
> of all the rest of us, please leave 1GB huge pages to hugetlbfs (until
> the day when we are all using 2MB base pages).

I do not agree with you. 2MB THP provides good performance, while letting us
keep using 4KB base pages. The 2MB THP implementation is the price we pay
to get the performance. This patchset removes the tie between MAX_ORDER
and section size to allow >2MB page allocation, which is useful in many
places. 1GB THP is one of the users. Gigantic pages also improve
device performance, like GPUs (e.g., AMD GPUs can use any power of two up to
1GB pages[1], which I just learnt). Also could you point out which part
of my patchset complicates kernel’s mm? I could try to simplify it if
possible.

In addition, I am not sure hugetlbfs is the way to go. THP is managed by
core mm, whereas hugetlbfs has its own code for memory management.
As hugetlbfs gets popular, more core mm functionalities have been
replicated and added to hugetlbfs codebase. It is not a good tradeoff
either. One of the reasons I work on 1GB THP is that Roman from Facebook
explicitly mentioned they want to use THP in place of hugetlbfs[2].

I think it might be more constructive to point out the existing issues
in THP so that we can improve the code together. BTW, I am also working
on simplifying THP code like generalizing THP split[3] and planning to
simplify page table manipulation code by reviving Kirill’s idea[4].

[1] https://lore.kernel.org/linux-mm/bdec12bd-9188-9f3e-c442-aa33e25303a6@amd.com/
[2] https://lore.kernel.org/linux-mm/20200903162527.GF60440@carbon.dhcp.thefacebook.com/
[3] https://lwn.net/Articles/837928/
[4] https://lore.kernel.org/linux-mm/20180424154355.mfjgkf47kdp2by4e@black.fi.intel.com/

—
Best Regards,
Yan, Zi

[-- Attachment #2: OpenPGP digital signature --]
[-- Type: application/pgp-signature, Size: 854 bytes --]