Re: [RFC PATCH v3 00/49] 1GB PUD THP support on x86_64

From: Roman Gushchin <guro@fb.com>
To: Zi Yan <ziy@nvidia.com>
Cc: <linux-mm@kvack.org>, Matthew Wilcox <willy@infradead.org>,
	"Kirill A . Shutemov" <kirill.shutemov@linux.intel.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Yang Shi <shy828301@gmail.com>, Michal Hocko <mhocko@suse.com>,
	John Hubbard <jhubbard@nvidia.com>,
	Ralph Campbell <rcampbell@nvidia.com>,
	David Nellans <dnellans@nvidia.com>,
	Jason Gunthorpe <jgg@ziepe.ca>,
	David Rientjes <rientjes@google.com>,
	Vlastimil Babka <vbabka@suse.cz>,
	David Hildenbrand <david@redhat.com>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	Song Liu <songliubraving@fb.com>
Subject: Re: [RFC PATCH v3 00/49] 1GB PUD THP support on x86_64
Date: Tue, 30 Mar 2021 11:02:07 -0700	[thread overview]
Message-ID: <YGNnnzwDIfdy2B/G@carbon.dhcp.thefacebook.com> (raw)
In-Reply-To: <06D1034A-DE8B-4970-9056-6CA1C436D2E8@nvidia.com>

On Tue, Mar 30, 2021 at 01:24:14PM -0400, Zi Yan wrote:
> Hi Roman,
> 
> 
> On 4 Mar 2021, at 11:45, Roman Gushchin wrote:
> 
> > On Thu, Mar 04, 2021 at 11:26:03AM -0500, Zi Yan wrote:
> >> On 1 Mar 2021, at 20:59, Roman Gushchin wrote:
> >>
> >>> On Wed, Feb 24, 2021 at 05:35:36PM -0500, Zi Yan wrote:
> >>>> From: Zi Yan <ziy@nvidia.com>
> >>>>
> >>>> Hi all,
> >>>>
> >>>> I have rebased my 1GB PUD THP support patches on v5.11-mmotm-2021-02-18-18-29
> >>>> and the code is available at
> >>>> https://github.com/x-y-z/linux-1gb-thp/tree/1gb_thp_v5.11-mmotm-2021-02-18-18-29
> >>>> if you want to give it a try. The actual 49 patches are not sent out with this
> >>>> cover letter. :)
> >>>>
> >>>> Instead of asking for code review, I would like to discuss on the concerns I got
> >>>> from previous RFCs. I think there are two major ones:
> >>>>
> >>>> 1. 1GB page allocation. Current implementation allocates 1GB pages from CMA
> >>>>    regions that are reserved at boot time like hugetlbfs. The concerns on
> >>>>    using CMA is that an educated guess is needed to avoid depleting kernel
> >>>>    memory in case CMA regions are set too large. Recently David Rientjes
> >>>>    proposes to use process_madvise() for hugepage collapse, which is an
> >>>>    alternative [1] but might not work for 1GB pages, since there is no way of
> >>>>    _allocating_ a 1GB page to which collapse pages. I proposed a similar
> >>>>    approach at LSF/MM 2019, generating physically contiguous memory after pages
> >>>>    are allocated [2], which is usable for 1GB THPs. This approach does in-place
> >>>>    huge page promotion thus does not require page allocation.
> >>>
> >>> Well, I don't think there an alternative to cma as now. When the memory is almost
> >>> filled at least once, any subsequent activity leading to substantial slab allocations
> >>> (e.g. run git gc) will fragment the memory, so that there are virtually no chances
> >>> to find a continuous GB.
> >>>
> >>> It's possible in theory to reduce the fragmentation on 1GB scale by grouping
> >>> non-movable pageblocks, but it seems a separate project.
> >>
> >> My experiments showed that finding continuous GBs is possible, but I agree that
> >> CMA is more reliable and 1GB scale defragmentation should be a separate project.
> >
> > I actually ran a large scale experiment (on tens of thousands of machines) in the last
> > several months. It was about hugetlbfs 1GB pages, but the allocation mechanism is the same.
> 
> Thanks for the information. I finally have time to come back to this. Do you mind sharing
> the total memory of these machines? I want to have some idea on the scale of this issue to
> make sure I reproduce in a proper machine. Are you trying to get <20% of 10s GBs, 100s GBs,
> or TBs memory?

There are different configurations, but in general they are in 100's GB or smaller.

> 
> >
> > My goal as to allocate a relatively small number of 1GB pages (<20% of the total memory).
> > Without cma chances are reaching 0% very fast after reboot, and even manual manipulations
> > like shutting down all workloads, dropping caches, calling sync, compaction, etc. do not
> > help much. Sometimes you can allocate maybe 1-2 pages, but that's about it.
> 
> Is there a way of replicating such an environment with publicly available software?
> I really want to understand the root cause and am willing to find a possible solution.
> It would be much easier if I can reproduce this locally.

There is nothing fb-specific: once the memory is filled with anon/pagecache, any subsequent
allocations of non-movable memory (slabs, percpu, etc) will fragment the memory. There
is a pageblock mechanism which prevents the fragmentation on 2MB scale, but nothing prevents
the fragmentation on 1GB scale. It just a matter of runtime (and the number of mm operations).

> 
> >
> > Even with cma we had to fix a number of additional problems (like sub-optimal placement
> > of cma areas, 2MB THP migration, some ext4 and btrfs page migration issues) to have
> > a reasonable success rate about ~95-99%. And it's not 100% anyway.
> >
> > The problem with artificial tests is that you're likely experimenting on a freshly
> > rebooted machine which isn't/wasn't doing much. It's a bad model of the real memory
> > state of a production server.
> 
> Yes, I agree that my experiment is not representative. Can you provide more information
> on what application behavior(s) leading to this memory fragmentation? I guess it is
> because non-moveable pages spread across the entire physical memory space. Is there
> a quick reproducer for that?

I have a simple c program which is able to fragment the memory, you can play with it:
https://github.com/rgushchin/fragm .

But as I said, basically any load which is actively using the whole memory
will fragment it.

Thanks!