Re: [PATCH v2] mm: hugetlb: optionally allocate gigantic hugepages using cma

From: Roman Gushchin <guro@fb.com>
To: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Michal Hocko <mhocko@kernel.org>, <linux-mm@kvack.org>,
	<kernel-team@fb.com>, <linux-kernel@vger.kernel.org>,
	Rik van Riel <riel@surriel.com>
Subject: Re: [PATCH v2] mm: hugetlb: optionally allocate gigantic hugepages using cma
Date: Tue, 10 Mar 2020 11:05:58 -0700	[thread overview]
Message-ID: <20200310180558.GD85000@carbon.dhcp.thefacebook.com> (raw)
In-Reply-To: <5cfa9031-fc15-2bcc-adb9-9779285ef0f7@oracle.com>

On Tue, Mar 10, 2020 at 10:27:01AM -0700, Mike Kravetz wrote:
> On 3/9/20 5:25 PM, Roman Gushchin wrote:
> > Commit 944d9fec8d7a ("hugetlb: add support for gigantic page allocation
> > at runtime") has added the run-time allocation of gigantic pages. However
> > it actually works only at early stages of the system loading, when
> > the majority of memory is free. After some time the memory gets
> > fragmented by non-movable pages, so the chances to find a contiguous
> > 1 GB block are getting close to zero. Even dropping caches manually
> > doesn't help a lot.
> > 
> > At large scale rebooting servers in order to allocate gigantic hugepages
> > is quite expensive and complex. At the same time keeping some constant
> > percentage of memory in reserved hugepages even if the workload isn't
> > using it is a big waste: not all workloads can benefit from using 1 GB
> > pages.
> > 
> > The following solution can solve the problem:
> > 1) On boot time a dedicated cma area* is reserved. The size is passed
> >    as a kernel argument.
> > 2) Run-time allocations of gigantic hugepages are performed using the
> >    cma allocator and the dedicated cma area
> > 
> > In this case gigantic hugepages can be allocated successfully with a
> > high probability, however the memory isn't completely wasted if nobody
> > is using 1GB hugepages: it can be used for pagecache, anon memory,
> > THPs, etc.
> > 
> > * On a multi-node machine a per-node cma area is allocated on each node.
> >   Following gigantic hugetlb allocation are using the first available
> >   numa node if the mask isn't specified by a user.
> > 
> > Usage:
> > 1) configure the kernel to allocate a cma area for hugetlb allocations:
> >    pass hugetlb_cma=10G as a kernel argument
> > 
> > 2) allocate hugetlb pages as usual, e.g.
> >    echo 10 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
> > 
> > If the option isn't enabled or the allocation of the cma area failed,
> > the current behavior of the system is preserved.
> > 
> > Only x86 is covered by this patch, but it's trivial to extend it to
> > cover other architectures as well.
> > 
> > v2: fixed !CONFIG_CMA build, suggested by Andrew Morton
> > 
> > Signed-off-by: Roman Gushchin <guro@fb.com>
> 
> Thanks!  I really like this idea.

Thank you!

> 
> > ---
> >  .../admin-guide/kernel-parameters.txt         |   7 ++
> >  arch/x86/kernel/setup.c                       |   3 +
> >  include/linux/hugetlb.h                       |   2 +
> >  mm/hugetlb.c                                  | 115 ++++++++++++++++++
> >  4 files changed, 127 insertions(+)
> > 
> <snip>
> > diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
> > index a74262c71484..ceeb06ddfd41 100644
> > --- a/arch/x86/kernel/setup.c
> > +++ b/arch/x86/kernel/setup.c
> > @@ -16,6 +16,7 @@
> >  #include <linux/pci.h>
> >  #include <linux/root_dev.h>
> >  #include <linux/sfi.h>
> > +#include <linux/hugetlb.h>
> >  #include <linux/tboot.h>
> >  #include <linux/usb/xhci-dbgp.h>
> >  
> > @@ -1158,6 +1159,8 @@ void __init setup_arch(char **cmdline_p)
> >  	initmem_init();
> >  	dma_contiguous_reserve(max_pfn_mapped << PAGE_SHIFT);
> >  
> > +	hugetlb_cma_reserve();
> > +
> 
> I know this is called from arch specific code here to fit in with the timing
> of CMA setup/reservation calls.  However, there really is nothing architecture
> specific about this functionality.  It would be great IMO if we could make
> this architecture independent.  However, I can not think of a straight forward
> way to do this.

I agree. Unfortunately I have no better idea than having an arch-dependent hook.

> 
> >  	/*
> >  	 * Reserve memory for crash kernel after SRAT is parsed so that it
> >  	 * won't consume hotpluggable memory.
> <snip>
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> <snip>
> > +void __init hugetlb_cma_reserve(void)
> > +{
> > +	unsigned long totalpages = 0;
> > +	unsigned long start_pfn, end_pfn;
> > +	phys_addr_t size;
> > +	int nid, i, res;
> > +
> > +	if (!hugetlb_cma_size && !hugetlb_cma_percent)
> > +		return;
> > +
> > +	if (hugetlb_cma_percent) {
> > +		for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn,
> > +				       NULL)
> > +			totalpages += end_pfn - start_pfn;
> > +
> > +		size = PAGE_SIZE * (hugetlb_cma_percent * 100 * totalpages) /
> > +			10000UL;
> > +	} else {
> > +		size = hugetlb_cma_size;
> > +	}
> > +
> > +	pr_info("hugetlb_cma: reserve %llu, %llu per node\n", size,
> > +		size / nr_online_nodes);
> > +
> > +	size /= nr_online_nodes;
> > +
> > +	for_each_node_state(nid, N_ONLINE) {
> > +		unsigned long min_pfn = 0, max_pfn = 0;
> > +
> > +		for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) {
> > +			if (!min_pfn)
> > +				min_pfn = start_pfn;
> > +			max_pfn = end_pfn;
> > +		}
> > +
> > +		res = cma_declare_contiguous(PFN_PHYS(min_pfn), size,
> > +					     PFN_PHYS(max_pfn), (1UL << 30),
> 
> The alignment is hard coded for x86 gigantic page size.  If this supports
> more architectures or becomes arch independent we will need to determine
> what this alignment should be.  Perhaps an arch specific call back to get
> the alignment for gigantic pages.  That will require a little thought as
> some arch's support multiple gigantic page sizes.

Good point!
Should we take the biggest possible size as a reference?
Or the smallest (larger than MAX_ORDER)?

Thanks!