Re: [PATCH v2] mm: hugetlb: optionally allocate gigantic hugepages using cma

From: Mike Kravetz <mike.kravetz@oracle.com>
To: Roman Gushchin <guro@fb.com>
Cc: Andrew Morton <akpm@linux-foundation.org>,
	Johannes Weiner <hannes@cmpxchg.org>,
	Michal Hocko <mhocko@kernel.org>,
	linux-mm@kvack.org, kernel-team@fb.com,
	linux-kernel@vger.kernel.org, Rik van Riel <riel@surriel.com>
Subject: Re: [PATCH v2] mm: hugetlb: optionally allocate gigantic hugepages using cma
Date: Tue, 10 Mar 2020 11:33:47 -0700	[thread overview]
Message-ID: <4b78a8a9-7b5a-eb62-acaa-2677e615bea1@oracle.com> (raw)
In-Reply-To: <20200310180558.GD85000@carbon.dhcp.thefacebook.com>

On 3/10/20 11:05 AM, Roman Gushchin wrote:
> On Tue, Mar 10, 2020 at 10:27:01AM -0700, Mike Kravetz wrote:
>> On 3/9/20 5:25 PM, Roman Gushchin wrote:
>>> diff --git a/arch/x86/kernel/setup.c b/arch/x86/kernel/setup.c
>>> index a74262c71484..ceeb06ddfd41 100644
>>> --- a/arch/x86/kernel/setup.c
>>> +++ b/arch/x86/kernel/setup.c
>>> @@ -16,6 +16,7 @@
>>>  #include <linux/pci.h>
>>>  #include <linux/root_dev.h>
>>>  #include <linux/sfi.h>
>>> +#include <linux/hugetlb.h>
>>>  #include <linux/tboot.h>
>>>  #include <linux/usb/xhci-dbgp.h>
>>>  
>>> @@ -1158,6 +1159,8 @@ void __init setup_arch(char **cmdline_p)
>>>  	initmem_init();
>>>  	dma_contiguous_reserve(max_pfn_mapped << PAGE_SHIFT);
>>>  
>>> +	hugetlb_cma_reserve();
>>> +
>>
>> I know this is called from arch specific code here to fit in with the timing
>> of CMA setup/reservation calls.  However, there really is nothing architecture
>> specific about this functionality.  It would be great IMO if we could make
>> this architecture independent.  However, I can not think of a straight forward
>> way to do this.
> 
> I agree. Unfortunately I have no better idea than having an arch-dependent hook.
> 
>>
>>>  	/*
>>>  	 * Reserve memory for crash kernel after SRAT is parsed so that it
>>>  	 * won't consume hotpluggable memory.
>> <snip>
>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> <snip>
>>> +void __init hugetlb_cma_reserve(void)
>>> +{
>>> +	unsigned long totalpages = 0;
>>> +	unsigned long start_pfn, end_pfn;
>>> +	phys_addr_t size;
>>> +	int nid, i, res;
>>> +
>>> +	if (!hugetlb_cma_size && !hugetlb_cma_percent)
>>> +		return;
>>> +
>>> +	if (hugetlb_cma_percent) {
>>> +		for_each_mem_pfn_range(i, MAX_NUMNODES, &start_pfn, &end_pfn,
>>> +				       NULL)
>>> +			totalpages += end_pfn - start_pfn;
>>> +
>>> +		size = PAGE_SIZE * (hugetlb_cma_percent * 100 * totalpages) /
>>> +			10000UL;
>>> +	} else {
>>> +		size = hugetlb_cma_size;
>>> +	}
>>> +
>>> +	pr_info("hugetlb_cma: reserve %llu, %llu per node\n", size,
>>> +		size / nr_online_nodes);
>>> +
>>> +	size /= nr_online_nodes;
>>> +
>>> +	for_each_node_state(nid, N_ONLINE) {
>>> +		unsigned long min_pfn = 0, max_pfn = 0;
>>> +
>>> +		for_each_mem_pfn_range(i, nid, &start_pfn, &end_pfn, NULL) {
>>> +			if (!min_pfn)
>>> +				min_pfn = start_pfn;
>>> +			max_pfn = end_pfn;
>>> +		}
>>> +
>>> +		res = cma_declare_contiguous(PFN_PHYS(min_pfn), size,
>>> +					     PFN_PHYS(max_pfn), (1UL << 30),
>>
>> The alignment is hard coded for x86 gigantic page size.  If this supports
>> more architectures or becomes arch independent we will need to determine
>> what this alignment should be.  Perhaps an arch specific call back to get
>> the alignment for gigantic pages.  That will require a little thought as
>> some arch's support multiple gigantic page sizes.
> 
> Good point!
> Should we take the biggest possible size as a reference?
> Or the smallest (larger than MAX_ORDER)?

As mentioned, it is pretty simple for architectures like x86 that only
have one gigantic page size.  Just a random thought, but since
hugetlb_cma_reserve is called from arch specific code perhaps the arch
code could pass in at least alignment as an argument to this function?
That way we can somewhat push the issue to the architectures.  For example,
power supports 16GB gigantic page size but I believe they are allocated
via firmware somehow.  So, they would not pass 16G as alignment.  In this
case setup of the CMA area is somewhat architecture dependent.  So, perhaps
the call from arch specific code is not such a bad idea.

With that in mind, we may want some type of coordination between arch
specific and independent code.  Right now, cmdline_parse_hugetlb_cma
is will accept a hugetlb_cma command line option without complaint even
if the architecture does not call hugetlb_cma_reserve.

Just a nit, but cma_declare_contiguous if going to round up size by alignment.  So, the actual reserved size may not match what is printed with,
+		pr_info("hugetlb_cma: successfully reserved %llu on node %d\n",
+			size, nid);

I found this out by testing code and specifying hugetlb_cma=2M.  Messages
in log were:
	kernel: hugetlb_cma: reserve 2097152, 1048576 per node
	kernel: hugetlb_cma: successfully reserved 1048576 on node 0
	kernel: hugetlb_cma: successfully reserved 1048576 on node 1
But, it really reserved 1GB per node.
-- 
Mike Kravetz