Re: [PATCH v5] hugetlbfs: Get unmapped area below TASK_UNMAPPED_BASE for hugetlbfs

From: Hushijie <hushijie3@huawei.com>
To: Mike Kravetz <mike.kravetz@oracle.com>
Cc: "will@kernel.org" <will@kernel.org>,
	"akpm@linux-foundation.org" <akpm@linux-foundation.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>,
	Nixiaoming <nixiaoming@huawei.com>,
	"wangxu (AE)" <wangxu72@huawei.com>,
	"Wangkefeng (OS Kernel Lab)" <wangkefeng.wang@huawei.com>,
	yangerkun <yangerkun@huawei.com>,
	"Wangle (RTOS FAE)" <wangle6@huawei.com>,
	"Chengang (L)" <cg.chen@huawei.com>,
	"Chenjie (K)" <chenjie6@huawei.com>,
	"Huangjianhui (Alex)" <alex.huangjianhui@huawei.com>
Subject: Re: [PATCH v5] hugetlbfs: Get unmapped area below TASK_UNMAPPED_BASE for hugetlbfs
Date: Sat, 16 May 2020 07:47:35 +0000	[thread overview]
Message-ID: <87d618c778584d2386c0f0c81be6319b@huawei.com> (raw)

>On 5/14/20 7:31 AM, Shijie Hu wrote:
>> Here is a final patch to solve that hugetlb_get_unmapped_area() can't
>> get unmapped area below mmap base for huge pages based on a few previous
>> discussions and patches from me.
>> 
>> I'm so sorry. When sending v2 and v3 patches, I forget to cc:
>> linux-mm@kvack.org and linux-kernel@vger.kernel.org. No records of these
>> two patches found on the https://lkml.org/lkml/.
>> 
>> Patch V1: https://lkml.org/lkml/2020/4/27/440
>> Patch V4: https://lkml.org/lkml/2020/5/13/980
>> 
>> Changes in V2:
>> * Follow Mike's suggestions, move hugetlb_get_unmapped_area() routines
>> from "fs/hugetlbfs/inode.c" to "arch/arm64/mm/hugetlbpage.c", without
>> changing core code.
>> * Add mmap_is_legacy() function, and only fall back to the bottom-up
>> function on no-legacy mode.
>> 
>> Changes in V3:
>> * Add *bottomup() and *topdown() two function, and check if
>> mm->get_unmapped_area is equal to arch_get_unmapped_area() instead of
>> checking mmap_is_legacy() to determine which function should be used.
>> 
>> Changes in V4:
>> * Follow the suggestions of Will and Mike, move back this patch to core
>> code.
>> 
>> Changes in V5:
>> * Fix kbuild test error.
>> 
>> ------
>> 
>> In a 32-bit program, running on arm64 architecture. When the address
>> space below mmap base is completely exhausted, shmat() for huge pages
>> will return ENOMEM, but shmat() for normal pages can still success on
>> no-legacy mode. This seems not fair.
>> 
>> For normal pages, get_unmapped_area() calling flows are:
>>     => mm->get_unmapped_area()
>> 	if on legacy mode,
>> 	    => arch_get_unmapped_area()
>> 			=> vm_unmapped_area()
>> 	if on no-legacy mode,
>> 	    => arch_get_unmapped_area_topdown()
>> 			=> vm_unmapped_area()
>> 
>> For huge pages, get_unmapped_area() calling flows are:
>>     => file->f_op->get_unmapped_area()
>> 		=> hugetlb_get_unmapped_area()
>> 			=> vm_unmapped_area()
>> 
>> To solve this issue, we only need to make hugetlb_get_unmapped_area() take
>> the same way as mm->get_unmapped_area(). Add *bottomup() and *topdown()
>> two functions, and check current mm->get_unmapped_area() to decide which
>> one to use. If mm->get_unmapped_area is equal to arch_get_unmapped_area(),
>> hugetlb_get_unmapped_area() calls bottomup routine, otherwise calls topdown
>> routine.
>> 
>> Signed-off-by: Shijie Hu <hushijie3@huawei.com>
>> Reported-by: kbuild test robot <lkp@intel.com>
>> ---
>>  fs/hugetlbfs/inode.c | 62 +++++++++++++++++++++++++++++++++++++++++++++-------
>>  1 file changed, 54 insertions(+), 8 deletions(-)
>> 
>> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
>> index 991c60c7ffe0..61418380f492 100644
>> --- a/fs/hugetlbfs/inode.c
>> +++ b/fs/hugetlbfs/inode.c
>> @@ -38,6 +38,7 @@
>>  #include <linux/uio.h>
>>  
>>  #include <linux/uaccess.h>
>> +#include <linux/sched/mm.h>
>>  
>>  static const struct super_operations hugetlbfs_ops;
>>  static const struct address_space_operations hugetlbfs_aops;
>> @@ -191,13 +192,60 @@ static int hugetlbfs_file_mmap(struct file *file, struct vm_area_struct *vma)
>>  
>>  #ifndef HAVE_ARCH_HUGETLB_UNMAPPED_AREA
>>  static unsigned long
>> +hugetlb_get_unmapped_area_bottomup(struct file *file, unsigned long addr,
>> +		unsigned long len, unsigned long pgoff, unsigned long flags)
>> +{
>> +	struct hstate *h = hstate_file(file);
>> +	struct vm_unmapped_area_info info;
>> +
>> +	info.flags = 0;
>> +	info.length = len;
>> +	info.low_limit = current->mm->mmap_base;
>> +	info.high_limit = TASK_SIZE;
>> +	info.align_mask = PAGE_MASK & ~huge_page_mask(h);
>> +	info.align_offset = 0;
>> +	return vm_unmapped_area(&info);
>> +}
>> +
>> +static unsigned long
>> +hugetlb_get_unmapped_area_topdown(struct file *file, unsigned long addr,
>> +		unsigned long len, unsigned long pgoff, unsigned long flags)
>> +{
>> +	struct hstate *h = hstate_file(file);
>> +	struct vm_unmapped_area_info info;
>> +
>> +	info.flags = VM_UNMAPPED_AREA_TOPDOWN;
>> +	info.length = len;
>> +	info.low_limit = max(PAGE_SIZE, mmap_min_addr);
>> +	info.high_limit = current->mm->mmap_base;
>> +	info.align_mask = PAGE_MASK & ~huge_page_mask(h);
>> +	info.align_offset = 0;
>> +	addr = vm_unmapped_area(&info);
>> +
>> +	/*
>> +	 * A failed mmap() very likely causes application failure,
>> +	 * so fall back to the bottom-up function here. This scenario
>> +	 * can happen with large stack limits and large mmap()
>> +	 * allocations.
>> +	 */
>> +	if (unlikely(offset_in_page(addr))) {
>> +		VM_BUG_ON(addr != -ENOMEM);
>> +		info.flags = 0;
>> +		info.low_limit = current->mm->mmap_base;
>> +		info.high_limit = TASK_SIZE;
>> +		addr = vm_unmapped_area(&info);
>> +	}
>> +
>> +	return addr;
>> +}
>> +
>> +static unsigned long
>>  hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
>>  		unsigned long len, unsigned long pgoff, unsigned long flags)
>>  {
>>  	struct mm_struct *mm = current->mm;
>>  	struct vm_area_struct *vma;
>>  	struct hstate *h = hstate_file(file);
>> -	struct vm_unmapped_area_info info;
>>  
>>  	if (len & ~huge_page_mask(h))
>>  		return -EINVAL;
>> @@ -218,13 +266,11 @@ hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
>>  			return addr;
>>  	}
>>  
>> -	info.flags = 0;
>> -	info.length = len;
>> -	info.low_limit = TASK_UNMAPPED_BASE;
>> -	info.high_limit = TASK_SIZE;
>> -	info.align_mask = PAGE_MASK & ~huge_page_mask(h);
>> -	info.align_offset = 0;
>> -	return vm_unmapped_area(&info);
>> +	if (mm->get_unmapped_area == arch_get_unmapped_area)
>> +		return hugetlb_get_unmapped_area_bottomup(file, addr, len,
>> +				pgoff, flags);
>> +	return hugetlb_get_unmapped_area_topdown(file, addr, len,
>> +			pgoff, flags);
>
>I like this code using the value of mm->get_unmapped_area to determine
>which routine to call.  It is used by a few architectures.   However, I
>noticed that on at least one architecture (powerpc) mm->get_unmapped_area
>may be assigned to routines other than arch_get_unmapped_area or
>arch_get_unmapped_area_topdown.  In such a case, we would call the 'new'
>topdown routine.  I would prefer that we call the bottomup routine in this
>default case.
>
>In reality, this does not impact powerpc as that architecture has it's
>own hugetlb_get_unmapped_area routine.
>

Yes, I also noticed this before, powerpc uses radix__arch_get_unmapped_area*() 
when CONFIG_PPC_RADIX_MMU opened as 'y' and radix_enabled() returns 
true. However, powerpc implemented its own hugetlb_get_unmapped_area(). This
patch actually has no effect on powerpc.

>Because of this, I suggest we add a comment above this code and switch
>the if/else order.  For example,
>
>+       /*
>+        * Use mm->get_unmapped_area value as a hint to use topdown routine.
>+        * If architectures have special needs, they should define their own
>+        * version of hugetlb_get_unmapped_area.
>+        */
>+       if (mm->get_unmapped_area == arch_get_unmapped_area_topdown)
>+               return hugetlb_get_unmapped_area_topdown(file, addr, len,
>+                               pgoff, flags);
>+       return hugetlb_get_unmapped_area_bottomup(file, addr, len,
>+                       pgoff, flags);
>
>Thoughts?
>-- 
>Mike Kravetz
>
I agree with you. It's clever to switch the if/else order. If there is such
a case, mm->get_unmapped_area() is neihter arch_get_unmapped_area() nor
arch_get_unmapped_area_topdown(), it is indeed more appropriate to make the
bottomup routine as the default behavior.

May I put this code and comment you show above into patch v6 and add 
"Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>" to it?
-- 
Shijie Hu

>
>>  }
>>  #endif
>>  
>>