linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] Cleanup and fixup for hugetlb
@ 2021-04-02  9:32 Miaohe Lin
  2021-04-02  9:32 ` [PATCH 1/4] mm/hugeltb: remove redundant VM_BUG_ON() in region_add() Miaohe Lin
                   ` (3 more replies)
  0 siblings, 4 replies; 27+ messages in thread
From: Miaohe Lin @ 2021-04-02  9:32 UTC (permalink / raw)
  To: akpm, mike.kravetz
  Cc: n-horiguchi, hillf.zj, linux-kernel, linux-mm, linmiaohe

Hi all,
This series contains cleanups to remove redundant VM_BUG_ON() and
simplify the return code. Also this fixes potential wrong gbl_reserve
value and handle the error case in hugetlb_fix_reserve_counts(). More
details can be found in the respective changelogs. Thanks!

Miaohe Lin (4):
  mm/hugeltb: remove redundant VM_BUG_ON() in region_add()
  mm/hugeltb: simplify the return code of __vma_reservation_common()
  mm/hugeltb: fix potential wrong gbl_reserve value for
    hugetlb_acct_memory()
  mm/hugeltb: handle the error case in hugetlb_fix_reserve_counts()

 mm/hugetlb.c | 18 +++++++++++++-----
 1 file changed, 13 insertions(+), 5 deletions(-)

-- 
2.19.1


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 1/4] mm/hugeltb: remove redundant VM_BUG_ON() in region_add()
  2021-04-02  9:32 [PATCH 0/4] Cleanup and fixup for hugetlb Miaohe Lin
@ 2021-04-02  9:32 ` Miaohe Lin
  2021-04-07  0:16   ` Mike Kravetz
  2021-04-02  9:32 ` [PATCH 2/4] mm/hugeltb: simplify the return code of __vma_reservation_common() Miaohe Lin
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 27+ messages in thread
From: Miaohe Lin @ 2021-04-02  9:32 UTC (permalink / raw)
  To: akpm, mike.kravetz
  Cc: n-horiguchi, hillf.zj, linux-kernel, linux-mm, linmiaohe

The same VM_BUG_ON() check is already done in the callee. Remove this extra
one to simplify the code slightly.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
---
 mm/hugetlb.c | 1 -
 1 file changed, 1 deletion(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index c22111f3da20..a03a50b7c410 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -556,7 +556,6 @@ static long region_add(struct resv_map *resv, long f, long t,
 	resv->adds_in_progress -= in_regions_needed;
 
 	spin_unlock(&resv->lock);
-	VM_BUG_ON(add < 0);
 	return add;
 }
 
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 2/4] mm/hugeltb: simplify the return code of __vma_reservation_common()
  2021-04-02  9:32 [PATCH 0/4] Cleanup and fixup for hugetlb Miaohe Lin
  2021-04-02  9:32 ` [PATCH 1/4] mm/hugeltb: remove redundant VM_BUG_ON() in region_add() Miaohe Lin
@ 2021-04-02  9:32 ` Miaohe Lin
  2021-04-07  0:53   ` Mike Kravetz
  2021-04-02  9:32 ` [PATCH 3/4] mm/hugeltb: fix potential wrong gbl_reserve value for hugetlb_acct_memory() Miaohe Lin
  2021-04-02  9:32 ` [PATCH 4/4] mm/hugeltb: handle the error case in hugetlb_fix_reserve_counts() Miaohe Lin
  3 siblings, 1 reply; 27+ messages in thread
From: Miaohe Lin @ 2021-04-02  9:32 UTC (permalink / raw)
  To: akpm, mike.kravetz
  Cc: n-horiguchi, hillf.zj, linux-kernel, linux-mm, linmiaohe

It's guaranteed that the vma is associated with a resv_map, i.e. either
VM_MAYSHARE or HPAGE_RESV_OWNER, when the code reaches here or we would
have returned via !resv check above. So ret must be less than 0 in the
'else' case. Simplify the return code to make this clear.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
---
 mm/hugetlb.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a03a50b7c410..b7864abded3d 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2183,7 +2183,7 @@ static long __vma_reservation_common(struct hstate *h,
 			return 1;
 	}
 	else
-		return ret < 0 ? ret : 0;
+		return ret;
 }
 
 static long vma_needs_reservation(struct hstate *h,
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 3/4] mm/hugeltb: fix potential wrong gbl_reserve value for hugetlb_acct_memory()
  2021-04-02  9:32 [PATCH 0/4] Cleanup and fixup for hugetlb Miaohe Lin
  2021-04-02  9:32 ` [PATCH 1/4] mm/hugeltb: remove redundant VM_BUG_ON() in region_add() Miaohe Lin
  2021-04-02  9:32 ` [PATCH 2/4] mm/hugeltb: simplify the return code of __vma_reservation_common() Miaohe Lin
@ 2021-04-02  9:32 ` Miaohe Lin
  2021-04-07  2:49   ` Mike Kravetz
  2021-04-02  9:32 ` [PATCH 4/4] mm/hugeltb: handle the error case in hugetlb_fix_reserve_counts() Miaohe Lin
  3 siblings, 1 reply; 27+ messages in thread
From: Miaohe Lin @ 2021-04-02  9:32 UTC (permalink / raw)
  To: akpm, mike.kravetz
  Cc: n-horiguchi, hillf.zj, linux-kernel, linux-mm, linmiaohe

The resv_map could be NULL since this routine can be called in the evict
inode path for all hugetlbfs inodes. So we could have chg = 0 and this
would result in a negative value when chg - freed. This is unexpected for
hugepage_subpool_put_pages() and hugetlb_acct_memory().

Fixes: b5cec28d36f5 ("hugetlbfs: truncate_hugepages() takes a range of pages")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
---
 mm/hugetlb.c | 4 +++-
 1 file changed, 3 insertions(+), 1 deletion(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index b7864abded3d..bdff8d23803f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5413,6 +5413,7 @@ long hugetlb_unreserve_pages(struct inode *inode, long start, long end,
 	long chg = 0;
 	struct hugepage_subpool *spool = subpool_inode(inode);
 	long gbl_reserve;
+	long delta;
 
 	/*
 	 * Since this routine can be called in the evict inode path for all
@@ -5437,7 +5438,8 @@ long hugetlb_unreserve_pages(struct inode *inode, long start, long end,
 	 * If the subpool has a minimum size, the number of global
 	 * reservations to be released may be adjusted.
 	 */
-	gbl_reserve = hugepage_subpool_put_pages(spool, (chg - freed));
+	delta = chg > 0 ? chg - freed : freed;
+	gbl_reserve = hugepage_subpool_put_pages(spool, delta);
 	hugetlb_acct_memory(h, -gbl_reserve);
 
 	return 0;
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 4/4] mm/hugeltb: handle the error case in hugetlb_fix_reserve_counts()
  2021-04-02  9:32 [PATCH 0/4] Cleanup and fixup for hugetlb Miaohe Lin
                   ` (2 preceding siblings ...)
  2021-04-02  9:32 ` [PATCH 3/4] mm/hugeltb: fix potential wrong gbl_reserve value for hugetlb_acct_memory() Miaohe Lin
@ 2021-04-02  9:32 ` Miaohe Lin
  2021-04-08 23:25   ` Mike Kravetz
  3 siblings, 1 reply; 27+ messages in thread
From: Miaohe Lin @ 2021-04-02  9:32 UTC (permalink / raw)
  To: akpm, mike.kravetz
  Cc: n-horiguchi, hillf.zj, linux-kernel, linux-mm, linmiaohe

A rare out of memory error would prevent removal of the reserve map region
for a page. hugetlb_fix_reserve_counts() handles this rare case to avoid
dangling with incorrect counts. Unfortunately, hugepage_subpool_get_pages
and hugetlb_acct_memory could possibly fail too. We should correctly handle
these cases.

Fixes: b5cec28d36f5 ("hugetlbfs: truncate_hugepages() takes a range of pages")
Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
---
 mm/hugetlb.c | 11 +++++++++--
 1 file changed, 9 insertions(+), 2 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index bdff8d23803f..ca5464ed04b7 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -745,13 +745,20 @@ void hugetlb_fix_reserve_counts(struct inode *inode)
 {
 	struct hugepage_subpool *spool = subpool_inode(inode);
 	long rsv_adjust;
+	bool reserved = false;
 
 	rsv_adjust = hugepage_subpool_get_pages(spool, 1);
-	if (rsv_adjust) {
+	if (rsv_adjust > 0) {
 		struct hstate *h = hstate_inode(inode);
 
-		hugetlb_acct_memory(h, 1);
+		if (!hugetlb_acct_memory(h, 1))
+			reserved = true;
+	} else if (!rsv_adjust) {
+		reserved = true;
 	}
+
+	if (!reserved)
+		pr_warn("hugetlb: fix reserve count failed\n");
 }
 
 /*
-- 
2.19.1


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/4] mm/hugeltb: remove redundant VM_BUG_ON() in region_add()
  2021-04-02  9:32 ` [PATCH 1/4] mm/hugeltb: remove redundant VM_BUG_ON() in region_add() Miaohe Lin
@ 2021-04-07  0:16   ` Mike Kravetz
  0 siblings, 0 replies; 27+ messages in thread
From: Mike Kravetz @ 2021-04-07  0:16 UTC (permalink / raw)
  To: Miaohe Lin, akpm; +Cc: n-horiguchi, hillf.zj, linux-kernel, linux-mm

On 4/2/21 2:32 AM, Miaohe Lin wrote:
> The same VM_BUG_ON() check is already done in the callee. Remove this extra
> one to simplify the code slightly.
> 
> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>

Thanks,
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>

-- 
Mike Kravetz

> ---
>  mm/hugetlb.c | 1 -
>  1 file changed, 1 deletion(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index c22111f3da20..a03a50b7c410 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -556,7 +556,6 @@ static long region_add(struct resv_map *resv, long f, long t,
>  	resv->adds_in_progress -= in_regions_needed;
>  
>  	spin_unlock(&resv->lock);
> -	VM_BUG_ON(add < 0);
>  	return add;
>  }
>  
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/4] mm/hugeltb: simplify the return code of __vma_reservation_common()
  2021-04-02  9:32 ` [PATCH 2/4] mm/hugeltb: simplify the return code of __vma_reservation_common() Miaohe Lin
@ 2021-04-07  0:53   ` Mike Kravetz
  2021-04-07  2:05     ` Miaohe Lin
  0 siblings, 1 reply; 27+ messages in thread
From: Mike Kravetz @ 2021-04-07  0:53 UTC (permalink / raw)
  To: Miaohe Lin, akpm; +Cc: n-horiguchi, hillf.zj, linux-kernel, linux-mm

On 4/2/21 2:32 AM, Miaohe Lin wrote:
> It's guaranteed that the vma is associated with a resv_map, i.e. either
> VM_MAYSHARE or HPAGE_RESV_OWNER, when the code reaches here or we would
> have returned via !resv check above. So ret must be less than 0 in the
> 'else' case. Simplify the return code to make this clear.

I believe we still neeed that ternary operator in the return statement.
Why?

There are two basic types of mappings to be concerned with:
shared and private.
For private mappings, a task can 'own' the mapping as indicated by
HPAGE_RESV_OWNER.  Or, it may not own the mapping.  The most common way
to create a non-owner private mapping is to have a task with a private
mapping fork.  The parent process will have HPAGE_RESV_OWNER set, the
child process will not.  The idea is that since the child has a COW copy
of the mapping it should not consume reservations made by the parent.
Only the parent (HPAGE_RESV_OWNER) is allowed to consume the
reservations.
Hope that makens sense?

> 
> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
> ---
>  mm/hugetlb.c | 2 +-
>  1 file changed, 1 insertion(+), 1 deletion(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index a03a50b7c410..b7864abded3d 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2183,7 +2183,7 @@ static long __vma_reservation_common(struct hstate *h,
>  			return 1;
>  	}
>  	else

This else also handles the case !HPAGE_RESV_OWNER.  In this case, we
never want to indicate reservations are available.  The ternary makes
sure a positive value is never returned.

-- 
Mike Kravetz

> -		return ret < 0 ? ret : 0;
> +		return ret;
>  }
>  
>  static long vma_needs_reservation(struct hstate *h,
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/4] mm/hugeltb: simplify the return code of __vma_reservation_common()
  2021-04-07  0:53   ` Mike Kravetz
@ 2021-04-07  2:05     ` Miaohe Lin
  2021-04-07  2:37       ` Mike Kravetz
  0 siblings, 1 reply; 27+ messages in thread
From: Miaohe Lin @ 2021-04-07  2:05 UTC (permalink / raw)
  To: Mike Kravetz, akpm; +Cc: n-horiguchi, hillf.zj, linux-kernel, linux-mm

Hi:
On 2021/4/7 8:53, Mike Kravetz wrote:
> On 4/2/21 2:32 AM, Miaohe Lin wrote:
>> It's guaranteed that the vma is associated with a resv_map, i.e. either
>> VM_MAYSHARE or HPAGE_RESV_OWNER, when the code reaches here or we would
>> have returned via !resv check above. So ret must be less than 0 in the
>> 'else' case. Simplify the return code to make this clear.
> 
> I believe we still neeed that ternary operator in the return statement.
> Why?
> 
> There are two basic types of mappings to be concerned with:
> shared and private.
> For private mappings, a task can 'own' the mapping as indicated by
> HPAGE_RESV_OWNER.  Or, it may not own the mapping.  The most common way
> to create a non-owner private mapping is to have a task with a private
> mapping fork.  The parent process will have HPAGE_RESV_OWNER set, the
> child process will not.  The idea is that since the child has a COW copy
> of the mapping it should not consume reservations made by the parent.

The child process will not have HPAGE_RESV_OWNER set because at fork time, we do:
		/*
		 * Clear hugetlb-related page reserves for children. This only
		 * affects MAP_PRIVATE mappings. Faults generated by the child
		 * are not guaranteed to succeed, even if read-only
		 */
		if (is_vm_hugetlb_page(tmp))
			reset_vma_resv_huge_pages(tmp);
i.e. we have vma->vm_private_data = (void *)0; for child process and vma_resv_map() will
return NULL in this case.
Or am I missed something?

> Only the parent (HPAGE_RESV_OWNER) is allowed to consume the
> reservations.
> Hope that makens sense?
> 
>>
>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
>> ---
>>  mm/hugetlb.c | 2 +-
>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index a03a50b7c410..b7864abded3d 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -2183,7 +2183,7 @@ static long __vma_reservation_common(struct hstate *h,
>>  			return 1;
>>  	}
>>  	else
> 
> This else also handles the case !HPAGE_RESV_OWNER.  In this case, we

IMO, for the case !HPAGE_RESV_OWNER, we won't reach here. What do you think?

> never want to indicate reservations are available.  The ternary makes
> sure a positive value is never returned.
> 

Many thanks for review and reply! :)

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/4] mm/hugeltb: simplify the return code of __vma_reservation_common()
  2021-04-07  2:05     ` Miaohe Lin
@ 2021-04-07  2:37       ` Mike Kravetz
  2021-04-07  3:09         ` Miaohe Lin
  0 siblings, 1 reply; 27+ messages in thread
From: Mike Kravetz @ 2021-04-07  2:37 UTC (permalink / raw)
  To: Miaohe Lin, akpm; +Cc: n-horiguchi, hillf.zj, linux-kernel, linux-mm

On 4/6/21 7:05 PM, Miaohe Lin wrote:
> Hi:
> On 2021/4/7 8:53, Mike Kravetz wrote:
>> On 4/2/21 2:32 AM, Miaohe Lin wrote:
>>> It's guaranteed that the vma is associated with a resv_map, i.e. either
>>> VM_MAYSHARE or HPAGE_RESV_OWNER, when the code reaches here or we would
>>> have returned via !resv check above. So ret must be less than 0 in the
>>> 'else' case. Simplify the return code to make this clear.
>>
>> I believe we still neeed that ternary operator in the return statement.
>> Why?
>>
>> There are two basic types of mappings to be concerned with:
>> shared and private.
>> For private mappings, a task can 'own' the mapping as indicated by
>> HPAGE_RESV_OWNER.  Or, it may not own the mapping.  The most common way
>> to create a non-owner private mapping is to have a task with a private
>> mapping fork.  The parent process will have HPAGE_RESV_OWNER set, the
>> child process will not.  The idea is that since the child has a COW copy
>> of the mapping it should not consume reservations made by the parent.
> 
> The child process will not have HPAGE_RESV_OWNER set because at fork time, we do:
> 		/*
> 		 * Clear hugetlb-related page reserves for children. This only
> 		 * affects MAP_PRIVATE mappings. Faults generated by the child
> 		 * are not guaranteed to succeed, even if read-only
> 		 */
> 		if (is_vm_hugetlb_page(tmp))
> 			reset_vma_resv_huge_pages(tmp);
> i.e. we have vma->vm_private_data = (void *)0; for child process and vma_resv_map() will
> return NULL in this case.
> Or am I missed something?
> 
>> Only the parent (HPAGE_RESV_OWNER) is allowed to consume the
>> reservations.
>> Hope that makens sense?
>>
>>>
>>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
>>> ---
>>>  mm/hugetlb.c | 2 +-
>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>
>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>>> index a03a50b7c410..b7864abded3d 100644
>>> --- a/mm/hugetlb.c
>>> +++ b/mm/hugetlb.c
>>> @@ -2183,7 +2183,7 @@ static long __vma_reservation_common(struct hstate *h,
>>>  			return 1;
>>>  	}
>>>  	else
>>
>> This else also handles the case !HPAGE_RESV_OWNER.  In this case, we
> 
> IMO, for the case !HPAGE_RESV_OWNER, we won't reach here. What do you think?
> 

I think you are correct.

However, if this is true we should be able to simply the code even
further.  There is no need to check for HPAGE_RESV_OWNER because we know
it must be set.  Correct?  If so, the code could look something like:

	if (vma->vm_flags & VM_MAYSHARE)
		return ret;

	/* We know private mapping with HPAGE_RESV_OWNER */
	 * ...						 *
	 * Add that existing comment                     */

	if (ret > 0)
		return 0;
	if (ret == 0)
		return 1;
	return ret;

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/4] mm/hugeltb: fix potential wrong gbl_reserve value for hugetlb_acct_memory()
  2021-04-02  9:32 ` [PATCH 3/4] mm/hugeltb: fix potential wrong gbl_reserve value for hugetlb_acct_memory() Miaohe Lin
@ 2021-04-07  2:49   ` Mike Kravetz
  2021-04-07  7:24     ` Miaohe Lin
  0 siblings, 1 reply; 27+ messages in thread
From: Mike Kravetz @ 2021-04-07  2:49 UTC (permalink / raw)
  To: Miaohe Lin, akpm; +Cc: n-horiguchi, hillf.zj, linux-kernel, linux-mm

On 4/2/21 2:32 AM, Miaohe Lin wrote:
> The resv_map could be NULL since this routine can be called in the evict
> inode path for all hugetlbfs inodes. So we could have chg = 0 and this
> would result in a negative value when chg - freed. This is unexpected for
> hugepage_subpool_put_pages() and hugetlb_acct_memory().

I am not sure if this is possible.

It is true that resv_map could be NULL.  However, I believe resv map
can only be NULL for inodes that are not regular or link inodes.  This
is the inode creation code in hugetlbfs_get_inode().

       /*
         * Reserve maps are only needed for inodes that can have associated
         * page allocations.
         */
        if (S_ISREG(mode) || S_ISLNK(mode)) {
                resv_map = resv_map_alloc();
                if (!resv_map)
                        return NULL;
        }

If resv_map is NULL, then no hugetlb pages can be allocated/associated
with the file.  As a result, remove_inode_hugepages will never find any
huge pages associated with the inode and the passed value 'freed' will
always be zero.

Does that sound correct?

-- 
Mike Kravetz

> 
> Fixes: b5cec28d36f5 ("hugetlbfs: truncate_hugepages() takes a range of pages")
> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
> ---
>  mm/hugetlb.c | 4 +++-
>  1 file changed, 3 insertions(+), 1 deletion(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index b7864abded3d..bdff8d23803f 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -5413,6 +5413,7 @@ long hugetlb_unreserve_pages(struct inode *inode, long start, long end,
>  	long chg = 0;
>  	struct hugepage_subpool *spool = subpool_inode(inode);
>  	long gbl_reserve;
> +	long delta;
>  
>  	/*
>  	 * Since this routine can be called in the evict inode path for all
> @@ -5437,7 +5438,8 @@ long hugetlb_unreserve_pages(struct inode *inode, long start, long end,
>  	 * If the subpool has a minimum size, the number of global
>  	 * reservations to be released may be adjusted.
>  	 */
> -	gbl_reserve = hugepage_subpool_put_pages(spool, (chg - freed));
> +	delta = chg > 0 ? chg - freed : freed;
> +	gbl_reserve = hugepage_subpool_put_pages(spool, delta);
>  	hugetlb_acct_memory(h, -gbl_reserve);
>  
>  	return 0;
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/4] mm/hugeltb: simplify the return code of __vma_reservation_common()
  2021-04-07  2:37       ` Mike Kravetz
@ 2021-04-07  3:09         ` Miaohe Lin
  2021-04-07 21:23           ` Mike Kravetz
  0 siblings, 1 reply; 27+ messages in thread
From: Miaohe Lin @ 2021-04-07  3:09 UTC (permalink / raw)
  To: Mike Kravetz, akpm; +Cc: n-horiguchi, hillf.zj, linux-kernel, linux-mm

On 2021/4/7 10:37, Mike Kravetz wrote:
> On 4/6/21 7:05 PM, Miaohe Lin wrote:
>> Hi:
>> On 2021/4/7 8:53, Mike Kravetz wrote:
>>> On 4/2/21 2:32 AM, Miaohe Lin wrote:
>>>> It's guaranteed that the vma is associated with a resv_map, i.e. either
>>>> VM_MAYSHARE or HPAGE_RESV_OWNER, when the code reaches here or we would
>>>> have returned via !resv check above. So ret must be less than 0 in the
>>>> 'else' case. Simplify the return code to make this clear.
>>>
>>> I believe we still neeed that ternary operator in the return statement.
>>> Why?
>>>
>>> There are two basic types of mappings to be concerned with:
>>> shared and private.
>>> For private mappings, a task can 'own' the mapping as indicated by
>>> HPAGE_RESV_OWNER.  Or, it may not own the mapping.  The most common way
>>> to create a non-owner private mapping is to have a task with a private
>>> mapping fork.  The parent process will have HPAGE_RESV_OWNER set, the
>>> child process will not.  The idea is that since the child has a COW copy
>>> of the mapping it should not consume reservations made by the parent.
>>
>> The child process will not have HPAGE_RESV_OWNER set because at fork time, we do:
>> 		/*
>> 		 * Clear hugetlb-related page reserves for children. This only
>> 		 * affects MAP_PRIVATE mappings. Faults generated by the child
>> 		 * are not guaranteed to succeed, even if read-only
>> 		 */
>> 		if (is_vm_hugetlb_page(tmp))
>> 			reset_vma_resv_huge_pages(tmp);
>> i.e. we have vma->vm_private_data = (void *)0; for child process and vma_resv_map() will
>> return NULL in this case.
>> Or am I missed something?
>>
>>> Only the parent (HPAGE_RESV_OWNER) is allowed to consume the
>>> reservations.
>>> Hope that makens sense?
>>>
>>>>
>>>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
>>>> ---
>>>>  mm/hugetlb.c | 2 +-
>>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>>
>>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>>>> index a03a50b7c410..b7864abded3d 100644
>>>> --- a/mm/hugetlb.c
>>>> +++ b/mm/hugetlb.c
>>>> @@ -2183,7 +2183,7 @@ static long __vma_reservation_common(struct hstate *h,
>>>>  			return 1;
>>>>  	}
>>>>  	else
>>>
>>> This else also handles the case !HPAGE_RESV_OWNER.  In this case, we
>>
>> IMO, for the case !HPAGE_RESV_OWNER, we won't reach here. What do you think?
>>
> 
> I think you are correct.
> 
> However, if this is true we should be able to simply the code even
> further.  There is no need to check for HPAGE_RESV_OWNER because we know
> it must be set.  Correct?  If so, the code could look something like:
> 
> 	if (vma->vm_flags & VM_MAYSHARE)
> 		return ret;
> 
> 	/* We know private mapping with HPAGE_RESV_OWNER */
> 	 * ...						 *
> 	 * Add that existing comment                     */
> 
> 	if (ret > 0)
> 		return 0;
> 	if (ret == 0)
> 		return 1;
> 	return ret;
> 

Many thanks for good suggestion! What do you mean is this ?

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index a03a50b7c410..9b4c05699a90 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -2163,27 +2163,26 @@ static long __vma_reservation_common(struct hstate *h,

        if (vma->vm_flags & VM_MAYSHARE)
                return ret;
-       else if (is_vma_resv_set(vma, HPAGE_RESV_OWNER) && ret >= 0) {
-               /*
-                * In most cases, reserves always exist for private mappings.
-                * However, a file associated with mapping could have been
-                * hole punched or truncated after reserves were consumed.
-                * As subsequent fault on such a range will not use reserves.
-                * Subtle - The reserve map for private mappings has the
-                * opposite meaning than that of shared mappings.  If NO
-                * entry is in the reserve map, it means a reservation exists.
-                * If an entry exists in the reserve map, it means the
-                * reservation has already been consumed.  As a result, the
-                * return value of this routine is the opposite of the
-                * value returned from reserve map manipulation routines above.
-                */
-               if (ret)
-                       return 0;
-               else
-                       return 1;
-       }
-       else
-               return ret < 0 ? ret : 0;
+       /*
+        * We know private mapping must have HPAGE_RESV_OWNER set.
+        *
+        * In most cases, reserves always exist for private mappings.
+        * However, a file associated with mapping could have been
+        * hole punched or truncated after reserves were consumed.
+        * As subsequent fault on such a range will not use reserves.
+        * Subtle - The reserve map for private mappings has the
+        * opposite meaning than that of shared mappings.  If NO
+        * entry is in the reserve map, it means a reservation exists.
+        * If an entry exists in the reserve map, it means the
+        * reservation has already been consumed.  As a result, the
+        * return value of this routine is the opposite of the
+        * value returned from reserve map manipulation routines above.
+        */
+       if (ret > 0)
+               return 0;
+       if (ret == 0)
+               return 1;
+       return ret;
 }

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/4] mm/hugeltb: fix potential wrong gbl_reserve value for hugetlb_acct_memory()
  2021-04-07  2:49   ` Mike Kravetz
@ 2021-04-07  7:24     ` Miaohe Lin
  2021-04-07 20:53       ` Mike Kravetz
  0 siblings, 1 reply; 27+ messages in thread
From: Miaohe Lin @ 2021-04-07  7:24 UTC (permalink / raw)
  To: Mike Kravetz, akpm; +Cc: n-horiguchi, linux-kernel, linux-mm

Hi:
On 2021/4/7 10:49, Mike Kravetz wrote:
> On 4/2/21 2:32 AM, Miaohe Lin wrote:
>> The resv_map could be NULL since this routine can be called in the evict
>> inode path for all hugetlbfs inodes. So we could have chg = 0 and this
>> would result in a negative value when chg - freed. This is unexpected for
>> hugepage_subpool_put_pages() and hugetlb_acct_memory().
> 
> I am not sure if this is possible.
> 
> It is true that resv_map could be NULL.  However, I believe resv map
> can only be NULL for inodes that are not regular or link inodes.  This
> is the inode creation code in hugetlbfs_get_inode().
> 
>        /*
>          * Reserve maps are only needed for inodes that can have associated
>          * page allocations.
>          */
>         if (S_ISREG(mode) || S_ISLNK(mode)) {
>                 resv_map = resv_map_alloc();
>                 if (!resv_map)
>                         return NULL;
>         }
> 

Agree.

> If resv_map is NULL, then no hugetlb pages can be allocated/associated
> with the file.  As a result, remove_inode_hugepages will never find any
> huge pages associated with the inode and the passed value 'freed' will
> always be zero.
> 

But I am confused now. AFAICS, remove_inode_hugepages() searches the address_space of
the inode to remove the hugepages while does not care if inode has associated resv_map.
How does it prevent hugetlb pages from being allocated/associated with the file if
resv_map is NULL? Could you please explain this more?

Many thanks.

> Does that sound correct?
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/4] mm/hugeltb: fix potential wrong gbl_reserve value for hugetlb_acct_memory()
  2021-04-07  7:24     ` Miaohe Lin
@ 2021-04-07 20:53       ` Mike Kravetz
  2021-04-08  3:24         ` Miaohe Lin
  0 siblings, 1 reply; 27+ messages in thread
From: Mike Kravetz @ 2021-04-07 20:53 UTC (permalink / raw)
  To: Miaohe Lin, akpm; +Cc: n-horiguchi, linux-kernel, linux-mm

On 4/7/21 12:24 AM, Miaohe Lin wrote:
> Hi:
> On 2021/4/7 10:49, Mike Kravetz wrote:
>> On 4/2/21 2:32 AM, Miaohe Lin wrote:
>>> The resv_map could be NULL since this routine can be called in the evict
>>> inode path for all hugetlbfs inodes. So we could have chg = 0 and this
>>> would result in a negative value when chg - freed. This is unexpected for
>>> hugepage_subpool_put_pages() and hugetlb_acct_memory().
>>
>> I am not sure if this is possible.
>>
>> It is true that resv_map could be NULL.  However, I believe resv map
>> can only be NULL for inodes that are not regular or link inodes.  This
>> is the inode creation code in hugetlbfs_get_inode().
>>
>>        /*
>>          * Reserve maps are only needed for inodes that can have associated
>>          * page allocations.
>>          */
>>         if (S_ISREG(mode) || S_ISLNK(mode)) {
>>                 resv_map = resv_map_alloc();
>>                 if (!resv_map)
>>                         return NULL;
>>         }
>>
> 
> Agree.
> 
>> If resv_map is NULL, then no hugetlb pages can be allocated/associated
>> with the file.  As a result, remove_inode_hugepages will never find any
>> huge pages associated with the inode and the passed value 'freed' will
>> always be zero.
>>
> 
> But I am confused now. AFAICS, remove_inode_hugepages() searches the address_space of
> the inode to remove the hugepages while does not care if inode has associated resv_map.
> How does it prevent hugetlb pages from being allocated/associated with the file if
> resv_map is NULL? Could you please explain this more?
> 

Recall that there are only two ways to get huge pages associated with
a hugetlbfs file: fallocate and mmap/write fault.  Directly writing to
hugetlbfs files is not supported.

If you take a closer look at hugetlbfs_get_inode, it has that code to
allocate the resv map mentioned above as well as the following:

		switch (mode & S_IFMT) {
		default:
			init_special_inode(inode, mode, dev);
			break;
		case S_IFREG:
			inode->i_op = &hugetlbfs_inode_operations;
			inode->i_fop = &hugetlbfs_file_operations;
			break;
		case S_IFDIR:
			inode->i_op = &hugetlbfs_dir_inode_operations;
			inode->i_fop = &simple_dir_operations;

			/* directory inodes start off with i_nlink == 2 (for "." entry) */
			inc_nlink(inode);
			break;
		case S_IFLNK:
			inode->i_op = &page_symlink_inode_operations;
			inode_nohighmem(inode);
			break;
		}

Notice that only S_IFREG inodes will have i_fop == &hugetlbfs_file_operations.
hugetlbfs_file_operations contain the hugetlbfs specific mmap and fallocate
routines.  Hence, only files with S_IFREG inodes can potentially have
associated huge pages.  S_IFLNK inodes can as well via file linking.

If an inode is not S_ISREG(mode) || S_ISLNK(mode), then it will not have
a resv_map.  In addition, it will not have hugetlbfs_file_operations and
can not have associated huge pages.

I looked at this closely when adding commits
58b6e5e8f1ad hugetlbfs: fix memory leak for resv_map
f27a5136f70a hugetlbfs: always use address space in inode for resv_map pointer

I may not be remembering all of the details correctly.  Commit f27a5136f70a
added the comment that resv_map could be NULL to hugetlb_unreserve_pages.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/4] mm/hugeltb: simplify the return code of __vma_reservation_common()
  2021-04-07  3:09         ` Miaohe Lin
@ 2021-04-07 21:23           ` Mike Kravetz
  2021-04-08  2:44             ` Miaohe Lin
  0 siblings, 1 reply; 27+ messages in thread
From: Mike Kravetz @ 2021-04-07 21:23 UTC (permalink / raw)
  To: Miaohe Lin, akpm; +Cc: n-horiguchi, hillf.zj, linux-kernel, linux-mm

On 4/6/21 8:09 PM, Miaohe Lin wrote:
> On 2021/4/7 10:37, Mike Kravetz wrote:
>> On 4/6/21 7:05 PM, Miaohe Lin wrote:
>>> Hi:
>>> On 2021/4/7 8:53, Mike Kravetz wrote:
>>>> On 4/2/21 2:32 AM, Miaohe Lin wrote:
>>>>> It's guaranteed that the vma is associated with a resv_map, i.e. either
>>>>> VM_MAYSHARE or HPAGE_RESV_OWNER, when the code reaches here or we would
>>>>> have returned via !resv check above. So ret must be less than 0 in the
>>>>> 'else' case. Simplify the return code to make this clear.
>>>>
>>>> I believe we still neeed that ternary operator in the return statement.
>>>> Why?
>>>>
>>>> There are two basic types of mappings to be concerned with:
>>>> shared and private.
>>>> For private mappings, a task can 'own' the mapping as indicated by
>>>> HPAGE_RESV_OWNER.  Or, it may not own the mapping.  The most common way
>>>> to create a non-owner private mapping is to have a task with a private
>>>> mapping fork.  The parent process will have HPAGE_RESV_OWNER set, the
>>>> child process will not.  The idea is that since the child has a COW copy
>>>> of the mapping it should not consume reservations made by the parent.
>>>
>>> The child process will not have HPAGE_RESV_OWNER set because at fork time, we do:
>>> 		/*
>>> 		 * Clear hugetlb-related page reserves for children. This only
>>> 		 * affects MAP_PRIVATE mappings. Faults generated by the child
>>> 		 * are not guaranteed to succeed, even if read-only
>>> 		 */
>>> 		if (is_vm_hugetlb_page(tmp))
>>> 			reset_vma_resv_huge_pages(tmp);
>>> i.e. we have vma->vm_private_data = (void *)0; for child process and vma_resv_map() will
>>> return NULL in this case.
>>> Or am I missed something?
>>>
>>>> Only the parent (HPAGE_RESV_OWNER) is allowed to consume the
>>>> reservations.
>>>> Hope that makens sense?
>>>>
>>>>>
>>>>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
>>>>> ---
>>>>>  mm/hugetlb.c | 2 +-
>>>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>
>>>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>>>>> index a03a50b7c410..b7864abded3d 100644
>>>>> --- a/mm/hugetlb.c
>>>>> +++ b/mm/hugetlb.c
>>>>> @@ -2183,7 +2183,7 @@ static long __vma_reservation_common(struct hstate *h,
>>>>>  			return 1;
>>>>>  	}
>>>>>  	else
>>>>
>>>> This else also handles the case !HPAGE_RESV_OWNER.  In this case, we
>>>
>>> IMO, for the case !HPAGE_RESV_OWNER, we won't reach here. What do you think?
>>>
>>
>> I think you are correct.
>>
>> However, if this is true we should be able to simply the code even
>> further.  There is no need to check for HPAGE_RESV_OWNER because we know
>> it must be set.  Correct?  If so, the code could look something like:
>>
>> 	if (vma->vm_flags & VM_MAYSHARE)
>> 		return ret;
>>
>> 	/* We know private mapping with HPAGE_RESV_OWNER */
>> 	 * ...						 *
>> 	 * Add that existing comment                     */
>>
>> 	if (ret > 0)
>> 		return 0;
>> 	if (ret == 0)
>> 		return 1;
>> 	return ret;
>>
> 
> Many thanks for good suggestion! What do you mean is this ?

I think the below changes would work fine.

However, this patch/discussion has made me ask the question.  Do we need
the HPAGE_RESV_OWNER flag?  Is the followng true?
!(vm_flags & VM_MAYSHARE) && vma_resv_map()  ===> HPAGE_RESV_OWNER
!(vm_flags & VM_MAYSHARE) && !vma_resv_map() ===> !HPAGE_RESV_OWNER

I am not suggesting we eliminate the flag and make corresponding
changes.  Just curious if you believe we 'could' remove the flag and
depend on the above conditions.

One reason for NOT removing the flag is that that flag itself and
supporting code and commnets help explain what happens with hugetlb
reserves for COW mappings.  That code is hard to understand and the
existing code and coments around HPAGE_RESV_OWNER help with
understanding.

-- 
Mike Kravetz

> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index a03a50b7c410..9b4c05699a90 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -2163,27 +2163,26 @@ static long __vma_reservation_common(struct hstate *h,
> 
>         if (vma->vm_flags & VM_MAYSHARE)
>                 return ret;
> -       else if (is_vma_resv_set(vma, HPAGE_RESV_OWNER) && ret >= 0) {
> -               /*
> -                * In most cases, reserves always exist for private mappings.
> -                * However, a file associated with mapping could have been
> -                * hole punched or truncated after reserves were consumed.
> -                * As subsequent fault on such a range will not use reserves.
> -                * Subtle - The reserve map for private mappings has the
> -                * opposite meaning than that of shared mappings.  If NO
> -                * entry is in the reserve map, it means a reservation exists.
> -                * If an entry exists in the reserve map, it means the
> -                * reservation has already been consumed.  As a result, the
> -                * return value of this routine is the opposite of the
> -                * value returned from reserve map manipulation routines above.
> -                */
> -               if (ret)
> -                       return 0;
> -               else
> -                       return 1;
> -       }
> -       else
> -               return ret < 0 ? ret : 0;
> +       /*
> +        * We know private mapping must have HPAGE_RESV_OWNER set.
> +        *
> +        * In most cases, reserves always exist for private mappings.
> +        * However, a file associated with mapping could have been
> +        * hole punched or truncated after reserves were consumed.
> +        * As subsequent fault on such a range will not use reserves.
> +        * Subtle - The reserve map for private mappings has the
> +        * opposite meaning than that of shared mappings.  If NO
> +        * entry is in the reserve map, it means a reservation exists.
> +        * If an entry exists in the reserve map, it means the
> +        * reservation has already been consumed.  As a result, the
> +        * return value of this routine is the opposite of the
> +        * value returned from reserve map manipulation routines above.
> +        */
> +       if (ret > 0)
> +               return 0;
> +       if (ret == 0)
> +               return 1;
> +       return ret;
>  }
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/4] mm/hugeltb: simplify the return code of __vma_reservation_common()
  2021-04-07 21:23           ` Mike Kravetz
@ 2021-04-08  2:44             ` Miaohe Lin
  2021-04-08 22:40               ` Mike Kravetz
  0 siblings, 1 reply; 27+ messages in thread
From: Miaohe Lin @ 2021-04-08  2:44 UTC (permalink / raw)
  To: Mike Kravetz, akpm; +Cc: n-horiguchi, hillf.zj, linux-kernel, linux-mm

On 2021/4/8 5:23, Mike Kravetz wrote:
> On 4/6/21 8:09 PM, Miaohe Lin wrote:
>> On 2021/4/7 10:37, Mike Kravetz wrote:
>>> On 4/6/21 7:05 PM, Miaohe Lin wrote:
>>>> Hi:
>>>> On 2021/4/7 8:53, Mike Kravetz wrote:
>>>>> On 4/2/21 2:32 AM, Miaohe Lin wrote:
>>>>>> It's guaranteed that the vma is associated with a resv_map, i.e. either
>>>>>> VM_MAYSHARE or HPAGE_RESV_OWNER, when the code reaches here or we would
>>>>>> have returned via !resv check above. So ret must be less than 0 in the
>>>>>> 'else' case. Simplify the return code to make this clear.
>>>>>
>>>>> I believe we still neeed that ternary operator in the return statement.
>>>>> Why?
>>>>>
>>>>> There are two basic types of mappings to be concerned with:
>>>>> shared and private.
>>>>> For private mappings, a task can 'own' the mapping as indicated by
>>>>> HPAGE_RESV_OWNER.  Or, it may not own the mapping.  The most common way
>>>>> to create a non-owner private mapping is to have a task with a private
>>>>> mapping fork.  The parent process will have HPAGE_RESV_OWNER set, the
>>>>> child process will not.  The idea is that since the child has a COW copy
>>>>> of the mapping it should not consume reservations made by the parent.
>>>>
>>>> The child process will not have HPAGE_RESV_OWNER set because at fork time, we do:
>>>> 		/*
>>>> 		 * Clear hugetlb-related page reserves for children. This only
>>>> 		 * affects MAP_PRIVATE mappings. Faults generated by the child
>>>> 		 * are not guaranteed to succeed, even if read-only
>>>> 		 */
>>>> 		if (is_vm_hugetlb_page(tmp))
>>>> 			reset_vma_resv_huge_pages(tmp);
>>>> i.e. we have vma->vm_private_data = (void *)0; for child process and vma_resv_map() will
>>>> return NULL in this case.
>>>> Or am I missed something?
>>>>
>>>>> Only the parent (HPAGE_RESV_OWNER) is allowed to consume the
>>>>> reservations.
>>>>> Hope that makens sense?
>>>>>
>>>>>>
>>>>>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
>>>>>> ---
>>>>>>  mm/hugetlb.c | 2 +-
>>>>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>>
>>>>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>>>>>> index a03a50b7c410..b7864abded3d 100644
>>>>>> --- a/mm/hugetlb.c
>>>>>> +++ b/mm/hugetlb.c
>>>>>> @@ -2183,7 +2183,7 @@ static long __vma_reservation_common(struct hstate *h,
>>>>>>  			return 1;
>>>>>>  	}
>>>>>>  	else
>>>>>
>>>>> This else also handles the case !HPAGE_RESV_OWNER.  In this case, we
>>>>
>>>> IMO, for the case !HPAGE_RESV_OWNER, we won't reach here. What do you think?
>>>>
>>>
>>> I think you are correct.
>>>
>>> However, if this is true we should be able to simply the code even
>>> further.  There is no need to check for HPAGE_RESV_OWNER because we know
>>> it must be set.  Correct?  If so, the code could look something like:
>>>
>>> 	if (vma->vm_flags & VM_MAYSHARE)
>>> 		return ret;
>>>
>>> 	/* We know private mapping with HPAGE_RESV_OWNER */
>>> 	 * ...						 *
>>> 	 * Add that existing comment                     */
>>>
>>> 	if (ret > 0)
>>> 		return 0;
>>> 	if (ret == 0)
>>> 		return 1;
>>> 	return ret;
>>>
>>
>> Many thanks for good suggestion! What do you mean is this ?
> 
> I think the below changes would work fine.
> 
> However, this patch/discussion has made me ask the question.  Do we need
> the HPAGE_RESV_OWNER flag?  Is the followng true?
> !(vm_flags & VM_MAYSHARE) && vma_resv_map()  ===> HPAGE_RESV_OWNER
> !(vm_flags & VM_MAYSHARE) && !vma_resv_map() ===> !HPAGE_RESV_OWNER
> 

I agree with you.

HPAGE_RESV_OWNER is set in hugetlb_reserve_pages() and there's no way to clear it
in the owner process. The child process can not inherit both HPAGE_RESV_OWNER and
resv_map. So for !HPAGE_RESV_OWNER vma, it knows nothing about resv_map.

IMO, in !(vm_flags & VM_MAYSHARE) case, we must have:
	!!vma_resv_map() == !!HPAGE_RESV_OWNER

> I am not suggesting we eliminate the flag and make corresponding
> changes.  Just curious if you believe we 'could' remove the flag and
> depend on the above conditions.
> 
> One reason for NOT removing the flag is that that flag itself and
> supporting code and commnets help explain what happens with hugetlb
> reserves for COW mappings.  That code is hard to understand and the
> existing code and coments around HPAGE_RESV_OWNER help with
> understanding.

Agree. These codes took me several days to understand...

> 

Thanks.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/4] mm/hugeltb: fix potential wrong gbl_reserve value for hugetlb_acct_memory()
  2021-04-07 20:53       ` Mike Kravetz
@ 2021-04-08  3:24         ` Miaohe Lin
  2021-04-08  3:26           ` Miaohe Lin
  0 siblings, 1 reply; 27+ messages in thread
From: Miaohe Lin @ 2021-04-08  3:24 UTC (permalink / raw)
  To: Mike Kravetz, akpm; +Cc: n-horiguchi, linux-kernel, linux-mm

On 2021/4/8 4:53, Mike Kravetz wrote:
> On 4/7/21 12:24 AM, Miaohe Lin wrote:
>> Hi:
>> On 2021/4/7 10:49, Mike Kravetz wrote:
>>> On 4/2/21 2:32 AM, Miaohe Lin wrote:
>>>> The resv_map could be NULL since this routine can be called in the evict
>>>> inode path for all hugetlbfs inodes. So we could have chg = 0 and this
>>>> would result in a negative value when chg - freed. This is unexpected for
>>>> hugepage_subpool_put_pages() and hugetlb_acct_memory().
>>>
>>> I am not sure if this is possible.
>>>
>>> It is true that resv_map could be NULL.  However, I believe resv map
>>> can only be NULL for inodes that are not regular or link inodes.  This
>>> is the inode creation code in hugetlbfs_get_inode().
>>>
>>>        /*
>>>          * Reserve maps are only needed for inodes that can have associated
>>>          * page allocations.
>>>          */
>>>         if (S_ISREG(mode) || S_ISLNK(mode)) {
>>>                 resv_map = resv_map_alloc();
>>>                 if (!resv_map)
>>>                         return NULL;
>>>         }
>>>
>>
>> Agree.
>>
>>> If resv_map is NULL, then no hugetlb pages can be allocated/associated
>>> with the file.  As a result, remove_inode_hugepages will never find any
>>> huge pages associated with the inode and the passed value 'freed' will
>>> always be zero.
>>>
>>
>> But I am confused now. AFAICS, remove_inode_hugepages() searches the address_space of
>> the inode to remove the hugepages while does not care if inode has associated resv_map.
>> How does it prevent hugetlb pages from being allocated/associated with the file if
>> resv_map is NULL? Could you please explain this more?
>>
> 
> Recall that there are only two ways to get huge pages associated with
> a hugetlbfs file: fallocate and mmap/write fault.  Directly writing to
> hugetlbfs files is not supported.
> 
> If you take a closer look at hugetlbfs_get_inode, it has that code to
> allocate the resv map mentioned above as well as the following:
> 
> 		switch (mode & S_IFMT) {
> 		default:
> 			init_special_inode(inode, mode, dev);
> 			break;
> 		case S_IFREG:
> 			inode->i_op = &hugetlbfs_inode_operations;
> 			inode->i_fop = &hugetlbfs_file_operations;
> 			break;
> 		case S_IFDIR:
> 			inode->i_op = &hugetlbfs_dir_inode_operations;
> 			inode->i_fop = &simple_dir_operations;
> 
> 			/* directory inodes start off with i_nlink == 2 (for "." entry) */
> 			inc_nlink(inode);
> 			break;
> 		case S_IFLNK:
> 			inode->i_op = &page_symlink_inode_operations;
> 			inode_nohighmem(inode);
> 			break;
> 		}
> 
> Notice that only S_IFREG inodes will have i_fop == &hugetlbfs_file_operations.
> hugetlbfs_file_operations contain the hugetlbfs specific mmap and fallocate
> routines.  Hence, only files with S_IFREG inodes can potentially have
> associated huge pages.  S_IFLNK inodes can as well via file linking.
> 
> If an inode is not S_ISREG(mode) || S_ISLNK(mode), then it will not have
> a resv_map.  In addition, it will not have hugetlbfs_file_operations and
> can not have associated huge pages.
> 

Many many thanks for detailed and patient explanation! :) I think I have got the idea!

> I looked at this closely when adding commits
> 58b6e5e8f1ad hugetlbfs: fix memory leak for resv_map
> f27a5136f70a hugetlbfs: always use address space in inode for resv_map pointer
> 
> I may not be remembering all of the details correctly.  Commit f27a5136f70a
> added the comment that resv_map could be NULL to hugetlb_unreserve_pages.
> 

Since we must have freed == 0 while chg == 0. Should we make this assumption explict
by something like below?

WARN_ON(chg < freed);

Thanks again!

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/4] mm/hugeltb: fix potential wrong gbl_reserve value for hugetlb_acct_memory()
  2021-04-08  3:24         ` Miaohe Lin
@ 2021-04-08  3:26           ` Miaohe Lin
  2021-04-08 22:53             ` Mike Kravetz
  0 siblings, 1 reply; 27+ messages in thread
From: Miaohe Lin @ 2021-04-08  3:26 UTC (permalink / raw)
  To: Mike Kravetz, akpm; +Cc: n-horiguchi, linux-kernel, linux-mm

On 2021/4/8 11:24, Miaohe Lin wrote:
> On 2021/4/8 4:53, Mike Kravetz wrote:
>> On 4/7/21 12:24 AM, Miaohe Lin wrote:
>>> Hi:
>>> On 2021/4/7 10:49, Mike Kravetz wrote:
>>>> On 4/2/21 2:32 AM, Miaohe Lin wrote:
>>>>> The resv_map could be NULL since this routine can be called in the evict
>>>>> inode path for all hugetlbfs inodes. So we could have chg = 0 and this
>>>>> would result in a negative value when chg - freed. This is unexpected for
>>>>> hugepage_subpool_put_pages() and hugetlb_acct_memory().
>>>>
>>>> I am not sure if this is possible.
>>>>
>>>> It is true that resv_map could be NULL.  However, I believe resv map
>>>> can only be NULL for inodes that are not regular or link inodes.  This
>>>> is the inode creation code in hugetlbfs_get_inode().
>>>>
>>>>        /*
>>>>          * Reserve maps are only needed for inodes that can have associated
>>>>          * page allocations.
>>>>          */
>>>>         if (S_ISREG(mode) || S_ISLNK(mode)) {
>>>>                 resv_map = resv_map_alloc();
>>>>                 if (!resv_map)
>>>>                         return NULL;
>>>>         }
>>>>
>>>
>>> Agree.
>>>
>>>> If resv_map is NULL, then no hugetlb pages can be allocated/associated
>>>> with the file.  As a result, remove_inode_hugepages will never find any
>>>> huge pages associated with the inode and the passed value 'freed' will
>>>> always be zero.
>>>>
>>>
>>> But I am confused now. AFAICS, remove_inode_hugepages() searches the address_space of
>>> the inode to remove the hugepages while does not care if inode has associated resv_map.
>>> How does it prevent hugetlb pages from being allocated/associated with the file if
>>> resv_map is NULL? Could you please explain this more?
>>>
>>
>> Recall that there are only two ways to get huge pages associated with
>> a hugetlbfs file: fallocate and mmap/write fault.  Directly writing to
>> hugetlbfs files is not supported.
>>
>> If you take a closer look at hugetlbfs_get_inode, it has that code to
>> allocate the resv map mentioned above as well as the following:
>>
>> 		switch (mode & S_IFMT) {
>> 		default:
>> 			init_special_inode(inode, mode, dev);
>> 			break;
>> 		case S_IFREG:
>> 			inode->i_op = &hugetlbfs_inode_operations;
>> 			inode->i_fop = &hugetlbfs_file_operations;
>> 			break;
>> 		case S_IFDIR:
>> 			inode->i_op = &hugetlbfs_dir_inode_operations;
>> 			inode->i_fop = &simple_dir_operations;
>>
>> 			/* directory inodes start off with i_nlink == 2 (for "." entry) */
>> 			inc_nlink(inode);
>> 			break;
>> 		case S_IFLNK:
>> 			inode->i_op = &page_symlink_inode_operations;
>> 			inode_nohighmem(inode);
>> 			break;
>> 		}
>>
>> Notice that only S_IFREG inodes will have i_fop == &hugetlbfs_file_operations.
>> hugetlbfs_file_operations contain the hugetlbfs specific mmap and fallocate
>> routines.  Hence, only files with S_IFREG inodes can potentially have
>> associated huge pages.  S_IFLNK inodes can as well via file linking.
>>
>> If an inode is not S_ISREG(mode) || S_ISLNK(mode), then it will not have
>> a resv_map.  In addition, it will not have hugetlbfs_file_operations and
>> can not have associated huge pages.
>>
> 
> Many many thanks for detailed and patient explanation! :) I think I have got the idea!
> 
>> I looked at this closely when adding commits
>> 58b6e5e8f1ad hugetlbfs: fix memory leak for resv_map
>> f27a5136f70a hugetlbfs: always use address space in inode for resv_map pointer
>>
>> I may not be remembering all of the details correctly.  Commit f27a5136f70a
>> added the comment that resv_map could be NULL to hugetlb_unreserve_pages.
>>
> 
> Since we must have freed == 0 while chg == 0. Should we make this assumption explict
> by something like below?
> 
> WARN_ON(chg < freed);
> 

Or just a comment to avoid confusion ?

> Thanks again!
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/4] mm/hugeltb: simplify the return code of __vma_reservation_common()
  2021-04-08  2:44             ` Miaohe Lin
@ 2021-04-08 22:40               ` Mike Kravetz
  2021-04-09  2:52                 ` Miaohe Lin
  0 siblings, 1 reply; 27+ messages in thread
From: Mike Kravetz @ 2021-04-08 22:40 UTC (permalink / raw)
  To: Miaohe Lin, akpm; +Cc: n-horiguchi, hillf.zj, linux-kernel, linux-mm

On 4/7/21 7:44 PM, Miaohe Lin wrote:
> On 2021/4/8 5:23, Mike Kravetz wrote:
>> On 4/6/21 8:09 PM, Miaohe Lin wrote:
>>> On 2021/4/7 10:37, Mike Kravetz wrote:
>>>> On 4/6/21 7:05 PM, Miaohe Lin wrote:
>>>>> Hi:
>>>>> On 2021/4/7 8:53, Mike Kravetz wrote:
>>>>>> On 4/2/21 2:32 AM, Miaohe Lin wrote:
>>>>>>> It's guaranteed that the vma is associated with a resv_map, i.e. either
>>>>>>> VM_MAYSHARE or HPAGE_RESV_OWNER, when the code reaches here or we would
>>>>>>> have returned via !resv check above. So ret must be less than 0 in the
>>>>>>> 'else' case. Simplify the return code to make this clear.
>>>>>>
>>>>>> I believe we still neeed that ternary operator in the return statement.
>>>>>> Why?
>>>>>>
>>>>>> There are two basic types of mappings to be concerned with:
>>>>>> shared and private.
>>>>>> For private mappings, a task can 'own' the mapping as indicated by
>>>>>> HPAGE_RESV_OWNER.  Or, it may not own the mapping.  The most common way
>>>>>> to create a non-owner private mapping is to have a task with a private
>>>>>> mapping fork.  The parent process will have HPAGE_RESV_OWNER set, the
>>>>>> child process will not.  The idea is that since the child has a COW copy
>>>>>> of the mapping it should not consume reservations made by the parent.
>>>>>
>>>>> The child process will not have HPAGE_RESV_OWNER set because at fork time, we do:
>>>>> 		/*
>>>>> 		 * Clear hugetlb-related page reserves for children. This only
>>>>> 		 * affects MAP_PRIVATE mappings. Faults generated by the child
>>>>> 		 * are not guaranteed to succeed, even if read-only
>>>>> 		 */
>>>>> 		if (is_vm_hugetlb_page(tmp))
>>>>> 			reset_vma_resv_huge_pages(tmp);
>>>>> i.e. we have vma->vm_private_data = (void *)0; for child process and vma_resv_map() will
>>>>> return NULL in this case.
>>>>> Or am I missed something?
>>>>>
>>>>>> Only the parent (HPAGE_RESV_OWNER) is allowed to consume the
>>>>>> reservations.
>>>>>> Hope that makens sense?
>>>>>>
>>>>>>>
>>>>>>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
>>>>>>> ---
>>>>>>>  mm/hugetlb.c | 2 +-
>>>>>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>>>
>>>>>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>>>>>>> index a03a50b7c410..b7864abded3d 100644
>>>>>>> --- a/mm/hugetlb.c
>>>>>>> +++ b/mm/hugetlb.c
>>>>>>> @@ -2183,7 +2183,7 @@ static long __vma_reservation_common(struct hstate *h,
>>>>>>>  			return 1;
>>>>>>>  	}
>>>>>>>  	else
>>>>>>
>>>>>> This else also handles the case !HPAGE_RESV_OWNER.  In this case, we
>>>>>
>>>>> IMO, for the case !HPAGE_RESV_OWNER, we won't reach here. What do you think?
>>>>>
>>>>
>>>> I think you are correct.
>>>>
>>>> However, if this is true we should be able to simply the code even
>>>> further.  There is no need to check for HPAGE_RESV_OWNER because we know
>>>> it must be set.  Correct?  If so, the code could look something like:
>>>>
>>>> 	if (vma->vm_flags & VM_MAYSHARE)
>>>> 		return ret;
>>>>
>>>> 	/* We know private mapping with HPAGE_RESV_OWNER */
>>>> 	 * ...						 *
>>>> 	 * Add that existing comment                     */
>>>>
>>>> 	if (ret > 0)
>>>> 		return 0;
>>>> 	if (ret == 0)
>>>> 		return 1;
>>>> 	return ret;
>>>>
>>>
>>> Many thanks for good suggestion! What do you mean is this ?
>>
>> I think the below changes would work fine.
>>
>> However, this patch/discussion has made me ask the question.  Do we need
>> the HPAGE_RESV_OWNER flag?  Is the followng true?
>> !(vm_flags & VM_MAYSHARE) && vma_resv_map()  ===> HPAGE_RESV_OWNER
>> !(vm_flags & VM_MAYSHARE) && !vma_resv_map() ===> !HPAGE_RESV_OWNER
>>
> 
> I agree with you.
> 
> HPAGE_RESV_OWNER is set in hugetlb_reserve_pages() and there's no way to clear it
> in the owner process. The child process can not inherit both HPAGE_RESV_OWNER and
> resv_map. So for !HPAGE_RESV_OWNER vma, it knows nothing about resv_map.
> 
> IMO, in !(vm_flags & VM_MAYSHARE) case, we must have:
> 	!!vma_resv_map() == !!HPAGE_RESV_OWNER
> 
>> I am not suggesting we eliminate the flag and make corresponding
>> changes.  Just curious if you believe we 'could' remove the flag and
>> depend on the above conditions.
>>
>> One reason for NOT removing the flag is that that flag itself and
>> supporting code and commnets help explain what happens with hugetlb
>> reserves for COW mappings.  That code is hard to understand and the
>> existing code and coments around HPAGE_RESV_OWNER help with
>> understanding.
> 
> Agree. These codes took me several days to understand...
> 

Please prepare v2 with the changes to remove the HPAGE_RESV_OWNER check
and move the large comment.


I would prefer to leave other places that mention HPAGE_RESV_OWNER
unchanged.

Thanks,
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/4] mm/hugeltb: fix potential wrong gbl_reserve value for hugetlb_acct_memory()
  2021-04-08  3:26           ` Miaohe Lin
@ 2021-04-08 22:53             ` Mike Kravetz
  2021-04-09  3:01               ` Miaohe Lin
  0 siblings, 1 reply; 27+ messages in thread
From: Mike Kravetz @ 2021-04-08 22:53 UTC (permalink / raw)
  To: Miaohe Lin, akpm; +Cc: n-horiguchi, linux-kernel, linux-mm

On 4/7/21 8:26 PM, Miaohe Lin wrote:
> On 2021/4/8 11:24, Miaohe Lin wrote:
>> On 2021/4/8 4:53, Mike Kravetz wrote:
>>> On 4/7/21 12:24 AM, Miaohe Lin wrote:
>>>> Hi:
>>>> On 2021/4/7 10:49, Mike Kravetz wrote:
>>>>> On 4/2/21 2:32 AM, Miaohe Lin wrote:
>>>>>> The resv_map could be NULL since this routine can be called in the evict
>>>>>> inode path for all hugetlbfs inodes. So we could have chg = 0 and this
>>>>>> would result in a negative value when chg - freed. This is unexpected for
>>>>>> hugepage_subpool_put_pages() and hugetlb_acct_memory().
>>>>>
>>>>> I am not sure if this is possible.
>>>>>
>>>>> It is true that resv_map could be NULL.  However, I believe resv map
>>>>> can only be NULL for inodes that are not regular or link inodes.  This
>>>>> is the inode creation code in hugetlbfs_get_inode().
>>>>>
>>>>>        /*
>>>>>          * Reserve maps are only needed for inodes that can have associated
>>>>>          * page allocations.
>>>>>          */
>>>>>         if (S_ISREG(mode) || S_ISLNK(mode)) {
>>>>>                 resv_map = resv_map_alloc();
>>>>>                 if (!resv_map)
>>>>>                         return NULL;
>>>>>         }
>>>>>
>>>>
>>>> Agree.
>>>>
>>>>> If resv_map is NULL, then no hugetlb pages can be allocated/associated
>>>>> with the file.  As a result, remove_inode_hugepages will never find any
>>>>> huge pages associated with the inode and the passed value 'freed' will
>>>>> always be zero.
>>>>>
>>>>
>>>> But I am confused now. AFAICS, remove_inode_hugepages() searches the address_space of
>>>> the inode to remove the hugepages while does not care if inode has associated resv_map.
>>>> How does it prevent hugetlb pages from being allocated/associated with the file if
>>>> resv_map is NULL? Could you please explain this more?
>>>>
>>>
>>> Recall that there are only two ways to get huge pages associated with
>>> a hugetlbfs file: fallocate and mmap/write fault.  Directly writing to
>>> hugetlbfs files is not supported.
>>>
>>> If you take a closer look at hugetlbfs_get_inode, it has that code to
>>> allocate the resv map mentioned above as well as the following:
>>>
>>> 		switch (mode & S_IFMT) {
>>> 		default:
>>> 			init_special_inode(inode, mode, dev);
>>> 			break;
>>> 		case S_IFREG:
>>> 			inode->i_op = &hugetlbfs_inode_operations;
>>> 			inode->i_fop = &hugetlbfs_file_operations;
>>> 			break;
>>> 		case S_IFDIR:
>>> 			inode->i_op = &hugetlbfs_dir_inode_operations;
>>> 			inode->i_fop = &simple_dir_operations;
>>>
>>> 			/* directory inodes start off with i_nlink == 2 (for "." entry) */
>>> 			inc_nlink(inode);
>>> 			break;
>>> 		case S_IFLNK:
>>> 			inode->i_op = &page_symlink_inode_operations;
>>> 			inode_nohighmem(inode);
>>> 			break;
>>> 		}
>>>
>>> Notice that only S_IFREG inodes will have i_fop == &hugetlbfs_file_operations.
>>> hugetlbfs_file_operations contain the hugetlbfs specific mmap and fallocate
>>> routines.  Hence, only files with S_IFREG inodes can potentially have
>>> associated huge pages.  S_IFLNK inodes can as well via file linking.
>>>
>>> If an inode is not S_ISREG(mode) || S_ISLNK(mode), then it will not have
>>> a resv_map.  In addition, it will not have hugetlbfs_file_operations and
>>> can not have associated huge pages.
>>>
>>
>> Many many thanks for detailed and patient explanation! :) I think I have got the idea!
>>
>>> I looked at this closely when adding commits
>>> 58b6e5e8f1ad hugetlbfs: fix memory leak for resv_map
>>> f27a5136f70a hugetlbfs: always use address space in inode for resv_map pointer
>>>
>>> I may not be remembering all of the details correctly.  Commit f27a5136f70a
>>> added the comment that resv_map could be NULL to hugetlb_unreserve_pages.
>>>
>>
>> Since we must have freed == 0 while chg == 0. Should we make this assumption explict
>> by something like below?
>>
>> WARN_ON(chg < freed);
>>
> 
> Or just a comment to avoid confusion ?
> 

Yes, add a comment to hugetlb_unreserve_pages saying that !resv_map
implies freed == 0.

It would also be helpful to check for (chg - freed) == 0 and skip the
calls to hugepage_subpool_put_pages() and hugetlb_acct_memory().  Both
of those routines may perform an unnecessary lock/unlock cycle in this
case.

A simple
	if (chg == free)
		return 0;
before the call to hugepage_subpool_put_pages would work.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 4/4] mm/hugeltb: handle the error case in hugetlb_fix_reserve_counts()
  2021-04-02  9:32 ` [PATCH 4/4] mm/hugeltb: handle the error case in hugetlb_fix_reserve_counts() Miaohe Lin
@ 2021-04-08 23:25   ` Mike Kravetz
  2021-04-09  3:17     ` Miaohe Lin
  0 siblings, 1 reply; 27+ messages in thread
From: Mike Kravetz @ 2021-04-08 23:25 UTC (permalink / raw)
  To: Miaohe Lin, akpm; +Cc: n-horiguchi, hillf.zj, linux-kernel, linux-mm

On 4/2/21 2:32 AM, Miaohe Lin wrote:
> A rare out of memory error would prevent removal of the reserve map region
> for a page. hugetlb_fix_reserve_counts() handles this rare case to avoid
> dangling with incorrect counts. Unfortunately, hugepage_subpool_get_pages
> and hugetlb_acct_memory could possibly fail too. We should correctly handle
> these cases.

Yes, this is a potential issue.

The 'good news' is that hugetlb_fix_reserve_counts() is unlikely to ever
be called.  To do so would imply we could not allocate a region entry
which is only 6 words in size.  We also keep a 'cache' of entries so we
may not even need to allocate.

But, as mentioned it is a potential issue.

> Fixes: b5cec28d36f5 ("hugetlbfs: truncate_hugepages() takes a range of pages")

This is likely going to make this get picked by by stable releases.
That is unfortunate as mentioned above this is mostly theoretical.

> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
> ---
>  mm/hugetlb.c | 11 +++++++++--
>  1 file changed, 9 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index bdff8d23803f..ca5464ed04b7 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -745,13 +745,20 @@ void hugetlb_fix_reserve_counts(struct inode *inode)
>  {
>  	struct hugepage_subpool *spool = subpool_inode(inode);
>  	long rsv_adjust;
> +	bool reserved = false;
>  
>  	rsv_adjust = hugepage_subpool_get_pages(spool, 1);
> -	if (rsv_adjust) {
> +	if (rsv_adjust > 0) {
>  		struct hstate *h = hstate_inode(inode);
>  
> -		hugetlb_acct_memory(h, 1);
> +		if (!hugetlb_acct_memory(h, 1))
> +			reserved = true;
> +	} else if (!rsv_adjust) {
> +		reserved = true;
>  	}
> +
> +	if (!reserved)
> +		pr_warn("hugetlb: fix reserve count failed\n");

We should expand this warning message a bit to indicate what this may
mean to the user.  Add something like"
	"Huge Page Reserved count may go negative".
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/4] mm/hugeltb: simplify the return code of __vma_reservation_common()
  2021-04-08 22:40               ` Mike Kravetz
@ 2021-04-09  2:52                 ` Miaohe Lin
  0 siblings, 0 replies; 27+ messages in thread
From: Miaohe Lin @ 2021-04-09  2:52 UTC (permalink / raw)
  To: Mike Kravetz, akpm; +Cc: n-horiguchi, hillf.zj, linux-kernel, linux-mm

On 2021/4/9 6:40, Mike Kravetz wrote:
> On 4/7/21 7:44 PM, Miaohe Lin wrote:
>> On 2021/4/8 5:23, Mike Kravetz wrote:
>>> On 4/6/21 8:09 PM, Miaohe Lin wrote:
>>>> On 2021/4/7 10:37, Mike Kravetz wrote:
>>>>> On 4/6/21 7:05 PM, Miaohe Lin wrote:
>>>>>> Hi:
>>>>>> On 2021/4/7 8:53, Mike Kravetz wrote:
>>>>>>> On 4/2/21 2:32 AM, Miaohe Lin wrote:
>>>>>>>> It's guaranteed that the vma is associated with a resv_map, i.e. either
>>>>>>>> VM_MAYSHARE or HPAGE_RESV_OWNER, when the code reaches here or we would
>>>>>>>> have returned via !resv check above. So ret must be less than 0 in the
>>>>>>>> 'else' case. Simplify the return code to make this clear.
>>>>>>>
>>>>>>> I believe we still neeed that ternary operator in the return statement.
>>>>>>> Why?
>>>>>>>
>>>>>>> There are two basic types of mappings to be concerned with:
>>>>>>> shared and private.
>>>>>>> For private mappings, a task can 'own' the mapping as indicated by
>>>>>>> HPAGE_RESV_OWNER.  Or, it may not own the mapping.  The most common way
>>>>>>> to create a non-owner private mapping is to have a task with a private
>>>>>>> mapping fork.  The parent process will have HPAGE_RESV_OWNER set, the
>>>>>>> child process will not.  The idea is that since the child has a COW copy
>>>>>>> of the mapping it should not consume reservations made by the parent.
>>>>>>
>>>>>> The child process will not have HPAGE_RESV_OWNER set because at fork time, we do:
>>>>>> 		/*
>>>>>> 		 * Clear hugetlb-related page reserves for children. This only
>>>>>> 		 * affects MAP_PRIVATE mappings. Faults generated by the child
>>>>>> 		 * are not guaranteed to succeed, even if read-only
>>>>>> 		 */
>>>>>> 		if (is_vm_hugetlb_page(tmp))
>>>>>> 			reset_vma_resv_huge_pages(tmp);
>>>>>> i.e. we have vma->vm_private_data = (void *)0; for child process and vma_resv_map() will
>>>>>> return NULL in this case.
>>>>>> Or am I missed something?
>>>>>>
>>>>>>> Only the parent (HPAGE_RESV_OWNER) is allowed to consume the
>>>>>>> reservations.
>>>>>>> Hope that makens sense?
>>>>>>>
>>>>>>>>
>>>>>>>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
>>>>>>>> ---
>>>>>>>>  mm/hugetlb.c | 2 +-
>>>>>>>>  1 file changed, 1 insertion(+), 1 deletion(-)
>>>>>>>>
>>>>>>>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>>>>>>>> index a03a50b7c410..b7864abded3d 100644
>>>>>>>> --- a/mm/hugetlb.c
>>>>>>>> +++ b/mm/hugetlb.c
>>>>>>>> @@ -2183,7 +2183,7 @@ static long __vma_reservation_common(struct hstate *h,
>>>>>>>>  			return 1;
>>>>>>>>  	}
>>>>>>>>  	else
>>>>>>>
>>>>>>> This else also handles the case !HPAGE_RESV_OWNER.  In this case, we
>>>>>>
>>>>>> IMO, for the case !HPAGE_RESV_OWNER, we won't reach here. What do you think?
>>>>>>
>>>>>
>>>>> I think you are correct.
>>>>>
>>>>> However, if this is true we should be able to simply the code even
>>>>> further.  There is no need to check for HPAGE_RESV_OWNER because we know
>>>>> it must be set.  Correct?  If so, the code could look something like:
>>>>>
>>>>> 	if (vma->vm_flags & VM_MAYSHARE)
>>>>> 		return ret;
>>>>>
>>>>> 	/* We know private mapping with HPAGE_RESV_OWNER */
>>>>> 	 * ...						 *
>>>>> 	 * Add that existing comment                     */
>>>>>
>>>>> 	if (ret > 0)
>>>>> 		return 0;
>>>>> 	if (ret == 0)
>>>>> 		return 1;
>>>>> 	return ret;
>>>>>
>>>>
>>>> Many thanks for good suggestion! What do you mean is this ?
>>>
>>> I think the below changes would work fine.
>>>
>>> However, this patch/discussion has made me ask the question.  Do we need
>>> the HPAGE_RESV_OWNER flag?  Is the followng true?
>>> !(vm_flags & VM_MAYSHARE) && vma_resv_map()  ===> HPAGE_RESV_OWNER
>>> !(vm_flags & VM_MAYSHARE) && !vma_resv_map() ===> !HPAGE_RESV_OWNER
>>>
>>
>> I agree with you.
>>
>> HPAGE_RESV_OWNER is set in hugetlb_reserve_pages() and there's no way to clear it
>> in the owner process. The child process can not inherit both HPAGE_RESV_OWNER and
>> resv_map. So for !HPAGE_RESV_OWNER vma, it knows nothing about resv_map.
>>
>> IMO, in !(vm_flags & VM_MAYSHARE) case, we must have:
>> 	!!vma_resv_map() == !!HPAGE_RESV_OWNER
>>
>>> I am not suggesting we eliminate the flag and make corresponding
>>> changes.  Just curious if you believe we 'could' remove the flag and
>>> depend on the above conditions.
>>>
>>> One reason for NOT removing the flag is that that flag itself and
>>> supporting code and commnets help explain what happens with hugetlb
>>> reserves for COW mappings.  That code is hard to understand and the
>>> existing code and coments around HPAGE_RESV_OWNER help with
>>> understanding.
>>
>> Agree. These codes took me several days to understand...
>>
> 
> Please prepare v2 with the changes to remove the HPAGE_RESV_OWNER check
> and move the large comment.
> 

Sure. Will do. Thanks.

> 
> I would prefer to leave other places that mention HPAGE_RESV_OWNER
> unchanged.
> 
> Thanks,
> 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/4] mm/hugeltb: fix potential wrong gbl_reserve value for hugetlb_acct_memory()
  2021-04-08 22:53             ` Mike Kravetz
@ 2021-04-09  3:01               ` Miaohe Lin
  2021-04-09  4:37                 ` Mike Kravetz
  0 siblings, 1 reply; 27+ messages in thread
From: Miaohe Lin @ 2021-04-09  3:01 UTC (permalink / raw)
  To: Mike Kravetz, akpm; +Cc: n-horiguchi, linux-kernel, linux-mm

On 2021/4/9 6:53, Mike Kravetz wrote:
> On 4/7/21 8:26 PM, Miaohe Lin wrote:
>> On 2021/4/8 11:24, Miaohe Lin wrote:
>>> On 2021/4/8 4:53, Mike Kravetz wrote:
>>>> On 4/7/21 12:24 AM, Miaohe Lin wrote:
>>>>> Hi:
>>>>> On 2021/4/7 10:49, Mike Kravetz wrote:
>>>>>> On 4/2/21 2:32 AM, Miaohe Lin wrote:
>>>>>>> The resv_map could be NULL since this routine can be called in the evict
>>>>>>> inode path for all hugetlbfs inodes. So we could have chg = 0 and this
>>>>>>> would result in a negative value when chg - freed. This is unexpected for
>>>>>>> hugepage_subpool_put_pages() and hugetlb_acct_memory().
>>>>>>
>>>>>> I am not sure if this is possible.
>>>>>>
>>>>>> It is true that resv_map could be NULL.  However, I believe resv map
>>>>>> can only be NULL for inodes that are not regular or link inodes.  This
>>>>>> is the inode creation code in hugetlbfs_get_inode().
>>>>>>
>>>>>>        /*
>>>>>>          * Reserve maps are only needed for inodes that can have associated
>>>>>>          * page allocations.
>>>>>>          */
>>>>>>         if (S_ISREG(mode) || S_ISLNK(mode)) {
>>>>>>                 resv_map = resv_map_alloc();
>>>>>>                 if (!resv_map)
>>>>>>                         return NULL;
>>>>>>         }
>>>>>>
>>>>>
>>>>> Agree.
>>>>>
>>>>>> If resv_map is NULL, then no hugetlb pages can be allocated/associated
>>>>>> with the file.  As a result, remove_inode_hugepages will never find any
>>>>>> huge pages associated with the inode and the passed value 'freed' will
>>>>>> always be zero.
>>>>>>
>>>>>
>>>>> But I am confused now. AFAICS, remove_inode_hugepages() searches the address_space of
>>>>> the inode to remove the hugepages while does not care if inode has associated resv_map.
>>>>> How does it prevent hugetlb pages from being allocated/associated with the file if
>>>>> resv_map is NULL? Could you please explain this more?
>>>>>
>>>>
>>>> Recall that there are only two ways to get huge pages associated with
>>>> a hugetlbfs file: fallocate and mmap/write fault.  Directly writing to
>>>> hugetlbfs files is not supported.
>>>>
>>>> If you take a closer look at hugetlbfs_get_inode, it has that code to
>>>> allocate the resv map mentioned above as well as the following:
>>>>
>>>> 		switch (mode & S_IFMT) {
>>>> 		default:
>>>> 			init_special_inode(inode, mode, dev);
>>>> 			break;
>>>> 		case S_IFREG:
>>>> 			inode->i_op = &hugetlbfs_inode_operations;
>>>> 			inode->i_fop = &hugetlbfs_file_operations;
>>>> 			break;
>>>> 		case S_IFDIR:
>>>> 			inode->i_op = &hugetlbfs_dir_inode_operations;
>>>> 			inode->i_fop = &simple_dir_operations;
>>>>
>>>> 			/* directory inodes start off with i_nlink == 2 (for "." entry) */
>>>> 			inc_nlink(inode);
>>>> 			break;
>>>> 		case S_IFLNK:
>>>> 			inode->i_op = &page_symlink_inode_operations;
>>>> 			inode_nohighmem(inode);
>>>> 			break;
>>>> 		}
>>>>
>>>> Notice that only S_IFREG inodes will have i_fop == &hugetlbfs_file_operations.
>>>> hugetlbfs_file_operations contain the hugetlbfs specific mmap and fallocate
>>>> routines.  Hence, only files with S_IFREG inodes can potentially have
>>>> associated huge pages.  S_IFLNK inodes can as well via file linking.
>>>>
>>>> If an inode is not S_ISREG(mode) || S_ISLNK(mode), then it will not have
>>>> a resv_map.  In addition, it will not have hugetlbfs_file_operations and
>>>> can not have associated huge pages.
>>>>
>>>
>>> Many many thanks for detailed and patient explanation! :) I think I have got the idea!
>>>
>>>> I looked at this closely when adding commits
>>>> 58b6e5e8f1ad hugetlbfs: fix memory leak for resv_map
>>>> f27a5136f70a hugetlbfs: always use address space in inode for resv_map pointer
>>>>
>>>> I may not be remembering all of the details correctly.  Commit f27a5136f70a
>>>> added the comment that resv_map could be NULL to hugetlb_unreserve_pages.
>>>>
>>>
>>> Since we must have freed == 0 while chg == 0. Should we make this assumption explict
>>> by something like below?
>>>
>>> WARN_ON(chg < freed);
>>>
>>
>> Or just a comment to avoid confusion ?
>>
> 
> Yes, add a comment to hugetlb_unreserve_pages saying that !resv_map
> implies freed == 0.
> 

Sounds good!

> It would also be helpful to check for (chg - freed) == 0 and skip the
> calls to hugepage_subpool_put_pages() and hugetlb_acct_memory().  Both
> of those routines may perform an unnecessary lock/unlock cycle in this
> case.
> 
> A simple
> 	if (chg == free)
> 		return 0;
> before the call to hugepage_subpool_put_pages would work.

This may not be really helpful because hugepage_subpool_put_pages() and hugetlb_acct_memory()
both would handle delta == 0 case without unnecessary lock/unlock cycle.
Does this make sense for you? If so, I will prepare v2 with the changes to add a comment
to hugetlb_unreserve_pages() __without__ the check for (chg - freed) == 0.

Many thanks!

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 4/4] mm/hugeltb: handle the error case in hugetlb_fix_reserve_counts()
  2021-04-08 23:25   ` Mike Kravetz
@ 2021-04-09  3:17     ` Miaohe Lin
  2021-04-09  5:04       ` Andrew Morton
  0 siblings, 1 reply; 27+ messages in thread
From: Miaohe Lin @ 2021-04-09  3:17 UTC (permalink / raw)
  To: Mike Kravetz, akpm; +Cc: n-horiguchi, hillf.zj, linux-kernel, linux-mm

On 2021/4/9 7:25, Mike Kravetz wrote:
> On 4/2/21 2:32 AM, Miaohe Lin wrote:
>> A rare out of memory error would prevent removal of the reserve map region
>> for a page. hugetlb_fix_reserve_counts() handles this rare case to avoid
>> dangling with incorrect counts. Unfortunately, hugepage_subpool_get_pages
>> and hugetlb_acct_memory could possibly fail too. We should correctly handle
>> these cases.
> 
> Yes, this is a potential issue.
> 
> The 'good news' is that hugetlb_fix_reserve_counts() is unlikely to ever
> be called.  To do so would imply we could not allocate a region entry
> which is only 6 words in size.  We also keep a 'cache' of entries so we
> may not even need to allocate.
> 
> But, as mentioned it is a potential issue.

Yes, a potential *theoretical* issue.

> 
>> Fixes: b5cec28d36f5 ("hugetlbfs: truncate_hugepages() takes a range of pages")
> 
> This is likely going to make this get picked by by stable releases.
> That is unfortunate as mentioned above this is mostly theoretical.
> 

I will drop this. This does not worth backport.

>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
>> ---
>>  mm/hugetlb.c | 11 +++++++++--
>>  1 file changed, 9 insertions(+), 2 deletions(-)
>>
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index bdff8d23803f..ca5464ed04b7 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -745,13 +745,20 @@ void hugetlb_fix_reserve_counts(struct inode *inode)
>>  {
>>  	struct hugepage_subpool *spool = subpool_inode(inode);
>>  	long rsv_adjust;
>> +	bool reserved = false;
>>  
>>  	rsv_adjust = hugepage_subpool_get_pages(spool, 1);
>> -	if (rsv_adjust) {
>> +	if (rsv_adjust > 0) {
>>  		struct hstate *h = hstate_inode(inode);
>>  
>> -		hugetlb_acct_memory(h, 1);
>> +		if (!hugetlb_acct_memory(h, 1))
>> +			reserved = true;
>> +	} else if (!rsv_adjust) {
>> +		reserved = true;
>>  	}
>> +
>> +	if (!reserved)
>> +		pr_warn("hugetlb: fix reserve count failed\n");
> 
> We should expand this warning message a bit to indicate what this may
> mean to the user.  Add something like"
> 	"Huge Page Reserved count may go negative".
> 

Will add it in v2. Many thanks for review and nice suggestion ! :)

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/4] mm/hugeltb: fix potential wrong gbl_reserve value for hugetlb_acct_memory()
  2021-04-09  3:01               ` Miaohe Lin
@ 2021-04-09  4:37                 ` Mike Kravetz
  2021-04-09  6:36                   ` Miaohe Lin
  0 siblings, 1 reply; 27+ messages in thread
From: Mike Kravetz @ 2021-04-09  4:37 UTC (permalink / raw)
  To: Miaohe Lin, akpm; +Cc: n-horiguchi, linux-kernel, linux-mm

On 4/8/21 8:01 PM, Miaohe Lin wrote:
> On 2021/4/9 6:53, Mike Kravetz wrote:
>>
>> Yes, add a comment to hugetlb_unreserve_pages saying that !resv_map
>> implies freed == 0.
>>
> 
> Sounds good!
> 
>> It would also be helpful to check for (chg - freed) == 0 and skip the
>> calls to hugepage_subpool_put_pages() and hugetlb_acct_memory().  Both
>> of those routines may perform an unnecessary lock/unlock cycle in this
>> case.
>>
>> A simple
>> 	if (chg == free)
>> 		return 0;
>> before the call to hugepage_subpool_put_pages would work.
> 
> This may not be really helpful because hugepage_subpool_put_pages() and hugetlb_acct_memory()
> both would handle delta == 0 case without unnecessary lock/unlock cycle.
> Does this make sense for you? If so, I will prepare v2 with the changes to add a comment
> to hugetlb_unreserve_pages() __without__ the check for (chg - freed) == 0.

Sorry, I forgot about the recent changes to check for delta == 0.
No need for the check here, just the comment.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 4/4] mm/hugeltb: handle the error case in hugetlb_fix_reserve_counts()
  2021-04-09  3:17     ` Miaohe Lin
@ 2021-04-09  5:04       ` Andrew Morton
  2021-04-09  7:07         ` Miaohe Lin
  0 siblings, 1 reply; 27+ messages in thread
From: Andrew Morton @ 2021-04-09  5:04 UTC (permalink / raw)
  To: Miaohe Lin; +Cc: Mike Kravetz, n-horiguchi, hillf.zj, linux-kernel, linux-mm

On Fri, 9 Apr 2021 11:17:49 +0800 Miaohe Lin <linmiaohe@huawei.com> wrote:

> On 2021/4/9 7:25, Mike Kravetz wrote:
> > On 4/2/21 2:32 AM, Miaohe Lin wrote:
> >> A rare out of memory error would prevent removal of the reserve map region
> >> for a page. hugetlb_fix_reserve_counts() handles this rare case to avoid
> >> dangling with incorrect counts. Unfortunately, hugepage_subpool_get_pages
> >> and hugetlb_acct_memory could possibly fail too. We should correctly handle
> >> these cases.
> > 
> > Yes, this is a potential issue.
> > 
> > The 'good news' is that hugetlb_fix_reserve_counts() is unlikely to ever
> > be called.  To do so would imply we could not allocate a region entry
> > which is only 6 words in size.  We also keep a 'cache' of entries so we
> > may not even need to allocate.
> > 
> > But, as mentioned it is a potential issue.
> 
> Yes, a potential *theoretical* issue.
> 
> > 
> >> Fixes: b5cec28d36f5 ("hugetlbfs: truncate_hugepages() takes a range of pages")
> > 
> > This is likely going to make this get picked by by stable releases.
> > That is unfortunate as mentioned above this is mostly theoretical.
> > 
> 
> I will drop this. This does not worth backport.
> 

-stable have been asked not to backport MM patches unless MM patches
include "cc:stable".  ie, no making our backporting decisions for us,
please.


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 3/4] mm/hugeltb: fix potential wrong gbl_reserve value for hugetlb_acct_memory()
  2021-04-09  4:37                 ` Mike Kravetz
@ 2021-04-09  6:36                   ` Miaohe Lin
  0 siblings, 0 replies; 27+ messages in thread
From: Miaohe Lin @ 2021-04-09  6:36 UTC (permalink / raw)
  To: Mike Kravetz, akpm; +Cc: n-horiguchi, linux-kernel, linux-mm

On 2021/4/9 12:37, Mike Kravetz wrote:
> On 4/8/21 8:01 PM, Miaohe Lin wrote:
>> On 2021/4/9 6:53, Mike Kravetz wrote:
>>>
>>> Yes, add a comment to hugetlb_unreserve_pages saying that !resv_map
>>> implies freed == 0.
>>>
>>
>> Sounds good!
>>
>>> It would also be helpful to check for (chg - freed) == 0 and skip the
>>> calls to hugepage_subpool_put_pages() and hugetlb_acct_memory().  Both
>>> of those routines may perform an unnecessary lock/unlock cycle in this
>>> case.
>>>
>>> A simple
>>> 	if (chg == free)
>>> 		return 0;
>>> before the call to hugepage_subpool_put_pages would work.
>>
>> This may not be really helpful because hugepage_subpool_put_pages() and hugetlb_acct_memory()
>> both would handle delta == 0 case without unnecessary lock/unlock cycle.
>> Does this make sense for you? If so, I will prepare v2 with the changes to add a comment
>> to hugetlb_unreserve_pages() __without__ the check for (chg - freed) == 0.
> 
> Sorry, I forgot about the recent changes to check for delta == 0.
> No need for the check here, just the comment.
> 

That's all right. Will add the comment in V2.
Thanks.

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 4/4] mm/hugeltb: handle the error case in hugetlb_fix_reserve_counts()
  2021-04-09  5:04       ` Andrew Morton
@ 2021-04-09  7:07         ` Miaohe Lin
  0 siblings, 0 replies; 27+ messages in thread
From: Miaohe Lin @ 2021-04-09  7:07 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Mike Kravetz, n-horiguchi, hillf.zj, linux-kernel, linux-mm

On 2021/4/9 13:04, Andrew Morton wrote:
> On Fri, 9 Apr 2021 11:17:49 +0800 Miaohe Lin <linmiaohe@huawei.com> wrote:
> 
>> On 2021/4/9 7:25, Mike Kravetz wrote:
>>> On 4/2/21 2:32 AM, Miaohe Lin wrote:
>>>> A rare out of memory error would prevent removal of the reserve map region
>>>> for a page. hugetlb_fix_reserve_counts() handles this rare case to avoid
>>>> dangling with incorrect counts. Unfortunately, hugepage_subpool_get_pages
>>>> and hugetlb_acct_memory could possibly fail too. We should correctly handle
>>>> these cases.
>>>
>>> Yes, this is a potential issue.
>>>
>>> The 'good news' is that hugetlb_fix_reserve_counts() is unlikely to ever
>>> be called.  To do so would imply we could not allocate a region entry
>>> which is only 6 words in size.  We also keep a 'cache' of entries so we
>>> may not even need to allocate.
>>>
>>> But, as mentioned it is a potential issue.
>>
>> Yes, a potential *theoretical* issue.
>>
>>>
>>>> Fixes: b5cec28d36f5 ("hugetlbfs: truncate_hugepages() takes a range of pages")
>>>
>>> This is likely going to make this get picked by by stable releases.
>>> That is unfortunate as mentioned above this is mostly theoretical.
>>>
>>
>> I will drop this. This does not worth backport.
>>
> 
> -stable have been asked not to backport MM patches unless MM patches
> include "cc:stable".  ie, no making our backporting decisions for us,
> please.
> 

Sorry about it! I would retain the Fixes tag.
Many thanks for pointing this out.

> .
> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2021-04-09  7:07 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-04-02  9:32 [PATCH 0/4] Cleanup and fixup for hugetlb Miaohe Lin
2021-04-02  9:32 ` [PATCH 1/4] mm/hugeltb: remove redundant VM_BUG_ON() in region_add() Miaohe Lin
2021-04-07  0:16   ` Mike Kravetz
2021-04-02  9:32 ` [PATCH 2/4] mm/hugeltb: simplify the return code of __vma_reservation_common() Miaohe Lin
2021-04-07  0:53   ` Mike Kravetz
2021-04-07  2:05     ` Miaohe Lin
2021-04-07  2:37       ` Mike Kravetz
2021-04-07  3:09         ` Miaohe Lin
2021-04-07 21:23           ` Mike Kravetz
2021-04-08  2:44             ` Miaohe Lin
2021-04-08 22:40               ` Mike Kravetz
2021-04-09  2:52                 ` Miaohe Lin
2021-04-02  9:32 ` [PATCH 3/4] mm/hugeltb: fix potential wrong gbl_reserve value for hugetlb_acct_memory() Miaohe Lin
2021-04-07  2:49   ` Mike Kravetz
2021-04-07  7:24     ` Miaohe Lin
2021-04-07 20:53       ` Mike Kravetz
2021-04-08  3:24         ` Miaohe Lin
2021-04-08  3:26           ` Miaohe Lin
2021-04-08 22:53             ` Mike Kravetz
2021-04-09  3:01               ` Miaohe Lin
2021-04-09  4:37                 ` Mike Kravetz
2021-04-09  6:36                   ` Miaohe Lin
2021-04-02  9:32 ` [PATCH 4/4] mm/hugeltb: handle the error case in hugetlb_fix_reserve_counts() Miaohe Lin
2021-04-08 23:25   ` Mike Kravetz
2021-04-09  3:17     ` Miaohe Lin
2021-04-09  5:04       ` Andrew Morton
2021-04-09  7:07         ` Miaohe Lin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).