All of lore.kernel.org
 help / color / mirror / Atom feed
* [bug report] mm/hugetlb: various bugs with avoid_reserve case in alloc_huge_page()
@ 2022-08-17  8:31 Miaohe Lin
  2022-08-18 22:43 ` Mike Kravetz
  0 siblings, 1 reply; 5+ messages in thread
From: Miaohe Lin @ 2022-08-17  8:31 UTC (permalink / raw)
  To: Andrew Morton, mike.kravetz, Muchun Song, Linux-MM, linux-kernel

Hi all:
    When I investigate the mm/hugetlb.c code again, I found there are a few possible issues
with avoid_reserve case. (It's really hard to follow the relevant code for me.) Please take
a look at the below analysis:

1.avoid_reserve issue with h->resv_huge_pages in alloc_huge_page.
    Assume:
	h->free_huge_pages 60
	h->resv_huge_pages 30
	spool->rsv_hpages  30

    When avoid_reserve is true, after alloc_huge_page(), we will have:
	spool->rsv_hpages  29 /* hugepage_subpool_get_pages decreases it. */
	h->free_huge_pages 59
	h->resv_huge_pages 30 /* rsv_hpages is used, but *h->resv_huge_pages is not modified accordingly*. */

    If the hugetlb page is freed later, we will have:
	spool->rsv_hpages  30 /* hugepage_subpool_put_pages increases it. */
	h->free_huge_pages 60
	h->resv_huge_pages 31 /* *increased wrongly* due to hugepage_subpool_put_pages(spool, 1) == 0. */
			   ^^

2.avoid_reserve issue with hugetlb rsvd cgroup charge for private mappings in alloc_huge_page.

    In general, if hugetlb pages are reserved, corresponding rsvd counters are charged in resv_maps
for private mappings. Otherwise they're charged in individual hugetlb pages. When alloc_huge_page()
is called with avoid_reserve == true, hugetlb_cgroup_charge_cgroup_rsvd() will be called to charge
the newly allocated hugetlb page even if there has a reservation for this page in resv_maps. Then
vma_commit_reservation() is called to indicate that the reservation is consumed. So the reservation
*can not be used, thus leaking* from now on because vma_needs_reservation always return 1 for it.

3.avoid_reserve issue with restore_reserve_on_error

    There's a assumption in restore_reserve_on_error(): If HPageRestoreReserve is not set, this indicates
there is an entry in the reserve map added by alloc_huge_page or HPageRestoreReserve would be set on the
page. But this assumption *does not hold for avoid_reserve*. HPageRestoreReserve won't be set even if there
is already an entry in the reserve map for avoid_reserve case. So avoid_reserve should be considered in this
function, i.e. we need *a reliable way* to determine whether the entry is added by the alloc_huge_page().

Are above issues possible? Or am I miss something? These possible issues seem not easy to fix for me.
Any thoughts? Any response would be appreciated!

Thanks!
Miaohe Lin

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [bug report] mm/hugetlb: various bugs with avoid_reserve case in alloc_huge_page()
  2022-08-17  8:31 [bug report] mm/hugetlb: various bugs with avoid_reserve case in alloc_huge_page() Miaohe Lin
@ 2022-08-18 22:43 ` Mike Kravetz
  2022-08-19  7:20   ` Miaohe Lin
  0 siblings, 1 reply; 5+ messages in thread
From: Mike Kravetz @ 2022-08-18 22:43 UTC (permalink / raw)
  To: Miaohe Lin; +Cc: Andrew Morton, Muchun Song, Linux-MM, linux-kernel

On 08/17/22 16:31, Miaohe Lin wrote:
> Hi all:
>     When I investigate the mm/hugetlb.c code again, I found there are a few possible issues
> with avoid_reserve case. (It's really hard to follow the relevant code for me.) Please take
> a look at the below analysis:

Thank you for taking a close look at this code!

I agree that the code is hard to follow.  I have spent many hours/days/weeks
chasing down the cause of incorrect reservation counts.  I imagine there could
be more issues, especially when you add the uncommon avoid_reserve and
MAP_NORESERVE processing.

> 1.avoid_reserve issue with h->resv_huge_pages in alloc_huge_page.

Did you actually see this issue, or is it just based on code inspection?
I tried to recreate, but could not.  When looking closer, this may not
even be possible.

>     Assume:
> 	h->free_huge_pages 60
> 	h->resv_huge_pages 30
> 	spool->rsv_hpages  30

OK.

> 
>     When avoid_reserve is true, after alloc_huge_page(), we will have:

Take a close look at the calling paths for alloc_huge_page when avoid_reserve
is true.  There are only two such call paths.
1) copy_hugetlb_page_range - We allocate pages in the 'early COW' processing.
   In such cases, the pages are private and not associated with a file, or
   filesystem or subpool (spool).  Therefore, there should be no spool
   modifications.
2) hugetlb_wp (formerly called hugetlb_cow) - Again, we are allocating a
   private page and should not be modifying spool.

If the above is correct, then we will not modify spool->rsv_hpages which
leads to the inconsistent results.

It is confusing that MAP_NORESERVE does not imply avoid_reserve will be
passed to alloc_huge_page.

> 	spool->rsv_hpages  29 /* hugepage_subpool_get_pages decreases it. */
> 	h->free_huge_pages 59
> 	h->resv_huge_pages 30 /* rsv_hpages is used, but *h->resv_huge_pages is not modified accordingly*. */
> 
>     If the hugetlb page is freed later, we will have:
> 	spool->rsv_hpages  30 /* hugepage_subpool_put_pages increases it. */
> 	h->free_huge_pages 60
> 	h->resv_huge_pages 31 /* *increased wrongly* due to hugepage_subpool_put_pages(spool, 1) == 0. */
> 			   ^^
> 

I'll take a closer look at 2 and 3 when we determine if 1 is a possible
issue or not.
-- 
Mike Kravetz

> 2.avoid_reserve issue with hugetlb rsvd cgroup charge for private mappings in alloc_huge_page.
> 
>     In general, if hugetlb pages are reserved, corresponding rsvd counters are charged in resv_maps
> for private mappings. Otherwise they're charged in individual hugetlb pages. When alloc_huge_page()
> is called with avoid_reserve == true, hugetlb_cgroup_charge_cgroup_rsvd() will be called to charge
> the newly allocated hugetlb page even if there has a reservation for this page in resv_maps. Then
> vma_commit_reservation() is called to indicate that the reservation is consumed. So the reservation
> *can not be used, thus leaking* from now on because vma_needs_reservation always return 1 for it.
> 
> 3.avoid_reserve issue with restore_reserve_on_error
> 
>     There's a assumption in restore_reserve_on_error(): If HPageRestoreReserve is not set, this indicates
> there is an entry in the reserve map added by alloc_huge_page or HPageRestoreReserve would be set on the
> page. But this assumption *does not hold for avoid_reserve*. HPageRestoreReserve won't be set even if there
> is already an entry in the reserve map for avoid_reserve case. So avoid_reserve should be considered in this
> function, i.e. we need *a reliable way* to determine whether the entry is added by the alloc_huge_page().
> 
> Are above issues possible? Or am I miss something? These possible issues seem not easy to fix for me.
> Any thoughts? Any response would be appreciated!
> 
> Thanks!
> Miaohe Lin

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [bug report] mm/hugetlb: various bugs with avoid_reserve case in alloc_huge_page()
  2022-08-18 22:43 ` Mike Kravetz
@ 2022-08-19  7:20   ` Miaohe Lin
  2022-08-19 19:11     ` Mike Kravetz
  0 siblings, 1 reply; 5+ messages in thread
From: Miaohe Lin @ 2022-08-19  7:20 UTC (permalink / raw)
  To: Mike Kravetz; +Cc: Andrew Morton, Muchun Song, Linux-MM, linux-kernel

On 2022/8/19 6:43, Mike Kravetz wrote:
> On 08/17/22 16:31, Miaohe Lin wrote:
>> Hi all:
>>     When I investigate the mm/hugetlb.c code again, I found there are a few possible issues
>> with avoid_reserve case. (It's really hard to follow the relevant code for me.) Please take
>> a look at the below analysis:
> 
> Thank you for taking a close look at this code!
> 
> I agree that the code is hard to follow.  I have spent many hours/days/weeks
> chasing down the cause of incorrect reservation counts.  I imagine there could
> be more issues, especially when you add the uncommon avoid_reserve and
> MAP_NORESERVE processing.

Many thanks for your time and reply, Mike!

> 
>> 1.avoid_reserve issue with h->resv_huge_pages in alloc_huge_page.
> 
> Did you actually see this issue, or is it just based on code inspection?

No, it's based on code inspection. ;)

> I tried to recreate, but could not.  When looking closer, this may not
> even be possible.
> 
>>     Assume:
>> 	h->free_huge_pages 60
>> 	h->resv_huge_pages 30
>> 	spool->rsv_hpages  30
> 
> OK.
> 
>>
>>     When avoid_reserve is true, after alloc_huge_page(), we will have:
> 
> Take a close look at the calling paths for alloc_huge_page when avoid_reserve
> is true.  There are only two such call paths.
> 1) copy_hugetlb_page_range - We allocate pages in the 'early COW' processing.
>    In such cases, the pages are private and not associated with a file, or
>    filesystem or subpool (spool).  Therefore, there should be no spool
>    modifications.

Agree.

> 2) hugetlb_wp (formerly called hugetlb_cow) - Again, we are allocating a
>    private page and should not be modifying spool.

Agree.

> 
> If the above is correct, then we will not modify spool->rsv_hpages which
> leads to the inconsistent results.

I missed to verify whether spool will be modified in avoid_reserve case. Sorry about that.

> 
> It is confusing that MAP_NORESERVE does not imply avoid_reserve will be
> passed to alloc_huge_page.

It's introduced to guarantee that COW faults for a process that called mmap(MAP_PRIVATE) will succeed via commit
04f2cbe35699 ("hugetlb: guarantee that COW faults for a process that called mmap(MAP_PRIVATE) on hugetlbfs will succeed").
It seems it has nothing to do with MAP_NORESERVE.

> 
>> 	spool->rsv_hpages  29 /* hugepage_subpool_get_pages decreases it. */
>> 	h->free_huge_pages 59
>> 	h->resv_huge_pages 30 /* rsv_hpages is used, but *h->resv_huge_pages is not modified accordingly*. */
>>
>>     If the hugetlb page is freed later, we will have:
>> 	spool->rsv_hpages  30 /* hugepage_subpool_put_pages increases it. */
>> 	h->free_huge_pages 60
>> 	h->resv_huge_pages 31 /* *increased wrongly* due to hugepage_subpool_put_pages(spool, 1) == 0. */
>> 			   ^^
>>
> 
> I'll take a closer look at 2 and 3 when we determine if 1 is a possible
> issue or not.

I want to propose removing the avoid_reserve code. When called from above case 1) or 2), vma_needs_reservation()
will always return 1 as there's no reservation for it. Also hugepage_subpool_get_pages() will always return 1 as
it's not associated with a spool. So when avoid_reserve == true, map_chg and gbl_chg must be 1 and vma_has_reserves()
will always return "false". As a result, passing in avoid_reserve == true will do nothing in fact. So it can be simply
removed. Or am I miss something again?

If avoid_reserve code can be removed, below issue 2 and 3 won't be possible as they rely on avoid_reserve doing its work.

Thanks!
Miaohe Lin

^ permalink raw reply	[flat|nested] 5+ messages in thread

* Re: [bug report] mm/hugetlb: various bugs with avoid_reserve case in alloc_huge_page()
  2022-08-19  7:20   ` Miaohe Lin
@ 2022-08-19 19:11     ` Mike Kravetz
  2022-08-22  3:13       ` Miaohe Lin
  0 siblings, 1 reply; 5+ messages in thread
From: Mike Kravetz @ 2022-08-19 19:11 UTC (permalink / raw)
  To: Miaohe Lin; +Cc: Andrew Morton, Muchun Song, Linux-MM, linux-kernel

On 08/19/22 15:20, Miaohe Lin wrote:
> On 2022/8/19 6:43, Mike Kravetz wrote:
> > On 08/17/22 16:31, Miaohe Lin wrote:
> >> Hi all:
> >>     When I investigate the mm/hugetlb.c code again, I found there are a few possible issues
> >> with avoid_reserve case. (It's really hard to follow the relevant code for me.) Please take
> >> a look at the below analysis:
> > 
> > Thank you for taking a close look at this code!
> > 
> > I agree that the code is hard to follow.  I have spent many hours/days/weeks
> > chasing down the cause of incorrect reservation counts.  I imagine there could
> > be more issues, especially when you add the uncommon avoid_reserve and
> > MAP_NORESERVE processing.
> 
> Many thanks for your time and reply, Mike!

Well, hugetlb reservations interrupted my sleep again :)  See below.

> > 
> >> 1.avoid_reserve issue with h->resv_huge_pages in alloc_huge_page.
> > 
> > Did you actually see this issue, or is it just based on code inspection?
> 
> No, it's based on code inspection. ;)
> 
> > I tried to recreate, but could not.  When looking closer, this may not
> > even be possible.
> > 
> >>     Assume:
> >> 	h->free_huge_pages 60
> >> 	h->resv_huge_pages 30
> >> 	spool->rsv_hpages  30
> > 
> > OK.
> > 
> >>
> >>     When avoid_reserve is true, after alloc_huge_page(), we will have:
> > 
> > Take a close look at the calling paths for alloc_huge_page when avoid_reserve
> > is true.  There are only two such call paths.
> > 1) copy_hugetlb_page_range - We allocate pages in the 'early COW' processing.
> >    In such cases, the pages are private and not associated with a file, or
> >    filesystem or subpool (spool).  Therefore, there should be no spool
> >    modifications.
> 
> Agree.
> 
> > 2) hugetlb_wp (formerly called hugetlb_cow) - Again, we are allocating a
> >    private page and should not be modifying spool.
> 
> Agree.
> 
> > 
> > If the above is correct, then we will not modify spool->rsv_hpages which
> > leads to the inconsistent results.
> 
> I missed to verify whether spool will be modified in avoid_reserve case. Sorry about that.
> 

That is how it SHOULD work.  However, there is a problem with a MAP_PRIVATE
mapping of a hugetlb file.  In this case, subpool_vma will return the
subpool associated with the file, and we will end up with a leaked reservation
as in your example.  I have verified with test code.

The first thought I had is that something like the following should be added.

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 474bfbe9929e..5aa19574e890 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -254,7 +258,9 @@ static inline struct hugepage_subpool *subpool_inode(struct inode *inode)
 
 static inline struct hugepage_subpool *subpool_vma(struct vm_area_struct *vma)
 {
-	return subpool_inode(file_inode(vma->vm_file));
+	if (vma->vm_flags & (VM_MAYSHARE | VM_SHARED))
+		return subpool_inode(file_inode(vma->vm_file));
+	return NULL;		/* no subpool for private mappings */
 }
 
 /* Helper that removes a struct file_region from the resv_map cache and returns


That certainly addresses the MAP_PRIVATE mapping of a hugetlb file issue.
I will collect up patches for issues we discover and submit together.

> > It is confusing that MAP_NORESERVE does not imply avoid_reserve will be
> > passed to alloc_huge_page.
> 
> It's introduced to guarantee that COW faults for a process that called mmap(MAP_PRIVATE) will succeed via commit
> 04f2cbe35699 ("hugetlb: guarantee that COW faults for a process that called mmap(MAP_PRIVATE) on hugetlbfs will succeed").
> It seems it has nothing to do with MAP_NORESERVE.
> 
> > 
> >> 	spool->rsv_hpages  29 /* hugepage_subpool_get_pages decreases it. */
> >> 	h->free_huge_pages 59
> >> 	h->resv_huge_pages 30 /* rsv_hpages is used, but *h->resv_huge_pages is not modified accordingly*. */
> >>
> >>     If the hugetlb page is freed later, we will have:
> >> 	spool->rsv_hpages  30 /* hugepage_subpool_put_pages increases it. */
> >> 	h->free_huge_pages 60
> >> 	h->resv_huge_pages 31 /* *increased wrongly* due to hugepage_subpool_put_pages(spool, 1) == 0. */
> >> 			   ^^
> >>
> > 
> > I'll take a closer look at 2 and 3 when we determine if 1 is a possible
> > issue or not.
> 
> I want to propose removing the avoid_reserve code. When called from above case 1) or 2), vma_needs_reservation()
> will always return 1 as there's no reservation for it. Also hugepage_subpool_get_pages() will always return 1 as
> it's not associated with a spool. So when avoid_reserve == true, map_chg and gbl_chg must be 1 and vma_has_reserves()
> will always return "false". As a result, passing in avoid_reserve == true will do nothing in fact. So it can be simply
> removed. Or am I miss something again?

I will take a closer look.  But, at a high level if avoid_reserve == true and
all pages are reserved we must fail the allocation or attempt dynamic
allocation if overcommit is allowed.  So, it seems we at least need the
flag to make this decision.
-- 
Mike Kravetz

> 
> If avoid_reserve code can be removed, below issue 2 and 3 won't be possible as they rely on avoid_reserve doing its work.
> 
> Thanks!
> Miaohe Lin

^ permalink raw reply related	[flat|nested] 5+ messages in thread

* Re: [bug report] mm/hugetlb: various bugs with avoid_reserve case in alloc_huge_page()
  2022-08-19 19:11     ` Mike Kravetz
@ 2022-08-22  3:13       ` Miaohe Lin
  0 siblings, 0 replies; 5+ messages in thread
From: Miaohe Lin @ 2022-08-22  3:13 UTC (permalink / raw)
  To: Mike Kravetz; +Cc: Andrew Morton, Muchun Song, Linux-MM, linux-kernel

On 2022/8/20 3:11, Mike Kravetz wrote:
> On 08/19/22 15:20, Miaohe Lin wrote:
>> On 2022/8/19 6:43, Mike Kravetz wrote:
>>> On 08/17/22 16:31, Miaohe Lin wrote:
>>>> Hi all:
>>>>     When I investigate the mm/hugetlb.c code again, I found there are a few possible issues
>>>> with avoid_reserve case. (It's really hard to follow the relevant code for me.) Please take
>>>> a look at the below analysis:
>>>
>>> Thank you for taking a close look at this code!
>>>
>>> I agree that the code is hard to follow.  I have spent many hours/days/weeks
>>> chasing down the cause of incorrect reservation counts.  I imagine there could
>>> be more issues, especially when you add the uncommon avoid_reserve and
>>> MAP_NORESERVE processing.
>>
>> Many thanks for your time and reply, Mike!
> 
> Well, hugetlb reservations interrupted my sleep again :)  See below.

Another day scratching my hair for hugetlb reservations. :)

> 
>>>
>>>> 1.avoid_reserve issue with h->resv_huge_pages in alloc_huge_page.
>>>
>>> Did you actually see this issue, or is it just based on code inspection?
>>
>> No, it's based on code inspection. ;)
>>
>>> I tried to recreate, but could not.  When looking closer, this may not
>>> even be possible.
>>>
>>>>     Assume:
>>>> 	h->free_huge_pages 60
>>>> 	h->resv_huge_pages 30
>>>> 	spool->rsv_hpages  30
>>>
>>> OK.
>>>
>>>>
>>>>     When avoid_reserve is true, after alloc_huge_page(), we will have:
>>>
>>> Take a close look at the calling paths for alloc_huge_page when avoid_reserve
>>> is true.  There are only two such call paths.
>>> 1) copy_hugetlb_page_range - We allocate pages in the 'early COW' processing.
>>>    In such cases, the pages are private and not associated with a file, or
>>>    filesystem or subpool (spool).  Therefore, there should be no spool
>>>    modifications.
>>
>> Agree.
>>
>>> 2) hugetlb_wp (formerly called hugetlb_cow) - Again, we are allocating a
>>>    private page and should not be modifying spool.
>>
>> Agree.
>>
>>>
>>> If the above is correct, then we will not modify spool->rsv_hpages which
>>> leads to the inconsistent results.
>>
>> I missed to verify whether spool will be modified in avoid_reserve case. Sorry about that.
>>
> 
> That is how it SHOULD work.  However, there is a problem with a MAP_PRIVATE
> mapping of a hugetlb file.  In this case, subpool_vma will return the
> subpool associated with the file, and we will end up with a leaked reservation
> as in your example.  I have verified with test code.

Thanks for your time and verification.

> 
> The first thought I had is that something like the following should be added.
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 474bfbe9929e..5aa19574e890 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -254,7 +258,9 @@ static inline struct hugepage_subpool *subpool_inode(struct inode *inode)
>  
>  static inline struct hugepage_subpool *subpool_vma(struct vm_area_struct *vma)
>  {
> -	return subpool_inode(file_inode(vma->vm_file));
> +	if (vma->vm_flags & (VM_MAYSHARE | VM_SHARED))

Maybe checking for VM_MAYSHARE is enough? It seems VM_SHARED can't be set while VM_MAYSHARE isn't set.
But this change looks good for me to fix the discussed issue.

> +		return subpool_inode(file_inode(vma->vm_file));
> +	return NULL;		/* no subpool for private mappings */
>  }
>  
>  /* Helper that removes a struct file_region from the resv_map cache and returns
> 
> 
> That certainly addresses the MAP_PRIVATE mapping of a hugetlb file issue.
> I will collect up patches for issues we discover and submit together.

Thanks for doing this.
BTW: IIUC, If MAP_PRIVATE mapping never consumes reservation, issue 2 and 3 shouldn't be possible.

> 
>>> It is confusing that MAP_NORESERVE does not imply avoid_reserve will be
>>> passed to alloc_huge_page.
>>
>> It's introduced to guarantee that COW faults for a process that called mmap(MAP_PRIVATE) will succeed via commit
>> 04f2cbe35699 ("hugetlb: guarantee that COW faults for a process that called mmap(MAP_PRIVATE) on hugetlbfs will succeed").
>> It seems it has nothing to do with MAP_NORESERVE.
>>
>>>
>>>> 	spool->rsv_hpages  29 /* hugepage_subpool_get_pages decreases it. */
>>>> 	h->free_huge_pages 59
>>>> 	h->resv_huge_pages 30 /* rsv_hpages is used, but *h->resv_huge_pages is not modified accordingly*. */
>>>>
>>>>     If the hugetlb page is freed later, we will have:
>>>> 	spool->rsv_hpages  30 /* hugepage_subpool_put_pages increases it. */
>>>> 	h->free_huge_pages 60
>>>> 	h->resv_huge_pages 31 /* *increased wrongly* due to hugepage_subpool_put_pages(spool, 1) == 0. */
>>>> 			   ^^
>>>>
>>>
>>> I'll take a closer look at 2 and 3 when we determine if 1 is a possible
>>> issue or not.
>>
>> I want to propose removing the avoid_reserve code. When called from above case 1) or 2), vma_needs_reservation()
>> will always return 1 as there's no reservation for it. Also hugepage_subpool_get_pages() will always return 1 as
>> it's not associated with a spool. So when avoid_reserve == true, map_chg and gbl_chg must be 1 and vma_has_reserves()
>> will always return "false". As a result, passing in avoid_reserve == true will do nothing in fact. So it can be simply
>> removed. Or am I miss something again?
> 
> I will take a closer look.  But, at a high level if avoid_reserve == true and
> all pages are reserved we must fail the allocation or attempt dynamic
> allocation if overcommit is allowed.  So, it seems we at least need the
> flag to make this decision.

Yes, you're right.

Thanks,
Miaohe Lin


^ permalink raw reply	[flat|nested] 5+ messages in thread

end of thread, other threads:[~2022-08-22  3:13 UTC | newest]

Thread overview: 5+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-08-17  8:31 [bug report] mm/hugetlb: various bugs with avoid_reserve case in alloc_huge_page() Miaohe Lin
2022-08-18 22:43 ` Mike Kravetz
2022-08-19  7:20   ` Miaohe Lin
2022-08-19 19:11     ` Mike Kravetz
2022-08-22  3:13       ` Miaohe Lin

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.