Re: [PATCH] mm/hugetlb: expand restore_reserve_on_error functionality

From: Mike Kravetz <mike.kravetz@oracle.com>
To: Mina Almasry <almasrymina@google.com>
Cc: Linux-MM <linux-mm@kvack.org>,
	lkml <linux-kernel@vger.kernel.org>,
	Axel Rasmussen <axelrasmussen@google.com>,
	Peter Xu <peterx@redhat.com>,
	Muchun Song <songmuchun@bytedance.com>,
	Michal Hocko <mhocko@suse.com>,
	Naoya Horiguchi <naoya.horiguchi@nec.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	stable@vger.kernel.org
Subject: Re: [PATCH] mm/hugetlb: expand restore_reserve_on_error functionality
Date: Wed, 2 Jun 2021 18:07:47 -0700	[thread overview]
Message-ID: <2436bc51-cccb-85dc-423c-0a9c45d80988@oracle.com> (raw)
In-Reply-To: <CAHS8izPD4mY_5MgAfNnjCQPN3Xc3Ttn8KDnT4Ub3XwgiiGbSjw@mail.gmail.com>

On 6/2/21 5:38 PM, Mina Almasry wrote:
> On Wed, Jun 2, 2021 at 4:50 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>>
>> The routine restore_reserve_on_error is called to restore reservation
>> information when an error occurs after page allocation.  The routine
>> alloc_huge_page modifies the mapping reserve map and potentially the
>> reserve count during allocation.  If code calling alloc_huge_page
>> encounters an error after allocation and needs to free the page, the
>> reservation information needs to be adjusted.
>>
>> Currently, restore_reserve_on_error only takes action on pages for which
>> the reserve count was adjusted(HPageRestoreReserve flag).  There is
>> nothing wrong with these adjustments.  However, alloc_huge_page ALWAYS
>> modifies the reserve map during allocation even if the reserve count is
>> not adjusted.  This can cause issues as observed during development of
>> this patch [1].
>>
>> One specific series of operations causing an issue is:
>> - Create a shared hugetlb mapping
>>   Reservations for all pages created by default
>> - Fault in a page in the mapping
>>   Reservation exists so reservation count is decremented
>> - Punch a hole in the file/mapping at index previously faulted
>>   Reservation and any associated pages will be removed
>> - Allocate a page to fill the hole
>>   No reservation entry, so reserve count unmodified
>>   Reservation entry added to map by alloc_huge_page
>> - Error after allocation and before instantiating the page
>>   Reservation entry remains in map
>> - Allocate a page to fill the hole
>>   Reservation entry exists, so decrement reservation count
>> This will cause a reservation count underflow as the reservation count
>> was decremented twice for the same index.
>>
>> A user would observe a very large number for HugePages_Rsvd in
>> /proc/meminfo.  This would also likely cause subsequent allocations of
>> hugetlb pages to fail as it would 'appear' that all pages are reserved.
>>
>> This sequence of operations is unlikely to happen, however they were
>> easily reproduced and observed using hacked up code as described in [1].
>>
>> Address the issue by having the routine restore_reserve_on_error take
>> action on pages where HPageRestoreReserve is not set.  In this case, we
>> need to remove any reserve map entry created by alloc_huge_page.  A new
>> helper routine vma_del_reservation assists with this operation.
>>
>> There are three callers of alloc_huge_page which do not currently call
>> restore_reserve_on error before freeing a page on error paths.  Add
>> those missing calls.
>>
>> [1] https://lore.kernel.org/linux-mm/20210528005029.88088-1-almasrymina@google.com/
>> Fixes: 96b96a96ddee ("mm/hugetlb: fix huge page reservation leak in private mapping error paths"
>> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
>> Cc: <stable@vger.kernel.org>
>> ---
>>  fs/hugetlbfs/inode.c    |   1 +
>>  include/linux/hugetlb.h |   2 +
>>  mm/hugetlb.c            | 120 ++++++++++++++++++++++++++++++++--------
>>  3 files changed, 100 insertions(+), 23 deletions(-)
>>
>> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
>> index 21f7a5c92e8a..926eeb9bf4eb 100644
>> --- a/fs/hugetlbfs/inode.c
>> +++ b/fs/hugetlbfs/inode.c
>> @@ -735,6 +735,7 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
>>                 __SetPageUptodate(page);
>>                 error = huge_add_to_page_cache(page, mapping, index);
>>                 if (unlikely(error)) {
>> +                       restore_reserve_on_error(h, &pseudo_vma, addr, page);
>>                         put_page(page);
>>                         mutex_unlock(&hugetlb_fault_mutex_table[hash]);
>>                         goto out;
>> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
>> index 44721df20e89..b375b31f60c4 100644
>> --- a/include/linux/hugetlb.h
>> +++ b/include/linux/hugetlb.h
>> @@ -627,6 +627,8 @@ struct page *alloc_huge_page_vma(struct hstate *h, struct vm_area_struct *vma,
>>                                 unsigned long address);
>>  int huge_add_to_page_cache(struct page *page, struct address_space *mapping,
>>                         pgoff_t idx);
>> +void restore_reserve_on_error(struct hstate *h, struct vm_area_struct *vma,
>> +                               unsigned long address, struct page *page);
>>
>>  /* arch callback */
>>  int __init __alloc_bootmem_huge_page(struct hstate *h);
>> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
>> index 9a616b7a8563..36b691c3eabf 100644
>> --- a/mm/hugetlb.c
>> +++ b/mm/hugetlb.c
>> @@ -2259,12 +2259,18 @@ static void return_unused_surplus_pages(struct hstate *h,
>>   * be restored when a newly allocated huge page must be freed.  It is
>>   * to be called after calling vma_needs_reservation to determine if a
>>   * reservation exists.
>> + *
>> + * vma_del_reservation is used in error paths where an entry in the reserve
>> + * map was created during huge page allocation and must be removed.  It is to
>> + * be called after calling vma_needs_reservation to determine if a reservation
>> + * exists.
>>   */
>>  enum vma_resv_mode {
>>         VMA_NEEDS_RESV,
>>         VMA_COMMIT_RESV,
>>         VMA_END_RESV,
>>         VMA_ADD_RESV,
>> +       VMA_DEL_RESV,
>>  };
>>  static long __vma_reservation_common(struct hstate *h,
>>                                 struct vm_area_struct *vma, unsigned long addr,
>> @@ -2308,11 +2314,21 @@ static long __vma_reservation_common(struct hstate *h,
>>                         ret = region_del(resv, idx, idx + 1);
>>                 }
>>                 break;
>> +       case VMA_DEL_RESV:
>> +               if (vma->vm_flags & VM_MAYSHARE) {
>> +                       region_abort(resv, idx, idx + 1, 1);
>> +                       ret = region_del(resv, idx, idx + 1);
>> +               } else {
>> +                       ret = region_add(resv, idx, idx + 1, 1, NULL, NULL);
>> +                       /* region_add calls of range 1 should never fail. */
>> +                       VM_BUG_ON(ret < 0);
>> +               }
>> +               break;
> 
> I haven't tested, but at first glance I don't think this quite works,
> no? The thing is that alloc_huge_page() does a
> vma_commit_reservation() which does add_region() regardless if
> vm_flags & VM_MAYSHARE or not, so to unroll that we need a function
> that does a region_del() regardless if vm_flags & VM_MAYSHARE or not.
> I wonder if the root of the bug is the unconditional region_add()
> vma_commit_reservation() does.

Do give it a test.  I beleive I tested in the same way you tested, but
could be mistaken.

We may not want to always unroll.  vma_del_reservation is only called in
the path where HPageRestoreReserve is not set.  So, no reservation entry
was present in the reserve map before the allocation.  alloc_huge_page
likely added an entry via vma_commit_reservation.  Since there was
an error and we will be freeing the page, we need to make sure no
reservation exists.  As you know, making sure a reservation does not
exist is different for shared and private mappings.
- For shared mappings, we need to make sure there is not an entry in the
  reserve map.
- For private mappings, we need to make sure there is an entry in the
  reserve map.
That is why vma_del_reservation does a region_del for shared mappings
and a region_add for private mappings.

> Also, I believe hugetlb_unreserve_pages() calls region_del() directly,
> and doesn't go through the vma_*_reservation() interface. If you're
> adding a delete function it may be nice to convert that to use the new
> function as well.
> 
> I'm happy to take this fix over and submit a v2, since I think I
> understand the problem and can readily reproduce the issue (I just add
> the warning and some prints and run the userfaultfd tests).

Do run your test.  I am interested to see if you experience issues.
-- 
Mike Kravetz