linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v3] mm, hugetlb: fix resv_huge_pages underflow on UFFDIO_COPY
@ 2021-05-21  7:44 Mina Almasry
  2021-05-22 21:19 ` Andrew Morton
  2021-05-24 18:07 ` Mike Kravetz
  0 siblings, 2 replies; 16+ messages in thread
From: Mina Almasry @ 2021-05-21  7:44 UTC (permalink / raw)
  Cc: Mina Almasry, Axel Rasmussen, Peter Xu, linux-mm, Mike Kravetz,
	Andrew Morton, linux-kernel

The userfaultfd hugetlb tests detect a resv_huge_pages underflow. This
happens when hugetlb_mcopy_atomic_pte() is called with !is_continue on
an index for which we already have a page in the cache. When this
happens, we allocate a second page, double consuming the reservation,
and then fail to insert the page into the cache and return -EEXIST.

To fix this, we first if there exists a page in the cache which already
consumed the reservation, and return -EEXIST immediately if so.

Secondly, if we fail to copy the page contents while holding the
hugetlb_fault_mutex, we will drop the mutex and return to the caller
after allocating a page that consumed a reservation. In this case there
may be a fault that double consumes the reservation. To handle this, we
free the allocated page, fix the reservations, and allocate a temporary
hugetlb page and return that to the caller. When the caller does the
copy outside of the lock, we again check the cache, and allocate a page
consuming the reservation, and copy over the contents.

Test:
Hacked the code locally such that resv_huge_pages underflows produce
a warning and the copy_huge_page_from_user() always fails, then:

./tools/testing/selftests/vm/userfaultfd hugetlb_shared 10
	2 /tmp/kokonut_test/huge/userfaultfd_test && echo test success
./tools/testing/selftests/vm/userfaultfd hugetlb 10
	2 /tmp/kokonut_test/huge/userfaultfd_test && echo test success

Both tests succeed and produce no warnings. After the test runs
number of free/resv hugepages is correct.

Signed-off-by: Mina Almasry <almasrymina@google.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: linux-mm@kvack.org
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org

---
 include/linux/hugetlb.h |   4 ++
 mm/hugetlb.c            | 103 ++++++++++++++++++++++++++++++++++++----
 mm/migrate.c            |  39 +++------------
 3 files changed, 103 insertions(+), 43 deletions(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index b92f25ccef58..427974510965 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -194,6 +194,8 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
 bool is_hugetlb_entry_migration(pte_t pte);
 void hugetlb_unshare_all_pmds(struct vm_area_struct *vma);

+void hugetlb_copy_page(struct page *dst, struct page *src);
+
 #else /* !CONFIG_HUGETLB_PAGE */

 static inline void reset_vma_resv_huge_pages(struct vm_area_struct *vma)
@@ -379,6 +381,8 @@ static inline vm_fault_t hugetlb_fault(struct mm_struct *mm,

 static inline void hugetlb_unshare_all_pmds(struct vm_area_struct *vma) { }

+static inline void hugetlb_copy_page(struct page *dst, struct page *src);
+
 #endif /* !CONFIG_HUGETLB_PAGE */
 /*
  * hugepages at page global directory. If arch support
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 629aa4c2259c..cb041c97a558 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -81,6 +81,45 @@ struct mutex *hugetlb_fault_mutex_table ____cacheline_aligned_in_smp;
 /* Forward declaration */
 static int hugetlb_acct_memory(struct hstate *h, long delta);

+/*
+ * Gigantic pages are so large that we do not guarantee that page++ pointer
+ * arithmetic will work across the entire page.  We need something more
+ * specialized.
+ */
+static void __copy_gigantic_page(struct page *dst, struct page *src,
+				 int nr_pages)
+{
+	int i;
+	struct page *dst_base = dst;
+	struct page *src_base = src;
+
+	for (i = 0; i < nr_pages;) {
+		cond_resched();
+		copy_highpage(dst, src);
+
+		i++;
+		dst = mem_map_next(dst, dst_base, i);
+		src = mem_map_next(src, src_base, i);
+	}
+}
+
+void hugetlb_copy_page(struct page *dst, struct page *src)
+{
+	int i;
+	struct hstate *h = page_hstate(src);
+	int nr_pages = pages_per_huge_page(h);
+
+	if (unlikely(nr_pages > MAX_ORDER_NR_PAGES)) {
+		__copy_gigantic_page(dst, src, nr_pages);
+		return;
+	}
+
+	for (i = 0; i < nr_pages; i++) {
+		cond_resched();
+		copy_highpage(dst + i, src + i);
+	}
+}
+
 static inline bool subpool_is_free(struct hugepage_subpool *spool)
 {
 	if (spool->count)
@@ -4868,19 +4907,20 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 			    struct page **pagep)
 {
 	bool is_continue = (mode == MCOPY_ATOMIC_CONTINUE);
-	struct address_space *mapping;
-	pgoff_t idx;
+	struct hstate *h = hstate_vma(dst_vma);
+	struct address_space *mapping = dst_vma->vm_file->f_mapping;
+	pgoff_t idx = vma_hugecache_offset(h, dst_vma, dst_addr);
 	unsigned long size;
 	int vm_shared = dst_vma->vm_flags & VM_SHARED;
-	struct hstate *h = hstate_vma(dst_vma);
 	pte_t _dst_pte;
 	spinlock_t *ptl;
-	int ret;
+	int ret = -ENOMEM;
 	struct page *page;
 	int writable;
-
-	mapping = dst_vma->vm_file->f_mapping;
-	idx = vma_hugecache_offset(h, dst_vma, dst_addr);
+	struct mempolicy *mpol;
+	nodemask_t *nodemask;
+	gfp_t gfp_mask = htlb_alloc_mask(h);
+	int node = huge_node(dst_vma, dst_addr, gfp_mask, &mpol, &nodemask);

 	if (is_continue) {
 		ret = -EFAULT;
@@ -4888,7 +4928,14 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 		if (!page)
 			goto out;
 	} else if (!*pagep) {
-		ret = -ENOMEM;
+		/* If a page already exists, then it's UFFDIO_COPY for
+		 * a non-missing case. Return -EEXIST.
+		 */
+		if (hugetlbfs_pagecache_present(h, dst_vma, dst_addr)) {
+			ret = -EEXIST;
+			goto out;
+		}
+
 		page = alloc_huge_page(dst_vma, dst_addr, 0);
 		if (IS_ERR(page))
 			goto out;
@@ -4900,12 +4947,48 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 		/* fallback to copy_from_user outside mmap_lock */
 		if (unlikely(ret)) {
 			ret = -ENOENT;
+			/* Free the allocated page which may have
+			 * consumed a reservation.
+			 */
+			restore_reserve_on_error(h, dst_vma, dst_addr, page);
+			if (!HPageRestoreReserve(page)) {
+				if (unlikely(hugetlb_unreserve_pages(
+					    mapping->host, idx, idx + 1, 1)))
+					hugetlb_fix_reserve_counts(
+						mapping->host);
+			}
+			put_page(page);
+
+			/* Allocate a temporary page to hold the copied
+			 * contents.
+			 */
+			page = alloc_migrate_huge_page(h, gfp_mask, node,
+						       nodemask);
+			if (IS_ERR(page)) {
+				ret = -ENOMEM;
+				goto out;
+			}
 			*pagep = page;
-			/* don't free the page */
+			/* Set the outparam pagep and return to the caller to
+			 * copy the contents outside the lock. Don't free the
+			 * page.
+			 */
 			goto out;
 		}
 	} else {
-		page = *pagep;
+		if (hugetlbfs_pagecache_present(h, dst_vma, dst_addr)) {
+			put_page(*pagep);
+			ret = -EEXIST;
+			goto out;
+		}
+
+		page = alloc_huge_page(dst_vma, dst_addr, 0);
+		if (IS_ERR(page)) {
+			ret = -ENOMEM;
+			goto out;
+		}
+		__copy_gigantic_page(page, *pagep, pages_per_huge_page(h));
+		put_page(*pagep);
 		*pagep = NULL;
 	}

diff --git a/mm/migrate.c b/mm/migrate.c
index 6b37d00890ca..d3437f9a608d 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -528,28 +528,6 @@ int migrate_huge_page_move_mapping(struct address_space *mapping,
 	return MIGRATEPAGE_SUCCESS;
 }

-/*
- * Gigantic pages are so large that we do not guarantee that page++ pointer
- * arithmetic will work across the entire page.  We need something more
- * specialized.
- */
-static void __copy_gigantic_page(struct page *dst, struct page *src,
-				int nr_pages)
-{
-	int i;
-	struct page *dst_base = dst;
-	struct page *src_base = src;
-
-	for (i = 0; i < nr_pages; ) {
-		cond_resched();
-		copy_highpage(dst, src);
-
-		i++;
-		dst = mem_map_next(dst, dst_base, i);
-		src = mem_map_next(src, src_base, i);
-	}
-}
-
 static void copy_huge_page(struct page *dst, struct page *src)
 {
 	int i;
@@ -557,19 +535,14 @@ static void copy_huge_page(struct page *dst, struct page *src)

 	if (PageHuge(src)) {
 		/* hugetlbfs page */
-		struct hstate *h = page_hstate(src);
-		nr_pages = pages_per_huge_page(h);
-
-		if (unlikely(nr_pages > MAX_ORDER_NR_PAGES)) {
-			__copy_gigantic_page(dst, src, nr_pages);
-			return;
-		}
-	} else {
-		/* thp page */
-		BUG_ON(!PageTransHuge(src));
-		nr_pages = thp_nr_pages(src);
+		hugetlb_copy_page(dst, src);
+		return;
 	}

+	/* thp page */
+	BUG_ON(!PageTransHuge(src));
+	nr_pages = thp_nr_pages(src);
+
 	for (i = 0; i < nr_pages; i++) {
 		cond_resched();
 		copy_highpage(dst + i, src + i);
--
2.31.1.818.g46aad6cb9e-goog

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v3] mm, hugetlb: fix resv_huge_pages underflow on UFFDIO_COPY
  2021-05-21  7:44 [PATCH v3] mm, hugetlb: fix resv_huge_pages underflow on UFFDIO_COPY Mina Almasry
@ 2021-05-22 21:19 ` Andrew Morton
  2021-05-22 21:32   ` Mina Almasry
  2021-05-24 18:07 ` Mike Kravetz
  1 sibling, 1 reply; 16+ messages in thread
From: Andrew Morton @ 2021-05-22 21:19 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Axel Rasmussen, Peter Xu, linux-mm, Mike Kravetz, linux-kernel

On Fri, 21 May 2021 00:44:33 -0700 Mina Almasry <almasrymina@google.com> wrote:

> The userfaultfd hugetlb tests detect a resv_huge_pages underflow. This
> happens when hugetlb_mcopy_atomic_pte() is called with !is_continue on
> an index for which we already have a page in the cache. When this
> happens, we allocate a second page, double consuming the reservation,
> and then fail to insert the page into the cache and return -EEXIST.
> 
> To fix this, we first if there exists a page in the cache which already

                       ^ check

> consumed the reservation, and return -EEXIST immediately if so.
> 
> Secondly, if we fail to copy the page contents while holding the
> hugetlb_fault_mutex, we will drop the mutex and return to the caller
> after allocating a page that consumed a reservation. In this case there
> may be a fault that double consumes the reservation. To handle this, we
> free the allocated page, fix the reservations, and allocate a temporary
> hugetlb page and return that to the caller. When the caller does the
> copy outside of the lock, we again check the cache, and allocate a page
> consuming the reservation, and copy over the contents.
> 
> Test:
> Hacked the code locally such that resv_huge_pages underflows produce
> a warning and the copy_huge_page_from_user() always fails, then:
> 
> ./tools/testing/selftests/vm/userfaultfd hugetlb_shared 10
> 	2 /tmp/kokonut_test/huge/userfaultfd_test && echo test success
> ./tools/testing/selftests/vm/userfaultfd hugetlb 10
> 	2 /tmp/kokonut_test/huge/userfaultfd_test && echo test success
> 
> Both tests succeed and produce no warnings. After the test runs
> number of free/resv hugepages is correct.
>
> ...
>
>  include/linux/hugetlb.h |   4 ++
>  mm/hugetlb.c            | 103 ++++++++++++++++++++++++++++++++++++----
>  mm/migrate.c            |  39 +++------------
>  3 files changed, 103 insertions(+), 43 deletions(-)

I'm assuming we want this in -stable?

Are we able to identify a Fixes: for this?

It's a large change.  Can we come up with some smaller and easier to
review and integrate version which we can feed into 5.13 and -stable
and do the fancier version for 5.14?

If you don't think -stable needs this then this version will be OK as-is.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3] mm, hugetlb: fix resv_huge_pages underflow on UFFDIO_COPY
  2021-05-22 21:19 ` Andrew Morton
@ 2021-05-22 21:32   ` Mina Almasry
  0 siblings, 0 replies; 16+ messages in thread
From: Mina Almasry @ 2021-05-22 21:32 UTC (permalink / raw)
  To: Andrew Morton; +Cc: Axel Rasmussen, Peter Xu, Linux-MM, Mike Kravetz, open list

On Sat, May 22, 2021 at 2:19 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Fri, 21 May 2021 00:44:33 -0700 Mina Almasry <almasrymina@google.com> wrote:
>
> > The userfaultfd hugetlb tests detect a resv_huge_pages underflow. This
> > happens when hugetlb_mcopy_atomic_pte() is called with !is_continue on
> > an index for which we already have a page in the cache. When this
> > happens, we allocate a second page, double consuming the reservation,
> > and then fail to insert the page into the cache and return -EEXIST.
> >
> > To fix this, we first if there exists a page in the cache which already
>
>                        ^ check
>
> > consumed the reservation, and return -EEXIST immediately if so.
> >
> > Secondly, if we fail to copy the page contents while holding the
> > hugetlb_fault_mutex, we will drop the mutex and return to the caller
> > after allocating a page that consumed a reservation. In this case there
> > may be a fault that double consumes the reservation. To handle this, we
> > free the allocated page, fix the reservations, and allocate a temporary
> > hugetlb page and return that to the caller. When the caller does the
> > copy outside of the lock, we again check the cache, and allocate a page
> > consuming the reservation, and copy over the contents.
> >
> > Test:
> > Hacked the code locally such that resv_huge_pages underflows produce
> > a warning and the copy_huge_page_from_user() always fails, then:
> >
> > ./tools/testing/selftests/vm/userfaultfd hugetlb_shared 10
> >       2 /tmp/kokonut_test/huge/userfaultfd_test && echo test success
> > ./tools/testing/selftests/vm/userfaultfd hugetlb 10
> >       2 /tmp/kokonut_test/huge/userfaultfd_test && echo test success
> >
> > Both tests succeed and produce no warnings. After the test runs
> > number of free/resv hugepages is correct.
> >
> > ...
> >
> >  include/linux/hugetlb.h |   4 ++
> >  mm/hugetlb.c            | 103 ++++++++++++++++++++++++++++++++++++----
> >  mm/migrate.c            |  39 +++------------
> >  3 files changed, 103 insertions(+), 43 deletions(-)
>
> I'm assuming we want this in -stable?
>

Umm I'll yield to Mike. This is a transient underflow; not actually
THAT serious of an issue. Sorry, I'll clarify that in the commit
message for the next version.

> Are we able to identify a Fixes: for this?
>

No, this issue has been there latent for some time. It repros as far
back as 5.11 at least, which is why maybe it's not that serious to
require in -stable.

> It's a large change.  Can we come up with some smaller and easier to
> review and integrate version which we can feed into 5.13 and -stable
> and do the fancier version for 5.14?
>

Yes. If we only do the hugetlbfs_pagecache_present() check then that
gets us some 90% of the way there, the rest of the patch addresses an
unlikely race.

> If you don't think -stable needs this then this version will be OK as-is.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3] mm, hugetlb: fix resv_huge_pages underflow on UFFDIO_COPY
  2021-05-21  7:44 [PATCH v3] mm, hugetlb: fix resv_huge_pages underflow on UFFDIO_COPY Mina Almasry
  2021-05-22 21:19 ` Andrew Morton
@ 2021-05-24 18:07 ` Mike Kravetz
  2021-05-25  0:11   ` Mina Almasry
  1 sibling, 1 reply; 16+ messages in thread
From: Mike Kravetz @ 2021-05-24 18:07 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Axel Rasmussen, Peter Xu, linux-mm, Andrew Morton, linux-kernel

On 5/21/21 12:44 AM, Mina Almasry wrote:
> The userfaultfd hugetlb tests detect a resv_huge_pages underflow. This
> happens when hugetlb_mcopy_atomic_pte() is called with !is_continue on
> an index for which we already have a page in the cache. When this
> happens, we allocate a second page, double consuming the reservation,
> and then fail to insert the page into the cache and return -EEXIST.
> 
> To fix this, we first if there exists a page in the cache which already
> consumed the reservation, and return -EEXIST immediately if so.

Thanks Mina.

Andrew brought up the possibility of two separate fixes:
1) A simple one to address the common cause of the underflow
2) A more extensive modification to address all the error path code

I think this is a good idea.  You mentioned that the underflow is
transient and will correct itself.  That is true, however code assumes
the reserve count is an unsigned value always >= the number of free
pages.  If code like the following is run while in this transitive state,
we will run into more serious issues:

	/*
	 * A child process with MAP_PRIVATE mappings created by their parent
	 * have no page reserves. This check ensures that reservations are
	 * not "stolen". The child may still get SIGKILLed
	 */
	if (!vma_has_reserves(vma, chg) &&
			h->free_huge_pages - h->resv_huge_pages == 0)
		goto err;

	/* If reserves cannot be used, ensure enough pages are in the pool */
	if (avoid_reserve && h->free_huge_pages - h->resv_huge_pages == 0)
		goto err;

For this reason, I think the simple change sent to stable would be a
good idea.

Do note that the call to hugetlbfs_pagecache_present() which would be
used in the stable fix is not 100% correct.  See below.

> Secondly, if we fail to copy the page contents while holding the
> hugetlb_fault_mutex, we will drop the mutex and return to the caller
> after allocating a page that consumed a reservation. In this case there
> may be a fault that double consumes the reservation. To handle this, we
> free the allocated page, fix the reservations, and allocate a temporary
> hugetlb page and return that to the caller. When the caller does the
> copy outside of the lock, we again check the cache, and allocate a page
> consuming the reservation, and copy over the contents.
> 
> Test:
> Hacked the code locally such that resv_huge_pages underflows produce
> a warning and the copy_huge_page_from_user() always fails, then:
> 
> ./tools/testing/selftests/vm/userfaultfd hugetlb_shared 10
> 	2 /tmp/kokonut_test/huge/userfaultfd_test && echo test success
> ./tools/testing/selftests/vm/userfaultfd hugetlb 10
> 	2 /tmp/kokonut_test/huge/userfaultfd_test && echo test success
> 
> Both tests succeed and produce no warnings. After the test runs
> number of free/resv hugepages is correct.
> 
> Signed-off-by: Mina Almasry <almasrymina@google.com>
> Cc: Axel Rasmussen <axelrasmussen@google.com>
> Cc: Peter Xu <peterx@redhat.com>
> Cc: linux-mm@kvack.org
> Cc: Mike Kravetz <mike.kravetz@oracle.com>
> Cc: Andrew Morton <akpm@linux-foundation.org>
> Cc: linux-mm@kvack.org
> Cc: linux-kernel@vger.kernel.org
> 
> ---
>  include/linux/hugetlb.h |   4 ++
>  mm/hugetlb.c            | 103 ++++++++++++++++++++++++++++++++++++----
>  mm/migrate.c            |  39 +++------------
>  3 files changed, 103 insertions(+), 43 deletions(-)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index b92f25ccef58..427974510965 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -194,6 +194,8 @@ unsigned long hugetlb_change_protection(struct vm_area_struct *vma,
>  bool is_hugetlb_entry_migration(pte_t pte);
>  void hugetlb_unshare_all_pmds(struct vm_area_struct *vma);
> 
> +void hugetlb_copy_page(struct page *dst, struct page *src);
> +
>  #else /* !CONFIG_HUGETLB_PAGE */
> 
>  static inline void reset_vma_resv_huge_pages(struct vm_area_struct *vma)
> @@ -379,6 +381,8 @@ static inline vm_fault_t hugetlb_fault(struct mm_struct *mm,
> 
>  static inline void hugetlb_unshare_all_pmds(struct vm_area_struct *vma) { }
> 
> +static inline void hugetlb_copy_page(struct page *dst, struct page *src);
> +
>  #endif /* !CONFIG_HUGETLB_PAGE */
>  /*
>   * hugepages at page global directory. If arch support

How about just making the existing routine copy_huge_page() callable from
outside migrate.c and avoid all this code movement?  If we want to do
any code movement, I would suggest moving copy_huge_page and routines it
depends on to mm/huge_memory.c.  That seems like a more appropriate
location.  But, I am fine with leaving them in migrate.c

> index 629aa4c2259c..cb041c97a558 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
< snip code dealing with movement of copy routines >
> @@ -4868,19 +4907,20 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
>  			    struct page **pagep)
>  {
>  	bool is_continue = (mode == MCOPY_ATOMIC_CONTINUE);
> -	struct address_space *mapping;
> -	pgoff_t idx;
> +	struct hstate *h = hstate_vma(dst_vma);
> +	struct address_space *mapping = dst_vma->vm_file->f_mapping;
> +	pgoff_t idx = vma_hugecache_offset(h, dst_vma, dst_addr);
>  	unsigned long size;
>  	int vm_shared = dst_vma->vm_flags & VM_SHARED;
> -	struct hstate *h = hstate_vma(dst_vma);
>  	pte_t _dst_pte;
>  	spinlock_t *ptl;
> -	int ret;
> +	int ret = -ENOMEM;
>  	struct page *page;
>  	int writable;
> -
> -	mapping = dst_vma->vm_file->f_mapping;
> -	idx = vma_hugecache_offset(h, dst_vma, dst_addr);
> +	struct mempolicy *mpol;
> +	nodemask_t *nodemask;
> +	gfp_t gfp_mask = htlb_alloc_mask(h);
> +	int node = huge_node(dst_vma, dst_addr, gfp_mask, &mpol, &nodemask);
> 
>  	if (is_continue) {
>  		ret = -EFAULT;
> @@ -4888,7 +4928,14 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
>  		if (!page)
>  			goto out;
>  	} else if (!*pagep) {
> -		ret = -ENOMEM;
> +		/* If a page already exists, then it's UFFDIO_COPY for
> +		 * a non-missing case. Return -EEXIST.
> +		 */
> +		if (hugetlbfs_pagecache_present(h, dst_vma, dst_addr)) {

We only want to call hugetlbfs_pagecache_present() if vm_shared.
Something like:

		if (vm_shared &&
		    hugetlbfs_pagecache_present(h, dst_vma, dst_addr)) {

This could be a private mapping of a file in which case we do not care
about the contents of the cache/file.  In fact, it could provide a false
positive.

> +			ret = -EEXIST;
> +			goto out;
> +		}
> +
>  		page = alloc_huge_page(dst_vma, dst_addr, 0);
>  		if (IS_ERR(page))
>  			goto out;
> @@ -4900,12 +4947,48 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
>  		/* fallback to copy_from_user outside mmap_lock */
>  		if (unlikely(ret)) {
>  			ret = -ENOENT;
> +			/* Free the allocated page which may have
> +			 * consumed a reservation.
> +			 */
> +			restore_reserve_on_error(h, dst_vma, dst_addr, page);

Good!

> +			if (!HPageRestoreReserve(page)) {
> +				if (unlikely(hugetlb_unreserve_pages(
> +					    mapping->host, idx, idx + 1, 1)))
> +					hugetlb_fix_reserve_counts(
> +						mapping->host);
> +			}

I do not understand the need to call hugetlb_unreserve_pages().  The
call to restore_reserve_on_error 'should' fix up the reserve map to
align with restoring the reserve count in put_page/free_huge_page.
Can you explain why that is there?

> +			put_page(page);
> +
> +			/* Allocate a temporary page to hold the copied
> +			 * contents.
> +			 */
> +			page = alloc_migrate_huge_page(h, gfp_mask, node,
> +						       nodemask);

I would suggest calling alloc_huge_page_vma to allocate the page here.
alloc_huge_page_vma will allocate a huge page from the pool if one is
available, and fall back to the buddy allocator.  This is the way the
migration code works.

alloc_huge_page_vma has the advantage that it 'can' work for gigantic
pages if some are available in the pool.  Note that alloc_migrate_huge_page
immediately checks for and fails if a gigantic page is needed.

If we use alloc_migrate_huge_page, the userfaultfd selftest you are
using will fail when tested with 'hacked code' to ALWAYS take this
error path.  That is because we will need two huge pages (one
temporary) and only one is available.  I believe this is OK.  As
mentioned, this is how the migration code works.

> +			if (IS_ERR(page)) {
> +				ret = -ENOMEM;
> +				goto out;
> +			}
>  			*pagep = page;
> -			/* don't free the page */
> +			/* Set the outparam pagep and return to the caller to
> +			 * copy the contents outside the lock. Don't free the
> +			 * page.
> +			 */
>  			goto out;
>  		}
>  	} else {
> -		page = *pagep;
> +		if (hugetlbfs_pagecache_present(h, dst_vma, dst_addr)) {

Again, check for vm_shared before calling hugetlbfs_pagecache_present.

> +			put_page(*pagep);
> +			ret = -EEXIST;
> +			goto out;

We need to set *pagep = NULL before returning.  The calling routine
__mcopy_atomic_hugetlb will BUG() if we return with an error code other
than -ENOENT and *pagep not NULL.

> +		}
> +
> +		page = alloc_huge_page(dst_vma, dst_addr, 0);
> +		if (IS_ERR(page)) {
> +			ret = -ENOMEM;

Again, clear *pagep before returning.

> +			goto out;
> +		}
> +		__copy_gigantic_page(page, *pagep, pages_per_huge_page(h));

See previous discussion about copy routine.

> +		put_page(*pagep);
>  		*pagep = NULL;
>  	}

Later in hugetlb_mcopy_atomic_pte, we will update the page cache and
page table.  There are error checks when performing these operations.
It is unlikely we will ever hit one of these errors.  However, if we
do encounter an error, the code should call restore_reserve_on_error
before the put_page.  Perhaps something like:

@@ -4996,6 +4996,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 	if (vm_shared || is_continue)
 		unlock_page(page);
 out_release_nounlock:
+	restore_reserve_on_error(h, dst_vma, dst_addr, page);
 	put_page(page);
 	goto out;
 }


> 
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 6b37d00890ca..d3437f9a608d 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
< snip code dealing with movement of copy routines >

With changes like those above in hugetlb_mcopy_atomic_pte, we should be
able to remove the following code in __mcopy_atomic_hugetlb.  We can do
this because any page returned/passed to __mcopy_atomic_hugetlb will NOT
have consumed a reserve.  It would be a nice cleanup as that big comment
explains how we currently guess what is the right thing to do in this
situation.

diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 63a73e164d55..e13a0492b7ba 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -346,54 +346,8 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 out_unlock:
 	mmap_read_unlock(dst_mm);
 out:
-	if (page) {
-		/*
-		 * We encountered an error and are about to free a newly
-		 * allocated huge page.
-		 *
-		 * Reservation handling is very subtle, and is different for
-		 * private and shared mappings.  See the routine
-		 * restore_reserve_on_error for details.  Unfortunately, we
-		 * can not call restore_reserve_on_error now as it would
-		 * require holding mmap_lock.
-		 *
-		 * If a reservation for the page existed in the reservation
-		 * map of a private mapping, the map was modified to indicate
-		 * the reservation was consumed when the page was allocated.
-		 * We clear the HPageRestoreReserve flag now so that the global
-		 * reserve count will not be incremented in free_huge_page.
-		 * The reservation map will still indicate the reservation
-		 * was consumed and possibly prevent later page allocation.
-		 * This is better than leaking a global reservation.  If no
-		 * reservation existed, it is still safe to clear
-		 * HPageRestoreReserve as no adjustments to reservation counts
-		 * were made during allocation.
-		 *
-		 * The reservation map for shared mappings indicates which
-		 * pages have reservations.  When a huge page is allocated
-		 * for an address with a reservation, no change is made to
-		 * the reserve map.  In this case HPageRestoreReserve will be
-		 * set to indicate that the global reservation count should be
-		 * incremented when the page is freed.  This is the desired
-		 * behavior.  However, when a huge page is allocated for an
-		 * address without a reservation a reservation entry is added
-		 * to the reservation map, and HPageRestoreReserve will not be
-		 * set. When the page is freed, the global reserve count will
-		 * NOT be incremented and it will appear as though we have
-		 * leaked reserved page.  In this case, set HPageRestoreReserve
-		 * so that the global reserve count will be incremented to
-		 * match the reservation map entry which was created.
-		 *
-		 * Note that vm_alloc_shared is based on the flags of the vma
-		 * for which the page was originally allocated.  dst_vma could
-		 * be different or NULL on error.
-		 */
-		if (vm_alloc_shared)
-			SetHPageRestoreReserve(page);
-		else
-			ClearHPageRestoreReserve(page);
+	if (page)
 		put_page(page);
-	}
 	BUG_ON(copied < 0);
 	BUG_ON(err > 0);
 	BUG_ON(!copied && !err);

-- 
Mike Kravetz

^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [PATCH v3] mm, hugetlb: fix resv_huge_pages underflow on UFFDIO_COPY
  2021-05-24 18:07 ` Mike Kravetz
@ 2021-05-25  0:11   ` Mina Almasry
  2021-05-25  0:45     ` Mike Kravetz
  0 siblings, 1 reply; 16+ messages in thread
From: Mina Almasry @ 2021-05-25  0:11 UTC (permalink / raw)
  To: Mike Kravetz; +Cc: Axel Rasmussen, Peter Xu, Linux-MM, Andrew Morton, open list

> > +                     if (!HPageRestoreReserve(page)) {
> > +                             if (unlikely(hugetlb_unreserve_pages(
> > +                                         mapping->host, idx, idx + 1, 1)))
> > +                                     hugetlb_fix_reserve_counts(
> > +                                             mapping->host);
> > +                     }
>
> I do not understand the need to call hugetlb_unreserve_pages().  The
> call to restore_reserve_on_error 'should' fix up the reserve map to
> align with restoring the reserve count in put_page/free_huge_page.
> Can you explain why that is there?
>

AFAICT here is what happens for a given index *without* the call to
hugetlb_unreserve_pages():

1. hugetlb_no_page() allocates a page consuming the reservation,
resv_huge_pages decrements.
2. remove_inode_hugepages() does remove_huge_page() and
hugetlb_unreserve_pages(). This removes the entry from the resv_map,
but does NOT increment back the resv_huge_pages. Because we removed
the entry, it looks like we have no reservation for this index.
free_huge_page() gets called on this page, and resv_huge_pages is not
incremented, I'm not sure why. This page should have come from the
reserves.
3. hugetlb_mcopy_pte_atomic() gets called for this index. Because of
the prior call to hugetlb_unreserve_page(), there is no entry in the
resv_map for this index, which means it looks like we don't have a
reservation for this index. We allocate a page outside the reserves
(deferred_reservation=1, HPageRestoreReserve=0), add an entry into
resv_map, and don't modify resv_huge_pages.
4. The copy fails and we deallocate the page, since
HPageRestoreReserve==0 for this page, restore_reserve_on_error() does
nothing.
5. hugetlb_mcopy_pte_atomic() gets recalled with the temporary page,
and we allocate another page. Now, since we added an entry in the
resv_map in the previous allocation, it looks like we have a
reservation for this allocation. We allocate a page with
deferred_reserve=0 && HPageRestoreReserve=1, we decrement
resv_huge_pages. Boom, we decremented resv_huge_pages twice for this
index, never incremented it.

To fix this, in step 4, when I deallocate a page, I check
HPageRestoreReserve(page). If HPageRestoreReserve=0, then this
reservation was consumed and deallocated before, and so I need to
remove the entry from the resv_map.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH v3] mm, hugetlb: fix resv_huge_pages underflow on UFFDIO_COPY
  2021-05-25  0:11   ` Mina Almasry
@ 2021-05-25  0:45     ` Mike Kravetz
  2021-05-25 23:31       ` [PATCH 0/2] Track reserve map changes to restore on error Mike Kravetz
  0 siblings, 1 reply; 16+ messages in thread
From: Mike Kravetz @ 2021-05-25  0:45 UTC (permalink / raw)
  To: Mina Almasry; +Cc: Axel Rasmussen, Peter Xu, Linux-MM, Andrew Morton, open list

On 5/24/21 5:11 PM, Mina Almasry wrote:
>>> +                     if (!HPageRestoreReserve(page)) {
>>> +                             if (unlikely(hugetlb_unreserve_pages(
>>> +                                         mapping->host, idx, idx + 1, 1)))
>>> +                                     hugetlb_fix_reserve_counts(
>>> +                                             mapping->host);
>>> +                     }
>>
>> I do not understand the need to call hugetlb_unreserve_pages().  The
>> call to restore_reserve_on_error 'should' fix up the reserve map to
>> align with restoring the reserve count in put_page/free_huge_page.
>> Can you explain why that is there?
>>
> 
> AFAICT here is what happens for a given index *without* the call to
> hugetlb_unreserve_pages():
> 
> 1. hugetlb_no_page() allocates a page consuming the reservation,
> resv_huge_pages decrements.

Ok

> 2. remove_inode_hugepages() does remove_huge_page() and
> hugetlb_unreserve_pages(). This removes the entry from the resv_map,
> but does NOT increment back the resv_huge_pages. Because we removed
> the entry, it looks like we have no reservation for this index.
> free_huge_page() gets called on this page, and resv_huge_pages is not
> incremented, I'm not sure why. This page should have come from the
> reserves.

This corresponds to a 'hole punch' operation.  When hugetlbfs hole punch
was added, a design decision was made to not try to restore reservations.
IIRC, this is because the hole punch is associated with the file and
reservations are associated with mappings.  Trying to 'go back' to
associated mappings and determine if a reservation should be restored
would be a difficult exercise.

> 3. hugetlb_mcopy_pte_atomic() gets called for this index. Because of
> the prior call to hugetlb_unreserve_page(), there is no entry in the
> resv_map for this index, which means it looks like we don't have a
> reservation for this index. We allocate a page outside the reserves
> (deferred_reservation=1, HPageRestoreReserve=0), add an entry into
> resv_map, and don't modify resv_huge_pages.

Ok

> 4. The copy fails and we deallocate the page, since
> HPageRestoreReserve==0 for this page, restore_reserve_on_error() does
> nothing.

Yes.  And this may be something we want to fix in the more general case.
i.e. I believe this can happen in code paths other than
hugetlb_mcopy_pte_atomic.  Rather than add special code here, I'll look
into updating restore_reserve_on_error to handle this.  Currently
restore_reserve_on_error only handles the case where HPageRestoreReserve
flags is set.

Thanks for explaining this!  I forgot about this 'special case' where
there is not an entry in the reserve map for a shared mapping.

-- 
Mike Kravetz

> 5. hugetlb_mcopy_pte_atomic() gets recalled with the temporary page,
> and we allocate another page. Now, since we added an entry in the
> resv_map in the previous allocation, it looks like we have a
> reservation for this allocation. We allocate a page with
> deferred_reserve=0 && HPageRestoreReserve=1, we decrement
> resv_huge_pages. Boom, we decremented resv_huge_pages twice for this
> index, never incremented it.
> 
> To fix this, in step 4, when I deallocate a page, I check
> HPageRestoreReserve(page). If HPageRestoreReserve=0, then this
> reservation was consumed and deallocated before, and so I need to
> remove the entry from the resv_map.

^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 0/2] Track reserve map changes to restore on error
  2021-05-25  0:45     ` Mike Kravetz
@ 2021-05-25 23:31       ` Mike Kravetz
  2021-05-25 23:31         ` [PATCH 1/2] hugetlb: rename HPageRestoreReserve flag to HPageRestoreRsvCnt Mike Kravetz
                           ` (2 more replies)
  0 siblings, 3 replies; 16+ messages in thread
From: Mike Kravetz @ 2021-05-25 23:31 UTC (permalink / raw)
  To: Mina Almasry, Linux-MM, open list
  Cc: Axel Rasmussen, Peter Xu, Andrew Morton, Mike Kravetz

Here is a modification to the reservation tracking for fixup on errors.
It is a more general change, but should work for the hugetlb_mcopy_pte_atomic
case as well.

Perhaps use this as a prerequisite for your fix(es)?  Pretty sure this
will eliminate the need for the call to hugetlb_unreserve_pages.

Mike Kravetz (2):
  hugetlb: rename HPageRestoreReserve flag to HPageRestoreRsvCnt
  hugetlb: add new hugetlb specific flag HPG_restore_rsv_map

 fs/hugetlbfs/inode.c    |   3 +-
 include/linux/hugetlb.h |  17 +++++--
 mm/hugetlb.c            | 108 ++++++++++++++++++++++++++++------------
 mm/userfaultfd.c        |  14 +++---
 4 files changed, 99 insertions(+), 43 deletions(-)

-- 
2.31.1


^ permalink raw reply	[flat|nested] 16+ messages in thread

* [PATCH 1/2] hugetlb: rename HPageRestoreReserve flag to HPageRestoreRsvCnt
  2021-05-25 23:31       ` [PATCH 0/2] Track reserve map changes to restore on error Mike Kravetz
@ 2021-05-25 23:31         ` Mike Kravetz
  2021-05-27  2:49           ` Mina Almasry
  2021-05-25 23:31         ` [PATCH 2/2] hugetlb: add new hugetlb specific flag HPG_restore_rsv_map Mike Kravetz
  2021-05-26  3:19         ` [External] [PATCH 0/2] Track reserve map changes to restore on error Muchun Song
  2 siblings, 1 reply; 16+ messages in thread
From: Mike Kravetz @ 2021-05-25 23:31 UTC (permalink / raw)
  To: Mina Almasry, Linux-MM, open list
  Cc: Axel Rasmussen, Peter Xu, Andrew Morton, Mike Kravetz

The hugetlb page specific flag HPageRestoreReserve is used to indicate
that a reservation was consumed and should be restored on error.  More
specifically, this is related to the global reservation count.

It is also possible to have a hugetlb page allocated where the global
count is not modified, but the reservation map is modified.  In such
cases, the reserve map needs modification in error paths.  Code
performing such modifications will be introduced in a subsequent patch.

Rename the flag HPageRestoreReserve to HPageRestoreRsvCnt so that it
is more clearly allociated with the global reservation count.  No
functional changes in this patch, just renaming.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 fs/hugetlbfs/inode.c    |  2 +-
 include/linux/hugetlb.h |  6 +++---
 mm/hugetlb.c            | 32 ++++++++++++++++----------------
 mm/userfaultfd.c        | 14 +++++++-------
 4 files changed, 27 insertions(+), 27 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index 55efd3dd04f6..bb4de5dcd652 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -529,7 +529,7 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
 			 * the subpool and global reserve usage count can need
 			 * to be adjusted.
 			 */
-			VM_BUG_ON(HPageRestoreReserve(page));
+			VM_BUG_ON(HPageRestoreRsvCnt(page));
 			remove_huge_page(page);
 			freed++;
 			if (!truncate_op) {
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index 03ca83db0a3e..e5e363fa5d02 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -511,7 +511,7 @@ unsigned long hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
  * of the hugetlb head page.  Functions created via the below macros should be
  * used to manipulate these flags.
  *
- * HPG_restore_reserve - Set when a hugetlb page consumes a reservation at
+ * HPG_restore_rsv_cnt - Set when a hugetlb page consumes a reservation at
  *	allocation time.  Cleared when page is fully instantiated.  Free
  *	routine checks flag to restore a reservation on error paths.
  *	Synchronization:  Examined or modified by code that knows it has
@@ -535,7 +535,7 @@ unsigned long hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
  * HPG_vmemmap_optimized - Set when the vmemmap pages of the page are freed.
  */
 enum hugetlb_page_flags {
-	HPG_restore_reserve = 0,
+	HPG_restore_rsv_cnt = 0,
 	HPG_migratable,
 	HPG_temporary,
 	HPG_freed,
@@ -581,7 +581,7 @@ static inline void ClearHPage##uname(struct page *page)		\
 /*
  * Create functions associated with hugetlb page flags
  */
-HPAGEFLAG(RestoreReserve, restore_reserve)
+HPAGEFLAG(RestoreRsvCnt, restore_rsv_cnt)
 HPAGEFLAG(Migratable, migratable)
 HPAGEFLAG(Temporary, temporary)
 HPAGEFLAG(Freed, freed)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index c3b2a8a494d6..2a8cea253388 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1165,7 +1165,7 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
 	nid = huge_node(vma, address, gfp_mask, &mpol, &nodemask);
 	page = dequeue_huge_page_nodemask(h, gfp_mask, nid, nodemask);
 	if (page && !avoid_reserve && vma_has_reserves(vma, chg)) {
-		SetHPageRestoreReserve(page);
+		SetHPageRestoreRsvCnt(page);
 		h->resv_huge_pages--;
 	}
 
@@ -1549,11 +1549,11 @@ void free_huge_page(struct page *page)
 
 	hugetlb_set_page_subpool(page, NULL);
 	page->mapping = NULL;
-	restore_reserve = HPageRestoreReserve(page);
-	ClearHPageRestoreReserve(page);
+	restore_reserve = HPageRestoreRsvCnt(page);
+	ClearHPageRestoreRsvCnt(page);
 
 	/*
-	 * If HPageRestoreReserve was set on page, page allocation consumed a
+	 * If HPageRestoreRsvCnt was set on page, page allocation consumed a
 	 * reservation.  If the page was associated with a subpool, there
 	 * would have been a page reserved in the subpool before allocation
 	 * via hugepage_subpool_get_pages().  Since we are 'restoring' the
@@ -2364,9 +2364,9 @@ static long vma_add_reservation(struct hstate *h,
  * specific error paths, a huge page was allocated (via alloc_huge_page)
  * and is about to be freed.  If a reservation for the page existed,
  * alloc_huge_page would have consumed the reservation and set
- * HPageRestoreReserve in the newly allocated page.  When the page is freed
+ * HPageRestoreRsvCnt in the newly allocated page.  When the page is freed
  * via free_huge_page, the global reservation count will be incremented if
- * HPageRestoreReserve is set.  However, free_huge_page can not adjust the
+ * HPageRestoreRsvCnt is set.  However, free_huge_page can not adjust the
  * reserve map.  Adjust the reserve map here to be consistent with global
  * reserve count adjustments to be made by free_huge_page.
  */
@@ -2374,13 +2374,13 @@ static void restore_reserve_on_error(struct hstate *h,
 			struct vm_area_struct *vma, unsigned long address,
 			struct page *page)
 {
-	if (unlikely(HPageRestoreReserve(page))) {
+	if (unlikely(HPageRestoreRsvCnt(page))) {
 		long rc = vma_needs_reservation(h, vma, address);
 
 		if (unlikely(rc < 0)) {
 			/*
 			 * Rare out of memory condition in reserve map
-			 * manipulation.  Clear HPageRestoreReserve so that
+			 * manipulation.  Clear HPageRestoreRsvCnt so that
 			 * global reserve count will not be incremented
 			 * by free_huge_page.  This will make it appear
 			 * as though the reservation for this page was
@@ -2389,7 +2389,7 @@ static void restore_reserve_on_error(struct hstate *h,
 			 * is better than inconsistent global huge page
 			 * accounting of reserve counts.
 			 */
-			ClearHPageRestoreReserve(page);
+			ClearHPageRestoreRsvCnt(page);
 		} else if (rc) {
 			rc = vma_add_reservation(h, vma, address);
 			if (unlikely(rc < 0))
@@ -2397,7 +2397,7 @@ static void restore_reserve_on_error(struct hstate *h,
 				 * See above comment about rare out of
 				 * memory condition.
 				 */
-				ClearHPageRestoreReserve(page);
+				ClearHPageRestoreRsvCnt(page);
 		} else
 			vma_end_reservation(h, vma, address);
 	}
@@ -2602,7 +2602,7 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
 		if (!page)
 			goto out_uncharge_cgroup;
 		if (!avoid_reserve && vma_has_reserves(vma, gbl_chg)) {
-			SetHPageRestoreReserve(page);
+			SetHPageRestoreRsvCnt(page);
 			h->resv_huge_pages--;
 		}
 		spin_lock_irq(&hugetlb_lock);
@@ -4052,7 +4052,7 @@ hugetlb_install_page(struct vm_area_struct *vma, pte_t *ptep, unsigned long addr
 	set_huge_pte_at(vma->vm_mm, addr, ptep, make_huge_pte(vma, new_page, 1));
 	hugepage_add_new_anon_rmap(new_page, vma, addr);
 	hugetlb_count_add(pages_per_huge_page(hstate_vma(vma)), vma->vm_mm);
-	ClearHPageRestoreReserve(new_page);
+	ClearHPageRestoreRsvCnt(new_page);
 	SetHPageMigratable(new_page);
 }
 
@@ -4525,7 +4525,7 @@ static vm_fault_t hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 	spin_lock(ptl);
 	ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
 	if (likely(ptep && pte_same(huge_ptep_get(ptep), pte))) {
-		ClearHPageRestoreReserve(new_page);
+		ClearHPageRestoreRsvCnt(new_page);
 
 		/* Break COW */
 		huge_ptep_clear_flush(vma, haddr, ptep);
@@ -4592,7 +4592,7 @@ int huge_add_to_page_cache(struct page *page, struct address_space *mapping,
 
 	if (err)
 		return err;
-	ClearHPageRestoreReserve(page);
+	ClearHPageRestoreRsvCnt(page);
 
 	/*
 	 * set page dirty so that it will not be removed from cache/file
@@ -4775,7 +4775,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 		goto backout;
 
 	if (anon_rmap) {
-		ClearHPageRestoreReserve(page);
+		ClearHPageRestoreRsvCnt(page);
 		hugepage_add_new_anon_rmap(page, vma, haddr);
 	} else
 		page_dup_rmap(page, true);
@@ -5096,7 +5096,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 	if (vm_shared) {
 		page_dup_rmap(page, true);
 	} else {
-		ClearHPageRestoreReserve(page);
+		ClearHPageRestoreRsvCnt(page);
 		hugepage_add_new_anon_rmap(page, dst_vma, dst_addr);
 	}
 
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 9ce5a3793ad4..58c706697b17 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -432,27 +432,27 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 		 * If a reservation for the page existed in the reservation
 		 * map of a private mapping, the map was modified to indicate
 		 * the reservation was consumed when the page was allocated.
-		 * We clear the HPageRestoreReserve flag now so that the global
+		 * We clear the HPageRestoreRsvCnt flag now so that the global
 		 * reserve count will not be incremented in free_huge_page.
 		 * The reservation map will still indicate the reservation
 		 * was consumed and possibly prevent later page allocation.
 		 * This is better than leaking a global reservation.  If no
 		 * reservation existed, it is still safe to clear
-		 * HPageRestoreReserve as no adjustments to reservation counts
+		 * HPageRestoreRsvCnt as no adjustments to reservation counts
 		 * were made during allocation.
 		 *
 		 * The reservation map for shared mappings indicates which
 		 * pages have reservations.  When a huge page is allocated
 		 * for an address with a reservation, no change is made to
-		 * the reserve map.  In this case HPageRestoreReserve will be
+		 * the reserve map.  In this case HPageRestoreRsvCnt will be
 		 * set to indicate that the global reservation count should be
 		 * incremented when the page is freed.  This is the desired
 		 * behavior.  However, when a huge page is allocated for an
 		 * address without a reservation a reservation entry is added
-		 * to the reservation map, and HPageRestoreReserve will not be
+		 * to the reservation map, and HPageRestoreRsvCnt will not be
 		 * set. When the page is freed, the global reserve count will
 		 * NOT be incremented and it will appear as though we have
-		 * leaked reserved page.  In this case, set HPageRestoreReserve
+		 * leaked reserved page.  In this case, set HPageRestoreRsvCnt
 		 * so that the global reserve count will be incremented to
 		 * match the reservation map entry which was created.
 		 *
@@ -461,9 +461,9 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
 		 * be different or NULL on error.
 		 */
 		if (vm_alloc_shared)
-			SetHPageRestoreReserve(page);
+			SetHPageRestoreRsvCnt(page);
 		else
-			ClearHPageRestoreReserve(page);
+			ClearHPageRestoreRsvCnt(page);
 		put_page(page);
 	}
 	BUG_ON(copied < 0);
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* [PATCH 2/2] hugetlb: add new hugetlb specific flag HPG_restore_rsv_map
  2021-05-25 23:31       ` [PATCH 0/2] Track reserve map changes to restore on error Mike Kravetz
  2021-05-25 23:31         ` [PATCH 1/2] hugetlb: rename HPageRestoreReserve flag to HPageRestoreRsvCnt Mike Kravetz
@ 2021-05-25 23:31         ` Mike Kravetz
  2021-05-27  2:58           ` Mina Almasry
  2021-05-26  3:19         ` [External] [PATCH 0/2] Track reserve map changes to restore on error Muchun Song
  2 siblings, 1 reply; 16+ messages in thread
From: Mike Kravetz @ 2021-05-25 23:31 UTC (permalink / raw)
  To: Mina Almasry, Linux-MM, open list
  Cc: Axel Rasmussen, Peter Xu, Andrew Morton, Mike Kravetz

When a hugetlb page is allocated via alloc_huge_page, the reserve map
as well as global reservation count may be modified.  In case of error
after allocation, the count and map should be restored to their previous
state if possible.  The flag HPageRestoreRsvCnt indicates the global
count was modified.  Add a new flag HPG_restore_rsv_map to indicate the
reserve map was modified.  Note that during hugetlb page allocation the
the global count and reserve map could be modified independently.
Therefore, two specific flags are needed.

The routine restore_reserve_on_error is called to restore reserve data
on error paths.  Modify the routine to check for the HPG_restore_rsv_map
flag and adjust the reserve map accordingly.

Add missing calls to restore_reserve_on_error to error paths  of code
calling alloc_huge_page.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 fs/hugetlbfs/inode.c    |  1 +
 include/linux/hugetlb.h | 11 ++++++
 mm/hugetlb.c            | 82 +++++++++++++++++++++++++++++++----------
 3 files changed, 75 insertions(+), 19 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index bb4de5dcd652..9d846a2edc4b 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -735,6 +735,7 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
 		__SetPageUptodate(page);
 		error = huge_add_to_page_cache(page, mapping, index);
 		if (unlikely(error)) {
+			restore_reserve_on_error(h, &pseudo_vma, addr, page);
 			put_page(page);
 			mutex_unlock(&hugetlb_fault_mutex_table[hash]);
 			goto out;
diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index e5e363fa5d02..da2251b0c609 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -517,6 +517,13 @@ unsigned long hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
  *	Synchronization:  Examined or modified by code that knows it has
  *	the only reference to page.  i.e. After allocation but before use
  *	or when the page is being freed.
+ * HPG_restore_rsv_map - Set when a hugetlb page allocation results in adding
+ *	an entry to the reserve map.  This can happen without adjustment of
+ *	the global reserve count.  Cleared when page is fully instantiated.
+ *	Error paths (restore_reserve_on_error) check this flag to make
+ *	adjustments to the reserve map.
+ *	Synchronization:  Examined or modified by code that knows it has
+ *	the only reference to page.  i.e. After allocation but before use.
  * HPG_migratable  - Set after a newly allocated page is added to the page
  *	cache and/or page tables.  Indicates the page is a candidate for
  *	migration.
@@ -536,6 +543,7 @@ unsigned long hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
  */
 enum hugetlb_page_flags {
 	HPG_restore_rsv_cnt = 0,
+	HPG_restore_rsv_map,
 	HPG_migratable,
 	HPG_temporary,
 	HPG_freed,
@@ -582,6 +590,7 @@ static inline void ClearHPage##uname(struct page *page)		\
  * Create functions associated with hugetlb page flags
  */
 HPAGEFLAG(RestoreRsvCnt, restore_rsv_cnt)
+HPAGEFLAG(RestoreRsvMap, restore_rsv_map)
 HPAGEFLAG(Migratable, migratable)
 HPAGEFLAG(Temporary, temporary)
 HPAGEFLAG(Freed, freed)
@@ -633,6 +642,8 @@ struct page *alloc_huge_page_vma(struct hstate *h, struct vm_area_struct *vma,
 				unsigned long address);
 int huge_add_to_page_cache(struct page *page, struct address_space *mapping,
 			pgoff_t idx);
+void restore_reserve_on_error(struct hstate *h, struct vm_area_struct *vma,
+		unsigned long address, struct page *page);
 
 /* arch callback */
 int __init __alloc_bootmem_huge_page(struct hstate *h);
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 2a8cea253388..1c3a68d70ab5 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1551,6 +1551,7 @@ void free_huge_page(struct page *page)
 	page->mapping = NULL;
 	restore_reserve = HPageRestoreRsvCnt(page);
 	ClearHPageRestoreRsvCnt(page);
+	ClearHPageRestoreRsvMap(page);
 
 	/*
 	 * If HPageRestoreRsvCnt was set on page, page allocation consumed a
@@ -2360,24 +2361,26 @@ static long vma_add_reservation(struct hstate *h,
 }
 
 /*
- * This routine is called to restore a reservation on error paths.  In the
- * specific error paths, a huge page was allocated (via alloc_huge_page)
- * and is about to be freed.  If a reservation for the page existed,
- * alloc_huge_page would have consumed the reservation and set
- * HPageRestoreRsvCnt in the newly allocated page.  When the page is freed
- * via free_huge_page, the global reservation count will be incremented if
- * HPageRestoreRsvCnt is set.  However, free_huge_page can not adjust the
- * reserve map.  Adjust the reserve map here to be consistent with global
- * reserve count adjustments to be made by free_huge_page.
+ * This routine is called to restore a reservation data on error paths.
+ * It handles two specific cases for pages allocated via alloc_huge_page:
+ * 1) A reservation was in place and page consumed the reservation.
+ *    HPageRestoreRsvCnt is set in the page.
+ * 2) No reservation was in place for the page, so HPageRestoreRsvCnt is
+ *    not set.  However, the reserve map was updated.
+ * In case 1, free_huge_page will increment the global reserve count.  But,
+ * free_huge_page does not have enough context to adjust the reservation map.
+ * This case deals primarily with private mappings.  Adjust the reserve map
+ * here to be consistent with global reserve count adjustments to be made
+ * by free_huge_page.
+ * In case 2, simply undo an reserve map modifications done by alloc_huge_page.
  */
-static void restore_reserve_on_error(struct hstate *h,
-			struct vm_area_struct *vma, unsigned long address,
-			struct page *page)
+void restore_reserve_on_error(struct hstate *h, struct vm_area_struct *vma,
+				unsigned long address, struct page *page)
 {
 	if (unlikely(HPageRestoreRsvCnt(page))) {
 		long rc = vma_needs_reservation(h, vma, address);
 
-		if (unlikely(rc < 0)) {
+		if (unlikely(rc < 0))
 			/*
 			 * Rare out of memory condition in reserve map
 			 * manipulation.  Clear HPageRestoreRsvCnt so that
@@ -2390,16 +2393,47 @@ static void restore_reserve_on_error(struct hstate *h,
 			 * accounting of reserve counts.
 			 */
 			ClearHPageRestoreRsvCnt(page);
-		} else if (rc) {
-			rc = vma_add_reservation(h, vma, address);
-			if (unlikely(rc < 0))
+		else if (rc)
+			vma_add_reservation(h, vma, address);
+		else
+			vma_end_reservation(h, vma, address);
+	} else if (unlikely(HPageRestoreRsvMap(page))) {
+		struct resv_map *resv = vma_resv_map(vma);
+		pgoff_t idx = vma_hugecache_offset(h, vma, address);
+		long rc;
+
+		/*
+		 * This handles the specific case where the reserve count
+		 * was not updated during the page allocation process, but
+		 * the reserve map was updated.  We need to undo the reserve
+		 * map update.
+		 *
+		 * The presence of an entry in the reserve map has opposite
+		 * meanings for shared and private mappings.
+		 */
+		if (vma->vm_flags & VM_MAYSHARE) {
+			rc = region_del(resv, idx, idx + 1);
+			if (rc < 0)
+				/*
+				 * Rare out of memory condition.  Since we can
+				 * not delete the reserve entry, set
+				 * HPageRestoreRsvCnt so that the global count
+				 * will be consistent with the reserve map.
+				 */
+				SetHPageRestoreRsvCnt(page);
+		} else {
+			rc = vma_needs_reservation(h, vma, address);
+			if (rc < 0)
 				/*
 				 * See above comment about rare out of
 				 * memory condition.
 				 */
-				ClearHPageRestoreRsvCnt(page);
-		} else
-			vma_end_reservation(h, vma, address);
+				SetHPageRestoreRsvCnt(page);
+			else if (rc)
+				vma_add_reservation(h, vma, address);
+			else
+				vma_end_reservation(h, vma, address);
+		}
 	}
 }
 
@@ -2641,6 +2675,8 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
 			hugetlb_cgroup_uncharge_page_rsvd(hstate_index(h),
 					pages_per_huge_page(h), page);
 	}
+	if (map_commit)
+		SetHPageRestoreRsvMap(page);
 	return page;
 
 out_uncharge_cgroup:
@@ -4053,6 +4089,7 @@ hugetlb_install_page(struct vm_area_struct *vma, pte_t *ptep, unsigned long addr
 	hugepage_add_new_anon_rmap(new_page, vma, addr);
 	hugetlb_count_add(pages_per_huge_page(hstate_vma(vma)), vma->vm_mm);
 	ClearHPageRestoreRsvCnt(new_page);
+	ClearHPageRestoreRsvMap(new_page);
 	SetHPageMigratable(new_page);
 }
 
@@ -4174,6 +4211,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
 				spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
 				entry = huge_ptep_get(src_pte);
 				if (!pte_same(src_pte_old, entry)) {
+					restore_reserve_on_error(h, vma, addr,
+								new);
 					put_page(new);
 					/* dst_entry won't change as in child */
 					goto again;
@@ -4526,6 +4565,7 @@ static vm_fault_t hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
 	ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
 	if (likely(ptep && pte_same(huge_ptep_get(ptep), pte))) {
 		ClearHPageRestoreRsvCnt(new_page);
+		ClearHPageRestoreRsvMap(new_page);
 
 		/* Break COW */
 		huge_ptep_clear_flush(vma, haddr, ptep);
@@ -4593,6 +4633,7 @@ int huge_add_to_page_cache(struct page *page, struct address_space *mapping,
 	if (err)
 		return err;
 	ClearHPageRestoreRsvCnt(page);
+	ClearHPageRestoreRsvMap(page);
 
 	/*
 	 * set page dirty so that it will not be removed from cache/file
@@ -4776,6 +4817,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 
 	if (anon_rmap) {
 		ClearHPageRestoreRsvCnt(page);
+		ClearHPageRestoreRsvMap(page);
 		hugepage_add_new_anon_rmap(page, vma, haddr);
 	} else
 		page_dup_rmap(page, true);
@@ -5097,6 +5139,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 		page_dup_rmap(page, true);
 	} else {
 		ClearHPageRestoreRsvCnt(page);
+		ClearHPageRestoreRsvMap(page);
 		hugepage_add_new_anon_rmap(page, dst_vma, dst_addr);
 	}
 
@@ -5133,6 +5176,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 	if (vm_shared || is_continue)
 		unlock_page(page);
 out_release_nounlock:
+	restore_reserve_on_error(h, dst_vma, dst_addr, page);
 	put_page(page);
 	goto out;
 }
-- 
2.31.1


^ permalink raw reply related	[flat|nested] 16+ messages in thread

* Re: [External] [PATCH 0/2] Track reserve map changes to restore on error
  2021-05-25 23:31       ` [PATCH 0/2] Track reserve map changes to restore on error Mike Kravetz
  2021-05-25 23:31         ` [PATCH 1/2] hugetlb: rename HPageRestoreReserve flag to HPageRestoreRsvCnt Mike Kravetz
  2021-05-25 23:31         ` [PATCH 2/2] hugetlb: add new hugetlb specific flag HPG_restore_rsv_map Mike Kravetz
@ 2021-05-26  3:19         ` Muchun Song
  2021-05-26 17:17           ` Mike Kravetz
  2 siblings, 1 reply; 16+ messages in thread
From: Muchun Song @ 2021-05-26  3:19 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Mina Almasry, Linux-MM, open list, Axel Rasmussen, Peter Xu,
	Andrew Morton

On Wed, May 26, 2021 at 7:31 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> Here is a modification to the reservation tracking for fixup on errors.
> It is a more general change, but should work for the hugetlb_mcopy_pte_atomic
> case as well.
>
> Perhaps use this as a prerequisite for your fix(es)?  Pretty sure this
> will eliminate the need for the call to hugetlb_unreserve_pages.

Hi Mike,

It seems like someone is fixing a bug, right? Maybe a link should be
placed in the cover letter so that someone can know what issue
we are facing.

Thanks.

>
> Mike Kravetz (2):
>   hugetlb: rename HPageRestoreReserve flag to HPageRestoreRsvCnt
>   hugetlb: add new hugetlb specific flag HPG_restore_rsv_map
>
>  fs/hugetlbfs/inode.c    |   3 +-
>  include/linux/hugetlb.h |  17 +++++--
>  mm/hugetlb.c            | 108 ++++++++++++++++++++++++++++------------
>  mm/userfaultfd.c        |  14 +++---
>  4 files changed, 99 insertions(+), 43 deletions(-)
>
> --
> 2.31.1
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [External] [PATCH 0/2] Track reserve map changes to restore on error
  2021-05-26  3:19         ` [External] [PATCH 0/2] Track reserve map changes to restore on error Muchun Song
@ 2021-05-26 17:17           ` Mike Kravetz
  2021-05-26 23:19             ` Mina Almasry
  0 siblings, 1 reply; 16+ messages in thread
From: Mike Kravetz @ 2021-05-26 17:17 UTC (permalink / raw)
  To: Muchun Song
  Cc: Mina Almasry, Linux-MM, open list, Axel Rasmussen, Peter Xu,
	Andrew Morton

On 5/25/21 8:19 PM, Muchun Song wrote:
> On Wed, May 26, 2021 at 7:31 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>>
>> Here is a modification to the reservation tracking for fixup on errors.
>> It is a more general change, but should work for the hugetlb_mcopy_pte_atomic
>> case as well.
>>
>> Perhaps use this as a prerequisite for your fix(es)?  Pretty sure this
>> will eliminate the need for the call to hugetlb_unreserve_pages.
> 
> Hi Mike,
> 
> It seems like someone is fixing a bug, right? Maybe a link should be
> placed in the cover letter so that someone can know what issue
> we are facing.
> 

Thanks Muchun,

I wanted to first see if these patches would work in the code Mina is
modifying.  If this works for Mina, then a more formal patch and request
for inclusion will be sent.

I believe this issue has existed since the introduction of hugetlb
reservations in v2.6.18.  Since the bug only shows up when we take error
paths, the issue may not have been observed.  Mina found a similar issue
in an error path which could also expose this issue.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [External] [PATCH 0/2] Track reserve map changes to restore on error
  2021-05-26 17:17           ` Mike Kravetz
@ 2021-05-26 23:19             ` Mina Almasry
  2021-05-27  2:48               ` Mina Almasry
  0 siblings, 1 reply; 16+ messages in thread
From: Mina Almasry @ 2021-05-26 23:19 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Muchun Song, Linux-MM, open list, Axel Rasmussen, Peter Xu,
	Andrew Morton

On Wed, May 26, 2021 at 10:17 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 5/25/21 8:19 PM, Muchun Song wrote:
> > On Wed, May 26, 2021 at 7:31 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
> >>
> >> Here is a modification to the reservation tracking for fixup on errors.
> >> It is a more general change, but should work for the hugetlb_mcopy_pte_atomic
> >> case as well.
> >>
> >> Perhaps use this as a prerequisite for your fix(es)?  Pretty sure this
> >> will eliminate the need for the call to hugetlb_unreserve_pages.
> >
> > Hi Mike,
> >
> > It seems like someone is fixing a bug, right? Maybe a link should be
> > placed in the cover letter so that someone can know what issue
> > we are facing.
> >
>
> Thanks Muchun,
>
> I wanted to first see if these patches would work in the code Mina is
> modifying.  If this works for Mina, then a more formal patch and request
> for inclusion will be sent.
>

So a quick test: I apply my patche and yours on top of linus/master,
and I remove the hugetlb_unreserve_pages() call that triggered this
conversation, and run the userfaultfd test, resv_huge_pages underflows
again, so it seems on the surface this doesn't quite work as is.

Not quite sure what to do off the top of my head. I think I will try
to debug why the 3 patches don't work together and I will fix either
your patch or mine. I haven't taken a deep look yet; I just ran a
quick test.

> I believe this issue has existed since the introduction of hugetlb
> reservations in v2.6.18.  Since the bug only shows up when we take error
> paths, the issue may not have been observed.  Mina found a similar issue
> in an error path which could also expose this issue.
> --
> Mike Kravetz

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [External] [PATCH 0/2] Track reserve map changes to restore on error
  2021-05-26 23:19             ` Mina Almasry
@ 2021-05-27  2:48               ` Mina Almasry
  2021-05-27 16:08                 ` Mike Kravetz
  0 siblings, 1 reply; 16+ messages in thread
From: Mina Almasry @ 2021-05-27  2:48 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Muchun Song, Linux-MM, open list, Axel Rasmussen, Peter Xu,
	Andrew Morton

On Wed, May 26, 2021 at 4:19 PM Mina Almasry <almasrymina@google.com> wrote:
>
> On Wed, May 26, 2021 at 10:17 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
> >
> > On 5/25/21 8:19 PM, Muchun Song wrote:
> > > On Wed, May 26, 2021 at 7:31 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
> > >>
> > >> Here is a modification to the reservation tracking for fixup on errors.
> > >> It is a more general change, but should work for the hugetlb_mcopy_pte_atomic
> > >> case as well.
> > >>
> > >> Perhaps use this as a prerequisite for your fix(es)?  Pretty sure this
> > >> will eliminate the need for the call to hugetlb_unreserve_pages.
> > >
> > > Hi Mike,
> > >
> > > It seems like someone is fixing a bug, right? Maybe a link should be
> > > placed in the cover letter so that someone can know what issue
> > > we are facing.
> > >
> >
> > Thanks Muchun,
> >
> > I wanted to first see if these patches would work in the code Mina is
> > modifying.  If this works for Mina, then a more formal patch and request
> > for inclusion will be sent.
> >
>
> So a quick test: I apply my patche and yours on top of linus/master,
> and I remove the hugetlb_unreserve_pages() call that triggered this
> conversation, and run the userfaultfd test, resv_huge_pages underflows
> again, so it seems on the surface this doesn't quite work as is.
>
> Not quite sure what to do off the top of my head. I think I will try
> to debug why the 3 patches don't work together and I will fix either
> your patch or mine. I haven't taken a deep look yet; I just ran a
> quick test.
>

Ok found the issue. With the setup I described above, the
hugetlb_shared test case passes:

./tools/testing/selftests/vm/userfaultfd hugetlb_shared 10 2
/tmp/kokonut_test/huge/userfaultfd_test && echo test success

The non-shared test case is the one that underflows:

./tools/testing/selftests/vm/userfaultfd hugetlb 10 2
/tmp/kokonut_test/huge/userfaultfd_test && echo test success

I've debugged a bit, and this messy hunk 'fixes' the underflow with
the non-shared case. (Sorry for the messiness).

@@ -2329,17 +2340,14 @@ void restore_reserve_on_error(struct hstate
*h, struct vm_area_struct *vma,
                                 */
                                SetHPageRestoreRsvCnt(page);
                } else {
-                       rc = vma_needs_reservation(h, vma, address);
-                       if (rc < 0)
-                               /*
-                                * See above comment about rare out of
-                                * memory condition.
-                                */
-                               SetHPageRestoreRsvCnt(page);
-                       else if (rc)
-                               vma_add_reservation(h, vma, address);
-                       else
-                               vma_end_reservation(h, vma, address);
+                       resv = inode_resv_map(vma->vm_file->f_mapping->host);
+                       if (resv) {
+                               int chg = region_del(resv, idx, idx+1);
+                               VM_BUG_ON(chg);
+                       }

The reason being is that on page allocation we region_add() an entry
into the resv_map regardless of whether this is a shared mapping or
not (vma_needs_reservation() + vma_commit_reservation(), which amounts
to region_add() at the end of the day).

To unroll back this change on error, we need to region_del() the region_add().

The code removed above doesn't end up calling region_del(), because
vma_needs_reservation() returns 0, because region_chg() sees there is
an entry in the resv_map, and returns 0.

The VM_BUG_ON() is just because I'm not sure how to handle that error.

> > I believe this issue has existed since the introduction of hugetlb
> > reservations in v2.6.18.  Since the bug only shows up when we take error
> > paths, the issue may not have been observed.  Mina found a similar issue
> > in an error path which could also expose this issue.
> > --
> > Mike Kravetz

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 1/2] hugetlb: rename HPageRestoreReserve flag to HPageRestoreRsvCnt
  2021-05-25 23:31         ` [PATCH 1/2] hugetlb: rename HPageRestoreReserve flag to HPageRestoreRsvCnt Mike Kravetz
@ 2021-05-27  2:49           ` Mina Almasry
  0 siblings, 0 replies; 16+ messages in thread
From: Mina Almasry @ 2021-05-27  2:49 UTC (permalink / raw)
  To: Mike Kravetz; +Cc: Linux-MM, open list, Axel Rasmussen, Peter Xu, Andrew Morton

On Tue, May 25, 2021 at 4:31 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> The hugetlb page specific flag HPageRestoreReserve is used to indicate
> that a reservation was consumed and should be restored on error.  More
> specifically, this is related to the global reservation count.
>
> It is also possible to have a hugetlb page allocated where the global
> count is not modified, but the reservation map is modified.  In such
> cases, the reserve map needs modification in error paths.  Code
> performing such modifications will be introduced in a subsequent patch.
>
> Rename the flag HPageRestoreReserve to HPageRestoreRsvCnt so that it
> is more clearly allociated with the global reservation count.  No
> functional changes in this patch, just renaming.
>
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>  fs/hugetlbfs/inode.c    |  2 +-
>  include/linux/hugetlb.h |  6 +++---
>  mm/hugetlb.c            | 32 ++++++++++++++++----------------
>  mm/userfaultfd.c        | 14 +++++++-------
>  4 files changed, 27 insertions(+), 27 deletions(-)
>
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index 55efd3dd04f6..bb4de5dcd652 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -529,7 +529,7 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart,
>                          * the subpool and global reserve usage count can need
>                          * to be adjusted.
>                          */
> -                       VM_BUG_ON(HPageRestoreReserve(page));
> +                       VM_BUG_ON(HPageRestoreRsvCnt(page));
>                         remove_huge_page(page);
>                         freed++;
>                         if (!truncate_op) {
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index 03ca83db0a3e..e5e363fa5d02 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -511,7 +511,7 @@ unsigned long hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
>   * of the hugetlb head page.  Functions created via the below macros should be
>   * used to manipulate these flags.
>   *
> - * HPG_restore_reserve - Set when a hugetlb page consumes a reservation at
> + * HPG_restore_rsv_cnt - Set when a hugetlb page consumes a reservation at
>   *     allocation time.  Cleared when page is fully instantiated.  Free
>   *     routine checks flag to restore a reservation on error paths.
>   *     Synchronization:  Examined or modified by code that knows it has
> @@ -535,7 +535,7 @@ unsigned long hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
>   * HPG_vmemmap_optimized - Set when the vmemmap pages of the page are freed.
>   */
>  enum hugetlb_page_flags {
> -       HPG_restore_reserve = 0,
> +       HPG_restore_rsv_cnt = 0,
>         HPG_migratable,
>         HPG_temporary,
>         HPG_freed,
> @@ -581,7 +581,7 @@ static inline void ClearHPage##uname(struct page *page)             \
>  /*
>   * Create functions associated with hugetlb page flags
>   */
> -HPAGEFLAG(RestoreReserve, restore_reserve)
> +HPAGEFLAG(RestoreRsvCnt, restore_rsv_cnt)
>  HPAGEFLAG(Migratable, migratable)
>  HPAGEFLAG(Temporary, temporary)
>  HPAGEFLAG(Freed, freed)
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index c3b2a8a494d6..2a8cea253388 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1165,7 +1165,7 @@ static struct page *dequeue_huge_page_vma(struct hstate *h,
>         nid = huge_node(vma, address, gfp_mask, &mpol, &nodemask);
>         page = dequeue_huge_page_nodemask(h, gfp_mask, nid, nodemask);
>         if (page && !avoid_reserve && vma_has_reserves(vma, chg)) {
> -               SetHPageRestoreReserve(page);
> +               SetHPageRestoreRsvCnt(page);
>                 h->resv_huge_pages--;
>         }
>
> @@ -1549,11 +1549,11 @@ void free_huge_page(struct page *page)
>
>         hugetlb_set_page_subpool(page, NULL);
>         page->mapping = NULL;
> -       restore_reserve = HPageRestoreReserve(page);
> -       ClearHPageRestoreReserve(page);
> +       restore_reserve = HPageRestoreRsvCnt(page);
> +       ClearHPageRestoreRsvCnt(page);
>
>         /*
> -        * If HPageRestoreReserve was set on page, page allocation consumed a
> +        * If HPageRestoreRsvCnt was set on page, page allocation consumed a
>          * reservation.  If the page was associated with a subpool, there
>          * would have been a page reserved in the subpool before allocation
>          * via hugepage_subpool_get_pages().  Since we are 'restoring' the
> @@ -2364,9 +2364,9 @@ static long vma_add_reservation(struct hstate *h,
>   * specific error paths, a huge page was allocated (via alloc_huge_page)
>   * and is about to be freed.  If a reservation for the page existed,
>   * alloc_huge_page would have consumed the reservation and set
> - * HPageRestoreReserve in the newly allocated page.  When the page is freed
> + * HPageRestoreRsvCnt in the newly allocated page.  When the page is freed
>   * via free_huge_page, the global reservation count will be incremented if
> - * HPageRestoreReserve is set.  However, free_huge_page can not adjust the
> + * HPageRestoreRsvCnt is set.  However, free_huge_page can not adjust the
>   * reserve map.  Adjust the reserve map here to be consistent with global
>   * reserve count adjustments to be made by free_huge_page.
>   */
> @@ -2374,13 +2374,13 @@ static void restore_reserve_on_error(struct hstate *h,
>                         struct vm_area_struct *vma, unsigned long address,
>                         struct page *page)
>  {
> -       if (unlikely(HPageRestoreReserve(page))) {
> +       if (unlikely(HPageRestoreRsvCnt(page))) {
>                 long rc = vma_needs_reservation(h, vma, address);
>
>                 if (unlikely(rc < 0)) {
>                         /*
>                          * Rare out of memory condition in reserve map
> -                        * manipulation.  Clear HPageRestoreReserve so that
> +                        * manipulation.  Clear HPageRestoreRsvCnt so that
>                          * global reserve count will not be incremented
>                          * by free_huge_page.  This will make it appear
>                          * as though the reservation for this page was
> @@ -2389,7 +2389,7 @@ static void restore_reserve_on_error(struct hstate *h,
>                          * is better than inconsistent global huge page
>                          * accounting of reserve counts.
>                          */
> -                       ClearHPageRestoreReserve(page);
> +                       ClearHPageRestoreRsvCnt(page);
>                 } else if (rc) {
>                         rc = vma_add_reservation(h, vma, address);
>                         if (unlikely(rc < 0))
> @@ -2397,7 +2397,7 @@ static void restore_reserve_on_error(struct hstate *h,
>                                  * See above comment about rare out of
>                                  * memory condition.
>                                  */
> -                               ClearHPageRestoreReserve(page);
> +                               ClearHPageRestoreRsvCnt(page);
>                 } else
>                         vma_end_reservation(h, vma, address);
>         }
> @@ -2602,7 +2602,7 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
>                 if (!page)
>                         goto out_uncharge_cgroup;
>                 if (!avoid_reserve && vma_has_reserves(vma, gbl_chg)) {
> -                       SetHPageRestoreReserve(page);
> +                       SetHPageRestoreRsvCnt(page);
>                         h->resv_huge_pages--;
>                 }
>                 spin_lock_irq(&hugetlb_lock);
> @@ -4052,7 +4052,7 @@ hugetlb_install_page(struct vm_area_struct *vma, pte_t *ptep, unsigned long addr
>         set_huge_pte_at(vma->vm_mm, addr, ptep, make_huge_pte(vma, new_page, 1));
>         hugepage_add_new_anon_rmap(new_page, vma, addr);
>         hugetlb_count_add(pages_per_huge_page(hstate_vma(vma)), vma->vm_mm);
> -       ClearHPageRestoreReserve(new_page);
> +       ClearHPageRestoreRsvCnt(new_page);
>         SetHPageMigratable(new_page);
>  }
>
> @@ -4525,7 +4525,7 @@ static vm_fault_t hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
>         spin_lock(ptl);
>         ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
>         if (likely(ptep && pte_same(huge_ptep_get(ptep), pte))) {
> -               ClearHPageRestoreReserve(new_page);
> +               ClearHPageRestoreRsvCnt(new_page);
>
>                 /* Break COW */
>                 huge_ptep_clear_flush(vma, haddr, ptep);
> @@ -4592,7 +4592,7 @@ int huge_add_to_page_cache(struct page *page, struct address_space *mapping,
>
>         if (err)
>                 return err;
> -       ClearHPageRestoreReserve(page);
> +       ClearHPageRestoreRsvCnt(page);
>
>         /*
>          * set page dirty so that it will not be removed from cache/file
> @@ -4775,7 +4775,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
>                 goto backout;
>
>         if (anon_rmap) {
> -               ClearHPageRestoreReserve(page);
> +               ClearHPageRestoreRsvCnt(page);
>                 hugepage_add_new_anon_rmap(page, vma, haddr);
>         } else
>                 page_dup_rmap(page, true);
> @@ -5096,7 +5096,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
>         if (vm_shared) {
>                 page_dup_rmap(page, true);
>         } else {
> -               ClearHPageRestoreReserve(page);
> +               ClearHPageRestoreRsvCnt(page);
>                 hugepage_add_new_anon_rmap(page, dst_vma, dst_addr);
>         }
>
> diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> index 9ce5a3793ad4..58c706697b17 100644
> --- a/mm/userfaultfd.c
> +++ b/mm/userfaultfd.c
> @@ -432,27 +432,27 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
>                  * If a reservation for the page existed in the reservation
>                  * map of a private mapping, the map was modified to indicate
>                  * the reservation was consumed when the page was allocated.
> -                * We clear the HPageRestoreReserve flag now so that the global
> +                * We clear the HPageRestoreRsvCnt flag now so that the global
>                  * reserve count will not be incremented in free_huge_page.
>                  * The reservation map will still indicate the reservation
>                  * was consumed and possibly prevent later page allocation.
>                  * This is better than leaking a global reservation.  If no
>                  * reservation existed, it is still safe to clear
> -                * HPageRestoreReserve as no adjustments to reservation counts
> +                * HPageRestoreRsvCnt as no adjustments to reservation counts
>                  * were made during allocation.
>                  *
>                  * The reservation map for shared mappings indicates which
>                  * pages have reservations.  When a huge page is allocated
>                  * for an address with a reservation, no change is made to
> -                * the reserve map.  In this case HPageRestoreReserve will be
> +                * the reserve map.  In this case HPageRestoreRsvCnt will be
>                  * set to indicate that the global reservation count should be
>                  * incremented when the page is freed.  This is the desired
>                  * behavior.  However, when a huge page is allocated for an
>                  * address without a reservation a reservation entry is added
> -                * to the reservation map, and HPageRestoreReserve will not be
> +                * to the reservation map, and HPageRestoreRsvCnt will not be
>                  * set. When the page is freed, the global reserve count will
>                  * NOT be incremented and it will appear as though we have
> -                * leaked reserved page.  In this case, set HPageRestoreReserve
> +                * leaked reserved page.  In this case, set HPageRestoreRsvCnt
>                  * so that the global reserve count will be incremented to
>                  * match the reservation map entry which was created.
>                  *
> @@ -461,9 +461,9 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm,
>                  * be different or NULL on error.
>                  */
>                 if (vm_alloc_shared)
> -                       SetHPageRestoreReserve(page);
> +                       SetHPageRestoreRsvCnt(page);
>                 else
> -                       ClearHPageRestoreReserve(page);
> +                       ClearHPageRestoreRsvCnt(page);
>                 put_page(page);
>         }
>         BUG_ON(copied < 0);
> --
> 2.31.1
>

Reviewed-by: Mina Almasry <almasrymina@google.com>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [PATCH 2/2] hugetlb: add new hugetlb specific flag HPG_restore_rsv_map
  2021-05-25 23:31         ` [PATCH 2/2] hugetlb: add new hugetlb specific flag HPG_restore_rsv_map Mike Kravetz
@ 2021-05-27  2:58           ` Mina Almasry
  0 siblings, 0 replies; 16+ messages in thread
From: Mina Almasry @ 2021-05-27  2:58 UTC (permalink / raw)
  To: Mike Kravetz; +Cc: Linux-MM, open list, Axel Rasmussen, Peter Xu, Andrew Morton

On Tue, May 25, 2021 at 4:31 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> When a hugetlb page is allocated via alloc_huge_page, the reserve map
> as well as global reservation count may be modified.  In case of error
> after allocation, the count and map should be restored to their previous
> state if possible.  The flag HPageRestoreRsvCnt indicates the global
> count was modified.  Add a new flag HPG_restore_rsv_map to indicate the
> reserve map was modified.  Note that during hugetlb page allocation the
> the global count and reserve map could be modified independently.
> Therefore, two specific flags are needed.
>
> The routine restore_reserve_on_error is called to restore reserve data
> on error paths.  Modify the routine to check for the HPG_restore_rsv_map
> flag and adjust the reserve map accordingly.
>

Should be an equivalent function that fixes the reservation on page
freeing? restore_reserve_on_put_page() or something? I'm confused that
we need to restore reservation on error, but seemingly there is no
function that restores reservation on normal operation page freeing.

> Add missing calls to restore_reserve_on_error to error paths  of code
> calling alloc_huge_page.
>

Is it a good idea to add a comment above alloc_huge_page() that to
unroll it we need to put_page() *and* call restore_reserve_on_error()?

> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>  fs/hugetlbfs/inode.c    |  1 +
>  include/linux/hugetlb.h | 11 ++++++
>  mm/hugetlb.c            | 82 +++++++++++++++++++++++++++++++----------
>  3 files changed, 75 insertions(+), 19 deletions(-)
>
> diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
> index bb4de5dcd652..9d846a2edc4b 100644
> --- a/fs/hugetlbfs/inode.c
> +++ b/fs/hugetlbfs/inode.c
> @@ -735,6 +735,7 @@ static long hugetlbfs_fallocate(struct file *file, int mode, loff_t offset,
>                 __SetPageUptodate(page);
>                 error = huge_add_to_page_cache(page, mapping, index);
>                 if (unlikely(error)) {
> +                       restore_reserve_on_error(h, &pseudo_vma, addr, page);
>                         put_page(page);
>                         mutex_unlock(&hugetlb_fault_mutex_table[hash]);
>                         goto out;
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index e5e363fa5d02..da2251b0c609 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -517,6 +517,13 @@ unsigned long hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
>   *     Synchronization:  Examined or modified by code that knows it has
>   *     the only reference to page.  i.e. After allocation but before use
>   *     or when the page is being freed.
> + * HPG_restore_rsv_map - Set when a hugetlb page allocation results in adding
> + *     an entry to the reserve map.  This can happen without adjustment of
> + *     the global reserve count.  Cleared when page is fully instantiated.
> + *     Error paths (restore_reserve_on_error) check this flag to make
> + *     adjustments to the reserve map.
> + *     Synchronization:  Examined or modified by code that knows it has
> + *     the only reference to page.  i.e. After allocation but before use.
>   * HPG_migratable  - Set after a newly allocated page is added to the page
>   *     cache and/or page tables.  Indicates the page is a candidate for
>   *     migration.
> @@ -536,6 +543,7 @@ unsigned long hugetlb_get_unmapped_area(struct file *file, unsigned long addr,
>   */
>  enum hugetlb_page_flags {
>         HPG_restore_rsv_cnt = 0,
> +       HPG_restore_rsv_map,
>         HPG_migratable,
>         HPG_temporary,
>         HPG_freed,
> @@ -582,6 +590,7 @@ static inline void ClearHPage##uname(struct page *page)             \
>   * Create functions associated with hugetlb page flags
>   */
>  HPAGEFLAG(RestoreRsvCnt, restore_rsv_cnt)
> +HPAGEFLAG(RestoreRsvMap, restore_rsv_map)
>  HPAGEFLAG(Migratable, migratable)
>  HPAGEFLAG(Temporary, temporary)
>  HPAGEFLAG(Freed, freed)
> @@ -633,6 +642,8 @@ struct page *alloc_huge_page_vma(struct hstate *h, struct vm_area_struct *vma,
>                                 unsigned long address);
>  int huge_add_to_page_cache(struct page *page, struct address_space *mapping,
>                         pgoff_t idx);
> +void restore_reserve_on_error(struct hstate *h, struct vm_area_struct *vma,
> +               unsigned long address, struct page *page);
>
>  /* arch callback */
>  int __init __alloc_bootmem_huge_page(struct hstate *h);
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 2a8cea253388..1c3a68d70ab5 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1551,6 +1551,7 @@ void free_huge_page(struct page *page)
>         page->mapping = NULL;
>         restore_reserve = HPageRestoreRsvCnt(page);
>         ClearHPageRestoreRsvCnt(page);
> +       ClearHPageRestoreRsvMap(page);
>
>         /*
>          * If HPageRestoreRsvCnt was set on page, page allocation consumed a
> @@ -2360,24 +2361,26 @@ static long vma_add_reservation(struct hstate *h,
>  }
>
>  /*
> - * This routine is called to restore a reservation on error paths.  In the
> - * specific error paths, a huge page was allocated (via alloc_huge_page)
> - * and is about to be freed.  If a reservation for the page existed,
> - * alloc_huge_page would have consumed the reservation and set
> - * HPageRestoreRsvCnt in the newly allocated page.  When the page is freed
> - * via free_huge_page, the global reservation count will be incremented if
> - * HPageRestoreRsvCnt is set.  However, free_huge_page can not adjust the
> - * reserve map.  Adjust the reserve map here to be consistent with global
> - * reserve count adjustments to be made by free_huge_page.
> + * This routine is called to restore a reservation data on error paths.
> + * It handles two specific cases for pages allocated via alloc_huge_page:
> + * 1) A reservation was in place and page consumed the reservation.
> + *    HPageRestoreRsvCnt is set in the page.
> + * 2) No reservation was in place for the page, so HPageRestoreRsvCnt is
> + *    not set.  However, the reserve map was updated.
> + * In case 1, free_huge_page will increment the global reserve count.  But,
> + * free_huge_page does not have enough context to adjust the reservation map.
> + * This case deals primarily with private mappings.  Adjust the reserve map
> + * here to be consistent with global reserve count adjustments to be made
> + * by free_huge_page.
> + * In case 2, simply undo an reserve map modifications done by alloc_huge_page.
>   */
> -static void restore_reserve_on_error(struct hstate *h,
> -                       struct vm_area_struct *vma, unsigned long address,
> -                       struct page *page)
> +void restore_reserve_on_error(struct hstate *h, struct vm_area_struct *vma,
> +                               unsigned long address, struct page *page)
>  {
>         if (unlikely(HPageRestoreRsvCnt(page))) {
>                 long rc = vma_needs_reservation(h, vma, address);
>
> -               if (unlikely(rc < 0)) {
> +               if (unlikely(rc < 0))
>                         /*
>                          * Rare out of memory condition in reserve map
>                          * manipulation.  Clear HPageRestoreRsvCnt so that
> @@ -2390,16 +2393,47 @@ static void restore_reserve_on_error(struct hstate *h,
>                          * accounting of reserve counts.
>                          */
>                         ClearHPageRestoreRsvCnt(page);
> -               } else if (rc) {
> -                       rc = vma_add_reservation(h, vma, address);
> -                       if (unlikely(rc < 0))
> +               else if (rc)
> +                       vma_add_reservation(h, vma, address);
> +               else
> +                       vma_end_reservation(h, vma, address);
> +       } else if (unlikely(HPageRestoreRsvMap(page))) {
> +               struct resv_map *resv = vma_resv_map(vma);
> +               pgoff_t idx = vma_hugecache_offset(h, vma, address);
> +               long rc;
> +
> +               /*
> +                * This handles the specific case where the reserve count
> +                * was not updated during the page allocation process, but
> +                * the reserve map was updated.  We need to undo the reserve
> +                * map update.
> +                *
> +                * The presence of an entry in the reserve map has opposite
> +                * meanings for shared and private mappings.
> +                */
> +               if (vma->vm_flags & VM_MAYSHARE) {
> +                       rc = region_del(resv, idx, idx + 1);
> +                       if (rc < 0)
> +                               /*
> +                                * Rare out of memory condition.  Since we can
> +                                * not delete the reserve entry, set
> +                                * HPageRestoreRsvCnt so that the global count
> +                                * will be consistent with the reserve map.
> +                                */
> +                               SetHPageRestoreRsvCnt(page);
> +               } else {
> +                       rc = vma_needs_reservation(h, vma, address);
> +                       if (rc < 0)
>                                 /*
>                                  * See above comment about rare out of
>                                  * memory condition.
>                                  */
> -                               ClearHPageRestoreRsvCnt(page);
> -               } else
> -                       vma_end_reservation(h, vma, address);
> +                               SetHPageRestoreRsvCnt(page);
> +                       else if (rc)
> +                               vma_add_reservation(h, vma, address);
> +                       else
> +                               vma_end_reservation(h, vma, address);
> +               }

As I mentioned in the other email, this call sequence does not result
in the region_del() call that we really need here. Calling
region_del() directly would be one fix, another would be to call
vma_end_reservation() even if !rc. Not sure which is more semantically
correct. hugetlb_unreserve_pages() calls region_del()
indiscriminately.

>         }
>  }
>
> @@ -2641,6 +2675,8 @@ struct page *alloc_huge_page(struct vm_area_struct *vma,
>                         hugetlb_cgroup_uncharge_page_rsvd(hstate_index(h),
>                                         pages_per_huge_page(h), page);
>         }
> +       if (map_commit)
> +               SetHPageRestoreRsvMap(page);
>         return page;
>
>  out_uncharge_cgroup:
> @@ -4053,6 +4089,7 @@ hugetlb_install_page(struct vm_area_struct *vma, pte_t *ptep, unsigned long addr
>         hugepage_add_new_anon_rmap(new_page, vma, addr);
>         hugetlb_count_add(pages_per_huge_page(hstate_vma(vma)), vma->vm_mm);
>         ClearHPageRestoreRsvCnt(new_page);
> +       ClearHPageRestoreRsvMap(new_page);
>         SetHPageMigratable(new_page);
>  }
>
> @@ -4174,6 +4211,8 @@ int copy_hugetlb_page_range(struct mm_struct *dst, struct mm_struct *src,
>                                 spin_lock_nested(src_ptl, SINGLE_DEPTH_NESTING);
>                                 entry = huge_ptep_get(src_pte);
>                                 if (!pte_same(src_pte_old, entry)) {
> +                                       restore_reserve_on_error(h, vma, addr,
> +                                                               new);
>                                         put_page(new);
>                                         /* dst_entry won't change as in child */
>                                         goto again;
> @@ -4526,6 +4565,7 @@ static vm_fault_t hugetlb_cow(struct mm_struct *mm, struct vm_area_struct *vma,
>         ptep = huge_pte_offset(mm, haddr, huge_page_size(h));
>         if (likely(ptep && pte_same(huge_ptep_get(ptep), pte))) {
>                 ClearHPageRestoreRsvCnt(new_page);
> +               ClearHPageRestoreRsvMap(new_page);
>
>                 /* Break COW */
>                 huge_ptep_clear_flush(vma, haddr, ptep);
> @@ -4593,6 +4633,7 @@ int huge_add_to_page_cache(struct page *page, struct address_space *mapping,
>         if (err)
>                 return err;
>         ClearHPageRestoreRsvCnt(page);
> +       ClearHPageRestoreRsvMap(page);
>
>         /*
>          * set page dirty so that it will not be removed from cache/file
> @@ -4776,6 +4817,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
>
>         if (anon_rmap) {
>                 ClearHPageRestoreRsvCnt(page);
> +               ClearHPageRestoreRsvMap(page);
>                 hugepage_add_new_anon_rmap(page, vma, haddr);
>         } else
>                 page_dup_rmap(page, true);
> @@ -5097,6 +5139,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
>                 page_dup_rmap(page, true);
>         } else {
>                 ClearHPageRestoreRsvCnt(page);
> +               ClearHPageRestoreRsvMap(page);
>                 hugepage_add_new_anon_rmap(page, dst_vma, dst_addr);
>         }
>
> @@ -5133,6 +5176,7 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
>         if (vm_shared || is_continue)
>                 unlock_page(page);
>  out_release_nounlock:
> +       restore_reserve_on_error(h, dst_vma, dst_addr, page);
>         put_page(page);
>         goto out;
>  }
> --
> 2.31.1
>

^ permalink raw reply	[flat|nested] 16+ messages in thread

* Re: [External] [PATCH 0/2] Track reserve map changes to restore on error
  2021-05-27  2:48               ` Mina Almasry
@ 2021-05-27 16:08                 ` Mike Kravetz
  0 siblings, 0 replies; 16+ messages in thread
From: Mike Kravetz @ 2021-05-27 16:08 UTC (permalink / raw)
  To: Mina Almasry
  Cc: Muchun Song, Linux-MM, open list, Axel Rasmussen, Peter Xu,
	Andrew Morton

On 5/26/21 7:48 PM, Mina Almasry wrote:
> On Wed, May 26, 2021 at 4:19 PM Mina Almasry <almasrymina@google.com> wrote:
>>
>> On Wed, May 26, 2021 at 10:17 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>>>
>>> On 5/25/21 8:19 PM, Muchun Song wrote:
>>>> On Wed, May 26, 2021 at 7:31 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>>>>>
>>>>> Here is a modification to the reservation tracking for fixup on errors.
>>>>> It is a more general change, but should work for the hugetlb_mcopy_pte_atomic
>>>>> case as well.
>>>>>
>>>>> Perhaps use this as a prerequisite for your fix(es)?  Pretty sure this
>>>>> will eliminate the need for the call to hugetlb_unreserve_pages.
>>>>
>>>> Hi Mike,
>>>>
>>>> It seems like someone is fixing a bug, right? Maybe a link should be
>>>> placed in the cover letter so that someone can know what issue
>>>> we are facing.
>>>>
>>>
>>> Thanks Muchun,
>>>
>>> I wanted to first see if these patches would work in the code Mina is
>>> modifying.  If this works for Mina, then a more formal patch and request
>>> for inclusion will be sent.
>>>
>>
>> So a quick test: I apply my patche and yours on top of linus/master,
>> and I remove the hugetlb_unreserve_pages() call that triggered this
>> conversation, and run the userfaultfd test, resv_huge_pages underflows
>> again, so it seems on the surface this doesn't quite work as is.
>>
>> Not quite sure what to do off the top of my head. I think I will try
>> to debug why the 3 patches don't work together and I will fix either
>> your patch or mine. I haven't taken a deep look yet; I just ran a
>> quick test.
>>
> 
> Ok found the issue. With the setup I described above, the
> hugetlb_shared test case passes:
> 
> ./tools/testing/selftests/vm/userfaultfd hugetlb_shared 10 2
> /tmp/kokonut_test/huge/userfaultfd_test && echo test success
> 
> The non-shared test case is the one that underflows:
> 
> ./tools/testing/selftests/vm/userfaultfd hugetlb 10 2
> /tmp/kokonut_test/huge/userfaultfd_test && echo test success
> 
> I've debugged a bit, and this messy hunk 'fixes' the underflow with
> the non-shared case. (Sorry for the messiness).
> 
> @@ -2329,17 +2340,14 @@ void restore_reserve_on_error(struct hstate
> *h, struct vm_area_struct *vma,
>                                  */
>                                 SetHPageRestoreRsvCnt(page);
>                 } else {
> -                       rc = vma_needs_reservation(h, vma, address);
> -                       if (rc < 0)
> -                               /*
> -                                * See above comment about rare out of
> -                                * memory condition.
> -                                */
> -                               SetHPageRestoreRsvCnt(page);
> -                       else if (rc)
> -                               vma_add_reservation(h, vma, address);
> -                       else
> -                               vma_end_reservation(h, vma, address);
> +                       resv = inode_resv_map(vma->vm_file->f_mapping->host);
> +                       if (resv) {
> +                               int chg = region_del(resv, idx, idx+1);
> +                               VM_BUG_ON(chg);
> +                       }
> 
> The reason being is that on page allocation we region_add() an entry
> into the resv_map regardless of whether this is a shared mapping or
> not (vma_needs_reservation() + vma_commit_reservation(), which amounts
> to region_add() at the end of the day).
> 
> To unroll back this change on error, we need to region_del() the region_add().
> 
> The code removed above doesn't end up calling region_del(), because
> vma_needs_reservation() returns 0, because region_chg() sees there is
> an entry in the resv_map, and returns 0.
> 
> The VM_BUG_ON() is just because I'm not sure how to handle that error.
> 

Thanks Mina!

Yes, that new block of code in restore_reserve_on_error is incorrect for
the private mapping case.  Since alloc_huge_page does the region_add for
both shared and private mappings, it seems we should just do the
region_del for both.  I'll update this patch to fix this and take your
other comments into account.

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 16+ messages in thread

end of thread, other threads:[~2021-05-27 16:08 UTC | newest]

Thread overview: 16+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-21  7:44 [PATCH v3] mm, hugetlb: fix resv_huge_pages underflow on UFFDIO_COPY Mina Almasry
2021-05-22 21:19 ` Andrew Morton
2021-05-22 21:32   ` Mina Almasry
2021-05-24 18:07 ` Mike Kravetz
2021-05-25  0:11   ` Mina Almasry
2021-05-25  0:45     ` Mike Kravetz
2021-05-25 23:31       ` [PATCH 0/2] Track reserve map changes to restore on error Mike Kravetz
2021-05-25 23:31         ` [PATCH 1/2] hugetlb: rename HPageRestoreReserve flag to HPageRestoreRsvCnt Mike Kravetz
2021-05-27  2:49           ` Mina Almasry
2021-05-25 23:31         ` [PATCH 2/2] hugetlb: add new hugetlb specific flag HPG_restore_rsv_map Mike Kravetz
2021-05-27  2:58           ` Mina Almasry
2021-05-26  3:19         ` [External] [PATCH 0/2] Track reserve map changes to restore on error Muchun Song
2021-05-26 17:17           ` Mike Kravetz
2021-05-26 23:19             ` Mina Almasry
2021-05-27  2:48               ` Mina Almasry
2021-05-27 16:08                 ` Mike Kravetz

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).