linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/3] hugetlb: fix potential ref counting races
@ 2021-07-10  0:24 Mike Kravetz
  2021-07-10  0:24 ` [PATCH 1/3] hugetlb: simplify prep_compound_gigantic_page ref count racing code Mike Kravetz
                   ` (3 more replies)
  0 siblings, 4 replies; 10+ messages in thread
From: Mike Kravetz @ 2021-07-10  0:24 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Muchun Song, Michal Hocko, Oscar Salvador, David Hildenbrand,
	Matthew Wilcox, Naoya Horiguchi, Mina Almasry, Andrew Morton,
	Mike Kravetz

When Muchun Song brought up a potential issue with hugetlb ref counting[1],
I started looking closer at the code.  hugetlbfs is the only code with it's
own specialized compound page destructor and taking special action when ref
counts drop to zero.  Potential races happen in this unique handling of ref
counts.  The following patches address these races when creating and
destroying hugetlb pages.

These potential races have likely existed since the creation of
hugetlbfs.  They certainly have been around for more than 10 years.
However, I am unaware of anyone actually hitting these races.  It is
VERY unlikely than anyone will actually hit these races, but they do
exist.

I could not think of an easy (or difficult) way to force these races.
Therefore, testing consisted of adding code to randomly increase ref
counts in strategic places.  In this way, I was able to exercise all the
race handling code paths.

[1] https://lore.kernel.org/linux-mm/CAMZfGtVMn3daKrJwZMaVOGOaJU+B4dS--x_oPmGQMD=c=QNGEg@mail.gmail.com/

Mike Kravetz (3):
  hugetlb: simplify prep_compound_gigantic_page ref count racing code
  hugetlb: drop ref count earlier after page allocation
  hugetlb: before freeing hugetlb page set dtor to appropriate value

 mm/hugetlb.c | 137 ++++++++++++++++++++++++++++++++++++++-------------
 1 file changed, 104 insertions(+), 33 deletions(-)

-- 
2.31.1



^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 1/3] hugetlb: simplify prep_compound_gigantic_page ref count racing code
  2021-07-10  0:24 [PATCH 0/3] hugetlb: fix potential ref counting races Mike Kravetz
@ 2021-07-10  0:24 ` Mike Kravetz
  2021-07-13  6:31   ` [External] " Muchun Song
  2021-07-10  0:24 ` [PATCH 2/3] hugetlb: drop ref count earlier after page allocation Mike Kravetz
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 10+ messages in thread
From: Mike Kravetz @ 2021-07-10  0:24 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Muchun Song, Michal Hocko, Oscar Salvador, David Hildenbrand,
	Matthew Wilcox, Naoya Horiguchi, Mina Almasry, Andrew Morton,
	Mike Kravetz

Code in prep_compound_gigantic_page waits for a rcu grace period if it
notices a temporarily inflated ref count on a tail page.  This was due
to the identified potential race with speculative page cache references
which could only last for a rcu grace period.  This is overly complicated
as this situation is VERY unlikely to ever happen.  Instead, just quickly
return an error.

Also, only print a warning in prep_compound_gigantic_page instead of
multiple callers.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb.c | 15 +++++----------
 1 file changed, 5 insertions(+), 10 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 924553aa8f78..e59ebba63da7 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1657,16 +1657,12 @@ static bool prep_compound_gigantic_page(struct page *page, unsigned int order)
 		 * cache adding could take a ref on a 'to be' tail page.
 		 * We need to respect any increased ref count, and only set
 		 * the ref count to zero if count is currently 1.  If count
-		 * is not 1, we call synchronize_rcu in the hope that a rcu
-		 * grace period will cause ref count to drop and then retry.
-		 * If count is still inflated on retry we return an error and
-		 * must discard the pages.
+		 * is not 1, we return an error and caller must discard the
+		 * pages.
 		 */
 		if (!page_ref_freeze(p, 1)) {
-			pr_info("HugeTLB unexpected inflated ref count on freshly allocated page\n");
-			synchronize_rcu();
-			if (!page_ref_freeze(p, 1))
-				goto out_error;
+			pr_warn("HugeTLB page can not be used due to unexpected inflated ref count\n");
+			goto out_error;
 		}
 		set_page_count(p, 0);
 		set_compound_head(p, page);
@@ -1830,7 +1826,6 @@ static struct page *alloc_fresh_huge_page(struct hstate *h,
 				retry = true;
 				goto retry;
 			}
-			pr_warn("HugeTLB page can not be used due to unexpected inflated ref count\n");
 			return NULL;
 		}
 	}
@@ -2828,8 +2823,8 @@ static void __init gather_bootmem_prealloc(void)
 			prep_new_huge_page(h, page, page_to_nid(page));
 			put_page(page); /* add to the hugepage allocator */
 		} else {
+			/* VERY unlikely inflated ref count on a tail page */
 			free_gigantic_page(page, huge_page_order(h));
-			pr_warn("HugeTLB page can not be used due to unexpected inflated ref count\n");
 		}
 
 		/*
-- 
2.31.1



^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 2/3] hugetlb: drop ref count earlier after page allocation
  2021-07-10  0:24 [PATCH 0/3] hugetlb: fix potential ref counting races Mike Kravetz
  2021-07-10  0:24 ` [PATCH 1/3] hugetlb: simplify prep_compound_gigantic_page ref count racing code Mike Kravetz
@ 2021-07-10  0:24 ` Mike Kravetz
  2021-07-10  0:24 ` [PATCH 3/3] hugetlb: before freeing hugetlb page set dtor to appropriate value Mike Kravetz
  2021-07-27 23:41 ` [PATCH 0/3] hugetlb: fix potential ref counting races Mike Kravetz
  3 siblings, 0 replies; 10+ messages in thread
From: Mike Kravetz @ 2021-07-10  0:24 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Muchun Song, Michal Hocko, Oscar Salvador, David Hildenbrand,
	Matthew Wilcox, Naoya Horiguchi, Mina Almasry, Andrew Morton,
	Mike Kravetz

When discussing the possibility of inflated page ref counts, Muuchun
Song pointed out this potential issue [1].  It is true that any code
could potentially take a reference on a compound page after allocation
and before it is converted to and put into use as a hugetlb page.
Specifically, this could be done by any users of get_page_unless_zero.

There are three areas of concern within hugetlb code.
1) When adding pages to the pool.
   In this case, new pages are allocated added to the pool by calling
   put_page to invoke the hugetlb destructor (free_huge_page).  If there
   is an inflated ref count on the page, it will not be immediately added
   to the free list.  It will only be added to the free list when the
   temporary ref count is dropped.  This is deemed acceptable and will
   not be addressed.
2) A page is allocated for immediate use normally as a surplus page or
   migration target.  In this case, the user of the page will also hold
   a reference.  There is no issue as this is just like normal page ref
   counting.
3) A page is allocated and MUST be added to the free list to satisfy a
   reservation.  One such example is gather_surplus_pages as pointed out
   by Muchun in [1].  More specifically, this case covers callers of
   enqueue_huge_page where the page reference count must be zero.  This
   patch covers this third case.

Three routines call enqueue_huge_page when the page reference count
could potentially be inflated.  They are: gather_surplus_pages,
alloc_and_dissolve_huge_page and add_hugetlb_page.

add_hugetlb_page is called on error paths when a huge page can not be
freed due to the inability to allocate vmemmap pages.  In this case, the
temporairly inflated ref count is not an issue.  When the ref is dropped
the appropriate action will be taken.  Instead of VM_BUG_ON if the ref
count does not drop to zero, simply return.

In gather_surplus_pages and alloc_and_dissolve_huge_page the caller
expects a page (or pages) to be put on the free lists.  In this case we
must ensure there are no temporary ref counts.  We do this by calling
put_page_testzero() earlier and not using pages without a zero ref
count.  The temporary page flag (HPageTemporary) is used in such cases
so that as soon as the inflated ref count is dropped the page will be
freed.

[1] https://lore.kernel.org/linux-mm/CAMZfGtVMn3daKrJwZMaVOGOaJU+B4dS--x_oPmGQMD=c=QNGEg@mail.gmail.com/
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb.c | 100 +++++++++++++++++++++++++++++++++++++++------------
 1 file changed, 78 insertions(+), 22 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index e59ebba63da7..3132c7395743 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1072,6 +1072,8 @@ static void enqueue_huge_page(struct hstate *h, struct page *page)
 	int nid = page_to_nid(page);
 
 	lockdep_assert_held(&hugetlb_lock);
+	VM_BUG_ON_PAGE(page_count(page), page);
+
 	list_move(&page->lru, &h->hugepage_freelists[nid]);
 	h->free_huge_pages++;
 	h->free_huge_pages_node[nid]++;
@@ -1399,11 +1401,20 @@ static void add_hugetlb_page(struct hstate *h, struct page *page,
 	SetHPageVmemmapOptimized(page);
 
 	/*
-	 * This page is now managed by the hugetlb allocator and has
-	 * no users -- drop the last reference.
+	 * This page is about to be managed by the hugetlb allocator and
+	 * should have no users.  Drop our reference, and check for others
+	 * just in case.
 	 */
 	zeroed = put_page_testzero(page);
-	VM_BUG_ON_PAGE(!zeroed, page);
+	if (!zeroed)
+		/*
+		 * It is VERY unlikely soneone else has taken a ref on
+		 * the page.  In this case, we simply return as the
+		 * hugetlb destructor (free_huge_page) will be called
+		 * when this other ref is dropped.
+		 */
+		return;
+
 	arch_clear_hugepage_flags(page);
 	enqueue_huge_page(h, page);
 }
@@ -2015,9 +2026,10 @@ int dissolve_free_huge_pages(unsigned long start_pfn, unsigned long end_pfn)
  * Allocates a fresh surplus page from the page allocator.
  */
 static struct page *alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask,
-		int nid, nodemask_t *nmask)
+		int nid, nodemask_t *nmask, bool zero_ref)
 {
 	struct page *page = NULL;
+	bool retry = false;
 
 	if (hstate_is_gigantic(h))
 		return NULL;
@@ -2027,6 +2039,7 @@ static struct page *alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask,
 		goto out_unlock;
 	spin_unlock_irq(&hugetlb_lock);
 
+retry:
 	page = alloc_fresh_huge_page(h, gfp_mask, nid, nmask, NULL);
 	if (!page)
 		return NULL;
@@ -2044,11 +2057,35 @@ static struct page *alloc_surplus_huge_page(struct hstate *h, gfp_t gfp_mask,
 		spin_unlock_irq(&hugetlb_lock);
 		put_page(page);
 		return NULL;
-	} else {
-		h->surplus_huge_pages++;
-		h->surplus_huge_pages_node[page_to_nid(page)]++;
 	}
 
+	if (zero_ref) {
+		/*
+		 * Caller requires a page with zero ref count.
+		 * We will drop ref count here.  If someone else is holding
+		 * a ref, the page will be freed when they drop it.  Abuse
+		 * temporary page flag to accomplish this.
+		 */
+		SetHPageTemporary(page);
+		if (!put_page_testzero(page)) {
+			/*
+			 * Unexpected inflated ref count on freshly allocated
+			 * huge.  Retry once.
+			 */
+			pr_info("HugeTLB unexpected inflated ref count on freshly allocated page\n");
+			spin_unlock_irq(&hugetlb_lock);
+			if (retry)
+				return NULL;
+
+			retry = true;
+			goto retry;
+		}
+		ClearHPageTemporary(page);
+	}
+
+	h->surplus_huge_pages++;
+	h->surplus_huge_pages_node[page_to_nid(page)]++;
+
 out_unlock:
 	spin_unlock_irq(&hugetlb_lock);
 
@@ -2090,7 +2127,7 @@ struct page *alloc_buddy_huge_page_with_mpol(struct hstate *h,
 	nodemask_t *nodemask;
 
 	nid = huge_node(vma, addr, gfp_mask, &mpol, &nodemask);
-	page = alloc_surplus_huge_page(h, gfp_mask, nid, nodemask);
+	page = alloc_surplus_huge_page(h, gfp_mask, nid, nodemask, false);
 	mpol_cond_put(mpol);
 
 	return page;
@@ -2162,7 +2199,7 @@ static int gather_surplus_pages(struct hstate *h, long delta)
 	spin_unlock_irq(&hugetlb_lock);
 	for (i = 0; i < needed; i++) {
 		page = alloc_surplus_huge_page(h, htlb_alloc_mask(h),
-				NUMA_NO_NODE, NULL);
+				NUMA_NO_NODE, NULL, true);
 		if (!page) {
 			alloc_ok = false;
 			break;
@@ -2203,24 +2240,20 @@ static int gather_surplus_pages(struct hstate *h, long delta)
 
 	/* Free the needed pages to the hugetlb pool */
 	list_for_each_entry_safe(page, tmp, &surplus_list, lru) {
-		int zeroed;
-
 		if ((--needed) < 0)
 			break;
-		/*
-		 * This page is now managed by the hugetlb allocator and has
-		 * no users -- drop the buddy allocator's reference.
-		 */
-		zeroed = put_page_testzero(page);
-		VM_BUG_ON_PAGE(!zeroed, page);
+		/* Add the page to the hugetlb allocator */
 		enqueue_huge_page(h, page);
 	}
 free:
 	spin_unlock_irq(&hugetlb_lock);
 
-	/* Free unnecessary surplus pages to the buddy allocator */
+	/*
+	 * Free unnecessary surplus pages to the buddy allocator.
+	 * Pages have no ref count, call free_huge_page directly.
+	 */
 	list_for_each_entry_safe(page, tmp, &surplus_list, lru)
-		put_page(page);
+		free_huge_page(page);
 	spin_lock_irq(&hugetlb_lock);
 
 	return ret;
@@ -2529,6 +2562,7 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page,
 {
 	gfp_t gfp_mask = htlb_alloc_mask(h) | __GFP_THISNODE;
 	int nid = page_to_nid(old_page);
+	bool alloc_retry = false;
 	struct page *new_page;
 	int ret = 0;
 
@@ -2539,9 +2573,30 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page,
 	 * the pool.  This simplifies and let us do most of the processing
 	 * under the lock.
 	 */
+alloc_retry:
 	new_page = alloc_buddy_huge_page(h, gfp_mask, nid, NULL, NULL);
 	if (!new_page)
 		return -ENOMEM;
+	/*
+	 * If all goes well, this page will be directly added to the free
+	 * list in the pool.  For this the ref count needs to be zero.
+	 * Attempt to drop now, and retry once if needed.  It is VERY
+	 * unlikely there is another ref on the page.
+	 *
+	 * If someone else has a reference to the page, it will be freed
+	 * when they drop their ref.  Abuse temporary page flag to accomplish
+	 * this.  Retry once if there is an inflated ref count.
+	 */
+	SetHPageTemporary(new_page);
+	if (!put_page_testzero(new_page)) {
+		if (alloc_retry)
+			return -EBUSY;
+
+		alloc_retry = true;
+		goto alloc_retry;
+	}
+	ClearHPageTemporary(new_page);
+
 	__prep_new_huge_page(h, new_page);
 
 retry:
@@ -2581,11 +2636,10 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page,
 		remove_hugetlb_page(h, old_page, false);
 
 		/*
-		 * Reference count trick is needed because allocator gives us
-		 * referenced page but the pool requires pages with 0 refcount.
+		 * Ref count on new page is already zero as it was dropped
+		 * earlier.  It can be directly added to the pool free list.
 		 */
 		__prep_account_new_huge_page(h, nid);
-		page_ref_dec(new_page);
 		enqueue_huge_page(h, new_page);
 
 		/*
@@ -2599,6 +2653,8 @@ static int alloc_and_dissolve_huge_page(struct hstate *h, struct page *old_page,
 
 free_new:
 	spin_unlock_irq(&hugetlb_lock);
+	/* Page has a zero ref count, but needs a ref to be freed */
+	set_page_refcounted(new_page);
 	update_and_free_page(h, new_page, false);
 
 	return ret;
-- 
2.31.1



^ permalink raw reply	[flat|nested] 10+ messages in thread

* [PATCH 3/3] hugetlb: before freeing hugetlb page set dtor to appropriate value
  2021-07-10  0:24 [PATCH 0/3] hugetlb: fix potential ref counting races Mike Kravetz
  2021-07-10  0:24 ` [PATCH 1/3] hugetlb: simplify prep_compound_gigantic_page ref count racing code Mike Kravetz
  2021-07-10  0:24 ` [PATCH 2/3] hugetlb: drop ref count earlier after page allocation Mike Kravetz
@ 2021-07-10  0:24 ` Mike Kravetz
  2021-07-14 10:57   ` [External] " Muchun Song
  2021-07-27 23:41 ` [PATCH 0/3] hugetlb: fix potential ref counting races Mike Kravetz
  3 siblings, 1 reply; 10+ messages in thread
From: Mike Kravetz @ 2021-07-10  0:24 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Muchun Song, Michal Hocko, Oscar Salvador, David Hildenbrand,
	Matthew Wilcox, Naoya Horiguchi, Mina Almasry, Andrew Morton,
	Mike Kravetz

When removing a hugetlb page from the pool the ref count is set to
one (as the free page has no ref count) and compound page destructor
is set to NULL_COMPOUND_DTOR.  Since a subsequent call to free the
hugetlb page will call __free_pages for non-gigantic pages and
free_gigantic_page for gigantic pages the destructor is not used.

However, consider the following race with code taking a speculative
reference on the page:

Thread 0				Thread 1
--------				--------
remove_hugetlb_page
  set_page_refcounted(page);
  set_compound_page_dtor(page,
           NULL_COMPOUND_DTOR);
					get_page_unless_zero(page)
__update_and_free_page
  __free_pages(page,
           huge_page_order(h));

		/* Note that __free_pages() will simply drop
		   the reference to the page. */

					put_page(page)
					  __put_compound_page()
					    destroy_compound_page
					      NULL_COMPOUND_DTOR
						BUG: kernel NULL pointer
						dereference, address:
						0000000000000000

To address this race, set the dtor to the normal compound page dtor
for non-gigantic pages.  The dtor for gigantic pages does not matter
as gigantic pages are changed from a compound page to 'just a group of
pages' before freeing.  Hence, the destructor is not used.

Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/hugetlb.c | 22 +++++++++++++++++++++-
 1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 3132c7395743..fa8ec2072949 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1370,8 +1370,28 @@ static void remove_hugetlb_page(struct hstate *h, struct page *page,
 		h->surplus_huge_pages_node[nid]--;
 	}
 
+	/*
+	 * Very subtle
+	 *
+	 * For non-gigantic pages set the destructor to the normal compound
+	 * page dtor.  This is needed in case someone takes an additional
+	 * temporary ref to the page, and freeing is delayed until they drop
+	 * their reference.
+	 *
+	 * For gigantic pages set the destructor to the null dtor.  This
+	 * destructor will never be called.  Before freeing the gigantic
+	 * page destroy_compound_gigantic_page will turn the compound page
+	 * into a simple group of pages.  After this the destructor does not
+	 * apply.
+	 *
+	 * This handles the case where more than one ref is held when and
+	 * after update_and_free_page is called.
+	 */
 	set_page_refcounted(page);
-	set_compound_page_dtor(page, NULL_COMPOUND_DTOR);
+	if (hstate_is_gigantic(h))
+		set_compound_page_dtor(page, NULL_COMPOUND_DTOR);
+	else
+		set_compound_page_dtor(page, COMPOUND_PAGE_DTOR);
 
 	h->nr_huge_pages--;
 	h->nr_huge_pages_node[nid]--;
-- 
2.31.1



^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [External] [PATCH 1/3] hugetlb: simplify prep_compound_gigantic_page ref count racing code
  2021-07-10  0:24 ` [PATCH 1/3] hugetlb: simplify prep_compound_gigantic_page ref count racing code Mike Kravetz
@ 2021-07-13  6:31   ` Muchun Song
  0 siblings, 0 replies; 10+ messages in thread
From: Muchun Song @ 2021-07-13  6:31 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Linux Memory Management List, LKML, Michal Hocko, Oscar Salvador,
	David Hildenbrand, Matthew Wilcox, Naoya Horiguchi, Mina Almasry,
	Andrew Morton

On Sat, Jul 10, 2021 at 8:25 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> Code in prep_compound_gigantic_page waits for a rcu grace period if it
> notices a temporarily inflated ref count on a tail page.  This was due
> to the identified potential race with speculative page cache references
> which could only last for a rcu grace period.  This is overly complicated
> as this situation is VERY unlikely to ever happen.  Instead, just quickly
> return an error.

Right. The race is very very small. IMHO, that does not complicate
the code is the right thing to do.

>
> Also, only print a warning in prep_compound_gigantic_page instead of
> multiple callers.
>
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>  mm/hugetlb.c | 15 +++++----------
>  1 file changed, 5 insertions(+), 10 deletions(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 924553aa8f78..e59ebba63da7 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1657,16 +1657,12 @@ static bool prep_compound_gigantic_page(struct page *page, unsigned int order)
>                  * cache adding could take a ref on a 'to be' tail page.
>                  * We need to respect any increased ref count, and only set
>                  * the ref count to zero if count is currently 1.  If count
> -                * is not 1, we call synchronize_rcu in the hope that a rcu
> -                * grace period will cause ref count to drop and then retry.
> -                * If count is still inflated on retry we return an error and
> -                * must discard the pages.
> +                * is not 1, we return an error and caller must discard the
> +                * pages.

Shall we add more details about why we discard the pages?

Thanks.

>                  */
>                 if (!page_ref_freeze(p, 1)) {
> -                       pr_info("HugeTLB unexpected inflated ref count on freshly allocated page\n");
> -                       synchronize_rcu();
> -                       if (!page_ref_freeze(p, 1))
> -                               goto out_error;
> +                       pr_warn("HugeTLB page can not be used due to unexpected inflated ref count\n");
> +                       goto out_error;
>                 }
>                 set_page_count(p, 0);
>                 set_compound_head(p, page);
> @@ -1830,7 +1826,6 @@ static struct page *alloc_fresh_huge_page(struct hstate *h,
>                                 retry = true;
>                                 goto retry;
>                         }
> -                       pr_warn("HugeTLB page can not be used due to unexpected inflated ref count\n");
>                         return NULL;
>                 }
>         }
> @@ -2828,8 +2823,8 @@ static void __init gather_bootmem_prealloc(void)
>                         prep_new_huge_page(h, page, page_to_nid(page));
>                         put_page(page); /* add to the hugepage allocator */
>                 } else {
> +                       /* VERY unlikely inflated ref count on a tail page */
>                         free_gigantic_page(page, huge_page_order(h));
> -                       pr_warn("HugeTLB page can not be used due to unexpected inflated ref count\n");
>                 }
>
>                 /*
> --
> 2.31.1
>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [External] [PATCH 3/3] hugetlb: before freeing hugetlb page set dtor to appropriate value
  2021-07-10  0:24 ` [PATCH 3/3] hugetlb: before freeing hugetlb page set dtor to appropriate value Mike Kravetz
@ 2021-07-14 10:57   ` Muchun Song
  2021-07-14 17:39     ` Mike Kravetz
  0 siblings, 1 reply; 10+ messages in thread
From: Muchun Song @ 2021-07-14 10:57 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Linux Memory Management List, LKML, Michal Hocko, Oscar Salvador,
	David Hildenbrand, Matthew Wilcox, Naoya Horiguchi, Mina Almasry,
	Andrew Morton

On Sat, Jul 10, 2021 at 8:25 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> When removing a hugetlb page from the pool the ref count is set to
> one (as the free page has no ref count) and compound page destructor
> is set to NULL_COMPOUND_DTOR.  Since a subsequent call to free the
> hugetlb page will call __free_pages for non-gigantic pages and
> free_gigantic_page for gigantic pages the destructor is not used.
>
> However, consider the following race with code taking a speculative
> reference on the page:
>
> Thread 0                                Thread 1
> --------                                --------
> remove_hugetlb_page
>   set_page_refcounted(page);
>   set_compound_page_dtor(page,
>            NULL_COMPOUND_DTOR);
>                                         get_page_unless_zero(page)
> __update_and_free_page
>   __free_pages(page,
>            huge_page_order(h));
>
>                 /* Note that __free_pages() will simply drop
>                    the reference to the page. */
>
>                                         put_page(page)
>                                           __put_compound_page()
>                                             destroy_compound_page
>                                               NULL_COMPOUND_DTOR
>                                                 BUG: kernel NULL pointer
>                                                 dereference, address:
>                                                 0000000000000000
>
> To address this race, set the dtor to the normal compound page dtor
> for non-gigantic pages.  The dtor for gigantic pages does not matter
> as gigantic pages are changed from a compound page to 'just a group of
> pages' before freeing.  Hence, the destructor is not used.
>
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> ---
>  mm/hugetlb.c | 22 +++++++++++++++++++++-
>  1 file changed, 21 insertions(+), 1 deletion(-)
>
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 3132c7395743..fa8ec2072949 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1370,8 +1370,28 @@ static void remove_hugetlb_page(struct hstate *h, struct page *page,
>                 h->surplus_huge_pages_node[nid]--;
>         }
>
> +       /*
> +        * Very subtle
> +        *
> +        * For non-gigantic pages set the destructor to the normal compound
> +        * page dtor.  This is needed in case someone takes an additional
> +        * temporary ref to the page, and freeing is delayed until they drop
> +        * their reference.
> +        *
> +        * For gigantic pages set the destructor to the null dtor.  This
> +        * destructor will never be called.  Before freeing the gigantic
> +        * page destroy_compound_gigantic_page will turn the compound page
> +        * into a simple group of pages.  After this the destructor does not
> +        * apply.
> +        *
> +        * This handles the case where more than one ref is held when and
> +        * after update_and_free_page is called.
> +        */
>         set_page_refcounted(page);
> -       set_compound_page_dtor(page, NULL_COMPOUND_DTOR);
> +       if (hstate_is_gigantic(h))
> +               set_compound_page_dtor(page, NULL_COMPOUND_DTOR);
> +       else
> +               set_compound_page_dtor(page, COMPOUND_PAGE_DTOR);

Hi Mike,

The race is really subtle. But we also should remove the WARN from
free_contig_range, right? Because the refcount of the head page of
the gigantic page can be greater than one, but free_contig_range has
the following warning.

WARN(count != 0, "%lu pages are still in use!\n", count);

Thanks.

>
>         h->nr_huge_pages--;
>         h->nr_huge_pages_node[nid]--;
> --
> 2.31.1
>


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [External] [PATCH 3/3] hugetlb: before freeing hugetlb page set dtor to appropriate value
  2021-07-14 10:57   ` [External] " Muchun Song
@ 2021-07-14 17:39     ` Mike Kravetz
  2021-07-15  3:14       ` Muchun Song
  0 siblings, 1 reply; 10+ messages in thread
From: Mike Kravetz @ 2021-07-14 17:39 UTC (permalink / raw)
  To: Muchun Song
  Cc: Linux Memory Management List, LKML, Michal Hocko, Oscar Salvador,
	David Hildenbrand, Matthew Wilcox, Naoya Horiguchi, Mina Almasry,
	Andrew Morton

On 7/14/21 3:57 AM, Muchun Song wrote:
> On Sat, Jul 10, 2021 at 8:25 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>> +       /*
>> +        * Very subtle
>> +        *
>> +        * For non-gigantic pages set the destructor to the normal compound
>> +        * page dtor.  This is needed in case someone takes an additional
>> +        * temporary ref to the page, and freeing is delayed until they drop
>> +        * their reference.
>> +        *
>> +        * For gigantic pages set the destructor to the null dtor.  This
>> +        * destructor will never be called.  Before freeing the gigantic
>> +        * page destroy_compound_gigantic_page will turn the compound page
>> +        * into a simple group of pages.  After this the destructor does not
>> +        * apply.
>> +        *
>> +        * This handles the case where more than one ref is held when and
>> +        * after update_and_free_page is called.
>> +        */
>>         set_page_refcounted(page);
>> -       set_compound_page_dtor(page, NULL_COMPOUND_DTOR);
>> +       if (hstate_is_gigantic(h))
>> +               set_compound_page_dtor(page, NULL_COMPOUND_DTOR);
>> +       else
>> +               set_compound_page_dtor(page, COMPOUND_PAGE_DTOR);
> 
> Hi Mike,
> 
> The race is really subtle. But we also should remove the WARN from
> free_contig_range, right? Because the refcount of the head page of
> the gigantic page can be greater than one, but free_contig_range has
> the following warning.
> 
> WARN(count != 0, "%lu pages are still in use!\n", count);
> 

I did hit that warning in my testing and thought about removing it.
However, I decided to keep it because non-hugetlb code also makes use of
alloc_contig_range/free_contig_range and it might be useful in those
cases.

My 'guess' is that the warning was added not because of temporary ref
count increases but rather to point out any code that forgot to drop a
reference.

BTW - It is not just the 'head' page which could trigger this warning, but
any 'tail' page as well.  That is because we do not call free_contig_range
with a compound page, but rather a group of pages all with ref count of
at least one.

I'm happy to remove the warning if people do not think it is generally
useful.
-- 
Mike Kravetz


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [External] [PATCH 3/3] hugetlb: before freeing hugetlb page set dtor to appropriate value
  2021-07-14 17:39     ` Mike Kravetz
@ 2021-07-15  3:14       ` Muchun Song
  0 siblings, 0 replies; 10+ messages in thread
From: Muchun Song @ 2021-07-15  3:14 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Linux Memory Management List, LKML, Michal Hocko, Oscar Salvador,
	David Hildenbrand, Matthew Wilcox, Naoya Horiguchi, Mina Almasry,
	Andrew Morton

On Thu, Jul 15, 2021 at 1:39 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 7/14/21 3:57 AM, Muchun Song wrote:
> > On Sat, Jul 10, 2021 at 8:25 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
> >> +       /*
> >> +        * Very subtle
> >> +        *
> >> +        * For non-gigantic pages set the destructor to the normal compound
> >> +        * page dtor.  This is needed in case someone takes an additional
> >> +        * temporary ref to the page, and freeing is delayed until they drop
> >> +        * their reference.
> >> +        *
> >> +        * For gigantic pages set the destructor to the null dtor.  This
> >> +        * destructor will never be called.  Before freeing the gigantic
> >> +        * page destroy_compound_gigantic_page will turn the compound page
> >> +        * into a simple group of pages.  After this the destructor does not
> >> +        * apply.
> >> +        *
> >> +        * This handles the case where more than one ref is held when and
> >> +        * after update_and_free_page is called.
> >> +        */
> >>         set_page_refcounted(page);
> >> -       set_compound_page_dtor(page, NULL_COMPOUND_DTOR);
> >> +       if (hstate_is_gigantic(h))
> >> +               set_compound_page_dtor(page, NULL_COMPOUND_DTOR);
> >> +       else
> >> +               set_compound_page_dtor(page, COMPOUND_PAGE_DTOR);
> >
> > Hi Mike,
> >
> > The race is really subtle. But we also should remove the WARN from
> > free_contig_range, right? Because the refcount of the head page of
> > the gigantic page can be greater than one, but free_contig_range has
> > the following warning.
> >
> > WARN(count != 0, "%lu pages are still in use!\n", count);
> >
>
> I did hit that warning in my testing and thought about removing it.
> However, I decided to keep it because non-hugetlb code also makes use of
> alloc_contig_range/free_contig_range and it might be useful in those
> cases.
>
> My 'guess' is that the warning was added not because of temporary ref
> count increases but rather to point out any code that forgot to drop a
> reference.

Got it. At least this patch looks good to me. So

Reviewed-by: Muchun Song <songmuchun@bytedance.com>

>
> BTW - It is not just the 'head' page which could trigger this warning, but
> any 'tail' page as well.  That is because we do not call free_contig_range
> with a compound page, but rather a group of pages all with ref count of
> at least one.

Right.

>
> I'm happy to remove the warning if people do not think it is generally
> useful.

For me, I suggest removing it. If someone has any ideas, please
let us know.

> --
> Mike Kravetz


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/3] hugetlb: fix potential ref counting races
  2021-07-10  0:24 [PATCH 0/3] hugetlb: fix potential ref counting races Mike Kravetz
                   ` (2 preceding siblings ...)
  2021-07-10  0:24 ` [PATCH 3/3] hugetlb: before freeing hugetlb page set dtor to appropriate value Mike Kravetz
@ 2021-07-27 23:41 ` Mike Kravetz
  2021-07-28  4:03   ` Muchun Song
  3 siblings, 1 reply; 10+ messages in thread
From: Mike Kravetz @ 2021-07-27 23:41 UTC (permalink / raw)
  To: linux-mm, linux-kernel
  Cc: Muchun Song, Michal Hocko, Oscar Salvador, David Hildenbrand,
	Matthew Wilcox, Naoya Horiguchi, Mina Almasry, Andrew Morton

Any additional comments on these patches/this approach?

The first patch addressing this issue actually went into the 5.14 merge
window as commit 7118fc2906e2 ("hugetlb: address ref count racing in
prep_compound_gigantic_page").

All this code is very tricky and subtle.  It addresses potential issues
discovered by code analysis.  I do not believe the races have ever been
experienced in practice.  If anyone has suggestions for a simpler or
alternative approach, I would love to hear them.
-- 
Mike Kravetz


On 7/9/21 5:24 PM, Mike Kravetz wrote:
> When Muchun Song brought up a potential issue with hugetlb ref counting[1],
> I started looking closer at the code.  hugetlbfs is the only code with it's
> own specialized compound page destructor and taking special action when ref
> counts drop to zero.  Potential races happen in this unique handling of ref
> counts.  The following patches address these races when creating and
> destroying hugetlb pages.
> 
> These potential races have likely existed since the creation of
> hugetlbfs.  They certainly have been around for more than 10 years.
> However, I am unaware of anyone actually hitting these races.  It is
> VERY unlikely than anyone will actually hit these races, but they do
> exist.
> 
> I could not think of an easy (or difficult) way to force these races.
> Therefore, testing consisted of adding code to randomly increase ref
> counts in strategic places.  In this way, I was able to exercise all the
> race handling code paths.
> 
> [1] https://lore.kernel.org/linux-mm/CAMZfGtVMn3daKrJwZMaVOGOaJU+B4dS--x_oPmGQMD=c=QNGEg@mail.gmail.com/
> 
> Mike Kravetz (3):
>   hugetlb: simplify prep_compound_gigantic_page ref count racing code
>   hugetlb: drop ref count earlier after page allocation
>   hugetlb: before freeing hugetlb page set dtor to appropriate value
> 
>  mm/hugetlb.c | 137 ++++++++++++++++++++++++++++++++++++++-------------
>  1 file changed, 104 insertions(+), 33 deletions(-)
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [PATCH 0/3] hugetlb: fix potential ref counting races
  2021-07-27 23:41 ` [PATCH 0/3] hugetlb: fix potential ref counting races Mike Kravetz
@ 2021-07-28  4:03   ` Muchun Song
  0 siblings, 0 replies; 10+ messages in thread
From: Muchun Song @ 2021-07-28  4:03 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Linux Memory Management List, LKML, Michal Hocko, Oscar Salvador,
	David Hildenbrand, Matthew Wilcox, Naoya Horiguchi, Mina Almasry,
	Andrew Morton

On Wed, Jul 28, 2021 at 7:41 AM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> Any additional comments on these patches/this approach?
>
> The first patch addressing this issue actually went into the 5.14 merge
> window as commit 7118fc2906e2 ("hugetlb: address ref count racing in
> prep_compound_gigantic_page").
>
> All this code is very tricky and subtle.  It addresses potential issues
> discovered by code analysis.  I do not believe the races have ever been
> experienced in practice.  If anyone has suggestions for a simpler or
> alternative approach, I would love to hear them.

Hi Mike,

I agree with you that this code is very tricky and subtle. I have looked
at this patch set. For me, I cannot figure out a better solution for this
race.

--
Thanks,
Muchun

> --
> Mike Kravetz
>
>
> On 7/9/21 5:24 PM, Mike Kravetz wrote:
> > When Muchun Song brought up a potential issue with hugetlb ref counting[1],
> > I started looking closer at the code.  hugetlbfs is the only code with it's
> > own specialized compound page destructor and taking special action when ref
> > counts drop to zero.  Potential races happen in this unique handling of ref
> > counts.  The following patches address these races when creating and
> > destroying hugetlb pages.
> >
> > These potential races have likely existed since the creation of
> > hugetlbfs.  They certainly have been around for more than 10 years.
> > However, I am unaware of anyone actually hitting these races.  It is
> > VERY unlikely than anyone will actually hit these races, but they do
> > exist.
> >
> > I could not think of an easy (or difficult) way to force these races.
> > Therefore, testing consisted of adding code to randomly increase ref
> > counts in strategic places.  In this way, I was able to exercise all the
> > race handling code paths.
> >
> > [1] https://lore.kernel.org/linux-mm/CAMZfGtVMn3daKrJwZMaVOGOaJU+B4dS--x_oPmGQMD=c=QNGEg@mail.gmail.com/
> >
> > Mike Kravetz (3):
> >   hugetlb: simplify prep_compound_gigantic_page ref count racing code
> >   hugetlb: drop ref count earlier after page allocation
> >   hugetlb: before freeing hugetlb page set dtor to appropriate value
> >
> >  mm/hugetlb.c | 137 ++++++++++++++++++++++++++++++++++++++-------------
> >  1 file changed, 104 insertions(+), 33 deletions(-)
> >


^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2021-07-28  4:04 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-07-10  0:24 [PATCH 0/3] hugetlb: fix potential ref counting races Mike Kravetz
2021-07-10  0:24 ` [PATCH 1/3] hugetlb: simplify prep_compound_gigantic_page ref count racing code Mike Kravetz
2021-07-13  6:31   ` [External] " Muchun Song
2021-07-10  0:24 ` [PATCH 2/3] hugetlb: drop ref count earlier after page allocation Mike Kravetz
2021-07-10  0:24 ` [PATCH 3/3] hugetlb: before freeing hugetlb page set dtor to appropriate value Mike Kravetz
2021-07-14 10:57   ` [External] " Muchun Song
2021-07-14 17:39     ` Mike Kravetz
2021-07-15  3:14       ` Muchun Song
2021-07-27 23:41 ` [PATCH 0/3] hugetlb: fix potential ref counting races Mike Kravetz
2021-07-28  4:03   ` Muchun Song

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).