All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v5 0/2] hwpoison: fix race with hugetlb page allocation
@ 2021-05-18 23:12 Naoya Horiguchi
  2021-05-18 23:12 ` [PATCH v5 1/2] mm,hwpoison: " Naoya Horiguchi
  2021-05-18 23:12 ` [PATCH v5 2/2] mm,hwpoison: make get_hwpoison_page call get_any_page() Naoya Horiguchi
  0 siblings, 2 replies; 9+ messages in thread
From: Naoya Horiguchi @ 2021-05-18 23:12 UTC (permalink / raw)
  To: Oscar Salvador, Muchun Song, linux-mm
  Cc: Andrew Morton, Mike Kravetz, Michal Hocko, Tony Luck,
	Naoya Horiguchi, linux-kernel

Hi,

I updated the patchset again based on discussion over v4 [1],
I rebased onto v5.13-rc2 with suggested fixes and code adjustment.

Thanks,
Naoya Horiguchi

[1] https://lore.kernel.org/linux-mm/20210517045401.2506032-1-nao.horiguchi@gmail.com/T/#u
---
Summary:

Naoya Horiguchi (2):
      mm,hwpoison: fix race with hugetlb page allocation
      mm,hwpoison: make get_hwpoison_page call get_any_page()

 include/linux/hugetlb.h |   6 +++
 mm/hugetlb.c            |  17 +++++++
 mm/memory-failure.c     | 118 +++++++++++++++++++++++++++---------------------
 3 files changed, 90 insertions(+), 51 deletions(-)

^ permalink raw reply	[flat|nested] 9+ messages in thread

* [PATCH v5 1/2] mm,hwpoison: fix race with hugetlb page allocation
  2021-05-18 23:12 [PATCH v5 0/2] hwpoison: fix race with hugetlb page allocation Naoya Horiguchi
@ 2021-05-18 23:12 ` Naoya Horiguchi
  2021-05-19 22:32   ` Mike Kravetz
  2021-05-18 23:12 ` [PATCH v5 2/2] mm,hwpoison: make get_hwpoison_page call get_any_page() Naoya Horiguchi
  1 sibling, 1 reply; 9+ messages in thread
From: Naoya Horiguchi @ 2021-05-18 23:12 UTC (permalink / raw)
  To: Oscar Salvador, Muchun Song, linux-mm
  Cc: Andrew Morton, Mike Kravetz, Michal Hocko, Tony Luck,
	Naoya Horiguchi, linux-kernel

From: Naoya Horiguchi <naoya.horiguchi@nec.com>

When hugetlb page fault (under overcommitting situation) and
memory_failure() race, VM_BUG_ON_PAGE() is triggered by the following race:

    CPU0:                           CPU1:

                                    gather_surplus_pages()
                                      page = alloc_surplus_huge_page()
    memory_failure_hugetlb()
      get_hwpoison_page(page)
        __get_hwpoison_page(page)
          get_page_unless_zero(page)
                                      zero = put_page_testzero(page)
                                      VM_BUG_ON_PAGE(!zero, page)
                                      enqueue_huge_page(h, page)
      put_page(page)

__get_hwpoison_page() only checks the page refcount before taking an
additional one for memory error handling, which is wrong because there's
a time window where compound pages have non-zero refcount during
initialization.  So make __get_hwpoison_page() check page status a bit
more for hugetlb pages.

Fixes: ead07f6a867b ("mm/memory-failure: introduce get_hwpoison_page() for consistent refcount handling")
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reported-by: Muchun Song <songmuchun@bytedance.com>
Cc: stable@vger.kernel.org # 5.12+
---
 include/linux/hugetlb.h |  6 ++++++
 mm/hugetlb.c            | 15 +++++++++++++++
 mm/memory-failure.c     |  8 +++++++-
 3 files changed, 28 insertions(+), 1 deletion(-)

diff --git v5.13-rc2/include/linux/hugetlb.h v5.13-rc2_patched/include/linux/hugetlb.h
index b92f25ccef58..790ae618548d 100644
--- v5.13-rc2/include/linux/hugetlb.h
+++ v5.13-rc2_patched/include/linux/hugetlb.h
@@ -149,6 +149,7 @@ bool hugetlb_reserve_pages(struct inode *inode, long from, long to,
 long hugetlb_unreserve_pages(struct inode *inode, long start, long end,
 						long freed);
 bool isolate_huge_page(struct page *page, struct list_head *list);
+int get_hwpoison_huge_page(struct page *page, bool *hugetlb);
 void putback_active_hugepage(struct page *page);
 void move_hugetlb_state(struct page *oldpage, struct page *newpage, int reason);
 void free_huge_page(struct page *page);
@@ -339,6 +340,11 @@ static inline bool isolate_huge_page(struct page *page, struct list_head *list)
 	return false;
 }
 
+static inline int get_hwpoison_huge_page(struct page *page, bool *hugetlb)
+{
+	return 0;
+}
+
 static inline void putback_active_hugepage(struct page *page)
 {
 }
diff --git v5.13-rc2/mm/hugetlb.c v5.13-rc2_patched/mm/hugetlb.c
index 95918f410c0f..f138bae3e302 100644
--- v5.13-rc2/mm/hugetlb.c
+++ v5.13-rc2_patched/mm/hugetlb.c
@@ -5847,6 +5847,21 @@ bool isolate_huge_page(struct page *page, struct list_head *list)
 	return ret;
 }
 
+int get_hwpoison_huge_page(struct page *page, bool *hugetlb)
+{
+	int ret = 0;
+
+	*hugetlb = false;
+	spin_lock_irq(&hugetlb_lock);
+	if (PageHeadHuge(page)) {
+		*hugetlb = true;
+		if (HPageFreed(page) || HPageMigratable(page))
+			ret = get_page_unless_zero(page);
+	}
+	spin_unlock_irq(&hugetlb_lock);
+	return ret;
+}
+
 void putback_active_hugepage(struct page *page)
 {
 	spin_lock_irq(&hugetlb_lock);
diff --git v5.13-rc2/mm/memory-failure.c v5.13-rc2_patched/mm/memory-failure.c
index 85ad98c00fd9..353c6177e489 100644
--- v5.13-rc2/mm/memory-failure.c
+++ v5.13-rc2_patched/mm/memory-failure.c
@@ -959,8 +959,14 @@ static int page_action(struct page_state *ps, struct page *p,
 static int __get_hwpoison_page(struct page *page)
 {
 	struct page *head = compound_head(page);
+	int ret = 0;
+	bool hugetlb = false;
+
+	ret = get_hwpoison_huge_page(head, &hugetlb);
+	if (hugetlb)
+		return ret;
 
-	if (!PageHuge(head) && PageTransHuge(head)) {
+	if (PageTransHuge(head)) {
 		/*
 		 * Non anonymous thp exists only in allocation/free time. We
 		 * can't handle such a case correctly, so let's give it up.
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* [PATCH v5 2/2] mm,hwpoison: make get_hwpoison_page call get_any_page()
  2021-05-18 23:12 [PATCH v5 0/2] hwpoison: fix race with hugetlb page allocation Naoya Horiguchi
  2021-05-18 23:12 ` [PATCH v5 1/2] mm,hwpoison: " Naoya Horiguchi
@ 2021-05-18 23:12 ` Naoya Horiguchi
  1 sibling, 0 replies; 9+ messages in thread
From: Naoya Horiguchi @ 2021-05-18 23:12 UTC (permalink / raw)
  To: Oscar Salvador, Muchun Song, linux-mm
  Cc: Andrew Morton, Mike Kravetz, Michal Hocko, Tony Luck,
	Naoya Horiguchi, linux-kernel

From: Naoya Horiguchi <naoya.horiguchi@nec.com>

__get_hwpoison_page() could fail to grab refcount by some race
condition, so it's helpful if we can handle it by retrying.
We already have retry logic, so make get_hwpoison_page() call
get_any_page() when called from memory_failure().

Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
---
 mm/hugetlb.c        |   2 +
 mm/memory-failure.c | 110 ++++++++++++++++++++++++--------------------
 2 files changed, 62 insertions(+), 50 deletions(-)

diff --git v5.13-rc2/mm/hugetlb.c v5.13-rc2_patched/mm/hugetlb.c
index f138bae3e302..6aef717cd8df 100644
--- v5.13-rc2/mm/hugetlb.c
+++ v5.13-rc2_patched/mm/hugetlb.c
@@ -5857,6 +5857,8 @@ int get_hwpoison_huge_page(struct page *page, bool *hugetlb)
 		*hugetlb = true;
 		if (HPageFreed(page) || HPageMigratable(page))
 			ret = get_page_unless_zero(page);
+		else
+			ret = -EBUSY;
 	}
 	spin_unlock_irq(&hugetlb_lock);
 	return ret;
diff --git v5.13-rc2/mm/memory-failure.c v5.13-rc2_patched/mm/memory-failure.c
index 353c6177e489..5e7d9c21f01b 100644
--- v5.13-rc2/mm/memory-failure.c
+++ v5.13-rc2_patched/mm/memory-failure.c
@@ -949,13 +949,6 @@ static int page_action(struct page_state *ps, struct page *p,
 	return (result == MF_RECOVERED || result == MF_DELAYED) ? 0 : -EBUSY;
 }
 
-/**
- * __get_hwpoison_page() - Get refcount for memory error handling:
- * @page:	raw error page (hit by memory error)
- *
- * Return: return 0 if failed to grab the refcount, otherwise true (some
- * non-zero value.)
- */
 static int __get_hwpoison_page(struct page *page)
 {
 	struct page *head = compound_head(page);
@@ -992,15 +985,6 @@ static int __get_hwpoison_page(struct page *page)
 	return 0;
 }
 
-/*
- * Safely get reference count of an arbitrary page.
- *
- * Returns 0 for a free page, 1 for an in-use page,
- * -EIO for a page-type we cannot handle and -EBUSY if we raced with an
- * allocation.
- * We only incremented refcount in case the page was already in-use and it
- * is a known type we can handle.
- */
 static int get_any_page(struct page *p, unsigned long flags)
 {
 	int ret = 0, pass = 0;
@@ -1010,50 +994,76 @@ static int get_any_page(struct page *p, unsigned long flags)
 		count_increased = true;
 
 try_again:
-	if (!count_increased && !__get_hwpoison_page(p)) {
-		if (page_count(p)) {
-			/* We raced with an allocation, retry. */
-			if (pass++ < 3)
-				goto try_again;
-			ret = -EBUSY;
-		} else if (!PageHuge(p) && !is_free_buddy_page(p)) {
-			/* We raced with put_page, retry. */
+	if (!count_increased) {
+		ret = __get_hwpoison_page(p);
+		if (!ret) {
+			if (page_count(p)) {
+				/* We raced with an allocation, retry. */
+				if (pass++ < 3)
+					goto try_again;
+				ret = -EBUSY;
+			} else if (!PageHuge(p) && !is_free_buddy_page(p)) {
+				/* We raced with put_page, retry. */
+				if (pass++ < 3)
+					goto try_again;
+				ret = -EIO;
+			}
+			goto out;
+		} else if (ret == -EBUSY) {
+			/* We raced with freeing huge page to buddy, retry. */
 			if (pass++ < 3)
 				goto try_again;
-			ret = -EIO;
 		}
+	}
+
+	if (PageHuge(p) || PageLRU(p) || __PageMovable(p)) {
+		ret = 1;
 	} else {
-		if (PageHuge(p) || PageLRU(p) || __PageMovable(p)) {
-			ret = 1;
-		} else {
-			/*
-			 * A page we cannot handle. Check whether we can turn
-			 * it into something we can handle.
-			 */
-			if (pass++ < 3) {
-				put_page(p);
-				shake_page(p, 1);
-				count_increased = false;
-				goto try_again;
-			}
+		/*
+		 * A page we cannot handle. Check whether we can turn
+		 * it into something we can handle.
+		 */
+		if (pass++ < 3) {
 			put_page(p);
-			ret = -EIO;
+			shake_page(p, 1);
+			count_increased = false;
+			goto try_again;
 		}
+		put_page(p);
+		ret = -EIO;
 	}
-
+out:
 	return ret;
 }
 
-static int get_hwpoison_page(struct page *p, unsigned long flags,
-			     enum mf_flags ctxt)
+/**
+ * get_hwpoison_page() - Get refcount for memory error handling
+ * @p:		Raw error page (hit by memory error)
+ * @flags:	Flags controlling behavior of error handling
+ *
+ * get_hwpoison_page() takes a page refcount of an error page to handle memory
+ * error on it, after checking that the error page is in a well-defined state
+ * (defined as a page-type we can successfully handle the memor error on it,
+ * such as LRU page and hugetlb page).
+ *
+ * Memory error handling could be triggered at any time on any type of page,
+ * so it's prone to race with typical memory management lifecycle (like
+ * allocation and free).  So to avoid such races, get_hwpoison_page() takes
+ * extra care for the error page's state (as done in __get_hwpoison_page()),
+ * and has some retry logic in get_any_page().
+ *
+ * Return: 0 on failure,
+ *         1 on success for in-use pages in a well-defined state,
+ *         -EIO for pages on which we can not handle memory errors,
+ *         -EBUSY when get_hwpoison_page() has raced with page lifecycle
+ *         operations like allocation and free.
+ */
+static int get_hwpoison_page(struct page *p, unsigned long flags)
 {
 	int ret;
 
 	zone_pcp_disable(page_zone(p));
-	if (ctxt == MF_SOFT_OFFLINE)
-		ret = get_any_page(p, flags);
-	else
-		ret = __get_hwpoison_page(p);
+	ret = get_any_page(p, flags);
 	zone_pcp_enable(page_zone(p));
 
 	return ret;
@@ -1239,7 +1249,7 @@ static int memory_failure_hugetlb(unsigned long pfn, int flags)
 
 	num_poisoned_pages_inc();
 
-	if (!(flags & MF_COUNT_INCREASED) && !get_hwpoison_page(p, flags, 0)) {
+	if (!(flags & MF_COUNT_INCREASED) && !get_hwpoison_page(p, flags)) {
 		/*
 		 * Check "filter hit" and "race with other subpage."
 		 */
@@ -1453,7 +1463,7 @@ int memory_failure(unsigned long pfn, int flags)
 	 * In fact it's dangerous to directly bump up page count from 0,
 	 * that may make page_ref_freeze()/page_ref_unfreeze() mismatch.
 	 */
-	if (!(flags & MF_COUNT_INCREASED) && !get_hwpoison_page(p, flags, 0)) {
+	if (!(flags & MF_COUNT_INCREASED) && !get_hwpoison_page(p, flags)) {
 		if (is_free_buddy_page(p)) {
 			if (take_page_off_buddy(p)) {
 				page_ref_inc(p);
@@ -1741,7 +1751,7 @@ int unpoison_memory(unsigned long pfn)
 		return 0;
 	}
 
-	if (!get_hwpoison_page(p, flags, 0)) {
+	if (!get_hwpoison_page(p, flags)) {
 		if (TestClearPageHWPoison(p))
 			num_poisoned_pages_dec();
 		unpoison_pr_info("Unpoison: Software-unpoisoned free page %#lx\n",
@@ -1957,7 +1967,7 @@ int soft_offline_page(unsigned long pfn, int flags)
 
 retry:
 	get_online_mems();
-	ret = get_hwpoison_page(page, flags, MF_SOFT_OFFLINE);
+	ret = get_hwpoison_page(page, flags);
 	put_online_mems();
 
 	if (ret > 0) {
-- 
2.25.1


^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v5 1/2] mm,hwpoison: fix race with hugetlb page allocation
  2021-05-18 23:12 ` [PATCH v5 1/2] mm,hwpoison: " Naoya Horiguchi
@ 2021-05-19 22:32   ` Mike Kravetz
  2021-05-20  7:17     ` HORIGUCHI NAOYA(堀口 直也)
  0 siblings, 1 reply; 9+ messages in thread
From: Mike Kravetz @ 2021-05-19 22:32 UTC (permalink / raw)
  To: Naoya Horiguchi, Oscar Salvador, Muchun Song, linux-mm
  Cc: Andrew Morton, Michal Hocko, Tony Luck, Naoya Horiguchi, linux-kernel

On 5/18/21 4:12 PM, Naoya Horiguchi wrote:
> From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> 
> When hugetlb page fault (under overcommitting situation) and
> memory_failure() race, VM_BUG_ON_PAGE() is triggered by the following race:
> 
>     CPU0:                           CPU1:
> 
>                                     gather_surplus_pages()
>                                       page = alloc_surplus_huge_page()
>     memory_failure_hugetlb()
>       get_hwpoison_page(page)
>         __get_hwpoison_page(page)
>           get_page_unless_zero(page)
>                                       zero = put_page_testzero(page)
>                                       VM_BUG_ON_PAGE(!zero, page)
>                                       enqueue_huge_page(h, page)
>       put_page(page)
> 
> __get_hwpoison_page() only checks the page refcount before taking an
> additional one for memory error handling, which is wrong because there's
> a time window where compound pages have non-zero refcount during
> initialization.  So make __get_hwpoison_page() check page status a bit
> more for hugetlb pages.
> 
> Fixes: ead07f6a867b ("mm/memory-failure: introduce get_hwpoison_page() for consistent refcount handling")
> Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> Reported-by: Muchun Song <songmuchun@bytedance.com>
> Cc: stable@vger.kernel.org # 5.12+
> ---
>  include/linux/hugetlb.h |  6 ++++++
>  mm/hugetlb.c            | 15 +++++++++++++++
>  mm/memory-failure.c     |  8 +++++++-
>  3 files changed, 28 insertions(+), 1 deletion(-)
> 
> diff --git v5.13-rc2/include/linux/hugetlb.h v5.13-rc2_patched/include/linux/hugetlb.h
> index b92f25ccef58..790ae618548d 100644
> --- v5.13-rc2/include/linux/hugetlb.h
> +++ v5.13-rc2_patched/include/linux/hugetlb.h
> @@ -149,6 +149,7 @@ bool hugetlb_reserve_pages(struct inode *inode, long from, long to,
>  long hugetlb_unreserve_pages(struct inode *inode, long start, long end,
>  						long freed);
>  bool isolate_huge_page(struct page *page, struct list_head *list);
> +int get_hwpoison_huge_page(struct page *page, bool *hugetlb);
>  void putback_active_hugepage(struct page *page);
>  void move_hugetlb_state(struct page *oldpage, struct page *newpage, int reason);
>  void free_huge_page(struct page *page);
> @@ -339,6 +340,11 @@ static inline bool isolate_huge_page(struct page *page, struct list_head *list)
>  	return false;
>  }
>  
> +static inline int get_hwpoison_huge_page(struct page *page, bool *hugetlb)
> +{
> +	return 0;
> +}
> +
>  static inline void putback_active_hugepage(struct page *page)
>  {
>  }
> diff --git v5.13-rc2/mm/hugetlb.c v5.13-rc2_patched/mm/hugetlb.c
> index 95918f410c0f..f138bae3e302 100644
> --- v5.13-rc2/mm/hugetlb.c
> +++ v5.13-rc2_patched/mm/hugetlb.c
> @@ -5847,6 +5847,21 @@ bool isolate_huge_page(struct page *page, struct list_head *list)
>  	return ret;
>  }
>  
> +int get_hwpoison_huge_page(struct page *page, bool *hugetlb)
> +{
> +	int ret = 0;
> +
> +	*hugetlb = false;
> +	spin_lock_irq(&hugetlb_lock);
> +	if (PageHeadHuge(page)) {
> +		*hugetlb = true;
> +		if (HPageFreed(page) || HPageMigratable(page))
> +			ret = get_page_unless_zero(page);
> +	}
> +	spin_unlock_irq(&hugetlb_lock);
> +	return ret;
> +}
> +
>  void putback_active_hugepage(struct page *page)
>  {
>  	spin_lock_irq(&hugetlb_lock);
> diff --git v5.13-rc2/mm/memory-failure.c v5.13-rc2_patched/mm/memory-failure.c
> index 85ad98c00fd9..353c6177e489 100644
> --- v5.13-rc2/mm/memory-failure.c
> +++ v5.13-rc2_patched/mm/memory-failure.c
> @@ -959,8 +959,14 @@ static int page_action(struct page_state *ps, struct page *p,
>  static int __get_hwpoison_page(struct page *page)
>  {
>  	struct page *head = compound_head(page);
> +	int ret = 0;
> +	bool hugetlb = false;
> +
> +	ret = get_hwpoison_huge_page(head, &hugetlb);
> +	if (hugetlb)
> +		return ret;
>  

Hello Naoya,

Thanks for your continued efforts.  However, I believe the race still
exists.  Unless I am mistaken, it is possible that page is in the hugetlb
allocation patch and racing with __get_hwpoison_page() as follows:

    CPU0:                           CPU1:

                                    gather_surplus_pages()
                                      page = alloc_surplus_huge_page()
				      	page = alloc_fresh_huge_page()
				      	  page = alloc_buddy_huge_page()
    memory_failure_hugetlb()
      get_hwpoison_page(page)
        __get_hwpoison_page(page)
	  get_hwpoison_huge_page()
	    /* Note that PageHuge()
	       is false, so hugetlb
	       not set */
	  PageTransHuge(head) false
					  prep_new_huge_page(page)
					  /* Now PageHuge() becomes true */
          get_page_unless_zero(page)

I am not sure if it is possible to handle this race in the memory error
code.  I can not think of a way to avoid potentially incrementing the
ref count on a hugetlb page as it is being created.  There is nothing
synchronizing this in the hugetlb code.

When Muchun first proposed a fix to the race, the idea was to catch the
race in the hugetlb code.  Michal suggested that the memory error code
be more careful in modifying ref counts.  I would wait a bit to see if
someone has a good idea how this can be done.  We 'may' need to revisit
the approach suggested by Muchun.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v5 1/2] mm,hwpoison: fix race with hugetlb page allocation
  2021-05-19 22:32   ` Mike Kravetz
@ 2021-05-20  7:17     ` HORIGUCHI NAOYA(堀口 直也)
  2021-05-25  7:36       ` Oscar Salvador
  0 siblings, 1 reply; 9+ messages in thread
From: HORIGUCHI NAOYA(堀口 直也) @ 2021-05-20  7:17 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Naoya Horiguchi, Oscar Salvador, Muchun Song, linux-mm,
	Andrew Morton, Michal Hocko, Tony Luck, linux-kernel

On Wed, May 19, 2021 at 03:32:17PM -0700, Mike Kravetz wrote:
> On 5/18/21 4:12 PM, Naoya Horiguchi wrote:
> > From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> > 
> > When hugetlb page fault (under overcommitting situation) and
> > memory_failure() race, VM_BUG_ON_PAGE() is triggered by the following race:
> > 
> >     CPU0:                           CPU1:
> > 
> >                                     gather_surplus_pages()
> >                                       page = alloc_surplus_huge_page()
> >     memory_failure_hugetlb()
> >       get_hwpoison_page(page)
> >         __get_hwpoison_page(page)
> >           get_page_unless_zero(page)
> >                                       zero = put_page_testzero(page)
> >                                       VM_BUG_ON_PAGE(!zero, page)
> >                                       enqueue_huge_page(h, page)
> >       put_page(page)
> > 
> > __get_hwpoison_page() only checks the page refcount before taking an
> > additional one for memory error handling, which is wrong because there's
> > a time window where compound pages have non-zero refcount during
> > initialization.  So make __get_hwpoison_page() check page status a bit
> > more for hugetlb pages.
> > 
> > Fixes: ead07f6a867b ("mm/memory-failure: introduce get_hwpoison_page() for consistent refcount handling")
> > Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> > Reported-by: Muchun Song <songmuchun@bytedance.com>
> > Cc: stable@vger.kernel.org # 5.12+
> > ---
> >  include/linux/hugetlb.h |  6 ++++++
> >  mm/hugetlb.c            | 15 +++++++++++++++
> >  mm/memory-failure.c     |  8 +++++++-
> >  3 files changed, 28 insertions(+), 1 deletion(-)
> > 
> > diff --git v5.13-rc2/include/linux/hugetlb.h v5.13-rc2_patched/include/linux/hugetlb.h
> > index b92f25ccef58..790ae618548d 100644
> > --- v5.13-rc2/include/linux/hugetlb.h
> > +++ v5.13-rc2_patched/include/linux/hugetlb.h
> > @@ -149,6 +149,7 @@ bool hugetlb_reserve_pages(struct inode *inode, long from, long to,
> >  long hugetlb_unreserve_pages(struct inode *inode, long start, long end,
> >  						long freed);
> >  bool isolate_huge_page(struct page *page, struct list_head *list);
> > +int get_hwpoison_huge_page(struct page *page, bool *hugetlb);
> >  void putback_active_hugepage(struct page *page);
> >  void move_hugetlb_state(struct page *oldpage, struct page *newpage, int reason);
> >  void free_huge_page(struct page *page);
> > @@ -339,6 +340,11 @@ static inline bool isolate_huge_page(struct page *page, struct list_head *list)
> >  	return false;
> >  }
> >  
> > +static inline int get_hwpoison_huge_page(struct page *page, bool *hugetlb)
> > +{
> > +	return 0;
> > +}
> > +
> >  static inline void putback_active_hugepage(struct page *page)
> >  {
> >  }
> > diff --git v5.13-rc2/mm/hugetlb.c v5.13-rc2_patched/mm/hugetlb.c
> > index 95918f410c0f..f138bae3e302 100644
> > --- v5.13-rc2/mm/hugetlb.c
> > +++ v5.13-rc2_patched/mm/hugetlb.c
> > @@ -5847,6 +5847,21 @@ bool isolate_huge_page(struct page *page, struct list_head *list)
> >  	return ret;
> >  }
> >  
> > +int get_hwpoison_huge_page(struct page *page, bool *hugetlb)
> > +{
> > +	int ret = 0;
> > +
> > +	*hugetlb = false;
> > +	spin_lock_irq(&hugetlb_lock);
> > +	if (PageHeadHuge(page)) {
> > +		*hugetlb = true;
> > +		if (HPageFreed(page) || HPageMigratable(page))
> > +			ret = get_page_unless_zero(page);
> > +	}
> > +	spin_unlock_irq(&hugetlb_lock);
> > +	return ret;
> > +}
> > +
> >  void putback_active_hugepage(struct page *page)
> >  {
> >  	spin_lock_irq(&hugetlb_lock);
> > diff --git v5.13-rc2/mm/memory-failure.c v5.13-rc2_patched/mm/memory-failure.c
> > index 85ad98c00fd9..353c6177e489 100644
> > --- v5.13-rc2/mm/memory-failure.c
> > +++ v5.13-rc2_patched/mm/memory-failure.c
> > @@ -959,8 +959,14 @@ static int page_action(struct page_state *ps, struct page *p,
> >  static int __get_hwpoison_page(struct page *page)
> >  {
> >  	struct page *head = compound_head(page);
> > +	int ret = 0;
> > +	bool hugetlb = false;
> > +
> > +	ret = get_hwpoison_huge_page(head, &hugetlb);
> > +	if (hugetlb)
> > +		return ret;
> >  
> 
> Hello Naoya,
> 
> Thanks for your continued efforts.  However, I believe the race still
> exists.  Unless I am mistaken, it is possible that page is in the hugetlb
> allocation patch and racing with __get_hwpoison_page() as follows:

Hi Mike, thank you for the investigation.

> 
>     CPU0:                           CPU1:
> 
>                                     gather_surplus_pages()
>                                       page = alloc_surplus_huge_page()
> 				      	page = alloc_fresh_huge_page()
> 				      	  page = alloc_buddy_huge_page()
>     memory_failure_hugetlb()
>       get_hwpoison_page(page)
>         __get_hwpoison_page(page)
> 	  get_hwpoison_huge_page()
> 	    /* Note that PageHuge()
> 	       is false, so hugetlb
> 	       not set */
> 	  PageTransHuge(head) false

It seems that PageTransHuge returns true in this race condition because it
simply checks PG_head flag.  But anyway, get_page_unless_zero() is called for
a hugetlb in this race, which is problematic.

> 					  prep_new_huge_page(page)
> 					  /* Now PageHuge() becomes true */
>           get_page_unless_zero(page)
> 
> I am not sure if it is possible to handle this race in the memory error
> code. 

I think that __get_hwpoison_page() might not properly call get_page_unless_zero().
Looking at other callers of this function, most(*) of them are calling it in
the context where a given page is pinned and in-use, and get_page_unless_zero()
is used to detect the race with page freeing.  In the current version,
__get_hwpoison_page() calls get_page_unless_zero() without caring for such an
assumption, which might be the root cause of the race with hugetlb page allocation.

# (*) It seems to me that do_migrate_range() might have the same issue
# around get_page_unless_zero().

So I think of inserting the check to comply with the assumption of
get_hwpoison_huge_page() like below:

        ret = get_hwpoison_huge_page(head, &hugetlb);
        if (hugetlb)
                return ret;

        if (!PageLRU(head) && !__PageMovable(head))
                return 0;

        if (PageTransHuge(head)) {
                ...
        }

        if (get_page_unless_zero(head)) {
                ...
        }

        return 0;

The newly added checks should work to prevent the above race, then get_any_page()
should retry and grab the page properly as a stable hugetlb page.

> I can not think of a way to avoid potentially incrementing the
> ref count on a hugetlb page as it is being created.  There is nothing
> synchronizing this in the hugetlb code.
> 
> When Muchun first proposed a fix to the race, the idea was to catch the
> race in the hugetlb code.  Michal suggested that the memory error code
> be more careful in modifying ref counts.  I would wait a bit to see if
> someone has a good idea how this can be done.  We 'may' need to revisit
> the approach suggested by Muchun.

If the above approach is still broken, let's revisit Muchun's approach.

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v5 1/2] mm,hwpoison: fix race with hugetlb page allocation
  2021-05-20  7:17     ` HORIGUCHI NAOYA(堀口 直也)
@ 2021-05-25  7:36       ` Oscar Salvador
  2021-05-25  8:07         ` HORIGUCHI NAOYA(堀口 直也)
  0 siblings, 1 reply; 9+ messages in thread
From: Oscar Salvador @ 2021-05-25  7:36 UTC (permalink / raw)
  To: HORIGUCHI NAOYA(堀口 直也)
  Cc: Mike Kravetz, Naoya Horiguchi, Muchun Song, linux-mm,
	Andrew Morton, Michal Hocko, Tony Luck, linux-kernel

On Thu, May 20, 2021 at 07:17:17AM +0000, HORIGUCHI NAOYA(堀口 直也) wrote:
> So I think of inserting the check to comply with the assumption of
> get_hwpoison_huge_page() like below:
> 
>         ret = get_hwpoison_huge_page(head, &hugetlb);
>         if (hugetlb)
>                 return ret;
> 
>         if (!PageLRU(head) && !__PageMovable(head))
>                 return 0;
> 
>         if (PageTransHuge(head)) {
>                 ...
>         }
> 
>         if (get_page_unless_zero(head)) {
>                 ...
>         }
> 
>         return 0;

Hi Naoya,

would you mind posting a complete draft of what it would look like?
I am having a hard time picturing it.


Thanks

-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v5 1/2] mm,hwpoison: fix race with hugetlb page allocation
  2021-05-25  7:36       ` Oscar Salvador
@ 2021-05-25  8:07         ` HORIGUCHI NAOYA(堀口 直也)
  2021-05-25  9:09           ` Oscar Salvador
  0 siblings, 1 reply; 9+ messages in thread
From: HORIGUCHI NAOYA(堀口 直也) @ 2021-05-25  8:07 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Mike Kravetz, Naoya Horiguchi, Muchun Song, linux-mm,
	Andrew Morton, Michal Hocko, Tony Luck, linux-kernel

On Tue, May 25, 2021 at 09:36:05AM +0200, Oscar Salvador wrote:
> On Thu, May 20, 2021 at 07:17:17AM +0000, HORIGUCHI NAOYA(堀口 直也) wrote:
> > So I think of inserting the check to comply with the assumption of
> > get_hwpoison_huge_page() like below:
> > 
> >         ret = get_hwpoison_huge_page(head, &hugetlb);
> >         if (hugetlb)
> >                 return ret;
> > 
> >         if (!PageLRU(head) && !__PageMovable(head))
> >                 return 0;
> > 
> >         if (PageTransHuge(head)) {
> >                 ...
> >         }
> > 
> >         if (get_page_unless_zero(head)) {
> >                 ...
> >         }
> > 
> >         return 0;
> 
> Hi Naoya,
> 
> would you mind posting a complete draft of what it would look like?
> I am having a hard time picturing it.

OK, here's the current draft.

Thanks,
Naoya Horiguchi

---
From: Naoya Horiguchi <naoya.horiguchi@nec.com>
Date: Tue, 18 May 2021 23:49:18 +0900
Subject: [PATCH] mm,hwpoison: fix race with hugetlb page allocation

When hugetlb page fault (under overcommitting situation) and
memory_failure() race, VM_BUG_ON_PAGE() is triggered by the following race:

    CPU0:                           CPU1:

                                    gather_surplus_pages()
                                      page = alloc_surplus_huge_page()
    memory_failure_hugetlb()
      get_hwpoison_page(page)
        __get_hwpoison_page(page)
          get_page_unless_zero(page)
                                      zero = put_page_testzero(page)
                                      VM_BUG_ON_PAGE(!zero, page)
                                      enqueue_huge_page(h, page)
      put_page(page)

__get_hwpoison_page() only checks the page refcount before taking an
additional one for memory error handling, which is wrong because there's
a time window where compound pages have non-zero refcount during
initialization.  So make __get_hwpoison_page() check page status a bit
more for hugetlb pages.

Fixes: ead07f6a867b ("mm/memory-failure: introduce get_hwpoison_page() for consistent refcount handling")
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reported-by: Muchun Song <songmuchun@bytedance.com>
Cc: stable@vger.kernel.org # 5.12+
---
 include/linux/hugetlb.h |  6 ++++++
 mm/hugetlb.c            | 15 +++++++++++++++
 mm/memory-failure.c     | 11 ++++++++++-
 3 files changed, 31 insertions(+), 1 deletion(-)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index b92f25ccef58..790ae618548d 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -149,6 +149,7 @@ bool hugetlb_reserve_pages(struct inode *inode, long from, long to,
 long hugetlb_unreserve_pages(struct inode *inode, long start, long end,
 						long freed);
 bool isolate_huge_page(struct page *page, struct list_head *list);
+int get_hwpoison_huge_page(struct page *page, bool *hugetlb);
 void putback_active_hugepage(struct page *page);
 void move_hugetlb_state(struct page *oldpage, struct page *newpage, int reason);
 void free_huge_page(struct page *page);
@@ -339,6 +340,11 @@ static inline bool isolate_huge_page(struct page *page, struct list_head *list)
 	return false;
 }
 
+static inline int get_hwpoison_huge_page(struct page *page, bool *hugetlb)
+{
+	return 0;
+}
+
 static inline void putback_active_hugepage(struct page *page)
 {
 }
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 95918f410c0f..f138bae3e302 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5847,6 +5847,21 @@ bool isolate_huge_page(struct page *page, struct list_head *list)
 	return ret;
 }
 
+int get_hwpoison_huge_page(struct page *page, bool *hugetlb)
+{
+	int ret = 0;
+
+	*hugetlb = false;
+	spin_lock_irq(&hugetlb_lock);
+	if (PageHeadHuge(page)) {
+		*hugetlb = true;
+		if (HPageFreed(page) || HPageMigratable(page))
+			ret = get_page_unless_zero(page);
+	}
+	spin_unlock_irq(&hugetlb_lock);
+	return ret;
+}
+
 void putback_active_hugepage(struct page *page)
 {
 	spin_lock_irq(&hugetlb_lock);
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 85ad98c00fd9..4c264c4090d7 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -959,8 +959,17 @@ static int page_action(struct page_state *ps, struct page *p,
 static int __get_hwpoison_page(struct page *page)
 {
 	struct page *head = compound_head(page);
+	int ret = 0;
+	bool hugetlb = false;
+
+	ret = get_hwpoison_huge_page(head, &hugetlb);
+	if (hugetlb)
+		return ret;
+
+	if (!PageLRU(head) && !__PageMovable(head))
+		return 0;
 
-	if (!PageHuge(head) && PageTransHuge(head)) {
+	if (PageTransHuge(head)) {
 		/*
 		 * Non anonymous thp exists only in allocation/free time. We
 		 * can't handle such a case correctly, so let's give it up.
-- 
2.25.1

^ permalink raw reply related	[flat|nested] 9+ messages in thread

* Re: [PATCH v5 1/2] mm,hwpoison: fix race with hugetlb page allocation
  2021-05-25  8:07         ` HORIGUCHI NAOYA(堀口 直也)
@ 2021-05-25  9:09           ` Oscar Salvador
  2021-05-25 13:08             ` HORIGUCHI NAOYA(堀口 直也)
  0 siblings, 1 reply; 9+ messages in thread
From: Oscar Salvador @ 2021-05-25  9:09 UTC (permalink / raw)
  To: HORIGUCHI NAOYA(堀口 直也)
  Cc: Mike Kravetz, Naoya Horiguchi, Muchun Song, linux-mm,
	Andrew Morton, Michal Hocko, Tony Luck, linux-kernel

On Tue, May 25, 2021 at 08:07:07AM +0000, HORIGUCHI NAOYA(堀口 直也) wrote:
> OK, here's the current draft.
> 
> Thanks,
> Naoya Horiguchi
> 
> ---
> From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> Date: Tue, 18 May 2021 23:49:18 +0900
> Subject: [PATCH] mm,hwpoison: fix race with hugetlb page allocation
> 
> When hugetlb page fault (under overcommitting situation) and
> memory_failure() race, VM_BUG_ON_PAGE() is triggered by the following race:
> 
>     CPU0:                           CPU1:
> 
>                                     gather_surplus_pages()
>                                       page = alloc_surplus_huge_page()
>     memory_failure_hugetlb()
>       get_hwpoison_page(page)
>         __get_hwpoison_page(page)
>           get_page_unless_zero(page)
>                                       zero = put_page_testzero(page)
>                                       VM_BUG_ON_PAGE(!zero, page)
>                                       enqueue_huge_page(h, page)
>       put_page(page)
> 
> __get_hwpoison_page() only checks the page refcount before taking an
> additional one for memory error handling, which is wrong because there's
> a time window where compound pages have non-zero refcount during
> initialization.  So make __get_hwpoison_page() check page status a bit
> more for hugetlb pages.

I think that this changelog would benefit from some information about the new
!PageLRU && !__PageMovable check.

>  static int __get_hwpoison_page(struct page *page)
>  {
>  	struct page *head = compound_head(page);
> +	int ret = 0;
> +	bool hugetlb = false;
> +
> +	ret = get_hwpoison_huge_page(head, &hugetlb);
> +	if (hugetlb)
> +		return ret;
> +
> +	if (!PageLRU(head) && !__PageMovable(head))
> +		return 0;

This definitely needs a comment hinting the reader why we need to check for this.
AFAICS, this is to close the race where a page is about to be a hugetlb page soon,
so we do not go for get_page_unless_zero(), right?

From soft_offline_page's POV I __guess__ that's fine because we only deal with
pages we know about.
But what about memory_failure()? I think memory_failure() is less picky about that,
so it is okay to not take a refcount on that case?

-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 9+ messages in thread

* Re: [PATCH v5 1/2] mm,hwpoison: fix race with hugetlb page allocation
  2021-05-25  9:09           ` Oscar Salvador
@ 2021-05-25 13:08             ` HORIGUCHI NAOYA(堀口 直也)
  0 siblings, 0 replies; 9+ messages in thread
From: HORIGUCHI NAOYA(堀口 直也) @ 2021-05-25 13:08 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Mike Kravetz, Naoya Horiguchi, Muchun Song, linux-mm,
	Andrew Morton, Michal Hocko, Tony Luck, linux-kernel

On Tue, May 25, 2021 at 11:09:18AM +0200, Oscar Salvador wrote:
> On Tue, May 25, 2021 at 08:07:07AM +0000, HORIGUCHI NAOYA(堀口 直也) wrote:
> > OK, here's the current draft.
> > 
> > Thanks,
> > Naoya Horiguchi
> > 
> > ---
> > From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> > Date: Tue, 18 May 2021 23:49:18 +0900
> > Subject: [PATCH] mm,hwpoison: fix race with hugetlb page allocation
> > 
> > When hugetlb page fault (under overcommitting situation) and
> > memory_failure() race, VM_BUG_ON_PAGE() is triggered by the following race:
> > 
> >     CPU0:                           CPU1:
> > 
> >                                     gather_surplus_pages()
> >                                       page = alloc_surplus_huge_page()
> >     memory_failure_hugetlb()
> >       get_hwpoison_page(page)
> >         __get_hwpoison_page(page)
> >           get_page_unless_zero(page)
> >                                       zero = put_page_testzero(page)
> >                                       VM_BUG_ON_PAGE(!zero, page)
> >                                       enqueue_huge_page(h, page)
> >       put_page(page)
> > 
> > __get_hwpoison_page() only checks the page refcount before taking an
> > additional one for memory error handling, which is wrong because there's
> > a time window where compound pages have non-zero refcount during
> > initialization.  So make __get_hwpoison_page() check page status a bit
> > more for hugetlb pages.
> 
> I think that this changelog would benefit from some information about the new
> !PageLRU && !__PageMovable check.

OK, I'll update about this check.

> >  static int __get_hwpoison_page(struct page *page)
> >  {
> >  	struct page *head = compound_head(page);
> > +	int ret = 0;
> > +	bool hugetlb = false;
> > +
> > +	ret = get_hwpoison_huge_page(head, &hugetlb);
> > +	if (hugetlb)
> > +		return ret;
> > +
> > +	if (!PageLRU(head) && !__PageMovable(head))
> > +		return 0;
> 
> This definitely needs a comment hinting the reader why we need to check for this.
> AFAICS, this is to close the race where a page is about to be a hugetlb page soon,
> so we do not go for get_page_unless_zero(), right?

Right, I can't find any other reliable check to show that a page is not
under initialization of hugetlb pages. I'll state why this check is needed.

> 
> From soft_offline_page's POV I __guess__ that's fine because we only deal with
> pages we know about.
> But what about memory_failure()? I think memory_failure() is less picky about that,
> so it is okay to not take a refcount on that case?

Actually the coverage of page types for which memory error can be handled
are the same between memory_failure() and soft_offline_page().  One
memory_failure-specific role is to print out logs about error events even if
they can't be successfully handled, which could be affected by this change,
so we might need to adjust how memory_failure() uses the return value of
get_hwpoison_page().

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 9+ messages in thread

end of thread, other threads:[~2021-05-25 13:08 UTC | newest]

Thread overview: 9+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-18 23:12 [PATCH v5 0/2] hwpoison: fix race with hugetlb page allocation Naoya Horiguchi
2021-05-18 23:12 ` [PATCH v5 1/2] mm,hwpoison: " Naoya Horiguchi
2021-05-19 22:32   ` Mike Kravetz
2021-05-20  7:17     ` HORIGUCHI NAOYA(堀口 直也)
2021-05-25  7:36       ` Oscar Salvador
2021-05-25  8:07         ` HORIGUCHI NAOYA(堀口 直也)
2021-05-25  9:09           ` Oscar Salvador
2021-05-25 13:08             ` HORIGUCHI NAOYA(堀口 直也)
2021-05-18 23:12 ` [PATCH v5 2/2] mm,hwpoison: make get_hwpoison_page call get_any_page() Naoya Horiguchi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.