linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v6 0/2] hwpoison: fix race with hugetlb page allocation
@ 2021-06-03 23:36 Naoya Horiguchi
  2021-06-03 23:36 ` [PATCH v6 1/2] mm,hwpoison: " Naoya Horiguchi
  2021-06-03 23:36 ` [PATCH v6 2/2] mm,hwpoison: make get_hwpoison_page() call get_any_page() Naoya Horiguchi
  0 siblings, 2 replies; 12+ messages in thread
From: Naoya Horiguchi @ 2021-06-03 23:36 UTC (permalink / raw)
  To: Oscar Salvador, Muchun Song, Mike Kravetz, linux-mm
  Cc: Andrew Morton, Michal Hocko, Tony Luck, Naoya Horiguchi, linux-kernel

Hi,

Here's the v6 of the fix of the race between get_hwpoison_page and hugetlb
page allocation.

Thanks,
Naoya Horiguchi

v5: https://lore.kernel.org/linux-mm/20210518231259.2553203-1-nao.horiguchi@gmail.com/T/#u
---
Summary:

Naoya Horiguchi (2):
      mm,hwpoison: fix race with hugetlb page allocation
      mm,hwpoison: make get_hwpoison_page() call get_any_page()

 include/linux/hugetlb.h |   6 ++
 mm/hugetlb.c            |  17 ++++
 mm/memory-failure.c     | 219 +++++++++++++++++++++++++++++-------------------
 3 files changed, 157 insertions(+), 85 deletions(-)


^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v6 1/2] mm,hwpoison: fix race with hugetlb page allocation
  2021-06-03 23:36 [PATCH v6 0/2] hwpoison: fix race with hugetlb page allocation Naoya Horiguchi
@ 2021-06-03 23:36 ` Naoya Horiguchi
  2021-06-04 23:55   ` Mike Kravetz
  2021-08-12  4:28   ` Luck, Tony
  2021-06-03 23:36 ` [PATCH v6 2/2] mm,hwpoison: make get_hwpoison_page() call get_any_page() Naoya Horiguchi
  1 sibling, 2 replies; 12+ messages in thread
From: Naoya Horiguchi @ 2021-06-03 23:36 UTC (permalink / raw)
  To: Oscar Salvador, Muchun Song, Mike Kravetz, linux-mm
  Cc: Andrew Morton, Michal Hocko, Tony Luck, Naoya Horiguchi, linux-kernel

From: Naoya Horiguchi <naoya.horiguchi@nec.com>

When hugetlb page fault (under overcommitting situation) and
memory_failure() race, VM_BUG_ON_PAGE() is triggered by the following race:

    CPU0:                           CPU1:

                                    gather_surplus_pages()
                                      page = alloc_surplus_huge_page()
    memory_failure_hugetlb()
      get_hwpoison_page(page)
        __get_hwpoison_page(page)
          get_page_unless_zero(page)
                                      zero = put_page_testzero(page)
                                      VM_BUG_ON_PAGE(!zero, page)
                                      enqueue_huge_page(h, page)
      put_page(page)

__get_hwpoison_page() only checks the page refcount before taking an
additional one for memory error handling, which is not enough because
there's a time window where compound pages have non-zero refcount during
hugetlb page initialization.

So make __get_hwpoison_page() check page status a bit more for hugetlb
pages with get_hwpoison_huge_page(). Checking hugetlb-specific flags
under hugetlb_lock makes sure that the hugetlb page is not transitive.
It's notable that another new function, HWPoisonHandlable(), is helpful
to prevent a race against other transitive page states (like a generic
compound page just before PageHuge becomes true).

Fixes: ead07f6a867b ("mm/memory-failure: introduce get_hwpoison_page() for consistent refcount handling")
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reported-by: Muchun Song <songmuchun@bytedance.com>
Cc: stable@vger.kernel.org # 5.12+
---
ChangeLog v6:
- defined HWPoisonHandlable(),
- updated comment and patch description.
---
 include/linux/hugetlb.h |  6 ++++++
 mm/hugetlb.c            | 15 +++++++++++++++
 mm/memory-failure.c     | 29 +++++++++++++++++++++++++++--
 3 files changed, 48 insertions(+), 2 deletions(-)

diff --git v5.13-rc2/include/linux/hugetlb.h v5.13-rc2_patched/include/linux/hugetlb.h
index b92f25ccef58..790ae618548d 100644
--- v5.13-rc2/include/linux/hugetlb.h
+++ v5.13-rc2_patched/include/linux/hugetlb.h
@@ -149,6 +149,7 @@ bool hugetlb_reserve_pages(struct inode *inode, long from, long to,
 long hugetlb_unreserve_pages(struct inode *inode, long start, long end,
 						long freed);
 bool isolate_huge_page(struct page *page, struct list_head *list);
+int get_hwpoison_huge_page(struct page *page, bool *hugetlb);
 void putback_active_hugepage(struct page *page);
 void move_hugetlb_state(struct page *oldpage, struct page *newpage, int reason);
 void free_huge_page(struct page *page);
@@ -339,6 +340,11 @@ static inline bool isolate_huge_page(struct page *page, struct list_head *list)
 	return false;
 }
 
+static inline int get_hwpoison_huge_page(struct page *page, bool *hugetlb)
+{
+	return 0;
+}
+
 static inline void putback_active_hugepage(struct page *page)
 {
 }
diff --git v5.13-rc2/mm/hugetlb.c v5.13-rc2_patched/mm/hugetlb.c
index 470f7b5b437e..0c1d87b1e303 100644
--- v5.13-rc2/mm/hugetlb.c
+++ v5.13-rc2_patched/mm/hugetlb.c
@@ -5847,6 +5847,21 @@ bool isolate_huge_page(struct page *page, struct list_head *list)
 	return ret;
 }
 
+int get_hwpoison_huge_page(struct page *page, bool *hugetlb)
+{
+	int ret = 0;
+
+	*hugetlb = false;
+	spin_lock_irq(&hugetlb_lock);
+	if (PageHeadHuge(page)) {
+		*hugetlb = true;
+		if (HPageFreed(page) || HPageMigratable(page))
+			ret = get_page_unless_zero(page);
+	}
+	spin_unlock_irq(&hugetlb_lock);
+	return ret;
+}
+
 void putback_active_hugepage(struct page *page)
 {
 	spin_lock_irq(&hugetlb_lock);
diff --git v5.13-rc2/mm/memory-failure.c v5.13-rc2_patched/mm/memory-failure.c
index 86e9a2217716..0caedbeeed77 100644
--- v5.13-rc2/mm/memory-failure.c
+++ v5.13-rc2_patched/mm/memory-failure.c
@@ -1092,6 +1092,17 @@ static int page_action(struct page_state *ps, struct page *p,
 	return (result == MF_RECOVERED || result == MF_DELAYED) ? 0 : -EBUSY;
 }
 
+/*
+ * Return true if a page type of a given page is supported by hwpoison
+ * mechanism (while handling could fail), otherwise false.  This function
+ * does not return true for hugetlb or device memory pages, so it's assumed
+ * to be called only in the context where we never have such pages.
+ */
+static inline bool HWPoisonHandlable(struct page *page)
+{
+	return PageLRU(page) || __PageMovable(page);
+}
+
 /**
  * __get_hwpoison_page() - Get refcount for memory error handling:
  * @page:	raw error page (hit by memory error)
@@ -1102,8 +1113,22 @@ static int page_action(struct page_state *ps, struct page *p,
 static int __get_hwpoison_page(struct page *page)
 {
 	struct page *head = compound_head(page);
+	int ret = 0;
+	bool hugetlb = false;
+
+	ret = get_hwpoison_huge_page(head, &hugetlb);
+	if (hugetlb)
+		return ret;
+
+	/*
+	 * This check prevents from calling get_hwpoison_unless_zero()
+	 * for any unsupported type of page in order to reduce the risk of
+	 * unexpected races caused by taking a page refcount.
+	 */
+	if (!HWPoisonHandlable(head))
+		return 0;
 
-	if (!PageHuge(head) && PageTransHuge(head)) {
+	if (PageTransHuge(head)) {
 		/*
 		 * Non anonymous thp exists only in allocation/free time. We
 		 * can't handle such a case correctly, so let's give it up.
@@ -1160,7 +1185,7 @@ static int get_any_page(struct page *p, unsigned long flags)
 			ret = -EIO;
 		}
 	} else {
-		if (PageHuge(p) || PageLRU(p) || __PageMovable(p)) {
+		if (PageHuge(p) || HWPoisonHandlable(p)) {
 			ret = 1;
 		} else {
 			/*
-- 
2.25.1



^ permalink raw reply	[flat|nested] 12+ messages in thread

* [PATCH v6 2/2] mm,hwpoison: make get_hwpoison_page() call get_any_page()
  2021-06-03 23:36 [PATCH v6 0/2] hwpoison: fix race with hugetlb page allocation Naoya Horiguchi
  2021-06-03 23:36 ` [PATCH v6 1/2] mm,hwpoison: " Naoya Horiguchi
@ 2021-06-03 23:36 ` Naoya Horiguchi
  1 sibling, 0 replies; 12+ messages in thread
From: Naoya Horiguchi @ 2021-06-03 23:36 UTC (permalink / raw)
  To: Oscar Salvador, Muchun Song, Mike Kravetz, linux-mm
  Cc: Andrew Morton, Michal Hocko, Tony Luck, Naoya Horiguchi, linux-kernel

From: Naoya Horiguchi <naoya.horiguchi@nec.com>

__get_hwpoison_page() could fail to grab refcount by some race
condition, so it's helpful if we can handle it by retrying.  We already
have retry logic, so make get_hwpoison_page() call get_any_page() when
called from memory_failure().

As a result, get_hwpoison_page() can return negative values (i.e. error
code), so some callers are also changed to handle error cases.
soft_offline_page() does nothing for -EBUSY because that's enough and
users in userspace can easily handle it.  unpoison_memory() is also
unchanged because it's broken and need thorough fixes (will be done
later).

Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
---
ChangeLog v6:
- some callers of get_hwpoison_page() have code for negative return values.
---
 mm/hugetlb.c        |   2 +
 mm/memory-failure.c | 194 +++++++++++++++++++++++++-------------------
 2 files changed, 111 insertions(+), 85 deletions(-)

diff --git v5.13-rc2/mm/hugetlb.c v5.13-rc2_patched/mm/hugetlb.c
index 0c1d87b1e303..5480cdcf62cd 100644
--- v5.13-rc2/mm/hugetlb.c
+++ v5.13-rc2_patched/mm/hugetlb.c
@@ -5857,6 +5857,8 @@ int get_hwpoison_huge_page(struct page *page, bool *hugetlb)
 		*hugetlb = true;
 		if (HPageFreed(page) || HPageMigratable(page))
 			ret = get_page_unless_zero(page);
+		else
+			ret = -EBUSY;
 	}
 	spin_unlock_irq(&hugetlb_lock);
 	return ret;
diff --git v5.13-rc2/mm/memory-failure.c v5.13-rc2_patched/mm/memory-failure.c
index 0caedbeeed77..359ca13b3cad 100644
--- v5.13-rc2/mm/memory-failure.c
+++ v5.13-rc2_patched/mm/memory-failure.c
@@ -1103,13 +1103,6 @@ static inline bool HWPoisonHandlable(struct page *page)
 	return PageLRU(page) || __PageMovable(page);
 }
 
-/**
- * __get_hwpoison_page() - Get refcount for memory error handling:
- * @page:	raw error page (hit by memory error)
- *
- * Return: return 0 if failed to grab the refcount, otherwise true (some
- * non-zero value.)
- */
 static int __get_hwpoison_page(struct page *page)
 {
 	struct page *head = compound_head(page);
@@ -1154,15 +1147,6 @@ static int __get_hwpoison_page(struct page *page)
 	return 0;
 }
 
-/*
- * Safely get reference count of an arbitrary page.
- *
- * Returns 0 for a free page, 1 for an in-use page,
- * -EIO for a page-type we cannot handle and -EBUSY if we raced with an
- * allocation.
- * We only incremented refcount in case the page was already in-use and it
- * is a known type we can handle.
- */
 static int get_any_page(struct page *p, unsigned long flags)
 {
 	int ret = 0, pass = 0;
@@ -1172,50 +1156,77 @@ static int get_any_page(struct page *p, unsigned long flags)
 		count_increased = true;
 
 try_again:
-	if (!count_increased && !__get_hwpoison_page(p)) {
-		if (page_count(p)) {
-			/* We raced with an allocation, retry. */
-			if (pass++ < 3)
-				goto try_again;
-			ret = -EBUSY;
-		} else if (!PageHuge(p) && !is_free_buddy_page(p)) {
-			/* We raced with put_page, retry. */
+	if (!count_increased) {
+		ret = __get_hwpoison_page(p);
+		if (!ret) {
+			if (page_count(p)) {
+				/* We raced with an allocation, retry. */
+				if (pass++ < 3)
+					goto try_again;
+				ret = -EBUSY;
+			} else if (!PageHuge(p) && !is_free_buddy_page(p)) {
+				/* We raced with put_page, retry. */
+				if (pass++ < 3)
+					goto try_again;
+				ret = -EIO;
+			}
+			goto out;
+		} else if (ret == -EBUSY) {
+			/* We raced with freeing huge page to buddy, retry. */
 			if (pass++ < 3)
 				goto try_again;
-			ret = -EIO;
+			goto out;
 		}
+	}
+
+	if (PageHuge(p) || HWPoisonHandlable(p)) {
+		ret = 1;
 	} else {
-		if (PageHuge(p) || HWPoisonHandlable(p)) {
-			ret = 1;
-		} else {
-			/*
-			 * A page we cannot handle. Check whether we can turn
-			 * it into something we can handle.
-			 */
-			if (pass++ < 3) {
-				put_page(p);
-				shake_page(p, 1);
-				count_increased = false;
-				goto try_again;
-			}
+		/*
+		 * A page we cannot handle. Check whether we can turn
+		 * it into something we can handle.
+		 */
+		if (pass++ < 3) {
 			put_page(p);
-			ret = -EIO;
+			shake_page(p, 1);
+			count_increased = false;
+			goto try_again;
 		}
+		put_page(p);
+		ret = -EIO;
 	}
-
+out:
 	return ret;
 }
 
-static int get_hwpoison_page(struct page *p, unsigned long flags,
-			     enum mf_flags ctxt)
+/**
+ * get_hwpoison_page() - Get refcount for memory error handling
+ * @p:		Raw error page (hit by memory error)
+ * @flags:	Flags controlling behavior of error handling
+ *
+ * get_hwpoison_page() takes a page refcount of an error page to handle memory
+ * error on it, after checking that the error page is in a well-defined state
+ * (defined as a page-type we can successfully handle the memor error on it,
+ * such as LRU page and hugetlb page).
+ *
+ * Memory error handling could be triggered at any time on any type of page,
+ * so it's prone to race with typical memory management lifecycle (like
+ * allocation and free).  So to avoid such races, get_hwpoison_page() takes
+ * extra care for the error page's state (as done in __get_hwpoison_page()),
+ * and has some retry logic in get_any_page().
+ *
+ * Return: 0 on failure,
+ *         1 on success for in-use pages in a well-defined state,
+ *         -EIO for pages on which we can not handle memory errors,
+ *         -EBUSY when get_hwpoison_page() has raced with page lifecycle
+ *         operations like allocation and free.
+ */
+static int get_hwpoison_page(struct page *p, unsigned long flags)
 {
 	int ret;
 
 	zone_pcp_disable(page_zone(p));
-	if (ctxt == MF_SOFT_OFFLINE)
-		ret = get_any_page(p, flags);
-	else
-		ret = __get_hwpoison_page(p);
+	ret = get_any_page(p, flags);
 	zone_pcp_enable(page_zone(p));
 
 	return ret;
@@ -1404,27 +1415,33 @@ static int memory_failure_hugetlb(unsigned long pfn, int flags)
 
 	num_poisoned_pages_inc();
 
-	if (!(flags & MF_COUNT_INCREASED) && !get_hwpoison_page(p, flags, 0)) {
-		/*
-		 * Check "filter hit" and "race with other subpage."
-		 */
-		lock_page(head);
-		if (PageHWPoison(head)) {
-			if ((hwpoison_filter(p) && TestClearPageHWPoison(p))
-			    || (p != head && TestSetPageHWPoison(head))) {
-				num_poisoned_pages_dec();
-				unlock_page(head);
-				return 0;
+	if (!(flags & MF_COUNT_INCREASED)) {
+		res = get_hwpoison_page(p, flags);
+		if (!res) {
+			/*
+			 * Check "filter hit" and "race with other subpage."
+			 */
+			lock_page(head);
+			if (PageHWPoison(head)) {
+				if ((hwpoison_filter(p) && TestClearPageHWPoison(p))
+				    || (p != head && TestSetPageHWPoison(head))) {
+					num_poisoned_pages_dec();
+					unlock_page(head);
+					return 0;
+				}
 			}
+			unlock_page(head);
+			res = MF_FAILED;
+			if (!dissolve_free_huge_page(p) && take_page_off_buddy(p)) {
+				page_ref_inc(p);
+				res = MF_RECOVERED;
+			}
+			action_result(pfn, MF_MSG_FREE_HUGE, res);
+			return res == MF_RECOVERED ? 0 : -EBUSY;
+		} else if (res < 0) {
+			action_result(pfn, MF_MSG_UNKNOWN, MF_IGNORED);
+			return -EBUSY;
 		}
-		unlock_page(head);
-		res = MF_FAILED;
-		if (!dissolve_free_huge_page(p) && take_page_off_buddy(p)) {
-			page_ref_inc(p);
-			res = MF_RECOVERED;
-		}
-		action_result(pfn, MF_MSG_FREE_HUGE, res);
-		return res == MF_RECOVERED ? 0 : -EBUSY;
 	}
 
 	lock_page(head);
@@ -1627,28 +1644,35 @@ int memory_failure(unsigned long pfn, int flags)
 	 * In fact it's dangerous to directly bump up page count from 0,
 	 * that may make page_ref_freeze()/page_ref_unfreeze() mismatch.
 	 */
-	if (!(flags & MF_COUNT_INCREASED) && !get_hwpoison_page(p, flags, 0)) {
-		if (is_free_buddy_page(p)) {
-			if (take_page_off_buddy(p)) {
-				page_ref_inc(p);
-				res = MF_RECOVERED;
-			} else {
-				/* We lost the race, try again */
-				if (retry) {
-					ClearPageHWPoison(p);
-					num_poisoned_pages_dec();
-					retry = false;
-					goto try_again;
+	if (!(flags & MF_COUNT_INCREASED)) {
+		res = get_hwpoison_page(p, flags);
+		if (!res) {
+			if (is_free_buddy_page(p)) {
+				if (take_page_off_buddy(p)) {
+					page_ref_inc(p);
+					res = MF_RECOVERED;
+				} else {
+					/* We lost the race, try again */
+					if (retry) {
+						ClearPageHWPoison(p);
+						num_poisoned_pages_dec();
+						retry = false;
+						goto try_again;
+					}
+					res = MF_FAILED;
 				}
-				res = MF_FAILED;
+				action_result(pfn, MF_MSG_BUDDY, res);
+				res = res == MF_RECOVERED ? 0 : -EBUSY;
+			} else {
+				action_result(pfn, MF_MSG_KERNEL_HIGH_ORDER, MF_IGNORED);
+				res = -EBUSY;
 			}
-			action_result(pfn, MF_MSG_BUDDY, res);
-			res = res == MF_RECOVERED ? 0 : -EBUSY;
-		} else {
-			action_result(pfn, MF_MSG_KERNEL_HIGH_ORDER, MF_IGNORED);
+			goto unlock_mutex;
+		} else if (res < 0) {
+			action_result(pfn, MF_MSG_UNKNOWN, MF_IGNORED);
 			res = -EBUSY;
+			goto unlock_mutex;
 		}
-		goto unlock_mutex;
 	}
 
 	if (PageTransHuge(hpage)) {
@@ -1919,7 +1943,7 @@ int unpoison_memory(unsigned long pfn)
 		return 0;
 	}
 
-	if (!get_hwpoison_page(p, flags, 0)) {
+	if (!get_hwpoison_page(p, flags)) {
 		if (TestClearPageHWPoison(p))
 			num_poisoned_pages_dec();
 		unpoison_pr_info("Unpoison: Software-unpoisoned free page %#lx\n",
@@ -2135,7 +2159,7 @@ int soft_offline_page(unsigned long pfn, int flags)
 
 retry:
 	get_online_mems();
-	ret = get_hwpoison_page(page, flags, MF_SOFT_OFFLINE);
+	ret = get_hwpoison_page(page, flags);
 	put_online_mems();
 
 	if (ret > 0) {
-- 
2.25.1



^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v6 1/2] mm,hwpoison: fix race with hugetlb page allocation
  2021-06-03 23:36 ` [PATCH v6 1/2] mm,hwpoison: " Naoya Horiguchi
@ 2021-06-04 23:55   ` Mike Kravetz
  2021-08-12  4:28   ` Luck, Tony
  1 sibling, 0 replies; 12+ messages in thread
From: Mike Kravetz @ 2021-06-04 23:55 UTC (permalink / raw)
  To: Naoya Horiguchi, Oscar Salvador, Muchun Song, linux-mm
  Cc: Andrew Morton, Michal Hocko, Tony Luck, Naoya Horiguchi, linux-kernel

On 6/3/21 4:36 PM, Naoya Horiguchi wrote:
> From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> 
> When hugetlb page fault (under overcommitting situation) and
> memory_failure() race, VM_BUG_ON_PAGE() is triggered by the following race:
> 
>     CPU0:                           CPU1:
> 
>                                     gather_surplus_pages()
>                                       page = alloc_surplus_huge_page()
>     memory_failure_hugetlb()
>       get_hwpoison_page(page)
>         __get_hwpoison_page(page)
>           get_page_unless_zero(page)
>                                       zero = put_page_testzero(page)
>                                       VM_BUG_ON_PAGE(!zero, page)
>                                       enqueue_huge_page(h, page)
>       put_page(page)
> 
> __get_hwpoison_page() only checks the page refcount before taking an
> additional one for memory error handling, which is not enough because
> there's a time window where compound pages have non-zero refcount during
> hugetlb page initialization.
> 
> So make __get_hwpoison_page() check page status a bit more for hugetlb
> pages with get_hwpoison_huge_page(). Checking hugetlb-specific flags
> under hugetlb_lock makes sure that the hugetlb page is not transitive.
> It's notable that another new function, HWPoisonHandlable(), is helpful
> to prevent a race against other transitive page states (like a generic
> compound page just before PageHuge becomes true).

Thanks!

I believe this is good enough to prevent the race.  Since access to the
page is not synchronized in any way, it would still be "theoretically
possible" to race.  For example,

    CPU0:                           CPU1:

    HWPoisonHandlable true
    				    page freed to buddy
				    page allocated from buddy
				    page added to hugetlb pool
				    alloc_surplus_huge_page

But, this is very Very VERY unlikely to ever happen.

> 
> Fixes: ead07f6a867b ("mm/memory-failure: introduce get_hwpoison_page() for consistent refcount handling")
> Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> Reported-by: Muchun Song <songmuchun@bytedance.com>
> Cc: stable@vger.kernel.org # 5.12+
> ---
> ChangeLog v6:
> - defined HWPoisonHandlable(),
> - updated comment and patch description.
> ---
>  include/linux/hugetlb.h |  6 ++++++
>  mm/hugetlb.c            | 15 +++++++++++++++
>  mm/memory-failure.c     | 29 +++++++++++++++++++++++++++--
>  3 files changed, 48 insertions(+), 2 deletions(-)

Acked-by: Mike Kravetz <mike.kravetz@oracle.com>
-- 
Mike Kravetz


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v6 1/2] mm,hwpoison: fix race with hugetlb page allocation
  2021-06-03 23:36 ` [PATCH v6 1/2] mm,hwpoison: " Naoya Horiguchi
  2021-06-04 23:55   ` Mike Kravetz
@ 2021-08-12  4:28   ` Luck, Tony
  2021-08-12  9:03     ` HORIGUCHI NAOYA(堀口 直也)
  1 sibling, 1 reply; 12+ messages in thread
From: Luck, Tony @ 2021-08-12  4:28 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Oscar Salvador, Muchun Song, Mike Kravetz, linux-mm,
	Andrew Morton, Michal Hocko, Naoya Horiguchi, linux-kernel

On Fri, Jun 04, 2021 at 08:36:31AM +0900, Naoya Horiguchi wrote:
> From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> 
> When hugetlb page fault (under overcommitting situation) and
> memory_failure() race, VM_BUG_ON_PAGE() is triggered by the following race:
> 
>     CPU0:                           CPU1:
> 
>                                     gather_surplus_pages()
>                                       page = alloc_surplus_huge_page()
>     memory_failure_hugetlb()
>       get_hwpoison_page(page)
>         __get_hwpoison_page(page)
>           get_page_unless_zero(page)
>                                       zero = put_page_testzero(page)
>                                       VM_BUG_ON_PAGE(!zero, page)
>                                       enqueue_huge_page(h, page)
>       put_page(page)
> 
> __get_hwpoison_page() only checks the page refcount before taking an
> additional one for memory error handling, which is not enough because
> there's a time window where compound pages have non-zero refcount during
> hugetlb page initialization.
> 
> So make __get_hwpoison_page() check page status a bit more for hugetlb
> pages with get_hwpoison_huge_page(). Checking hugetlb-specific flags
> under hugetlb_lock makes sure that the hugetlb page is not transitive.
> It's notable that another new function, HWPoisonHandlable(), is helpful
> to prevent a race against other transitive page states (like a generic
> compound page just before PageHuge becomes true).

I'm seeing some strange results when doing a simple injection/recovery.

Current upstream often fails to offline the page with messages like:

	"high-order kernel page"
or
	"unknown page"

Things were working in v5.12. Broken in v5.13.

Bisect says that:

25182f05ffed ("mm,hwpoison: fix race with hugetlb page allocation")

is the culprit (though it is possible that there is more than one
issue ... failure symptoms changed a bit during the bisection).

This commit doesn't revert automatically from upstream. But it
does revert from v5.13. Running with this reverted from v5.13
gives kernel that recovers normally[1] from hundreds of consecutive
error injections.

-Tony

[1] Almost normally. My test catches SIGBUS and prints the virtual
address from the siginfo_t structure. Sometimes the address is correct
other times it is NULL.


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v6 1/2] mm,hwpoison: fix race with hugetlb page allocation
  2021-08-12  4:28   ` Luck, Tony
@ 2021-08-12  9:03     ` HORIGUCHI NAOYA(堀口 直也)
  2021-08-12 15:25       ` Luck, Tony
  0 siblings, 1 reply; 12+ messages in thread
From: HORIGUCHI NAOYA(堀口 直也) @ 2021-08-12  9:03 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Naoya Horiguchi, Oscar Salvador, Muchun Song, Mike Kravetz,
	linux-mm, Andrew Morton, Michal Hocko, linux-kernel

On Wed, Aug 11, 2021 at 09:28:13PM -0700, Luck, Tony wrote:
> On Fri, Jun 04, 2021 at 08:36:31AM +0900, Naoya Horiguchi wrote:
> > From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> > 
> > When hugetlb page fault (under overcommitting situation) and
> > memory_failure() race, VM_BUG_ON_PAGE() is triggered by the following race:
> > 
> >     CPU0:                           CPU1:
> > 
> >                                     gather_surplus_pages()
> >                                       page = alloc_surplus_huge_page()
> >     memory_failure_hugetlb()
> >       get_hwpoison_page(page)
> >         __get_hwpoison_page(page)
> >           get_page_unless_zero(page)
> >                                       zero = put_page_testzero(page)
> >                                       VM_BUG_ON_PAGE(!zero, page)
> >                                       enqueue_huge_page(h, page)
> >       put_page(page)
> > 
> > __get_hwpoison_page() only checks the page refcount before taking an
> > additional one for memory error handling, which is not enough because
> > there's a time window where compound pages have non-zero refcount during
> > hugetlb page initialization.
> > 
> > So make __get_hwpoison_page() check page status a bit more for hugetlb
> > pages with get_hwpoison_huge_page(). Checking hugetlb-specific flags
> > under hugetlb_lock makes sure that the hugetlb page is not transitive.
> > It's notable that another new function, HWPoisonHandlable(), is helpful
> > to prevent a race against other transitive page states (like a generic
> > compound page just before PageHuge becomes true).
> 
> I'm seeing some strange results when doing a simple injection/recovery.
> 
> Current upstream often fails to offline the page with messages like:
> 
> 	"high-order kernel page"
> or
> 	"unknown page"
> 
> Things were working in v5.12. Broken in v5.13.
> 
> Bisect says that:
> 
> 25182f05ffed ("mm,hwpoison: fix race with hugetlb page allocation")
> 
> is the culprit (though it is possible that there is more than one
> issue ... failure symptoms changed a bit during the bisection).

Sorry for the failures. I think that the following patch (and dependencies)
should solve the issue.
https://lore.kernel.org/linux-mm/20210614021212.223326-6-nao.horiguchi@gmail.com/.
I'll submit the update (maybe the patchset will be smaller by feedbacks)
later soon.

Thanks,
Naoya Horiguchi

> 
> This commit doesn't revert automatically from upstream. But it
> does revert from v5.13. Running with this reverted from v5.13
> gives kernel that recovers normally[1] from hundreds of consecutive
> error injections.
> 
> -Tony
> 
> [1] Almost normally. My test catches SIGBUS and prints the virtual
> address from the siginfo_t structure. Sometimes the address is correct
> other times it is NULL.

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v6 1/2] mm,hwpoison: fix race with hugetlb page allocation
  2021-08-12  9:03     ` HORIGUCHI NAOYA(堀口 直也)
@ 2021-08-12 15:25       ` Luck, Tony
  2021-08-13  6:29         ` HORIGUCHI NAOYA(堀口 直也)
  0 siblings, 1 reply; 12+ messages in thread
From: Luck, Tony @ 2021-08-12 15:25 UTC (permalink / raw)
  To: HORIGUCHI NAOYA(堀口 直也)
  Cc: Naoya Horiguchi, Oscar Salvador, Muchun Song, Mike Kravetz,
	linux-mm, Andrew Morton, Michal Hocko, linux-kernel

On Thu, Aug 12, 2021 at 09:03:04AM +0000, HORIGUCHI NAOYA(堀口 直也) wrote:
> Sorry for the failures. I think that the following patch (and dependencies)
> should solve the issue.
> https://lore.kernel.org/linux-mm/20210614021212.223326-6-nao.horiguchi@gmail.com/.
> I'll submit the update (maybe the patchset will be smaller by feedbacks)
> later soon.

I was uncertain about which dependencies you meant. So I followed
the advice in the cover letter for the patch series containing that
patch and did:

$ git fetch https://github.com/nhoriguchi/linux hwpoison

This kernel still has some odd issues and the poison page
did not get unmapped from my test user application.

See git://git.kernel.org/pub/scm/linux/kernel/git/aegl/ras-tools.git
for my test program. In this case I was just running with default settings
to inject an error into a user data page and then consume it.

Here's the dmesg output. There are multiple calls to memory_failure()
because the poison address is signalled both by the memory controller
(CMCI with UCNA signature) and the DCU (#MC with SRAR signature).

Note that the first message says: "recovery action for unknown page: Ignored"

[   70.331253] EINJ: Error INJection is initialized.
[   76.949490] process '/aegl/ras-tools/einj_mem_uc' started with executable stack
[   77.481846] Disabling lock debugging due to kernel taint
[   77.482004] mce: [Hardware Error]: Machine check events logged
[   77.487176] mce: Uncorrected hardware memory error in user-access at 7e025e400
[   77.493225] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
[   77.508704] {1}[Hardware Error]: event severity: recoverable
[   77.514361] {1}[Hardware Error]:  Error 0, type: recoverable
[   77.520011] {1}[Hardware Error]:  fru_text: Card01, ChnG, DIMM0
[   77.525921] {1}[Hardware Error]:   section_type: memory error
[   77.531659] {1}[Hardware Error]:   error_status: 0x0000000000000400
[   77.537914] {1}[Hardware Error]:   physical_address: 0x00000007e025e400
[   77.544518] {1}[Hardware Error]:   node: 0 card: 6 module: 0 rank: 1 bank: 8 device: 0 row: 15105 column: 896
[   77.554503] {1}[Hardware Error]:   error_type: 4, single-symbol chipkill ECC
[   77.561548] {1}[Hardware Error]:   DIMM location: NODE 3 CPU0_DIMM_G1
[   77.568135] Memory failure: 0x7e025e: recovery action for unknown page: Ignored
[   77.575445] Memory failure: 0x7e025e: already hardware poisoned
[   77.627894] EDAC skx MC3: HANDLING MCE MEMORY ERROR
[   77.633600] EDAC skx MC3: CPU 0: Machine Check Event: 0x0 Bank 25: 0xac00000200a00090
[   77.633601] EDAC skx MC3: TSC 0x18d57ce28f4f25
[   77.633602] EDAC skx MC3: ADDR 0x7e025e400
[   77.633603] EDAC skx MC3: MISC 0x9000201d809c086
[   77.633603] EDAC skx MC3: PROCESSOR 0:0x606a6 TIME 1628780833 SOCKET 0 APIC 0x0
[   77.633608] EDAC MC3: 1 UE memory read error on CPU_SrcID#0_MC#3_Chan#0_DIMM#0 (channel:0 slot:0 page:0x7e025e offset:0x400 grain:32 -  err_code:0x00a0:0x0090  SystemAddress:0x7e025e400 ProcessorSocketId:0x0 MemoryControllerId:0x3 ChannelAddress:0xec04bc00 ChannelId:0x0 RankAddress:0x76025c00 PhysicalRankId:0x1 DimmSlotId:0x0 Row:0x3b01 Column:0x380 Bank:0x0 BankGroup:0x2 ChipSelect:0x1 ChipId:0x0)
[   77.633611] Memory failure: 0x7e025e: Sending SIGBUS to einj_mem_uc:12283 due to hardware memory corruption
[   77.668827] mce: [Hardware Error]: Machine check events logged
[   77.678605] Memory failure: 0x7e025e: already hardware poisoned
[   77.736685] EDAC skx MC3: HANDLING MCE MEMORY ERROR
[   77.742392] EDAC skx MC3: CPU 0: Machine Check Event: 0x0 Bank 255: 0xb40000000000009f
[   77.742394] EDAC skx MC3: TSC 0x0
[   77.742394] EDAC skx MC3: ADDR 0x7e025e400
[   77.742395] EDAC skx MC3: MISC 0x0
[   77.742395] EDAC skx MC3: PROCESSOR 0:0x606a6 TIME 1628780833 SOCKET 0 APIC 0x0
[   77.742397] EDAC MC3: 1 UE memory read error on CPU_SrcID#0_MC#3_Chan#0_DIMM#0 (channel:0 slot:0 page:0x7e025e offset:0x400 grain:32 -  err_code:0x0000:0x009f  SystemAddress:0x7e025e400 ProcessorSocketId:0x0 MemoryControllerId:0x3 ChannelAddress:0xec04bc00 ChannelId:0x0 RankAddress:0x76025c00 PhysicalRankId:0x1 DimmSlotId:0x0 Row:0x3b01 Column:0x380 Bank:0x0 BankGroup:0x2 ChipSelect:0x1 ChipId:0x0)
[   77.777612] Memory failure: 0x7e025e: already hardware poisoned

-Tony


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v6 1/2] mm,hwpoison: fix race with hugetlb page allocation
  2021-08-12 15:25       ` Luck, Tony
@ 2021-08-13  6:29         ` HORIGUCHI NAOYA(堀口 直也)
  2021-08-13 15:07           ` Luck, Tony
  0 siblings, 1 reply; 12+ messages in thread
From: HORIGUCHI NAOYA(堀口 直也) @ 2021-08-13  6:29 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Naoya Horiguchi, Oscar Salvador, Muchun Song, Mike Kravetz,
	linux-mm, Andrew Morton, Michal Hocko, linux-kernel

On Thu, Aug 12, 2021 at 08:25:48AM -0700, Luck, Tony wrote:
> On Thu, Aug 12, 2021 at 09:03:04AM +0000, HORIGUCHI NAOYA(堀口 直也) wrote:
> > Sorry for the failures. I think that the following patch (and dependencies)
> > should solve the issue.
> > https://lore.kernel.org/linux-mm/20210614021212.223326-6-nao.horiguchi@gmail.com/.
> > I'll submit the update (maybe the patchset will be smaller by feedbacks)
> > later soon.
>
> I was uncertain about which dependencies you meant. So I followed
> the advice in the cover letter for the patch series containing that
> patch and did:
>
> $ git fetch https://github.com/nhoriguchi/linux hwpoison
>
> This kernel still has some odd issues and the poison page
> did not get unmapped from my test user application.
>
> See git://git.kernel.org/pub/scm/linux/kernel/git/aegl/ras-tools.git
> for my test program. In this case I was just running with default settings
> to inject an error into a user data page and then consume it.
>
> Here's the dmesg output. There are multiple calls to memory_failure()
> because the poison address is signalled both by the memory controller
> (CMCI with UCNA signature) and the DCU (#MC with SRAR signature).

Thanks for the details.

The following dmesg implies that the failure happened in einj_mem_uc
which has 13 sub-testcases (specified by "testname"). In which one
the "unknonw page" event was triggered?

>
> Note that the first message says: "recovery action for unknown page: Ignored"
>
> [   70.331253] EINJ: Error INJection is initialized.
> [   76.949490] process '/aegl/ras-tools/einj_mem_uc' started with executable stack
> [   77.481846] Disabling lock debugging due to kernel taint
> [   77.482004] mce: [Hardware Error]: Machine check events logged
> [   77.487176] mce: Uncorrected hardware memory error in user-access at 7e025e400
> [   77.493225] {1}[Hardware Error]: Hardware error from APEI Generic Hardware Error Source: 0
> [   77.508704] {1}[Hardware Error]: event severity: recoverable
> [   77.514361] {1}[Hardware Error]:  Error 0, type: recoverable
> [   77.520011] {1}[Hardware Error]:  fru_text: Card01, ChnG, DIMM0
> [   77.525921] {1}[Hardware Error]:   section_type: memory error
> [   77.531659] {1}[Hardware Error]:   error_status: 0x0000000000000400

Accourding to https://www.kernel.org/doc/html/latest/firmware-guide/acpi/apei/einj.html
this error status means "Platform Uncorrectable non-fatal", so memory_failure()
is called with MF_ACTION_REQUIRED set?

> [   77.537914] {1}[Hardware Error]:   physical_address: 0x00000007e025e400
> [   77.544518] {1}[Hardware Error]:   node: 0 card: 6 module: 0 rank: 1 bank: 8 device: 0 row: 15105 column: 896
> [   77.554503] {1}[Hardware Error]:   error_type: 4, single-symbol chipkill ECC
> [   77.561548] {1}[Hardware Error]:   DIMM location: NODE 3 CPU0_DIMM_G1
> [   77.568135] Memory failure: 0x7e025e: recovery action for unknown page: Ignored

This "unknown page" should come from the following line in memory_failure():

   int memory_failure(unsigned long pfn, int flags)
   ...
           if (!(flags & MF_COUNT_INCREASED)) {
                   res = get_hwpoison_page(p, flags);
                   if (!res) {
                           ...  // code for HWPoisonHandlable pages.
                   } else if (res < 0) {
                           action_result(pfn, MF_MSG_UNKNOWN, MF_IGNORED); /// HERE
                           res = -EBUSY;
                           goto unlock_mutex;
                   }
           }

This path is chosen when HWPoisonHandlable() returns false in get_hwpoison_page()
and HWPoisonHandlable() is like this in the current version:

    static inline bool HWPoisonHandlable(struct page *page)
    {
            return PageLRU(page) || __PageMovable(page) ||
                    PageSlab(page) || PageTable(page) || PageReserved(page);
    }

So I wonder why HWPoisonHandlable() returned false in your case ("to inject
an error into a user data page and then consume it" sounds a basic testcase
so it's expected to be detected by PageLRU() or __PageMovable().)

I think that we are going to the direction of stopping taking refcount of
error pages blindly to reduce the risk of critical races.  In order to
improve HWPoisonHandlable() check instead of reverting it, I'd like to
understand error page's state in the failure case.  Could you try testing
again with inserting a dump_page() like below?  (I'll try to reproduce by
myself somehow, but it might be hard because I have no machine with firmware
supporting EINJ.)

    @@ -1703,6 +1703,7 @@ int memory_failure(unsigned long pfn, int flags)
                            goto unlock_mutex;
                    } else if (res < 0) {
                            action_result(pfn, MF_MSG_UNKNOWN, MF_IGNORED);
    +                       dump_page(p, "hwpoison unknown page");
                            res = -EBUSY;
                            goto unlock_mutex;
                    }

Thanks,
Naoya Horiguchi

> [   77.575445] Memory failure: 0x7e025e: already hardware poisoned
> [   77.627894] EDAC skx MC3: HANDLING MCE MEMORY ERROR
> [   77.633600] EDAC skx MC3: CPU 0: Machine Check Event: 0x0 Bank 25: 0xac00000200a00090
> [   77.633601] EDAC skx MC3: TSC 0x18d57ce28f4f25
> [   77.633602] EDAC skx MC3: ADDR 0x7e025e400
> [   77.633603] EDAC skx MC3: MISC 0x9000201d809c086
> [   77.633603] EDAC skx MC3: PROCESSOR 0:0x606a6 TIME 1628780833 SOCKET 0 APIC 0x0
> [   77.633608] EDAC MC3: 1 UE memory read error on CPU_SrcID#0_MC#3_Chan#0_DIMM#0 (channel:0 slot:0 page:0x7e025e offset:0x400 grain:32 -  err_code:0x00a0:0x0090  SystemAddress:0x7e025e400 ProcessorSocketId:0x0 MemoryControllerId:0x3 ChannelAddress:0xec04bc00 ChannelId:0x0 RankAddress:0x76025c00 PhysicalRankId:0x1 DimmSlotId:0x0 Row:0x3b01 Column:0x380 Bank:0x0 BankGroup:0x2 ChipSelect:0x1 ChipId:0x0)
> [   77.633611] Memory failure: 0x7e025e: Sending SIGBUS to einj_mem_uc:12283 due to hardware memory corruption
> [   77.668827] mce: [Hardware Error]: Machine check events logged
> [   77.678605] Memory failure: 0x7e025e: already hardware poisoned
> [   77.736685] EDAC skx MC3: HANDLING MCE MEMORY ERROR
> [   77.742392] EDAC skx MC3: CPU 0: Machine Check Event: 0x0 Bank 255: 0xb40000000000009f
> [   77.742394] EDAC skx MC3: TSC 0x0
> [   77.742394] EDAC skx MC3: ADDR 0x7e025e400
> [   77.742395] EDAC skx MC3: MISC 0x0
> [   77.742395] EDAC skx MC3: PROCESSOR 0:0x606a6 TIME 1628780833 SOCKET 0 APIC 0x0
> [   77.742397] EDAC MC3: 1 UE memory read error on CPU_SrcID#0_MC#3_Chan#0_DIMM#0 (channel:0 slot:0 page:0x7e025e offset:0x400 grain:32 -  err_code:0x0000:0x009f  SystemAddress:0x7e025e400 ProcessorSocketId:0x0 MemoryControllerId:0x3 ChannelAddress:0xec04bc00 ChannelId:0x0 RankAddress:0x76025c00 PhysicalRankId:0x1 DimmSlotId:0x0 Row:0x3b01 Column:0x380 Bank:0x0 BankGroup:0x2 ChipSelect:0x1 ChipId:0x0)
> [   77.777612] Memory failure: 0x7e025e: already hardware poisoned
>
> -Tony
>

^ permalink raw reply	[flat|nested] 12+ messages in thread

* RE: [PATCH v6 1/2] mm,hwpoison: fix race with hugetlb page allocation
  2021-08-13  6:29         ` HORIGUCHI NAOYA(堀口 直也)
@ 2021-08-13 15:07           ` Luck, Tony
  2021-08-16 17:12             ` Naoya Horiguchi
  0 siblings, 1 reply; 12+ messages in thread
From: Luck, Tony @ 2021-08-13 15:07 UTC (permalink / raw)
  To: HORIGUCHI NAOYA(堀口 直也)
  Cc: Naoya Horiguchi, Oscar Salvador, Muchun Song, Mike Kravetz,
	linux-mm, Andrew Morton, Michal Hocko, linux-kernel

I'm running the default case from my einj_mem_uc test. That just:

1) allocates a page using:

	mmap(NULL, pagesize, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_ANON, -1, 0);

2) fills the page with random data (to make sure it has been allocated, and that the kernel can't
do KSM tricks to share this physical page with some other user).

3) injects the error at a 1KB offset within the page.

4) does a memory read of the poison address.


>                            action_result(pfn, MF_MSG_UNKNOWN, MF_IGNORED);
>   +                       dump_page(p, "hwpoison unknown page");
>                            res = -EBUSY;
>                            goto unlock_mutex;
>                    }

I added that patch against upstream (v5.14-rc5).  Here's the dump. The "pfn" matches the physical address where I injected,
and it has the hwpoison flag bit that was set early in memory_failure() --- so this is the right page.

[   79.368212] Memory failure: 0x623889: recovery action for unknown page: Ignored
[   79.375525] page:0000000065ad9479 refcount:3 mapcount:1 mapping:00000000a4ac843b index:0x0 pfn:0x623889
[   79.384909] memcg:ff40a569f2966000
[   79.388313] aops:shmem_aops ino:4c00 dentry name:"dev/zero"
[   79.393896] flags: 0x17ffffc088000c(uptodate|dirty|swapbacked|hwpoison|node=0|zone=2|lastcpupid=0x1fffff)
[   79.403455] raw: 0017ffffc088000c 0000000000000000 dead000000000122 ff40a569f45a7160
[   79.411191] raw: 0000000000000000 0000000000000000 0000000300000000 ff40a569f2966000
[   79.418931] page dumped because: hwpoison unknown page


-Tony

^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v6 1/2] mm,hwpoison: fix race with hugetlb page allocation
  2021-08-13 15:07           ` Luck, Tony
@ 2021-08-16 17:12             ` Naoya Horiguchi
  2021-08-16 17:56               ` Luck, Tony
  0 siblings, 1 reply; 12+ messages in thread
From: Naoya Horiguchi @ 2021-08-16 17:12 UTC (permalink / raw)
  To: Luck, Tony
  Cc: HORIGUCHI NAOYA(堀口 直也),
	Naoya Horiguchi, Oscar Salvador, Muchun Song, Mike Kravetz,
	linux-mm, Andrew Morton, Michal Hocko, linux-kernel

On Fri, Aug 13, 2021 at 03:07:20PM +0000, Luck, Tony wrote:
> I'm running the default case from my einj_mem_uc test. That just:
> 
> 1) allocates a page using:
> 
> 	mmap(NULL, pagesize, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_ANON, -1, 0);
> 
> 2) fills the page with random data (to make sure it has been allocated, and that the kernel can't
> do KSM tricks to share this physical page with some other user).
> 
> 3) injects the error at a 1KB offset within the page.
> 
> 4) does a memory read of the poison address.
> 
> 
> >                            action_result(pfn, MF_MSG_UNKNOWN, MF_IGNORED);
> >   +                       dump_page(p, "hwpoison unknown page");
> >                            res = -EBUSY;
> >                            goto unlock_mutex;
> >                    }
> 
> I added that patch against upstream (v5.14-rc5).  Here's the dump. The "pfn" matches the physical address where I injected,
> and it has the hwpoison flag bit that was set early in memory_failure() --- so this is the right page.
> 
> [   79.368212] Memory failure: 0x623889: recovery action for unknown page: Ignored
> [   79.375525] page:0000000065ad9479 refcount:3 mapcount:1 mapping:00000000a4ac843b index:0x0 pfn:0x623889
> [   79.384909] memcg:ff40a569f2966000
> [   79.388313] aops:shmem_aops ino:4c00 dentry name:"dev/zero"
> [   79.393896] flags: 0x17ffffc088000c(uptodate|dirty|swapbacked|hwpoison|node=0|zone=2|lastcpupid=0x1fffff)
> [   79.403455] raw: 0017ffffc088000c 0000000000000000 dead000000000122 ff40a569f45a7160
> [   79.411191] raw: 0000000000000000 0000000000000000 0000000300000000 ff40a569f2966000
> [   79.418931] page dumped because: hwpoison unknown page

Thank you for your help.

This dump indicates that HWPoisonHandlable() returned false due to
the lack of PG_lru flag.  In older code before 5.13, get_any_page() does
retry with shake_page(), but does not since 5.13, which seems to me
the root cause of the issue. So my suggestion is to call shake_page()
when HWPoisonHandlable() is false.

Could you try checking that the following diff fixes the issue?
I could still have better fix (like inserting shake_page() to other
retry paths in get_any_page()), but the below is the minimum one.

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 76cc53b2999a..3e770e4f259e 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1146,7 +1146,7 @@ static int __get_hwpoison_page(struct page *page)
         * unexpected races caused by taking a page refcount.
         */
        if (!HWPoisonHandlable(head))
-               return 0;
+               return -EBUSY;

        if (PageTransHuge(head)) {
                /*
@@ -1199,9 +1199,14 @@ static int get_any_page(struct page *p, unsigned long flags)
                        }
                        goto out;
                } else if (ret == -EBUSY) {
-                       /* We raced with freeing huge page to buddy, retry. */
-                       if (pass++ < 3)
+                       /*
+                        * We raced with (possibly temporary) unhandlable
+                        * page, retry.
+                        */
+                       if (pass++ < 3) {
+                               shake_page(p, 1);
                                goto try_again;
+                       }
                        goto out;
                }
        }


Thanks,
Naoya Horiguchi


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v6 1/2] mm,hwpoison: fix race with hugetlb page allocation
  2021-08-16 17:12             ` Naoya Horiguchi
@ 2021-08-16 17:56               ` Luck, Tony
  2021-08-17  5:40                 ` HORIGUCHI NAOYA(堀口 直也)
  0 siblings, 1 reply; 12+ messages in thread
From: Luck, Tony @ 2021-08-16 17:56 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: HORIGUCHI NAOYA(堀口 直也),
	Naoya Horiguchi, Oscar Salvador, Muchun Song, Mike Kravetz,
	linux-mm, Andrew Morton, Michal Hocko, linux-kernel

On Tue, Aug 17, 2021 at 02:12:07AM +0900, Naoya Horiguchi wrote:
> This dump indicates that HWPoisonHandlable() returned false due to
> the lack of PG_lru flag.  In older code before 5.13, get_any_page() does
> retry with shake_page(), but does not since 5.13, which seems to me
> the root cause of the issue. So my suggestion is to call shake_page()
> when HWPoisonHandlable() is false.
> 
> Could you try checking that the following diff fixes the issue?
> I could still have better fix (like inserting shake_page() to other
> retry paths in get_any_page()), but the below is the minimum one.

Tried it ... and it works! Injected and recovered from a thousand
errors without seeing any problems.

-Tony

P.S. Somewhere in the mail system your patch arrived with <TAB>s
changed to spaces. Here's what I applied to v5.14-rc6 (hopefully with
TABS preserved) ... just in case anyone else is following along with
this thread and wants to try some tests.

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index eefd823deb67..aa6592540f17 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1146,7 +1146,7 @@ static int __get_hwpoison_page(struct page *page)
 	 * unexpected races caused by taking a page refcount.
 	 */
 	if (!HWPoisonHandlable(head))
-		return 0;
+		return -EBUSY;
 
 	if (PageTransHuge(head)) {
 		/*
@@ -1199,9 +1199,14 @@ static int get_any_page(struct page *p, unsigned long flags)
 			}
 			goto out;
 		} else if (ret == -EBUSY) {
-			/* We raced with freeing huge page to buddy, retry. */
-			if (pass++ < 3)
+			/*
+			 * We raced with (possibly temporary) unhandlable
+			 * page, retry.
+			 */
+			if (pass++ < 3) {
+				shake_page(p, 1);
 				goto try_again;
+			}
 			goto out;
 		}
 	}


^ permalink raw reply	[flat|nested] 12+ messages in thread

* Re: [PATCH v6 1/2] mm,hwpoison: fix race with hugetlb page allocation
  2021-08-16 17:56               ` Luck, Tony
@ 2021-08-17  5:40                 ` HORIGUCHI NAOYA(堀口 直也)
  0 siblings, 0 replies; 12+ messages in thread
From: HORIGUCHI NAOYA(堀口 直也) @ 2021-08-17  5:40 UTC (permalink / raw)
  To: Luck, Tony
  Cc: Naoya Horiguchi, Naoya Horiguchi, Oscar Salvador, Muchun Song,
	Mike Kravetz, linux-mm, Andrew Morton, Michal Hocko,
	linux-kernel

On Mon, Aug 16, 2021 at 10:56:46AM -0700, Luck, Tony wrote:
> On Tue, Aug 17, 2021 at 02:12:07AM +0900, Naoya Horiguchi wrote:
> > This dump indicates that HWPoisonHandlable() returned false due to
> > the lack of PG_lru flag.  In older code before 5.13, get_any_page() does
> > retry with shake_page(), but does not since 5.13, which seems to me
> > the root cause of the issue. So my suggestion is to call shake_page()
> > when HWPoisonHandlable() is false.
> > 
> > Could you try checking that the following diff fixes the issue?
> > I could still have better fix (like inserting shake_page() to other
> > retry paths in get_any_page()), but the below is the minimum one.
> 
> Tried it ... and it works! Injected and recovered from a thousand
> errors without seeing any problems.

Thank you for testing, I've submitted a patch just now.

- Naoya

^ permalink raw reply	[flat|nested] 12+ messages in thread

end of thread, other threads:[~2021-08-17  5:40 UTC | newest]

Thread overview: 12+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-03 23:36 [PATCH v6 0/2] hwpoison: fix race with hugetlb page allocation Naoya Horiguchi
2021-06-03 23:36 ` [PATCH v6 1/2] mm,hwpoison: " Naoya Horiguchi
2021-06-04 23:55   ` Mike Kravetz
2021-08-12  4:28   ` Luck, Tony
2021-08-12  9:03     ` HORIGUCHI NAOYA(堀口 直也)
2021-08-12 15:25       ` Luck, Tony
2021-08-13  6:29         ` HORIGUCHI NAOYA(堀口 直也)
2021-08-13 15:07           ` Luck, Tony
2021-08-16 17:12             ` Naoya Horiguchi
2021-08-16 17:56               ` Luck, Tony
2021-08-17  5:40                 ` HORIGUCHI NAOYA(堀口 直也)
2021-06-03 23:36 ` [PATCH v6 2/2] mm,hwpoison: make get_hwpoison_page() call get_any_page() Naoya Horiguchi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).