linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v1 0/5] mm, hwpoison: enable 1GB hugepage support
@ 2022-06-02  5:06 Naoya Horiguchi
  2022-06-02  5:06 ` [PATCH v1 1/5] mm, hwpoison, hugetlb: introduce SUBPAGE_INDEX_HWPOISON to save raw error page Naoya Horiguchi
                   ` (4 more replies)
  0 siblings, 5 replies; 19+ messages in thread
From: Naoya Horiguchi @ 2022-06-02  5:06 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, David Hildenbrand, Mike Kravetz, Miaohe Lin,
	Liu Shixin, Yang Shi, Oscar Salvador, Muchun Song,
	Naoya Horiguchi, linux-kernel

Hi,

This patchset enables memory error handling on 1GB hugepage.

"Save raw error page" patch (1/4 of patchset [1]) is necessary, so it's
included in this series (the remaining part of hotplug related things are
still in progress).  Patch 2/5 solves issues in a corner case of hugepage
handling, which might not be the main target of this patchset, but slightly
related.  It was posted separately [2] but depends on 1/5, so I group them
together.

Patch 3/5 to 5/5 are main part of this series and fix a small issue about
handling 1GB hugepage, which I hope will be workable.

[1]: https://lore.kernel.org/linux-mm/20220427042841.678351-1-naoya.horiguchi@linux.dev/T/#u

[2]: https://lore.kernel.org/linux-mm/20220511151955.3951352-1-naoya.horiguchi@linux.dev/T/

Please let me know if you have any suggestions and comments.

Thanks,
Naoya Horiguchi
---
Summary:

Naoya Horiguchi (5):
      mm, hwpoison, hugetlb: introduce SUBPAGE_INDEX_HWPOISON to save raw error page
      mm,hwpoison: set PG_hwpoison for busy hugetlb pages
      mm, hwpoison: make __page_handle_poison returns int
      mm, hwpoison: skip raw hwpoison page in freeing 1GB hugepage
      mm, hwpoison: enable memory error handling on 1GB hugepage

 include/linux/hugetlb.h | 24 ++++++++++++++++++++++++
 include/linux/mm.h      |  2 +-
 include/ras/ras_event.h |  1 -
 mm/hugetlb.c            |  9 +++++++++
 mm/memory-failure.c     | 48 +++++++++++++++++++++---------------------------
 5 files changed, 55 insertions(+), 29 deletions(-)


^ permalink raw reply	[flat|nested] 19+ messages in thread

* [PATCH v1 1/5] mm, hwpoison, hugetlb: introduce SUBPAGE_INDEX_HWPOISON to save raw error page
  2022-06-02  5:06 [PATCH v1 0/5] mm, hwpoison: enable 1GB hugepage support Naoya Horiguchi
@ 2022-06-02  5:06 ` Naoya Horiguchi
  2022-06-07 12:45   ` Miaohe Lin
  2022-06-07 13:04   ` David Hildenbrand
  2022-06-02  5:06 ` [PATCH v1 2/5] mm,hwpoison: set PG_hwpoison for busy hugetlb pages Naoya Horiguchi
                   ` (3 subsequent siblings)
  4 siblings, 2 replies; 19+ messages in thread
From: Naoya Horiguchi @ 2022-06-02  5:06 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, David Hildenbrand, Mike Kravetz, Miaohe Lin,
	Liu Shixin, Yang Shi, Oscar Salvador, Muchun Song,
	Naoya Horiguchi, linux-kernel

From: Naoya Horiguchi <naoya.horiguchi@nec.com>

When handling memory error on a hugetlb page, the error handler tries to
dissolve and turn it into 4kB pages.  If it's successfully dissolved,
PageHWPoison flag is moved to the raw error page, so that's all right.
However, dissolve sometimes fails, then the error page is left as
hwpoisoned hugepage. It's useful if we can retry to dissolve it to save
healthy pages, but that's not possible now because the information about
where the raw error page is lost.

Use the private field of a tail page to keep that information.  The code
path of shrinking hugepage pool used this info to try delayed dissolve.
This only keeps one hwpoison page for now, which might be OK because it's
simple and multiple hwpoison pages in a hugepage can be rare. But it can
be extended in the future.

Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
---
ChangeLog since previous post on 4/27:
- fixed typo in patch description (by Miaohe)
- fixed config value in #ifdef statement (by Miaohe)
- added sentences about "multiple hwpoison pages" scenario in patch
  description
---
 include/linux/hugetlb.h | 24 ++++++++++++++++++++++++
 mm/hugetlb.c            |  9 +++++++++
 mm/memory-failure.c     |  2 ++
 3 files changed, 35 insertions(+)

diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
index ac2a1d758a80..a5341a3a0d4b 100644
--- a/include/linux/hugetlb.h
+++ b/include/linux/hugetlb.h
@@ -42,6 +42,9 @@ enum {
 	SUBPAGE_INDEX_CGROUP,		/* reuse page->private */
 	SUBPAGE_INDEX_CGROUP_RSVD,	/* reuse page->private */
 	__MAX_CGROUP_SUBPAGE_INDEX = SUBPAGE_INDEX_CGROUP_RSVD,
+#endif
+#ifdef CONFIG_MEMORY_FAILURE
+	SUBPAGE_INDEX_HWPOISON,
 #endif
 	__NR_USED_SUBPAGE,
 };
@@ -784,6 +787,27 @@ extern int dissolve_free_huge_page(struct page *page);
 extern int dissolve_free_huge_pages(unsigned long start_pfn,
 				    unsigned long end_pfn);
 
+#ifdef CONFIG_MEMORY_FAILURE
+/*
+ * pointer to raw error page is located in hpage[SUBPAGE_INDEX_HWPOISON].private
+ */
+static inline struct page *hugetlb_page_hwpoison(struct page *hpage)
+{
+	return (void *)page_private(hpage + SUBPAGE_INDEX_HWPOISON);
+}
+
+static inline void hugetlb_set_page_hwpoison(struct page *hpage,
+					struct page *page)
+{
+	set_page_private(hpage + SUBPAGE_INDEX_HWPOISON, (unsigned long)page);
+}
+#else
+static inline struct page *hugetlb_page_hwpoison(struct page *hpage)
+{
+	return NULL;
+}
+#endif
+
 #ifdef CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION
 #ifndef arch_hugetlb_migration_supported
 static inline bool arch_hugetlb_migration_supported(struct hstate *h)
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index f8e048b939c7..6867ea8345d1 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1547,6 +1547,15 @@ static void __update_and_free_page(struct hstate *h, struct page *page)
 		return;
 	}
 
+	if (unlikely(PageHWPoison(page))) {
+		struct page *raw_error = hugetlb_page_hwpoison(page);
+
+		if (raw_error && raw_error != page) {
+			SetPageHWPoison(raw_error);
+			ClearPageHWPoison(page);
+		}
+	}
+
 	for (i = 0; i < pages_per_huge_page(h);
 	     i++, subpage = mem_map_next(subpage, page, i)) {
 		subpage->flags &= ~(1 << PG_locked | 1 << PG_error |
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 66edaa7e5092..056dbb2050f8 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1534,6 +1534,8 @@ int __get_huge_page_for_hwpoison(unsigned long pfn, int flags)
 		goto out;
 	}
 
+	hugetlb_set_page_hwpoison(head, page);
+
 	return ret;
 out:
 	if (count_increased)
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v1 2/5] mm,hwpoison: set PG_hwpoison for busy hugetlb pages
  2022-06-02  5:06 [PATCH v1 0/5] mm, hwpoison: enable 1GB hugepage support Naoya Horiguchi
  2022-06-02  5:06 ` [PATCH v1 1/5] mm, hwpoison, hugetlb: introduce SUBPAGE_INDEX_HWPOISON to save raw error page Naoya Horiguchi
@ 2022-06-02  5:06 ` Naoya Horiguchi
  2022-06-07 12:50   ` Miaohe Lin
  2022-06-02  5:06 ` [PATCH v1 3/5] mm, hwpoison: make __page_handle_poison returns int Naoya Horiguchi
                   ` (2 subsequent siblings)
  4 siblings, 1 reply; 19+ messages in thread
From: Naoya Horiguchi @ 2022-06-02  5:06 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, David Hildenbrand, Mike Kravetz, Miaohe Lin,
	Liu Shixin, Yang Shi, Oscar Salvador, Muchun Song,
	Naoya Horiguchi, linux-kernel

From: Naoya Horiguchi <naoya.horiguchi@nec.com>

If memory_failure() fails to grab page refcount on a hugetlb page
because it's busy, it returns without setting PG_hwpoison on it.
This not only loses a chance of error containment, but breaks the rule
that action_result() should be called only when memory_failure() do
any of handling work (even if that's just setting PG_hwpoison).
This inconsistency could harm code maintainability.

So set PG_hwpoison and call hugetlb_set_page_hwpoison() for such a case.

Fixes: 405ce051236c ("mm/hwpoison: fix race between hugetlb free/demotion and memory_failure_hugetlb()")
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
---
 include/linux/mm.h  | 1 +
 mm/memory-failure.c | 8 ++++----
 2 files changed, 5 insertions(+), 4 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index d446e834a3e5..04de0c3e4f9f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3187,6 +3187,7 @@ enum mf_flags {
 	MF_MUST_KILL = 1 << 2,
 	MF_SOFT_OFFLINE = 1 << 3,
 	MF_UNPOISON = 1 << 4,
+	MF_NO_RETRY = 1 << 5,
 };
 extern int memory_failure(unsigned long pfn, int flags);
 extern void memory_failure_queue(unsigned long pfn, int flags);
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 056dbb2050f8..fe6a7961dc66 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1526,7 +1526,8 @@ int __get_huge_page_for_hwpoison(unsigned long pfn, int flags)
 			count_increased = true;
 	} else {
 		ret = -EBUSY;
-		goto out;
+		if (!(flags & MF_NO_RETRY))
+			goto out;
 	}
 
 	if (TestSetPageHWPoison(head)) {
@@ -1556,7 +1557,6 @@ static int try_memory_failure_hugetlb(unsigned long pfn, int flags, int *hugetlb
 	struct page *p = pfn_to_page(pfn);
 	struct page *head;
 	unsigned long page_flags;
-	bool retry = true;
 
 	*hugetlb = 1;
 retry:
@@ -1572,8 +1572,8 @@ static int try_memory_failure_hugetlb(unsigned long pfn, int flags, int *hugetlb
 		}
 		return res;
 	} else if (res == -EBUSY) {
-		if (retry) {
-			retry = false;
+		if (!(flags & MF_NO_RETRY)) {
+			flags |= MF_NO_RETRY;
 			goto retry;
 		}
 		action_result(pfn, MF_MSG_UNKNOWN, MF_IGNORED);
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v1 3/5] mm, hwpoison: make __page_handle_poison returns int
  2022-06-02  5:06 [PATCH v1 0/5] mm, hwpoison: enable 1GB hugepage support Naoya Horiguchi
  2022-06-02  5:06 ` [PATCH v1 1/5] mm, hwpoison, hugetlb: introduce SUBPAGE_INDEX_HWPOISON to save raw error page Naoya Horiguchi
  2022-06-02  5:06 ` [PATCH v1 2/5] mm,hwpoison: set PG_hwpoison for busy hugetlb pages Naoya Horiguchi
@ 2022-06-02  5:06 ` Naoya Horiguchi
  2022-06-07 12:54   ` Miaohe Lin
  2022-06-02  5:06 ` [PATCH v1 4/5] mm, hwpoison: skip raw hwpoison page in freeing 1GB hugepage Naoya Horiguchi
  2022-06-02  5:06 ` [PATCH v1 5/5] mm, hwpoison: enable memory error handling on " Naoya Horiguchi
  4 siblings, 1 reply; 19+ messages in thread
From: Naoya Horiguchi @ 2022-06-02  5:06 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, David Hildenbrand, Mike Kravetz, Miaohe Lin,
	Liu Shixin, Yang Shi, Oscar Salvador, Muchun Song,
	Naoya Horiguchi, linux-kernel

From: Naoya Horiguchi <naoya.horiguchi@nec.com>

__page_handle_poison() returns bool that shows whether
take_page_off_buddy() has passed or not now.  But we will want to
distinguish another case of "dissolve has passed but taking off failed"
by its return value. So change the type of the return value.
No functional change.

Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
---
 mm/memory-failure.c | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index fe6a7961dc66..f149a7864c81 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -68,7 +68,13 @@ int sysctl_memory_failure_recovery __read_mostly = 1;
 
 atomic_long_t num_poisoned_pages __read_mostly = ATOMIC_LONG_INIT(0);
 
-static bool __page_handle_poison(struct page *page)
+/*
+ * Return values:
+ *   1:   the page is dissolved (if needed) and taken off from buddy,
+ *   0:   the page is dissolved (if needed) and not taken off from buddy,
+ *   < 0: failed to dissolve.
+ */
+static int __page_handle_poison(struct page *page)
 {
 	int ret;
 
@@ -78,7 +84,7 @@ static bool __page_handle_poison(struct page *page)
 		ret = take_page_off_buddy(page);
 	zone_pcp_enable(page_zone(page));
 
-	return ret > 0;
+	return ret;
 }
 
 static bool page_handle_poison(struct page *page, bool hugepage_or_freepage, bool release)
@@ -88,7 +94,7 @@ static bool page_handle_poison(struct page *page, bool hugepage_or_freepage, boo
 		 * Doing this check for free pages is also fine since dissolve_free_huge_page
 		 * returns 0 for non-hugetlb pages as well.
 		 */
-		if (!__page_handle_poison(page))
+		if (__page_handle_poison(page) <= 0)
 			/*
 			 * We could fail to take off the target page from buddy
 			 * for example due to racy page allocation, but that's
@@ -1045,7 +1051,7 @@ static int me_huge_page(struct page_state *ps, struct page *p)
 		 * save healthy subpages.
 		 */
 		put_page(hpage);
-		if (__page_handle_poison(p)) {
+		if (__page_handle_poison(p) > 0) {
 			page_ref_inc(p);
 			res = MF_RECOVERED;
 		}
@@ -1595,8 +1601,7 @@ static int try_memory_failure_hugetlb(unsigned long pfn, int flags, int *hugetlb
 	 */
 	if (res == 0) {
 		unlock_page(head);
-		res = MF_FAILED;
-		if (__page_handle_poison(p)) {
+		if (__page_handle_poison(p) > 0) {
 			page_ref_inc(p);
 			res = MF_RECOVERED;
 		}
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v1 4/5] mm, hwpoison: skip raw hwpoison page in freeing 1GB hugepage
  2022-06-02  5:06 [PATCH v1 0/5] mm, hwpoison: enable 1GB hugepage support Naoya Horiguchi
                   ` (2 preceding siblings ...)
  2022-06-02  5:06 ` [PATCH v1 3/5] mm, hwpoison: make __page_handle_poison returns int Naoya Horiguchi
@ 2022-06-02  5:06 ` Naoya Horiguchi
  2022-06-07 13:56   ` Miaohe Lin
  2022-06-02  5:06 ` [PATCH v1 5/5] mm, hwpoison: enable memory error handling on " Naoya Horiguchi
  4 siblings, 1 reply; 19+ messages in thread
From: Naoya Horiguchi @ 2022-06-02  5:06 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, David Hildenbrand, Mike Kravetz, Miaohe Lin,
	Liu Shixin, Yang Shi, Oscar Salvador, Muchun Song,
	Naoya Horiguchi, linux-kernel

From: Naoya Horiguchi <naoya.horiguchi@nec.com>

Currently if memory_failure() (modified to remove blocking code) is called
on a page in some 1GB hugepage, memory error handling returns failure and
the raw error page gets into undesirable state.  The impact is small in
production systems (just leaked single 4kB page), but this limits the test
efficiency because unpoison doesn't work for it.  So we can no longer
create 1GB hugepage on the 1GB physical address range with such hwpoison
pages, that could be an issue in testing on small systems.

When a hwpoison page in a 1GB hugepage is handled, it's caught by the
PageHWPoison check in free_pages_prepare() because the hugepage is broken
down into raw error page and order is 0:

        if (unlikely(PageHWPoison(page)) && !order) {
                ...
                return false;
        }

Then, the page is not sent to buddy and the page refcount is left 0.

Originally this check is supposed to work when the error page is freed from
page_handle_poison() (that is called from soft-offline), but now we are
opening another path to call it, so the callers of __page_handle_poison()
need to handle the case by considering the return value 0 as success. Then
page refcount for hwpoison is properly incremented and now unpoison works.

Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
---
 mm/memory-failure.c | 9 ++++++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index f149a7864c81..babeb34f7477 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1043,7 +1043,6 @@ static int me_huge_page(struct page_state *ps, struct page *p)
 		res = truncate_error_page(hpage, page_to_pfn(p), mapping);
 		unlock_page(hpage);
 	} else {
-		res = MF_FAILED;
 		unlock_page(hpage);
 		/*
 		 * migration entry prevents later access on error anonymous
@@ -1051,9 +1050,11 @@ static int me_huge_page(struct page_state *ps, struct page *p)
 		 * save healthy subpages.
 		 */
 		put_page(hpage);
-		if (__page_handle_poison(p) > 0) {
+		if (__page_handle_poison(p) >= 0) {
 			page_ref_inc(p);
 			res = MF_RECOVERED;
+		} else {
+			res = MF_FAILED;
 		}
 	}
 
@@ -1601,9 +1602,11 @@ static int try_memory_failure_hugetlb(unsigned long pfn, int flags, int *hugetlb
 	 */
 	if (res == 0) {
 		unlock_page(head);
-		if (__page_handle_poison(p) > 0) {
+		if (__page_handle_poison(p) >= 0) {
 			page_ref_inc(p);
 			res = MF_RECOVERED;
+		} else {
+			res = MF_FAILED;
 		}
 		action_result(pfn, MF_MSG_FREE_HUGE, res);
 		return res == MF_RECOVERED ? 0 : -EBUSY;
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH v1 5/5] mm, hwpoison: enable memory error handling on 1GB hugepage
  2022-06-02  5:06 [PATCH v1 0/5] mm, hwpoison: enable 1GB hugepage support Naoya Horiguchi
                   ` (3 preceding siblings ...)
  2022-06-02  5:06 ` [PATCH v1 4/5] mm, hwpoison: skip raw hwpoison page in freeing 1GB hugepage Naoya Horiguchi
@ 2022-06-02  5:06 ` Naoya Horiguchi
  2022-06-07 14:11   ` Miaohe Lin
  4 siblings, 1 reply; 19+ messages in thread
From: Naoya Horiguchi @ 2022-06-02  5:06 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, David Hildenbrand, Mike Kravetz, Miaohe Lin,
	Liu Shixin, Yang Shi, Oscar Salvador, Muchun Song,
	Naoya Horiguchi, linux-kernel

From: Naoya Horiguchi <naoya.horiguchi@nec.com>

Now error handling code is prepared, so remove the blocking code and
enable memory error handling on 1GB hugepage.

Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
---
 include/linux/mm.h      |  1 -
 include/ras/ras_event.h |  1 -
 mm/memory-failure.c     | 16 ----------------
 3 files changed, 18 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 04de0c3e4f9f..58a6aa916e4f 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3238,7 +3238,6 @@ enum mf_action_page_type {
 	MF_MSG_DIFFERENT_COMPOUND,
 	MF_MSG_HUGE,
 	MF_MSG_FREE_HUGE,
-	MF_MSG_NON_PMD_HUGE,
 	MF_MSG_UNMAP_FAILED,
 	MF_MSG_DIRTY_SWAPCACHE,
 	MF_MSG_CLEAN_SWAPCACHE,
diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h
index d0337a41141c..cbd3ddd7c33d 100644
--- a/include/ras/ras_event.h
+++ b/include/ras/ras_event.h
@@ -360,7 +360,6 @@ TRACE_EVENT(aer_event,
 	EM ( MF_MSG_DIFFERENT_COMPOUND, "different compound page after locking" ) \
 	EM ( MF_MSG_HUGE, "huge page" )					\
 	EM ( MF_MSG_FREE_HUGE, "free huge page" )			\
-	EM ( MF_MSG_NON_PMD_HUGE, "non-pmd-sized huge page" )		\
 	EM ( MF_MSG_UNMAP_FAILED, "unmapping failed page" )		\
 	EM ( MF_MSG_DIRTY_SWAPCACHE, "dirty swapcache page" )		\
 	EM ( MF_MSG_CLEAN_SWAPCACHE, "clean swapcache page" )		\
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index babeb34f7477..ced033a99e19 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -725,7 +725,6 @@ static const char * const action_page_types[] = {
 	[MF_MSG_DIFFERENT_COMPOUND]	= "different compound page after locking",
 	[MF_MSG_HUGE]			= "huge page",
 	[MF_MSG_FREE_HUGE]		= "free huge page",
-	[MF_MSG_NON_PMD_HUGE]		= "non-pmd-sized huge page",
 	[MF_MSG_UNMAP_FAILED]		= "unmapping failed page",
 	[MF_MSG_DIRTY_SWAPCACHE]	= "dirty swapcache page",
 	[MF_MSG_CLEAN_SWAPCACHE]	= "clean swapcache page",
@@ -1614,21 +1613,6 @@ static int try_memory_failure_hugetlb(unsigned long pfn, int flags, int *hugetlb
 
 	page_flags = head->flags;
 
-	/*
-	 * TODO: hwpoison for pud-sized hugetlb doesn't work right now, so
-	 * simply disable it. In order to make it work properly, we need
-	 * make sure that:
-	 *  - conversion of a pud that maps an error hugetlb into hwpoison
-	 *    entry properly works, and
-	 *  - other mm code walking over page table is aware of pud-aligned
-	 *    hwpoison entries.
-	 */
-	if (huge_page_size(page_hstate(head)) > PMD_SIZE) {
-		action_result(pfn, MF_MSG_NON_PMD_HUGE, MF_IGNORED);
-		res = -EBUSY;
-		goto out;
-	}
-
 	if (!hwpoison_user_mappings(p, pfn, flags, head)) {
 		action_result(pfn, MF_MSG_UNMAP_FAILED, MF_IGNORED);
 		res = -EBUSY;
-- 
2.25.1



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH v1 1/5] mm, hwpoison, hugetlb: introduce SUBPAGE_INDEX_HWPOISON to save raw error page
  2022-06-02  5:06 ` [PATCH v1 1/5] mm, hwpoison, hugetlb: introduce SUBPAGE_INDEX_HWPOISON to save raw error page Naoya Horiguchi
@ 2022-06-07 12:45   ` Miaohe Lin
  2022-06-07 13:04   ` David Hildenbrand
  1 sibling, 0 replies; 19+ messages in thread
From: Miaohe Lin @ 2022-06-07 12:45 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andrew Morton, David Hildenbrand, Mike Kravetz, Liu Shixin,
	Yang Shi, Oscar Salvador, Muchun Song, Naoya Horiguchi,
	linux-kernel, Linux-MM

On 2022/6/2 13:06, Naoya Horiguchi wrote:
> From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> 
> When handling memory error on a hugetlb page, the error handler tries to
> dissolve and turn it into 4kB pages.  If it's successfully dissolved,
> PageHWPoison flag is moved to the raw error page, so that's all right.
> However, dissolve sometimes fails, then the error page is left as
> hwpoisoned hugepage. It's useful if we can retry to dissolve it to save
> healthy pages, but that's not possible now because the information about
> where the raw error page is lost.
> 
> Use the private field of a tail page to keep that information.  The code
> path of shrinking hugepage pool used this info to try delayed dissolve.
> This only keeps one hwpoison page for now, which might be OK because it's
> simple and multiple hwpoison pages in a hugepage can be rare. But it can

Yeah, multiple hwpoison pages in a hugepage might indicate that there are other problems.

> be extended in the future.

Since 1GB hugepage is going to be supported, only keeping one hwpoison page might not be
enough soon. ;)

> 
> Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>

Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>

Thanks!

> ---
> ChangeLog since previous post on 4/27:
> - fixed typo in patch description (by Miaohe)
> - fixed config value in #ifdef statement (by Miaohe)
> - added sentences about "multiple hwpoison pages" scenario in patch
>   description
> ---
>  include/linux/hugetlb.h | 24 ++++++++++++++++++++++++
>  mm/hugetlb.c            |  9 +++++++++
>  mm/memory-failure.c     |  2 ++
>  3 files changed, 35 insertions(+)
> 
> diff --git a/include/linux/hugetlb.h b/include/linux/hugetlb.h
> index ac2a1d758a80..a5341a3a0d4b 100644
> --- a/include/linux/hugetlb.h
> +++ b/include/linux/hugetlb.h
> @@ -42,6 +42,9 @@ enum {
>  	SUBPAGE_INDEX_CGROUP,		/* reuse page->private */
>  	SUBPAGE_INDEX_CGROUP_RSVD,	/* reuse page->private */
>  	__MAX_CGROUP_SUBPAGE_INDEX = SUBPAGE_INDEX_CGROUP_RSVD,
> +#endif
> +#ifdef CONFIG_MEMORY_FAILURE
> +	SUBPAGE_INDEX_HWPOISON,
>  #endif
>  	__NR_USED_SUBPAGE,
>  };
> @@ -784,6 +787,27 @@ extern int dissolve_free_huge_page(struct page *page);
>  extern int dissolve_free_huge_pages(unsigned long start_pfn,
>  				    unsigned long end_pfn);
>  
> +#ifdef CONFIG_MEMORY_FAILURE
> +/*
> + * pointer to raw error page is located in hpage[SUBPAGE_INDEX_HWPOISON].private
> + */
> +static inline struct page *hugetlb_page_hwpoison(struct page *hpage)
> +{
> +	return (void *)page_private(hpage + SUBPAGE_INDEX_HWPOISON);
> +}
> +
> +static inline void hugetlb_set_page_hwpoison(struct page *hpage,
> +					struct page *page)
> +{
> +	set_page_private(hpage + SUBPAGE_INDEX_HWPOISON, (unsigned long)page);
> +}
> +#else
> +static inline struct page *hugetlb_page_hwpoison(struct page *hpage)
> +{
> +	return NULL;
> +}
> +#endif
> +
>  #ifdef CONFIG_ARCH_ENABLE_HUGEPAGE_MIGRATION
>  #ifndef arch_hugetlb_migration_supported
>  static inline bool arch_hugetlb_migration_supported(struct hstate *h)
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index f8e048b939c7..6867ea8345d1 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -1547,6 +1547,15 @@ static void __update_and_free_page(struct hstate *h, struct page *page)
>  		return;
>  	}
>  
> +	if (unlikely(PageHWPoison(page))) {
> +		struct page *raw_error = hugetlb_page_hwpoison(page);
> +
> +		if (raw_error && raw_error != page) {
> +			SetPageHWPoison(raw_error);
> +			ClearPageHWPoison(page);
> +		}
> +	}
> +
>  	for (i = 0; i < pages_per_huge_page(h);
>  	     i++, subpage = mem_map_next(subpage, page, i)) {
>  		subpage->flags &= ~(1 << PG_locked | 1 << PG_error |
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 66edaa7e5092..056dbb2050f8 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1534,6 +1534,8 @@ int __get_huge_page_for_hwpoison(unsigned long pfn, int flags)
>  		goto out;
>  	}
>  
> +	hugetlb_set_page_hwpoison(head, page);
> +
>  	return ret;
>  out:
>  	if (count_increased)
> 



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v1 2/5] mm,hwpoison: set PG_hwpoison for busy hugetlb pages
  2022-06-02  5:06 ` [PATCH v1 2/5] mm,hwpoison: set PG_hwpoison for busy hugetlb pages Naoya Horiguchi
@ 2022-06-07 12:50   ` Miaohe Lin
  0 siblings, 0 replies; 19+ messages in thread
From: Miaohe Lin @ 2022-06-07 12:50 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andrew Morton, David Hildenbrand, Mike Kravetz, Liu Shixin,
	Yang Shi, Oscar Salvador, Muchun Song, Naoya Horiguchi,
	linux-kernel, Linux-MM

On 2022/6/2 13:06, Naoya Horiguchi wrote:
> From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> 
> If memory_failure() fails to grab page refcount on a hugetlb page
> because it's busy, it returns without setting PG_hwpoison on it.
> This not only loses a chance of error containment, but breaks the rule
> that action_result() should be called only when memory_failure() do
> any of handling work (even if that's just setting PG_hwpoison).
> This inconsistency could harm code maintainability.

Yes, this patch will make the code more maintainable. But as discussed previously,
this page might be under the migration, this patch can't save more.

Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>

Thanks!

> 
> So set PG_hwpoison and call hugetlb_set_page_hwpoison() for such a case.
> 
> Fixes: 405ce051236c ("mm/hwpoison: fix race between hugetlb free/demotion and memory_failure_hugetlb()")
> Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> ---
>  include/linux/mm.h  | 1 +
>  mm/memory-failure.c | 8 ++++----
>  2 files changed, 5 insertions(+), 4 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index d446e834a3e5..04de0c3e4f9f 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3187,6 +3187,7 @@ enum mf_flags {
>  	MF_MUST_KILL = 1 << 2,
>  	MF_SOFT_OFFLINE = 1 << 3,
>  	MF_UNPOISON = 1 << 4,
> +	MF_NO_RETRY = 1 << 5,
>  };
>  extern int memory_failure(unsigned long pfn, int flags);
>  extern void memory_failure_queue(unsigned long pfn, int flags);
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 056dbb2050f8..fe6a7961dc66 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1526,7 +1526,8 @@ int __get_huge_page_for_hwpoison(unsigned long pfn, int flags)
>  			count_increased = true;
>  	} else {
>  		ret = -EBUSY;
> -		goto out;
> +		if (!(flags & MF_NO_RETRY))
> +			goto out;
>  	}
>  
>  	if (TestSetPageHWPoison(head)) {
> @@ -1556,7 +1557,6 @@ static int try_memory_failure_hugetlb(unsigned long pfn, int flags, int *hugetlb
>  	struct page *p = pfn_to_page(pfn);
>  	struct page *head;
>  	unsigned long page_flags;
> -	bool retry = true;
>  
>  	*hugetlb = 1;
>  retry:
> @@ -1572,8 +1572,8 @@ static int try_memory_failure_hugetlb(unsigned long pfn, int flags, int *hugetlb
>  		}
>  		return res;
>  	} else if (res == -EBUSY) {
> -		if (retry) {
> -			retry = false;
> +		if (!(flags & MF_NO_RETRY)) {
> +			flags |= MF_NO_RETRY;
>  			goto retry;
>  		}
>  		action_result(pfn, MF_MSG_UNKNOWN, MF_IGNORED);
> 



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v1 3/5] mm, hwpoison: make __page_handle_poison returns int
  2022-06-02  5:06 ` [PATCH v1 3/5] mm, hwpoison: make __page_handle_poison returns int Naoya Horiguchi
@ 2022-06-07 12:54   ` Miaohe Lin
  2022-06-08  1:40     ` HORIGUCHI NAOYA(堀口 直也)
  0 siblings, 1 reply; 19+ messages in thread
From: Miaohe Lin @ 2022-06-07 12:54 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andrew Morton, David Hildenbrand, Mike Kravetz, Liu Shixin,
	Yang Shi, Oscar Salvador, Muchun Song, Naoya Horiguchi,
	linux-kernel, Linux-MM

On 2022/6/2 13:06, Naoya Horiguchi wrote:
> From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> 
> __page_handle_poison() returns bool that shows whether
> take_page_off_buddy() has passed or not now.  But we will want to
> distinguish another case of "dissolve has passed but taking off failed"
> by its return value. So change the type of the return value.
> No functional change.
> 
> Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> ---
>  mm/memory-failure.c | 17 +++++++++++------
>  1 file changed, 11 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index fe6a7961dc66..f149a7864c81 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -68,7 +68,13 @@ int sysctl_memory_failure_recovery __read_mostly = 1;
>  
>  atomic_long_t num_poisoned_pages __read_mostly = ATOMIC_LONG_INIT(0);
>  
> -static bool __page_handle_poison(struct page *page)
> +/*
> + * Return values:
> + *   1:   the page is dissolved (if needed) and taken off from buddy,
> + *   0:   the page is dissolved (if needed) and not taken off from buddy,
> + *   < 0: failed to dissolve.
> + */
> +static int __page_handle_poison(struct page *page)
>  {
>  	int ret;
>  
> @@ -78,7 +84,7 @@ static bool __page_handle_poison(struct page *page)
>  		ret = take_page_off_buddy(page);
>  	zone_pcp_enable(page_zone(page));
>  
> -	return ret > 0;
> +	return ret;
>  }
>  
>  static bool page_handle_poison(struct page *page, bool hugepage_or_freepage, bool release)
> @@ -88,7 +94,7 @@ static bool page_handle_poison(struct page *page, bool hugepage_or_freepage, boo
>  		 * Doing this check for free pages is also fine since dissolve_free_huge_page
>  		 * returns 0 for non-hugetlb pages as well.
>  		 */
> -		if (!__page_handle_poison(page))
> +		if (__page_handle_poison(page) <= 0)
>  			/*
>  			 * We could fail to take off the target page from buddy
>  			 * for example due to racy page allocation, but that's
> @@ -1045,7 +1051,7 @@ static int me_huge_page(struct page_state *ps, struct page *p)
>  		 * save healthy subpages.
>  		 */
>  		put_page(hpage);
> -		if (__page_handle_poison(p)) {
> +		if (__page_handle_poison(p) > 0) {
>  			page_ref_inc(p);
>  			res = MF_RECOVERED;
>  		}
> @@ -1595,8 +1601,7 @@ static int try_memory_failure_hugetlb(unsigned long pfn, int flags, int *hugetlb
>  	 */
>  	if (res == 0) {
>  		unlock_page(head);
> -		res = MF_FAILED;

This looks like an unexpected change. res will be 0 instead of MF_FAILED if __page_handle_poison failed to
dissolve or not taken off from buddy. But this is fixed in later patch in this series. So it should be fine.

Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>

Thanks!

> -		if (__page_handle_poison(p)) {
> +		if (__page_handle_poison(p) > 0) {
>  			page_ref_inc(p);
>  			res = MF_RECOVERED;
>  		}
> 



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v1 1/5] mm, hwpoison, hugetlb: introduce SUBPAGE_INDEX_HWPOISON to save raw error page
  2022-06-02  5:06 ` [PATCH v1 1/5] mm, hwpoison, hugetlb: introduce SUBPAGE_INDEX_HWPOISON to save raw error page Naoya Horiguchi
  2022-06-07 12:45   ` Miaohe Lin
@ 2022-06-07 13:04   ` David Hildenbrand
  2022-06-08  1:31     ` HORIGUCHI NAOYA(堀口 直也)
  1 sibling, 1 reply; 19+ messages in thread
From: David Hildenbrand @ 2022-06-07 13:04 UTC (permalink / raw)
  To: Naoya Horiguchi, linux-mm
  Cc: Andrew Morton, Mike Kravetz, Miaohe Lin, Liu Shixin, Yang Shi,
	Oscar Salvador, Muchun Song, Naoya Horiguchi, linux-kernel

On 02.06.22 07:06, Naoya Horiguchi wrote:
> From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> 
> When handling memory error on a hugetlb page, the error handler tries to
> dissolve and turn it into 4kB pages.  If it's successfully dissolved,
> PageHWPoison flag is moved to the raw error page, so that's all right.
> However, dissolve sometimes fails, then the error page is left as
> hwpoisoned hugepage. It's useful if we can retry to dissolve it to save
> healthy pages, but that's not possible now because the information about
> where the raw error page is lost.
> 
> Use the private field of a tail page to keep that information.  The code
> path of shrinking hugepage pool used this info to try delayed dissolve.
> This only keeps one hwpoison page for now, which might be OK because it's
> simple and multiple hwpoison pages in a hugepage can be rare. But it can
> be extended in the future.
> 
>

But what would happen now if you have multiple successive MCE events on
such a page now?

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v1 4/5] mm, hwpoison: skip raw hwpoison page in freeing 1GB hugepage
  2022-06-02  5:06 ` [PATCH v1 4/5] mm, hwpoison: skip raw hwpoison page in freeing 1GB hugepage Naoya Horiguchi
@ 2022-06-07 13:56   ` Miaohe Lin
  0 siblings, 0 replies; 19+ messages in thread
From: Miaohe Lin @ 2022-06-07 13:56 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andrew Morton, David Hildenbrand, Mike Kravetz, Liu Shixin,
	Yang Shi, Oscar Salvador, Muchun Song, Naoya Horiguchi,
	linux-kernel, Linux-MM

On 2022/6/2 13:06, Naoya Horiguchi wrote:
> From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> 
> Currently if memory_failure() (modified to remove blocking code) is called
> on a page in some 1GB hugepage, memory error handling returns failure and
> the raw error page gets into undesirable state.  The impact is small in
> production systems (just leaked single 4kB page), but this limits the test
> efficiency because unpoison doesn't work for it.  So we can no longer

I think I get the point after I have read the above commit log several times and refered to
the discussion in [1]. The impact is small due to the 1G hugepage is dissolved while memory
error handling returns failure. So we just leak single 4KB page and unpoison doesn't work for
it due to page refcnt is 0. Do I get the point?

[1] https://lore.kernel.org/all/20220519021757.GA520829@hori.linux.bs1.fc.nec.co.jp/

Although I wonder why __page_handle_poison() fails for 1GB hugepage, the code itself looks good
to me. Thanks!

Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>

> create 1GB hugepage on the 1GB physical address range with such hwpoison
> pages, that could be an issue in testing on small systems.
> 
> When a hwpoison page in a 1GB hugepage is handled, it's caught by the
> PageHWPoison check in free_pages_prepare() because the hugepage is broken
> down into raw error page and order is 0:
> 
>         if (unlikely(PageHWPoison(page)) && !order) {
>                 ...
>                 return false;
>         }
> 
> Then, the page is not sent to buddy and the page refcount is left 0.
> 
> Originally this check is supposed to work when the error page is freed from
> page_handle_poison() (that is called from soft-offline), but now we are
> opening another path to call it, so the callers of __page_handle_poison()
> need to handle the case by considering the return value 0 as success. Then
> page refcount for hwpoison is properly incremented and now unpoison works.
> 
> Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> ---
>  mm/memory-failure.c | 9 ++++++---
>  1 file changed, 6 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index f149a7864c81..babeb34f7477 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1043,7 +1043,6 @@ static int me_huge_page(struct page_state *ps, struct page *p)
>  		res = truncate_error_page(hpage, page_to_pfn(p), mapping);
>  		unlock_page(hpage);
>  	} else {
> -		res = MF_FAILED;
>  		unlock_page(hpage);
>  		/*
>  		 * migration entry prevents later access on error anonymous
> @@ -1051,9 +1050,11 @@ static int me_huge_page(struct page_state *ps, struct page *p)
>  		 * save healthy subpages.
>  		 */
>  		put_page(hpage);
> -		if (__page_handle_poison(p) > 0) {
> +		if (__page_handle_poison(p) >= 0) {
>  			page_ref_inc(p);
>  			res = MF_RECOVERED;
> +		} else {
> +			res = MF_FAILED;
>  		}
>  	}
>  
> @@ -1601,9 +1602,11 @@ static int try_memory_failure_hugetlb(unsigned long pfn, int flags, int *hugetlb
>  	 */
>  	if (res == 0) {
>  		unlock_page(head);
> -		if (__page_handle_poison(p) > 0) {
> +		if (__page_handle_poison(p) >= 0) {
>  			page_ref_inc(p);
>  			res = MF_RECOVERED;
> +		} else {
> +			res = MF_FAILED;
>  		}
>  		action_result(pfn, MF_MSG_FREE_HUGE, res);
>  		return res == MF_RECOVERED ? 0 : -EBUSY;
> 



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v1 5/5] mm, hwpoison: enable memory error handling on 1GB hugepage
  2022-06-02  5:06 ` [PATCH v1 5/5] mm, hwpoison: enable memory error handling on " Naoya Horiguchi
@ 2022-06-07 14:11   ` Miaohe Lin
  2022-06-08  6:16     ` HORIGUCHI NAOYA(堀口 直也)
  0 siblings, 1 reply; 19+ messages in thread
From: Miaohe Lin @ 2022-06-07 14:11 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andrew Morton, David Hildenbrand, Mike Kravetz, Liu Shixin,
	Yang Shi, Oscar Salvador, Muchun Song, Naoya Horiguchi,
	linux-kernel, Linux-MM

On 2022/6/2 13:06, Naoya Horiguchi wrote:
> From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> 
> Now error handling code is prepared, so remove the blocking code and
> enable memory error handling on 1GB hugepage.
> 

I'm nervous about this change. It seems there are many code paths not awared of pud swap entry.
I browsed some of them:
apply_to_pud_range called from apply_to_page_range:

apply_to_pud_range:
	next = pud_addr_end(addr, end);
	if (pud_none(*pud) && !create)
		continue;
	if (WARN_ON_ONCE(pud_leaf(*pud)))
		return -EINVAL;
	if (!pud_none(*pud) && WARN_ON_ONCE(pud_bad(*pud))) {
		if (!create)
			continue;
		pud_clear_bad(pud);
	}
	err = apply_to_pmd_range(mm, pud, addr, next,
				 fn, data, create, mask);

For !pud_present case, it will mostly reach apply_to_pmd_range and call pmd_offset on it. And invalid
pointer will be de-referenced.

Another example might be copy_pud_range and so on. So I think it might not be prepared to enable the
1GB hugepage or all of these places should be fixed?

Thanks!

> Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> ---
>  include/linux/mm.h      |  1 -
>  include/ras/ras_event.h |  1 -
>  mm/memory-failure.c     | 16 ----------------
>  3 files changed, 18 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 04de0c3e4f9f..58a6aa916e4f 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -3238,7 +3238,6 @@ enum mf_action_page_type {
>  	MF_MSG_DIFFERENT_COMPOUND,
>  	MF_MSG_HUGE,
>  	MF_MSG_FREE_HUGE,
> -	MF_MSG_NON_PMD_HUGE,
>  	MF_MSG_UNMAP_FAILED,
>  	MF_MSG_DIRTY_SWAPCACHE,
>  	MF_MSG_CLEAN_SWAPCACHE,
> diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h
> index d0337a41141c..cbd3ddd7c33d 100644
> --- a/include/ras/ras_event.h
> +++ b/include/ras/ras_event.h
> @@ -360,7 +360,6 @@ TRACE_EVENT(aer_event,
>  	EM ( MF_MSG_DIFFERENT_COMPOUND, "different compound page after locking" ) \
>  	EM ( MF_MSG_HUGE, "huge page" )					\
>  	EM ( MF_MSG_FREE_HUGE, "free huge page" )			\
> -	EM ( MF_MSG_NON_PMD_HUGE, "non-pmd-sized huge page" )		\
>  	EM ( MF_MSG_UNMAP_FAILED, "unmapping failed page" )		\
>  	EM ( MF_MSG_DIRTY_SWAPCACHE, "dirty swapcache page" )		\
>  	EM ( MF_MSG_CLEAN_SWAPCACHE, "clean swapcache page" )		\
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index babeb34f7477..ced033a99e19 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -725,7 +725,6 @@ static const char * const action_page_types[] = {
>  	[MF_MSG_DIFFERENT_COMPOUND]	= "different compound page after locking",
>  	[MF_MSG_HUGE]			= "huge page",
>  	[MF_MSG_FREE_HUGE]		= "free huge page",
> -	[MF_MSG_NON_PMD_HUGE]		= "non-pmd-sized huge page",
>  	[MF_MSG_UNMAP_FAILED]		= "unmapping failed page",
>  	[MF_MSG_DIRTY_SWAPCACHE]	= "dirty swapcache page",
>  	[MF_MSG_CLEAN_SWAPCACHE]	= "clean swapcache page",
> @@ -1614,21 +1613,6 @@ static int try_memory_failure_hugetlb(unsigned long pfn, int flags, int *hugetlb
>  
>  	page_flags = head->flags;
>  
> -	/*
> -	 * TODO: hwpoison for pud-sized hugetlb doesn't work right now, so
> -	 * simply disable it. In order to make it work properly, we need
> -	 * make sure that:
> -	 *  - conversion of a pud that maps an error hugetlb into hwpoison
> -	 *    entry properly works, and
> -	 *  - other mm code walking over page table is aware of pud-aligned
> -	 *    hwpoison entries.
> -	 */
> -	if (huge_page_size(page_hstate(head)) > PMD_SIZE) {
> -		action_result(pfn, MF_MSG_NON_PMD_HUGE, MF_IGNORED);
> -		res = -EBUSY;
> -		goto out;
> -	}
> -
>  	if (!hwpoison_user_mappings(p, pfn, flags, head)) {
>  		action_result(pfn, MF_MSG_UNMAP_FAILED, MF_IGNORED);
>  		res = -EBUSY;
> 



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v1 1/5] mm, hwpoison, hugetlb: introduce SUBPAGE_INDEX_HWPOISON to save raw error page
  2022-06-07 13:04   ` David Hildenbrand
@ 2022-06-08  1:31     ` HORIGUCHI NAOYA(堀口 直也)
  2022-06-08  9:19       ` David Hildenbrand
  0 siblings, 1 reply; 19+ messages in thread
From: HORIGUCHI NAOYA(堀口 直也) @ 2022-06-08  1:31 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Naoya Horiguchi, linux-mm, Andrew Morton, Mike Kravetz,
	Miaohe Lin, Liu Shixin, Yang Shi, Oscar Salvador, Muchun Song,
	linux-kernel

On Tue, Jun 07, 2022 at 03:04:15PM +0200, David Hildenbrand wrote:
> On 02.06.22 07:06, Naoya Horiguchi wrote:
> > From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> > 
> > When handling memory error on a hugetlb page, the error handler tries to
> > dissolve and turn it into 4kB pages.  If it's successfully dissolved,
> > PageHWPoison flag is moved to the raw error page, so that's all right.
> > However, dissolve sometimes fails, then the error page is left as
> > hwpoisoned hugepage. It's useful if we can retry to dissolve it to save
> > healthy pages, but that's not possible now because the information about
> > where the raw error page is lost.
> > 
> > Use the private field of a tail page to keep that information.  The code
> > path of shrinking hugepage pool used this info to try delayed dissolve.
> > This only keeps one hwpoison page for now, which might be OK because it's
> > simple and multiple hwpoison pages in a hugepage can be rare. But it can
> > be extended in the future.
> > 
> >
> 
> But what would happen now if you have multiple successive MCE events on
> such a page now?

The 2nd and later events are ignored due to "already hwpoisoned hugepage",
this might not be good when the hwpoisoned hugepage is freed/dissolved later.
So a temporal workaround is to remember "hugepage has multiple hwpoison pages"
and disable free/dissolve for such hugepages.

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v1 3/5] mm, hwpoison: make __page_handle_poison returns int
  2022-06-07 12:54   ` Miaohe Lin
@ 2022-06-08  1:40     ` HORIGUCHI NAOYA(堀口 直也)
  0 siblings, 0 replies; 19+ messages in thread
From: HORIGUCHI NAOYA(堀口 直也) @ 2022-06-08  1:40 UTC (permalink / raw)
  To: Miaohe Lin
  Cc: Naoya Horiguchi, Andrew Morton, David Hildenbrand, Mike Kravetz,
	Liu Shixin, Yang Shi, Oscar Salvador, Muchun Song, linux-kernel,
	Linux-MM

On Tue, Jun 07, 2022 at 08:54:24PM +0800, Miaohe Lin wrote:
> On 2022/6/2 13:06, Naoya Horiguchi wrote:
> > From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> > 
> > __page_handle_poison() returns bool that shows whether
> > take_page_off_buddy() has passed or not now.  But we will want to
> > distinguish another case of "dissolve has passed but taking off failed"
> > by its return value. So change the type of the return value.
> > No functional change.
> > 
> > Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> > ---
> >  mm/memory-failure.c | 17 +++++++++++------
> >  1 file changed, 11 insertions(+), 6 deletions(-)
> > 
> > diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> > index fe6a7961dc66..f149a7864c81 100644
> > --- a/mm/memory-failure.c
> > +++ b/mm/memory-failure.c
> > @@ -68,7 +68,13 @@ int sysctl_memory_failure_recovery __read_mostly = 1;
> >  
> >  atomic_long_t num_poisoned_pages __read_mostly = ATOMIC_LONG_INIT(0);
> >  
> > -static bool __page_handle_poison(struct page *page)
> > +/*
> > + * Return values:
> > + *   1:   the page is dissolved (if needed) and taken off from buddy,
> > + *   0:   the page is dissolved (if needed) and not taken off from buddy,
> > + *   < 0: failed to dissolve.
> > + */
> > +static int __page_handle_poison(struct page *page)
> >  {
> >  	int ret;
> >  
> > @@ -78,7 +84,7 @@ static bool __page_handle_poison(struct page *page)
> >  		ret = take_page_off_buddy(page);
> >  	zone_pcp_enable(page_zone(page));
> >  
> > -	return ret > 0;
> > +	return ret;
> >  }
> >  
> >  static bool page_handle_poison(struct page *page, bool hugepage_or_freepage, bool release)
> > @@ -88,7 +94,7 @@ static bool page_handle_poison(struct page *page, bool hugepage_or_freepage, boo
> >  		 * Doing this check for free pages is also fine since dissolve_free_huge_page
> >  		 * returns 0 for non-hugetlb pages as well.
> >  		 */
> > -		if (!__page_handle_poison(page))
> > +		if (__page_handle_poison(page) <= 0)
> >  			/*
> >  			 * We could fail to take off the target page from buddy
> >  			 * for example due to racy page allocation, but that's
> > @@ -1045,7 +1051,7 @@ static int me_huge_page(struct page_state *ps, struct page *p)
> >  		 * save healthy subpages.
> >  		 */
> >  		put_page(hpage);
> > -		if (__page_handle_poison(p)) {
> > +		if (__page_handle_poison(p) > 0) {
> >  			page_ref_inc(p);
> >  			res = MF_RECOVERED;
> >  		}
> > @@ -1595,8 +1601,7 @@ static int try_memory_failure_hugetlb(unsigned long pfn, int flags, int *hugetlb
> >  	 */
> >  	if (res == 0) {
> >  		unlock_page(head);
> > -		res = MF_FAILED;
> 
> This looks like an unexpected change. res will be 0 instead of MF_FAILED if __page_handle_poison failed to
> dissolve or not taken off from buddy. But this is fixed in later patch in this series. So it should be fine.

Ah, you're right. this patch is stated as "non functional change" but that
is not true due to this.  So I'll move this line deletion to 4/5 in the next
version.

> 
> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
> 
> Thanks!

Thank you :)

- Naoya Horiguchi

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v1 5/5] mm, hwpoison: enable memory error handling on 1GB hugepage
  2022-06-07 14:11   ` Miaohe Lin
@ 2022-06-08  6:16     ` HORIGUCHI NAOYA(堀口 直也)
  2022-06-08 12:57       ` Miaohe Lin
  0 siblings, 1 reply; 19+ messages in thread
From: HORIGUCHI NAOYA(堀口 直也) @ 2022-06-08  6:16 UTC (permalink / raw)
  To: Miaohe Lin
  Cc: Naoya Horiguchi, Andrew Morton, David Hildenbrand, Mike Kravetz,
	Liu Shixin, Yang Shi, Oscar Salvador, Muchun Song, linux-kernel,
	Linux-MM

On Tue, Jun 07, 2022 at 10:11:24PM +0800, Miaohe Lin wrote:
> On 2022/6/2 13:06, Naoya Horiguchi wrote:
> > From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> > 
> > Now error handling code is prepared, so remove the blocking code and
> > enable memory error handling on 1GB hugepage.
> > 
> 
> I'm nervous about this change. It seems there are many code paths not awared of pud swap entry.
> I browsed some of them:
> apply_to_pud_range called from apply_to_page_range:
> 
> apply_to_pud_range:
> 	next = pud_addr_end(addr, end);
> 	if (pud_none(*pud) && !create)
> 		continue;
> 	if (WARN_ON_ONCE(pud_leaf(*pud)))
> 		return -EINVAL;
> 	if (!pud_none(*pud) && WARN_ON_ONCE(pud_bad(*pud))) {
> 		if (!create)
> 			continue;
> 		pud_clear_bad(pud);
> 	}
> 	err = apply_to_pmd_range(mm, pud, addr, next,
> 				 fn, data, create, mask);
> 
> For !pud_present case, it will mostly reach apply_to_pmd_range and call pmd_offset on it. And invalid
> pointer will be de-referenced.

apply_to_pmd_range() has BUG_ON(pud_huge(*pud)) and apply_to_pte_range() has
BUG_ON(pmd_huge(*pmd)), so this page table walking code seems to not expect
to handle pmd/pud level mapping.

> 
> Another example might be copy_pud_range and so on. So I think it might not be prepared to enable the
> 1GB hugepage or all of these places should be fixed?

I think that most of page table walker for user address space should first
check is_vm_hugetlb_page() and call hugetlb specific walking code for vma
with VM_HUGETLB.
copy_page_range() is a good example.  It calls copy_hugetlb_page_range()
for vma with VM_HUGETLB and the function should support hwpoison entry.
But I feel that I need testing for confirmation.

And I'm not sure that all other are prepared for non-present pud-mapping,
so I'll need somehow code inspection and testing for each.

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v1 1/5] mm, hwpoison, hugetlb: introduce SUBPAGE_INDEX_HWPOISON to save raw error page
  2022-06-08  1:31     ` HORIGUCHI NAOYA(堀口 直也)
@ 2022-06-08  9:19       ` David Hildenbrand
  0 siblings, 0 replies; 19+ messages in thread
From: David Hildenbrand @ 2022-06-08  9:19 UTC (permalink / raw)
  To: HORIGUCHI NAOYA(堀口 直也)
  Cc: Naoya Horiguchi, linux-mm, Andrew Morton, Mike Kravetz,
	Miaohe Lin, Liu Shixin, Yang Shi, Oscar Salvador, Muchun Song,
	linux-kernel

On 08.06.22 03:31, HORIGUCHI NAOYA(堀口 直也) wrote:
> On Tue, Jun 07, 2022 at 03:04:15PM +0200, David Hildenbrand wrote:
>> On 02.06.22 07:06, Naoya Horiguchi wrote:
>>> From: Naoya Horiguchi <naoya.horiguchi@nec.com>
>>>
>>> When handling memory error on a hugetlb page, the error handler tries to
>>> dissolve and turn it into 4kB pages.  If it's successfully dissolved,
>>> PageHWPoison flag is moved to the raw error page, so that's all right.
>>> However, dissolve sometimes fails, then the error page is left as
>>> hwpoisoned hugepage. It's useful if we can retry to dissolve it to save
>>> healthy pages, but that's not possible now because the information about
>>> where the raw error page is lost.
>>>
>>> Use the private field of a tail page to keep that information.  The code
>>> path of shrinking hugepage pool used this info to try delayed dissolve.
>>> This only keeps one hwpoison page for now, which might be OK because it's
>>> simple and multiple hwpoison pages in a hugepage can be rare. But it can
>>> be extended in the future.
>>>
>>>
>>
>> But what would happen now if you have multiple successive MCE events on
>> such a page now?
> 
> The 2nd and later events are ignored due to "already hwpoisoned hugepage",
> this might not be good when the hwpoisoned hugepage is freed/dissolved later.
> So a temporal workaround is to remember "hugepage has multiple hwpoison pages"
> and disable free/dissolve for such hugepages.

Right. We might want to indicate exactly one vs. multiple, and then in
__update_and_free_page(), move the hwpoison flag to one subpage,
eventually exposing other corrupted subpages to the system.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v1 5/5] mm, hwpoison: enable memory error handling on 1GB hugepage
  2022-06-08  6:16     ` HORIGUCHI NAOYA(堀口 直也)
@ 2022-06-08 12:57       ` Miaohe Lin
  2022-06-09  8:45         ` HORIGUCHI NAOYA(堀口 直也)
  0 siblings, 1 reply; 19+ messages in thread
From: Miaohe Lin @ 2022-06-08 12:57 UTC (permalink / raw)
  To: HORIGUCHI NAOYA(堀口 直也)
  Cc: Naoya Horiguchi, Andrew Morton, David Hildenbrand, Mike Kravetz,
	Liu Shixin, Yang Shi, Oscar Salvador, Muchun Song, linux-kernel,
	Linux-MM

On 2022/6/8 14:16, HORIGUCHI NAOYA(堀口 直也) wrote:
> On Tue, Jun 07, 2022 at 10:11:24PM +0800, Miaohe Lin wrote:
>> On 2022/6/2 13:06, Naoya Horiguchi wrote:
>>> From: Naoya Horiguchi <naoya.horiguchi@nec.com>
>>>
>>> Now error handling code is prepared, so remove the blocking code and
>>> enable memory error handling on 1GB hugepage.
>>>
>>
>> I'm nervous about this change. It seems there are many code paths not awared of pud swap entry.
>> I browsed some of them:
>> apply_to_pud_range called from apply_to_page_range:
>>
>> apply_to_pud_range:
>> 	next = pud_addr_end(addr, end);
>> 	if (pud_none(*pud) && !create)
>> 		continue;
>> 	if (WARN_ON_ONCE(pud_leaf(*pud)))
>> 		return -EINVAL;
>> 	if (!pud_none(*pud) && WARN_ON_ONCE(pud_bad(*pud))) {
>> 		if (!create)
>> 			continue;
>> 		pud_clear_bad(pud);
>> 	}
>> 	err = apply_to_pmd_range(mm, pud, addr, next,
>> 				 fn, data, create, mask);
>>
>> For !pud_present case, it will mostly reach apply_to_pmd_range and call pmd_offset on it. And invalid
>> pointer will be de-referenced.
> 
> apply_to_pmd_range() has BUG_ON(pud_huge(*pud)) and apply_to_pte_range() has
> BUG_ON(pmd_huge(*pmd)), so this page table walking code seems to not expect
> to handle pmd/pud level mapping.

Yes, you're right. These functions are not intended to handle pmd/pud level mapping.

> 
>>
>> Another example might be copy_pud_range and so on. So I think it might not be prepared to enable the
>> 1GB hugepage or all of these places should be fixed?
> 
> I think that most of page table walker for user address space should first
> check is_vm_hugetlb_page() and call hugetlb specific walking code for vma
> with VM_HUGETLB.
> copy_page_range() is a good example.  It calls copy_hugetlb_page_range()
> for vma with VM_HUGETLB and the function should support hwpoison entry.
> But I feel that I need testing for confirmation.

Sorry, I missed it should be called from hugetlb variants.

> 
> And I'm not sure that all other are prepared for non-present pud-mapping,
> so I'll need somehow code inspection and testing for each.

I browsed the code again, there still might be some problematic code paths:

1.For follow_pud_mask, !pud_present will mostly reach follow_pmd_mask(). This can be
called for hugetlb page. (Note gup_pud_range is fixed at 15494520b776 ("mm: fix gup_pud_range"))

2.Even for huge_pte_alloc, pud_offset will be called in pud_alloc. So pudp will be an invalid pointer.
And it will be de-referenced later.

I hope I'm not miss something again this time. ;)

> 
> Thanks,
> Naoya Horiguchi

Thanks!

> 



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v1 5/5] mm, hwpoison: enable memory error handling on 1GB hugepage
  2022-06-08 12:57       ` Miaohe Lin
@ 2022-06-09  8:45         ` HORIGUCHI NAOYA(堀口 直也)
  2022-06-09 11:28           ` Miaohe Lin
  0 siblings, 1 reply; 19+ messages in thread
From: HORIGUCHI NAOYA(堀口 直也) @ 2022-06-09  8:45 UTC (permalink / raw)
  To: Miaohe Lin
  Cc: Naoya Horiguchi, Andrew Morton, David Hildenbrand, Mike Kravetz,
	Liu Shixin, Yang Shi, Oscar Salvador, Muchun Song, linux-kernel,
	Linux-MM

On Wed, Jun 08, 2022 at 08:57:24PM +0800, Miaohe Lin wrote:
...
> >
> > I think that most of page table walker for user address space should first
> > check is_vm_hugetlb_page() and call hugetlb specific walking code for vma
> > with VM_HUGETLB.
> > copy_page_range() is a good example.  It calls copy_hugetlb_page_range()
> > for vma with VM_HUGETLB and the function should support hwpoison entry.
> > But I feel that I need testing for confirmation.
>
> Sorry, I missed it should be called from hugetlb variants.
>
> >
> > And I'm not sure that all other are prepared for non-present pud-mapping,
> > so I'll need somehow code inspection and testing for each.
>
> I browsed the code again, there still might be some problematic code paths:
>
> 1.For follow_pud_mask, !pud_present will mostly reach follow_pmd_mask(). This can be
> called for hugetlb page. (Note gup_pud_range is fixed at 15494520b776 ("mm: fix gup_pud_range"))
>
> 2.Even for huge_pte_alloc, pud_offset will be called in pud_alloc. So pudp will be an invalid pointer.
> And it will be de-referenced later.

Yes, these paths need to support non-present pud entry, so I'll update/add
the patches.  It seems that I did the similar work for pmd few years ago
(cbef8478bee5 ("mm/hugetlb: pmd_huge() returns true for non-present hugepage").

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH v1 5/5] mm, hwpoison: enable memory error handling on 1GB hugepage
  2022-06-09  8:45         ` HORIGUCHI NAOYA(堀口 直也)
@ 2022-06-09 11:28           ` Miaohe Lin
  0 siblings, 0 replies; 19+ messages in thread
From: Miaohe Lin @ 2022-06-09 11:28 UTC (permalink / raw)
  To: HORIGUCHI NAOYA(堀口 直也)
  Cc: Naoya Horiguchi, Andrew Morton, David Hildenbrand, Mike Kravetz,
	Liu Shixin, Yang Shi, Oscar Salvador, Muchun Song, linux-kernel,
	Linux-MM

On 2022/6/9 16:45, HORIGUCHI NAOYA(堀口 直也) wrote:
> On Wed, Jun 08, 2022 at 08:57:24PM +0800, Miaohe Lin wrote:
> ...
>>>
>>> I think that most of page table walker for user address space should first
>>> check is_vm_hugetlb_page() and call hugetlb specific walking code for vma
>>> with VM_HUGETLB.
>>> copy_page_range() is a good example.  It calls copy_hugetlb_page_range()
>>> for vma with VM_HUGETLB and the function should support hwpoison entry.
>>> But I feel that I need testing for confirmation.
>>
>> Sorry, I missed it should be called from hugetlb variants.
>>
>>>
>>> And I'm not sure that all other are prepared for non-present pud-mapping,
>>> so I'll need somehow code inspection and testing for each.
>>
>> I browsed the code again, there still might be some problematic code paths:
>>
>> 1.For follow_pud_mask, !pud_present will mostly reach follow_pmd_mask(). This can be
>> called for hugetlb page. (Note gup_pud_range is fixed at 15494520b776 ("mm: fix gup_pud_range"))
>>
>> 2.Even for huge_pte_alloc, pud_offset will be called in pud_alloc. So pudp will be an invalid pointer.
>> And it will be de-referenced later.
> 
> Yes, these paths need to support non-present pud entry, so I'll update/add
> the patches.  It seems that I did the similar work for pmd few years ago
> (cbef8478bee5 ("mm/hugetlb: pmd_huge() returns true for non-present hugepage").

Yes, these should be similar work. Thanks for your hard work. :)

> 
> Thanks,
> Naoya Horiguchi
> 



^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2022-06-09 11:28 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-06-02  5:06 [PATCH v1 0/5] mm, hwpoison: enable 1GB hugepage support Naoya Horiguchi
2022-06-02  5:06 ` [PATCH v1 1/5] mm, hwpoison, hugetlb: introduce SUBPAGE_INDEX_HWPOISON to save raw error page Naoya Horiguchi
2022-06-07 12:45   ` Miaohe Lin
2022-06-07 13:04   ` David Hildenbrand
2022-06-08  1:31     ` HORIGUCHI NAOYA(堀口 直也)
2022-06-08  9:19       ` David Hildenbrand
2022-06-02  5:06 ` [PATCH v1 2/5] mm,hwpoison: set PG_hwpoison for busy hugetlb pages Naoya Horiguchi
2022-06-07 12:50   ` Miaohe Lin
2022-06-02  5:06 ` [PATCH v1 3/5] mm, hwpoison: make __page_handle_poison returns int Naoya Horiguchi
2022-06-07 12:54   ` Miaohe Lin
2022-06-08  1:40     ` HORIGUCHI NAOYA(堀口 直也)
2022-06-02  5:06 ` [PATCH v1 4/5] mm, hwpoison: skip raw hwpoison page in freeing 1GB hugepage Naoya Horiguchi
2022-06-07 13:56   ` Miaohe Lin
2022-06-02  5:06 ` [PATCH v1 5/5] mm, hwpoison: enable memory error handling on " Naoya Horiguchi
2022-06-07 14:11   ` Miaohe Lin
2022-06-08  6:16     ` HORIGUCHI NAOYA(堀口 直也)
2022-06-08 12:57       ` Miaohe Lin
2022-06-09  8:45         ` HORIGUCHI NAOYA(堀口 直也)
2022-06-09 11:28           ` Miaohe Lin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).