linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v1 0/6] mm/hwpoison: fix unpoison_memory()
@ 2021-06-14  2:12 Naoya Horiguchi
  2021-06-14  2:12 ` [PATCH v1 1/6] mm/hwpoison: mf_mutex for soft offline and unpoison Naoya Horiguchi
                   ` (5 more replies)
  0 siblings, 6 replies; 22+ messages in thread
From: Naoya Horiguchi @ 2021-06-14  2:12 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, David Hildenbrand, Oscar Salvador, Michal Hocko,
	Ding Hui, Tony Luck, Aneesh Kumar K.V, Naoya Horiguchi,
	linux-kernel

Hi,

This patchset is suggesting to adjust unpoison_memory() to the new way of
isolation and refcounting of hwpoisoned pages.

As mentioned in [1], NR_FREE_PAGES is incremented via newly added function
undoing take_page_off_buddy(), so the counter is consistent with this series.

It depends on some other hwpoison-related patches (not mainlined but in linux-mm),
so I listed the dependencies here:

  - mm/hwpoison: do not lock page again when me_huge_page() successfully recovers
  - mm/memory-failure: make sure wait for page writeback in memory_failure
  - mm,hwpoison: make get_hwpoison_page() call get_any_page()
  - mm,hwpoison: fix race with hugetlb page allocation
  - mm,hwpoison: Send SIGBUS with error virutal address
  - mm,hwpoison: Return -EHWPOISON to denote that the page has already been poisoned
  - mm/memory-failure: Use a mutex to avoid memory_failure() races

Or GitHub branch I pushed for easy patch access might be helpful for reviewers.

  git fetch https://github.com/nhoriguchi/linux hwpoison

[1] https://lore.kernel.org/linux-mm/20210527003436.GA3543069@hori.linux.bs1.fc.nec.co.jp/

Thanks,
Naoya Horiguchi
---
Summary:

Naoya Horiguchi (6):
      mm/hwpoison: mf_mutex for soft offline and unpoison
      mm/hwpoison: remove race consideration
      mm/hwpoison: introduce MF_MSG_PAGETABLE
      mm/hwpoison: remove MF_MSG_BUDDY_2ND and MF_MSG_POISONED_HUGE
      mm/hwpoison: make some kernel pages handlable
      mm/hwpoison: fix unpoison_memory()

 include/linux/mm.h         |   4 +-
 include/linux/page-flags.h |   4 ++
 include/ras/ras_event.h    |   3 +-
 mm/memory-failure.c        | 176 +++++++++++++++++++++++++--------------------
 mm/page_alloc.c            |  19 +++++
 5 files changed, 126 insertions(+), 80 deletions(-)


^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v1 1/6] mm/hwpoison: mf_mutex for soft offline and unpoison
  2021-06-14  2:12 [PATCH v1 0/6] mm/hwpoison: fix unpoison_memory() Naoya Horiguchi
@ 2021-06-14  2:12 ` Naoya Horiguchi
  2021-06-15 11:41   ` Ding Hui
  2021-06-15 12:42   ` Miaohe Lin
  2021-06-14  2:12 ` [PATCH v1 2/6] mm/hwpoison: remove race consideration Naoya Horiguchi
                   ` (4 subsequent siblings)
  5 siblings, 2 replies; 22+ messages in thread
From: Naoya Horiguchi @ 2021-06-14  2:12 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, David Hildenbrand, Oscar Salvador, Michal Hocko,
	Ding Hui, Tony Luck, Aneesh Kumar K.V, Naoya Horiguchi,
	linux-kernel

From: Naoya Horiguchi <naoya.horiguchi@nec.com>

Originally mf_mutex is introduced to serialize multiple MCE events, but
it's also helpful to exclude races among  soft_offline_page() and
unpoison_memory().  So apply mf_mutex to them.

Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
---
 mm/memory-failure.c | 26 ++++++++++++++++++--------
 1 file changed, 18 insertions(+), 8 deletions(-)

diff --git v5.13-rc5/mm/memory-failure.c v5.13-rc5_patched/mm/memory-failure.c
index ae30fd6d575a..280eb6d6dd15 100644
--- v5.13-rc5/mm/memory-failure.c
+++ v5.13-rc5_patched/mm/memory-failure.c
@@ -1583,6 +1583,8 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
 	return rc;
 }
 
+static DEFINE_MUTEX(mf_mutex);
+
 /**
  * memory_failure - Handle memory failure of a page.
  * @pfn: Page Number of the corrupted page
@@ -1609,7 +1611,6 @@ int memory_failure(unsigned long pfn, int flags)
 	int res = 0;
 	unsigned long page_flags;
 	bool retry = true;
-	static DEFINE_MUTEX(mf_mutex);
 
 	if (!sysctl_memory_failure_recovery)
 		panic("Memory failure on page %lx", pfn);
@@ -1918,6 +1919,7 @@ int unpoison_memory(unsigned long pfn)
 	struct page *page;
 	struct page *p;
 	int freeit = 0;
+	int ret = 0;
 	unsigned long flags = 0;
 	static DEFINE_RATELIMIT_STATE(unpoison_rs, DEFAULT_RATELIMIT_INTERVAL,
 					DEFAULT_RATELIMIT_BURST);
@@ -1928,28 +1930,30 @@ int unpoison_memory(unsigned long pfn)
 	p = pfn_to_page(pfn);
 	page = compound_head(p);
 
+	mutex_lock(&mf_mutex);
+
 	if (!PageHWPoison(p)) {
 		unpoison_pr_info("Unpoison: Page was already unpoisoned %#lx\n",
 				 pfn, &unpoison_rs);
-		return 0;
+		goto unlock_mutex;
 	}
 
 	if (page_count(page) > 1) {
 		unpoison_pr_info("Unpoison: Someone grabs the hwpoison page %#lx\n",
 				 pfn, &unpoison_rs);
-		return 0;
+		goto unlock_mutex;
 	}
 
 	if (page_mapped(page)) {
 		unpoison_pr_info("Unpoison: Someone maps the hwpoison page %#lx\n",
 				 pfn, &unpoison_rs);
-		return 0;
+		goto unlock_mutex;
 	}
 
 	if (page_mapping(page)) {
 		unpoison_pr_info("Unpoison: the hwpoison page has non-NULL mapping %#lx\n",
 				 pfn, &unpoison_rs);
-		return 0;
+		goto unlock_mutex;
 	}
 
 	/*
@@ -1960,7 +1964,7 @@ int unpoison_memory(unsigned long pfn)
 	if (!PageHuge(page) && PageTransHuge(page)) {
 		unpoison_pr_info("Unpoison: Memory failure is now running on %#lx\n",
 				 pfn, &unpoison_rs);
-		return 0;
+		goto unlock_mutex;
 	}
 
 	if (!get_hwpoison_page(p, flags)) {
@@ -1968,7 +1972,7 @@ int unpoison_memory(unsigned long pfn)
 			num_poisoned_pages_dec();
 		unpoison_pr_info("Unpoison: Software-unpoisoned free page %#lx\n",
 				 pfn, &unpoison_rs);
-		return 0;
+		goto unlock_mutex;
 	}
 
 	lock_page(page);
@@ -1990,7 +1994,9 @@ int unpoison_memory(unsigned long pfn)
 	if (freeit && !(pfn == my_zero_pfn(0) && page_count(p) == 1))
 		put_page(page);
 
-	return 0;
+unlock_mutex:
+	mutex_unlock(&mf_mutex);
+	return ret;
 }
 EXPORT_SYMBOL(unpoison_memory);
 
@@ -2171,6 +2177,8 @@ int soft_offline_page(unsigned long pfn, int flags)
 		return -EIO;
 	}
 
+	mutex_lock(&mf_mutex);
+
 	if (PageHWPoison(page)) {
 		pr_info("%s: %#lx page already poisoned\n", __func__, pfn);
 		put_ref_page(ref_page);
@@ -2194,5 +2202,7 @@ int soft_offline_page(unsigned long pfn, int flags)
 			 __func__, pfn, page->flags, &page->flags);
 	}
 
+	mutex_unlock(&mf_mutex);
+
 	return ret;
 }
-- 
2.25.1



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v1 2/6] mm/hwpoison: remove race consideration
  2021-06-14  2:12 [PATCH v1 0/6] mm/hwpoison: fix unpoison_memory() Naoya Horiguchi
  2021-06-14  2:12 ` [PATCH v1 1/6] mm/hwpoison: mf_mutex for soft offline and unpoison Naoya Horiguchi
@ 2021-06-14  2:12 ` Naoya Horiguchi
  2021-06-15 12:57   ` Ding Hui
  2021-06-14  2:12 ` [PATCH v1 3/6] mm/hwpoison: introduce MF_MSG_PAGETABLE Naoya Horiguchi
                   ` (3 subsequent siblings)
  5 siblings, 1 reply; 22+ messages in thread
From: Naoya Horiguchi @ 2021-06-14  2:12 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, David Hildenbrand, Oscar Salvador, Michal Hocko,
	Ding Hui, Tony Luck, Aneesh Kumar K.V, Naoya Horiguchi,
	linux-kernel

From: Naoya Horiguchi <naoya.horiguchi@nec.com>

Now memory_failure() and unpoison_memory() are protected by mf_mutex,
so no need to explicitly check races between them.  So remove them.

Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
---
 mm/memory-failure.c | 37 -------------------------------------
 1 file changed, 37 deletions(-)

diff --git v5.13-rc5/mm/memory-failure.c v5.13-rc5_patched/mm/memory-failure.c
index 280eb6d6dd15..e7910386fc9c 100644
--- v5.13-rc5/mm/memory-failure.c
+++ v5.13-rc5_patched/mm/memory-failure.c
@@ -1461,14 +1461,6 @@ static int memory_failure_hugetlb(unsigned long pfn, int flags)
 	lock_page(head);
 	page_flags = head->flags;
 
-	if (!PageHWPoison(head)) {
-		pr_err("Memory failure: %#lx: just unpoisoned\n", pfn);
-		num_poisoned_pages_dec();
-		unlock_page(head);
-		put_page(head);
-		return 0;
-	}
-
 	/*
 	 * TODO: hwpoison for pud-sized hugetlb doesn't work right now, so
 	 * simply disable it. In order to make it work properly, we need
@@ -1730,16 +1722,6 @@ int memory_failure(unsigned long pfn, int flags)
 	 */
 	page_flags = p->flags;
 
-	/*
-	 * unpoison always clear PG_hwpoison inside page lock
-	 */
-	if (!PageHWPoison(p)) {
-		pr_err("Memory failure: %#lx: just unpoisoned\n", pfn);
-		num_poisoned_pages_dec();
-		unlock_page(p);
-		put_page(p);
-		goto unlock_mutex;
-	}
 	if (hwpoison_filter(p)) {
 		if (TestClearPageHWPoison(p))
 			num_poisoned_pages_dec();
@@ -1956,17 +1938,6 @@ int unpoison_memory(unsigned long pfn)
 		goto unlock_mutex;
 	}
 
-	/*
-	 * unpoison_memory() can encounter thp only when the thp is being
-	 * worked by memory_failure() and the page lock is not held yet.
-	 * In such case, we yield to memory_failure() and make unpoison fail.
-	 */
-	if (!PageHuge(page) && PageTransHuge(page)) {
-		unpoison_pr_info("Unpoison: Memory failure is now running on %#lx\n",
-				 pfn, &unpoison_rs);
-		goto unlock_mutex;
-	}
-
 	if (!get_hwpoison_page(p, flags)) {
 		if (TestClearPageHWPoison(p))
 			num_poisoned_pages_dec();
@@ -1975,20 +1946,12 @@ int unpoison_memory(unsigned long pfn)
 		goto unlock_mutex;
 	}
 
-	lock_page(page);
-	/*
-	 * This test is racy because PG_hwpoison is set outside of page lock.
-	 * That's acceptable because that won't trigger kernel panic. Instead,
-	 * the PG_hwpoison page will be caught and isolated on the entrance to
-	 * the free buddy page pool.
-	 */
 	if (TestClearPageHWPoison(page)) {
 		unpoison_pr_info("Unpoison: Software-unpoisoned page %#lx\n",
 				 pfn, &unpoison_rs);
 		num_poisoned_pages_dec();
 		freeit = 1;
 	}
-	unlock_page(page);
 
 	put_page(page);
 	if (freeit && !(pfn == my_zero_pfn(0) && page_count(p) == 1))
-- 
2.25.1



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v1 3/6] mm/hwpoison: introduce MF_MSG_PAGETABLE
  2021-06-14  2:12 [PATCH v1 0/6] mm/hwpoison: fix unpoison_memory() Naoya Horiguchi
  2021-06-14  2:12 ` [PATCH v1 1/6] mm/hwpoison: mf_mutex for soft offline and unpoison Naoya Horiguchi
  2021-06-14  2:12 ` [PATCH v1 2/6] mm/hwpoison: remove race consideration Naoya Horiguchi
@ 2021-06-14  2:12 ` Naoya Horiguchi
  2021-06-14  3:06   ` Matthew Wilcox
  2021-06-14  2:12 ` [PATCH v1 4/6] mm/hwpoison: remove MF_MSG_BUDDY_2ND and MF_MSG_POISONED_HUGE Naoya Horiguchi
                   ` (2 subsequent siblings)
  5 siblings, 1 reply; 22+ messages in thread
From: Naoya Horiguchi @ 2021-06-14  2:12 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, David Hildenbrand, Oscar Salvador, Michal Hocko,
	Ding Hui, Tony Luck, Aneesh Kumar K.V, Naoya Horiguchi,
	linux-kernel

From: Naoya Horiguchi <naoya.horiguchi@nec.com>

Page table pages could have some amount of share on system memory,
so define it for a separate page type message.

Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
---
 include/linux/mm.h      | 1 +
 include/ras/ras_event.h | 1 +
 mm/memory-failure.c     | 1 +
 3 files changed, 3 insertions(+)

diff --git v5.13-rc5/include/linux/mm.h v5.13-rc5_patched/include/linux/mm.h
index c274f75efcf9..45008654f695 100644
--- v5.13-rc5/include/linux/mm.h
+++ v5.13-rc5_patched/include/linux/mm.h
@@ -3103,6 +3103,7 @@ enum mf_action_page_type {
 	MF_MSG_KERNEL,
 	MF_MSG_KERNEL_HIGH_ORDER,
 	MF_MSG_SLAB,
+	MF_MSG_PAGETABLE,
 	MF_MSG_DIFFERENT_COMPOUND,
 	MF_MSG_POISONED_HUGE,
 	MF_MSG_HUGE,
diff --git v5.13-rc5/include/ras/ras_event.h v5.13-rc5_patched/include/ras/ras_event.h
index 0bdbc0d17d2f..2f459f6f87fb 100644
--- v5.13-rc5/include/ras/ras_event.h
+++ v5.13-rc5_patched/include/ras/ras_event.h
@@ -357,6 +357,7 @@ TRACE_EVENT(aer_event,
 	EM ( MF_MSG_KERNEL, "reserved kernel page" )			\
 	EM ( MF_MSG_KERNEL_HIGH_ORDER, "high-order kernel page" )	\
 	EM ( MF_MSG_SLAB, "kernel slab page" )				\
+	EM ( MF_MSG_PAGETABLE, "page table page page" )			\
 	EM ( MF_MSG_DIFFERENT_COMPOUND, "different compound page after locking" ) \
 	EM ( MF_MSG_POISONED_HUGE, "huge page already hardware poisoned" )	\
 	EM ( MF_MSG_HUGE, "huge page" )					\
diff --git v5.13-rc5/mm/memory-failure.c v5.13-rc5_patched/mm/memory-failure.c
index e7910386fc9c..30d6519ce203 100644
--- v5.13-rc5/mm/memory-failure.c
+++ v5.13-rc5_patched/mm/memory-failure.c
@@ -708,6 +708,7 @@ static const char * const action_page_types[] = {
 	[MF_MSG_KERNEL]			= "reserved kernel page",
 	[MF_MSG_KERNEL_HIGH_ORDER]	= "high-order kernel page",
 	[MF_MSG_SLAB]			= "kernel slab page",
+	[MF_MSG_PAGETABLE]		= "page table page",
 	[MF_MSG_DIFFERENT_COMPOUND]	= "different compound page after locking",
 	[MF_MSG_POISONED_HUGE]		= "huge page already hardware poisoned",
 	[MF_MSG_HUGE]			= "huge page",
-- 
2.25.1



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v1 4/6] mm/hwpoison: remove MF_MSG_BUDDY_2ND and MF_MSG_POISONED_HUGE
  2021-06-14  2:12 [PATCH v1 0/6] mm/hwpoison: fix unpoison_memory() Naoya Horiguchi
                   ` (2 preceding siblings ...)
  2021-06-14  2:12 ` [PATCH v1 3/6] mm/hwpoison: introduce MF_MSG_PAGETABLE Naoya Horiguchi
@ 2021-06-14  2:12 ` Naoya Horiguchi
  2021-06-14  2:12 ` [PATCH v1 5/6] mm/hwpoison: make some kernel pages handlable Naoya Horiguchi
  2021-06-14  2:12 ` [PATCH v1 6/6] mm/hwpoison: fix unpoison_memory() Naoya Horiguchi
  5 siblings, 0 replies; 22+ messages in thread
From: Naoya Horiguchi @ 2021-06-14  2:12 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, David Hildenbrand, Oscar Salvador, Michal Hocko,
	Ding Hui, Tony Luck, Aneesh Kumar K.V, Naoya Horiguchi,
	linux-kernel

From: Naoya Horiguchi <naoya.horiguchi@nec.com>

These action_page_types are no longer used, so remove them.

Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
---
 include/linux/mm.h      | 2 --
 include/ras/ras_event.h | 2 --
 mm/memory-failure.c     | 2 --
 3 files changed, 6 deletions(-)

diff --git v5.13-rc5/include/linux/mm.h v5.13-rc5_patched/include/linux/mm.h
index 45008654f695..f1e3b82e1a93 100644
--- v5.13-rc5/include/linux/mm.h
+++ v5.13-rc5_patched/include/linux/mm.h
@@ -3105,7 +3105,6 @@ enum mf_action_page_type {
 	MF_MSG_SLAB,
 	MF_MSG_PAGETABLE,
 	MF_MSG_DIFFERENT_COMPOUND,
-	MF_MSG_POISONED_HUGE,
 	MF_MSG_HUGE,
 	MF_MSG_FREE_HUGE,
 	MF_MSG_NON_PMD_HUGE,
@@ -3120,7 +3119,6 @@ enum mf_action_page_type {
 	MF_MSG_CLEAN_LRU,
 	MF_MSG_TRUNCATED_LRU,
 	MF_MSG_BUDDY,
-	MF_MSG_BUDDY_2ND,
 	MF_MSG_DAX,
 	MF_MSG_UNSPLIT_THP,
 	MF_MSG_UNKNOWN,
diff --git v5.13-rc5/include/ras/ras_event.h v5.13-rc5_patched/include/ras/ras_event.h
index 2f459f6f87fb..23306428f5e6 100644
--- v5.13-rc5/include/ras/ras_event.h
+++ v5.13-rc5_patched/include/ras/ras_event.h
@@ -359,7 +359,6 @@ TRACE_EVENT(aer_event,
 	EM ( MF_MSG_SLAB, "kernel slab page" )				\
 	EM ( MF_MSG_PAGETABLE, "page table page page" )			\
 	EM ( MF_MSG_DIFFERENT_COMPOUND, "different compound page after locking" ) \
-	EM ( MF_MSG_POISONED_HUGE, "huge page already hardware poisoned" )	\
 	EM ( MF_MSG_HUGE, "huge page" )					\
 	EM ( MF_MSG_FREE_HUGE, "free huge page" )			\
 	EM ( MF_MSG_NON_PMD_HUGE, "non-pmd-sized huge page" )		\
@@ -374,7 +373,6 @@ TRACE_EVENT(aer_event,
 	EM ( MF_MSG_CLEAN_LRU, "clean LRU page" )			\
 	EM ( MF_MSG_TRUNCATED_LRU, "already truncated LRU page" )	\
 	EM ( MF_MSG_BUDDY, "free buddy page" )				\
-	EM ( MF_MSG_BUDDY_2ND, "free buddy page (2nd try)" )		\
 	EM ( MF_MSG_DAX, "dax page" )					\
 	EM ( MF_MSG_UNSPLIT_THP, "unsplit thp" )			\
 	EMe ( MF_MSG_UNKNOWN, "unknown page" )
diff --git v5.13-rc5/mm/memory-failure.c v5.13-rc5_patched/mm/memory-failure.c
index 30d6519ce203..b986936e50eb 100644
--- v5.13-rc5/mm/memory-failure.c
+++ v5.13-rc5_patched/mm/memory-failure.c
@@ -710,7 +710,6 @@ static const char * const action_page_types[] = {
 	[MF_MSG_SLAB]			= "kernel slab page",
 	[MF_MSG_PAGETABLE]		= "page table page",
 	[MF_MSG_DIFFERENT_COMPOUND]	= "different compound page after locking",
-	[MF_MSG_POISONED_HUGE]		= "huge page already hardware poisoned",
 	[MF_MSG_HUGE]			= "huge page",
 	[MF_MSG_FREE_HUGE]		= "free huge page",
 	[MF_MSG_NON_PMD_HUGE]		= "non-pmd-sized huge page",
@@ -725,7 +724,6 @@ static const char * const action_page_types[] = {
 	[MF_MSG_CLEAN_LRU]		= "clean LRU page",
 	[MF_MSG_TRUNCATED_LRU]		= "already truncated LRU page",
 	[MF_MSG_BUDDY]			= "free buddy page",
-	[MF_MSG_BUDDY_2ND]		= "free buddy page (2nd try)",
 	[MF_MSG_DAX]			= "dax page",
 	[MF_MSG_UNSPLIT_THP]		= "unsplit thp",
 	[MF_MSG_UNKNOWN]		= "unknown page",
-- 
2.25.1



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v1 5/6] mm/hwpoison: make some kernel pages handlable
  2021-06-14  2:12 [PATCH v1 0/6] mm/hwpoison: fix unpoison_memory() Naoya Horiguchi
                   ` (3 preceding siblings ...)
  2021-06-14  2:12 ` [PATCH v1 4/6] mm/hwpoison: remove MF_MSG_BUDDY_2ND and MF_MSG_POISONED_HUGE Naoya Horiguchi
@ 2021-06-14  2:12 ` Naoya Horiguchi
  2021-07-28 10:59   ` Ding Hui
  2021-06-14  2:12 ` [PATCH v1 6/6] mm/hwpoison: fix unpoison_memory() Naoya Horiguchi
  5 siblings, 1 reply; 22+ messages in thread
From: Naoya Horiguchi @ 2021-06-14  2:12 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, David Hildenbrand, Oscar Salvador, Michal Hocko,
	Ding Hui, Tony Luck, Aneesh Kumar K.V, Naoya Horiguchi,
	linux-kernel

From: Naoya Horiguchi <naoya.horiguchi@nec.com>

HWPoisonHandlable() introduced by patch "mm,hwpoison: fix race with hugetlb
page allocation" filters error events by page type, and only limited events
reach get_page_unless_zero() to avoid race.

Actually this is too restictive because get_hwpoison_page always fails
to take refcount for any types of kernel page, leading to
MF_MSG_KERNEL_HIGH_ORDER.  This is not critical (no panic), but less
informative than MF_MSG_SLAB or MF_MSG_PAGETABLE, so extend
HWPoisonHandlable() to some basic types of kernel pages (slab, pgtable,
and reserved pages).

The "handling" for these types are still primitive (just taking refcount
and setting PG_hwpoison) and some more aggressive actions for memory
error containment are possible and wanted.  But compared to the older code,
these cases never enter the code block of page locks (note that
page locks is not well-defined on these pages), so it's a little safer
for functions intended for user pages not to be called for kernel pages.

Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
---
 mm/memory-failure.c | 28 ++++++++++++++++++++--------
 1 file changed, 20 insertions(+), 8 deletions(-)

diff --git v5.13-rc5/mm/memory-failure.c v5.13-rc5_patched/mm/memory-failure.c
index b986936e50eb..0d51067f0129 100644
--- v5.13-rc5/mm/memory-failure.c
+++ v5.13-rc5_patched/mm/memory-failure.c
@@ -1113,7 +1113,8 @@ static int page_action(struct page_state *ps, struct page *p,
  */
 static inline bool HWPoisonHandlable(struct page *page)
 {
-	return PageLRU(page) || __PageMovable(page);
+	return PageLRU(page) || __PageMovable(page) ||
+		PageSlab(page) || PageTable(page) || PageReserved(page);
 }
 
 static int __get_hwpoison_page(struct page *page)
@@ -1260,12 +1261,6 @@ static bool hwpoison_user_mappings(struct page *p, unsigned long pfn,
 	struct page *hpage = *hpagep;
 	bool mlocked = PageMlocked(hpage);
 
-	/*
-	 * Here we are interested only in user-mapped pages, so skip any
-	 * other types of pages.
-	 */
-	if (PageReserved(p) || PageSlab(p))
-		return true;
 	if (!(PageLRU(hpage) || PageHuge(p)))
 		return true;
 
@@ -1670,7 +1665,10 @@ int memory_failure(unsigned long pfn, int flags)
 				action_result(pfn, MF_MSG_BUDDY, res);
 				res = res == MF_RECOVERED ? 0 : -EBUSY;
 			} else {
-				action_result(pfn, MF_MSG_KERNEL_HIGH_ORDER, MF_IGNORED);
+				if (PageCompound(p))
+					action_result(pfn, MF_MSG_KERNEL_HIGH_ORDER, MF_IGNORED);
+				else
+					action_result(pfn, MF_MSG_KERNEL, MF_IGNORED);
 				res = -EBUSY;
 			}
 			goto unlock_mutex;
@@ -1681,6 +1679,20 @@ int memory_failure(unsigned long pfn, int flags)
 		}
 	}
 
+	if (PageSlab(p)) {
+		action_result(pfn, MF_MSG_SLAB, MF_IGNORED);
+		res = -EBUSY;
+		goto unlock_mutex;
+	} else if (PageTable(p)) {
+		action_result(pfn, MF_MSG_PAGETABLE, MF_IGNORED);
+		res = -EBUSY;
+		goto unlock_mutex;
+	} else if (PageReserved(p)) {
+		action_result(pfn, MF_MSG_KERNEL, MF_IGNORED);
+		res = -EBUSY;
+		goto unlock_mutex;
+	}
+
 	if (PageTransHuge(hpage)) {
 		if (try_to_split_thp_page(p, "Memory Failure") < 0) {
 			action_result(pfn, MF_MSG_UNSPLIT_THP, MF_IGNORED);
-- 
2.25.1



^ permalink raw reply	[flat|nested] 22+ messages in thread

* [PATCH v1 6/6] mm/hwpoison: fix unpoison_memory()
  2021-06-14  2:12 [PATCH v1 0/6] mm/hwpoison: fix unpoison_memory() Naoya Horiguchi
                   ` (4 preceding siblings ...)
  2021-06-14  2:12 ` [PATCH v1 5/6] mm/hwpoison: make some kernel pages handlable Naoya Horiguchi
@ 2021-06-14  2:12 ` Naoya Horiguchi
  2021-06-17 10:00   ` Ding Hui
  5 siblings, 1 reply; 22+ messages in thread
From: Naoya Horiguchi @ 2021-06-14  2:12 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, David Hildenbrand, Oscar Salvador, Michal Hocko,
	Ding Hui, Tony Luck, Aneesh Kumar K.V, Naoya Horiguchi,
	linux-kernel

From: Naoya Horiguchi <naoya.horiguchi@nec.com>

After recent soft-offline rework, error pages can be taken off from
buddy allocator, but the existing unpoison_memory() does not properly
undo the operation.  Moreover, due to the recent change on
__get_hwpoison_page(), get_page_unless_zero() is hardly called for
hwpoisoned pages.  So __get_hwpoison_page() mostly returns zero (meaning
to fail to grab page refcount) and unpoison just clears PG_hwpoison
without releasing a refcount.  That does not lead to a critical issue
like kernel panic, but unpoisoned pages never get back to buddy (leaked
permanently), which is not good.

To fix this, we need to identify "taken off" pages from other types of
hwpoisoned pages.  We can't use refcount or page flags for this purpose,
so a pseudo flag is defined by hacking ->private field.

Sometimes hwpoisoned pages can be still in-use, where the refcount should
be more than 1, so we can't unpoison them immediately and need to wait
until the all users release their refcount.

Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
---
 include/linux/mm.h         |  1 +
 include/linux/page-flags.h |  4 ++
 mm/memory-failure.c        | 84 ++++++++++++++++++++++++++++----------
 mm/page_alloc.c            | 19 +++++++++
 4 files changed, 86 insertions(+), 22 deletions(-)

diff --git v5.13-rc5/include/linux/mm.h v5.13-rc5_patched/include/linux/mm.h
index f1e3b82e1a93..0baf3fc97415 100644
--- v5.13-rc5/include/linux/mm.h
+++ v5.13-rc5_patched/include/linux/mm.h
@@ -3077,6 +3077,7 @@ enum mf_flags {
 	MF_ACTION_REQUIRED = 1 << 1,
 	MF_MUST_KILL = 1 << 2,
 	MF_SOFT_OFFLINE = 1 << 3,
+	MF_UNPOISON = 1 << 4,
 };
 extern int memory_failure(unsigned long pfn, int flags);
 extern void memory_failure_queue(unsigned long pfn, int flags);
diff --git v5.13-rc5/include/linux/page-flags.h v5.13-rc5_patched/include/linux/page-flags.h
index 04a34c08e0a6..00bddb3a058d 100644
--- v5.13-rc5/include/linux/page-flags.h
+++ v5.13-rc5_patched/include/linux/page-flags.h
@@ -430,7 +430,11 @@ PAGEFLAG_FALSE(Uncached)
 PAGEFLAG(HWPoison, hwpoison, PF_ANY)
 TESTSCFLAG(HWPoison, hwpoison, PF_ANY)
 #define __PG_HWPOISON (1UL << PG_hwpoison)
+#define MAGIC_HWPOISON	0x4857504f49534f4e
+extern void SetPageHWPoisonTakenOff(struct page *page);
+extern void ClearPageHWPoisonTakenOff(struct page *page);
 extern bool take_page_off_buddy(struct page *page);
+extern void take_page_back_buddy(struct page *page);
 #else
 PAGEFLAG_FALSE(HWPoison)
 #define __PG_HWPOISON 0
diff --git v5.13-rc5/mm/memory-failure.c v5.13-rc5_patched/mm/memory-failure.c
index 0d51067f0129..41b0ef96e2aa 100644
--- v5.13-rc5/mm/memory-failure.c
+++ v5.13-rc5_patched/mm/memory-failure.c
@@ -1105,6 +1105,22 @@ static int page_action(struct page_state *ps, struct page *p,
 	return (result == MF_RECOVERED || result == MF_DELAYED) ? 0 : -EBUSY;
 }
 
+static inline bool PageHWPoisonTakenOff(struct page *page)
+{
+	return PageHWPoison(page) && page_private(page) == MAGIC_HWPOISON;
+}
+
+void SetPageHWPoisonTakenOff(struct page *page)
+{
+	set_page_private(page, MAGIC_HWPOISON);
+}
+
+void ClearPageHWPoisonTakenOff(struct page *page)
+{
+	if (PageHWPoison(page))
+		set_page_private(page, 0);
+}
+
 /*
  * Return true if a page type of a given page is supported by hwpoison
  * mechanism (while handling could fail), otherwise false.  This function
@@ -1117,7 +1133,7 @@ static inline bool HWPoisonHandlable(struct page *page)
 		PageSlab(page) || PageTable(page) || PageReserved(page);
 }
 
-static int __get_hwpoison_page(struct page *page)
+static int __get_hwpoison_page(struct page *page, unsigned long flags)
 {
 	struct page *head = compound_head(page);
 	int ret = 0;
@@ -1127,12 +1143,19 @@ static int __get_hwpoison_page(struct page *page)
 	if (hugetlb)
 		return ret;
 
+	/*
+	 * Finding taken-off pages, so it could be true only when called
+	 * from unpoison_memory().
+	 */
+	if (PageHWPoisonTakenOff(head))
+		return -EHWPOISON;
+
 	/*
 	 * This check prevents from calling get_hwpoison_unless_zero()
 	 * for any unsupported type of page in order to reduce the risk of
 	 * unexpected races caused by taking a page refcount.
 	 */
-	if (!HWPoisonHandlable(head))
+	if (!(flags & MF_UNPOISON) && !HWPoisonHandlable(head))
 		return 0;
 
 	if (PageTransHuge(head)) {
@@ -1171,7 +1194,7 @@ static int get_any_page(struct page *p, unsigned long flags)
 
 try_again:
 	if (!count_increased) {
-		ret = __get_hwpoison_page(p);
+		ret = __get_hwpoison_page(p, flags);
 		if (!ret) {
 			if (page_count(p)) {
 				/* We raced with an allocation, retry. */
@@ -1190,6 +1213,8 @@ static int get_any_page(struct page *p, unsigned long flags)
 			if (pass++ < 3)
 				goto try_again;
 			goto out;
+		} else if (ret == -EHWPOISON) {
+			goto out;
 		}
 	}
 
@@ -1233,14 +1258,18 @@ static int get_any_page(struct page *p, unsigned long flags)
  *         1 on success for in-use pages in a well-defined state,
  *         -EIO for pages on which we can not handle memory errors,
  *         -EBUSY when get_hwpoison_page() has raced with page lifecycle
- *         operations like allocation and free.
+ *         operations like allocation and free,
+ *         -EHWPOISON when the page is hwpoisoned and taken off from buddy.
  */
 static int get_hwpoison_page(struct page *p, unsigned long flags)
 {
 	int ret;
 
 	zone_pcp_disable(page_zone(p));
-	ret = get_any_page(p, flags);
+	if (flags & MF_UNPOISON)
+		ret = __get_hwpoison_page(p, flags);
+	else
+		ret = get_any_page(p, flags);
 	zone_pcp_enable(page_zone(p));
 
 	return ret;
@@ -1895,6 +1924,17 @@ core_initcall(memory_failure_init);
 		pr_info(fmt, pfn);			\
 })
 
+static inline int clear_page_hwpoison(struct ratelimit_state *rs, struct page *p)
+{
+	if (TestClearPageHWPoison(p)) {
+		unpoison_pr_info("Unpoison: Software-unpoisoned page %#lx\n",
+				 page_to_pfn(p), rs);
+		num_poisoned_pages_dec();
+		return 1;
+	}
+	return 0;
+}
+
 /**
  * unpoison_memory - Unpoison a previously poisoned page
  * @pfn: Page number of the to be unpoisoned page
@@ -1911,9 +1951,7 @@ int unpoison_memory(unsigned long pfn)
 {
 	struct page *page;
 	struct page *p;
-	int freeit = 0;
 	int ret = 0;
-	unsigned long flags = 0;
 	static DEFINE_RATELIMIT_STATE(unpoison_rs, DEFAULT_RATELIMIT_INTERVAL,
 					DEFAULT_RATELIMIT_BURST);
 
@@ -1949,24 +1987,26 @@ int unpoison_memory(unsigned long pfn)
 		goto unlock_mutex;
 	}
 
-	if (!get_hwpoison_page(p, flags)) {
-		if (TestClearPageHWPoison(p))
-			num_poisoned_pages_dec();
-		unpoison_pr_info("Unpoison: Software-unpoisoned free page %#lx\n",
-				 pfn, &unpoison_rs);
+	ret = get_hwpoison_page(p, MF_UNPOISON);
+	if (!ret) {
+		clear_page_hwpoison(&unpoison_rs, p);
 		goto unlock_mutex;
-	}
-
-	if (TestClearPageHWPoison(page)) {
-		unpoison_pr_info("Unpoison: Software-unpoisoned page %#lx\n",
-				 pfn, &unpoison_rs);
-		num_poisoned_pages_dec();
-		freeit = 1;
-	}
+	} else if (ret < 0) {
+		if (ret == -EHWPOISON) {
+			clear_page_hwpoison(&unpoison_rs, p);
+			take_page_back_buddy(p);
+			ret = 0;
+		} else
+			unpoison_pr_info("Unpoison: failed to grab page %#lx\n",
+					 pfn, &unpoison_rs);
+		goto unlock_mutex;
+	} else {
+		int freeit = clear_page_hwpoison(&unpoison_rs, p);
 
-	put_page(page);
-	if (freeit && !(pfn == my_zero_pfn(0) && page_count(p) == 1))
 		put_page(page);
+		if (freeit && !(pfn == my_zero_pfn(0) && page_count(p) == 1))
+			put_page(page);
+	}
 
 unlock_mutex:
 	mutex_unlock(&mf_mutex);
diff --git v5.13-rc5/mm/page_alloc.c v5.13-rc5_patched/mm/page_alloc.c
index d1f5de1c1283..325e91d92b7e 100644
--- v5.13-rc5/mm/page_alloc.c
+++ v5.13-rc5_patched/mm/page_alloc.c
@@ -9158,6 +9158,7 @@ bool take_page_off_buddy(struct page *page)
 			del_page_from_free_list(page_head, zone, page_order);
 			break_down_buddy_pages(zone, page_head, page, 0,
 						page_order, migratetype);
+			SetPageHWPoisonTakenOff(page);
 			if (!is_migrate_isolate(migratetype))
 				__mod_zone_freepage_state(zone, -1, migratetype);
 			ret = true;
@@ -9169,4 +9170,22 @@ bool take_page_off_buddy(struct page *page)
 	spin_unlock_irqrestore(&zone->lock, flags);
 	return ret;
 }
+
+/*
+ * Cancel takeoff done by take_page_off_buddy().
+ */
+void take_page_back_buddy(struct page *page)
+{
+	struct zone *zone = page_zone(page);
+	unsigned long pfn = page_to_pfn(page);
+	unsigned long flags;
+	int migratetype = get_pfnblock_migratetype(page, pfn);
+
+	spin_lock_irqsave(&zone->lock, flags);
+	if (put_page_testzero(page)) {
+		ClearPageHWPoisonTakenOff(page);
+		__free_one_page(page, pfn, zone, 0, migratetype, FPI_NONE);
+	}
+	spin_unlock_irqrestore(&zone->lock, flags);
+}
 #endif
-- 
2.25.1



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v1 3/6] mm/hwpoison: introduce MF_MSG_PAGETABLE
  2021-06-14  2:12 ` [PATCH v1 3/6] mm/hwpoison: introduce MF_MSG_PAGETABLE Naoya Horiguchi
@ 2021-06-14  3:06   ` Matthew Wilcox
  2021-06-14  3:55     ` HORIGUCHI NAOYA(堀口 直也)
  0 siblings, 1 reply; 22+ messages in thread
From: Matthew Wilcox @ 2021-06-14  3:06 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Andrew Morton, David Hildenbrand, Oscar Salvador,
	Michal Hocko, Ding Hui, Tony Luck, Aneesh Kumar K.V,
	Naoya Horiguchi, linux-kernel

On Mon, Jun 14, 2021 at 11:12:09AM +0900, Naoya Horiguchi wrote:
> +++ v5.13-rc5_patched/include/ras/ras_event.h
> @@ -357,6 +357,7 @@ TRACE_EVENT(aer_event,
>  	EM ( MF_MSG_KERNEL, "reserved kernel page" )			\
>  	EM ( MF_MSG_KERNEL_HIGH_ORDER, "high-order kernel page" )	\
>  	EM ( MF_MSG_SLAB, "kernel slab page" )				\
> +	EM ( MF_MSG_PAGETABLE, "page table page page" )			\

Did you intend to double the word "page"?

> +++ v5.13-rc5_patched/mm/memory-failure.c
> @@ -708,6 +708,7 @@ static const char * const action_page_types[] = {
>  	[MF_MSG_KERNEL]			= "reserved kernel page",
>  	[MF_MSG_KERNEL_HIGH_ORDER]	= "high-order kernel page",
>  	[MF_MSG_SLAB]			= "kernel slab page",
> +	[MF_MSG_PAGETABLE]		= "page table page",

... you didn't here.


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v1 3/6] mm/hwpoison: introduce MF_MSG_PAGETABLE
  2021-06-14  3:06   ` Matthew Wilcox
@ 2021-06-14  3:55     ` HORIGUCHI NAOYA(堀口 直也)
  0 siblings, 0 replies; 22+ messages in thread
From: HORIGUCHI NAOYA(堀口 直也) @ 2021-06-14  3:55 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Naoya Horiguchi, linux-mm, Andrew Morton, David Hildenbrand,
	Oscar Salvador, Michal Hocko, Ding Hui, Tony Luck,
	Aneesh Kumar K.V, linux-kernel

On Mon, Jun 14, 2021 at 04:06:32AM +0100, Matthew Wilcox wrote:
> On Mon, Jun 14, 2021 at 11:12:09AM +0900, Naoya Horiguchi wrote:
> > +++ v5.13-rc5_patched/include/ras/ras_event.h
> > @@ -357,6 +357,7 @@ TRACE_EVENT(aer_event,
> >  	EM ( MF_MSG_KERNEL, "reserved kernel page" )			\
> >  	EM ( MF_MSG_KERNEL_HIGH_ORDER, "high-order kernel page" )	\
> >  	EM ( MF_MSG_SLAB, "kernel slab page" )				\
> > +	EM ( MF_MSG_PAGETABLE, "page table page page" )			\
> 
> Did you intend to double the word "page"?

Oh, this is a typo, I'll fix this.

> 
> > +++ v5.13-rc5_patched/mm/memory-failure.c
> > @@ -708,6 +708,7 @@ static const char * const action_page_types[] = {
> >  	[MF_MSG_KERNEL]			= "reserved kernel page",
> >  	[MF_MSG_KERNEL_HIGH_ORDER]	= "high-order kernel page",
> >  	[MF_MSG_SLAB]			= "kernel slab page",
> > +	[MF_MSG_PAGETABLE]		= "page table page",
> 
> ... you didn't here.

Thank you!

- Naoya Horiguchi

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v1 1/6] mm/hwpoison: mf_mutex for soft offline and unpoison
  2021-06-14  2:12 ` [PATCH v1 1/6] mm/hwpoison: mf_mutex for soft offline and unpoison Naoya Horiguchi
@ 2021-06-15 11:41   ` Ding Hui
  2021-06-15 11:55     ` HORIGUCHI NAOYA(堀口 直也)
  2021-06-15 12:42   ` Miaohe Lin
  1 sibling, 1 reply; 22+ messages in thread
From: Ding Hui @ 2021-06-15 11:41 UTC (permalink / raw)
  To: Naoya Horiguchi, linux-mm
  Cc: Andrew Morton, David Hildenbrand, Oscar Salvador, Michal Hocko,
	Tony Luck, Aneesh Kumar K.V, Naoya Horiguchi, linux-kernel

On 2021/6/14 10:12, Naoya Horiguchi wrote:

>   
> @@ -2171,6 +2177,8 @@ int soft_offline_page(unsigned long pfn, int flags)
>   		return -EIO;
>   	}
>   
> +	mutex_lock(&mf_mutex);
> +
>   	if (PageHWPoison(page)) {
>   		pr_info("%s: %#lx page already poisoned\n", __func__, pfn);
>   		put_ref_page(ref_page);

Did you miss mutex_unlock() here when page already poisoned ?

> @@ -2194,5 +2202,7 @@ int soft_offline_page(unsigned long pfn, int flags)
>   			 __func__, pfn, page->flags, &page->flags);
>   	}
>   
> +	mutex_unlock(&mf_mutex);
> +
>   	return ret;
>   }
> 

--
Thanks,
- Ding Hui


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v1 1/6] mm/hwpoison: mf_mutex for soft offline and unpoison
  2021-06-15 11:41   ` Ding Hui
@ 2021-06-15 11:55     ` HORIGUCHI NAOYA(堀口 直也)
  0 siblings, 0 replies; 22+ messages in thread
From: HORIGUCHI NAOYA(堀口 直也) @ 2021-06-15 11:55 UTC (permalink / raw)
  To: Ding Hui
  Cc: Naoya Horiguchi, linux-mm, Andrew Morton, David Hildenbrand,
	Oscar Salvador, Michal Hocko, Tony Luck, Aneesh Kumar K.V,
	linux-kernel

On Tue, Jun 15, 2021 at 07:41:34PM +0800, Ding Hui wrote:
> On 2021/6/14 10:12, Naoya Horiguchi wrote:
> 
> > @@ -2171,6 +2177,8 @@ int soft_offline_page(unsigned long pfn, int flags)
> >   		return -EIO;
> >   	}
> > +	mutex_lock(&mf_mutex);
> > +
> >   	if (PageHWPoison(page)) {
> >   		pr_info("%s: %#lx page already poisoned\n", __func__, pfn);
> >   		put_ref_page(ref_page);
> 
> Did you miss mutex_unlock() here when page already poisoned ?

Thanks, you're right. I'll fix it.

- Naoya

> 
> > @@ -2194,5 +2202,7 @@ int soft_offline_page(unsigned long pfn, int flags)
> >   			 __func__, pfn, page->flags, &page->flags);
> >   	}
> > +	mutex_unlock(&mf_mutex);
> > +
> >   	return ret;
> >   }
> > 
> 
> --
> Thanks,
> - Ding Hui
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v1 1/6] mm/hwpoison: mf_mutex for soft offline and unpoison
  2021-06-14  2:12 ` [PATCH v1 1/6] mm/hwpoison: mf_mutex for soft offline and unpoison Naoya Horiguchi
  2021-06-15 11:41   ` Ding Hui
@ 2021-06-15 12:42   ` Miaohe Lin
  2021-06-16  0:41     ` HORIGUCHI NAOYA(堀口 直也)
  1 sibling, 1 reply; 22+ messages in thread
From: Miaohe Lin @ 2021-06-15 12:42 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andrew Morton, David Hildenbrand, Oscar Salvador, Michal Hocko,
	Ding Hui, Tony Luck, Aneesh Kumar K.V, Naoya Horiguchi,
	linux-kernel, Linux-MM

Hi:
On 2021/6/14 10:12, Naoya Horiguchi wrote:
> From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> 
> Originally mf_mutex is introduced to serialize multiple MCE events, but
> it's also helpful to exclude races among  soft_offline_page() and
> unpoison_memory().  So apply mf_mutex to them.
> 

When I was investigating the memory-failure code, I realized these possible races too.
It's very kind of you to fix these races! Many thanks!

> Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> ---
>  mm/memory-failure.c | 26 ++++++++++++++++++--------
>  1 file changed, 18 insertions(+), 8 deletions(-)
> 
> diff --git v5.13-rc5/mm/memory-failure.c v5.13-rc5_patched/mm/memory-failure.c
> index ae30fd6d575a..280eb6d6dd15 100644
> --- v5.13-rc5/mm/memory-failure.c
> +++ v5.13-rc5_patched/mm/memory-failure.c
> @@ -1583,6 +1583,8 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
>  	return rc;
>  }
>  
> +static DEFINE_MUTEX(mf_mutex);
> +
>  /**
>   * memory_failure - Handle memory failure of a page.
>   * @pfn: Page Number of the corrupted page
> @@ -1609,7 +1611,6 @@ int memory_failure(unsigned long pfn, int flags)
>  	int res = 0;
>  	unsigned long page_flags;
>  	bool retry = true;
> -	static DEFINE_MUTEX(mf_mutex);
>  
>  	if (!sysctl_memory_failure_recovery)
>  		panic("Memory failure on page %lx", pfn);
> @@ -1918,6 +1919,7 @@ int unpoison_memory(unsigned long pfn)
>  	struct page *page;
>  	struct page *p;
>  	int freeit = 0;
> +	int ret = 0;
>  	unsigned long flags = 0;
>  	static DEFINE_RATELIMIT_STATE(unpoison_rs, DEFAULT_RATELIMIT_INTERVAL,
>  					DEFAULT_RATELIMIT_BURST);
> @@ -1928,28 +1930,30 @@ int unpoison_memory(unsigned long pfn)
>  	p = pfn_to_page(pfn);
>  	page = compound_head(p);
>  
> +	mutex_lock(&mf_mutex);
> +
>  	if (!PageHWPoison(p)) {
>  		unpoison_pr_info("Unpoison: Page was already unpoisoned %#lx\n",
>  				 pfn, &unpoison_rs);
> -		return 0;
> +		goto unlock_mutex;
>  	}
>  
>  	if (page_count(page) > 1) {
>  		unpoison_pr_info("Unpoison: Someone grabs the hwpoison page %#lx\n",
>  				 pfn, &unpoison_rs);
> -		return 0;
> +		goto unlock_mutex;
>  	}
>  
>  	if (page_mapped(page)) {
>  		unpoison_pr_info("Unpoison: Someone maps the hwpoison page %#lx\n",
>  				 pfn, &unpoison_rs);
> -		return 0;
> +		goto unlock_mutex;
>  	}
>  
>  	if (page_mapping(page)) {
>  		unpoison_pr_info("Unpoison: the hwpoison page has non-NULL mapping %#lx\n",
>  				 pfn, &unpoison_rs);
> -		return 0;
> +		goto unlock_mutex;
>  	}
>  
>  	/*
> @@ -1960,7 +1964,7 @@ int unpoison_memory(unsigned long pfn)
>  	if (!PageHuge(page) && PageTransHuge(page)) {
>  		unpoison_pr_info("Unpoison: Memory failure is now running on %#lx\n",
>  				 pfn, &unpoison_rs);
> -		return 0;
> +		goto unlock_mutex;
>  	}
>  

Maybe it's more appropriate to start mutex_lock(&mf_mutex) here? I think these races start here.

>  	if (!get_hwpoison_page(p, flags)) {
> @@ -1968,7 +1972,7 @@ int unpoison_memory(unsigned long pfn)
>  			num_poisoned_pages_dec();
>  		unpoison_pr_info("Unpoison: Software-unpoisoned free page %#lx\n",
>  				 pfn, &unpoison_rs);
> -		return 0;
> +		goto unlock_mutex;
>  	}
>  
>  	lock_page(page);
> @@ -1990,7 +1994,9 @@ int unpoison_memory(unsigned long pfn)
>  	if (freeit && !(pfn == my_zero_pfn(0) && page_count(p) == 1))
>  		put_page(page);
>  
> -	return 0;
> +unlock_mutex:
> +	mutex_unlock(&mf_mutex);
> +	return ret;
>  }
>  EXPORT_SYMBOL(unpoison_memory);
>  
> @@ -2171,6 +2177,8 @@ int soft_offline_page(unsigned long pfn, int flags)
>  		return -EIO;
>  	}
>  
> +	mutex_lock(&mf_mutex);
> +
>  	if (PageHWPoison(page)) {
>  		pr_info("%s: %#lx page already poisoned\n", __func__, pfn);
>  		put_ref_page(ref_page);
> @@ -2194,5 +2202,7 @@ int soft_offline_page(unsigned long pfn, int flags)
>  			 __func__, pfn, page->flags, &page->flags);
>  	}
>  
> +	mutex_unlock(&mf_mutex);
> +
>  	return ret;
>  }
>


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v1 2/6] mm/hwpoison: remove race consideration
  2021-06-14  2:12 ` [PATCH v1 2/6] mm/hwpoison: remove race consideration Naoya Horiguchi
@ 2021-06-15 12:57   ` Ding Hui
  2021-06-16  0:11     ` HORIGUCHI NAOYA(堀口 直也)
  0 siblings, 1 reply; 22+ messages in thread
From: Ding Hui @ 2021-06-15 12:57 UTC (permalink / raw)
  To: Naoya Horiguchi, linux-mm
  Cc: Andrew Morton, David Hildenbrand, Oscar Salvador, Michal Hocko,
	Tony Luck, Aneesh Kumar K.V, Naoya Horiguchi, linux-kernel

On 2021/6/14 10:12, Naoya Horiguchi wrote:
> @@ -1956,17 +1938,6 @@ int unpoison_memory(unsigned long pfn)
>   		goto unlock_mutex;
>   	}
>   
> -	/*
> -	 * unpoison_memory() can encounter thp only when the thp is being
> -	 * worked by memory_failure() and the page lock is not held yet.
> -	 * In such case, we yield to memory_failure() and make unpoison fail.
> -	 */
> -	if (!PageHuge(page) && PageTransHuge(page)) {
> -		unpoison_pr_info("Unpoison: Memory failure is now running on %#lx\n",
> -				 pfn, &unpoison_rs);
> -		goto unlock_mutex;
> -	}
> -

if a huge page is in process of alloc or free, HUGETLB_PAGE_DTOR can be 
set after __SetPageHead() or be cleared before __ClearPageHead(), so 
this condition may be true in racy.

Do we need the racy test for this situation?

>   	if (!get_hwpoison_page(p, flags)) {
>   		if (TestClearPageHWPoison(p))
>   			num_poisoned_pages_dec();
-- 
Thanks,
- Ding Hui


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v1 2/6] mm/hwpoison: remove race consideration
  2021-06-15 12:57   ` Ding Hui
@ 2021-06-16  0:11     ` HORIGUCHI NAOYA(堀口 直也)
  2021-06-16  0:40       ` Ding Hui
  0 siblings, 1 reply; 22+ messages in thread
From: HORIGUCHI NAOYA(堀口 直也) @ 2021-06-16  0:11 UTC (permalink / raw)
  To: Ding Hui
  Cc: Naoya Horiguchi, linux-mm, Andrew Morton, David Hildenbrand,
	Oscar Salvador, Michal Hocko, Tony Luck, Aneesh Kumar K.V,
	linux-kernel

On Tue, Jun 15, 2021 at 08:57:06PM +0800, Ding Hui wrote:
> On 2021/6/14 10:12, Naoya Horiguchi wrote:
> > @@ -1956,17 +1938,6 @@ int unpoison_memory(unsigned long pfn)
> >   		goto unlock_mutex;
> >   	}
> > -	/*
> > -	 * unpoison_memory() can encounter thp only when the thp is being
> > -	 * worked by memory_failure() and the page lock is not held yet.
> > -	 * In such case, we yield to memory_failure() and make unpoison fail.
> > -	 */
> > -	if (!PageHuge(page) && PageTransHuge(page)) {
> > -		unpoison_pr_info("Unpoison: Memory failure is now running on %#lx\n",
> > -				 pfn, &unpoison_rs);
> > -		goto unlock_mutex;
> > -	}
> > -
> 
> if a huge page is in process of alloc or free, HUGETLB_PAGE_DTOR can be set
> after __SetPageHead() or be cleared before __ClearPageHead(), so this
> condition may be true in racy.

Hi Ding,

We confirm PageHWPoison() before reaching this if-block and hwpoisoned pages
are prohibited from allocation, so it seems to me that this check never
races with hugetlb allocation.

And according to the original patch introduced this if-block (0cea3fdc416d:
"mm/hwpoison: fix race against poison thp"), this if-block intended to close
the race between memory_failure() and unpoison_memory(), so that's no longer
necessary due to mf_mutex.

> Do we need the racy test for this situation?

I'm not sure, but I think that we need more stress/fuzz testing focusing on
this subsystem, and "unpoison vs allocation" race can be covered in the topic.

Thank you,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v1 2/6] mm/hwpoison: remove race consideration
  2021-06-16  0:11     ` HORIGUCHI NAOYA(堀口 直也)
@ 2021-06-16  0:40       ` Ding Hui
  0 siblings, 0 replies; 22+ messages in thread
From: Ding Hui @ 2021-06-16  0:40 UTC (permalink / raw)
  To: HORIGUCHI NAOYA(堀口 直也)
  Cc: Naoya Horiguchi, linux-mm, Andrew Morton, David Hildenbrand,
	Oscar Salvador, Michal Hocko, Tony Luck, Aneesh Kumar K.V,
	linux-kernel

On 2021/6/16 8:11, HORIGUCHI NAOYA(堀口 直也) wrote:
> On Tue, Jun 15, 2021 at 08:57:06PM +0800, Ding Hui wrote:
>> On 2021/6/14 10:12, Naoya Horiguchi wrote:
>>> @@ -1956,17 +1938,6 @@ int unpoison_memory(unsigned long pfn)
>>>    		goto unlock_mutex;
>>>    	}
>>> -	/*
>>> -	 * unpoison_memory() can encounter thp only when the thp is being
>>> -	 * worked by memory_failure() and the page lock is not held yet.
>>> -	 * In such case, we yield to memory_failure() and make unpoison fail.
>>> -	 */
>>> -	if (!PageHuge(page) && PageTransHuge(page)) {
>>> -		unpoison_pr_info("Unpoison: Memory failure is now running on %#lx\n",
>>> -				 pfn, &unpoison_rs);
>>> -		goto unlock_mutex;
>>> -	}
>>> -
>>
>> if a huge page is in process of alloc or free, HUGETLB_PAGE_DTOR can be set
>> after __SetPageHead() or be cleared before __ClearPageHead(), so this
>> condition may be true in racy.
> 
> Hi Ding,
> 
> We confirm PageHWPoison() before reaching this if-block and hwpoisoned pages
> are prohibited from allocation, so it seems to me that this check never
> races with hugetlb allocation.
> 
> And according to the original patch introduced this if-block (0cea3fdc416d:
> "mm/hwpoison: fix race against poison thp"), this if-block intended to close
> the race between memory_failure() and unpoison_memory(), so that's no longer
> necessary due to mf_mutex.
> 

I got it and thanks for your explanation.

>> Do we need the racy test for this situation?
> 
> I'm not sure, but I think that we need more stress/fuzz testing focusing on
> this subsystem, and "unpoison vs allocation" race can be covered in the topic.
> 
> Thank you,
> Naoya Horiguchi
> 


-- 
Thanks,
- Ding Hui


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v1 1/6] mm/hwpoison: mf_mutex for soft offline and unpoison
  2021-06-15 12:42   ` Miaohe Lin
@ 2021-06-16  0:41     ` HORIGUCHI NAOYA(堀口 直也)
  2021-06-16  3:14       ` Miaohe Lin
  0 siblings, 1 reply; 22+ messages in thread
From: HORIGUCHI NAOYA(堀口 直也) @ 2021-06-16  0:41 UTC (permalink / raw)
  To: Miaohe Lin
  Cc: Naoya Horiguchi, Andrew Morton, David Hildenbrand,
	Oscar Salvador, Michal Hocko, Ding Hui, Tony Luck,
	Aneesh Kumar K.V, linux-kernel, Linux-MM

On Tue, Jun 15, 2021 at 08:42:23PM +0800, Miaohe Lin wrote:
...
> > @@ -1960,7 +1964,7 @@ int unpoison_memory(unsigned long pfn)
> >  	if (!PageHuge(page) && PageTransHuge(page)) {
> >  		unpoison_pr_info("Unpoison: Memory failure is now running on %#lx\n",
> >  				 pfn, &unpoison_rs);
> > -		return 0;
> > +		goto unlock_mutex;
> >  	}
> >  
> 
> Maybe it's more appropriate to start mutex_lock(&mf_mutex) here? I think these races start here.

Hi Miaohe,

Thank your for the review.

Consider that we put mutex_lock() here, and let's think about two concurrent
calls of unpoison_memory(), then these events could be processed like below:

    CPU 0                             CPU 1
    unpoison_memory
    check PageHWPoison // true
                                      unpoison_memory
                                      check PageHWPoison  // true
    mutex_lock
    get_hwpoison_page
    TestClearPageHWPoison
    put_page
    put_page // freed
    mutex_unlock

    // the unpoisoned page can be used for allocation

                                      mutex_lock
                                      get_hwpoison_page // succeeds
                                      ... // unpoison the !PageHWPoison page !?


So I thought that we had better do the prechecks in mf_mutex.  Maybe the 2nd
unpoison_memory() just get and put the page refcount by 1 even in this race,
so the impact is not so big, but I feel like avoiding "unpoison the
!PageHWPoison page" situation.

Does it make sense for you?

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v1 1/6] mm/hwpoison: mf_mutex for soft offline and unpoison
  2021-06-16  0:41     ` HORIGUCHI NAOYA(堀口 直也)
@ 2021-06-16  3:14       ` Miaohe Lin
  0 siblings, 0 replies; 22+ messages in thread
From: Miaohe Lin @ 2021-06-16  3:14 UTC (permalink / raw)
  To: HORIGUCHI NAOYA(堀口 直也)
  Cc: Naoya Horiguchi, Andrew Morton, David Hildenbrand,
	Oscar Salvador, Michal Hocko, Ding Hui, Tony Luck,
	Aneesh Kumar K.V, linux-kernel, Linux-MM

On 2021/6/16 8:41, HORIGUCHI NAOYA(堀口 直也) wrote:
> On Tue, Jun 15, 2021 at 08:42:23PM +0800, Miaohe Lin wrote:
> ...
>>> @@ -1960,7 +1964,7 @@ int unpoison_memory(unsigned long pfn)
>>>  	if (!PageHuge(page) && PageTransHuge(page)) {
>>>  		unpoison_pr_info("Unpoison: Memory failure is now running on %#lx\n",
>>>  				 pfn, &unpoison_rs);
>>> -		return 0;
>>> +		goto unlock_mutex;
>>>  	}
>>>  
>>
>> Maybe it's more appropriate to start mutex_lock(&mf_mutex) here? I think these races start here.
> 
> Hi Miaohe,
> 
> Thank your for the review.
> > Consider that we put mutex_lock() here, and let's think about two concurrent
> calls of unpoison_memory(), then these events could be processed like below:
> 
>     CPU 0                             CPU 1
>     unpoison_memory
>     check PageHWPoison // true
>                                       unpoison_memory
>                                       check PageHWPoison  // true
>     mutex_lock
>     get_hwpoison_page
>     TestClearPageHWPoison
>     put_page
>     put_page // freed
>     mutex_unlock
> 
>     // the unpoisoned page can be used for allocation
> 
>                                       mutex_lock
>                                       get_hwpoison_page // succeeds
>                                       ... // unpoison the !PageHWPoison page !?
> 
> 
> So I thought that we had better do the prechecks in mf_mutex.  Maybe the 2nd
> unpoison_memory() just get and put the page refcount by 1 even in this race,
> so the impact is not so big, but I feel like avoiding "unpoison the
> !PageHWPoison page" situation.
> 
> Does it make sense for you?

Yes, I missed the races inside unpoison_memory() itself.
Many thanks for your detailed explanation!

> 
> Thanks,
> Naoya Horiguchi
> 



^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v1 6/6] mm/hwpoison: fix unpoison_memory()
  2021-06-14  2:12 ` [PATCH v1 6/6] mm/hwpoison: fix unpoison_memory() Naoya Horiguchi
@ 2021-06-17 10:00   ` Ding Hui
  2021-06-18  8:36     ` HORIGUCHI NAOYA(堀口 直也)
  0 siblings, 1 reply; 22+ messages in thread
From: Ding Hui @ 2021-06-17 10:00 UTC (permalink / raw)
  To: Naoya Horiguchi, linux-mm
  Cc: Andrew Morton, David Hildenbrand, Oscar Salvador, Michal Hocko,
	Tony Luck, Aneesh Kumar K.V, Naoya Horiguchi, linux-kernel

On 2021/6/14 10:12, Naoya Horiguchi wrote:
> From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> 
> After recent soft-offline rework, error pages can be taken off from
> buddy allocator, but the existing unpoison_memory() does not properly
> undo the operation.  Moreover, due to the recent change on
> __get_hwpoison_page(), get_page_unless_zero() is hardly called for
> hwpoisoned pages.  So __get_hwpoison_page() mostly returns zero (meaning
> to fail to grab page refcount) and unpoison just clears PG_hwpoison
> without releasing a refcount.  That does not lead to a critical issue
> like kernel panic, but unpoisoned pages never get back to buddy (leaked
> permanently), which is not good.

As I mention in [1], I'm not sure about the exactly meaning of "broken" 
in unpoison_memory().

Maybe the misunderstanding is:

I think __get_hwpoison_page() mostly returns one for hwpoisoned page.
In 06be6ff3d2ec ("mm,hwpoison: rework soft offline for free pages"), 
page_handle_poison() is introduced, it will add refcount for all 
soft-offlineed hwpoison page.
In memory_failure() for hard-offline,page_ref_inc() called on free page 
too, and for used page, we do not call put_page() after 
get_hwpoison_page() != 0.
So all hwpoisoned page refcount must be great than zero when 
unpoison_memory() if regardless of racy.

Recently I tested loop soft-offline random pages and unpoison them for 
days, it works fine to me. (with bac9c6fa1f92 patched)

[1]: 
https://lore.kernel.org/lkml/6af291a0-41fa-8112-5297-6a4cdf2337b6@sangfor.com.cn/

> 
> To fix this, we need to identify "taken off" pages from other types of
> hwpoisoned pages.  We can't use refcount or page flags for this purpose,
> so a pseudo flag is defined by hacking ->private field.
> 
> Sometimes hwpoisoned pages can be still in-use, where the refcount should
> be more than 1, so we can't unpoison them immediately and need to wait
> until the all users release their refcount.
> 
> Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> ---


-- 
Thanks,
- Ding Hui


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v1 6/6] mm/hwpoison: fix unpoison_memory()
  2021-06-17 10:00   ` Ding Hui
@ 2021-06-18  8:36     ` HORIGUCHI NAOYA(堀口 直也)
  2021-06-19 12:22       ` Ding Hui
  0 siblings, 1 reply; 22+ messages in thread
From: HORIGUCHI NAOYA(堀口 直也) @ 2021-06-18  8:36 UTC (permalink / raw)
  To: Ding Hui
  Cc: Naoya Horiguchi, linux-mm, Andrew Morton, David Hildenbrand,
	Oscar Salvador, Michal Hocko, Tony Luck, Aneesh Kumar K.V,
	linux-kernel

On Thu, Jun 17, 2021 at 06:00:21PM +0800, Ding Hui wrote:
> On 2021/6/14 10:12, Naoya Horiguchi wrote:
> > From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> > 
> > After recent soft-offline rework, error pages can be taken off from
> > buddy allocator, but the existing unpoison_memory() does not properly
> > undo the operation.  Moreover, due to the recent change on
> > __get_hwpoison_page(), get_page_unless_zero() is hardly called for
> > hwpoisoned pages.  So __get_hwpoison_page() mostly returns zero (meaning
> > to fail to grab page refcount) and unpoison just clears PG_hwpoison
> > without releasing a refcount.  That does not lead to a critical issue
> > like kernel panic, but unpoisoned pages never get back to buddy (leaked
> > permanently), which is not good.
> 
> As I mention in [1], I'm not sure about the exactly meaning of "broken" in
> unpoison_memory().
> 
> Maybe the misunderstanding is:
> 
> I think __get_hwpoison_page() mostly returns one for hwpoisoned page.
> In 06be6ff3d2ec ("mm,hwpoison: rework soft offline for free pages"),
> page_handle_poison() is introduced, it will add refcount for all
> soft-offlineed hwpoison page.
> In memory_failure() for hard-offline,page_ref_inc() called on free page
> too, and for used page, we do not call put_page() after get_hwpoison_page()
> != 0.
> So all hwpoisoned page refcount must be great than zero when
> unpoison_memory() if regardless of racy.

Hi, Ding,

Thanks for the comment.  I feel that I failed to define the exact issue in
unpoison.  Maybe I saw and misinterpreted some random error as unpoison's
issue during developing other hwpoison patches, so please don't take serious
my previous wrong word "broken", sorry about that.

Anyway I reconsider how to handle this 6/6, maybe it will be a clear
description of the problem, and will be simplified.

> 
> Recently I tested loop soft-offline random pages and unpoison them for days,
> it works fine to me. (with bac9c6fa1f92 patched)

Thank you for testing,

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v1 6/6] mm/hwpoison: fix unpoison_memory()
  2021-06-18  8:36     ` HORIGUCHI NAOYA(堀口 直也)
@ 2021-06-19 12:22       ` Ding Hui
  0 siblings, 0 replies; 22+ messages in thread
From: Ding Hui @ 2021-06-19 12:22 UTC (permalink / raw)
  To: HORIGUCHI NAOYA(堀口 直也)
  Cc: Naoya Horiguchi, linux-mm, Andrew Morton, David Hildenbrand,
	Oscar Salvador, Michal Hocko, Tony Luck, Aneesh Kumar K.V,
	linux-kernel

On 2021/6/18 4:36 下午, HORIGUCHI NAOYA(堀口 直也) wrote:
> On Thu, Jun 17, 2021 at 06:00:21PM +0800, Ding Hui wrote:
>> On 2021/6/14 10:12, Naoya Horiguchi wrote:
>>> From: Naoya Horiguchi <naoya.horiguchi@nec.com>
>>>
>>> After recent soft-offline rework, error pages can be taken off from
>>> buddy allocator, but the existing unpoison_memory() does not properly
>>> undo the operation.  Moreover, due to the recent change on
>>> __get_hwpoison_page(), get_page_unless_zero() is hardly called for
>>> hwpoisoned pages.  So __get_hwpoison_page() mostly returns zero (meaning
>>> to fail to grab page refcount) and unpoison just clears PG_hwpoison
>>> without releasing a refcount.  That does not lead to a critical issue
>>> like kernel panic, but unpoisoned pages never get back to buddy (leaked
>>> permanently), which is not good.
>>
>> As I mention in [1], I'm not sure about the exactly meaning of "broken" in
>> unpoison_memory().
>>
>> Maybe the misunderstanding is:
>>
>> I think __get_hwpoison_page() mostly returns one for hwpoisoned page.
>> In 06be6ff3d2ec ("mm,hwpoison: rework soft offline for free pages"),
>> page_handle_poison() is introduced, it will add refcount for all
>> soft-offlineed hwpoison page.
>> In memory_failure() for hard-offline,page_ref_inc() called on free page
>> too, and for used page, we do not call put_page() after get_hwpoison_page()
>> != 0.
>> So all hwpoisoned page refcount must be great than zero when
>> unpoison_memory() if regardless of racy.
> 
> Hi, Ding,
> 
> Thanks for the comment.  I feel that I failed to define the exact issue in
> unpoison.  Maybe I saw and misinterpreted some random error as unpoison's
> issue during developing other hwpoison patches, so please don't take serious
> my previous wrong word "broken", sorry about that.
> 
> Anyway I reconsider how to handle this 6/6, maybe it will be a clear
> description of the problem, and will be simplified.
> 
>>
>> Recently I tested loop soft-offline random pages and unpoison them for days,
>> it works fine to me. (with bac9c6fa1f92 patched)
> 
> Thank you for testing,
> 

Hi Naoya,

I'm afraid of my description about testing is ambiguous for others, let 
me clarify that I ran stress soft-offline test case from mce-test 
project (https://git.kernel.org/pub/scm/utils/cpu/mce/mce-test.git) for 
days to verify my modify about NR_FREE_PAGES (bac9c6fa1f92), without 
your current patchset, the case is loop soft-offline random pages and 
unpoison them, and it works basic fine to me.

-- 
Thanks,
-dinghui


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v1 5/6] mm/hwpoison: make some kernel pages handlable
  2021-06-14  2:12 ` [PATCH v1 5/6] mm/hwpoison: make some kernel pages handlable Naoya Horiguchi
@ 2021-07-28 10:59   ` Ding Hui
  2021-07-29  6:54     ` HORIGUCHI NAOYA(堀口 直也)
  0 siblings, 1 reply; 22+ messages in thread
From: Ding Hui @ 2021-07-28 10:59 UTC (permalink / raw)
  To: Naoya Horiguchi, mike.kravetz
  Cc: Andrew Morton, David Hildenbrand, Oscar Salvador, Michal Hocko,
	Tony Luck, Aneesh Kumar K.V, Naoya Horiguchi, linux-kernel,
	linux-mm, huangcun

On 2021/6/14 10:12, Naoya Horiguchi wrote:
> From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> 
> HWPoisonHandlable() introduced by patch "mm,hwpoison: fix race with hugetlb
> page allocation" filters error events by page type, and only limited events
> reach get_page_unless_zero() to avoid race >

I want to report a bug which has relationship with "mm,hwpoison: fix 
race with hugetlb page allocation", hugetlb pmd shared and also this patch.

Recently, when test hugetlb and soft offline, I encountered a crash like 
this:
[449901.638605] huge_test[16596]: segfault at 8 ip 00007f5f64c39a12 sp 
00007fff2105c020 error 4 in ld-2.23.so[7f5f64c2a000+26000]
[449901.638612] Code: 48 8d 35 2c 03 01 00 48 8d 3d 31 03 01 00 ba b5 00 
00 00 e8 f0 a5 00 00 53 49 89 fa 89 f6 48 8d 14 76 48 83 ec 10 48 8b 47 
68 <48> 8b 78 08 49 8b 82 f8 00 00 00 48 8b 40 08 4c 8d 04 d0 49 8b 42
[449901.638885] BUG: Bad rss-counter state mm:00000000a1ce68ac idx:0 val:358
[449901.638894] ------------[ cut here ]------------
[449901.638962] BUG: Bad rss-counter state mm:00000000a1ce68ac idx:1 val:26
[449901.638966] BUG: non-zero pgtables_bytes on freeing mm: 28672
[449901.639045] kernel BUG at fs/hugetlbfs/inode.c:443!
[449901.639193] invalid opcode: 0000 [#1] SMP NOPTI

After a few days of digging and reproduce, it turns out that there is a 
mechanism conflict between the get_hwpoison_page() and hugetlb pmd share:

In huge_pmd_unshare(), the page_count is used to determine whether the 
page is shared, it is not safe.

My case is the same page's refcount was increaseed by 
get_hwpoison_page() little before if (page_count(virt_to_page(ptep)) == 
1) in huge_pmd_unshare(), so huge_pmd_unshare() went to wrong branch.

> Actually this is too restictive because get_hwpoison_page always fails
> to take refcount for any types of kernel page, leading to
> MF_MSG_KERNEL_HIGH_ORDER.  This is not critical (no panic), but less
> informative than MF_MSG_SLAB or MF_MSG_PAGETABLE, so extend
> HWPoisonHandlable() to some basic types of kernel pages (slab, pgtable,
> and reserved pages).
> 

After "mm,hwpoison: fix race with hugetlb page allocation",the 
PageTable(page) is blocked to get_page_unless_zero() due to 
"restictive", this bug is just fixed by side effect.

> The "handling" for these types are still primitive (just taking refcount
> and setting PG_hwpoison) and some more aggressive actions for memory
> error containment are possible and wanted.  But compared to the older code,
> these cases never enter the code block of page locks (note that
> page locks is not well-defined on these pages), so it's a little safer
> for functions intended for user pages not to be called for kernel pages.
> 

But the root cause is still existed, the bug may come back at any time 
by unconsciously, like this patch, if the PageTable(page) is allowed to 
get_page_unless_zero(), the risk is come back.

I'm not sure is there any other way to determine whether the pmd page is 
shared, so I add Mike Kravetz here, and report the risk to you.

> Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> ---
>   mm/memory-failure.c | 28 ++++++++++++++++++++--------
>   1 file changed, 20 insertions(+), 8 deletions(-)
> 
> diff --git v5.13-rc5/mm/memory-failure.c v5.13-rc5_patched/mm/memory-failure.c
> index b986936e50eb..0d51067f0129 100644
> --- v5.13-rc5/mm/memory-failure.c
> +++ v5.13-rc5_patched/mm/memory-failure.c
> @@ -1113,7 +1113,8 @@ static int page_action(struct page_state *ps, struct page *p,
>    */
>   static inline bool HWPoisonHandlable(struct page *page)
>   {
> -	return PageLRU(page) || __PageMovable(page);
> +	return PageLRU(page) || __PageMovable(page) ||
> +		PageSlab(page) || PageTable(page) || PageReserved(page);
>   }
>    >   static int __get_hwpoison_page(struct page *page)
> @@ -1260,12 +1261,6 @@ static bool hwpoison_user_mappings(struct page *p, unsigned long pfn,
>   	struct page *hpage = *hpagep;
>   	bool mlocked = PageMlocked(hpage);
>   
> -	/*
> -	 * Here we are interested only in user-mapped pages, so skip any
> -	 * other types of pages.
> -	 */
> -	if (PageReserved(p) || PageSlab(p))
> -		return true;
>   	if (!(PageLRU(hpage) || PageHuge(p)))
>   		return true;
>   
> @@ -1670,7 +1665,10 @@ int memory_failure(unsigned long pfn, int flags)
>   				action_result(pfn, MF_MSG_BUDDY, res);
>   				res = res == MF_RECOVERED ? 0 : -EBUSY;
>   			} else {
> -				action_result(pfn, MF_MSG_KERNEL_HIGH_ORDER, MF_IGNORED);
> +				if (PageCompound(p))
> +					action_result(pfn, MF_MSG_KERNEL_HIGH_ORDER, MF_IGNORED);
> +				else
> +					action_result(pfn, MF_MSG_KERNEL, MF_IGNORED);
>   				res = -EBUSY;
>   			}
>   			goto unlock_mutex;
> @@ -1681,6 +1679,20 @@ int memory_failure(unsigned long pfn, int flags)
>   		}
>   	}
>   
> +	if (PageSlab(p)) {
> +		action_result(pfn, MF_MSG_SLAB, MF_IGNORED);
> +		res = -EBUSY;
> +		goto unlock_mutex;
> +	} else if (PageTable(p)) {
> +		action_result(pfn, MF_MSG_PAGETABLE, MF_IGNORED);
> +		res = -EBUSY;
> +		goto unlock_mutex;
> +	} else if (PageReserved(p)) {
> +		action_result(pfn, MF_MSG_KERNEL, MF_IGNORED);
> +		res = -EBUSY;
> +		goto unlock_mutex;
> +	}
> +
>   	if (PageTransHuge(hpage)) {
>   		if (try_to_split_thp_page(p, "Memory Failure") < 0) {
>   			action_result(pfn, MF_MSG_UNSPLIT_THP, MF_IGNORED);
> 


-- 
Thanks,
- Ding Hui


^ permalink raw reply	[flat|nested] 22+ messages in thread

* Re: [PATCH v1 5/6] mm/hwpoison: make some kernel pages handlable
  2021-07-28 10:59   ` Ding Hui
@ 2021-07-29  6:54     ` HORIGUCHI NAOYA(堀口 直也)
  0 siblings, 0 replies; 22+ messages in thread
From: HORIGUCHI NAOYA(堀口 直也) @ 2021-07-29  6:54 UTC (permalink / raw)
  To: Ding Hui
  Cc: Naoya Horiguchi, mike.kravetz, Andrew Morton, David Hildenbrand,
	Oscar Salvador, Michal Hocko, Tony Luck, Aneesh Kumar K.V,
	linux-kernel, linux-mm, huangcun

On Wed, Jul 28, 2021 at 06:59:37PM +0800, Ding Hui wrote:
> On 2021/6/14 10:12, Naoya Horiguchi wrote:
> > From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> > 
> > HWPoisonHandlable() introduced by patch "mm,hwpoison: fix race with hugetlb
> > page allocation" filters error events by page type, and only limited events
> > reach get_page_unless_zero() to avoid race >
> 
> I want to report a bug which has relationship with "mm,hwpoison: fix race
> with hugetlb page allocation", hugetlb pmd shared and also this patch.
> 
> Recently, when test hugetlb and soft offline, I encountered a crash like
> this:
> [449901.638605] huge_test[16596]: segfault at 8 ip 00007f5f64c39a12 sp
> 00007fff2105c020 error 4 in ld-2.23.so[7f5f64c2a000+26000]
> [449901.638612] Code: 48 8d 35 2c 03 01 00 48 8d 3d 31 03 01 00 ba b5 00 00
> 00 e8 f0 a5 00 00 53 49 89 fa 89 f6 48 8d 14 76 48 83 ec 10 48 8b 47 68 <48>
> 8b 78 08 49 8b 82 f8 00 00 00 48 8b 40 08 4c 8d 04 d0 49 8b 42
> [449901.638885] BUG: Bad rss-counter state mm:00000000a1ce68ac idx:0 val:358
> [449901.638894] ------------[ cut here ]------------
> [449901.638962] BUG: Bad rss-counter state mm:00000000a1ce68ac idx:1 val:26
> [449901.638966] BUG: non-zero pgtables_bytes on freeing mm: 28672
> [449901.639045] kernel BUG at fs/hugetlbfs/inode.c:443!
> [449901.639193] invalid opcode: 0000 [#1] SMP NOPTI
> 
> After a few days of digging and reproduce, it turns out that there is a
> mechanism conflict between the get_hwpoison_page() and hugetlb pmd share:

Thank you for testing and reporting.

> 
> In huge_pmd_unshare(), the page_count is used to determine whether the page
> is shared, it is not safe.
> 
> My case is the same page's refcount was increaseed by get_hwpoison_page()
> little before if (page_count(virt_to_page(ptep)) == 1) in
> huge_pmd_unshare(), so huge_pmd_unshare() went to wrong branch.
> 
> 
> > Actually this is too restictive because get_hwpoison_page always fails
> > to take refcount for any types of kernel page, leading to
> > MF_MSG_KERNEL_HIGH_ORDER.  This is not critical (no panic), but less
> > informative than MF_MSG_SLAB or MF_MSG_PAGETABLE, so extend
> > HWPoisonHandlable() to some basic types of kernel pages (slab, pgtable,
> > and reserved pages).
> > 
> 
> After "mm,hwpoison: fix race with hugetlb page allocation",the
> PageTable(page) is blocked to get_page_unless_zero() due to "restictive",
> this bug is just fixed by side effect.

So to keep this fixed, this patch shouldn't be merged until the
root cause is solved.

> 
> > The "handling" for these types are still primitive (just taking refcount
> > and setting PG_hwpoison) and some more aggressive actions for memory
> > error containment are possible and wanted.  But compared to the older code,
> > these cases never enter the code block of page locks (note that
> > page locks is not well-defined on these pages), so it's a little safer
> > for functions intended for user pages not to be called for kernel pages.
> > 
> 
> But the root cause is still existed, the bug may come back at any time by
> unconsciously, like this patch, if the PageTable(page) is allowed to
> get_page_unless_zero(), the risk is come back.
> 
> I'm not sure is there any other way to determine whether the pmd page is
> shared, so I add Mike Kravetz here, and report the risk to you.

I think that the sharedess can be maintained in another counter in struct
page using _mapcount field or private field, but the benefit might not large
enough for the effort for now.  I think that if PageTable page can be
migrated (I don't think it can be now), soft-offline saves us from errors
on PageTable pages, so the effort will get more worth doing.

Anyway I'll separate out "extending to support PageTable pages" part
from this patch.

Thanks,
Naoya Horiguchi

> 
> > Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> > ---
> >   mm/memory-failure.c | 28 ++++++++++++++++++++--------
> >   1 file changed, 20 insertions(+), 8 deletions(-)
> > 
> > diff --git v5.13-rc5/mm/memory-failure.c v5.13-rc5_patched/mm/memory-failure.c
> > index b986936e50eb..0d51067f0129 100644
> > --- v5.13-rc5/mm/memory-failure.c
> > +++ v5.13-rc5_patched/mm/memory-failure.c
> > @@ -1113,7 +1113,8 @@ static int page_action(struct page_state *ps, struct page *p,
> >    */
> >   static inline bool HWPoisonHandlable(struct page *page)
> >   {
> > -	return PageLRU(page) || __PageMovable(page);
> > +	return PageLRU(page) || __PageMovable(page) ||
> > +		PageSlab(page) || PageTable(page) || PageReserved(page);
> >   }
> >    >   static int __get_hwpoison_page(struct page *page)
> > @@ -1260,12 +1261,6 @@ static bool hwpoison_user_mappings(struct page *p, unsigned long pfn,
> >   	struct page *hpage = *hpagep;
> >   	bool mlocked = PageMlocked(hpage);
> > -	/*
> > -	 * Here we are interested only in user-mapped pages, so skip any
> > -	 * other types of pages.
> > -	 */
> > -	if (PageReserved(p) || PageSlab(p))
> > -		return true;
> >   	if (!(PageLRU(hpage) || PageHuge(p)))
> >   		return true;
> > @@ -1670,7 +1665,10 @@ int memory_failure(unsigned long pfn, int flags)
> >   				action_result(pfn, MF_MSG_BUDDY, res);
> >   				res = res == MF_RECOVERED ? 0 : -EBUSY;
> >   			} else {
> > -				action_result(pfn, MF_MSG_KERNEL_HIGH_ORDER, MF_IGNORED);
> > +				if (PageCompound(p))
> > +					action_result(pfn, MF_MSG_KERNEL_HIGH_ORDER, MF_IGNORED);
> > +				else
> > +					action_result(pfn, MF_MSG_KERNEL, MF_IGNORED);
> >   				res = -EBUSY;
> >   			}
> >   			goto unlock_mutex;
> > @@ -1681,6 +1679,20 @@ int memory_failure(unsigned long pfn, int flags)
> >   		}
> >   	}
> > +	if (PageSlab(p)) {
> > +		action_result(pfn, MF_MSG_SLAB, MF_IGNORED);
> > +		res = -EBUSY;
> > +		goto unlock_mutex;
> > +	} else if (PageTable(p)) {
> > +		action_result(pfn, MF_MSG_PAGETABLE, MF_IGNORED);
> > +		res = -EBUSY;
> > +		goto unlock_mutex;
> > +	} else if (PageReserved(p)) {
> > +		action_result(pfn, MF_MSG_KERNEL, MF_IGNORED);
> > +		res = -EBUSY;
> > +		goto unlock_mutex;
> > +	}
> > +
> >   	if (PageTransHuge(hpage)) {
> >   		if (try_to_split_thp_page(p, "Memory Failure") < 0) {
> >   			action_result(pfn, MF_MSG_UNSPLIT_THP, MF_IGNORED);
> > 
> 
> 
> -- 
> Thanks,
> - Ding Hui
> 

^ permalink raw reply	[flat|nested] 22+ messages in thread

end of thread, other threads:[~2021-07-29  6:54 UTC | newest]

Thread overview: 22+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-06-14  2:12 [PATCH v1 0/6] mm/hwpoison: fix unpoison_memory() Naoya Horiguchi
2021-06-14  2:12 ` [PATCH v1 1/6] mm/hwpoison: mf_mutex for soft offline and unpoison Naoya Horiguchi
2021-06-15 11:41   ` Ding Hui
2021-06-15 11:55     ` HORIGUCHI NAOYA(堀口 直也)
2021-06-15 12:42   ` Miaohe Lin
2021-06-16  0:41     ` HORIGUCHI NAOYA(堀口 直也)
2021-06-16  3:14       ` Miaohe Lin
2021-06-14  2:12 ` [PATCH v1 2/6] mm/hwpoison: remove race consideration Naoya Horiguchi
2021-06-15 12:57   ` Ding Hui
2021-06-16  0:11     ` HORIGUCHI NAOYA(堀口 直也)
2021-06-16  0:40       ` Ding Hui
2021-06-14  2:12 ` [PATCH v1 3/6] mm/hwpoison: introduce MF_MSG_PAGETABLE Naoya Horiguchi
2021-06-14  3:06   ` Matthew Wilcox
2021-06-14  3:55     ` HORIGUCHI NAOYA(堀口 直也)
2021-06-14  2:12 ` [PATCH v1 4/6] mm/hwpoison: remove MF_MSG_BUDDY_2ND and MF_MSG_POISONED_HUGE Naoya Horiguchi
2021-06-14  2:12 ` [PATCH v1 5/6] mm/hwpoison: make some kernel pages handlable Naoya Horiguchi
2021-07-28 10:59   ` Ding Hui
2021-07-29  6:54     ` HORIGUCHI NAOYA(堀口 直也)
2021-06-14  2:12 ` [PATCH v1 6/6] mm/hwpoison: fix unpoison_memory() Naoya Horiguchi
2021-06-17 10:00   ` Ding Hui
2021-06-18  8:36     ` HORIGUCHI NAOYA(堀口 直也)
2021-06-19 12:22       ` Ding Hui

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).