linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v5 00/16] HWPOISON: soft offline rework
@ 2020-07-31 12:20 nao.horiguchi
  2020-07-31 12:20 ` [PATCH v5 01/16] mm,hwpoison: cleanup unused PageHuge() check nao.horiguchi
                   ` (17 more replies)
  0 siblings, 18 replies; 26+ messages in thread
From: nao.horiguchi @ 2020-07-31 12:20 UTC (permalink / raw)
  To: linux-mm
  Cc: mhocko, akpm, mike.kravetz, osalvador, tony.luck, david,
	aneesh.kumar, zeil, cai, naoya.horiguchi, linux-kernel

This patchset is the latest version of soft offline rework patchset
targetted for v5.9.

Main focus of this series is to stabilize soft offline.  Historically soft
offlined pages have suffered from racy conditions because PageHWPoison is
used to a little too aggressively, which (directly or indirectly) invades
other mm code which cares little about hwpoison.  This results in unexpected
behavior or kernel panic, which is very far from soft offline's "do not
disturb userspace or other kernel component" policy.

Main point of this change set is to contain target page "via buddy allocator",
where we first free the target page as we do for normal pages, and remove
from buddy only when we confirm that it reaches free list. There is surely
race window of page allocation, but that's fine because someone really want
that page and the page is still working, so soft offline can happily give up.

v4 from Oscar tries to handle the race around reallocation, but that part
seems still work in progress, so I decide to separate it for changes into
v5.9.  Thank you for your contribution, Oscar.

The issue reported by Qian Cai is fixed by patch 16/16.

This patchset is based on v5.8-rc7-mmotm-2020-07-27-18-18, but I applied
this series after reverting previous version.
Maybe https://github.com/Naoya-Horiguchi/linux/commits/soft-offline-rework.v5
shows what I did more precisely.

Any other comment/suggestion/help would be appreciated.

Thanks,
Naoya Horiguchi
---
Previous versions:
  v1: https://lore.kernel.org/linux-mm/1541746035-13408-1-git-send-email-n-horiguchi@ah.jp.nec.com/
  v2: https://lore.kernel.org/linux-mm/20191017142123.24245-1-osalvador@suse.de/
  v3: https://lore.kernel.org/linux-mm/20200624150137.7052-1-nao.horiguchi@gmail.com/
  v4: https://lore.kernel.org/linux-mm/20200716123810.25292-1-osalvador@suse.de/
---
Summary:

Naoya Horiguchi (8):
      mm,hwpoison: cleanup unused PageHuge() check
      mm, hwpoison: remove recalculating hpage
      mm,madvise: call soft_offline_page() without MF_COUNT_INCREASED
      mm,hwpoison-inject: don't pin for hwpoison_filter
      mm,hwpoison: remove MF_COUNT_INCREASED
      mm,hwpoison: remove flag argument from soft offline functions
      mm,hwpoison: introduce MF_MSG_UNSPLIT_THP
      mm,hwpoison: double-check page count in __get_any_page()

Oscar Salvador (8):
      mm,madvise: Refactor madvise_inject_error
      mm,hwpoison: Un-export get_hwpoison_page and make it static
      mm,hwpoison: Kill put_hwpoison_page
      mm,hwpoison: Unify THP handling for hard and soft offline
      mm,hwpoison: Rework soft offline for free pages
      mm,hwpoison: Rework soft offline for in-use pages
      mm,hwpoison: Refactor soft_offline_huge_page and __soft_offline_page
      mm,hwpoison: Return 0 if the page is already poisoned in soft-offline

 drivers/base/memory.c      |   2 +-
 include/linux/mm.h         |  12 +-
 include/linux/page-flags.h |   6 +-
 include/ras/ras_event.h    |   3 +
 mm/hwpoison-inject.c       |  18 +--
 mm/madvise.c               |  39 +++---
 mm/memory-failure.c        | 334 ++++++++++++++++++++-------------------------
 mm/migrate.c               |  11 +-
 mm/page_alloc.c            |  60 ++++++--
 9 files changed, 233 insertions(+), 252 deletions(-)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH v5 01/16] mm,hwpoison: cleanup unused PageHuge() check
  2020-07-31 12:20 [PATCH v5 00/16] HWPOISON: soft offline rework nao.horiguchi
@ 2020-07-31 12:20 ` nao.horiguchi
  2020-07-31 12:20 ` [PATCH v5 02/16] mm, hwpoison: remove recalculating hpage nao.horiguchi
                   ` (16 subsequent siblings)
  17 siblings, 0 replies; 26+ messages in thread
From: nao.horiguchi @ 2020-07-31 12:20 UTC (permalink / raw)
  To: linux-mm
  Cc: mhocko, akpm, mike.kravetz, osalvador, tony.luck, david,
	aneesh.kumar, zeil, cai, naoya.horiguchi, linux-kernel

From: Naoya Horiguchi <naoya.horiguchi@nec.com>

Drop the PageHuge check, which is dead code since memory_failure() forks
into memory_failure_hugetlb() for hugetlb pages.

memory_failure() and memory_failure_hugetlb() shares some functions like
hwpoison_user_mappings() and identify_page_state(), so they should properly
handle 4kB page, thp, and hugetlb.

Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/memory-failure.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git v5.8-rc7-mmotm-2020-07-27-18-18/mm/memory-failure.c v5.8-rc7-mmotm-2020-07-27-18-18_patched/mm/memory-failure.c
index fe53768e0793..3d2d61f1c6e9 100644
--- v5.8-rc7-mmotm-2020-07-27-18-18/mm/memory-failure.c
+++ v5.8-rc7-mmotm-2020-07-27-18-18_patched/mm/memory-failure.c
@@ -1382,10 +1382,7 @@ int memory_failure(unsigned long pfn, int flags)
 	 * page_remove_rmap() in try_to_unmap_one(). So to determine page status
 	 * correctly, we save a copy of the page flags at this time.
 	 */
-	if (PageHuge(p))
-		page_flags = hpage->flags;
-	else
-		page_flags = p->flags;
+	page_flags = p->flags;
 
 	/*
 	 * unpoison always clear PG_hwpoison inside page lock
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v5 02/16] mm, hwpoison: remove recalculating hpage
  2020-07-31 12:20 [PATCH v5 00/16] HWPOISON: soft offline rework nao.horiguchi
  2020-07-31 12:20 ` [PATCH v5 01/16] mm,hwpoison: cleanup unused PageHuge() check nao.horiguchi
@ 2020-07-31 12:20 ` nao.horiguchi
  2020-07-31 12:20 ` [PATCH v5 03/16] mm,madvise: call soft_offline_page() without MF_COUNT_INCREASED nao.horiguchi
                   ` (15 subsequent siblings)
  17 siblings, 0 replies; 26+ messages in thread
From: nao.horiguchi @ 2020-07-31 12:20 UTC (permalink / raw)
  To: linux-mm
  Cc: mhocko, akpm, mike.kravetz, osalvador, tony.luck, david,
	aneesh.kumar, zeil, cai, naoya.horiguchi, linux-kernel

From: Naoya Horiguchi <naoya.horiguchi@nec.com>

hpage is never used after try_to_split_thp_page() in memory_failure(),
so we don't have to update hpage.  So let's not recalculate/use hpage.

Suggested-by: "Aneesh Kumar K.V" <aneesh.kumar@linux.ibm.com>
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Oscar Salvador <osalvador@suse.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/memory-failure.c | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git v5.8-rc7-mmotm-2020-07-27-18-18/mm/memory-failure.c v5.8-rc7-mmotm-2020-07-27-18-18_patched/mm/memory-failure.c
index 3d2d61f1c6e9..f8d200417e0f 100644
--- v5.8-rc7-mmotm-2020-07-27-18-18/mm/memory-failure.c
+++ v5.8-rc7-mmotm-2020-07-27-18-18_patched/mm/memory-failure.c
@@ -1342,7 +1342,6 @@ int memory_failure(unsigned long pfn, int flags)
 		}
 		unlock_page(p);
 		VM_BUG_ON_PAGE(!page_count(p), p);
-		hpage = compound_head(p);
 	}
 
 	/*
@@ -1414,11 +1413,8 @@ int memory_failure(unsigned long pfn, int flags)
 	/*
 	 * Now take care of user space mappings.
 	 * Abort on fail: __delete_from_page_cache() assumes unmapped page.
-	 *
-	 * When the raw error page is thp tail page, hpage points to the raw
-	 * page after thp split.
 	 */
-	if (!hwpoison_user_mappings(p, pfn, flags, &hpage)) {
+	if (!hwpoison_user_mappings(p, pfn, flags, &p)) {
 		action_result(pfn, MF_MSG_UNMAP_FAILED, MF_IGNORED);
 		res = -EBUSY;
 		goto out;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v5 03/16] mm,madvise: call soft_offline_page() without MF_COUNT_INCREASED
  2020-07-31 12:20 [PATCH v5 00/16] HWPOISON: soft offline rework nao.horiguchi
  2020-07-31 12:20 ` [PATCH v5 01/16] mm,hwpoison: cleanup unused PageHuge() check nao.horiguchi
  2020-07-31 12:20 ` [PATCH v5 02/16] mm, hwpoison: remove recalculating hpage nao.horiguchi
@ 2020-07-31 12:20 ` nao.horiguchi
  2020-07-31 12:21 ` [PATCH v5 04/16] mm,madvise: Refactor madvise_inject_error nao.horiguchi
                   ` (14 subsequent siblings)
  17 siblings, 0 replies; 26+ messages in thread
From: nao.horiguchi @ 2020-07-31 12:20 UTC (permalink / raw)
  To: linux-mm
  Cc: mhocko, akpm, mike.kravetz, osalvador, tony.luck, david,
	aneesh.kumar, zeil, cai, naoya.horiguchi, linux-kernel

From: Naoya Horiguchi <naoya.horiguchi@nec.com>

The call to get_user_pages_fast is only to get the pointer to a struct
page of a given address, pinning it is memory-poisoning handler's job,
so drop the refcount grabbed by get_user_pages_fast().

Note that the target page is still pinned after this put_page() because
the current process should have refcount from mapping.

Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Oscar Salvador <osalvador@suse.de>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/madvise.c | 24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git v5.8-rc7-mmotm-2020-07-27-18-18/mm/madvise.c v5.8-rc7-mmotm-2020-07-27-18-18_patched/mm/madvise.c
index a16dba21cdf6..1fe89a5b8d33 100644
--- v5.8-rc7-mmotm-2020-07-27-18-18/mm/madvise.c
+++ v5.8-rc7-mmotm-2020-07-27-18-18_patched/mm/madvise.c
@@ -910,16 +910,24 @@ static int madvise_inject_error(int behavior,
 		 */
 		size = page_size(compound_head(page));
 
-		if (PageHWPoison(page)) {
-			put_page(page);
+		/*
+		 * The get_user_pages_fast() is just to get the pfn of the
+		 * given address, and the refcount has nothing to do with
+		 * what we try to test, so it should be released immediately.
+		 * This is racy but it's intended because the real hardware
+		 * errors could happen at any moment and memory error handlers
+		 * must properly handle the race.
+		 */
+		put_page(page);
+
+		if (PageHWPoison(page))
 			continue;
-		}
 
 		if (behavior == MADV_SOFT_OFFLINE) {
 			pr_info("Soft offlining pfn %#lx at process virtual address %#lx\n",
 					pfn, start);
 
-			ret = soft_offline_page(pfn, MF_COUNT_INCREASED);
+			ret = soft_offline_page(pfn, 0);
 			if (ret)
 				return ret;
 			continue;
@@ -927,14 +935,6 @@ static int madvise_inject_error(int behavior,
 
 		pr_info("Injecting memory failure for pfn %#lx at process virtual address %#lx\n",
 				pfn, start);
-
-		/*
-		 * Drop the page reference taken by get_user_pages_fast(). In
-		 * the absence of MF_COUNT_INCREASED the memory_failure()
-		 * routine is responsible for pinning the page to prevent it
-		 * from being released back to the page allocator.
-		 */
-		put_page(page);
 		ret = memory_failure(pfn, 0);
 		if (ret)
 			return ret;
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v5 04/16] mm,madvise: Refactor madvise_inject_error
  2020-07-31 12:20 [PATCH v5 00/16] HWPOISON: soft offline rework nao.horiguchi
                   ` (2 preceding siblings ...)
  2020-07-31 12:20 ` [PATCH v5 03/16] mm,madvise: call soft_offline_page() without MF_COUNT_INCREASED nao.horiguchi
@ 2020-07-31 12:21 ` nao.horiguchi
  2020-07-31 12:21 ` [PATCH v5 05/16] mm,hwpoison-inject: don't pin for hwpoison_filter nao.horiguchi
                   ` (13 subsequent siblings)
  17 siblings, 0 replies; 26+ messages in thread
From: nao.horiguchi @ 2020-07-31 12:21 UTC (permalink / raw)
  To: linux-mm
  Cc: mhocko, akpm, mike.kravetz, osalvador, tony.luck, david,
	aneesh.kumar, zeil, cai, naoya.horiguchi, linux-kernel

From: Oscar Salvador <osalvador@suse.de>

Make a proper if-else condition for {hard,soft}-offline.

Signed-off-by: Oscar Salvador <osalvador@suse.com>
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
---
 mm/madvise.c | 16 ++++++----------
 1 file changed, 6 insertions(+), 10 deletions(-)

diff --git v5.8-rc7-mmotm-2020-07-27-18-18/mm/madvise.c v5.8-rc7-mmotm-2020-07-27-18-18_patched/mm/madvise.c
index 1fe89a5b8d33..2c50c2c5673b 100644
--- v5.8-rc7-mmotm-2020-07-27-18-18/mm/madvise.c
+++ v5.8-rc7-mmotm-2020-07-27-18-18_patched/mm/madvise.c
@@ -886,16 +886,15 @@ static long madvise_remove(struct vm_area_struct *vma,
 static int madvise_inject_error(int behavior,
 		unsigned long start, unsigned long end)
 {
-	struct page *page;
 	struct zone *zone;
 	unsigned long size;
 
 	if (!capable(CAP_SYS_ADMIN))
 		return -EPERM;
 
-
 	for (; start < end; start += size) {
 		unsigned long pfn;
+		struct page *page;
 		int ret;
 
 		ret = get_user_pages_fast(start, 1, 0, &page);
@@ -925,17 +924,14 @@ static int madvise_inject_error(int behavior,
 
 		if (behavior == MADV_SOFT_OFFLINE) {
 			pr_info("Soft offlining pfn %#lx at process virtual address %#lx\n",
-					pfn, start);
-
+				pfn, start);
 			ret = soft_offline_page(pfn, 0);
-			if (ret)
-				return ret;
-			continue;
+		} else {
+			pr_info("Injecting memory failure for pfn %#lx at process virtual address %#lx\n",
+				pfn, start);
+			ret = memory_failure(pfn, 0);
 		}
 
-		pr_info("Injecting memory failure for pfn %#lx at process virtual address %#lx\n",
-				pfn, start);
-		ret = memory_failure(pfn, 0);
 		if (ret)
 			return ret;
 	}
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v5 05/16] mm,hwpoison-inject: don't pin for hwpoison_filter
  2020-07-31 12:20 [PATCH v5 00/16] HWPOISON: soft offline rework nao.horiguchi
                   ` (3 preceding siblings ...)
  2020-07-31 12:21 ` [PATCH v5 04/16] mm,madvise: Refactor madvise_inject_error nao.horiguchi
@ 2020-07-31 12:21 ` nao.horiguchi
  2020-07-31 12:21 ` [PATCH v5 06/16] mm,hwpoison: Un-export get_hwpoison_page and make it static nao.horiguchi
                   ` (12 subsequent siblings)
  17 siblings, 0 replies; 26+ messages in thread
From: nao.horiguchi @ 2020-07-31 12:21 UTC (permalink / raw)
  To: linux-mm
  Cc: mhocko, akpm, mike.kravetz, osalvador, tony.luck, david,
	aneesh.kumar, zeil, cai, naoya.horiguchi, linux-kernel

From: Naoya Horiguchi <naoya.horiguchi@nec.com>

Another memory error injection interface debugfs:hwpoison/corrupt-pfn
also takes bogus refcount for hwpoison_filter(). It's justified
because this does a coarse filter, expecting that memory_failure()
redoes the check for sure.

Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Oscar Salvador <osalvador@suse.com>
---
 mm/hwpoison-inject.c | 18 +++++-------------
 1 file changed, 5 insertions(+), 13 deletions(-)

diff --git v5.8-rc7-mmotm-2020-07-27-18-18/mm/hwpoison-inject.c v5.8-rc7-mmotm-2020-07-27-18-18_patched/mm/hwpoison-inject.c
index e488876b168a..1ae1ebc2b9b1 100644
--- v5.8-rc7-mmotm-2020-07-27-18-18/mm/hwpoison-inject.c
+++ v5.8-rc7-mmotm-2020-07-27-18-18_patched/mm/hwpoison-inject.c
@@ -26,11 +26,6 @@ static int hwpoison_inject(void *data, u64 val)
 
 	p = pfn_to_page(pfn);
 	hpage = compound_head(p);
-	/*
-	 * This implies unable to support free buddy pages.
-	 */
-	if (!get_hwpoison_page(p))
-		return 0;
 
 	if (!hwpoison_filter_enable)
 		goto inject;
@@ -40,23 +35,20 @@ static int hwpoison_inject(void *data, u64 val)
 	 * This implies unable to support non-LRU pages.
 	 */
 	if (!PageLRU(hpage) && !PageHuge(p))
-		goto put_out;
+		return 0;
 
 	/*
-	 * do a racy check with elevated page count, to make sure PG_hwpoison
-	 * will only be set for the targeted owner (or on a free page).
+	 * do a racy check to make sure PG_hwpoison will only be set for
+	 * the targeted owner (or on a free page).
 	 * memory_failure() will redo the check reliably inside page lock.
 	 */
 	err = hwpoison_filter(hpage);
 	if (err)
-		goto put_out;
+		return 0;
 
 inject:
 	pr_info("Injecting memory failure at pfn %#lx\n", pfn);
-	return memory_failure(pfn, MF_COUNT_INCREASED);
-put_out:
-	put_hwpoison_page(p);
-	return 0;
+	return memory_failure(pfn, 0);
 }
 
 static int hwpoison_unpoison(void *data, u64 val)
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v5 06/16] mm,hwpoison: Un-export get_hwpoison_page and make it static
  2020-07-31 12:20 [PATCH v5 00/16] HWPOISON: soft offline rework nao.horiguchi
                   ` (4 preceding siblings ...)
  2020-07-31 12:21 ` [PATCH v5 05/16] mm,hwpoison-inject: don't pin for hwpoison_filter nao.horiguchi
@ 2020-07-31 12:21 ` nao.horiguchi
  2020-07-31 12:21 ` [PATCH v5 07/16] mm,hwpoison: Kill put_hwpoison_page nao.horiguchi
                   ` (11 subsequent siblings)
  17 siblings, 0 replies; 26+ messages in thread
From: nao.horiguchi @ 2020-07-31 12:21 UTC (permalink / raw)
  To: linux-mm
  Cc: mhocko, akpm, mike.kravetz, osalvador, tony.luck, david,
	aneesh.kumar, zeil, cai, naoya.horiguchi, linux-kernel

From: Oscar Salvador <osalvador@suse.de>

Since get_hwpoison_page is only used in memory-failure code now,
let us un-export it and make it private to that code.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
---
 include/linux/mm.h  | 1 -
 mm/memory-failure.c | 3 +--
 2 files changed, 1 insertion(+), 3 deletions(-)

diff --git v5.8-rc7-mmotm-2020-07-27-18-18/include/linux/mm.h v5.8-rc7-mmotm-2020-07-27-18-18_patched/include/linux/mm.h
index 5e76bb4291e6..8f742373a46a 100644
--- v5.8-rc7-mmotm-2020-07-27-18-18/include/linux/mm.h
+++ v5.8-rc7-mmotm-2020-07-27-18-18_patched/include/linux/mm.h
@@ -2985,7 +2985,6 @@ extern int memory_failure(unsigned long pfn, int flags);
 extern void memory_failure_queue(unsigned long pfn, int flags);
 extern void memory_failure_queue_kick(int cpu);
 extern int unpoison_memory(unsigned long pfn);
-extern int get_hwpoison_page(struct page *page);
 #define put_hwpoison_page(page)	put_page(page)
 extern int sysctl_memory_failure_early_kill;
 extern int sysctl_memory_failure_recovery;
diff --git v5.8-rc7-mmotm-2020-07-27-18-18/mm/memory-failure.c v5.8-rc7-mmotm-2020-07-27-18-18_patched/mm/memory-failure.c
index f8d200417e0f..405c9bef6ffb 100644
--- v5.8-rc7-mmotm-2020-07-27-18-18/mm/memory-failure.c
+++ v5.8-rc7-mmotm-2020-07-27-18-18_patched/mm/memory-failure.c
@@ -925,7 +925,7 @@ static int page_action(struct page_state *ps, struct page *p,
  * Return: return 0 if failed to grab the refcount, otherwise true (some
  * non-zero value.)
  */
-int get_hwpoison_page(struct page *page)
+static int get_hwpoison_page(struct page *page)
 {
 	struct page *head = compound_head(page);
 
@@ -954,7 +954,6 @@ int get_hwpoison_page(struct page *page)
 
 	return 0;
 }
-EXPORT_SYMBOL_GPL(get_hwpoison_page);
 
 /*
  * Do all that is necessary to remove user space mappings. Unmap
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v5 07/16] mm,hwpoison: Kill put_hwpoison_page
  2020-07-31 12:20 [PATCH v5 00/16] HWPOISON: soft offline rework nao.horiguchi
                   ` (5 preceding siblings ...)
  2020-07-31 12:21 ` [PATCH v5 06/16] mm,hwpoison: Un-export get_hwpoison_page and make it static nao.horiguchi
@ 2020-07-31 12:21 ` nao.horiguchi
  2020-07-31 12:21 ` [PATCH v5 08/16] mm,hwpoison: remove MF_COUNT_INCREASED nao.horiguchi
                   ` (10 subsequent siblings)
  17 siblings, 0 replies; 26+ messages in thread
From: nao.horiguchi @ 2020-07-31 12:21 UTC (permalink / raw)
  To: linux-mm
  Cc: mhocko, akpm, mike.kravetz, osalvador, tony.luck, david,
	aneesh.kumar, zeil, cai, naoya.horiguchi, linux-kernel

From: Oscar Salvador <osalvador@suse.de>

After commit 4e41a30c6d50 ("mm: hwpoison: adjust for new thp refcounting"),
put_hwpoison_page got reduced to a put_page.
Let us just use put_page instead.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
---
 include/linux/mm.h  |  1 -
 mm/memory-failure.c | 30 +++++++++++++++---------------
 2 files changed, 15 insertions(+), 16 deletions(-)

diff --git v5.8-rc7-mmotm-2020-07-27-18-18/include/linux/mm.h v5.8-rc7-mmotm-2020-07-27-18-18_patched/include/linux/mm.h
index 8f742373a46a..371970dfffc4 100644
--- v5.8-rc7-mmotm-2020-07-27-18-18/include/linux/mm.h
+++ v5.8-rc7-mmotm-2020-07-27-18-18_patched/include/linux/mm.h
@@ -2985,7 +2985,6 @@ extern int memory_failure(unsigned long pfn, int flags);
 extern void memory_failure_queue(unsigned long pfn, int flags);
 extern void memory_failure_queue_kick(int cpu);
 extern int unpoison_memory(unsigned long pfn);
-#define put_hwpoison_page(page)	put_page(page)
 extern int sysctl_memory_failure_early_kill;
 extern int sysctl_memory_failure_recovery;
 extern void shake_page(struct page *p, int access);
diff --git v5.8-rc7-mmotm-2020-07-27-18-18/mm/memory-failure.c v5.8-rc7-mmotm-2020-07-27-18-18_patched/mm/memory-failure.c
index 405c9bef6ffb..6853bf3a253d 100644
--- v5.8-rc7-mmotm-2020-07-27-18-18/mm/memory-failure.c
+++ v5.8-rc7-mmotm-2020-07-27-18-18_patched/mm/memory-failure.c
@@ -1144,7 +1144,7 @@ static int memory_failure_hugetlb(unsigned long pfn, int flags)
 		pr_err("Memory failure: %#lx: just unpoisoned\n", pfn);
 		num_poisoned_pages_dec();
 		unlock_page(head);
-		put_hwpoison_page(head);
+		put_page(head);
 		return 0;
 	}
 
@@ -1336,7 +1336,7 @@ int memory_failure(unsigned long pfn, int flags)
 					pfn);
 			if (TestClearPageHWPoison(p))
 				num_poisoned_pages_dec();
-			put_hwpoison_page(p);
+			put_page(p);
 			return -EBUSY;
 		}
 		unlock_page(p);
@@ -1389,14 +1389,14 @@ int memory_failure(unsigned long pfn, int flags)
 		pr_err("Memory failure: %#lx: just unpoisoned\n", pfn);
 		num_poisoned_pages_dec();
 		unlock_page(p);
-		put_hwpoison_page(p);
+		put_page(p);
 		return 0;
 	}
 	if (hwpoison_filter(p)) {
 		if (TestClearPageHWPoison(p))
 			num_poisoned_pages_dec();
 		unlock_page(p);
-		put_hwpoison_page(p);
+		put_page(p);
 		return 0;
 	}
 
@@ -1630,9 +1630,9 @@ int unpoison_memory(unsigned long pfn)
 	}
 	unlock_page(page);
 
-	put_hwpoison_page(page);
+	put_page(page);
 	if (freeit && !(pfn == my_zero_pfn(0) && page_count(p) == 1))
-		put_hwpoison_page(page);
+		put_page(page);
 
 	return 0;
 }
@@ -1683,7 +1683,7 @@ static int get_any_page(struct page *page, unsigned long pfn, int flags)
 		/*
 		 * Try to free it.
 		 */
-		put_hwpoison_page(page);
+		put_page(page);
 		shake_page(page, 1);
 
 		/*
@@ -1692,7 +1692,7 @@ static int get_any_page(struct page *page, unsigned long pfn, int flags)
 		ret = __get_any_page(page, pfn, 0);
 		if (ret == 1 && !PageLRU(page)) {
 			/* Drop page reference which is from __get_any_page() */
-			put_hwpoison_page(page);
+			put_page(page);
 			pr_info("soft_offline: %#lx: unknown non LRU page type %lx (%pGp)\n",
 				pfn, page->flags, &page->flags);
 			return -EIO;
@@ -1715,7 +1715,7 @@ static int soft_offline_huge_page(struct page *page, int flags)
 	lock_page(hpage);
 	if (PageHWPoison(hpage)) {
 		unlock_page(hpage);
-		put_hwpoison_page(hpage);
+		put_page(hpage);
 		pr_info("soft offline: %#lx hugepage already poisoned\n", pfn);
 		return -EBUSY;
 	}
@@ -1726,7 +1726,7 @@ static int soft_offline_huge_page(struct page *page, int flags)
 	 * get_any_page() and isolate_huge_page() takes a refcount each,
 	 * so need to drop one here.
 	 */
-	put_hwpoison_page(hpage);
+	put_page(hpage);
 	if (!ret) {
 		pr_info("soft offline: %#lx hugepage failed to isolate\n", pfn);
 		return -EBUSY;
@@ -1779,7 +1779,7 @@ static int __soft_offline_page(struct page *page, int flags)
 	wait_on_page_writeback(page);
 	if (PageHWPoison(page)) {
 		unlock_page(page);
-		put_hwpoison_page(page);
+		put_page(page);
 		pr_info("soft offline: %#lx page already poisoned\n", pfn);
 		return -EBUSY;
 	}
@@ -1794,7 +1794,7 @@ static int __soft_offline_page(struct page *page, int flags)
 	 * would need to fix isolation locking first.
 	 */
 	if (ret == 1) {
-		put_hwpoison_page(page);
+		put_page(page);
 		pr_info("soft_offline: %#lx: invalidated\n", pfn);
 		SetPageHWPoison(page);
 		num_poisoned_pages_inc();
@@ -1814,7 +1814,7 @@ static int __soft_offline_page(struct page *page, int flags)
 	 * Drop page reference which is came from get_any_page()
 	 * successful isolate_lru_page() already took another one.
 	 */
-	put_hwpoison_page(page);
+	put_page(page);
 	if (!ret) {
 		LIST_HEAD(pagelist);
 		/*
@@ -1858,7 +1858,7 @@ static int soft_offline_in_use_page(struct page *page, int flags)
 				pr_info("soft offline: %#lx: non anonymous thp\n", page_to_pfn(page));
 			else
 				pr_info("soft offline: %#lx: thp split failed\n", page_to_pfn(page));
-			put_hwpoison_page(page);
+			put_page(page);
 			return -EBUSY;
 		}
 		unlock_page(page);
@@ -1931,7 +1931,7 @@ int soft_offline_page(unsigned long pfn, int flags)
 	if (PageHWPoison(page)) {
 		pr_info("soft offline: %#lx page already poisoned\n", pfn);
 		if (flags & MF_COUNT_INCREASED)
-			put_hwpoison_page(page);
+			put_page(page);
 		return -EBUSY;
 	}
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v5 08/16] mm,hwpoison: remove MF_COUNT_INCREASED
  2020-07-31 12:20 [PATCH v5 00/16] HWPOISON: soft offline rework nao.horiguchi
                   ` (6 preceding siblings ...)
  2020-07-31 12:21 ` [PATCH v5 07/16] mm,hwpoison: Kill put_hwpoison_page nao.horiguchi
@ 2020-07-31 12:21 ` nao.horiguchi
  2020-07-31 12:21 ` [PATCH v5 09/16] mm,hwpoison: remove flag argument from soft offline functions nao.horiguchi
                   ` (9 subsequent siblings)
  17 siblings, 0 replies; 26+ messages in thread
From: nao.horiguchi @ 2020-07-31 12:21 UTC (permalink / raw)
  To: linux-mm
  Cc: mhocko, akpm, mike.kravetz, osalvador, tony.luck, david,
	aneesh.kumar, zeil, cai, naoya.horiguchi, linux-kernel

From: Naoya Horiguchi <naoya.horiguchi@nec.com>

Now there's no user of MF_COUNT_INCREASED, so we can safely remove
it from all calling points.

Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 include/linux/mm.h  |  7 +++----
 mm/memory-failure.c | 14 +++-----------
 2 files changed, 6 insertions(+), 15 deletions(-)

diff --git v5.8-rc7-mmotm-2020-07-27-18-18/include/linux/mm.h v5.8-rc7-mmotm-2020-07-27-18-18_patched/include/linux/mm.h
index 371970dfffc4..c09111e8eac8 100644
--- v5.8-rc7-mmotm-2020-07-27-18-18/include/linux/mm.h
+++ v5.8-rc7-mmotm-2020-07-27-18-18_patched/include/linux/mm.h
@@ -2976,10 +2976,9 @@ void register_page_bootmem_memmap(unsigned long section_nr, struct page *map,
 				  unsigned long nr_pages);
 
 enum mf_flags {
-	MF_COUNT_INCREASED = 1 << 0,
-	MF_ACTION_REQUIRED = 1 << 1,
-	MF_MUST_KILL = 1 << 2,
-	MF_SOFT_OFFLINE = 1 << 3,
+	MF_ACTION_REQUIRED = 1 << 0,
+	MF_MUST_KILL = 1 << 1,
+	MF_SOFT_OFFLINE = 1 << 2,
 };
 extern int memory_failure(unsigned long pfn, int flags);
 extern void memory_failure_queue(unsigned long pfn, int flags);
diff --git v5.8-rc7-mmotm-2020-07-27-18-18/mm/memory-failure.c v5.8-rc7-mmotm-2020-07-27-18-18_patched/mm/memory-failure.c
index 6853bf3a253d..9768ab5f51ef 100644
--- v5.8-rc7-mmotm-2020-07-27-18-18/mm/memory-failure.c
+++ v5.8-rc7-mmotm-2020-07-27-18-18_patched/mm/memory-failure.c
@@ -1118,7 +1118,7 @@ static int memory_failure_hugetlb(unsigned long pfn, int flags)
 
 	num_poisoned_pages_inc();
 
-	if (!(flags & MF_COUNT_INCREASED) && !get_hwpoison_page(p)) {
+	if (!get_hwpoison_page(p)) {
 		/*
 		 * Check "filter hit" and "race with other subpage."
 		 */
@@ -1314,7 +1314,7 @@ int memory_failure(unsigned long pfn, int flags)
 	 * In fact it's dangerous to directly bump up page count from 0,
 	 * that may make page_ref_freeze()/page_ref_unfreeze() mismatch.
 	 */
-	if (!(flags & MF_COUNT_INCREASED) && !get_hwpoison_page(p)) {
+	if (!get_hwpoison_page(p)) {
 		if (is_free_buddy_page(p)) {
 			action_result(pfn, MF_MSG_BUDDY, MF_DELAYED);
 			return 0;
@@ -1354,10 +1354,7 @@ int memory_failure(unsigned long pfn, int flags)
 	shake_page(p, 0);
 	/* shake_page could have turned it free. */
 	if (!PageLRU(p) && is_free_buddy_page(p)) {
-		if (flags & MF_COUNT_INCREASED)
-			action_result(pfn, MF_MSG_BUDDY, MF_DELAYED);
-		else
-			action_result(pfn, MF_MSG_BUDDY_2ND, MF_DELAYED);
+		action_result(pfn, MF_MSG_BUDDY_2ND, MF_DELAYED);
 		return 0;
 	}
 
@@ -1648,9 +1645,6 @@ static int __get_any_page(struct page *p, unsigned long pfn, int flags)
 {
 	int ret;
 
-	if (flags & MF_COUNT_INCREASED)
-		return 1;
-
 	/*
 	 * When the target page is a free hugepage, just remove it
 	 * from free hugepage list.
@@ -1930,8 +1924,6 @@ int soft_offline_page(unsigned long pfn, int flags)
 
 	if (PageHWPoison(page)) {
 		pr_info("soft offline: %#lx page already poisoned\n", pfn);
-		if (flags & MF_COUNT_INCREASED)
-			put_page(page);
 		return -EBUSY;
 	}
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v5 09/16] mm,hwpoison: remove flag argument from soft offline functions
  2020-07-31 12:20 [PATCH v5 00/16] HWPOISON: soft offline rework nao.horiguchi
                   ` (7 preceding siblings ...)
  2020-07-31 12:21 ` [PATCH v5 08/16] mm,hwpoison: remove MF_COUNT_INCREASED nao.horiguchi
@ 2020-07-31 12:21 ` nao.horiguchi
  2020-07-31 12:21 ` [PATCH v5 10/16] mm,hwpoison: Unify THP handling for hard and soft offline nao.horiguchi
                   ` (8 subsequent siblings)
  17 siblings, 0 replies; 26+ messages in thread
From: nao.horiguchi @ 2020-07-31 12:21 UTC (permalink / raw)
  To: linux-mm
  Cc: mhocko, akpm, mike.kravetz, osalvador, tony.luck, david,
	aneesh.kumar, zeil, cai, naoya.horiguchi, linux-kernel

From: Naoya Horiguchi <naoya.horiguchi@nec.com>

The argument @flag no longer affects the behavior of soft_offline_page()
and its variants, so let's remove them.

Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 drivers/base/memory.c |  2 +-
 include/linux/mm.h    |  2 +-
 mm/madvise.c          |  2 +-
 mm/memory-failure.c   | 27 +++++++++++++--------------
 4 files changed, 16 insertions(+), 17 deletions(-)

diff --git v5.8-rc7-mmotm-2020-07-27-18-18/drivers/base/memory.c v5.8-rc7-mmotm-2020-07-27-18-18_patched/drivers/base/memory.c
index 4db3c660de83..3e6d27c9dff6 100644
--- v5.8-rc7-mmotm-2020-07-27-18-18/drivers/base/memory.c
+++ v5.8-rc7-mmotm-2020-07-27-18-18_patched/drivers/base/memory.c
@@ -463,7 +463,7 @@ static ssize_t soft_offline_page_store(struct device *dev,
 	if (kstrtoull(buf, 0, &pfn) < 0)
 		return -EINVAL;
 	pfn >>= PAGE_SHIFT;
-	ret = soft_offline_page(pfn, 0);
+	ret = soft_offline_page(pfn);
 	return ret == 0 ? count : ret;
 }
 
diff --git v5.8-rc7-mmotm-2020-07-27-18-18/include/linux/mm.h v5.8-rc7-mmotm-2020-07-27-18-18_patched/include/linux/mm.h
index c09111e8eac8..ecb3c7191fb7 100644
--- v5.8-rc7-mmotm-2020-07-27-18-18/include/linux/mm.h
+++ v5.8-rc7-mmotm-2020-07-27-18-18_patched/include/linux/mm.h
@@ -2988,7 +2988,7 @@ extern int sysctl_memory_failure_early_kill;
 extern int sysctl_memory_failure_recovery;
 extern void shake_page(struct page *p, int access);
 extern atomic_long_t num_poisoned_pages __read_mostly;
-extern int soft_offline_page(unsigned long pfn, int flags);
+extern int soft_offline_page(unsigned long pfn);
 
 
 /*
diff --git v5.8-rc7-mmotm-2020-07-27-18-18/mm/madvise.c v5.8-rc7-mmotm-2020-07-27-18-18_patched/mm/madvise.c
index 2c50c2c5673b..3eee78abdbec 100644
--- v5.8-rc7-mmotm-2020-07-27-18-18/mm/madvise.c
+++ v5.8-rc7-mmotm-2020-07-27-18-18_patched/mm/madvise.c
@@ -925,7 +925,7 @@ static int madvise_inject_error(int behavior,
 		if (behavior == MADV_SOFT_OFFLINE) {
 			pr_info("Soft offlining pfn %#lx at process virtual address %#lx\n",
 				pfn, start);
-			ret = soft_offline_page(pfn, 0);
+			ret = soft_offline_page(pfn);
 		} else {
 			pr_info("Injecting memory failure for pfn %#lx at process virtual address %#lx\n",
 				pfn, start);
diff --git v5.8-rc7-mmotm-2020-07-27-18-18/mm/memory-failure.c v5.8-rc7-mmotm-2020-07-27-18-18_patched/mm/memory-failure.c
index 9768ab5f51ef..7c0a2f8cfe0c 100644
--- v5.8-rc7-mmotm-2020-07-27-18-18/mm/memory-failure.c
+++ v5.8-rc7-mmotm-2020-07-27-18-18_patched/mm/memory-failure.c
@@ -1502,7 +1502,7 @@ static void memory_failure_work_func(struct work_struct *work)
 		if (!gotten)
 			break;
 		if (entry.flags & MF_SOFT_OFFLINE)
-			soft_offline_page(entry.pfn, entry.flags);
+			soft_offline_page(entry.pfn);
 		else
 			memory_failure(entry.pfn, entry.flags);
 	}
@@ -1641,7 +1641,7 @@ EXPORT_SYMBOL(unpoison_memory);
  * that is not free, and 1 for any other page type.
  * For 1 the page is returned with increased page count, otherwise not.
  */
-static int __get_any_page(struct page *p, unsigned long pfn, int flags)
+static int __get_any_page(struct page *p, unsigned long pfn)
 {
 	int ret;
 
@@ -1668,9 +1668,9 @@ static int __get_any_page(struct page *p, unsigned long pfn, int flags)
 	return ret;
 }
 
-static int get_any_page(struct page *page, unsigned long pfn, int flags)
+static int get_any_page(struct page *page, unsigned long pfn)
 {
-	int ret = __get_any_page(page, pfn, flags);
+	int ret = __get_any_page(page, pfn);
 
 	if (ret == 1 && !PageHuge(page) &&
 	    !PageLRU(page) && !__PageMovable(page)) {
@@ -1683,7 +1683,7 @@ static int get_any_page(struct page *page, unsigned long pfn, int flags)
 		/*
 		 * Did it turn free?
 		 */
-		ret = __get_any_page(page, pfn, 0);
+		ret = __get_any_page(page, pfn);
 		if (ret == 1 && !PageLRU(page)) {
 			/* Drop page reference which is from __get_any_page() */
 			put_page(page);
@@ -1695,7 +1695,7 @@ static int get_any_page(struct page *page, unsigned long pfn, int flags)
 	return ret;
 }
 
-static int soft_offline_huge_page(struct page *page, int flags)
+static int soft_offline_huge_page(struct page *page)
 {
 	int ret;
 	unsigned long pfn = page_to_pfn(page);
@@ -1754,7 +1754,7 @@ static int soft_offline_huge_page(struct page *page, int flags)
 	return ret;
 }
 
-static int __soft_offline_page(struct page *page, int flags)
+static int __soft_offline_page(struct page *page)
 {
 	int ret;
 	unsigned long pfn = page_to_pfn(page);
@@ -1838,7 +1838,7 @@ static int __soft_offline_page(struct page *page, int flags)
 	return ret;
 }
 
-static int soft_offline_in_use_page(struct page *page, int flags)
+static int soft_offline_in_use_page(struct page *page)
 {
 	int ret;
 	int mt;
@@ -1868,9 +1868,9 @@ static int soft_offline_in_use_page(struct page *page, int flags)
 	mt = get_pageblock_migratetype(page);
 	set_pageblock_migratetype(page, MIGRATE_ISOLATE);
 	if (PageHuge(page))
-		ret = soft_offline_huge_page(page, flags);
+		ret = soft_offline_huge_page(page);
 	else
-		ret = __soft_offline_page(page, flags);
+		ret = __soft_offline_page(page);
 	set_pageblock_migratetype(page, mt);
 	return ret;
 }
@@ -1891,7 +1891,6 @@ static int soft_offline_free_page(struct page *page)
 /**
  * soft_offline_page - Soft offline a page.
  * @pfn: pfn to soft-offline
- * @flags: flags. Same as memory_failure().
  *
  * Returns 0 on success, otherwise negated errno.
  *
@@ -1910,7 +1909,7 @@ static int soft_offline_free_page(struct page *page)
  * This is not a 100% solution for all memory, but tries to be
  * ``good enough'' for the majority of memory.
  */
-int soft_offline_page(unsigned long pfn, int flags)
+int soft_offline_page(unsigned long pfn)
 {
 	int ret;
 	struct page *page;
@@ -1928,11 +1927,11 @@ int soft_offline_page(unsigned long pfn, int flags)
 	}
 
 	get_online_mems();
-	ret = get_any_page(page, pfn, flags);
+	ret = get_any_page(page, pfn);
 	put_online_mems();
 
 	if (ret > 0)
-		ret = soft_offline_in_use_page(page, flags);
+		ret = soft_offline_in_use_page(page);
 	else if (ret == 0)
 		ret = soft_offline_free_page(page);
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v5 10/16] mm,hwpoison: Unify THP handling for hard and soft offline
  2020-07-31 12:20 [PATCH v5 00/16] HWPOISON: soft offline rework nao.horiguchi
                   ` (8 preceding siblings ...)
  2020-07-31 12:21 ` [PATCH v5 09/16] mm,hwpoison: remove flag argument from soft offline functions nao.horiguchi
@ 2020-07-31 12:21 ` nao.horiguchi
  2020-07-31 12:21 ` [PATCH v5 11/16] mm,hwpoison: Rework soft offline for free pages nao.horiguchi
                   ` (7 subsequent siblings)
  17 siblings, 0 replies; 26+ messages in thread
From: nao.horiguchi @ 2020-07-31 12:21 UTC (permalink / raw)
  To: linux-mm
  Cc: mhocko, akpm, mike.kravetz, osalvador, tony.luck, david,
	aneesh.kumar, zeil, cai, naoya.horiguchi, linux-kernel

From: Oscar Salvador <osalvador@suse.de>

Place the THP's page handling in a helper and use it
from both hard and soft-offline machinery, so we get rid
of some duplicated code.

Signed-off-by: Oscar Salvador <osalvador@suse.com>
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
---
 mm/memory-failure.c | 48 +++++++++++++++++++++------------------------
 1 file changed, 22 insertions(+), 26 deletions(-)

diff --git v5.8-rc7-mmotm-2020-07-27-18-18/mm/memory-failure.c v5.8-rc7-mmotm-2020-07-27-18-18_patched/mm/memory-failure.c
index 7c0a2f8cfe0c..803f4b2ac510 100644
--- v5.8-rc7-mmotm-2020-07-27-18-18/mm/memory-failure.c
+++ v5.8-rc7-mmotm-2020-07-27-18-18_patched/mm/memory-failure.c
@@ -1103,6 +1103,25 @@ static int identify_page_state(unsigned long pfn, struct page *p,
 	return page_action(ps, p, pfn);
 }
 
+static int try_to_split_thp_page(struct page *page, const char *msg)
+{
+	lock_page(page);
+	if (!PageAnon(page) || unlikely(split_huge_page(page))) {
+		unsigned long pfn = page_to_pfn(page);
+
+		unlock_page(page);
+		if (!PageAnon(page))
+			pr_info("%s: %#lx: non anonymous thp\n", msg, pfn);
+		else
+			pr_info("%s: %#lx: thp split failed\n", msg, pfn);
+		put_page(page);
+		return -EBUSY;
+	}
+	unlock_page(page);
+
+	return 0;
+}
+
 static int memory_failure_hugetlb(unsigned long pfn, int flags)
 {
 	struct page *p = pfn_to_page(pfn);
@@ -1325,21 +1344,8 @@ int memory_failure(unsigned long pfn, int flags)
 	}
 
 	if (PageTransHuge(hpage)) {
-		lock_page(p);
-		if (!PageAnon(p) || unlikely(split_huge_page(p))) {
-			unlock_page(p);
-			if (!PageAnon(p))
-				pr_err("Memory failure: %#lx: non anonymous thp\n",
-					pfn);
-			else
-				pr_err("Memory failure: %#lx: thp split failed\n",
-					pfn);
-			if (TestClearPageHWPoison(p))
-				num_poisoned_pages_dec();
-			put_page(p);
+		if (try_to_split_thp_page(p, "Memory Failure") < 0)
 			return -EBUSY;
-		}
-		unlock_page(p);
 		VM_BUG_ON_PAGE(!page_count(p), p);
 	}
 
@@ -1844,19 +1850,9 @@ static int soft_offline_in_use_page(struct page *page)
 	int mt;
 	struct page *hpage = compound_head(page);
 
-	if (!PageHuge(page) && PageTransHuge(hpage)) {
-		lock_page(page);
-		if (!PageAnon(page) || unlikely(split_huge_page(page))) {
-			unlock_page(page);
-			if (!PageAnon(page))
-				pr_info("soft offline: %#lx: non anonymous thp\n", page_to_pfn(page));
-			else
-				pr_info("soft offline: %#lx: thp split failed\n", page_to_pfn(page));
-			put_page(page);
+	if (!PageHuge(page) && PageTransHuge(hpage))
+		if (try_to_split_thp_page(page, "soft offline") < 0)
 			return -EBUSY;
-		}
-		unlock_page(page);
-	}
 
 	/*
 	 * Setting MIGRATE_ISOLATE here ensures that the page will be linked
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v5 11/16] mm,hwpoison: Rework soft offline for free pages
  2020-07-31 12:20 [PATCH v5 00/16] HWPOISON: soft offline rework nao.horiguchi
                   ` (9 preceding siblings ...)
  2020-07-31 12:21 ` [PATCH v5 10/16] mm,hwpoison: Unify THP handling for hard and soft offline nao.horiguchi
@ 2020-07-31 12:21 ` nao.horiguchi
  2020-07-31 12:21 ` [PATCH v5 12/16] mm,hwpoison: Rework soft offline for in-use pages nao.horiguchi
                   ` (6 subsequent siblings)
  17 siblings, 0 replies; 26+ messages in thread
From: nao.horiguchi @ 2020-07-31 12:21 UTC (permalink / raw)
  To: linux-mm
  Cc: mhocko, akpm, mike.kravetz, osalvador, tony.luck, david,
	aneesh.kumar, zeil, cai, naoya.horiguchi, linux-kernel

From: Oscar Salvador <osalvador@suse.de>

When trying to soft-offline a free page, we need to first take it off
the buddy allocator.
Once we know is out of reach, we can safely flag it as poisoned.

take_page_off_buddy will be used to take a page meant to be poisoned
off the buddy allocator.
take_page_off_buddy calls break_down_buddy_pages, which splits a
higher-order page in case our page belongs to one.

Once the page is under our control, we call page_handle_poison to set it
as poisoned and grab a refcount on it.

Signed-off-by: Oscar Salvador <osalvador@suse.com>
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
---
ChangeLog v4 -> v5:
- fix comile error

ChangeLog v2 -> v3:
- use add_to_free_list() instead of add_to_free_area()
- use del_page_from_free_list() instead of del_page_from_free_area()
- add fast return
- move extern definition to header file as warned by checkpatch.pl
---
 include/linux/page-flags.h |  1 +
 mm/memory-failure.c        | 18 ++++++----
 mm/page_alloc.c            | 68 ++++++++++++++++++++++++++++++++++++++
 3 files changed, 81 insertions(+), 6 deletions(-)

diff --git v5.8-rc7-mmotm-2020-07-27-18-18/include/linux/page-flags.h v5.8-rc7-mmotm-2020-07-27-18-18_patched/include/linux/page-flags.h
index 6be1aa559b1e..9fa5d4e2d69a 100644
--- v5.8-rc7-mmotm-2020-07-27-18-18/include/linux/page-flags.h
+++ v5.8-rc7-mmotm-2020-07-27-18-18_patched/include/linux/page-flags.h
@@ -423,6 +423,7 @@ PAGEFLAG(HWPoison, hwpoison, PF_ANY)
 TESTSCFLAG(HWPoison, hwpoison, PF_ANY)
 #define __PG_HWPOISON (1UL << PG_hwpoison)
 extern bool set_hwpoison_free_buddy_page(struct page *page);
+extern bool take_page_off_buddy(struct page *page);
 #else
 PAGEFLAG_FALSE(HWPoison)
 static inline bool set_hwpoison_free_buddy_page(struct page *page)
diff --git v5.8-rc7-mmotm-2020-07-27-18-18/mm/memory-failure.c v5.8-rc7-mmotm-2020-07-27-18-18_patched/mm/memory-failure.c
index 803f4b2ac510..8b6a98929b54 100644
--- v5.8-rc7-mmotm-2020-07-27-18-18/mm/memory-failure.c
+++ v5.8-rc7-mmotm-2020-07-27-18-18_patched/mm/memory-failure.c
@@ -65,6 +65,13 @@ int sysctl_memory_failure_recovery __read_mostly = 1;
 
 atomic_long_t num_poisoned_pages __read_mostly = ATOMIC_LONG_INIT(0);
 
+static void page_handle_poison(struct page *page)
+{
+	SetPageHWPoison(page);
+	page_ref_inc(page);
+	num_poisoned_pages_inc();
+}
+
 #if defined(CONFIG_HWPOISON_INJECT) || defined(CONFIG_HWPOISON_INJECT_MODULE)
 
 u32 hwpoison_filter_enable = 0;
@@ -1873,14 +1880,13 @@ static int soft_offline_in_use_page(struct page *page)
 
 static int soft_offline_free_page(struct page *page)
 {
-	int rc = dissolve_free_huge_page(page);
+	int rc = -EBUSY;
 
-	if (!rc) {
-		if (set_hwpoison_free_buddy_page(page))
-			num_poisoned_pages_inc();
-		else
-			rc = -EBUSY;
+	if (!dissolve_free_huge_page(page) && take_page_off_buddy(page)) {
+		page_handle_poison(page);
+		rc = 0;
 	}
+
 	return rc;
 }
 
diff --git v5.8-rc7-mmotm-2020-07-27-18-18/mm/page_alloc.c v5.8-rc7-mmotm-2020-07-27-18-18_patched/mm/page_alloc.c
index efe2e94c45f5..aab89f7db4ac 100644
--- v5.8-rc7-mmotm-2020-07-27-18-18/mm/page_alloc.c
+++ v5.8-rc7-mmotm-2020-07-27-18-18_patched/mm/page_alloc.c
@@ -8776,6 +8776,74 @@ bool is_free_buddy_page(struct page *page)
 }
 
 #ifdef CONFIG_MEMORY_FAILURE
+/*
+ * Break down a higher-order page in sub-pages, and keep our target out of
+ * buddy allocator.
+ */
+static void break_down_buddy_pages(struct zone *zone, struct page *page,
+				   struct page *target, int low, int high,
+				   int migratetype)
+{
+	unsigned long size = 1 << high;
+	struct page *current_buddy, *next_page;
+
+	while (high > low) {
+		high--;
+		size >>= 1;
+
+		if (target >= &page[size]) {
+			next_page = page + size;
+			current_buddy = page;
+		} else {
+			next_page = page;
+			current_buddy = page + size;
+		}
+
+		if (set_page_guard(zone, current_buddy, high, migratetype))
+			continue;
+
+		if (current_buddy != target) {
+			add_to_free_list(current_buddy, zone, high, migratetype);
+			set_page_order(current_buddy, high);
+			page = next_page;
+		}
+	}
+}
+
+/*
+ * Take a page that will be marked as poisoned off the buddy allocator.
+ */
+bool take_page_off_buddy(struct page *page)
+{
+	struct zone *zone = page_zone(page);
+	unsigned long pfn = page_to_pfn(page);
+	unsigned long flags;
+	unsigned int order;
+	bool ret = false;
+
+	spin_lock_irqsave(&zone->lock, flags);
+	for (order = 0; order < MAX_ORDER; order++) {
+		struct page *page_head = page - (pfn & ((1 << order) - 1));
+		int buddy_order = page_order(page_head);
+
+		if (PageBuddy(page_head) && buddy_order >= order) {
+			unsigned long pfn_head = page_to_pfn(page_head);
+			int migratetype = get_pfnblock_migratetype(page_head,
+								   pfn_head);
+
+			del_page_from_free_list(page_head, zone, buddy_order);
+			break_down_buddy_pages(zone, page_head, page, 0,
+						buddy_order, migratetype);
+			ret = true;
+			break;
+		}
+		if (page_count(page_head) > 0)
+			break;
+	}
+	spin_unlock_irqrestore(&zone->lock, flags);
+	return ret;
+}
+
 /*
  * Set PG_hwpoison flag if a given page is confirmed to be a free page.  This
  * test is performed under the zone lock to prevent a race against page
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v5 12/16] mm,hwpoison: Rework soft offline for in-use pages
  2020-07-31 12:20 [PATCH v5 00/16] HWPOISON: soft offline rework nao.horiguchi
                   ` (10 preceding siblings ...)
  2020-07-31 12:21 ` [PATCH v5 11/16] mm,hwpoison: Rework soft offline for free pages nao.horiguchi
@ 2020-07-31 12:21 ` nao.horiguchi
  2020-07-31 12:21 ` [PATCH v5 13/16] mm,hwpoison: Refactor soft_offline_huge_page and __soft_offline_page nao.horiguchi
                   ` (5 subsequent siblings)
  17 siblings, 0 replies; 26+ messages in thread
From: nao.horiguchi @ 2020-07-31 12:21 UTC (permalink / raw)
  To: linux-mm
  Cc: mhocko, akpm, mike.kravetz, osalvador, tony.luck, david,
	aneesh.kumar, zeil, cai, naoya.horiguchi, linux-kernel

From: Oscar Salvador <osalvador@suse.de>

This patch changes the way we set and handle in-use poisoned pages.
Until now, poisoned pages were released to the buddy allocator, trusting
that the checks that take place prior to hand the page would act as a
safe net and would skip that page.

This has proved to be wrong, as we got some pfn walkers out there, like
compaction, that all they care is the page to be PageBuddy and be in a
freelist.
Although this might not be the only user, having poisoned pages
in the buddy allocator seems a bad idea as we should only have
free pages that are ready and meant to be used as such.

Before explaining the taken approach, let us break down the kind
of pages we can soft offline.

- Anonymous THP (after the split, they end up being 4K pages)
- Hugetlb
- Order-0 pages (that can be either migrated or invalited)

* Normal pages (order-0 and anon-THP)

  - If they are clean and unmapped page cache pages, we invalidate
    then by means of invalidate_inode_page().
  - If they are mapped/dirty, we do the isolate-and-migrate dance.

  Either way, do not call put_page directly from those paths.
  Instead, we keep the page and send it to page_set_poison to perform the
  right handling.

  page_set_poison sets the HWPoison flag and does the last put_page.
  This call to put_page is mainly to be able to call __page_cache_release,
  since this function is not exported.

  Down the chain, we placed a check for HWPoison page in
  free_pages_prepare, that just skips any poisoned page, so those pages
  do not end up in any pcplist/freelist.

  After that, we set the refcount on the page to 1 and we increment
  the poisoned pages counter.

  We could do as we do for free pages:
  1) wait until the page hits buddy's freelists
  2) take it off
  3) flag it

  The problem is that we could race with an allocation, so by the time we
  want to take the page off the buddy, the page is already allocated, so we
  cannot soft-offline it.
  This is not fatal of course, but if it is better if we can close the race
  as does not require a lot of code.

* Hugetlb pages

  - We isolate-and-migrate them

  After the migration has been successful, we call dissolve_free_huge_page,
  and we set HWPoison on the page if we succeed.
  Hugetlb has a slightly different handling though.

  While for non-hugetlb pages we cared about closing the race with an
  allocation, doing so for hugetlb pages requires quite some additional
  code (we would need to hook in free_huge_page and some other places).
  So I decided to not make the code overly complicated and just fail
  normally if the page we allocated in the meantime.

Because of the way we handle now in-use pages, we no longer need the
put-as-isolation-migratetype dance, that was guarding for poisoned pages
to end up in pcplists.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
---
 include/linux/page-flags.h |  5 -----
 mm/memory-failure.c        | 45 ++++++++++++++------------------------
 mm/migrate.c               | 11 +++-------
 mm/page_alloc.c            | 28 ------------------------
 4 files changed, 19 insertions(+), 70 deletions(-)

diff --git v5.8-rc7-mmotm-2020-07-27-18-18/include/linux/page-flags.h v5.8-rc7-mmotm-2020-07-27-18-18_patched/include/linux/page-flags.h
index 9fa5d4e2d69a..d1df51ed6eeb 100644
--- v5.8-rc7-mmotm-2020-07-27-18-18/include/linux/page-flags.h
+++ v5.8-rc7-mmotm-2020-07-27-18-18_patched/include/linux/page-flags.h
@@ -422,14 +422,9 @@ PAGEFLAG_FALSE(Uncached)
 PAGEFLAG(HWPoison, hwpoison, PF_ANY)
 TESTSCFLAG(HWPoison, hwpoison, PF_ANY)
 #define __PG_HWPOISON (1UL << PG_hwpoison)
-extern bool set_hwpoison_free_buddy_page(struct page *page);
 extern bool take_page_off_buddy(struct page *page);
 #else
 PAGEFLAG_FALSE(HWPoison)
-static inline bool set_hwpoison_free_buddy_page(struct page *page)
-{
-	return 0;
-}
 #define __PG_HWPOISON 0
 #endif
 
diff --git v5.8-rc7-mmotm-2020-07-27-18-18/mm/memory-failure.c v5.8-rc7-mmotm-2020-07-27-18-18_patched/mm/memory-failure.c
index 8b6a98929b54..291084e27ead 100644
--- v5.8-rc7-mmotm-2020-07-27-18-18/mm/memory-failure.c
+++ v5.8-rc7-mmotm-2020-07-27-18-18_patched/mm/memory-failure.c
@@ -65,8 +65,12 @@ int sysctl_memory_failure_recovery __read_mostly = 1;
 
 atomic_long_t num_poisoned_pages __read_mostly = ATOMIC_LONG_INIT(0);
 
-static void page_handle_poison(struct page *page)
+static void page_handle_poison(struct page *page, bool release)
 {
+	if (release) {
+		put_page(page);
+		drain_all_pages(page_zone(page));
+	}
 	SetPageHWPoison(page);
 	page_ref_inc(page);
 	num_poisoned_pages_inc();
@@ -1750,19 +1754,13 @@ static int soft_offline_huge_page(struct page *page)
 			ret = -EIO;
 	} else {
 		/*
-		 * We set PG_hwpoison only when the migration source hugepage
-		 * was successfully dissolved, because otherwise hwpoisoned
-		 * hugepage remains on free hugepage list, then userspace will
-		 * find it as SIGBUS by allocation failure. That's not expected
-		 * in soft-offlining.
+		 * We set PG_hwpoison only when we were able to take the page
+		 * off the buddy.
 		 */
-		ret = dissolve_free_huge_page(page);
-		if (!ret) {
-			if (set_hwpoison_free_buddy_page(page))
-				num_poisoned_pages_inc();
-			else
-				ret = -EBUSY;
-		}
+		if (!dissolve_free_huge_page(page) && take_page_off_buddy(page))
+			page_handle_poison(page, false);
+		else
+			ret = -EBUSY;
 	}
 	return ret;
 }
@@ -1801,10 +1799,8 @@ static int __soft_offline_page(struct page *page)
 	 * would need to fix isolation locking first.
 	 */
 	if (ret == 1) {
-		put_page(page);
 		pr_info("soft_offline: %#lx: invalidated\n", pfn);
-		SetPageHWPoison(page);
-		num_poisoned_pages_inc();
+		page_handle_poison(page, true);
 		return 0;
 	}
 
@@ -1835,7 +1831,9 @@ static int __soft_offline_page(struct page *page)
 		list_add(&page->lru, &pagelist);
 		ret = migrate_pages(&pagelist, alloc_migration_target, NULL,
 			(unsigned long)&mtc, MIGRATE_SYNC, MR_MEMORY_FAILURE);
-		if (ret) {
+		if (!ret) {
+			page_handle_poison(page, true);
+		} else {
 			if (!list_empty(&pagelist))
 				putback_movable_pages(&pagelist);
 
@@ -1854,27 +1852,16 @@ static int __soft_offline_page(struct page *page)
 static int soft_offline_in_use_page(struct page *page)
 {
 	int ret;
-	int mt;
 	struct page *hpage = compound_head(page);
 
 	if (!PageHuge(page) && PageTransHuge(hpage))
 		if (try_to_split_thp_page(page, "soft offline") < 0)
 			return -EBUSY;
 
-	/*
-	 * Setting MIGRATE_ISOLATE here ensures that the page will be linked
-	 * to free list immediately (not via pcplist) when released after
-	 * successful page migration. Otherwise we can't guarantee that the
-	 * page is really free after put_page() returns, so
-	 * set_hwpoison_free_buddy_page() highly likely fails.
-	 */
-	mt = get_pageblock_migratetype(page);
-	set_pageblock_migratetype(page, MIGRATE_ISOLATE);
 	if (PageHuge(page))
 		ret = soft_offline_huge_page(page);
 	else
 		ret = __soft_offline_page(page);
-	set_pageblock_migratetype(page, mt);
 	return ret;
 }
 
@@ -1883,7 +1870,7 @@ static int soft_offline_free_page(struct page *page)
 	int rc = -EBUSY;
 
 	if (!dissolve_free_huge_page(page) && take_page_off_buddy(page)) {
-		page_handle_poison(page);
+		page_handle_poison(page, false);
 		rc = 0;
 	}
 
diff --git v5.8-rc7-mmotm-2020-07-27-18-18/mm/migrate.c v5.8-rc7-mmotm-2020-07-27-18-18_patched/mm/migrate.c
index 2c809ffcf0e1..d7a9379c343b 100644
--- v5.8-rc7-mmotm-2020-07-27-18-18/mm/migrate.c
+++ v5.8-rc7-mmotm-2020-07-27-18-18_patched/mm/migrate.c
@@ -1222,16 +1222,11 @@ static int unmap_and_move(new_page_t get_new_page,
 	 * we want to retry.
 	 */
 	if (rc == MIGRATEPAGE_SUCCESS) {
-		put_page(page);
-		if (reason == MR_MEMORY_FAILURE) {
+		if (reason != MR_MEMORY_FAILURE)
 			/*
-			 * Set PG_HWPoison on just freed page
-			 * intentionally. Although it's rather weird,
-			 * it's how HWPoison flag works at the moment.
+			 * We release the page in page_handle_poison.
 			 */
-			if (set_hwpoison_free_buddy_page(page))
-				num_poisoned_pages_inc();
-		}
+			put_page(page);
 	} else {
 		if (rc != -EAGAIN) {
 			if (likely(!__PageMovable(page))) {
diff --git v5.8-rc7-mmotm-2020-07-27-18-18/mm/page_alloc.c v5.8-rc7-mmotm-2020-07-27-18-18_patched/mm/page_alloc.c
index aab89f7db4ac..e4896e674594 100644
--- v5.8-rc7-mmotm-2020-07-27-18-18/mm/page_alloc.c
+++ v5.8-rc7-mmotm-2020-07-27-18-18_patched/mm/page_alloc.c
@@ -8843,32 +8843,4 @@ bool take_page_off_buddy(struct page *page)
 	spin_unlock_irqrestore(&zone->lock, flags);
 	return ret;
 }
-
-/*
- * Set PG_hwpoison flag if a given page is confirmed to be a free page.  This
- * test is performed under the zone lock to prevent a race against page
- * allocation.
- */
-bool set_hwpoison_free_buddy_page(struct page *page)
-{
-	struct zone *zone = page_zone(page);
-	unsigned long pfn = page_to_pfn(page);
-	unsigned long flags;
-	unsigned int order;
-	bool hwpoisoned = false;
-
-	spin_lock_irqsave(&zone->lock, flags);
-	for (order = 0; order < MAX_ORDER; order++) {
-		struct page *page_head = page - (pfn & ((1 << order) - 1));
-
-		if (PageBuddy(page_head) && page_order(page_head) >= order) {
-			if (!TestSetPageHWPoison(page))
-				hwpoisoned = true;
-			break;
-		}
-	}
-	spin_unlock_irqrestore(&zone->lock, flags);
-
-	return hwpoisoned;
-}
 #endif
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v5 13/16] mm,hwpoison: Refactor soft_offline_huge_page and __soft_offline_page
  2020-07-31 12:20 [PATCH v5 00/16] HWPOISON: soft offline rework nao.horiguchi
                   ` (11 preceding siblings ...)
  2020-07-31 12:21 ` [PATCH v5 12/16] mm,hwpoison: Rework soft offline for in-use pages nao.horiguchi
@ 2020-07-31 12:21 ` nao.horiguchi
  2020-07-31 12:21 ` [PATCH v5 14/16] mm,hwpoison: Return 0 if the page is already poisoned in soft-offline nao.horiguchi
                   ` (4 subsequent siblings)
  17 siblings, 0 replies; 26+ messages in thread
From: nao.horiguchi @ 2020-07-31 12:21 UTC (permalink / raw)
  To: linux-mm
  Cc: mhocko, akpm, mike.kravetz, osalvador, tony.luck, david,
	aneesh.kumar, zeil, cai, naoya.horiguchi, linux-kernel

From: Oscar Salvador <osalvador@suse.de>

Merging soft_offline_huge_page and __soft_offline_page let us get rid of
quite some duplicated code, and makes the code much easier to follow.

Now, __soft_offline_page will handle both normal and hugetlb pages.

Note that move put_page() block to the beginning of page_handle_poison()
with drain_all_pages() in order to make sure that the target page is
freed and sent into free list to make take_page_off_buddy() work properly.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
---
ChangeLog v4 -> v5:
- use "char const *msg_page[]" instead of "const char *msg_page[]"
- move adding drain_all_pages() to 12/16

ChangeLog v2 -> v3:
- use page_is_file_lru() instead of page_is_file_cache(),
- add description about put_page() and drain_all_pages().
- fix coding style warnings by checkpatch.pl
---
 mm/memory-failure.c | 177 ++++++++++++++++++++------------------------
 1 file changed, 80 insertions(+), 97 deletions(-)

diff --git v5.8-rc7-mmotm-2020-07-27-18-18/mm/memory-failure.c v5.8-rc7-mmotm-2020-07-27-18-18_patched/mm/memory-failure.c
index 291084e27ead..904dec64da6b 100644
--- v5.8-rc7-mmotm-2020-07-27-18-18/mm/memory-failure.c
+++ v5.8-rc7-mmotm-2020-07-27-18-18_patched/mm/memory-failure.c
@@ -65,15 +65,33 @@ int sysctl_memory_failure_recovery __read_mostly = 1;
 
 atomic_long_t num_poisoned_pages __read_mostly = ATOMIC_LONG_INIT(0);
 
-static void page_handle_poison(struct page *page, bool release)
+static bool page_handle_poison(struct page *page, bool hugepage_or_freepage, bool release)
 {
 	if (release) {
 		put_page(page);
 		drain_all_pages(page_zone(page));
 	}
+
+	if (hugepage_or_freepage) {
+		/*
+		 * Doing this check for free pages is also fine since dissolve_free_huge_page
+		 * returns 0 for non-hugetlb pages as well.
+		 */
+		if (dissolve_free_huge_page(page) || !take_page_off_buddy(page))
+			/*
+			 * We could fail to take off the target page from buddy
+			 * for example due to racy page allocaiton, but that's
+			 * acceptable because soft-offlined page is not broken
+			 * and if someone really want to use it, they should
+			 * take it.
+			 */
+			return false;
+	}
+
 	SetPageHWPoison(page);
 	page_ref_inc(page);
 	num_poisoned_pages_inc();
+	return true;
 }
 
 #if defined(CONFIG_HWPOISON_INJECT) || defined(CONFIG_HWPOISON_INJECT_MODULE)
@@ -1712,63 +1730,52 @@ static int get_any_page(struct page *page, unsigned long pfn)
 	return ret;
 }
 
-static int soft_offline_huge_page(struct page *page)
+static bool isolate_page(struct page *page, struct list_head *pagelist)
 {
-	int ret;
-	unsigned long pfn = page_to_pfn(page);
-	struct page *hpage = compound_head(page);
-	LIST_HEAD(pagelist);
+	bool isolated = false;
+	bool lru = PageLRU(page);
+
+	if (PageHuge(page)) {
+		isolated = isolate_huge_page(page, pagelist);
+	} else {
+		if (lru)
+			isolated = !isolate_lru_page(page);
+		else
+			isolated = !isolate_movable_page(page, ISOLATE_UNEVICTABLE);
+
+		if (isolated)
+			list_add(&page->lru, pagelist);
 
-	/*
-	 * This double-check of PageHWPoison is to avoid the race with
-	 * memory_failure(). See also comment in __soft_offline_page().
-	 */
-	lock_page(hpage);
-	if (PageHWPoison(hpage)) {
-		unlock_page(hpage);
-		put_page(hpage);
-		pr_info("soft offline: %#lx hugepage already poisoned\n", pfn);
-		return -EBUSY;
 	}
-	unlock_page(hpage);
 
-	ret = isolate_huge_page(hpage, &pagelist);
+	if (isolated && lru)
+		inc_node_page_state(page, NR_ISOLATED_ANON +
+				    page_is_file_lru(page));
+
 	/*
-	 * get_any_page() and isolate_huge_page() takes a refcount each,
-	 * so need to drop one here.
+	 * If we succeed to isolate the page, we grabbed another refcount on
+	 * the page, so we can safely drop the one we got from get_any_pages().
+	 * If we failed to isolate the page, it means that we cannot go further
+	 * and we will return an error, so drop the reference we got from
+	 * get_any_pages() as well.
 	 */
-	put_page(hpage);
-	if (!ret) {
-		pr_info("soft offline: %#lx hugepage failed to isolate\n", pfn);
-		return -EBUSY;
-	}
-
-	ret = migrate_pages(&pagelist, new_page, NULL, MPOL_MF_MOVE_ALL,
-				MIGRATE_SYNC, MR_MEMORY_FAILURE);
-	if (ret) {
-		pr_info("soft offline: %#lx: hugepage migration failed %d, type %lx (%pGp)\n",
-			pfn, ret, page->flags, &page->flags);
-		if (!list_empty(&pagelist))
-			putback_movable_pages(&pagelist);
-		if (ret > 0)
-			ret = -EIO;
-	} else {
-		/*
-		 * We set PG_hwpoison only when we were able to take the page
-		 * off the buddy.
-		 */
-		if (!dissolve_free_huge_page(page) && take_page_off_buddy(page))
-			page_handle_poison(page, false);
-		else
-			ret = -EBUSY;
-	}
-	return ret;
+	put_page(page);
+	return isolated;
 }
 
+/*
+ * __soft_offline_page handles hugetlb-pages and non-hugetlb pages.
+ * If the page is a non-dirty unmapped page-cache page, it simply invalidates.
+ * If the page is mapped, it migrates the contents over.
+ */
 static int __soft_offline_page(struct page *page)
 {
-	int ret;
+	int ret = 0;
 	unsigned long pfn = page_to_pfn(page);
+	struct page *hpage = compound_head(page);
+	char const *msg_page[] = {"page", "hugepage"};
+	bool huge = PageHuge(page);
+	LIST_HEAD(pagelist);
 	struct migration_target_control mtc = {
 		.nid = NUMA_NO_NODE,
 		.gfp_mask = GFP_USER | __GFP_MOVABLE | __GFP_RETRY_MAYFAIL,
@@ -1781,98 +1788,74 @@ static int __soft_offline_page(struct page *page)
 	 * so there's no race between soft_offline_page() and memory_failure().
 	 */
 	lock_page(page);
-	wait_on_page_writeback(page);
+	if (!PageHuge(page))
+		wait_on_page_writeback(page);
 	if (PageHWPoison(page)) {
 		unlock_page(page);
 		put_page(page);
 		pr_info("soft offline: %#lx page already poisoned\n", pfn);
 		return -EBUSY;
 	}
-	/*
-	 * Try to invalidate first. This should work for
-	 * non dirty unmapped page cache pages.
-	 */
-	ret = invalidate_inode_page(page);
+
+	if (!PageHuge(page))
+		/*
+		 * Try to invalidate first. This should work for
+		 * non dirty unmapped page cache pages.
+		 */
+		ret = invalidate_inode_page(page);
 	unlock_page(page);
+
 	/*
 	 * RED-PEN would be better to keep it isolated here, but we
 	 * would need to fix isolation locking first.
 	 */
-	if (ret == 1) {
+	if (ret) {
 		pr_info("soft_offline: %#lx: invalidated\n", pfn);
-		page_handle_poison(page, true);
+		page_handle_poison(page, false, true);
 		return 0;
 	}
 
-	/*
-	 * Simple invalidation didn't work.
-	 * Try to migrate to a new page instead. migrate.c
-	 * handles a large number of cases for us.
-	 */
-	if (PageLRU(page))
-		ret = isolate_lru_page(page);
-	else
-		ret = isolate_movable_page(page, ISOLATE_UNEVICTABLE);
-	/*
-	 * Drop page reference which is came from get_any_page()
-	 * successful isolate_lru_page() already took another one.
-	 */
-	put_page(page);
-	if (!ret) {
-		LIST_HEAD(pagelist);
-		/*
-		 * After isolated lru page, the PageLRU will be cleared,
-		 * so use !__PageMovable instead for LRU page's mapping
-		 * cannot have PAGE_MAPPING_MOVABLE.
-		 */
-		if (!__PageMovable(page))
-			inc_node_page_state(page, NR_ISOLATED_ANON +
-						page_is_file_lru(page));
-		list_add(&page->lru, &pagelist);
+	if (isolate_page(hpage, &pagelist)) {
 		ret = migrate_pages(&pagelist, alloc_migration_target, NULL,
 			(unsigned long)&mtc, MIGRATE_SYNC, MR_MEMORY_FAILURE);
 		if (!ret) {
-			page_handle_poison(page, true);
+			bool release = !huge;
+
+			if (!page_handle_poison(page, true, release))
+				ret = -EBUSY;
 		} else {
 			if (!list_empty(&pagelist))
 				putback_movable_pages(&pagelist);
 
-			pr_info("soft offline: %#lx: migration failed %d, type %lx (%pGp)\n",
-				pfn, ret, page->flags, &page->flags);
+
+			pr_info("soft offline: %#lx: %s migration failed %d, type %lx (%pGp)\n",
+				pfn, msg_page[huge], ret, page->flags, &page->flags);
 			if (ret > 0)
 				ret = -EIO;
 		}
 	} else {
-		pr_info("soft offline: %#lx: isolation failed: %d, page count %d, type %lx (%pGp)\n",
-			pfn, ret, page_count(page), page->flags, &page->flags);
+		pr_info("soft offline: %#lx: %s isolation failed: %d, page count %d, type %lx (%pGp)\n",
+			pfn, msg_page[huge], ret, page_count(page), page->flags, &page->flags);
 	}
 	return ret;
 }
 
 static int soft_offline_in_use_page(struct page *page)
 {
-	int ret;
 	struct page *hpage = compound_head(page);
 
 	if (!PageHuge(page) && PageTransHuge(hpage))
 		if (try_to_split_thp_page(page, "soft offline") < 0)
 			return -EBUSY;
-
-	if (PageHuge(page))
-		ret = soft_offline_huge_page(page);
-	else
-		ret = __soft_offline_page(page);
-	return ret;
+	return __soft_offline_page(page);
 }
 
 static int soft_offline_free_page(struct page *page)
 {
-	int rc = -EBUSY;
+	int rc = 0;
 
-	if (!dissolve_free_huge_page(page) && take_page_off_buddy(page)) {
-		page_handle_poison(page, false);
-		rc = 0;
-	}
+	if (!page_handle_poison(page, true, false))
+		rc = -EBUSY;
 
 	return rc;
 }
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v5 14/16] mm,hwpoison: Return 0 if the page is already poisoned in soft-offline
  2020-07-31 12:20 [PATCH v5 00/16] HWPOISON: soft offline rework nao.horiguchi
                   ` (12 preceding siblings ...)
  2020-07-31 12:21 ` [PATCH v5 13/16] mm,hwpoison: Refactor soft_offline_huge_page and __soft_offline_page nao.horiguchi
@ 2020-07-31 12:21 ` nao.horiguchi
  2020-07-31 12:21 ` [PATCH v5 15/16] mm,hwpoison: introduce MF_MSG_UNSPLIT_THP nao.horiguchi
                   ` (3 subsequent siblings)
  17 siblings, 0 replies; 26+ messages in thread
From: nao.horiguchi @ 2020-07-31 12:21 UTC (permalink / raw)
  To: linux-mm
  Cc: mhocko, akpm, mike.kravetz, osalvador, tony.luck, david,
	aneesh.kumar, zeil, cai, naoya.horiguchi, linux-kernel

From: Oscar Salvador <osalvador@suse.de>

Currently, there is an inconsistency when calling soft-offline from
different paths on a page that is already poisoned.

1) madvise:

        madvise_inject_error skips any poisoned page and continues
        the loop.
        If that was the only page to madvise, it returns 0.

2) /sys/devices/system/memory/:

        When calling soft_offline_page_store()->soft_offline_page(),
        we return -EBUSY in case the page is already poisoned.
        This is inconsistent with a) the above example and b)
        memory_failure, where we return 0 if the page was poisoned.

Fix this by dropping the PageHWPoison() check in madvise_inject_error,
and let soft_offline_page return 0 if it finds the page already poisoned.

Please, note that this represents a user-api change, since now the
return error when calling soft_offline_page_store()->soft_offline_page()
will be different.

Signed-off-by: Oscar Salvador <osalvador@suse.com>
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
---
 mm/madvise.c        | 3 ---
 mm/memory-failure.c | 4 ++--
 2 files changed, 2 insertions(+), 5 deletions(-)

diff --git v5.8-rc7-mmotm-2020-07-27-18-18/mm/madvise.c v5.8-rc7-mmotm-2020-07-27-18-18_patched/mm/madvise.c
index 3eee78abdbec..843f6fad3b89 100644
--- v5.8-rc7-mmotm-2020-07-27-18-18/mm/madvise.c
+++ v5.8-rc7-mmotm-2020-07-27-18-18_patched/mm/madvise.c
@@ -919,9 +919,6 @@ static int madvise_inject_error(int behavior,
 		 */
 		put_page(page);
 
-		if (PageHWPoison(page))
-			continue;
-
 		if (behavior == MADV_SOFT_OFFLINE) {
 			pr_info("Soft offlining pfn %#lx at process virtual address %#lx\n",
 				pfn, start);
diff --git v5.8-rc7-mmotm-2020-07-27-18-18/mm/memory-failure.c v5.8-rc7-mmotm-2020-07-27-18-18_patched/mm/memory-failure.c
index 904dec64da6b..bd63f1f2e44e 100644
--- v5.8-rc7-mmotm-2020-07-27-18-18/mm/memory-failure.c
+++ v5.8-rc7-mmotm-2020-07-27-18-18_patched/mm/memory-failure.c
@@ -1794,7 +1794,7 @@ static int __soft_offline_page(struct page *page)
 		unlock_page(page);
 		put_page(page);
 		pr_info("soft offline: %#lx page already poisoned\n", pfn);
-		return -EBUSY;
+		return 0;
 	}
 
 	if (!PageHuge(page))
@@ -1895,7 +1895,7 @@ int soft_offline_page(unsigned long pfn)
 
 	if (PageHWPoison(page)) {
 		pr_info("soft offline: %#lx page already poisoned\n", pfn);
-		return -EBUSY;
+		return 0;
 	}
 
 	get_online_mems();
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v5 15/16] mm,hwpoison: introduce MF_MSG_UNSPLIT_THP
  2020-07-31 12:20 [PATCH v5 00/16] HWPOISON: soft offline rework nao.horiguchi
                   ` (13 preceding siblings ...)
  2020-07-31 12:21 ` [PATCH v5 14/16] mm,hwpoison: Return 0 if the page is already poisoned in soft-offline nao.horiguchi
@ 2020-07-31 12:21 ` nao.horiguchi
  2020-07-31 12:21 ` [PATCH v5 16/16] mm,hwpoison: double-check page count in __get_any_page() nao.horiguchi
                   ` (2 subsequent siblings)
  17 siblings, 0 replies; 26+ messages in thread
From: nao.horiguchi @ 2020-07-31 12:21 UTC (permalink / raw)
  To: linux-mm
  Cc: mhocko, akpm, mike.kravetz, osalvador, tony.luck, david,
	aneesh.kumar, zeil, cai, naoya.horiguchi, linux-kernel

From: Naoya Horiguchi <naoya.horiguchi@nec.com>

memory_failure() is supposed to call action_result() when it handles
a memory error event, but there's one missing case. So let's add it.

I find that include/ras/ras_event.h has some other MF_MSG_* undefined,
so this patch also adds them.

Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Oscar Salvador <osalvador@suse.com>
---
 include/linux/mm.h      | 1 +
 include/ras/ras_event.h | 3 +++
 mm/memory-failure.c     | 5 ++++-
 3 files changed, 8 insertions(+), 1 deletion(-)

diff --git v5.8-rc7-mmotm-2020-07-27-18-18/include/linux/mm.h v5.8-rc7-mmotm-2020-07-27-18-18_patched/include/linux/mm.h
index ecb3c7191fb7..4f12b2465e80 100644
--- v5.8-rc7-mmotm-2020-07-27-18-18/include/linux/mm.h
+++ v5.8-rc7-mmotm-2020-07-27-18-18_patched/include/linux/mm.h
@@ -3023,6 +3023,7 @@ enum mf_action_page_type {
 	MF_MSG_BUDDY,
 	MF_MSG_BUDDY_2ND,
 	MF_MSG_DAX,
+	MF_MSG_UNSPLIT_THP,
 	MF_MSG_UNKNOWN,
 };
 
diff --git v5.8-rc7-mmotm-2020-07-27-18-18/include/ras/ras_event.h v5.8-rc7-mmotm-2020-07-27-18-18_patched/include/ras/ras_event.h
index 36c5c5e38c1d..0bdbc0d17d2f 100644
--- v5.8-rc7-mmotm-2020-07-27-18-18/include/ras/ras_event.h
+++ v5.8-rc7-mmotm-2020-07-27-18-18_patched/include/ras/ras_event.h
@@ -361,6 +361,7 @@ TRACE_EVENT(aer_event,
 	EM ( MF_MSG_POISONED_HUGE, "huge page already hardware poisoned" )	\
 	EM ( MF_MSG_HUGE, "huge page" )					\
 	EM ( MF_MSG_FREE_HUGE, "free huge page" )			\
+	EM ( MF_MSG_NON_PMD_HUGE, "non-pmd-sized huge page" )		\
 	EM ( MF_MSG_UNMAP_FAILED, "unmapping failed page" )		\
 	EM ( MF_MSG_DIRTY_SWAPCACHE, "dirty swapcache page" )		\
 	EM ( MF_MSG_CLEAN_SWAPCACHE, "clean swapcache page" )		\
@@ -373,6 +374,8 @@ TRACE_EVENT(aer_event,
 	EM ( MF_MSG_TRUNCATED_LRU, "already truncated LRU page" )	\
 	EM ( MF_MSG_BUDDY, "free buddy page" )				\
 	EM ( MF_MSG_BUDDY_2ND, "free buddy page (2nd try)" )		\
+	EM ( MF_MSG_DAX, "dax page" )					\
+	EM ( MF_MSG_UNSPLIT_THP, "unsplit thp" )			\
 	EMe ( MF_MSG_UNKNOWN, "unknown page" )
 
 /*
diff --git v5.8-rc7-mmotm-2020-07-27-18-18/mm/memory-failure.c v5.8-rc7-mmotm-2020-07-27-18-18_patched/mm/memory-failure.c
index bd63f1f2e44e..6f242a194c64 100644
--- v5.8-rc7-mmotm-2020-07-27-18-18/mm/memory-failure.c
+++ v5.8-rc7-mmotm-2020-07-27-18-18_patched/mm/memory-failure.c
@@ -583,6 +583,7 @@ static const char * const action_page_types[] = {
 	[MF_MSG_BUDDY]			= "free buddy page",
 	[MF_MSG_BUDDY_2ND]		= "free buddy page (2nd try)",
 	[MF_MSG_DAX]			= "dax page",
+	[MF_MSG_UNSPLIT_THP]		= "unsplit thp",
 	[MF_MSG_UNKNOWN]		= "unknown page",
 };
 
@@ -1373,8 +1374,10 @@ int memory_failure(unsigned long pfn, int flags)
 	}
 
 	if (PageTransHuge(hpage)) {
-		if (try_to_split_thp_page(p, "Memory Failure") < 0)
+		if (try_to_split_thp_page(p, "Memory Failure") < 0) {
+			action_result(pfn, MF_MSG_UNSPLIT_THP, MF_IGNORED);
 			return -EBUSY;
+		}
 		VM_BUG_ON_PAGE(!page_count(p), p);
 	}
 
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH v5 16/16] mm,hwpoison: double-check page count in __get_any_page()
  2020-07-31 12:20 [PATCH v5 00/16] HWPOISON: soft offline rework nao.horiguchi
                   ` (14 preceding siblings ...)
  2020-07-31 12:21 ` [PATCH v5 15/16] mm,hwpoison: introduce MF_MSG_UNSPLIT_THP nao.horiguchi
@ 2020-07-31 12:21 ` nao.horiguchi
  2020-08-03 12:39 ` [PATCH v5 00/16] HWPOISON: soft offline rework Qian Cai
  2020-08-03 19:07 ` Qian Cai
  17 siblings, 0 replies; 26+ messages in thread
From: nao.horiguchi @ 2020-07-31 12:21 UTC (permalink / raw)
  To: linux-mm
  Cc: mhocko, akpm, mike.kravetz, osalvador, tony.luck, david,
	aneesh.kumar, zeil, cai, naoya.horiguchi, linux-kernel

From: Naoya Horiguchi <naoya.horiguchi@nec.com>

Soft offlining could fail with EIO due to the race condition with
hugepage migration. This issuse became visible due to the change by
previous patch that makes soft offline handler take page refcount
by its own.  We have no way to directly pin zero refcount page, and
the page considered as a zero refcount page could be allocated just
after the first check.

This patch adds the second check to find the race and gives us
chance to handle it more reliably.

Reported-by: Qian Cai <cai@lca.pw>
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
---
 mm/memory-failure.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git v5.8-rc7-mmotm-2020-07-27-18-18/mm/memory-failure.c v5.8-rc7-mmotm-2020-07-27-18-18_patched/mm/memory-failure.c
index 6f242a194c64..b2753ce2b85b 100644
--- v5.8-rc7-mmotm-2020-07-27-18-18/mm/memory-failure.c
+++ v5.8-rc7-mmotm-2020-07-27-18-18_patched/mm/memory-failure.c
@@ -1694,6 +1694,9 @@ static int __get_any_page(struct page *p, unsigned long pfn)
 		} else if (is_free_buddy_page(p)) {
 			pr_info("%s: %#lx free buddy page\n", __func__, pfn);
 			ret = 0;
+		} else if (page_count(p)) {
+			/* raced with allocation */
+			ret = -EBUSY;
 		} else {
 			pr_info("%s: %#lx: unknown zero refcount page type %lx\n",
 				__func__, pfn, p->flags);
@@ -1710,6 +1713,9 @@ static int get_any_page(struct page *page, unsigned long pfn)
 {
 	int ret = __get_any_page(page, pfn);
 
+	if (ret == -EBUSY)
+		ret = __get_any_page(page, pfn);
+
 	if (ret == 1 && !PageHuge(page) &&
 	    !PageLRU(page) && !__PageMovable(page)) {
 		/*
-- 
2.17.1


^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 00/16] HWPOISON: soft offline rework
  2020-07-31 12:20 [PATCH v5 00/16] HWPOISON: soft offline rework nao.horiguchi
                   ` (15 preceding siblings ...)
  2020-07-31 12:21 ` [PATCH v5 16/16] mm,hwpoison: double-check page count in __get_any_page() nao.horiguchi
@ 2020-08-03 12:39 ` Qian Cai
  2020-08-03 13:36   ` HORIGUCHI NAOYA(堀口 直也)
  2020-08-03 19:07 ` Qian Cai
  17 siblings, 1 reply; 26+ messages in thread
From: Qian Cai @ 2020-08-03 12:39 UTC (permalink / raw)
  To: nao.horiguchi
  Cc: linux-mm, mhocko, akpm, mike.kravetz, osalvador, tony.luck,
	david, aneesh.kumar, zeil, naoya.horiguchi, linux-kernel

On Fri, Jul 31, 2020 at 12:20:56PM +0000, nao.horiguchi@gmail.com wrote:
> This patchset is the latest version of soft offline rework patchset
> targetted for v5.9.
> 
> Main focus of this series is to stabilize soft offline.  Historically soft
> offlined pages have suffered from racy conditions because PageHWPoison is
> used to a little too aggressively, which (directly or indirectly) invades
> other mm code which cares little about hwpoison.  This results in unexpected
> behavior or kernel panic, which is very far from soft offline's "do not
> disturb userspace or other kernel component" policy.
> 
> Main point of this change set is to contain target page "via buddy allocator",
> where we first free the target page as we do for normal pages, and remove
> from buddy only when we confirm that it reaches free list. There is surely
> race window of page allocation, but that's fine because someone really want
> that page and the page is still working, so soft offline can happily give up.
> 
> v4 from Oscar tries to handle the race around reallocation, but that part
> seems still work in progress, so I decide to separate it for changes into
> v5.9.  Thank you for your contribution, Oscar.
> 
> The issue reported by Qian Cai is fixed by patch 16/16.

I am still getting EIO everywhere on next-20200803 (which includes this v5).

# ./random 1
- start: migrate_huge_offline
- use NUMA nodes 0,8.
- mmap and free 8388608 bytes hugepages on node 0
- mmap and free 8388608 bytes hugepages on node 8
madvise: Input/output error

From the serial console,

[  637.164222][ T8357] soft offline: 0x118ee0: hugepage isolation failed: 0, page count 2, type 7fff800001000e (referenced|uptodate|dirty|head)
[  637.164890][ T8357] Soft offlining pfn 0x20001380 at process virtual address 0x7fff9f000000
[  637.165422][ T8357] Soft offlining pfn 0x3ba00 at process virtual address 0x7fff9f200000
[  637.166409][ T8357] Soft offlining pfn 0x201914a0 at process virtual address 0x7fff9f000000
[  637.166833][ T8357] Soft offlining pfn 0x12b9a0 at process virtual address 0x7fff9f200000
[  637.168044][ T8357] Soft offlining pfn 0x1abb60 at process virtual address 0x7fff9f000000
[  637.168557][ T8357] Soft offlining pfn 0x20014820 at process virtual address 0x7fff9f200000
[  637.169493][ T8357] Soft offlining pfn 0x119720 at process virtual address 0x7fff9f000000
[  637.169603][ T8357] soft offline: 0x119720: hugepage isolation failed: 0, page count 2, type 7fff800001000e (referenced|uptodate|dirty|head)
[  637.169756][ T8357] Soft offlining pfn 0x118ee0 at process virtual address 0x7fff9f200000
[  637.170653][ T8357] Soft offlining pfn 0x200e81e0 at process virtual address 0x7fff9f000000
[  637.171067][ T8357] Soft offlining pfn 0x201c5f60 at process virtual address 0x7fff9f200000
[  637.172101][ T8357] Soft offlining pfn 0x201c8f00 at process virtual address 0x7fff9f000000
[  637.172241][ T8357] __get_any_page: 0x201c8f00: unknown zero refcount page type 87fff8000000000

> 
> This patchset is based on v5.8-rc7-mmotm-2020-07-27-18-18, but I applied
> this series after reverting previous version.
> Maybe https://github.com/Naoya-Horiguchi/linux/commits/soft-offline-rework.v5
> shows what I did more precisely.
> 
> Any other comment/suggestion/help would be appreciated.
> 
> Thanks,
> Naoya Horiguchi
> ---
> Previous versions:
>   v1: https://lore.kernel.org/linux-mm/1541746035-13408-1-git-send-email-n-horiguchi@ah.jp.nec.com/
>   v2: https://lore.kernel.org/linux-mm/20191017142123.24245-1-osalvador@suse.de/
>   v3: https://lore.kernel.org/linux-mm/20200624150137.7052-1-nao.horiguchi@gmail.com/
>   v4: https://lore.kernel.org/linux-mm/20200716123810.25292-1-osalvador@suse.de/
> ---
> Summary:
> 
> Naoya Horiguchi (8):
>       mm,hwpoison: cleanup unused PageHuge() check
>       mm, hwpoison: remove recalculating hpage
>       mm,madvise: call soft_offline_page() without MF_COUNT_INCREASED
>       mm,hwpoison-inject: don't pin for hwpoison_filter
>       mm,hwpoison: remove MF_COUNT_INCREASED
>       mm,hwpoison: remove flag argument from soft offline functions
>       mm,hwpoison: introduce MF_MSG_UNSPLIT_THP
>       mm,hwpoison: double-check page count in __get_any_page()
> 
> Oscar Salvador (8):
>       mm,madvise: Refactor madvise_inject_error
>       mm,hwpoison: Un-export get_hwpoison_page and make it static
>       mm,hwpoison: Kill put_hwpoison_page
>       mm,hwpoison: Unify THP handling for hard and soft offline
>       mm,hwpoison: Rework soft offline for free pages
>       mm,hwpoison: Rework soft offline for in-use pages
>       mm,hwpoison: Refactor soft_offline_huge_page and __soft_offline_page
>       mm,hwpoison: Return 0 if the page is already poisoned in soft-offline
> 
>  drivers/base/memory.c      |   2 +-
>  include/linux/mm.h         |  12 +-
>  include/linux/page-flags.h |   6 +-
>  include/ras/ras_event.h    |   3 +
>  mm/hwpoison-inject.c       |  18 +--
>  mm/madvise.c               |  39 +++---
>  mm/memory-failure.c        | 334 ++++++++++++++++++++-------------------------
>  mm/migrate.c               |  11 +-
>  mm/page_alloc.c            |  60 ++++++--
>  9 files changed, 233 insertions(+), 252 deletions(-)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 00/16] HWPOISON: soft offline rework
  2020-08-03 12:39 ` [PATCH v5 00/16] HWPOISON: soft offline rework Qian Cai
@ 2020-08-03 13:36   ` HORIGUCHI NAOYA(堀口 直也)
  2020-08-03 15:19     ` Qian Cai
  0 siblings, 1 reply; 26+ messages in thread
From: HORIGUCHI NAOYA(堀口 直也) @ 2020-08-03 13:36 UTC (permalink / raw)
  To: Qian Cai
  Cc: nao.horiguchi, linux-mm, mhocko, akpm, mike.kravetz, osalvador,
	tony.luck, david, aneesh.kumar, zeil, linux-kernel

Hello,

On Mon, Aug 03, 2020 at 08:39:55AM -0400, Qian Cai wrote:
> On Fri, Jul 31, 2020 at 12:20:56PM +0000, nao.horiguchi@gmail.com wrote:
> > This patchset is the latest version of soft offline rework patchset
> > targetted for v5.9.
> > 
> > Main focus of this series is to stabilize soft offline.  Historically soft
> > offlined pages have suffered from racy conditions because PageHWPoison is
> > used to a little too aggressively, which (directly or indirectly) invades
> > other mm code which cares little about hwpoison.  This results in unexpected
> > behavior or kernel panic, which is very far from soft offline's "do not
> > disturb userspace or other kernel component" policy.
> > 
> > Main point of this change set is to contain target page "via buddy allocator",
> > where we first free the target page as we do for normal pages, and remove
> > from buddy only when we confirm that it reaches free list. There is surely
> > race window of page allocation, but that's fine because someone really want
> > that page and the page is still working, so soft offline can happily give up.
> > 
> > v4 from Oscar tries to handle the race around reallocation, but that part
> > seems still work in progress, so I decide to separate it for changes into
> > v5.9.  Thank you for your contribution, Oscar.
> > 
> > The issue reported by Qian Cai is fixed by patch 16/16.
> 
> I am still getting EIO everywhere on next-20200803 (which includes this v5).
> 
> # ./random 1
> - start: migrate_huge_offline
> - use NUMA nodes 0,8.
> - mmap and free 8388608 bytes hugepages on node 0
> - mmap and free 8388608 bytes hugepages on node 8
> madvise: Input/output error
> 
> From the serial console,
> 
> [  637.164222][ T8357] soft offline: 0x118ee0: hugepage isolation failed: 0, page count 2, type 7fff800001000e (referenced|uptodate|dirty|head)
> [  637.164890][ T8357] Soft offlining pfn 0x20001380 at process virtual address 0x7fff9f000000
> [  637.165422][ T8357] Soft offlining pfn 0x3ba00 at process virtual address 0x7fff9f200000
> [  637.166409][ T8357] Soft offlining pfn 0x201914a0 at process virtual address 0x7fff9f000000
> [  637.166833][ T8357] Soft offlining pfn 0x12b9a0 at process virtual address 0x7fff9f200000
> [  637.168044][ T8357] Soft offlining pfn 0x1abb60 at process virtual address 0x7fff9f000000
> [  637.168557][ T8357] Soft offlining pfn 0x20014820 at process virtual address 0x7fff9f200000
> [  637.169493][ T8357] Soft offlining pfn 0x119720 at process virtual address 0x7fff9f000000
> [  637.169603][ T8357] soft offline: 0x119720: hugepage isolation failed: 0, page count 2, type 7fff800001000e (referenced|uptodate|dirty|head)
> [  637.169756][ T8357] Soft offlining pfn 0x118ee0 at process virtual address 0x7fff9f200000
> [  637.170653][ T8357] Soft offlining pfn 0x200e81e0 at process virtual address 0x7fff9f000000
> [  637.171067][ T8357] Soft offlining pfn 0x201c5f60 at process virtual address 0x7fff9f200000
> [  637.172101][ T8357] Soft offlining pfn 0x201c8f00 at process virtual address 0x7fff9f000000
> [  637.172241][ T8357] __get_any_page: 0x201c8f00: unknown zero refcount page type 87fff8000000000

I might misjudge to skip the following patch, sorry about that.
Could you try with it?

---
From eafe6fde94cd15e67631540f1b2b000b6e33a650 Mon Sep 17 00:00:00 2001
From: Oscar Salvador <osalvador@suse.de>
Date: Mon, 3 Aug 2020 22:25:10 +0900
Subject: [PATCH] mm,hwpoison: Drain pcplists before bailing out for non-buddy
 zero-refcount page

A page with 0-refcount and !PageBuddy could perfectly be a pcppage.
Currently, we bail out with an error if we encounter such a page,
meaning that we do not handle pcppages neither from hard-offline
nor from soft-offline path.

Fix this by draining pcplists whenever we find this kind of page
and retry the check again.
It might be that pcplists have been spilled into the buddy allocator
and so we can handle it.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 mm/memory-failure.c | 30 ++++++++++++++++++++++++------
 1 file changed, 24 insertions(+), 6 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index b2753ce2b85b..02be529445c0 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -949,13 +949,13 @@ static int page_action(struct page_state *ps, struct page *p,
 }
 
 /**
- * get_hwpoison_page() - Get refcount for memory error handling:
+ * __get_hwpoison_page() - Get refcount for memory error handling:
  * @page:	raw error page (hit by memory error)
  *
  * Return: return 0 if failed to grab the refcount, otherwise true (some
  * non-zero value.)
  */
-static int get_hwpoison_page(struct page *page)
+static int __get_hwpoison_page(struct page *page)
 {
 	struct page *head = compound_head(page);
 
@@ -985,6 +985,28 @@ static int get_hwpoison_page(struct page *page)
 	return 0;
 }
 
+static int get_hwpoison_page(struct page *p)
+{
+	int ret;
+	bool drained = false;
+
+retry:
+	ret = __get_hwpoison_page(p);
+	if (!ret) {
+		if (!is_free_buddy_page(p) && !page_count(p) && !drained) {
+			/*
+			 * The page might be in a pcplist, so try to drain
+			 * those and see if we are lucky.
+			 */
+			drain_all_pages(page_zone(p));
+			drained = true;
+			goto retry;
+		}
+	}
+
+	return ret;
+}
+
 /*
  * Do all that is necessary to remove user space mappings. Unmap
  * the pages and send SIGBUS to the processes if the data was dirty.
@@ -1683,10 +1705,6 @@ static int __get_any_page(struct page *p, unsigned long pfn)
 {
 	int ret;
 
-	/*
-	 * When the target page is a free hugepage, just remove it
-	 * from free hugepage list.
-	 */
 	if (!get_hwpoison_page(p)) {
 		if (PageHuge(p)) {
 			pr_info("%s: %#lx free huge page\n", __func__, pfn);
-- 
2.25.1

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 00/16] HWPOISON: soft offline rework
  2020-08-03 13:36   ` HORIGUCHI NAOYA(堀口 直也)
@ 2020-08-03 15:19     ` Qian Cai
  2020-08-05 20:43       ` HORIGUCHI NAOYA(堀口 直也)
  0 siblings, 1 reply; 26+ messages in thread
From: Qian Cai @ 2020-08-03 15:19 UTC (permalink / raw)
  To: HORIGUCHI NAOYA(堀口 直也)
  Cc: nao.horiguchi, linux-mm, mhocko, akpm, mike.kravetz, osalvador,
	tony.luck, david, aneesh.kumar, zeil, linux-kernel

On Mon, Aug 03, 2020 at 01:36:58PM +0000, HORIGUCHI NAOYA(堀口 直也) wrote:
> Hello,
> 
> On Mon, Aug 03, 2020 at 08:39:55AM -0400, Qian Cai wrote:
> > On Fri, Jul 31, 2020 at 12:20:56PM +0000, nao.horiguchi@gmail.com wrote:
> > > This patchset is the latest version of soft offline rework patchset
> > > targetted for v5.9.
> > > 
> > > Main focus of this series is to stabilize soft offline.  Historically soft
> > > offlined pages have suffered from racy conditions because PageHWPoison is
> > > used to a little too aggressively, which (directly or indirectly) invades
> > > other mm code which cares little about hwpoison.  This results in unexpected
> > > behavior or kernel panic, which is very far from soft offline's "do not
> > > disturb userspace or other kernel component" policy.
> > > 
> > > Main point of this change set is to contain target page "via buddy allocator",
> > > where we first free the target page as we do for normal pages, and remove
> > > from buddy only when we confirm that it reaches free list. There is surely
> > > race window of page allocation, but that's fine because someone really want
> > > that page and the page is still working, so soft offline can happily give up.
> > > 
> > > v4 from Oscar tries to handle the race around reallocation, but that part
> > > seems still work in progress, so I decide to separate it for changes into
> > > v5.9.  Thank you for your contribution, Oscar.
> > > 
> > > The issue reported by Qian Cai is fixed by patch 16/16.
> > 
> > I am still getting EIO everywhere on next-20200803 (which includes this v5).
> > 
> > # ./random 1
> > - start: migrate_huge_offline
> > - use NUMA nodes 0,8.
> > - mmap and free 8388608 bytes hugepages on node 0
> > - mmap and free 8388608 bytes hugepages on node 8
> > madvise: Input/output error
> > 
> > From the serial console,
> > 
> > [  637.164222][ T8357] soft offline: 0x118ee0: hugepage isolation failed: 0, page count 2, type 7fff800001000e (referenced|uptodate|dirty|head)
> > [  637.164890][ T8357] Soft offlining pfn 0x20001380 at process virtual address 0x7fff9f000000
> > [  637.165422][ T8357] Soft offlining pfn 0x3ba00 at process virtual address 0x7fff9f200000
> > [  637.166409][ T8357] Soft offlining pfn 0x201914a0 at process virtual address 0x7fff9f000000
> > [  637.166833][ T8357] Soft offlining pfn 0x12b9a0 at process virtual address 0x7fff9f200000
> > [  637.168044][ T8357] Soft offlining pfn 0x1abb60 at process virtual address 0x7fff9f000000
> > [  637.168557][ T8357] Soft offlining pfn 0x20014820 at process virtual address 0x7fff9f200000
> > [  637.169493][ T8357] Soft offlining pfn 0x119720 at process virtual address 0x7fff9f000000
> > [  637.169603][ T8357] soft offline: 0x119720: hugepage isolation failed: 0, page count 2, type 7fff800001000e (referenced|uptodate|dirty|head)
> > [  637.169756][ T8357] Soft offlining pfn 0x118ee0 at process virtual address 0x7fff9f200000
> > [  637.170653][ T8357] Soft offlining pfn 0x200e81e0 at process virtual address 0x7fff9f000000
> > [  637.171067][ T8357] Soft offlining pfn 0x201c5f60 at process virtual address 0x7fff9f200000
> > [  637.172101][ T8357] Soft offlining pfn 0x201c8f00 at process virtual address 0x7fff9f000000
> > [  637.172241][ T8357] __get_any_page: 0x201c8f00: unknown zero refcount page type 87fff8000000000
> 
> I might misjudge to skip the following patch, sorry about that.
> Could you try with it?

Still getting EIO after applied this patch.

[ 1215.499030][T88982] soft offline: 0x201bdc20: hugepage isolation failed: 0, page count 2, type 87fff800001000e (referenced|uptodate|dirty|head)
[ 1215.499775][T88982] Soft offlining pfn 0x201bdc20 at process virtual address 0x7fff91a00000
[ 1215.500189][T88982] Soft offlining pfn 0x201c19c0 at process virtual address 0x7fff91c00000
[ 1215.500297][T88982] soft offline: 0x201c19c0: hugepage isolation failed: 0, page count 2, type 87fff800001000e (referenced|uptodate|dirty|head)
[ 1215.500982][T88982] Soft offlining pfn 0x1f1fa0 at process virtual address 0x7fff91a00000
[ 1215.501086][T88982] soft offline: 0x1f1fa0: hugepage isolation failed: 0, page count 2, type 7fff800001000e (referenced|uptodate|dirty|head)
[ 1215.501237][T88982] Soft offlining pfn 0x1f4520 at process virtual address 0x7fff91c00000
[ 1215.501355][T88982] soft offline: 0x1f4520: hugepage isolation failed: 0, page count 2, type 7fff800001000e (referenced|uptodate|dirty|head)
[ 1215.502196][T88982] Soft offlining pfn 0x1f4520 at process virtual address 0x7fff91a00000
[ 1215.502573][T88982] Soft offlining pfn 0x1f1fa0 at process virtual address 0x7fff91c00000
[ 1215.502687][T88982] soft offline: 0x1f1fa0: hugepage isolation failed: 0, page count 2, type 7fff800001000e (referenced|uptodate|dirty|head)
[ 1215.503245][T88982] Soft offlining pfn 0x201c3cc0 at process virtual address 0x7fff91a00000
[ 1215.503594][T88982] Soft offlining pfn 0x201c3ce0 at process virtual address 0x7fff91c00000
[ 1215.503755][T88982] __get_any_page: 0x201c3ce0: unknown zero refcount page type 87fff8000000000

> 
> ---
> From eafe6fde94cd15e67631540f1b2b000b6e33a650 Mon Sep 17 00:00:00 2001
> From: Oscar Salvador <osalvador@suse.de>
> Date: Mon, 3 Aug 2020 22:25:10 +0900
> Subject: [PATCH] mm,hwpoison: Drain pcplists before bailing out for non-buddy
>  zero-refcount page

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 00/16] HWPOISON: soft offline rework
  2020-07-31 12:20 [PATCH v5 00/16] HWPOISON: soft offline rework nao.horiguchi
                   ` (16 preceding siblings ...)
  2020-08-03 12:39 ` [PATCH v5 00/16] HWPOISON: soft offline rework Qian Cai
@ 2020-08-03 19:07 ` Qian Cai
  2020-08-04  1:16   ` HORIGUCHI NAOYA(堀口 直也)
  17 siblings, 1 reply; 26+ messages in thread
From: Qian Cai @ 2020-08-03 19:07 UTC (permalink / raw)
  To: nao.horiguchi
  Cc: linux-mm, mhocko, akpm, mike.kravetz, osalvador, tony.luck,
	david, aneesh.kumar, zeil, naoya.horiguchi, linux-kernel

On Fri, Jul 31, 2020 at 12:20:56PM +0000, nao.horiguchi@gmail.com wrote:
> This patchset is the latest version of soft offline rework patchset
> targetted for v5.9.
> 
> Main focus of this series is to stabilize soft offline.  Historically soft
> offlined pages have suffered from racy conditions because PageHWPoison is
> used to a little too aggressively, which (directly or indirectly) invades
> other mm code which cares little about hwpoison.  This results in unexpected
> behavior or kernel panic, which is very far from soft offline's "do not
> disturb userspace or other kernel component" policy.
> 
> Main point of this change set is to contain target page "via buddy allocator",
> where we first free the target page as we do for normal pages, and remove
> from buddy only when we confirm that it reaches free list. There is surely
> race window of page allocation, but that's fine because someone really want
> that page and the page is still working, so soft offline can happily give up.
> 
> v4 from Oscar tries to handle the race around reallocation, but that part
> seems still work in progress, so I decide to separate it for changes into
> v5.9.  Thank you for your contribution, Oscar.
> 
> The issue reported by Qian Cai is fixed by patch 16/16.
> 
> This patchset is based on v5.8-rc7-mmotm-2020-07-27-18-18, but I applied
> this series after reverting previous version.
> Maybe https://github.com/Naoya-Horiguchi/linux/commits/soft-offline-rework.v5
> shows what I did more precisely.
> 
> Any other comment/suggestion/help would be appreciated.

There is another issue with this patchset (with and without the patch [1]).

[1] https://lore.kernel.org/lkml/20200803133657.GA13307@hori.linux.bs1.fc.nec.co.jp/

Arm64 using 512M-size hugepages starts to fail allocations prematurely.

# ./random 1
- start: migrate_huge_offline
- use NUMA nodes 0,1.
- mmap and free 2147483648 bytes hugepages on node 0
- mmap and free 2147483648 bytes hugepages on node 1
madvise: Cannot allocate memory

[  284.388061][ T3706] soft offline: 0x956000: hugepage isolation failed: 0, page count 2, type 17ffff80001000e (referenced|uptodate|dirty|head)
[  284.400777][ T3706] Soft offlining pfn 0x8e000 at process virtual address 0xffff80000000
[  284.893412][ T3706] Soft offlining pfn 0x8a000 at process virtual address 0xffff60000000
[  284.901539][ T3706] soft offline: 0x8a000: hugepage isolation failed: 0, page count 2, type 7ffff80001000e (referenced|uptodate|dirty|head)
[  284.914129][ T3706] Soft offlining pfn 0x8c000 at process virtual address 0xffff80000000
[  285.433497][ T3706] Soft offlining pfn 0x88000 at process virtual address 0xffff60000000
[  285.720377][ T3706] Soft offlining pfn 0x8a000 at process virtual address 0xffff80000000
[  286.281620][ T3706] Soft offlining pfn 0xa000 at process virtual address 0xffff60000000
[  286.290065][ T3706] soft offline: 0xa000: hugepage migration failed -12, type 7ffff80001000e (referenced|uptodate|dirty|head)

Reverting this patchset and its dependency patchset [2] (reverting the
dependency alone did not help) fixed it,

# ./random 1
- start: migrate_huge_offline
- use NUMA nodes 0,1.
- mmap and free 2147483648 bytes hugepages on node 0
- mmap and free 2147483648 bytes hugepages on node 1
- pass: mmap_offline_node_huge

[2] https://lore.kernel.org/linux-mm/1594622517-20681-1-git-send-email-iamjoonsoo.kim@lge.com/ 

> 
> Thanks,
> Naoya Horiguchi
> ---
> Previous versions:
>   v1: https://lore.kernel.org/linux-mm/1541746035-13408-1-git-send-email-n-horiguchi@ah.jp.nec.com/
>   v2: https://lore.kernel.org/linux-mm/20191017142123.24245-1-osalvador@suse.de/
>   v3: https://lore.kernel.org/linux-mm/20200624150137.7052-1-nao.horiguchi@gmail.com/
>   v4: https://lore.kernel.org/linux-mm/20200716123810.25292-1-osalvador@suse.de/
> ---
> Summary:
> 
> Naoya Horiguchi (8):
>       mm,hwpoison: cleanup unused PageHuge() check
>       mm, hwpoison: remove recalculating hpage
>       mm,madvise: call soft_offline_page() without MF_COUNT_INCREASED
>       mm,hwpoison-inject: don't pin for hwpoison_filter
>       mm,hwpoison: remove MF_COUNT_INCREASED
>       mm,hwpoison: remove flag argument from soft offline functions
>       mm,hwpoison: introduce MF_MSG_UNSPLIT_THP
>       mm,hwpoison: double-check page count in __get_any_page()
> 
> Oscar Salvador (8):
>       mm,madvise: Refactor madvise_inject_error
>       mm,hwpoison: Un-export get_hwpoison_page and make it static
>       mm,hwpoison: Kill put_hwpoison_page
>       mm,hwpoison: Unify THP handling for hard and soft offline
>       mm,hwpoison: Rework soft offline for free pages
>       mm,hwpoison: Rework soft offline for in-use pages
>       mm,hwpoison: Refactor soft_offline_huge_page and __soft_offline_page
>       mm,hwpoison: Return 0 if the page is already poisoned in soft-offline
> 
>  drivers/base/memory.c      |   2 +-
>  include/linux/mm.h         |  12 +-
>  include/linux/page-flags.h |   6 +-
>  include/ras/ras_event.h    |   3 +
>  mm/hwpoison-inject.c       |  18 +--
>  mm/madvise.c               |  39 +++---
>  mm/memory-failure.c        | 334 ++++++++++++++++++++-------------------------
>  mm/migrate.c               |  11 +-
>  mm/page_alloc.c            |  60 ++++++--
>  9 files changed, 233 insertions(+), 252 deletions(-)

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 00/16] HWPOISON: soft offline rework
  2020-08-03 19:07 ` Qian Cai
@ 2020-08-04  1:16   ` HORIGUCHI NAOYA(堀口 直也)
  2020-08-04  1:49     ` Qian Cai
  0 siblings, 1 reply; 26+ messages in thread
From: HORIGUCHI NAOYA(堀口 直也) @ 2020-08-04  1:16 UTC (permalink / raw)
  To: Qian Cai
  Cc: nao.horiguchi, linux-mm, mhocko, akpm, mike.kravetz, osalvador,
	tony.luck, david, aneesh.kumar, zeil, linux-kernel

On Mon, Aug 03, 2020 at 03:07:09PM -0400, Qian Cai wrote:
> On Fri, Jul 31, 2020 at 12:20:56PM +0000, nao.horiguchi@gmail.com wrote:
> > This patchset is the latest version of soft offline rework patchset
> > targetted for v5.9.
> > 
> > Main focus of this series is to stabilize soft offline.  Historically soft
> > offlined pages have suffered from racy conditions because PageHWPoison is
> > used to a little too aggressively, which (directly or indirectly) invades
> > other mm code which cares little about hwpoison.  This results in unexpected
> > behavior or kernel panic, which is very far from soft offline's "do not
> > disturb userspace or other kernel component" policy.
> > 
> > Main point of this change set is to contain target page "via buddy allocator",
> > where we first free the target page as we do for normal pages, and remove
> > from buddy only when we confirm that it reaches free list. There is surely
> > race window of page allocation, but that's fine because someone really want
> > that page and the page is still working, so soft offline can happily give up.
> > 
> > v4 from Oscar tries to handle the race around reallocation, but that part
> > seems still work in progress, so I decide to separate it for changes into
> > v5.9.  Thank you for your contribution, Oscar.
> > 
> > The issue reported by Qian Cai is fixed by patch 16/16.
> > 
> > This patchset is based on v5.8-rc7-mmotm-2020-07-27-18-18, but I applied
> > this series after reverting previous version.
> > Maybe https://github.com/Naoya-Horiguchi/linux/commits/soft-offline-rework.v5
> > shows what I did more precisely.
> > 
> > Any other comment/suggestion/help would be appreciated.
> 
> There is another issue with this patchset (with and without the patch [1]).
> 
> [1] https://lore.kernel.org/lkml/20200803133657.GA13307@hori.linux.bs1.fc.nec.co.jp/
> 
> Arm64 using 512M-size hugepages starts to fail allocations prematurely.
> 
> # ./random 1
> - start: migrate_huge_offline
> - use NUMA nodes 0,1.
> - mmap and free 2147483648 bytes hugepages on node 0
> - mmap and free 2147483648 bytes hugepages on node 1
> madvise: Cannot allocate memory
> 
> [  284.388061][ T3706] soft offline: 0x956000: hugepage isolation failed: 0, page count 2, type 17ffff80001000e (referenced|uptodate|dirty|head)
> [  284.400777][ T3706] Soft offlining pfn 0x8e000 at process virtual address 0xffff80000000
> [  284.893412][ T3706] Soft offlining pfn 0x8a000 at process virtual address 0xffff60000000
> [  284.901539][ T3706] soft offline: 0x8a000: hugepage isolation failed: 0, page count 2, type 7ffff80001000e (referenced|uptodate|dirty|head)
> [  284.914129][ T3706] Soft offlining pfn 0x8c000 at process virtual address 0xffff80000000
> [  285.433497][ T3706] Soft offlining pfn 0x88000 at process virtual address 0xffff60000000
> [  285.720377][ T3706] Soft offlining pfn 0x8a000 at process virtual address 0xffff80000000
> [  286.281620][ T3706] Soft offlining pfn 0xa000 at process virtual address 0xffff60000000
> [  286.290065][ T3706] soft offline: 0xa000: hugepage migration failed -12, type 7ffff80001000e (referenced|uptodate|dirty|head)

I think that this is due to the lack of contiguous memory.
This test program iterates soft offlining many times for hugepages,
so finally one page in every 512MB will be removed from buddy, then we
can't allocate hugepage any more even if we have enough free pages.
This is not good for heavy hugepage users, but that should be intended.

It seems that random.c calls madvise(MADV_SOFT_OFFLINE) for 2 hugepages,
and iterates it 1000 (==NR_LOOP) times, so if the system doesn't have
enough memory to cover the range of 2000 hugepages (1000GB in the Arm64
system), this ENOMEM should reproduce as expected.

> 
> Reverting this patchset and its dependency patchset [2] (reverting the
> dependency alone did not help) fixed it,

But it's still not clear to me why this was not visible before this
patchset, so I need more check for it.

Thanks,
Naoya Horiguchi

> 
> # ./random 1
> - start: migrate_huge_offline
> - use NUMA nodes 0,1.
> - mmap and free 2147483648 bytes hugepages on node 0
> - mmap and free 2147483648 bytes hugepages on node 1
> - pass: mmap_offline_node_huge
> 
> [2] https://lore.kernel.org/linux-mm/1594622517-20681-1-git-send-email-iamjoonsoo.kim@lge.com/ 
> 
> > 
> > Thanks,
> > Naoya Horiguchi
> > ---
> > Previous versions:
> >   v1: https://lore.kernel.org/linux-mm/1541746035-13408-1-git-send-email-n-horiguchi@ah.jp.nec.com/
> >   v2: https://lore.kernel.org/linux-mm/20191017142123.24245-1-osalvador@suse.de/
> >   v3: https://lore.kernel.org/linux-mm/20200624150137.7052-1-nao.horiguchi@gmail.com/
> >   v4: https://lore.kernel.org/linux-mm/20200716123810.25292-1-osalvador@suse.de/
> > ---
> > Summary:
> > 
> > Naoya Horiguchi (8):
> >       mm,hwpoison: cleanup unused PageHuge() check
> >       mm, hwpoison: remove recalculating hpage
> >       mm,madvise: call soft_offline_page() without MF_COUNT_INCREASED
> >       mm,hwpoison-inject: don't pin for hwpoison_filter
> >       mm,hwpoison: remove MF_COUNT_INCREASED
> >       mm,hwpoison: remove flag argument from soft offline functions
> >       mm,hwpoison: introduce MF_MSG_UNSPLIT_THP
> >       mm,hwpoison: double-check page count in __get_any_page()
> > 
> > Oscar Salvador (8):
> >       mm,madvise: Refactor madvise_inject_error
> >       mm,hwpoison: Un-export get_hwpoison_page and make it static
> >       mm,hwpoison: Kill put_hwpoison_page
> >       mm,hwpoison: Unify THP handling for hard and soft offline
> >       mm,hwpoison: Rework soft offline for free pages
> >       mm,hwpoison: Rework soft offline for in-use pages
> >       mm,hwpoison: Refactor soft_offline_huge_page and __soft_offline_page
> >       mm,hwpoison: Return 0 if the page is already poisoned in soft-offline
> > 
> >  drivers/base/memory.c      |   2 +-
> >  include/linux/mm.h         |  12 +-
> >  include/linux/page-flags.h |   6 +-
> >  include/ras/ras_event.h    |   3 +
> >  mm/hwpoison-inject.c       |  18 +--
> >  mm/madvise.c               |  39 +++---
> >  mm/memory-failure.c        | 334 ++++++++++++++++++++-------------------------
> >  mm/migrate.c               |  11 +-
> >  mm/page_alloc.c            |  60 ++++++--
> >  9 files changed, 233 insertions(+), 252 deletions(-)
> 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 00/16] HWPOISON: soft offline rework
  2020-08-04  1:16   ` HORIGUCHI NAOYA(堀口 直也)
@ 2020-08-04  1:49     ` Qian Cai
  2020-08-04  8:13       ` osalvador
  2020-08-05 20:44       ` HORIGUCHI NAOYA(堀口 直也)
  0 siblings, 2 replies; 26+ messages in thread
From: Qian Cai @ 2020-08-04  1:49 UTC (permalink / raw)
  To: HORIGUCHI NAOYA(堀口 直也)
  Cc: nao.horiguchi, linux-mm, mhocko, akpm, mike.kravetz, osalvador,
	tony.luck, david, aneesh.kumar, zeil, linux-kernel

On Tue, Aug 04, 2020 at 01:16:45AM +0000, HORIGUCHI NAOYA(堀口 直也) wrote:
> On Mon, Aug 03, 2020 at 03:07:09PM -0400, Qian Cai wrote:
> > On Fri, Jul 31, 2020 at 12:20:56PM +0000, nao.horiguchi@gmail.com wrote:
> > > This patchset is the latest version of soft offline rework patchset
> > > targetted for v5.9.
> > > 
> > > Main focus of this series is to stabilize soft offline.  Historically soft
> > > offlined pages have suffered from racy conditions because PageHWPoison is
> > > used to a little too aggressively, which (directly or indirectly) invades
> > > other mm code which cares little about hwpoison.  This results in unexpected
> > > behavior or kernel panic, which is very far from soft offline's "do not
> > > disturb userspace or other kernel component" policy.
> > > 
> > > Main point of this change set is to contain target page "via buddy allocator",
> > > where we first free the target page as we do for normal pages, and remove
> > > from buddy only when we confirm that it reaches free list. There is surely
> > > race window of page allocation, but that's fine because someone really want
> > > that page and the page is still working, so soft offline can happily give up.
> > > 
> > > v4 from Oscar tries to handle the race around reallocation, but that part
> > > seems still work in progress, so I decide to separate it for changes into
> > > v5.9.  Thank you for your contribution, Oscar.
> > > 
> > > The issue reported by Qian Cai is fixed by patch 16/16.
> > > 
> > > This patchset is based on v5.8-rc7-mmotm-2020-07-27-18-18, but I applied
> > > this series after reverting previous version.
> > > Maybe https://github.com/Naoya-Horiguchi/linux/commits/soft-offline-rework.v5
> > > shows what I did more precisely.
> > > 
> > > Any other comment/suggestion/help would be appreciated.
> > 
> > There is another issue with this patchset (with and without the patch [1]).
> > 
> > [1] https://lore.kernel.org/lkml/20200803133657.GA13307@hori.linux.bs1.fc.nec.co.jp/
> > 
> > Arm64 using 512M-size hugepages starts to fail allocations prematurely.
> > 
> > # ./random 1
> > - start: migrate_huge_offline
> > - use NUMA nodes 0,1.
> > - mmap and free 2147483648 bytes hugepages on node 0
> > - mmap and free 2147483648 bytes hugepages on node 1
> > madvise: Cannot allocate memory
> > 
> > [  284.388061][ T3706] soft offline: 0x956000: hugepage isolation failed: 0, page count 2, type 17ffff80001000e (referenced|uptodate|dirty|head)
> > [  284.400777][ T3706] Soft offlining pfn 0x8e000 at process virtual address 0xffff80000000
> > [  284.893412][ T3706] Soft offlining pfn 0x8a000 at process virtual address 0xffff60000000
> > [  284.901539][ T3706] soft offline: 0x8a000: hugepage isolation failed: 0, page count 2, type 7ffff80001000e (referenced|uptodate|dirty|head)
> > [  284.914129][ T3706] Soft offlining pfn 0x8c000 at process virtual address 0xffff80000000
> > [  285.433497][ T3706] Soft offlining pfn 0x88000 at process virtual address 0xffff60000000
> > [  285.720377][ T3706] Soft offlining pfn 0x8a000 at process virtual address 0xffff80000000
> > [  286.281620][ T3706] Soft offlining pfn 0xa000 at process virtual address 0xffff60000000
> > [  286.290065][ T3706] soft offline: 0xa000: hugepage migration failed -12, type 7ffff80001000e (referenced|uptodate|dirty|head)
> 
> I think that this is due to the lack of contiguous memory.
> This test program iterates soft offlining many times for hugepages,
> so finally one page in every 512MB will be removed from buddy, then we
> can't allocate hugepage any more even if we have enough free pages.
> This is not good for heavy hugepage users, but that should be intended.
> 
> It seems that random.c calls madvise(MADV_SOFT_OFFLINE) for 2 hugepages,
> and iterates it 1000 (==NR_LOOP) times, so if the system doesn't have
> enough memory to cover the range of 2000 hugepages (1000GB in the Arm64
> system), this ENOMEM should reproduce as expected.

Well, each iteration will mmap/munmap, so there should be no leaking. 

https://gitlab.com/cailca/linux-mm/-/blob/master/random.c#L376

It also seem to me madvise(MADV_SOFT_OFFLINE) does start to fragment memory
somehow, because after this "madvise: Cannot allocate memory" happened, I
immediately checked /proc/meminfo and then found no hugepage usage at all.

> 
> > 
> > Reverting this patchset and its dependency patchset [2] (reverting the
> > dependency alone did not help) fixed it,
> 
> But it's still not clear to me why this was not visible before this
> patchset, so I need more check for it.
> 
> Thanks,
> Naoya Horiguchi
> 
> > 
> > # ./random 1
> > - start: migrate_huge_offline
> > - use NUMA nodes 0,1.
> > - mmap and free 2147483648 bytes hugepages on node 0
> > - mmap and free 2147483648 bytes hugepages on node 1
> > - pass: mmap_offline_node_huge
> > 
> > [2] https://lore.kernel.org/linux-mm/1594622517-20681-1-git-send-email-iamjoonsoo.kim@lge.com/ 
> > 
> > > 
> > > Thanks,
> > > Naoya Horiguchi
> > > ---
> > > Previous versions:
> > >   v1: https://lore.kernel.org/linux-mm/1541746035-13408-1-git-send-email-n-horiguchi@ah.jp.nec.com/
> > >   v2: https://lore.kernel.org/linux-mm/20191017142123.24245-1-osalvador@suse.de/
> > >   v3: https://lore.kernel.org/linux-mm/20200624150137.7052-1-nao.horiguchi@gmail.com/
> > >   v4: https://lore.kernel.org/linux-mm/20200716123810.25292-1-osalvador@suse.de/
> > > ---
> > > Summary:
> > > 
> > > Naoya Horiguchi (8):
> > >       mm,hwpoison: cleanup unused PageHuge() check
> > >       mm, hwpoison: remove recalculating hpage
> > >       mm,madvise: call soft_offline_page() without MF_COUNT_INCREASED
> > >       mm,hwpoison-inject: don't pin for hwpoison_filter
> > >       mm,hwpoison: remove MF_COUNT_INCREASED
> > >       mm,hwpoison: remove flag argument from soft offline functions
> > >       mm,hwpoison: introduce MF_MSG_UNSPLIT_THP
> > >       mm,hwpoison: double-check page count in __get_any_page()
> > > 
> > > Oscar Salvador (8):
> > >       mm,madvise: Refactor madvise_inject_error
> > >       mm,hwpoison: Un-export get_hwpoison_page and make it static
> > >       mm,hwpoison: Kill put_hwpoison_page
> > >       mm,hwpoison: Unify THP handling for hard and soft offline
> > >       mm,hwpoison: Rework soft offline for free pages
> > >       mm,hwpoison: Rework soft offline for in-use pages
> > >       mm,hwpoison: Refactor soft_offline_huge_page and __soft_offline_page
> > >       mm,hwpoison: Return 0 if the page is already poisoned in soft-offline
> > > 
> > >  drivers/base/memory.c      |   2 +-
> > >  include/linux/mm.h         |  12 +-
> > >  include/linux/page-flags.h |   6 +-
> > >  include/ras/ras_event.h    |   3 +
> > >  mm/hwpoison-inject.c       |  18 +--
> > >  mm/madvise.c               |  39 +++---
> > >  mm/memory-failure.c        | 334 ++++++++++++++++++++-------------------------
> > >  mm/migrate.c               |  11 +-
> > >  mm/page_alloc.c            |  60 ++++++--
> > >  9 files changed, 233 insertions(+), 252 deletions(-)
> > 

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 00/16] HWPOISON: soft offline rework
  2020-08-04  1:49     ` Qian Cai
@ 2020-08-04  8:13       ` osalvador
  2020-08-05 20:44       ` HORIGUCHI NAOYA(堀口 直也)
  1 sibling, 0 replies; 26+ messages in thread
From: osalvador @ 2020-08-04  8:13 UTC (permalink / raw)
  To: Qian Cai
  Cc: HORIGUCHI NAOYA(堀口 直也),
	nao.horiguchi, linux-mm, mhocko, akpm, mike.kravetz, tony.luck,
	david, aneesh.kumar, zeil, linux-kernel

On 2020-08-04 03:49, Qian Cai wrote:
> 
> Well, each iteration will mmap/munmap, so there should be no leaking.
> 
> https://gitlab.com/cailca/linux-mm/-/blob/master/random.c#L376
> 
> It also seem to me madvise(MADV_SOFT_OFFLINE) does start to fragment 
> memory
> somehow, because after this "madvise: Cannot allocate memory" happened, 
> I
> immediately checked /proc/meminfo and then found no hugepage usage at 
> all.

Unfortunately I will be off for a week, but out of curiosity, could you 
try out with below tree [1] and see if you still see those issues?

Thanks for your time

[1] https://github.com/leberus/linux-mm-1/tree/hwpoison




^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 00/16] HWPOISON: soft offline rework
  2020-08-03 15:19     ` Qian Cai
@ 2020-08-05 20:43       ` HORIGUCHI NAOYA(堀口 直也)
  0 siblings, 0 replies; 26+ messages in thread
From: HORIGUCHI NAOYA(堀口 直也) @ 2020-08-05 20:43 UTC (permalink / raw)
  To: Qian Cai
  Cc: nao.horiguchi, linux-mm, mhocko, akpm, mike.kravetz, osalvador,
	tony.luck, david, aneesh.kumar, zeil, linux-kernel

On Mon, Aug 03, 2020 at 11:19:09AM -0400, Qian Cai wrote:
> On Mon, Aug 03, 2020 at 01:36:58PM +0000, HORIGUCHI NAOYA(堀口 直也) wrote:
> > Hello,
> > 
> > On Mon, Aug 03, 2020 at 08:39:55AM -0400, Qian Cai wrote:
> > > On Fri, Jul 31, 2020 at 12:20:56PM +0000, nao.horiguchi@gmail.com wrote:
> > > > This patchset is the latest version of soft offline rework patchset
> > > > targetted for v5.9.
> > > > 
> > > > Main focus of this series is to stabilize soft offline.  Historically soft
> > > > offlined pages have suffered from racy conditions because PageHWPoison is
> > > > used to a little too aggressively, which (directly or indirectly) invades
> > > > other mm code which cares little about hwpoison.  This results in unexpected
> > > > behavior or kernel panic, which is very far from soft offline's "do not
> > > > disturb userspace or other kernel component" policy.
> > > > 
> > > > Main point of this change set is to contain target page "via buddy allocator",
> > > > where we first free the target page as we do for normal pages, and remove
> > > > from buddy only when we confirm that it reaches free list. There is surely
> > > > race window of page allocation, but that's fine because someone really want
> > > > that page and the page is still working, so soft offline can happily give up.
> > > > 
> > > > v4 from Oscar tries to handle the race around reallocation, but that part
> > > > seems still work in progress, so I decide to separate it for changes into
> > > > v5.9.  Thank you for your contribution, Oscar.
> > > > 
> > > > The issue reported by Qian Cai is fixed by patch 16/16.
> > > 
> > > I am still getting EIO everywhere on next-20200803 (which includes this v5).
> > > 
> > > # ./random 1
> > > - start: migrate_huge_offline
> > > - use NUMA nodes 0,8.
> > > - mmap and free 8388608 bytes hugepages on node 0
> > > - mmap and free 8388608 bytes hugepages on node 8
> > > madvise: Input/output error
> > > 
> > > From the serial console,
> > > 
> > > [  637.164222][ T8357] soft offline: 0x118ee0: hugepage isolation failed: 0, page count 2, type 7fff800001000e (referenced|uptodate|dirty|head)
> > > [  637.164890][ T8357] Soft offlining pfn 0x20001380 at process virtual address 0x7fff9f000000
> > > [  637.165422][ T8357] Soft offlining pfn 0x3ba00 at process virtual address 0x7fff9f200000
> > > [  637.166409][ T8357] Soft offlining pfn 0x201914a0 at process virtual address 0x7fff9f000000
> > > [  637.166833][ T8357] Soft offlining pfn 0x12b9a0 at process virtual address 0x7fff9f200000
> > > [  637.168044][ T8357] Soft offlining pfn 0x1abb60 at process virtual address 0x7fff9f000000
> > > [  637.168557][ T8357] Soft offlining pfn 0x20014820 at process virtual address 0x7fff9f200000
> > > [  637.169493][ T8357] Soft offlining pfn 0x119720 at process virtual address 0x7fff9f000000
> > > [  637.169603][ T8357] soft offline: 0x119720: hugepage isolation failed: 0, page count 2, type 7fff800001000e (referenced|uptodate|dirty|head)
> > > [  637.169756][ T8357] Soft offlining pfn 0x118ee0 at process virtual address 0x7fff9f200000
> > > [  637.170653][ T8357] Soft offlining pfn 0x200e81e0 at process virtual address 0x7fff9f000000
> > > [  637.171067][ T8357] Soft offlining pfn 0x201c5f60 at process virtual address 0x7fff9f200000
> > > [  637.172101][ T8357] Soft offlining pfn 0x201c8f00 at process virtual address 0x7fff9f000000
> > > [  637.172241][ T8357] __get_any_page: 0x201c8f00: unknown zero refcount page type 87fff8000000000
> > 
> > I might misjudge to skip the following patch, sorry about that.
> > Could you try with it?
> 
> Still getting EIO after applied this patch.
> 
> [ 1215.499030][T88982] soft offline: 0x201bdc20: hugepage isolation failed: 0, page count 2, type 87fff800001000e (referenced|uptodate|dirty|head)
> [ 1215.499775][T88982] Soft offlining pfn 0x201bdc20 at process virtual address 0x7fff91a00000
> [ 1215.500189][T88982] Soft offlining pfn 0x201c19c0 at process virtual address 0x7fff91c00000
> [ 1215.500297][T88982] soft offline: 0x201c19c0: hugepage isolation failed: 0, page count 2, type 87fff800001000e (referenced|uptodate|dirty|head)
> [ 1215.500982][T88982] Soft offlining pfn 0x1f1fa0 at process virtual address 0x7fff91a00000
> [ 1215.501086][T88982] soft offline: 0x1f1fa0: hugepage isolation failed: 0, page count 2, type 7fff800001000e (referenced|uptodate|dirty|head)
> [ 1215.501237][T88982] Soft offlining pfn 0x1f4520 at process virtual address 0x7fff91c00000
> [ 1215.501355][T88982] soft offline: 0x1f4520: hugepage isolation failed: 0, page count 2, type 7fff800001000e (referenced|uptodate|dirty|head)
> [ 1215.502196][T88982] Soft offlining pfn 0x1f4520 at process virtual address 0x7fff91a00000
> [ 1215.502573][T88982] Soft offlining pfn 0x1f1fa0 at process virtual address 0x7fff91c00000
> [ 1215.502687][T88982] soft offline: 0x1f1fa0: hugepage isolation failed: 0, page count 2, type 7fff800001000e (referenced|uptodate|dirty|head)
> [ 1215.503245][T88982] Soft offlining pfn 0x201c3cc0 at process virtual address 0x7fff91a00000
> [ 1215.503594][T88982] Soft offlining pfn 0x201c3ce0 at process virtual address 0x7fff91c00000
> [ 1215.503755][T88982] __get_any_page: 0x201c3ce0: unknown zero refcount page type 87fff8000000000

So I lean to cancel patch 3/16 where madvise_inject_error() releases page
refcount before calling soft_offline_page(). That should solve the issue
because refcount of the target page never becomes zero.
Patch 4/16, 8/16 and 9/16 depend on it, so they should be removed too.
I'll resubmit the series later, but I prepare the delta below (based on
next-20200805), so it would be appreciated if you can run random.c with it
to confrim the fix.

Thanks,
Naoya Horiguchi
---
From 533090c0869aeca88d8ff14d27c3cef8fc060ccd Mon Sep 17 00:00:00 2001
From: Naoya Horiguchi <naoya.horiguchi@nec.com>
Date: Thu, 6 Aug 2020 01:29:08 +0900
Subject: [PATCH] (for testing) revert change around removing
 MF_COUNT_INCREASED

revert the following patches
- mm,madvise: call soft_offline_page() without MF_COUNT_INCREASED
- mm,madvise: Refactor madvise_inject_error
- mm,hwpoison: remove MF_COUNT_INCREASED
- mm,hwpoison: remove flag argument from soft offline functions

Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
---
 drivers/base/memory.c |  2 +-
 include/linux/mm.h    |  9 +++++----
 mm/madvise.c          | 36 +++++++++++++++++++-----------------
 mm/memory-failure.c   | 31 ++++++++++++++++++++-----------
 4 files changed, 45 insertions(+), 33 deletions(-)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 3e6d27c9dff6..4db3c660de83 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -463,7 +463,7 @@ static ssize_t soft_offline_page_store(struct device *dev,
 	if (kstrtoull(buf, 0, &pfn) < 0)
 		return -EINVAL;
 	pfn >>= PAGE_SHIFT;
-	ret = soft_offline_page(pfn);
+	ret = soft_offline_page(pfn, 0);
 	return ret == 0 ? count : ret;
 }
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 4f12b2465e80..442921a004a2 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2976,9 +2976,10 @@ void register_page_bootmem_memmap(unsigned long section_nr, struct page *map,
 				  unsigned long nr_pages);
 
 enum mf_flags {
-	MF_ACTION_REQUIRED = 1 << 0,
-	MF_MUST_KILL = 1 << 1,
-	MF_SOFT_OFFLINE = 1 << 2,
+	MF_COUNT_INCREASED = 1 << 0,
+	MF_ACTION_REQUIRED = 1 << 1,
+	MF_MUST_KILL = 1 << 2,
+	MF_SOFT_OFFLINE = 1 << 3,
 };
 extern int memory_failure(unsigned long pfn, int flags);
 extern void memory_failure_queue(unsigned long pfn, int flags);
@@ -2988,7 +2989,7 @@ extern int sysctl_memory_failure_early_kill;
 extern int sysctl_memory_failure_recovery;
 extern void shake_page(struct page *p, int access);
 extern atomic_long_t num_poisoned_pages __read_mostly;
-extern int soft_offline_page(unsigned long pfn);
+extern int soft_offline_page(unsigned long pfn, int flags);
 
 
 /*
diff --git a/mm/madvise.c b/mm/madvise.c
index 843f6fad3b89..5fa5f66468b3 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -886,15 +886,16 @@ static long madvise_remove(struct vm_area_struct *vma,
 static int madvise_inject_error(int behavior,
 		unsigned long start, unsigned long end)
 {
+	struct page *page;
 	struct zone *zone;
 	unsigned long size;
 
 	if (!capable(CAP_SYS_ADMIN))
 		return -EPERM;
 
+
 	for (; start < end; start += size) {
 		unsigned long pfn;
-		struct page *page;
 		int ret;
 
 		ret = get_user_pages_fast(start, 1, 0, &page);
@@ -909,26 +910,27 @@ static int madvise_inject_error(int behavior,
 		 */
 		size = page_size(compound_head(page));
 
-		/*
-		 * The get_user_pages_fast() is just to get the pfn of the
-		 * given address, and the refcount has nothing to do with
-		 * what we try to test, so it should be released immediately.
-		 * This is racy but it's intended because the real hardware
-		 * errors could happen at any moment and memory error handlers
-		 * must properly handle the race.
-		 */
-		put_page(page);
-
 		if (behavior == MADV_SOFT_OFFLINE) {
 			pr_info("Soft offlining pfn %#lx at process virtual address %#lx\n",
-				pfn, start);
-			ret = soft_offline_page(pfn);
-		} else {
-			pr_info("Injecting memory failure for pfn %#lx at process virtual address %#lx\n",
-				pfn, start);
-			ret = memory_failure(pfn, 0);
+					pfn, start);
+
+			ret = soft_offline_page(pfn, MF_COUNT_INCREASED);
+			if (ret)
+				return ret;
+			continue;
 		}
 
+		pr_info("Injecting memory failure for pfn %#lx at process virtual address %#lx\n",
+				pfn, start);
+
+		/*
+		 * Drop the page reference taken by get_user_pages_fast(). In
+		 * the absence of MF_COUNT_INCREASED the memory_failure()
+		 * routine is responsible for pinning the page to prevent it
+		 * from being released back to the page allocator.
+		 */
+		put_page(page);
+		ret = memory_failure(pfn, 0);
 		if (ret)
 			return ret;
 	}
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index a229d4694954..fd50a6f9a60d 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1167,7 +1167,7 @@ static int memory_failure_hugetlb(unsigned long pfn, int flags)
 
 	num_poisoned_pages_inc();
 
-	if (!get_hwpoison_page(p)) {
+	if (!(flags & MF_COUNT_INCREASED) && !get_hwpoison_page(p)) {
 		/*
 		 * Check "filter hit" and "race with other subpage."
 		 */
@@ -1363,7 +1363,7 @@ int memory_failure(unsigned long pfn, int flags)
 	 * In fact it's dangerous to directly bump up page count from 0,
 	 * that may make page_ref_freeze()/page_ref_unfreeze() mismatch.
 	 */
-	if (!get_hwpoison_page(p)) {
+	if (!(flags & MF_COUNT_INCREASED) && !get_hwpoison_page(p)) {
 		if (is_free_buddy_page(p)) {
 			action_result(pfn, MF_MSG_BUDDY, MF_DELAYED);
 			return 0;
@@ -1392,7 +1392,10 @@ int memory_failure(unsigned long pfn, int flags)
 	shake_page(p, 0);
 	/* shake_page could have turned it free. */
 	if (!PageLRU(p) && is_free_buddy_page(p)) {
-		action_result(pfn, MF_MSG_BUDDY_2ND, MF_DELAYED);
+		if (flags & MF_COUNT_INCREASED)
+			action_result(pfn, MF_MSG_BUDDY, MF_DELAYED);
+		else
+			action_result(pfn, MF_MSG_BUDDY_2ND, MF_DELAYED);
 		return 0;
 	}
 
@@ -1540,7 +1543,7 @@ static void memory_failure_work_func(struct work_struct *work)
 		if (!gotten)
 			break;
 		if (entry.flags & MF_SOFT_OFFLINE)
-			soft_offline_page(entry.pfn);
+			soft_offline_page(entry.pfn, entry.flags);
 		else
 			memory_failure(entry.pfn, entry.flags);
 	}
@@ -1679,10 +1682,13 @@ EXPORT_SYMBOL(unpoison_memory);
  * that is not free, and 1 for any other page type.
  * For 1 the page is returned with increased page count, otherwise not.
  */
-static int __get_any_page(struct page *p, unsigned long pfn)
+static int __get_any_page(struct page *p, unsigned long pfn, int flags)
 {
 	int ret;
 
+	if (flags & MF_COUNT_INCREASED)
+		return 1;
+
 	/*
 	 * When the target page is a free hugepage, just remove it
 	 * from free hugepage list.
@@ -1709,12 +1715,12 @@ static int __get_any_page(struct page *p, unsigned long pfn)
 	return ret;
 }
 
-static int get_any_page(struct page *page, unsigned long pfn)
+static int get_any_page(struct page *page, unsigned long pfn, int flags)
 {
-	int ret = __get_any_page(page, pfn);
+	int ret = __get_any_page(page, pfn, flags);
 
 	if (ret == -EBUSY)
-		ret = __get_any_page(page, pfn);
+		ret = __get_any_page(page, pfn, flags);
 
 	if (ret == 1 && !PageHuge(page) &&
 	    !PageLRU(page) && !__PageMovable(page)) {
@@ -1727,7 +1733,7 @@ static int get_any_page(struct page *page, unsigned long pfn)
 		/*
 		 * Did it turn free?
 		 */
-		ret = __get_any_page(page, pfn);
+		ret = __get_any_page(page, pfn, 0);
 		if (ret == 1 && !PageLRU(page)) {
 			/* Drop page reference which is from __get_any_page() */
 			put_page(page);
@@ -1869,6 +1875,7 @@ static int soft_offline_free_page(struct page *page)
 /**
  * soft_offline_page - Soft offline a page.
  * @pfn: pfn to soft-offline
+ * @flags: flags. Same as memory_failure().
  *
  * Returns 0 on success, otherwise negated errno.
  *
@@ -1887,7 +1894,7 @@ static int soft_offline_free_page(struct page *page)
  * This is not a 100% solution for all memory, but tries to be
  * ``good enough'' for the majority of memory.
  */
-int soft_offline_page(unsigned long pfn)
+int soft_offline_page(unsigned long pfn, int flags)
 {
 	int ret;
 	struct page *page;
@@ -1901,11 +1908,13 @@ int soft_offline_page(unsigned long pfn)
 
 	if (PageHWPoison(page)) {
 		pr_info("soft offline: %#lx page already poisoned\n", pfn);
+		if (flags & MF_COUNT_INCREASED)
+			put_page(page);
 		return 0;
 	}
 
 	get_online_mems();
-	ret = get_any_page(page, pfn);
+	ret = get_any_page(page, pfn, flags);
 	put_online_mems();
 
 	if (ret > 0)
-- 
2.25.1

^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH v5 00/16] HWPOISON: soft offline rework
  2020-08-04  1:49     ` Qian Cai
  2020-08-04  8:13       ` osalvador
@ 2020-08-05 20:44       ` HORIGUCHI NAOYA(堀口 直也)
  1 sibling, 0 replies; 26+ messages in thread
From: HORIGUCHI NAOYA(堀口 直也) @ 2020-08-05 20:44 UTC (permalink / raw)
  To: Qian Cai
  Cc: nao.horiguchi, linux-mm, mhocko, akpm, mike.kravetz, osalvador,
	tony.luck, david, aneesh.kumar, zeil, linux-kernel

On Mon, Aug 03, 2020 at 09:49:42PM -0400, Qian Cai wrote:
> On Tue, Aug 04, 2020 at 01:16:45AM +0000, HORIGUCHI NAOYA(堀口 直也) wrote:
> > On Mon, Aug 03, 2020 at 03:07:09PM -0400, Qian Cai wrote:
> > > On Fri, Jul 31, 2020 at 12:20:56PM +0000, nao.horiguchi@gmail.com wrote:
> > > > This patchset is the latest version of soft offline rework patchset
> > > > targetted for v5.9.
> > > > 
> > > > Main focus of this series is to stabilize soft offline.  Historically soft
> > > > offlined pages have suffered from racy conditions because PageHWPoison is
> > > > used to a little too aggressively, which (directly or indirectly) invades
> > > > other mm code which cares little about hwpoison.  This results in unexpected
> > > > behavior or kernel panic, which is very far from soft offline's "do not
> > > > disturb userspace or other kernel component" policy.
> > > > 
> > > > Main point of this change set is to contain target page "via buddy allocator",
> > > > where we first free the target page as we do for normal pages, and remove
> > > > from buddy only when we confirm that it reaches free list. There is surely
> > > > race window of page allocation, but that's fine because someone really want
> > > > that page and the page is still working, so soft offline can happily give up.
> > > > 
> > > > v4 from Oscar tries to handle the race around reallocation, but that part
> > > > seems still work in progress, so I decide to separate it for changes into
> > > > v5.9.  Thank you for your contribution, Oscar.
> > > > 
> > > > The issue reported by Qian Cai is fixed by patch 16/16.
> > > > 
> > > > This patchset is based on v5.8-rc7-mmotm-2020-07-27-18-18, but I applied
> > > > this series after reverting previous version.
> > > > Maybe https://github.com/Naoya-Horiguchi/linux/commits/soft-offline-rework.v5
> > > > shows what I did more precisely.
> > > > 
> > > > Any other comment/suggestion/help would be appreciated.
> > > 
> > > There is another issue with this patchset (with and without the patch [1]).
> > > 
> > > [1] https://lore.kernel.org/lkml/20200803133657.GA13307@hori.linux.bs1.fc.nec.co.jp/
> > > 
> > > Arm64 using 512M-size hugepages starts to fail allocations prematurely.
> > > 
> > > # ./random 1
> > > - start: migrate_huge_offline
> > > - use NUMA nodes 0,1.
> > > - mmap and free 2147483648 bytes hugepages on node 0
> > > - mmap and free 2147483648 bytes hugepages on node 1
> > > madvise: Cannot allocate memory
> > > 
> > > [  284.388061][ T3706] soft offline: 0x956000: hugepage isolation failed: 0, page count 2, type 17ffff80001000e (referenced|uptodate|dirty|head)
> > > [  284.400777][ T3706] Soft offlining pfn 0x8e000 at process virtual address 0xffff80000000
> > > [  284.893412][ T3706] Soft offlining pfn 0x8a000 at process virtual address 0xffff60000000
> > > [  284.901539][ T3706] soft offline: 0x8a000: hugepage isolation failed: 0, page count 2, type 7ffff80001000e (referenced|uptodate|dirty|head)
> > > [  284.914129][ T3706] Soft offlining pfn 0x8c000 at process virtual address 0xffff80000000
> > > [  285.433497][ T3706] Soft offlining pfn 0x88000 at process virtual address 0xffff60000000
> > > [  285.720377][ T3706] Soft offlining pfn 0x8a000 at process virtual address 0xffff80000000
> > > [  286.281620][ T3706] Soft offlining pfn 0xa000 at process virtual address 0xffff60000000
> > > [  286.290065][ T3706] soft offline: 0xa000: hugepage migration failed -12, type 7ffff80001000e (referenced|uptodate|dirty|head)
> > 
> > I think that this is due to the lack of contiguous memory.
> > This test program iterates soft offlining many times for hugepages,
> > so finally one page in every 512MB will be removed from buddy, then we
> > can't allocate hugepage any more even if we have enough free pages.
> > This is not good for heavy hugepage users, but that should be intended.
> > 
> > It seems that random.c calls madvise(MADV_SOFT_OFFLINE) for 2 hugepages,
> > and iterates it 1000 (==NR_LOOP) times, so if the system doesn't have
> > enough memory to cover the range of 2000 hugepages (1000GB in the Arm64
> > system), this ENOMEM should reproduce as expected.
> 
> Well, each iteration will mmap/munmap, so there should be no leaking. 
> 
> https://gitlab.com/cailca/linux-mm/-/blob/master/random.c#L376
> 
> It also seem to me madvise(MADV_SOFT_OFFLINE) does start to fragment memory
> somehow, because after this "madvise: Cannot allocate memory" happened, I
> immediately checked /proc/meminfo and then found no hugepage usage at all.
> 
> > 
> > > 
> > > Reverting this patchset and its dependency patchset [2] (reverting the
> > > dependency alone did not help) fixed it,
> > 
> > But it's still not clear to me why this was not visible before this
> > patchset, so I need more check for it.

I've reproduced ENOMEM with v5.8 (without this patchset) simply by using
VM with small memory (4GB). So this specific error seems not to be caused
by this series.

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2020-08-05 20:45 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-31 12:20 [PATCH v5 00/16] HWPOISON: soft offline rework nao.horiguchi
2020-07-31 12:20 ` [PATCH v5 01/16] mm,hwpoison: cleanup unused PageHuge() check nao.horiguchi
2020-07-31 12:20 ` [PATCH v5 02/16] mm, hwpoison: remove recalculating hpage nao.horiguchi
2020-07-31 12:20 ` [PATCH v5 03/16] mm,madvise: call soft_offline_page() without MF_COUNT_INCREASED nao.horiguchi
2020-07-31 12:21 ` [PATCH v5 04/16] mm,madvise: Refactor madvise_inject_error nao.horiguchi
2020-07-31 12:21 ` [PATCH v5 05/16] mm,hwpoison-inject: don't pin for hwpoison_filter nao.horiguchi
2020-07-31 12:21 ` [PATCH v5 06/16] mm,hwpoison: Un-export get_hwpoison_page and make it static nao.horiguchi
2020-07-31 12:21 ` [PATCH v5 07/16] mm,hwpoison: Kill put_hwpoison_page nao.horiguchi
2020-07-31 12:21 ` [PATCH v5 08/16] mm,hwpoison: remove MF_COUNT_INCREASED nao.horiguchi
2020-07-31 12:21 ` [PATCH v5 09/16] mm,hwpoison: remove flag argument from soft offline functions nao.horiguchi
2020-07-31 12:21 ` [PATCH v5 10/16] mm,hwpoison: Unify THP handling for hard and soft offline nao.horiguchi
2020-07-31 12:21 ` [PATCH v5 11/16] mm,hwpoison: Rework soft offline for free pages nao.horiguchi
2020-07-31 12:21 ` [PATCH v5 12/16] mm,hwpoison: Rework soft offline for in-use pages nao.horiguchi
2020-07-31 12:21 ` [PATCH v5 13/16] mm,hwpoison: Refactor soft_offline_huge_page and __soft_offline_page nao.horiguchi
2020-07-31 12:21 ` [PATCH v5 14/16] mm,hwpoison: Return 0 if the page is already poisoned in soft-offline nao.horiguchi
2020-07-31 12:21 ` [PATCH v5 15/16] mm,hwpoison: introduce MF_MSG_UNSPLIT_THP nao.horiguchi
2020-07-31 12:21 ` [PATCH v5 16/16] mm,hwpoison: double-check page count in __get_any_page() nao.horiguchi
2020-08-03 12:39 ` [PATCH v5 00/16] HWPOISON: soft offline rework Qian Cai
2020-08-03 13:36   ` HORIGUCHI NAOYA(堀口 直也)
2020-08-03 15:19     ` Qian Cai
2020-08-05 20:43       ` HORIGUCHI NAOYA(堀口 直也)
2020-08-03 19:07 ` Qian Cai
2020-08-04  1:16   ` HORIGUCHI NAOYA(堀口 直也)
2020-08-04  1:49     ` Qian Cai
2020-08-04  8:13       ` osalvador
2020-08-05 20:44       ` HORIGUCHI NAOYA(堀口 直也)

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).