All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH v4 00/15] Hwpoison soft-offline rework
@ 2020-07-16 12:37 Oscar Salvador
  2020-07-16 12:37 ` [PATCH v4 01/15] mm,hwpoison: cleanup unused PageHuge() check Oscar Salvador
                   ` (16 more replies)
  0 siblings, 17 replies; 24+ messages in thread
From: Oscar Salvador @ 2020-07-16 12:37 UTC (permalink / raw)
  To: akpm
  Cc: mhocko, linux-mm, mike.kravetz, david, aneesh.kumar,
	naoya.horiguchi, linux-kernel, Oscar Salvador

Hi all,

this is a follow-up version on [1].
That version had some flaws wrt. handling hugetlb pages, so this version
fixes it.
I checked that the case reported by Qian seems to work fine now.

Cover letter:

This patchset was initially based on Naoya's hwpoison rework [1], so
thanks to him for the initial work.
I would also like to think Naoya for testing the patchset off-line,
and report any issues he found, that was quite helpful.

This patchset aims to fix some issues laying in soft-offline handling,
but it also takes the chance and takes some further steps to perform 
cleanups and some refactoring as well.


 - Motivation:

   A customer and I were facing an issue were processes were killed
   after having soft-offlined some of their pages.
   This should not happen when soft-offlining, as it is meant to be non-disruptive.
   I was able to reproduce the issue when I stressed the memory +
   soft offlining pages in the meantime.

   After debugging the issue, I saw that the problem was that pages were returned
   back to user-space after having offlined them properly.
   So, when those pages were faulted in, the fault handler returned VM_FAULT_POISON
   all the way down to the arch handler, and it simply killed the process.

   After a further anaylsis, it became clear that the problem was that when
   kcompactd kicked in to migrate pages over, compaction_alloc callback
   was handing poisoned pages to the migrate routine.

   All this could happen because isolate_freepages_block and
   fast_isolate_freepages just check for the page to be PageBuddy,
   and since 1) poisoned pages can be part of a higher order page
   and 2) poisoned pages are also Page Buddy, they can sneak in easily.

   I also saw some other problems with sawap pages, but I suspected it
   to be the same sort of problem, so I did not follow that trace.

   The above refers to soft-offline.
   But I also saw problems with hard-offline, specially hugetlb corruption,
   and some other weird stuff. (I could paste the logs)

   The full explanation refering to the soft-offline case can be found at [2].

 - Approach:

   The taken approach is to contain those pages and never let them hit 
   neither pcplists nor buddy freelists.
   Only when they are completely out of reach, we flag them as poisoned.

   A full explanation of this can be found in patch#11 and patch#12

 - Outcome:

   With this patchset, I no longer see the issues with soft-offline.

[1] https://lore.kernel.org/linux-mm/1541746035-13408-1-git-send-email-n-horiguchi@ah.jp.nec.com/
[2] https://lore.kernel.org/linux-mm/20190826104144.GA7849@linux/T/#u

Naoya Horiguchi (6):
  mm,hwpoison: cleanup unused PageHuge() check
  mm, hwpoison: remove recalculating hpage
  mm,madvise: call soft_offline_page() without MF_COUNT_INCREASED
  mm,hwpoison-inject: don't pin for hwpoison_filter
  mm,hwpoison: remove MF_COUNT_INCREASED
  mm,hwpoison: remove flag argument from soft offline functions

Oscar Salvador (9):
  mm,madvise: Refactor madvise_inject_error
  mm,hwpoison: Un-export get_hwpoison_page and make it static
  mm,hwpoison: Kill put_hwpoison_page
  mm,hwpoison: Unify THP handling for hard and soft offline
  mm,hwpoison: Rework soft offline for free pages
  mm,hwpoison: Rework soft offline for in-use pages
  mm,hwpoison: Refactor soft_offline_huge_page and __soft_offline_page
  mm,hwpoison: Return 0 if the page is already poisoned in soft-offline
  mm,hwpoison: introduce MF_MSG_UNSPLIT_THP

 drivers/base/memory.c      |   2 +-
 include/linux/mm.h         |  12 +-
 include/linux/page-flags.h |   6 +-
 include/ras/ras_event.h    |   3 +
 mm/hugetlb.c               |  60 +++++++-
 mm/hwpoison-inject.c       |  18 +--
 mm/madvise.c               |  37 ++---
 mm/memory-failure.c        | 307 +++++++++++++++----------------------
 mm/migrate.c               |  11 +-
 mm/page_alloc.c            |  70 +++++++--
 10 files changed, 270 insertions(+), 256 deletions(-)

-- 
2.26.2


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v4 01/15] mm,hwpoison: cleanup unused PageHuge() check
  2020-07-16 12:37 [PATCH v4 00/15] Hwpoison soft-offline rework Oscar Salvador
@ 2020-07-16 12:37 ` Oscar Salvador
  2020-07-16 12:37 ` [PATCH v4 02/15] mm, hwpoison: remove recalculating hpage Oscar Salvador
                   ` (15 subsequent siblings)
  16 siblings, 0 replies; 24+ messages in thread
From: Oscar Salvador @ 2020-07-16 12:37 UTC (permalink / raw)
  To: akpm
  Cc: mhocko, linux-mm, mike.kravetz, david, aneesh.kumar,
	naoya.horiguchi, linux-kernel, Naoya Horiguchi, Oscar Salvador

From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

Drop the PageHuge check since memory_failure forks into memory_failure_hugetlb()
for hugetlb pages.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Oscar Salvador <osalvador@suse.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/memory-failure.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 47b8ccb1fb9b..e5d0c14c2332 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1382,10 +1382,7 @@ int memory_failure(unsigned long pfn, int flags)
 	 * page_remove_rmap() in try_to_unmap_one(). So to determine page status
 	 * correctly, we save a copy of the page flags at this time.
 	 */
-	if (PageHuge(p))
-		page_flags = hpage->flags;
-	else
-		page_flags = p->flags;
+	page_flags = p->flags;
 
 	/*
 	 * unpoison always clear PG_hwpoison inside page lock
-- 
2.26.2


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v4 02/15] mm, hwpoison: remove recalculating hpage
  2020-07-16 12:37 [PATCH v4 00/15] Hwpoison soft-offline rework Oscar Salvador
  2020-07-16 12:37 ` [PATCH v4 01/15] mm,hwpoison: cleanup unused PageHuge() check Oscar Salvador
@ 2020-07-16 12:37 ` Oscar Salvador
  2020-07-16 12:37 ` [PATCH v4 03/15] mm,madvise: call soft_offline_page() without MF_COUNT_INCREASED Oscar Salvador
                   ` (14 subsequent siblings)
  16 siblings, 0 replies; 24+ messages in thread
From: Oscar Salvador @ 2020-07-16 12:37 UTC (permalink / raw)
  To: akpm
  Cc: mhocko, linux-mm, mike.kravetz, david, aneesh.kumar,
	naoya.horiguchi, linux-kernel, Oscar Salvador

From: Naoya Horiguchi <naoya.horiguchi@nec.com>

hpage is never used after try_to_split_thp_page() in memory_failure(),
so we don't have to update hpage.  So let's not recalculate/use hpage.

Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Oscar Salvador <osalvador@suse.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/memory-failure.c | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index e5d0c14c2332..d2d6010764e7 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1342,7 +1342,6 @@ int memory_failure(unsigned long pfn, int flags)
 		}
 		unlock_page(p);
 		VM_BUG_ON_PAGE(!page_count(p), p);
-		hpage = compound_head(p);
 	}
 
 	/*
@@ -1414,11 +1413,8 @@ int memory_failure(unsigned long pfn, int flags)
 	/*
 	 * Now take care of user space mappings.
 	 * Abort on fail: __delete_from_page_cache() assumes unmapped page.
-	 *
-	 * When the raw error page is thp tail page, hpage points to the raw
-	 * page after thp split.
 	 */
-	if (!hwpoison_user_mappings(p, pfn, flags, &hpage)) {
+	if (!hwpoison_user_mappings(p, pfn, flags, &p)) {
 		action_result(pfn, MF_MSG_UNMAP_FAILED, MF_IGNORED);
 		res = -EBUSY;
 		goto out;
-- 
2.26.2


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v4 03/15] mm,madvise: call soft_offline_page() without MF_COUNT_INCREASED
  2020-07-16 12:37 [PATCH v4 00/15] Hwpoison soft-offline rework Oscar Salvador
  2020-07-16 12:37 ` [PATCH v4 01/15] mm,hwpoison: cleanup unused PageHuge() check Oscar Salvador
  2020-07-16 12:37 ` [PATCH v4 02/15] mm, hwpoison: remove recalculating hpage Oscar Salvador
@ 2020-07-16 12:37 ` Oscar Salvador
  2020-07-16 23:15   ` Mike Kravetz
  2020-07-16 12:37 ` [PATCH v4 04/15] mm,madvise: Refactor madvise_inject_error Oscar Salvador
                   ` (13 subsequent siblings)
  16 siblings, 1 reply; 24+ messages in thread
From: Oscar Salvador @ 2020-07-16 12:37 UTC (permalink / raw)
  To: akpm
  Cc: mhocko, linux-mm, mike.kravetz, david, aneesh.kumar,
	naoya.horiguchi, linux-kernel, Naoya Horiguchi, Oscar Salvador

From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

The call to get_user_pages_fast is only to get the pointer to a struct
page of a given address, pinning it is memory-poisoning handler's job,
so drop the refcount grabbed by get_user_pages_fast

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Oscar Salvador <osalvador@suse.com>
---
 mm/madvise.c | 24 ++++++++++++------------
 1 file changed, 12 insertions(+), 12 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index a16dba21cdf6..1fe89a5b8d33 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -910,16 +910,24 @@ static int madvise_inject_error(int behavior,
 		 */
 		size = page_size(compound_head(page));
 
-		if (PageHWPoison(page)) {
-			put_page(page);
+		/*
+		 * The get_user_pages_fast() is just to get the pfn of the
+		 * given address, and the refcount has nothing to do with
+		 * what we try to test, so it should be released immediately.
+		 * This is racy but it's intended because the real hardware
+		 * errors could happen at any moment and memory error handlers
+		 * must properly handle the race.
+		 */
+		put_page(page);
+
+		if (PageHWPoison(page))
 			continue;
-		}
 
 		if (behavior == MADV_SOFT_OFFLINE) {
 			pr_info("Soft offlining pfn %#lx at process virtual address %#lx\n",
 					pfn, start);
 
-			ret = soft_offline_page(pfn, MF_COUNT_INCREASED);
+			ret = soft_offline_page(pfn, 0);
 			if (ret)
 				return ret;
 			continue;
@@ -927,14 +935,6 @@ static int madvise_inject_error(int behavior,
 
 		pr_info("Injecting memory failure for pfn %#lx at process virtual address %#lx\n",
 				pfn, start);
-
-		/*
-		 * Drop the page reference taken by get_user_pages_fast(). In
-		 * the absence of MF_COUNT_INCREASED the memory_failure()
-		 * routine is responsible for pinning the page to prevent it
-		 * from being released back to the page allocator.
-		 */
-		put_page(page);
 		ret = memory_failure(pfn, 0);
 		if (ret)
 			return ret;
-- 
2.26.2


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v4 04/15] mm,madvise: Refactor madvise_inject_error
  2020-07-16 12:37 [PATCH v4 00/15] Hwpoison soft-offline rework Oscar Salvador
                   ` (2 preceding siblings ...)
  2020-07-16 12:37 ` [PATCH v4 03/15] mm,madvise: call soft_offline_page() without MF_COUNT_INCREASED Oscar Salvador
@ 2020-07-16 12:37 ` Oscar Salvador
  2020-07-16 12:37 ` [PATCH v4 05/15] mm,hwpoison-inject: don't pin for hwpoison_filter Oscar Salvador
                   ` (12 subsequent siblings)
  16 siblings, 0 replies; 24+ messages in thread
From: Oscar Salvador @ 2020-07-16 12:37 UTC (permalink / raw)
  To: akpm
  Cc: mhocko, linux-mm, mike.kravetz, david, aneesh.kumar,
	naoya.horiguchi, linux-kernel, Oscar Salvador, Oscar Salvador

Make a proper if-else condition for {hard,soft}-offline.

Signed-off-by: Oscar Salvador <osalvador@suse.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 mm/madvise.c | 14 ++++++--------
 1 file changed, 6 insertions(+), 8 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 1fe89a5b8d33..dd2173d8f4e5 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -886,7 +886,6 @@ static long madvise_remove(struct vm_area_struct *vma,
 static int madvise_inject_error(int behavior,
 		unsigned long start, unsigned long end)
 {
-	struct page *page;
 	struct zone *zone;
 	unsigned long size;
 
@@ -896,6 +895,7 @@ static int madvise_inject_error(int behavior,
 
 	for (; start < end; start += size) {
 		unsigned long pfn;
+		struct page *page;
 		int ret;
 
 		ret = get_user_pages_fast(start, 1, 0, &page);
@@ -925,17 +925,15 @@ static int madvise_inject_error(int behavior,
 
 		if (behavior == MADV_SOFT_OFFLINE) {
 			pr_info("Soft offlining pfn %#lx at process virtual address %#lx\n",
-					pfn, start);
+				 pfn, start);
 
 			ret = soft_offline_page(pfn, 0);
-			if (ret)
-				return ret;
-			continue;
+		} else {
+			pr_info("Injecting memory failure for pfn %#lx at process virtual address %#lx\n",
+				 pfn, start);
+			ret = memory_failure(pfn, 0);
 		}
 
-		pr_info("Injecting memory failure for pfn %#lx at process virtual address %#lx\n",
-				pfn, start);
-		ret = memory_failure(pfn, 0);
 		if (ret)
 			return ret;
 	}
-- 
2.26.2


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v4 05/15] mm,hwpoison-inject: don't pin for hwpoison_filter
  2020-07-16 12:37 [PATCH v4 00/15] Hwpoison soft-offline rework Oscar Salvador
                   ` (3 preceding siblings ...)
  2020-07-16 12:37 ` [PATCH v4 04/15] mm,madvise: Refactor madvise_inject_error Oscar Salvador
@ 2020-07-16 12:37 ` Oscar Salvador
  2020-07-16 12:38 ` [PATCH v4 06/15] mm,hwpoison: Un-export get_hwpoison_page and make it static Oscar Salvador
                   ` (11 subsequent siblings)
  16 siblings, 0 replies; 24+ messages in thread
From: Oscar Salvador @ 2020-07-16 12:37 UTC (permalink / raw)
  To: akpm
  Cc: mhocko, linux-mm, mike.kravetz, david, aneesh.kumar,
	naoya.horiguchi, linux-kernel, Naoya Horiguchi, Oscar Salvador

From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

Another memory error injection interface debugfs:hwpoison/corrupt-pfn
also takes bogus refcount for hwpoison_filter(). It's justified
because this does a coarse filter, expecting that memory_failure()
redoes the check for sure.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Oscar Salvador <osalvador@suse.com>
---
 mm/hwpoison-inject.c | 18 +++++-------------
 1 file changed, 5 insertions(+), 13 deletions(-)

diff --git a/mm/hwpoison-inject.c b/mm/hwpoison-inject.c
index e488876b168a..1ae1ebc2b9b1 100644
--- a/mm/hwpoison-inject.c
+++ b/mm/hwpoison-inject.c
@@ -26,11 +26,6 @@ static int hwpoison_inject(void *data, u64 val)
 
 	p = pfn_to_page(pfn);
 	hpage = compound_head(p);
-	/*
-	 * This implies unable to support free buddy pages.
-	 */
-	if (!get_hwpoison_page(p))
-		return 0;
 
 	if (!hwpoison_filter_enable)
 		goto inject;
@@ -40,23 +35,20 @@ static int hwpoison_inject(void *data, u64 val)
 	 * This implies unable to support non-LRU pages.
 	 */
 	if (!PageLRU(hpage) && !PageHuge(p))
-		goto put_out;
+		return 0;
 
 	/*
-	 * do a racy check with elevated page count, to make sure PG_hwpoison
-	 * will only be set for the targeted owner (or on a free page).
+	 * do a racy check to make sure PG_hwpoison will only be set for
+	 * the targeted owner (or on a free page).
 	 * memory_failure() will redo the check reliably inside page lock.
 	 */
 	err = hwpoison_filter(hpage);
 	if (err)
-		goto put_out;
+		return 0;
 
 inject:
 	pr_info("Injecting memory failure at pfn %#lx\n", pfn);
-	return memory_failure(pfn, MF_COUNT_INCREASED);
-put_out:
-	put_hwpoison_page(p);
-	return 0;
+	return memory_failure(pfn, 0);
 }
 
 static int hwpoison_unpoison(void *data, u64 val)
-- 
2.26.2


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v4 06/15] mm,hwpoison: Un-export get_hwpoison_page and make it static
  2020-07-16 12:37 [PATCH v4 00/15] Hwpoison soft-offline rework Oscar Salvador
                   ` (4 preceding siblings ...)
  2020-07-16 12:37 ` [PATCH v4 05/15] mm,hwpoison-inject: don't pin for hwpoison_filter Oscar Salvador
@ 2020-07-16 12:38 ` Oscar Salvador
  2020-07-16 12:38 ` [PATCH v4 07/15] mm,hwpoison: Kill put_hwpoison_page Oscar Salvador
                   ` (10 subsequent siblings)
  16 siblings, 0 replies; 24+ messages in thread
From: Oscar Salvador @ 2020-07-16 12:38 UTC (permalink / raw)
  To: akpm
  Cc: mhocko, linux-mm, mike.kravetz, david, aneesh.kumar,
	naoya.horiguchi, linux-kernel, Oscar Salvador, Oscar Salvador

Since get_hwpoison_page is only used in memory-failure code now,
let us un-export it and make it private to that code.

Signed-off-by: Oscar Salvador <osalvador@suse.com>
---
 include/linux/mm.h  | 1 -
 mm/memory-failure.c | 3 +--
 2 files changed, 1 insertion(+), 3 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index b3c8fd204ec4..919bcab54c26 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3008,7 +3008,6 @@ extern int memory_failure(unsigned long pfn, int flags);
 extern void memory_failure_queue(unsigned long pfn, int flags);
 extern void memory_failure_queue_kick(int cpu);
 extern int unpoison_memory(unsigned long pfn);
-extern int get_hwpoison_page(struct page *page);
 #define put_hwpoison_page(page)	put_page(page)
 extern int sysctl_memory_failure_early_kill;
 extern int sysctl_memory_failure_recovery;
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index d2d6010764e7..48feb45047f7 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -925,7 +925,7 @@ static int page_action(struct page_state *ps, struct page *p,
  * Return: return 0 if failed to grab the refcount, otherwise true (some
  * non-zero value.)
  */
-int get_hwpoison_page(struct page *page)
+static int get_hwpoison_page(struct page *page)
 {
 	struct page *head = compound_head(page);
 
@@ -954,7 +954,6 @@ int get_hwpoison_page(struct page *page)
 
 	return 0;
 }
-EXPORT_SYMBOL_GPL(get_hwpoison_page);
 
 /*
  * Do all that is necessary to remove user space mappings. Unmap
-- 
2.26.2


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v4 07/15] mm,hwpoison: Kill put_hwpoison_page
  2020-07-16 12:37 [PATCH v4 00/15] Hwpoison soft-offline rework Oscar Salvador
                   ` (5 preceding siblings ...)
  2020-07-16 12:38 ` [PATCH v4 06/15] mm,hwpoison: Un-export get_hwpoison_page and make it static Oscar Salvador
@ 2020-07-16 12:38 ` Oscar Salvador
  2020-07-16 12:38 ` [PATCH v4 08/15] mm,hwpoison: remove MF_COUNT_INCREASED Oscar Salvador
                   ` (9 subsequent siblings)
  16 siblings, 0 replies; 24+ messages in thread
From: Oscar Salvador @ 2020-07-16 12:38 UTC (permalink / raw)
  To: akpm
  Cc: mhocko, linux-mm, mike.kravetz, david, aneesh.kumar,
	naoya.horiguchi, linux-kernel, Oscar Salvador, Oscar Salvador

After commit 4e41a30c6d50 ("mm: hwpoison: adjust for new thp refcounting"),
put_hwpoison_page got reduced to a put_page.
Let us just use put_page instead.

Signed-off-by: Oscar Salvador <osalvador@suse.com>
---
 include/linux/mm.h  |  1 -
 mm/memory-failure.c | 30 +++++++++++++++---------------
 2 files changed, 15 insertions(+), 16 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 919bcab54c26..9d1c8540fdaf 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3008,7 +3008,6 @@ extern int memory_failure(unsigned long pfn, int flags);
 extern void memory_failure_queue(unsigned long pfn, int flags);
 extern void memory_failure_queue_kick(int cpu);
 extern int unpoison_memory(unsigned long pfn);
-#define put_hwpoison_page(page)	put_page(page)
 extern int sysctl_memory_failure_early_kill;
 extern int sysctl_memory_failure_recovery;
 extern void shake_page(struct page *p, int access);
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 48feb45047f7..0b7d9769cf29 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1144,7 +1144,7 @@ static int memory_failure_hugetlb(unsigned long pfn, int flags)
 		pr_err("Memory failure: %#lx: just unpoisoned\n", pfn);
 		num_poisoned_pages_dec();
 		unlock_page(head);
-		put_hwpoison_page(head);
+		put_page(head);
 		return 0;
 	}
 
@@ -1336,7 +1336,7 @@ int memory_failure(unsigned long pfn, int flags)
 					pfn);
 			if (TestClearPageHWPoison(p))
 				num_poisoned_pages_dec();
-			put_hwpoison_page(p);
+			put_page(p);
 			return -EBUSY;
 		}
 		unlock_page(p);
@@ -1389,14 +1389,14 @@ int memory_failure(unsigned long pfn, int flags)
 		pr_err("Memory failure: %#lx: just unpoisoned\n", pfn);
 		num_poisoned_pages_dec();
 		unlock_page(p);
-		put_hwpoison_page(p);
+		put_page(p);
 		return 0;
 	}
 	if (hwpoison_filter(p)) {
 		if (TestClearPageHWPoison(p))
 			num_poisoned_pages_dec();
 		unlock_page(p);
-		put_hwpoison_page(p);
+		put_page(p);
 		return 0;
 	}
 
@@ -1630,9 +1630,9 @@ int unpoison_memory(unsigned long pfn)
 	}
 	unlock_page(page);
 
-	put_hwpoison_page(page);
+	put_page(page);
 	if (freeit && !(pfn == my_zero_pfn(0) && page_count(p) == 1))
-		put_hwpoison_page(page);
+		put_page(page);
 
 	return 0;
 }
@@ -1690,7 +1690,7 @@ static int get_any_page(struct page *page, unsigned long pfn, int flags)
 		/*
 		 * Try to free it.
 		 */
-		put_hwpoison_page(page);
+		put_page(page);
 		shake_page(page, 1);
 
 		/*
@@ -1699,7 +1699,7 @@ static int get_any_page(struct page *page, unsigned long pfn, int flags)
 		ret = __get_any_page(page, pfn, 0);
 		if (ret == 1 && !PageLRU(page)) {
 			/* Drop page reference which is from __get_any_page() */
-			put_hwpoison_page(page);
+			put_page(page);
 			pr_info("soft_offline: %#lx: unknown non LRU page type %lx (%pGp)\n",
 				pfn, page->flags, &page->flags);
 			return -EIO;
@@ -1722,7 +1722,7 @@ static int soft_offline_huge_page(struct page *page, int flags)
 	lock_page(hpage);
 	if (PageHWPoison(hpage)) {
 		unlock_page(hpage);
-		put_hwpoison_page(hpage);
+		put_page(hpage);
 		pr_info("soft offline: %#lx hugepage already poisoned\n", pfn);
 		return -EBUSY;
 	}
@@ -1733,7 +1733,7 @@ static int soft_offline_huge_page(struct page *page, int flags)
 	 * get_any_page() and isolate_huge_page() takes a refcount each,
 	 * so need to drop one here.
 	 */
-	put_hwpoison_page(hpage);
+	put_page(hpage);
 	if (!ret) {
 		pr_info("soft offline: %#lx hugepage failed to isolate\n", pfn);
 		return -EBUSY;
@@ -1782,7 +1782,7 @@ static int __soft_offline_page(struct page *page, int flags)
 	wait_on_page_writeback(page);
 	if (PageHWPoison(page)) {
 		unlock_page(page);
-		put_hwpoison_page(page);
+		put_page(page);
 		pr_info("soft offline: %#lx page already poisoned\n", pfn);
 		return -EBUSY;
 	}
@@ -1797,7 +1797,7 @@ static int __soft_offline_page(struct page *page, int flags)
 	 * would need to fix isolation locking first.
 	 */
 	if (ret == 1) {
-		put_hwpoison_page(page);
+		put_page(page);
 		pr_info("soft_offline: %#lx: invalidated\n", pfn);
 		SetPageHWPoison(page);
 		num_poisoned_pages_inc();
@@ -1817,7 +1817,7 @@ static int __soft_offline_page(struct page *page, int flags)
 	 * Drop page reference which is came from get_any_page()
 	 * successful isolate_lru_page() already took another one.
 	 */
-	put_hwpoison_page(page);
+	put_page(page);
 	if (!ret) {
 		LIST_HEAD(pagelist);
 		/*
@@ -1861,7 +1861,7 @@ static int soft_offline_in_use_page(struct page *page, int flags)
 				pr_info("soft offline: %#lx: non anonymous thp\n", page_to_pfn(page));
 			else
 				pr_info("soft offline: %#lx: thp split failed\n", page_to_pfn(page));
-			put_hwpoison_page(page);
+			put_page(page);
 			return -EBUSY;
 		}
 		unlock_page(page);
@@ -1934,7 +1934,7 @@ int soft_offline_page(unsigned long pfn, int flags)
 	if (PageHWPoison(page)) {
 		pr_info("soft offline: %#lx page already poisoned\n", pfn);
 		if (flags & MF_COUNT_INCREASED)
-			put_hwpoison_page(page);
+			put_page(page);
 		return -EBUSY;
 	}
 
-- 
2.26.2


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v4 08/15] mm,hwpoison: remove MF_COUNT_INCREASED
  2020-07-16 12:37 [PATCH v4 00/15] Hwpoison soft-offline rework Oscar Salvador
                   ` (6 preceding siblings ...)
  2020-07-16 12:38 ` [PATCH v4 07/15] mm,hwpoison: Kill put_hwpoison_page Oscar Salvador
@ 2020-07-16 12:38 ` Oscar Salvador
  2020-07-16 12:38 ` [PATCH v4 09/15] mm,hwpoison: remove flag argument from soft offline functions Oscar Salvador
                   ` (8 subsequent siblings)
  16 siblings, 0 replies; 24+ messages in thread
From: Oscar Salvador @ 2020-07-16 12:38 UTC (permalink / raw)
  To: akpm
  Cc: mhocko, linux-mm, mike.kravetz, david, aneesh.kumar,
	naoya.horiguchi, linux-kernel, Naoya Horiguchi, Oscar Salvador

From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

Now there's no user of MF_COUNT_INCREASED, so we can safely remove
it from all calling points.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Oscar Salvador <osalvador@suse.com>
---
 include/linux/mm.h  |  7 +++----
 mm/memory-failure.c | 14 +++-----------
 2 files changed, 6 insertions(+), 15 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 9d1c8540fdaf..e2ce2f05fa49 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2999,10 +2999,9 @@ void register_page_bootmem_memmap(unsigned long section_nr, struct page *map,
 				  unsigned long nr_pages);
 
 enum mf_flags {
-	MF_COUNT_INCREASED = 1 << 0,
-	MF_ACTION_REQUIRED = 1 << 1,
-	MF_MUST_KILL = 1 << 2,
-	MF_SOFT_OFFLINE = 1 << 3,
+	MF_ACTION_REQUIRED = 1 << 0,
+	MF_MUST_KILL = 1 << 1,
+	MF_SOFT_OFFLINE = 1 << 2,
 };
 extern int memory_failure(unsigned long pfn, int flags);
 extern void memory_failure_queue(unsigned long pfn, int flags);
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 0b7d9769cf29..15b8e7fd94ed 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1118,7 +1118,7 @@ static int memory_failure_hugetlb(unsigned long pfn, int flags)
 
 	num_poisoned_pages_inc();
 
-	if (!(flags & MF_COUNT_INCREASED) && !get_hwpoison_page(p)) {
+	if (!get_hwpoison_page(p)) {
 		/*
 		 * Check "filter hit" and "race with other subpage."
 		 */
@@ -1314,7 +1314,7 @@ int memory_failure(unsigned long pfn, int flags)
 	 * In fact it's dangerous to directly bump up page count from 0,
 	 * that may make page_ref_freeze()/page_ref_unfreeze() mismatch.
 	 */
-	if (!(flags & MF_COUNT_INCREASED) && !get_hwpoison_page(p)) {
+	if (!get_hwpoison_page(p)) {
 		if (is_free_buddy_page(p)) {
 			action_result(pfn, MF_MSG_BUDDY, MF_DELAYED);
 			return 0;
@@ -1354,10 +1354,7 @@ int memory_failure(unsigned long pfn, int flags)
 	shake_page(p, 0);
 	/* shake_page could have turned it free. */
 	if (!PageLRU(p) && is_free_buddy_page(p)) {
-		if (flags & MF_COUNT_INCREASED)
-			action_result(pfn, MF_MSG_BUDDY, MF_DELAYED);
-		else
-			action_result(pfn, MF_MSG_BUDDY_2ND, MF_DELAYED);
+		action_result(pfn, MF_MSG_BUDDY_2ND, MF_DELAYED);
 		return 0;
 	}
 
@@ -1655,9 +1652,6 @@ static int __get_any_page(struct page *p, unsigned long pfn, int flags)
 {
 	int ret;
 
-	if (flags & MF_COUNT_INCREASED)
-		return 1;
-
 	/*
 	 * When the target page is a free hugepage, just remove it
 	 * from free hugepage list.
@@ -1933,8 +1927,6 @@ int soft_offline_page(unsigned long pfn, int flags)
 
 	if (PageHWPoison(page)) {
 		pr_info("soft offline: %#lx page already poisoned\n", pfn);
-		if (flags & MF_COUNT_INCREASED)
-			put_page(page);
 		return -EBUSY;
 	}
 
-- 
2.26.2


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v4 09/15] mm,hwpoison: remove flag argument from soft offline functions
  2020-07-16 12:37 [PATCH v4 00/15] Hwpoison soft-offline rework Oscar Salvador
                   ` (7 preceding siblings ...)
  2020-07-16 12:38 ` [PATCH v4 08/15] mm,hwpoison: remove MF_COUNT_INCREASED Oscar Salvador
@ 2020-07-16 12:38 ` Oscar Salvador
  2020-07-16 12:38 ` [PATCH v4 10/15] mm,hwpoison: Unify THP handling for hard and soft offline Oscar Salvador
                   ` (7 subsequent siblings)
  16 siblings, 0 replies; 24+ messages in thread
From: Oscar Salvador @ 2020-07-16 12:38 UTC (permalink / raw)
  To: akpm
  Cc: mhocko, linux-mm, mike.kravetz, david, aneesh.kumar,
	naoya.horiguchi, linux-kernel, Naoya Horiguchi, Oscar Salvador

From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

The argument @flag no longer affects the behavior of soft_offline_page()
and its variants, so let's remove them.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Oscar Salvador <osalvador@suse.com>
---
 drivers/base/memory.c |  2 +-
 include/linux/mm.h    |  2 +-
 mm/madvise.c          |  2 +-
 mm/memory-failure.c   | 27 +++++++++++++--------------
 4 files changed, 16 insertions(+), 17 deletions(-)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 2b09b68b9f78..20664f279c99 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -463,7 +463,7 @@ static ssize_t soft_offline_page_store(struct device *dev,
 	if (kstrtoull(buf, 0, &pfn) < 0)
 		return -EINVAL;
 	pfn >>= PAGE_SHIFT;
-	ret = soft_offline_page(pfn, 0);
+	ret = soft_offline_page(pfn);
 	return ret == 0 ? count : ret;
 }
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index e2ce2f05fa49..8f6a45165ec8 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3011,7 +3011,7 @@ extern int sysctl_memory_failure_early_kill;
 extern int sysctl_memory_failure_recovery;
 extern void shake_page(struct page *p, int access);
 extern atomic_long_t num_poisoned_pages __read_mostly;
-extern int soft_offline_page(unsigned long pfn, int flags);
+extern int soft_offline_page(unsigned long pfn);
 
 
 /*
diff --git a/mm/madvise.c b/mm/madvise.c
index dd2173d8f4e5..226f0fcf0828 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -927,7 +927,7 @@ static int madvise_inject_error(int behavior,
 			pr_info("Soft offlining pfn %#lx at process virtual address %#lx\n",
 				 pfn, start);
 
-			ret = soft_offline_page(pfn, 0);
+			ret = soft_offline_page(pfn);
 		} else {
 			pr_info("Injecting memory failure for pfn %#lx at process virtual address %#lx\n",
 				 pfn, start);
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 15b8e7fd94ed..03d3aae77f89 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1502,7 +1502,7 @@ static void memory_failure_work_func(struct work_struct *work)
 		if (!gotten)
 			break;
 		if (entry.flags & MF_SOFT_OFFLINE)
-			soft_offline_page(entry.pfn, entry.flags);
+			soft_offline_page(entry.pfn);
 		else
 			memory_failure(entry.pfn, entry.flags);
 	}
@@ -1648,7 +1648,7 @@ static struct page *new_page(struct page *p, unsigned long private)
  * that is not free, and 1 for any other page type.
  * For 1 the page is returned with increased page count, otherwise not.
  */
-static int __get_any_page(struct page *p, unsigned long pfn, int flags)
+static int __get_any_page(struct page *p, unsigned long pfn)
 {
 	int ret;
 
@@ -1675,9 +1675,9 @@ static int __get_any_page(struct page *p, unsigned long pfn, int flags)
 	return ret;
 }
 
-static int get_any_page(struct page *page, unsigned long pfn, int flags)
+static int get_any_page(struct page *page, unsigned long pfn)
 {
-	int ret = __get_any_page(page, pfn, flags);
+	int ret = __get_any_page(page, pfn);
 
 	if (ret == 1 && !PageHuge(page) &&
 	    !PageLRU(page) && !__PageMovable(page)) {
@@ -1690,7 +1690,7 @@ static int get_any_page(struct page *page, unsigned long pfn, int flags)
 		/*
 		 * Did it turn free?
 		 */
-		ret = __get_any_page(page, pfn, 0);
+		ret = __get_any_page(page, pfn);
 		if (ret == 1 && !PageLRU(page)) {
 			/* Drop page reference which is from __get_any_page() */
 			put_page(page);
@@ -1702,7 +1702,7 @@ static int get_any_page(struct page *page, unsigned long pfn, int flags)
 	return ret;
 }
 
-static int soft_offline_huge_page(struct page *page, int flags)
+static int soft_offline_huge_page(struct page *page)
 {
 	int ret;
 	unsigned long pfn = page_to_pfn(page);
@@ -1761,7 +1761,7 @@ static int soft_offline_huge_page(struct page *page, int flags)
 	return ret;
 }
 
-static int __soft_offline_page(struct page *page, int flags)
+static int __soft_offline_page(struct page *page)
 {
 	int ret;
 	unsigned long pfn = page_to_pfn(page);
@@ -1841,7 +1841,7 @@ static int __soft_offline_page(struct page *page, int flags)
 	return ret;
 }
 
-static int soft_offline_in_use_page(struct page *page, int flags)
+static int soft_offline_in_use_page(struct page *page)
 {
 	int ret;
 	int mt;
@@ -1871,9 +1871,9 @@ static int soft_offline_in_use_page(struct page *page, int flags)
 	mt = get_pageblock_migratetype(page);
 	set_pageblock_migratetype(page, MIGRATE_ISOLATE);
 	if (PageHuge(page))
-		ret = soft_offline_huge_page(page, flags);
+		ret = soft_offline_huge_page(page);
 	else
-		ret = __soft_offline_page(page, flags);
+		ret = __soft_offline_page(page);
 	set_pageblock_migratetype(page, mt);
 	return ret;
 }
@@ -1894,7 +1894,6 @@ static int soft_offline_free_page(struct page *page)
 /**
  * soft_offline_page - Soft offline a page.
  * @pfn: pfn to soft-offline
- * @flags: flags. Same as memory_failure().
  *
  * Returns 0 on success, otherwise negated errno.
  *
@@ -1913,7 +1912,7 @@ static int soft_offline_free_page(struct page *page)
  * This is not a 100% solution for all memory, but tries to be
  * ``good enough'' for the majority of memory.
  */
-int soft_offline_page(unsigned long pfn, int flags)
+int soft_offline_page(unsigned long pfn)
 {
 	int ret;
 	struct page *page;
@@ -1931,11 +1930,11 @@ int soft_offline_page(unsigned long pfn, int flags)
 	}
 
 	get_online_mems();
-	ret = get_any_page(page, pfn, flags);
+	ret = get_any_page(page, pfn);
 	put_online_mems();
 
 	if (ret > 0)
-		ret = soft_offline_in_use_page(page, flags);
+		ret = soft_offline_in_use_page(page);
 	else if (ret == 0)
 		ret = soft_offline_free_page(page);
 
-- 
2.26.2


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v4 10/15] mm,hwpoison: Unify THP handling for hard and soft offline
  2020-07-16 12:37 [PATCH v4 00/15] Hwpoison soft-offline rework Oscar Salvador
                   ` (8 preceding siblings ...)
  2020-07-16 12:38 ` [PATCH v4 09/15] mm,hwpoison: remove flag argument from soft offline functions Oscar Salvador
@ 2020-07-16 12:38 ` Oscar Salvador
  2020-07-16 12:38 ` [PATCH v4 11/15] mm,hwpoison: Rework soft offline for free pages Oscar Salvador
                   ` (6 subsequent siblings)
  16 siblings, 0 replies; 24+ messages in thread
From: Oscar Salvador @ 2020-07-16 12:38 UTC (permalink / raw)
  To: akpm
  Cc: mhocko, linux-mm, mike.kravetz, david, aneesh.kumar,
	naoya.horiguchi, linux-kernel, Oscar Salvador, Oscar Salvador

Place the THP's page handling in a helper and use it
from both hard and soft-offline machinery, so we get rid
of some duplicated code.

Signed-off-by: Oscar Salvador <osalvador@suse.com>
Acked-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 mm/memory-failure.c | 48 +++++++++++++++++++++------------------------
 1 file changed, 22 insertions(+), 26 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 03d3aae77f89..2e244d5b83e0 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1103,6 +1103,25 @@ static int identify_page_state(unsigned long pfn, struct page *p,
 	return page_action(ps, p, pfn);
 }
 
+static int try_to_split_thp_page(struct page *page, const char *msg)
+{
+	lock_page(page);
+	if (!PageAnon(page) || unlikely(split_huge_page(page))) {
+		unsigned long pfn = page_to_pfn(page);
+
+		unlock_page(page);
+		if (!PageAnon(page))
+			pr_info("%s: %#lx: non anonymous thp\n", msg, pfn);
+		else
+			pr_info("%s: %#lx: thp split failed\n", msg, pfn);
+		put_page(page);
+		return -EBUSY;
+	}
+	unlock_page(page);
+
+	return 0;
+}
+
 static int memory_failure_hugetlb(unsigned long pfn, int flags)
 {
 	struct page *p = pfn_to_page(pfn);
@@ -1325,21 +1344,8 @@ int memory_failure(unsigned long pfn, int flags)
 	}
 
 	if (PageTransHuge(hpage)) {
-		lock_page(p);
-		if (!PageAnon(p) || unlikely(split_huge_page(p))) {
-			unlock_page(p);
-			if (!PageAnon(p))
-				pr_err("Memory failure: %#lx: non anonymous thp\n",
-					pfn);
-			else
-				pr_err("Memory failure: %#lx: thp split failed\n",
-					pfn);
-			if (TestClearPageHWPoison(p))
-				num_poisoned_pages_dec();
-			put_page(p);
+		if (try_to_split_thp_page(p, "Memory Failure") < 0)
 			return -EBUSY;
-		}
-		unlock_page(p);
 		VM_BUG_ON_PAGE(!page_count(p), p);
 	}
 
@@ -1847,19 +1853,9 @@ static int soft_offline_in_use_page(struct page *page)
 	int mt;
 	struct page *hpage = compound_head(page);
 
-	if (!PageHuge(page) && PageTransHuge(hpage)) {
-		lock_page(page);
-		if (!PageAnon(page) || unlikely(split_huge_page(page))) {
-			unlock_page(page);
-			if (!PageAnon(page))
-				pr_info("soft offline: %#lx: non anonymous thp\n", page_to_pfn(page));
-			else
-				pr_info("soft offline: %#lx: thp split failed\n", page_to_pfn(page));
-			put_page(page);
+	if (!PageHuge(page) && PageTransHuge(hpage))
+		if (try_to_split_thp_page(page, "soft offline") < 0)
 			return -EBUSY;
-		}
-		unlock_page(page);
-	}
 
 	/*
 	 * Setting MIGRATE_ISOLATE here ensures that the page will be linked
-- 
2.26.2


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v4 11/15] mm,hwpoison: Rework soft offline for free pages
  2020-07-16 12:37 [PATCH v4 00/15] Hwpoison soft-offline rework Oscar Salvador
                   ` (9 preceding siblings ...)
  2020-07-16 12:38 ` [PATCH v4 10/15] mm,hwpoison: Unify THP handling for hard and soft offline Oscar Salvador
@ 2020-07-16 12:38 ` Oscar Salvador
  2020-07-16 12:38 ` [PATCH v4 12/15] mm,hwpoison: Rework soft offline for in-use pages Oscar Salvador
                   ` (5 subsequent siblings)
  16 siblings, 0 replies; 24+ messages in thread
From: Oscar Salvador @ 2020-07-16 12:38 UTC (permalink / raw)
  To: akpm
  Cc: mhocko, linux-mm, mike.kravetz, david, aneesh.kumar,
	naoya.horiguchi, linux-kernel, Oscar Salvador, Oscar Salvador

When trying to soft-offline a free page, we need to first take it off
the buddy allocator.
Once we know is out of reach, we can safely flag it as poisoned.

take_page_off_buddy will be used to take a page meant to be poisoned
off the buddy allocator.
take_page_off_buddy calls break_down_buddy_pages, which splits a
higher-order page in case our page belongs to one.

Once the page is under our control, we call page_handle_poison to set it
as poisoned and grab a refcount on it.

Signed-off-by: Oscar Salvador <osalvador@suse.com>
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
---
 include/linux/page-flags.h |  1 +
 mm/memory-failure.c        | 17 +++++++---
 mm/page_alloc.c            | 68 ++++++++++++++++++++++++++++++++++++++
 3 files changed, 81 insertions(+), 5 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 276140c94f4a..01baf6d426ff 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -425,6 +425,7 @@ PAGEFLAG_FALSE(Uncached)
 PAGEFLAG(HWPoison, hwpoison, PF_ANY)
 TESTSCFLAG(HWPoison, hwpoison, PF_ANY)
 #define __PG_HWPOISON (1UL << PG_hwpoison)
+extern bool take_page_off_buddy(struct page *page);
 extern bool set_hwpoison_free_buddy_page(struct page *page);
 #else
 PAGEFLAG_FALSE(HWPoison)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 2e244d5b83e0..caf012d34607 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -65,6 +65,13 @@ int sysctl_memory_failure_recovery __read_mostly = 1;
 
 atomic_long_t num_poisoned_pages __read_mostly = ATOMIC_LONG_INIT(0);
 
+static void page_handle_poison(struct page *page)
+{
+	SetPageHWPoison(page);
+	page_ref_inc(page);
+	num_poisoned_pages_inc();
+}
+
 #if defined(CONFIG_HWPOISON_INJECT) || defined(CONFIG_HWPOISON_INJECT_MODULE)
 
 u32 hwpoison_filter_enable = 0;
@@ -1876,14 +1883,14 @@ static int soft_offline_in_use_page(struct page *page)
 
 static int soft_offline_free_page(struct page *page)
 {
+	int rc = -EBUSY;
 	int rc = dissolve_free_huge_page(page);
 
-	if (!rc) {
-		if (set_hwpoison_free_buddy_page(page))
-			num_poisoned_pages_inc();
-		else
-			rc = -EBUSY;
+	if (!dissolve_free_huge_page(page) && take_page_off_buddy(page)) {
+		page_handle_poison(page);
+		rc = 0;
 	}
+
 	return rc;
 }
 
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 9e07a5d2d30d..4fa0e0887c07 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8777,6 +8777,74 @@ bool is_free_buddy_page(struct page *page)
 }
 
 #ifdef CONFIG_MEMORY_FAILURE
+/*
+ * Break down a higher-order page in sub-pages, and keep our target out of
+ * buddy allocator.
+ */
+static void break_down_buddy_pages(struct zone *zone, struct page *page,
+				   struct page *target, int low, int high,
+				   int migratetype)
+{
+	unsigned long size = 1 << high;
+	struct page *current_buddy, *next_page;
+
+	while (high > low) {
+		high--;
+		size >>= 1;
+
+		if (target >= &page[size]) {
+			next_page = page + size;
+			current_buddy = page;
+		} else {
+			next_page = page;
+			current_buddy = page + size;
+		}
+
+		if (set_page_guard(zone, current_buddy, high, migratetype))
+			continue;
+
+		if (current_buddy != target) {
+			add_to_free_list(current_buddy, zone, high, migratetype);
+			set_page_order(current_buddy, high);
+			page = next_page;
+		}
+	}
+}
+
+/*
+ * Take a page that will be marked as poisoned off the buddy allocator.
+ */
+bool take_page_off_buddy(struct page *page)
+{
+	struct zone *zone = page_zone(page);
+	unsigned long pfn = page_to_pfn(page);
+	unsigned long flags;
+	unsigned int order;
+	bool ret = false;
+
+	spin_lock_irqsave(&zone->lock, flags);
+	for (order = 0; order < MAX_ORDER; order++) {
+		struct page *page_head = page - (pfn & ((1 << order) - 1));
+		int buddy_order = page_order(page_head);
+
+		if (PageBuddy(page_head) && buddy_order >= order) {
+			unsigned long pfn_head = page_to_pfn(page_head);
+			int migratetype = get_pfnblock_migratetype(page_head,
+								   pfn_head);
+
+			del_page_from_free_list(page_head, zone, buddy_order);
+			break_down_buddy_pages(zone, page_head, page, 0,
+						buddy_order, migratetype);
+			ret = true;
+			break;
+		}
+		if (page_count(page_head) > 0)
+			break;
+	}
+	spin_unlock_irqrestore(&zone->lock, flags);
+	return ret;
+}
+
 /*
  * Set PG_hwpoison flag if a given page is confirmed to be a free page.  This
  * test is performed under the zone lock to prevent a race against page
-- 
2.26.2


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v4 12/15] mm,hwpoison: Rework soft offline for in-use pages
  2020-07-16 12:37 [PATCH v4 00/15] Hwpoison soft-offline rework Oscar Salvador
                   ` (10 preceding siblings ...)
  2020-07-16 12:38 ` [PATCH v4 11/15] mm,hwpoison: Rework soft offline for free pages Oscar Salvador
@ 2020-07-16 12:38 ` Oscar Salvador
  2020-07-17  6:55   ` HORIGUCHI NAOYA(堀口 直也)
       [not found]   ` <f7387d64d0024d15a1bc821a8e19b8f0@DB7PR04MB5180.eurprd04.prod.outlook.com>
  2020-07-16 12:38 ` [PATCH v4 13/15] mm,hwpoison: Refactor soft_offline_huge_page and __soft_offline_page Oscar Salvador
                   ` (4 subsequent siblings)
  16 siblings, 2 replies; 24+ messages in thread
From: Oscar Salvador @ 2020-07-16 12:38 UTC (permalink / raw)
  To: akpm
  Cc: mhocko, linux-mm, mike.kravetz, david, aneesh.kumar,
	naoya.horiguchi, linux-kernel, Oscar Salvador, Oscar Salvador

This patch changes the way we set and handle in-use poisoned pages.
Until now, poisoned pages were released to the buddy allocator, trusting
that the checks that take place prior to deliver the page to its end
user would act as a safe net and would skip that page.

This has proved to be wrong, as we got some pfn walkers out there, like
compaction, that all they care is the page to be PageBuddy and be in a
freelist.
Although this might not be the only user, having poisoned pages
in the buddy allocator seems a bad idea as we should only have
free pages that are ready and meant to be used as such.

Before explaining the taken approach, let us break down the kind
of pages we can soft offline.

- Anonymous THP (after the split, they end up being 4K pages)
- Hugetlb
- Order-0 pages (that can be either migrated or invalited)

* Normal pages (order-0 and anon-THP)

  - If they are clean and unmapped page cache pages, we invalidate
    then by means of invalidate_inode_page().
  - If they are mapped/dirty, we do the isolate-and-migrate dance.

  Either way, do not call put_page directly from those paths.
  Instead, we keep the page and send it to page_set_poison to perform the
  right handling.

  Among other things, page_set_poison() sets the HWPoison flag and does the last
  put_page.
  This call to put_page is mainly to be able to call __page_cache_release,
  since this function is not exported.

  Down the chain, we placed a check for HWPoison page in
  free_pages_prepare, that just skips any poisoned page, so those pages
  do not end up either in a pcplist or in buddy-freelist.

  After that, we set the refcount on the page to 1 and we increment
  the poisoned pages counter.

  We could do as we do for free pages:
  1) wait until the page hits buddy's freelists
  2) take it off
  3) flag it

  The problem is that we could race with an allocation, so by the time we
  want to take the page off the buddy, the page is already allocated, so we
  cannot soft-offline it.
  This is not fatal of course, but if it is better if we can close the race
  as does not require a lot of code.

* Hugetlb pages

  - We isolate-and-migrate them

  There is no magic in here, we just isolate and migrate them.
  A new set of internal functions have been made to flag a hugetlb page as
  poisoned (SetPageHugePoisoned(), PageHugePoisoned(), ClearPageHugePoisoned())
  This allows us to flag the page when we migrate it, back in
  move_hugetlb_state().
  Later on we check whether the page is poisoned in __free_huge_page,
  and we bail out in that case before sending the page to e.g: active
  free list.
  This gives us full control of the page, and we can handle it
  page_handle_poison().

  In other words, we do not allow migrated hugepages to get back to the
  freelists.

  Since now the page has no user and has been migrated, we can call
  dissolve_free_huge_page, which will end up calling update_and_free_page.
  In update_and_free_page(), we check for the page to be poisoned.
  If it so, we handle it as we handle gigantic pages, i.e: we break down
  the page in order-0 pages and free them one by one.
  Doing so, allows us for free_pages_prepare to skip poisoned pages.

Because of the way we handle now in-use pages, we no longer need the
put-as-isolation-migratetype dance, that was guarding for poisoned pages
to end up in pcplists.

Signed-off-by: Oscar Salvador <osalvador@suse.com>
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
---
 include/linux/page-flags.h |  5 ----
 mm/hugetlb.c               | 60 +++++++++++++++++++++++++++++++++-----
 mm/memory-failure.c        | 53 +++++++++++++--------------------
 mm/migrate.c               | 11 ++-----
 mm/page_alloc.c            | 38 +++++++-----------------
 5 files changed, 86 insertions(+), 81 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index 01baf6d426ff..2ac8bfa0cf20 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -426,13 +426,8 @@ PAGEFLAG(HWPoison, hwpoison, PF_ANY)
 TESTSCFLAG(HWPoison, hwpoison, PF_ANY)
 #define __PG_HWPOISON (1UL << PG_hwpoison)
 extern bool take_page_off_buddy(struct page *page);
-extern bool set_hwpoison_free_buddy_page(struct page *page);
 #else
 PAGEFLAG_FALSE(HWPoison)
-static inline bool set_hwpoison_free_buddy_page(struct page *page)
-{
-	return 0;
-}
 #define __PG_HWPOISON 0
 #endif
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 7badb01d15e3..1c6397936512 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -29,6 +29,7 @@
 #include <linux/numa.h>
 #include <linux/llist.h>
 #include <linux/cma.h>
+#include <linux/migrate.h>
 
 #include <asm/page.h>
 #include <asm/pgalloc.h>
@@ -1209,9 +1210,26 @@ static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed)
 		((node = hstate_next_node_to_free(hs, mask)) || 1);	\
 		nr_nodes--)
 
-#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE
-static void destroy_compound_gigantic_page(struct page *page,
-					unsigned int order)
+static inline bool PageHugePoisoned(struct page *page)
+{
+	if (!PageHuge(page))
+		return false;
+
+	return (unsigned long)page[3].mapping == -1U;
+}
+
+static inline void SetPageHugePoisoned(struct page *page)
+{
+	page[3].mapping = (void *)-1U;
+}
+
+static inline void ClearPageHugePoisoned(struct page *page)
+{
+	page[3].mapping = NULL;
+}
+
+static void destroy_compound_gigantic_page(struct hstate *h, struct page *page,
+					   unsigned int order)
 {
 	int i;
 	int nr_pages = 1 << order;
@@ -1222,14 +1240,19 @@ static void destroy_compound_gigantic_page(struct page *page,
 		atomic_set(compound_pincount_ptr(page), 0);
 
 	for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) {
+		if (!hstate_is_gigantic(h))
+			 p->mapping = NULL;
 		clear_compound_head(p);
 		set_page_refcounted(p);
 	}
 
+	if (PageHugePoisoned(page))
+		ClearPageHugePoisoned(page);
 	set_compound_order(page, 0);
 	__ClearPageHead(page);
 }
 
+#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE
 static void free_gigantic_page(struct page *page, unsigned int order)
 {
 	/*
@@ -1284,13 +1307,16 @@ static struct page *alloc_gigantic_page(struct hstate *h, gfp_t gfp_mask,
 	return NULL;
 }
 static inline void free_gigantic_page(struct page *page, unsigned int order) { }
-static inline void destroy_compound_gigantic_page(struct page *page,
-						unsigned int order) { }
+static inline void destroy_compound_gigantic_page(struct hstate *h,
+						  struct page *page,
+						  unsigned int order) { }
 #endif
 
 static void update_and_free_page(struct hstate *h, struct page *page)
 {
 	int i;
+	bool poisoned = PageHugePoisoned(page);
+	unsigned int order = huge_page_order(h);
 
 	if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
 		return;
@@ -1313,11 +1339,21 @@ static void update_and_free_page(struct hstate *h, struct page *page)
 		 * we might block in free_gigantic_page().
 		 */
 		spin_unlock(&hugetlb_lock);
-		destroy_compound_gigantic_page(page, huge_page_order(h));
-		free_gigantic_page(page, huge_page_order(h));
+		destroy_compound_gigantic_page(h, page, order);
+		free_gigantic_page(page, order);
 		spin_lock(&hugetlb_lock);
 	} else {
-		__free_pages(page, huge_page_order(h));
+		if (unlikely(poisoned)) {
+			/*
+			 * If the hugepage is poisoned, do as we do for
+			 * gigantic pages and free the pages as order-0.
+			 * free_pages_prepare will skip over the poisoned ones.
+			 */
+			destroy_compound_gigantic_page(h, page, order);
+			free_contig_range(page_to_pfn(page), 1 << order);
+		} else {
+			__free_pages(page, huge_page_order(h));
+		}
 	}
 }
 
@@ -1427,6 +1463,11 @@ static void __free_huge_page(struct page *page)
 	if (restore_reserve)
 		h->resv_huge_pages++;
 
+	if (PageHugePoisoned(page)) {
+		spin_unlock(&hugetlb_lock);
+		return;
+	}
+
 	if (PageHugeTemporary(page)) {
 		list_del(&page->lru);
 		ClearPageHugeTemporary(page);
@@ -5642,6 +5683,9 @@ void move_hugetlb_state(struct page *oldpage, struct page *newpage, int reason)
 	hugetlb_cgroup_migrate(oldpage, newpage);
 	set_page_owner_migrate_reason(newpage, reason);
 
+	if (reason == MR_MEMORY_FAILURE)
+		SetPageHugePoisoned(oldpage);
+
 	/*
 	 * transfer temporary state of the new huge page. This is
 	 * reverse to other transitions because the newpage is going to
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index caf012d34607..c0ebab4eed4c 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -65,9 +65,17 @@ int sysctl_memory_failure_recovery __read_mostly = 1;
 
 atomic_long_t num_poisoned_pages __read_mostly = ATOMIC_LONG_INIT(0);
 
-static void page_handle_poison(struct page *page)
+static void page_handle_poison(struct page *page, bool release, bool set_flag,
+			       bool huge_flag)
 {
-	SetPageHWPoison(page);
+	if (set_flag)
+		SetPageHWPoison(page);
+
+        if (huge_flag)
+		dissolve_free_huge_page(page);
+        else if (release)
+		put_page(page);
+
 	page_ref_inc(page);
 	num_poisoned_pages_inc();
 }
@@ -1717,7 +1725,7 @@ static int get_any_page(struct page *page, unsigned long pfn)
 
 static int soft_offline_huge_page(struct page *page)
 {
-	int ret;
+	int ret = -EBUSY;
 	unsigned long pfn = page_to_pfn(page);
 	struct page *hpage = compound_head(page);
 	LIST_HEAD(pagelist);
@@ -1757,19 +1765,12 @@ static int soft_offline_huge_page(struct page *page)
 			ret = -EIO;
 	} else {
 		/*
-		 * We set PG_hwpoison only when the migration source hugepage
-		 * was successfully dissolved, because otherwise hwpoisoned
-		 * hugepage remains on free hugepage list, then userspace will
-		 * find it as SIGBUS by allocation failure. That's not expected
-		 * in soft-offlining.
+		 * At this point the page cannot be in-use since we do not
+		 * let the page to go back to hugetlb freelists.
+		 * In that case we just need to dissolve it.
+		 * page_handle_poison will take care of it.
 		 */
-		ret = dissolve_free_huge_page(page);
-		if (!ret) {
-			if (set_hwpoison_free_buddy_page(page))
-				num_poisoned_pages_inc();
-			else
-				ret = -EBUSY;
-		}
+		page_handle_poison(page, true, true, true);
 	}
 	return ret;
 }
@@ -1804,10 +1805,8 @@ static int __soft_offline_page(struct page *page)
 	 * would need to fix isolation locking first.
 	 */
 	if (ret == 1) {
-		put_page(page);
 		pr_info("soft_offline: %#lx: invalidated\n", pfn);
-		SetPageHWPoison(page);
-		num_poisoned_pages_inc();
+		page_handle_poison(page, true, true, false);
 		return 0;
 	}
 
@@ -1838,7 +1837,9 @@ static int __soft_offline_page(struct page *page)
 		list_add(&page->lru, &pagelist);
 		ret = migrate_pages(&pagelist, new_page, NULL, MPOL_MF_MOVE_ALL,
 					MIGRATE_SYNC, MR_MEMORY_FAILURE);
-		if (ret) {
+		if (!ret) {
+			page_handle_poison(page, true, true, false);
+		} else {
 			if (!list_empty(&pagelist))
 				putback_movable_pages(&pagelist);
 
@@ -1857,37 +1858,25 @@ static int __soft_offline_page(struct page *page)
 static int soft_offline_in_use_page(struct page *page)
 {
 	int ret;
-	int mt;
 	struct page *hpage = compound_head(page);
 
 	if (!PageHuge(page) && PageTransHuge(hpage))
 		if (try_to_split_thp_page(page, "soft offline") < 0)
 			return -EBUSY;
 
-	/*
-	 * Setting MIGRATE_ISOLATE here ensures that the page will be linked
-	 * to free list immediately (not via pcplist) when released after
-	 * successful page migration. Otherwise we can't guarantee that the
-	 * page is really free after put_page() returns, so
-	 * set_hwpoison_free_buddy_page() highly likely fails.
-	 */
-	mt = get_pageblock_migratetype(page);
-	set_pageblock_migratetype(page, MIGRATE_ISOLATE);
 	if (PageHuge(page))
 		ret = soft_offline_huge_page(page);
 	else
 		ret = __soft_offline_page(page);
-	set_pageblock_migratetype(page, mt);
 	return ret;
 }
 
 static int soft_offline_free_page(struct page *page)
 {
 	int rc = -EBUSY;
-	int rc = dissolve_free_huge_page(page);
 
 	if (!dissolve_free_huge_page(page) && take_page_off_buddy(page)) {
-		page_handle_poison(page);
+		page_handle_poison(page, false, true, false);
 		rc = 0;
 	}
 
diff --git a/mm/migrate.c b/mm/migrate.c
index 75c10d81e833..a68d81d0ae6e 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1222,16 +1222,11 @@ static int unmap_and_move(new_page_t get_new_page,
 	 * we want to retry.
 	 */
 	if (rc == MIGRATEPAGE_SUCCESS) {
-		put_page(page);
-		if (reason == MR_MEMORY_FAILURE) {
+		if (reason != MR_MEMORY_FAILURE)
 			/*
-			 * Set PG_HWPoison on just freed page
-			 * intentionally. Although it's rather weird,
-			 * it's how HWPoison flag works at the moment.
+			 * We handle poisoned pages in page_handle_poison.
 			 */
-			if (set_hwpoison_free_buddy_page(page))
-				num_poisoned_pages_inc();
-		}
+			put_page(page);
 	} else {
 		if (rc != -EAGAIN) {
 			if (likely(!__PageMovable(page))) {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 4fa0e0887c07..11df51fc2718 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1175,6 +1175,16 @@ static __always_inline bool free_pages_prepare(struct page *page,
 
 	trace_mm_page_free(page, order);
 
+	if (unlikely(PageHWPoison(page)) && !order) {
+		/*
+		 * Untie memcg state and reset page's owner
+		 */
+		if (memcg_kmem_enabled() && PageKmemcg(page))
+			__memcg_kmem_uncharge_page(page, order);
+		reset_page_owner(page, order);
+		return false;
+	}
+
 	/*
 	 * Check tail pages before head page information is cleared to
 	 * avoid checking PageCompound for order-0 pages.
@@ -8844,32 +8854,4 @@ bool take_page_off_buddy(struct page *page)
 	spin_unlock_irqrestore(&zone->lock, flags);
 	return ret;
 }
-
-/*
- * Set PG_hwpoison flag if a given page is confirmed to be a free page.  This
- * test is performed under the zone lock to prevent a race against page
- * allocation.
- */
-bool set_hwpoison_free_buddy_page(struct page *page)
-{
-	struct zone *zone = page_zone(page);
-	unsigned long pfn = page_to_pfn(page);
-	unsigned long flags;
-	unsigned int order;
-	bool hwpoisoned = false;
-
-	spin_lock_irqsave(&zone->lock, flags);
-	for (order = 0; order < MAX_ORDER; order++) {
-		struct page *page_head = page - (pfn & ((1 << order) - 1));
-
-		if (PageBuddy(page_head) && page_order(page_head) >= order) {
-			if (!TestSetPageHWPoison(page))
-				hwpoisoned = true;
-			break;
-		}
-	}
-	spin_unlock_irqrestore(&zone->lock, flags);
-
-	return hwpoisoned;
-}
 #endif
-- 
2.26.2


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v4 13/15] mm,hwpoison: Refactor soft_offline_huge_page and __soft_offline_page
  2020-07-16 12:37 [PATCH v4 00/15] Hwpoison soft-offline rework Oscar Salvador
                   ` (11 preceding siblings ...)
  2020-07-16 12:38 ` [PATCH v4 12/15] mm,hwpoison: Rework soft offline for in-use pages Oscar Salvador
@ 2020-07-16 12:38 ` Oscar Salvador
  2020-07-16 12:38 ` [PATCH v4 14/15] mm,hwpoison: Return 0 if the page is already poisoned in soft-offline Oscar Salvador
                   ` (3 subsequent siblings)
  16 siblings, 0 replies; 24+ messages in thread
From: Oscar Salvador @ 2020-07-16 12:38 UTC (permalink / raw)
  To: akpm
  Cc: mhocko, linux-mm, mike.kravetz, david, aneesh.kumar,
	naoya.horiguchi, linux-kernel, Oscar Salvador, Oscar Salvador

Merging soft_offline_huge_page and __soft_offline_page let us get rid of
quite some duplicated code, and makes the code much easier to follow.

Now, __soft_offline_page will handle both normal and hugetlb pages.

Note that move put_page() block to the beginning of page_handle_poison()
with drain_all_pages() in order to make sure that the target page is
freed and sent into free list to make take_page_off_buddy() work properly.

Signed-off-by: Oscar Salvador <osalvador@suse.com>
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
---
 mm/memory-failure.c | 141 ++++++++++++++++----------------------------
 1 file changed, 52 insertions(+), 89 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index c0ebab4eed4c..c6c83337708a 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1723,62 +1723,50 @@ static int get_any_page(struct page *page, unsigned long pfn)
 	return ret;
 }
 
-static int soft_offline_huge_page(struct page *page)
+static bool isolate_page(struct page *page, struct list_head *pagelist)
 {
-	int ret = -EBUSY;
-	unsigned long pfn = page_to_pfn(page);
-	struct page *hpage = compound_head(page);
-	LIST_HEAD(pagelist);
+	bool isolated = false;
+	bool lru = PageLRU(page);
 
-	/*
-	 * This double-check of PageHWPoison is to avoid the race with
-	 * memory_failure(). See also comment in __soft_offline_page().
-	 */
-	lock_page(hpage);
-	if (PageHWPoison(hpage)) {
-		unlock_page(hpage);
-		put_page(hpage);
-		pr_info("soft offline: %#lx hugepage already poisoned\n", pfn);
-		return -EBUSY;
+	if (PageHuge(page)) {
+		isolated = isolate_huge_page(page, pagelist);
+	} else {
+		if (lru)
+			isolated = !isolate_lru_page(page);
+		else
+			isolated = !isolate_movable_page(page, ISOLATE_UNEVICTABLE);
+
+		if (isolated)
+			list_add(&page->lru, pagelist);
 	}
-	unlock_page(hpage);
 
-	ret = isolate_huge_page(hpage, &pagelist);
+	if (isolated && lru)
+		inc_node_page_state(page, NR_ISOLATED_ANON +
+				    page_is_file_lru(page));
+
 	/*
-	 * get_any_page() and isolate_huge_page() takes a refcount each,
-	 * so need to drop one here.
+	 * If we succeed to isolate the page, we grabbed another refcount on
+	 * the page, so we can safely drop the one we got from get_any_pages().
+	 * If we failed to isolate the page, it means that we cannot go further
+	 * and we will return an error, so drop the reference we got from
+	 * get_any_pages() as well.
 	 */
-	put_page(hpage);
-	if (!ret) {
-		pr_info("soft offline: %#lx hugepage failed to isolate\n", pfn);
-		return -EBUSY;
-	}
-
-	ret = migrate_pages(&pagelist, new_page, NULL, MPOL_MF_MOVE_ALL,
-				MIGRATE_SYNC, MR_MEMORY_FAILURE);
-	if (ret) {
-		pr_info("soft offline: %#lx: hugepage migration failed %d, type %lx (%pGp)\n",
-			pfn, ret, page->flags, &page->flags);
-		if (!list_empty(&pagelist))
-			putback_movable_pages(&pagelist);
-		if (ret > 0)
-			ret = -EIO;
-	} else {
-		/*
-		 * At this point the page cannot be in-use since we do not
-		 * let the page to go back to hugetlb freelists.
-		 * In that case we just need to dissolve it.
-		 * page_handle_poison will take care of it.
-		 */
-		page_handle_poison(page, true, true, true);
-	}
-	return ret;
+	put_page(page);
+	return isolated;
 }
 
+/*
+ * __soft_offline_page handles hugetlb-pages and non-hugetlb pages.
+ * If the page is a non-dirty unmapped page-cache page, it simply invalidates.
+ */
 static int __soft_offline_page(struct page *page)
 {
-	int ret;
+	int ret = 0;
 	unsigned long pfn = page_to_pfn(page);
+	struct page *hpage = compound_head(page);
+	const char *msg_page[] = {"page", "hugepage"};
+	bool huge = PageHuge(page);
+	LIST_HEAD(pagelist);
 
 	/*
 	 * Check PageHWPoison again inside page lock because PageHWPoison
@@ -1787,88 +1775,63 @@ static int __soft_offline_page(struct page *page)
 	 * so there's no race between soft_offline_page() and memory_failure().
 	 */
 	lock_page(page);
-	wait_on_page_writeback(page);
+	if (!PageHuge(page))
+		wait_on_page_writeback(page);
 	if (PageHWPoison(page)) {
 		unlock_page(page);
 		put_page(page);
 		pr_info("soft offline: %#lx page already poisoned\n", pfn);
 		return -EBUSY;
 	}
-	/*
-	 * Try to invalidate first. This should work for
-	 * non dirty unmapped page cache pages.
-	 */
-	ret = invalidate_inode_page(page);
+
+	if (!PageHuge(page))
+		/*
+		 * Try to invalidate first. This should work for
+		 * non dirty unmapped page cache pages.
+		 */
+		ret = invalidate_inode_page(page);
 	unlock_page(page);
+
 	/*
 	 * RED-PEN would be better to keep it isolated here, but we
 	 * would need to fix isolation locking first.
 	 */
 	if (ret == 1) {
 		pr_info("soft_offline: %#lx: invalidated\n", pfn);
-		page_handle_poison(page, true, true, false);
+		page_handle_poison(page, false, true, false);
 		return 0;
 	}
 
-	/*
-	 * Simple invalidation didn't work.
-	 * Try to migrate to a new page instead. migrate.c
-	 * handles a large number of cases for us.
-	 */
-	if (PageLRU(page))
-		ret = isolate_lru_page(page);
-	else
-		ret = isolate_movable_page(page, ISOLATE_UNEVICTABLE);
-	/*
-	 * Drop page reference which is came from get_any_page()
-	 * successful isolate_lru_page() already took another one.
-	 */
-	put_page(page);
-	if (!ret) {
-		LIST_HEAD(pagelist);
-		/*
-		 * After isolated lru page, the PageLRU will be cleared,
-		 * so use !__PageMovable instead for LRU page's mapping
-		 * cannot have PAGE_MAPPING_MOVABLE.
-		 */
-		if (!__PageMovable(page))
-			inc_node_page_state(page, NR_ISOLATED_ANON +
-						page_is_file_lru(page));
-		list_add(&page->lru, &pagelist);
-		ret = migrate_pages(&pagelist, new_page, NULL, MPOL_MF_MOVE_ALL,
+	if (isolate_page(hpage, &pagelist)) {
+	ret = migrate_pages(&pagelist, new_page, NULL, MPOL_MF_MOVE_ALL,
 					MIGRATE_SYNC, MR_MEMORY_FAILURE);
 		if (!ret) {
-			page_handle_poison(page, true, true, false);
+			page_handle_poison(page, true, true, huge);
 		} else {
 			if (!list_empty(&pagelist))
 				putback_movable_pages(&pagelist);
 
-			pr_info("soft offline: %#lx: migration failed %d, type %lx (%pGp)\n",
-				pfn, ret, page->flags, &page->flags);
+			pr_info("soft offline: %#lx: %s migration failed %d, type %lx (%pGp)\n",
+				 pfn, msg_page[huge], ret, page->flags, &page->flags);
 			if (ret > 0)
 				ret = -EIO;
 		}
 	} else {
-		pr_info("soft offline: %#lx: isolation failed: %d, page count %d, type %lx (%pGp)\n",
-			pfn, ret, page_count(page), page->flags, &page->flags);
+		pr_info("soft offline: %#lx: %s isolation failed: %d, page count %d, type %lx (%pGp)\n",
+			 pfn, msg_page[huge], ret, page_count(page), page->flags, &page->flags);
 	}
 	return ret;
 }
 
 static int soft_offline_in_use_page(struct page *page)
 {
-	int ret;
 	struct page *hpage = compound_head(page);
 
 	if (!PageHuge(page) && PageTransHuge(hpage))
 		if (try_to_split_thp_page(page, "soft offline") < 0)
 			return -EBUSY;
 
-	if (PageHuge(page))
-		ret = soft_offline_huge_page(page);
-	else
-		ret = __soft_offline_page(page);
-	return ret;
+	return __soft_offline_page(page);
 }
 
 static int soft_offline_free_page(struct page *page)
-- 
2.26.2


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v4 14/15] mm,hwpoison: Return 0 if the page is already poisoned in soft-offline
  2020-07-16 12:37 [PATCH v4 00/15] Hwpoison soft-offline rework Oscar Salvador
                   ` (12 preceding siblings ...)
  2020-07-16 12:38 ` [PATCH v4 13/15] mm,hwpoison: Refactor soft_offline_huge_page and __soft_offline_page Oscar Salvador
@ 2020-07-16 12:38 ` Oscar Salvador
  2020-07-16 12:38 ` [PATCH v4 15/15] mm,hwpoison: introduce MF_MSG_UNSPLIT_THP Oscar Salvador
                   ` (2 subsequent siblings)
  16 siblings, 0 replies; 24+ messages in thread
From: Oscar Salvador @ 2020-07-16 12:38 UTC (permalink / raw)
  To: akpm
  Cc: mhocko, linux-mm, mike.kravetz, david, aneesh.kumar,
	naoya.horiguchi, linux-kernel, Oscar Salvador, Oscar Salvador

Currently, there is an inconsistency when calling soft-offline from
different paths on a page that is already poisoned.

1) madvise:

        madvise_inject_error skips any poisoned page and continues
        the loop.
        If that was the only page to madvise, it returns 0.

2) /sys/devices/system/memory/:

        When calling soft_offline_page_store()->soft_offline_page(),
        we return -EBUSY in case the page is already poisoned.
        This is inconsistent with a) the above example and b)
        memory_failure, where we return 0 if the page was poisoned.

Fix this by dropping the PageHWPoison() check in madvise_inject_error,
and let soft_offline_page return 0 if it finds the page already poisoned.

Please, note that this represents a user-api change, since now the
return error when calling soft_offline_page_store()->soft_offline_page()
will be different.

Signed-off-by: Oscar Salvador <osalvador@suse.com>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
---
 mm/madvise.c        | 3 ---
 mm/memory-failure.c | 4 ++--
 2 files changed, 2 insertions(+), 5 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 226f0fcf0828..7b5ca96108cd 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -920,9 +920,6 @@ static int madvise_inject_error(int behavior,
 		 */
 		put_page(page);
 
-		if (PageHWPoison(page))
-			continue;
-
 		if (behavior == MADV_SOFT_OFFLINE) {
 			pr_info("Soft offlining pfn %#lx at process virtual address %#lx\n",
 				 pfn, start);
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index c6c83337708a..2b2aa5a76b9b 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1781,7 +1781,7 @@ static int __soft_offline_page(struct page *page)
 		unlock_page(page);
 		put_page(page);
 		pr_info("soft offline: %#lx page already poisoned\n", pfn);
-		return -EBUSY;
+		return 0;
 	}
 
 	if (!PageHuge(page))
@@ -1881,7 +1881,7 @@ int soft_offline_page(unsigned long pfn)
 
 	if (PageHWPoison(page)) {
 		pr_info("soft offline: %#lx page already poisoned\n", pfn);
-		return -EBUSY;
+		return 0;
 	}
 
 	get_online_mems();
-- 
2.26.2


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH v4 15/15] mm,hwpoison: introduce MF_MSG_UNSPLIT_THP
  2020-07-16 12:37 [PATCH v4 00/15] Hwpoison soft-offline rework Oscar Salvador
                   ` (13 preceding siblings ...)
  2020-07-16 12:38 ` [PATCH v4 14/15] mm,hwpoison: Return 0 if the page is already poisoned in soft-offline Oscar Salvador
@ 2020-07-16 12:38 ` Oscar Salvador
  2020-07-16 12:38 ` [PATCH] x86/speculation: Add basic IBPB (Indirect Branch Prediction Barrier) support Oscar Salvador
  2020-07-17 14:49 ` [PATCH v4 00/15] Hwpoison soft-offline rework Qian Cai
  16 siblings, 0 replies; 24+ messages in thread
From: Oscar Salvador @ 2020-07-16 12:38 UTC (permalink / raw)
  To: akpm
  Cc: mhocko, linux-mm, mike.kravetz, david, aneesh.kumar,
	naoya.horiguchi, linux-kernel, Oscar Salvador, Oscar Salvador

memory_failure() is supposed to call action_result() when it handles
a memory error event, but there's one missing case. So let's add it.

I find that include/ras/ras_event.h has some other MF_MSG_* undefined,
so this patch also adds them.

Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Oscar Salvador <osalvador@suse.com
---
 include/linux/mm.h      | 1 +
 include/ras/ras_event.h | 3 +++
 mm/memory-failure.c     | 5 ++++-
 3 files changed, 8 insertions(+), 1 deletion(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index 8f6a45165ec8..678ea25625d7 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3046,6 +3046,7 @@ enum mf_action_page_type {
 	MF_MSG_BUDDY,
 	MF_MSG_BUDDY_2ND,
 	MF_MSG_DAX,
+	MF_MSG_UNSPLIT_THP,
 	MF_MSG_UNKNOWN,
 };
 
diff --git a/include/ras/ras_event.h b/include/ras/ras_event.h
index 36c5c5e38c1d..0bdbc0d17d2f 100644
--- a/include/ras/ras_event.h
+++ b/include/ras/ras_event.h
@@ -361,6 +361,7 @@ TRACE_EVENT(aer_event,
 	EM ( MF_MSG_POISONED_HUGE, "huge page already hardware poisoned" )	\
 	EM ( MF_MSG_HUGE, "huge page" )					\
 	EM ( MF_MSG_FREE_HUGE, "free huge page" )			\
+	EM ( MF_MSG_NON_PMD_HUGE, "non-pmd-sized huge page" )		\
 	EM ( MF_MSG_UNMAP_FAILED, "unmapping failed page" )		\
 	EM ( MF_MSG_DIRTY_SWAPCACHE, "dirty swapcache page" )		\
 	EM ( MF_MSG_CLEAN_SWAPCACHE, "clean swapcache page" )		\
@@ -373,6 +374,8 @@ TRACE_EVENT(aer_event,
 	EM ( MF_MSG_TRUNCATED_LRU, "already truncated LRU page" )	\
 	EM ( MF_MSG_BUDDY, "free buddy page" )				\
 	EM ( MF_MSG_BUDDY_2ND, "free buddy page (2nd try)" )		\
+	EM ( MF_MSG_DAX, "dax page" )					\
+	EM ( MF_MSG_UNSPLIT_THP, "unsplit thp" )			\
 	EMe ( MF_MSG_UNKNOWN, "unknown page" )
 
 /*
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 2b2aa5a76b9b..7359164c3fe9 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -569,6 +569,7 @@ static const char * const action_page_types[] = {
 	[MF_MSG_BUDDY]			= "free buddy page",
 	[MF_MSG_BUDDY_2ND]		= "free buddy page (2nd try)",
 	[MF_MSG_DAX]			= "dax page",
+	[MF_MSG_UNSPLIT_THP]		= "unsplit thp",
 	[MF_MSG_UNKNOWN]		= "unknown page",
 };
 
@@ -1359,8 +1360,10 @@ int memory_failure(unsigned long pfn, int flags)
 	}
 
 	if (PageTransHuge(hpage)) {
-		if (try_to_split_thp_page(p, "Memory Failure") < 0)
+		if (try_to_split_thp_page(p, "Memory Failure") < 0) {
+			action_result(pfn, MF_MSG_UNSPLIT_THP, MF_IGNORED);
 			return -EBUSY;
+		}
 		VM_BUG_ON_PAGE(!page_count(p), p);
 	}
 
-- 
2.26.2


^ permalink raw reply	[flat|nested] 24+ messages in thread

* [PATCH] x86/speculation: Add basic IBPB (Indirect Branch Prediction Barrier) support
  2020-07-16 12:37 [PATCH v4 00/15] Hwpoison soft-offline rework Oscar Salvador
                   ` (14 preceding siblings ...)
  2020-07-16 12:38 ` [PATCH v4 15/15] mm,hwpoison: introduce MF_MSG_UNSPLIT_THP Oscar Salvador
@ 2020-07-16 12:38 ` Oscar Salvador
  2020-07-16 12:43   ` osalvador
  2020-07-17 14:49 ` [PATCH v4 00/15] Hwpoison soft-offline rework Qian Cai
  16 siblings, 1 reply; 24+ messages in thread
From: Oscar Salvador @ 2020-07-16 12:38 UTC (permalink / raw)
  To: akpm
  Cc: mhocko, linux-mm, mike.kravetz, david, aneesh.kumar,
	naoya.horiguchi, linux-kernel, David Woodhouse, KarimAllah Ahmed,
	gnomes, ak, ashok.raj, dave.hansen, arjan, torvalds, peterz, bp,
	pbonzini, tim.c.chen, gregkh, Greg Kroah-Hartman,
	Srivatsa S . Bhat, Jiri Slaby

From: David Woodhouse <dwmw@amazon.co.uk>

(cherry picked from commit 20ffa1caecca4db8f79fe665acdeaa5af815a24d)

Expose indirect_branch_prediction_barrier() for use in subsequent patches.

[ tglx: Add IBPB status to spectre_v2 sysfs file ]

Co-developed-by: KarimAllah Ahmed <karahmed@amazon.de>
Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de>
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Cc: gnomes@lxorguk.ukuu.org.uk
Cc: ak@linux.intel.com
Cc: ashok.raj@intel.com
Cc: dave.hansen@intel.com
Cc: arjan@linux.intel.com
Cc: torvalds@linux-foundation.org
Cc: peterz@infradead.org
Cc: bp@alien8.de
Cc: pbonzini@redhat.com
Cc: tim.c.chen@linux.intel.com
Cc: gregkh@linux-foundation.org
Link: https://lkml.kernel.org/r/1516896855-7642-8-git-send-email-dwmw@amazon.co.uk
Signed-off-by: David Woodhouse <dwmw@amazon.co.uk>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Srivatsa S. Bhat <srivatsa@csail.mit.edu>
Reviewed-by: Matt Helsley (VMware) <matt.helsley@gmail.com>
Reviewed-by: Alexey Makhalov <amakhalov@vmware.com>
Reviewed-by: Bo Gan <ganb@vmware.com>
Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Jiri Slaby <jslaby@suse.cz>
---
 arch/x86/include/asm/cpufeatures.h   |  2 ++
 arch/x86/include/asm/nospec-branch.h | 13 +++++++++++++
 arch/x86/kernel/cpu/bugs.c           | 10 +++++++++-
 3 files changed, 24 insertions(+), 1 deletion(-)

diff --git a/arch/x86/include/asm/cpufeatures.h b/arch/x86/include/asm/cpufeatures.h
index a5671b849837..b4e370b5b761 100644
--- a/arch/x86/include/asm/cpufeatures.h
+++ b/arch/x86/include/asm/cpufeatures.h
@@ -201,6 +201,8 @@
 /* Because the ALTERNATIVE scheme is for members of the X86_FEATURE club... */
 #define X86_FEATURE_KAISER	( 7*32+31) /* CONFIG_PAGE_TABLE_ISOLATION w/o nokaiser */
 
+#define X86_FEATURE_IBPB		( 7*32+21) /* Indirect Branch Prediction Barrier enabled*/
+
 /* Virtualization flags: Linux defined, word 8 */
 #define X86_FEATURE_TPR_SHADOW  ( 8*32+ 0) /* Intel TPR Shadow */
 #define X86_FEATURE_VNMI        ( 8*32+ 1) /* Intel Virtual NMI */
diff --git a/arch/x86/include/asm/nospec-branch.h b/arch/x86/include/asm/nospec-branch.h
index 8b910416243c..41851afd44af 100644
--- a/arch/x86/include/asm/nospec-branch.h
+++ b/arch/x86/include/asm/nospec-branch.h
@@ -194,6 +194,19 @@ static inline void vmexit_fill_RSB(void)
 #endif
 }
 
+static inline void indirect_branch_prediction_barrier(void)
+{
+	asm volatile(ALTERNATIVE("",
+				 "movl %[msr], %%ecx\n\t"
+				 "movl %[val], %%eax\n\t"
+				 "movl $0, %%edx\n\t"
+				 "wrmsr",
+				 X86_FEATURE_IBPB)
+		     : : [msr] "i" (MSR_IA32_PRED_CMD),
+			 [val] "i" (PRED_CMD_IBPB)
+		     : "eax", "ecx", "edx", "memory");
+}
+
 #endif /* __ASSEMBLY__ */
 
 /*
diff --git a/arch/x86/kernel/cpu/bugs.c b/arch/x86/kernel/cpu/bugs.c
index 2bbc74f8a4a8..7def33ada730 100644
--- a/arch/x86/kernel/cpu/bugs.c
+++ b/arch/x86/kernel/cpu/bugs.c
@@ -296,6 +296,13 @@ static void __init spectre_v2_select_mitigation(void)
 		setup_force_cpu_cap(X86_FEATURE_RSB_CTXSW);
 		pr_info("Filling RSB on context switch\n");
 	}
+
+	/* Initialize Indirect Branch Prediction Barrier if supported */
+	if (boot_cpu_has(X86_FEATURE_SPEC_CTRL) ||
+	    boot_cpu_has(X86_FEATURE_AMD_PRED_CMD)) {
+		setup_force_cpu_cap(X86_FEATURE_IBPB);
+		pr_info("Enabling Indirect Branch Prediction Barrier\n");
+	}
 }
 
 #undef pr_fmt
@@ -325,7 +332,8 @@ ssize_t cpu_show_spectre_v2(struct device *dev,
 	if (!boot_cpu_has_bug(X86_BUG_SPECTRE_V2))
 		return sprintf(buf, "Not affected\n");
 
-	return sprintf(buf, "%s%s\n", spectre_v2_strings[spectre_v2_enabled],
+	return sprintf(buf, "%s%s%s\n", spectre_v2_strings[spectre_v2_enabled],
+		       boot_cpu_has(X86_FEATURE_IBPB) ? ", IBPB" : "",
 		       spectre_v2_module_string());
 }
 #endif
-- 
2.18.0


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH] x86/speculation: Add basic IBPB (Indirect Branch Prediction Barrier) support
  2020-07-16 12:38 ` [PATCH] x86/speculation: Add basic IBPB (Indirect Branch Prediction Barrier) support Oscar Salvador
@ 2020-07-16 12:43   ` osalvador
  0 siblings, 0 replies; 24+ messages in thread
From: osalvador @ 2020-07-16 12:43 UTC (permalink / raw)
  To: akpm
  Cc: mhocko, linux-mm, mike.kravetz, david, aneesh.kumar,
	naoya.horiguchi, linux-kernel, David Woodhouse, KarimAllah Ahmed,
	gnomes, ak, ashok.raj, dave.hansen, arjan, torvalds, peterz, bp,
	pbonzini, tim.c.chen, gregkh, Greg Kroah-Hartman,
	Srivatsa S . Bhat, Jiri Slaby

On 2020-07-16 14:38, Oscar Salvador wrote:
> From: David Woodhouse <dwmw@amazon.co.uk>

Sorry for the noise.
This should not be here.

I dunno how this patch sneaked in.

Please ignore it.




^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4 03/15] mm,madvise: call soft_offline_page() without MF_COUNT_INCREASED
  2020-07-16 12:37 ` [PATCH v4 03/15] mm,madvise: call soft_offline_page() without MF_COUNT_INCREASED Oscar Salvador
@ 2020-07-16 23:15   ` Mike Kravetz
  0 siblings, 0 replies; 24+ messages in thread
From: Mike Kravetz @ 2020-07-16 23:15 UTC (permalink / raw)
  To: Oscar Salvador, akpm
  Cc: mhocko, linux-mm, david, aneesh.kumar, naoya.horiguchi,
	linux-kernel, Naoya Horiguchi, Oscar Salvador

On 7/16/20 5:37 AM, Oscar Salvador wrote:
> From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> 
> The call to get_user_pages_fast is only to get the pointer to a struct
> page of a given address, pinning it is memory-poisoning handler's job,
> so drop the refcount grabbed by get_user_pages_fast
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Signed-off-by: Oscar Salvador <osalvador@suse.com>

Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4 12/15] mm,hwpoison: Rework soft offline for in-use pages
  2020-07-16 12:38 ` [PATCH v4 12/15] mm,hwpoison: Rework soft offline for in-use pages Oscar Salvador
@ 2020-07-17  6:55   ` HORIGUCHI NAOYA(堀口 直也)
       [not found]   ` <f7387d64d0024d15a1bc821a8e19b8f0@DB7PR04MB5180.eurprd04.prod.outlook.com>
  1 sibling, 0 replies; 24+ messages in thread
From: HORIGUCHI NAOYA(堀口 直也) @ 2020-07-17  6:55 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: akpm, mhocko, linux-mm, mike.kravetz, david, aneesh.kumar,
	linux-kernel, Oscar Salvador

On Thu, Jul 16, 2020 at 02:38:06PM +0200, Oscar Salvador wrote:
> This patch changes the way we set and handle in-use poisoned pages.
> Until now, poisoned pages were released to the buddy allocator, trusting
> that the checks that take place prior to deliver the page to its end
> user would act as a safe net and would skip that page.
> 
> This has proved to be wrong, as we got some pfn walkers out there, like
> compaction, that all they care is the page to be PageBuddy and be in a
> freelist.
> Although this might not be the only user, having poisoned pages
> in the buddy allocator seems a bad idea as we should only have
> free pages that are ready and meant to be used as such.
> 
> Before explaining the taken approach, let us break down the kind
> of pages we can soft offline.
> 
> - Anonymous THP (after the split, they end up being 4K pages)
> - Hugetlb
> - Order-0 pages (that can be either migrated or invalited)
> 
> * Normal pages (order-0 and anon-THP)
> 
>   - If they are clean and unmapped page cache pages, we invalidate
>     then by means of invalidate_inode_page().
>   - If they are mapped/dirty, we do the isolate-and-migrate dance.
> 
>   Either way, do not call put_page directly from those paths.
>   Instead, we keep the page and send it to page_set_poison to perform the
>   right handling.
> 
>   Among other things, page_set_poison() sets the HWPoison flag and does the last
>   put_page.
>   This call to put_page is mainly to be able to call __page_cache_release,
>   since this function is not exported.
> 
>   Down the chain, we placed a check for HWPoison page in
>   free_pages_prepare, that just skips any poisoned page, so those pages
>   do not end up either in a pcplist or in buddy-freelist.
> 
>   After that, we set the refcount on the page to 1 and we increment
>   the poisoned pages counter.
> 
>   We could do as we do for free pages:
>   1) wait until the page hits buddy's freelists
>   2) take it off
>   3) flag it
> 
>   The problem is that we could race with an allocation, so by the time we
>   want to take the page off the buddy, the page is already allocated, so we
>   cannot soft-offline it.
>   This is not fatal of course, but if it is better if we can close the race
>   as does not require a lot of code.
> 
> * Hugetlb pages
> 
>   - We isolate-and-migrate them
> 
>   There is no magic in here, we just isolate and migrate them.
>   A new set of internal functions have been made to flag a hugetlb page as
>   poisoned (SetPageHugePoisoned(), PageHugePoisoned(), ClearPageHugePoisoned())
>   This allows us to flag the page when we migrate it, back in
>   move_hugetlb_state().
>
>   Later on we check whether the page is poisoned in __free_huge_page,
>   and we bail out in that case before sending the page to e.g: active
>   free list.
>   This gives us full control of the page, and we can handle it
>   page_handle_poison().
> 
>   In other words, we do not allow migrated hugepages to get back to the
>   freelists.
> 
>   Since now the page has no user and has been migrated, we can call
>   dissolve_free_huge_page, which will end up calling update_and_free_page.
>   In update_and_free_page(), we check for the page to be poisoned.
>   If it so, we handle it as we handle gigantic pages, i.e: we break down
>   the page in order-0 pages and free them one by one.
>   Doing so, allows us for free_pages_prepare to skip poisoned pages.
> 
> Because of the way we handle now in-use pages, we no longer need the
> put-as-isolation-migratetype dance, that was guarding for poisoned pages
> to end up in pcplists.

I ran Quan Cai's test program (https://github.com/cailca/linux-mm) on a
small (4GB memory) VM, and weiredly found that (1) the target hugepages
are not always dissolved and (2) dissovled hugetpages are still counted
in "HugePages_Total:". See below:

    $ ./random 1
    - start: migrate_huge_offline
    - use NUMA nodes 0,1.
    - mmap and free 8388608 bytes hugepages on node 0
    - mmap and free 8388608 bytes hugepages on node 1
    madvise: Cannot allocate memory

    $ cat /proc/meminfo
    MemTotal:        4026772 kB
    MemFree:          976300 kB
    MemAvailable:     892840 kB
    Buffers:           20936 kB
    Cached:            99768 kB
    SwapCached:         5904 kB
    Active:            84332 kB
    Inactive:         116328 kB
    Active(anon):      27944 kB
    Inactive(anon):    68524 kB
    Active(file):      56388 kB
    Inactive(file):    47804 kB
    Unevictable:        7532 kB
    Mlocked:               0 kB
    SwapTotal:       2621436 kB
    SwapFree:        2609844 kB
    Dirty:                56 kB
    Writeback:             0 kB
    AnonPages:         81764 kB
    Mapped:            54348 kB
    Shmem:              8948 kB
    KReclaimable:      22744 kB
    Slab:              52056 kB
    SReclaimable:      22744 kB
    SUnreclaim:        29312 kB
    KernelStack:        3888 kB
    PageTables:         2804 kB
    NFS_Unstable:          0 kB
    Bounce:                0 kB
    WritebackTmp:          0 kB
    CommitLimit:     3260612 kB
    Committed_AS:     828196 kB
    VmallocTotal:   34359738367 kB
    VmallocUsed:       19260 kB
    VmallocChunk:          0 kB
    Percpu:             5120 kB
    HardwareCorrupted:  5368 kB
    AnonHugePages:     18432 kB
    ShmemHugePages:        0 kB
    ShmemPmdMapped:        0 kB
    FileHugePages:         0 kB
    FilePmdMapped:         0 kB
    CmaTotal:              0 kB
    CmaFree:               0 kB
    HugePages_Total:    1342     // still counted as hugetlb pages.
    HugePages_Free:        0     // all hugepage are still allocated (or leaked?)
    HugePages_Rsvd:        0
    HugePages_Surp:      762     // some are counted in surplus.
    Hugepagesize:       2048 kB
    Hugetlb:         2748416 kB
    DirectMap4k:      112480 kB
    DirectMap2M:     4081664 kB


    $ page-types -b hwpoison
                 flags      page-count       MB  symbolic-flags                     long-symbolic-flags
    0x0000000000080008             421        1  ___U_______________X_______________________      uptodate,hwpoison
    0x00000000000a8018               1        0  ___UD__________H_G_X_______________________      uptodate,dirty,compound_head,huge,hwpoison
    0x00000000000a801c             920        3  __RUD__________H_G_X_______________________      referenced,uptodate,dirty,compound_head,huge,hwpoison
                 total            1342        5

This means that some hugepages are dissolved, but the others not,
maybe which is not desirable.
I'll dig this more later but just let me share at first.

A few minor comment below ...

> 
> Signed-off-by: Oscar Salvador <osalvador@suse.com>
> Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> ---
>  include/linux/page-flags.h |  5 ----
>  mm/hugetlb.c               | 60 +++++++++++++++++++++++++++++++++-----
>  mm/memory-failure.c        | 53 +++++++++++++--------------------
>  mm/migrate.c               | 11 ++-----
>  mm/page_alloc.c            | 38 +++++++-----------------
>  5 files changed, 86 insertions(+), 81 deletions(-)
> 
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index 01baf6d426ff..2ac8bfa0cf20 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -426,13 +426,8 @@ PAGEFLAG(HWPoison, hwpoison, PF_ANY)
>  TESTSCFLAG(HWPoison, hwpoison, PF_ANY)
>  #define __PG_HWPOISON (1UL << PG_hwpoison)
>  extern bool take_page_off_buddy(struct page *page);
> -extern bool set_hwpoison_free_buddy_page(struct page *page);
>  #else
>  PAGEFLAG_FALSE(HWPoison)
> -static inline bool set_hwpoison_free_buddy_page(struct page *page)
> -{
> -	return 0;
> -}
>  #define __PG_HWPOISON 0
>  #endif
>  
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 7badb01d15e3..1c6397936512 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -29,6 +29,7 @@
>  #include <linux/numa.h>
>  #include <linux/llist.h>
>  #include <linux/cma.h>
> +#include <linux/migrate.h>
>  
>  #include <asm/page.h>
>  #include <asm/pgalloc.h>
> @@ -1209,9 +1210,26 @@ static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed)
>  		((node = hstate_next_node_to_free(hs, mask)) || 1);	\
>  		nr_nodes--)
>  
> -#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE
> -static void destroy_compound_gigantic_page(struct page *page,
> -					unsigned int order)
> +static inline bool PageHugePoisoned(struct page *page)
> +{
> +	if (!PageHuge(page))
> +		return false;
> +
> +	return (unsigned long)page[3].mapping == -1U;
> +}
> +
> +static inline void SetPageHugePoisoned(struct page *page)
> +{
> +	page[3].mapping = (void *)-1U;
> +}
> +
> +static inline void ClearPageHugePoisoned(struct page *page)
> +{
> +	page[3].mapping = NULL;
> +}
> +
> +static void destroy_compound_gigantic_page(struct hstate *h, struct page *page,
> +					   unsigned int order)
>  {
>  	int i;
>  	int nr_pages = 1 << order;
> @@ -1222,14 +1240,19 @@ static void destroy_compound_gigantic_page(struct page *page,
>  		atomic_set(compound_pincount_ptr(page), 0);
>  
>  	for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) {
> +		if (!hstate_is_gigantic(h))
> +			 p->mapping = NULL;
>  		clear_compound_head(p);
>  		set_page_refcounted(p);
>  	}
>  
> +	if (PageHugePoisoned(page))
> +		ClearPageHugePoisoned(page);
>  	set_compound_order(page, 0);
>  	__ClearPageHead(page);
>  }
>  
> +#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE
>  static void free_gigantic_page(struct page *page, unsigned int order)
>  {
>  	/*
> @@ -1284,13 +1307,16 @@ static struct page *alloc_gigantic_page(struct hstate *h, gfp_t gfp_mask,
>  	return NULL;
>  }
>  static inline void free_gigantic_page(struct page *page, unsigned int order) { }
> -static inline void destroy_compound_gigantic_page(struct page *page,
> -						unsigned int order) { }
> +static inline void destroy_compound_gigantic_page(struct hstate *h,
> +						  struct page *page,
> +						  unsigned int order) { }
>  #endif
>  
>  static void update_and_free_page(struct hstate *h, struct page *page)
>  {
>  	int i;
> +	bool poisoned = PageHugePoisoned(page);
> +	unsigned int order = huge_page_order(h);
>  
>  	if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
>  		return;
> @@ -1313,11 +1339,21 @@ static void update_and_free_page(struct hstate *h, struct page *page)
>  		 * we might block in free_gigantic_page().
>  		 */
>  		spin_unlock(&hugetlb_lock);
> -		destroy_compound_gigantic_page(page, huge_page_order(h));
> -		free_gigantic_page(page, huge_page_order(h));
> +		destroy_compound_gigantic_page(h, page, order);
> +		free_gigantic_page(page, order);
>  		spin_lock(&hugetlb_lock);
>  	} else {
> -		__free_pages(page, huge_page_order(h));
> +		if (unlikely(poisoned)) {
> +			/*
> +			 * If the hugepage is poisoned, do as we do for
> +			 * gigantic pages and free the pages as order-0.
> +			 * free_pages_prepare will skip over the poisoned ones.
> +			 */
> +			destroy_compound_gigantic_page(h, page, order);

This function is for gigantic page from its name, so shouldn't be called
for non-gigantic huge page. Maybe renaming it and/or introducing some inner
function layer to factor out common part would be better.

> +			free_contig_range(page_to_pfn(page), 1 << order);
> +		} else {
> +			__free_pages(page, huge_page_order(h));
> +		}
>  	}
>  }
>  
> @@ -1427,6 +1463,11 @@ static void __free_huge_page(struct page *page)
>  	if (restore_reserve)
>  		h->resv_huge_pages++;
>  
> +	if (PageHugePoisoned(page)) {
> +		spin_unlock(&hugetlb_lock);
> +		return;
> +	}
> +
>  	if (PageHugeTemporary(page)) {
>  		list_del(&page->lru);
>  		ClearPageHugeTemporary(page);
> @@ -5642,6 +5683,9 @@ void move_hugetlb_state(struct page *oldpage, struct page *newpage, int reason)
>  	hugetlb_cgroup_migrate(oldpage, newpage);
>  	set_page_owner_migrate_reason(newpage, reason);
>  
> +	if (reason == MR_MEMORY_FAILURE)
> +		SetPageHugePoisoned(oldpage);
> +
>  	/*
>  	 * transfer temporary state of the new huge page. This is
>  	 * reverse to other transitions because the newpage is going to
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index caf012d34607..c0ebab4eed4c 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -65,9 +65,17 @@ int sysctl_memory_failure_recovery __read_mostly = 1;
>  
>  atomic_long_t num_poisoned_pages __read_mostly = ATOMIC_LONG_INIT(0);
>  
> -static void page_handle_poison(struct page *page)
> +static void page_handle_poison(struct page *page, bool release, bool set_flag,
> +			       bool huge_flag)
>  {
> -	SetPageHWPoison(page);
> +	if (set_flag)
> +		SetPageHWPoison(page);
> +
> +        if (huge_flag)
> +		dissolve_free_huge_page(page);
> +        else if (release)
> +		put_page(page);
> +

Indentation seems to be broken, you can run checkpatch.pl to find details.

Thanks,
Naoya Horiguchi

>  	page_ref_inc(page);
>  	num_poisoned_pages_inc();
>  }
> @@ -1717,7 +1725,7 @@ static int get_any_page(struct page *page, unsigned long pfn)
>  
>  static int soft_offline_huge_page(struct page *page)
>  {
> -	int ret;
> +	int ret = -EBUSY;
>  	unsigned long pfn = page_to_pfn(page);
>  	struct page *hpage = compound_head(page);
>  	LIST_HEAD(pagelist);
> @@ -1757,19 +1765,12 @@ static int soft_offline_huge_page(struct page *page)
>  			ret = -EIO;
>  	} else {
>  		/*
> -		 * We set PG_hwpoison only when the migration source hugepage
> -		 * was successfully dissolved, because otherwise hwpoisoned
> -		 * hugepage remains on free hugepage list, then userspace will
> -		 * find it as SIGBUS by allocation failure. That's not expected
> -		 * in soft-offlining.
> +		 * At this point the page cannot be in-use since we do not
> +		 * let the page to go back to hugetlb freelists.
> +		 * In that case we just need to dissolve it.
> +		 * page_handle_poison will take care of it.
>  		 */
> -		ret = dissolve_free_huge_page(page);
> -		if (!ret) {
> -			if (set_hwpoison_free_buddy_page(page))
> -				num_poisoned_pages_inc();
> -			else
> -				ret = -EBUSY;
> -		}
> +		page_handle_poison(page, true, true, true);
>  	}
>  	return ret;
>  }
> @@ -1804,10 +1805,8 @@ static int __soft_offline_page(struct page *page)
>  	 * would need to fix isolation locking first.
>  	 */
>  	if (ret == 1) {
> -		put_page(page);
>  		pr_info("soft_offline: %#lx: invalidated\n", pfn);
> -		SetPageHWPoison(page);
> -		num_poisoned_pages_inc();
> +		page_handle_poison(page, true, true, false);
>  		return 0;
>  	}
>  
> @@ -1838,7 +1837,9 @@ static int __soft_offline_page(struct page *page)
>  		list_add(&page->lru, &pagelist);
>  		ret = migrate_pages(&pagelist, new_page, NULL, MPOL_MF_MOVE_ALL,
>  					MIGRATE_SYNC, MR_MEMORY_FAILURE);
> -		if (ret) {
> +		if (!ret) {
> +			page_handle_poison(page, true, true, false);
> +		} else {
>  			if (!list_empty(&pagelist))
>  				putback_movable_pages(&pagelist);
>  
> @@ -1857,37 +1858,25 @@ static int __soft_offline_page(struct page *page)
>  static int soft_offline_in_use_page(struct page *page)
>  {
>  	int ret;
> -	int mt;
>  	struct page *hpage = compound_head(page);
>  
>  	if (!PageHuge(page) && PageTransHuge(hpage))
>  		if (try_to_split_thp_page(page, "soft offline") < 0)
>  			return -EBUSY;
>  
> -	/*
> -	 * Setting MIGRATE_ISOLATE here ensures that the page will be linked
> -	 * to free list immediately (not via pcplist) when released after
> -	 * successful page migration. Otherwise we can't guarantee that the
> -	 * page is really free after put_page() returns, so
> -	 * set_hwpoison_free_buddy_page() highly likely fails.
> -	 */
> -	mt = get_pageblock_migratetype(page);
> -	set_pageblock_migratetype(page, MIGRATE_ISOLATE);
>  	if (PageHuge(page))
>  		ret = soft_offline_huge_page(page);
>  	else
>  		ret = __soft_offline_page(page);
> -	set_pageblock_migratetype(page, mt);
>  	return ret;
>  }
>  
>  static int soft_offline_free_page(struct page *page)
>  {
>  	int rc = -EBUSY;
> -	int rc = dissolve_free_huge_page(page);
>  
>  	if (!dissolve_free_huge_page(page) && take_page_off_buddy(page)) {
> -		page_handle_poison(page);
> +		page_handle_poison(page, false, true, false);
>  		rc = 0;
>  	}
>  
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 75c10d81e833..a68d81d0ae6e 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1222,16 +1222,11 @@ static int unmap_and_move(new_page_t get_new_page,
>  	 * we want to retry.
>  	 */
>  	if (rc == MIGRATEPAGE_SUCCESS) {
> -		put_page(page);
> -		if (reason == MR_MEMORY_FAILURE) {
> +		if (reason != MR_MEMORY_FAILURE)
>  			/*
> -			 * Set PG_HWPoison on just freed page
> -			 * intentionally. Although it's rather weird,
> -			 * it's how HWPoison flag works at the moment.
> +			 * We handle poisoned pages in page_handle_poison.
>  			 */
> -			if (set_hwpoison_free_buddy_page(page))
> -				num_poisoned_pages_inc();
> -		}
> +			put_page(page);
>  	} else {
>  		if (rc != -EAGAIN) {
>  			if (likely(!__PageMovable(page))) {
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 4fa0e0887c07..11df51fc2718 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1175,6 +1175,16 @@ static __always_inline bool free_pages_prepare(struct page *page,
>  
>  	trace_mm_page_free(page, order);
>  
> +	if (unlikely(PageHWPoison(page)) && !order) {
> +		/*
> +		 * Untie memcg state and reset page's owner
> +		 */
> +		if (memcg_kmem_enabled() && PageKmemcg(page))
> +			__memcg_kmem_uncharge_page(page, order);
> +		reset_page_owner(page, order);
> +		return false;
> +	}
> +
>  	/*
>  	 * Check tail pages before head page information is cleared to
>  	 * avoid checking PageCompound for order-0 pages.
> @@ -8844,32 +8854,4 @@ bool take_page_off_buddy(struct page *page)
>  	spin_unlock_irqrestore(&zone->lock, flags);
>  	return ret;
>  }
> -
> -/*
> - * Set PG_hwpoison flag if a given page is confirmed to be a free page.  This
> - * test is performed under the zone lock to prevent a race against page
> - * allocation.
> - */
> -bool set_hwpoison_free_buddy_page(struct page *page)
> -{
> -	struct zone *zone = page_zone(page);
> -	unsigned long pfn = page_to_pfn(page);
> -	unsigned long flags;
> -	unsigned int order;
> -	bool hwpoisoned = false;
> -
> -	spin_lock_irqsave(&zone->lock, flags);
> -	for (order = 0; order < MAX_ORDER; order++) {
> -		struct page *page_head = page - (pfn & ((1 << order) - 1));
> -
> -		if (PageBuddy(page_head) && page_order(page_head) >= order) {
> -			if (!TestSetPageHWPoison(page))
> -				hwpoisoned = true;
> -			break;
> -		}
> -	}
> -	spin_unlock_irqrestore(&zone->lock, flags);
> -
> -	return hwpoisoned;
> -}
>  #endif
> -- 
> 2.26.2
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4 00/15] Hwpoison soft-offline rework
  2020-07-16 12:37 [PATCH v4 00/15] Hwpoison soft-offline rework Oscar Salvador
                   ` (15 preceding siblings ...)
  2020-07-16 12:38 ` [PATCH] x86/speculation: Add basic IBPB (Indirect Branch Prediction Barrier) support Oscar Salvador
@ 2020-07-17 14:49 ` Qian Cai
  16 siblings, 0 replies; 24+ messages in thread
From: Qian Cai @ 2020-07-17 14:49 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: akpm, mhocko, linux-mm, mike.kravetz, david, aneesh.kumar,
	naoya.horiguchi, linux-kernel

On Thu, Jul 16, 2020 at 02:37:54PM +0200, Oscar Salvador wrote:
> Hi all,
> 
> this is a follow-up version on [1].
> That version had some flaws wrt. handling hugetlb pages, so this version
> fixes it.
> I checked that the case reported by Qian seems to work fine now.

I am still getting EIO from madvise on some x86 NUMA systems with next-20200717
which includes this patchset.

# git clone https://gitlab.com/cailca/linux-mm
# cd linux-mm; make

# ./random 1
- start: migrate_huge_offline
- use NUMA nodes 0,3.
- mmap and free 8388608 bytes hugepages on node 0
- mmap and free 8388608 bytes hugepages on node 3
madvise: Input/output error

== serial console output ==
[  100.149531][ T1644] Soft offlining pfn 0x3a5000 at process virtual address 0x7f1bf7e00000
[  100.193804][ T1644] Soft offlining pfn 0x3a5200 at process virtual address 0x7f1bf8000000
[  100.263446][ T1644] Soft offlining pfn 0x1fa3e00 at process virtual address 0x7f1bf7e00000
[  100.302745][ T1644] __get_any_page: 0x1fa3e00 free huge page
[  100.330226][ T1644] Soft offlining pfn 0x3bd600 at process virtual address 0x7f1bf8000000
[  100.373717][ T1644] Soft offlining pfn 0x202dc00 at process virtual address 0x7f1bf7e00000
[  100.414605][ T1644] Soft offlining pfn 0x1fa3c00 at process virtual address 0x7f1bf8000000
[  100.457675][ T1644] Soft offlining pfn 0x1fd3a00 at process virtual address 0x7f1bf7e00000
[  100.498519][ T1644] Soft offlining pfn 0x1fd3800 at process virtual address 0x7f1bf8000000
[  100.541750][ T1644] Soft offlining pfn 0x1fa3800 at process virtual address 0x7f1bf7e00000
[  100.582207][ T1644] Soft offlining pfn 0x1ffde00 at process virtual address 0x7f1bf8000000
[  100.625221][ T1644] Soft offlining pfn 0x1f7fe00 at process virtual address 0x7f1bf7e00000
[  100.665768][ T1644] Soft offlining pfn 0x1f7fc00 at process virtual address 0x7f1bf8000000
[  100.712181][ T1644] Soft offlining pfn 0x202d800 at process virtual address 0x7f1bf7e00000
[  100.753815][ T1644] Soft offlining pfn 0x205da00 at process virtual address 0x7f1bf8000000
[  100.796807][ T1644] Soft offlining pfn 0x1f7fa00 at process virtual address 0x7f1bf7e00000
[  100.837403][ T1644] Soft offlining pfn 0x1f7f800 at process virtual address 0x7f1bf8000000
[  100.880442][ T1644] Soft offlining pfn 0x1ffd800 at process virtual address 0x7f1bf7e00000
[  100.921584][ T1644] Soft offlining pfn 0x1fd3600 at process virtual address 0x7f1bf8000000
[  100.964444][ T1644] Soft offlining pfn 0x1fa3600 at process virtual address 0x7f1bf7e00000
[  101.005009][ T1644] Soft offlining pfn 0x1fa3400 at process virtual address 0x7f1bf8000000
[  101.047984][ T1644] Soft offlining pfn 0x1f7f400 at process virtual address 0x7f1bf7e00000
[  101.088665][ T1644] Soft offlining pfn 0x205d600 at process virtual address 0x7f1bf8000000
[  101.131368][ T1644] Soft offlining pfn 0x202d600 at process virtual address 0x7f1bf7e00000
[  101.171717][ T1644] Soft offlining pfn 0x202d400 at process virtual address 0x7f1bf8000000
[  101.216780][ T1644] Soft offlining pfn 0x1fa3000 at process virtual address 0x7f1bf7e00000
[  101.258755][ T1644] Soft offlining pfn 0x1fd3200 at process virtual address 0x7f1bf8000000
[  101.302585][ T1644] Soft offlining pfn 0x1ffd600 at process virtual address 0x7f1bf7e00000
[  101.344729][ T1644] Soft offlining pfn 0x1ffd400 at process virtual address 0x7f1bf8000000
[  101.388958][ T1644] Soft offlining pfn 0x205d000 at process virtual address 0x7f1bf7e00000
[  101.430995][ T1644] Soft offlining pfn 0x1f7f200 at process virtual address 0x7f1bf8000000
[  101.474513][ T1644] Soft offlining pfn 0x1fd2e00 at process virtual address 0x7f1bf7e00000
[  101.515333][ T1644] Soft offlining pfn 0x1fd2c00 at process virtual address 0x7f1bf8000000
[  101.558119][ T1644] Soft offlining pfn 0x1fa2c00 at process virtual address 0x7f1bf7e00000
[  101.600051][ T1644] Soft offlining pfn 0x202d200 at process virtual address 0x7f1bf8000000
[  101.643046][ T1644] Soft offlining pfn 0x1ffd200 at process virtual address 0x7f1bf7e00000
[  101.683842][ T1644] Soft offlining pfn 0x1ffd000 at process virtual address 0x7f1bf8000000
[  101.730551][ T1644] Soft offlining pfn 0x205cc00 at process virtual address 0x7f1bf7e00000
[  101.772575][ T1644] Soft offlining pfn 0x1f7ee00 at process virtual address 0x7f1bf8000000
[  101.818438][ T1644] Soft offlining pfn 0x1fd2a00 at process virtual address 0x7f1bf7e00000
[  101.861488][ T1644] Soft offlining pfn 0x1fd2800 at process virtual address 0x7f1bf8000000
[  101.904410][ T1644] Soft offlining pfn 0x1fa2800 at process virtual address 0x7f1bf7e00000
[  101.946639][ T1644] Soft offlining pfn 0x202ce00 at process virtual address 0x7f1bf8000000
[  101.989523][ T1644] Soft offlining pfn 0x1ffce00 at process virtual address 0x7f1bf7e00000
[  102.030092][ T1644] Soft offlining pfn 0x1ffcc00 at process virtual address 0x7f1bf8000000
[  102.076592][ T1644] Soft offlining pfn 0x437600 at process virtual address 0x7f1bf7e00000
[  102.116941][ T1644] Soft offlining pfn 0x433800 at process virtual address 0x7f1bf8000000
[  102.161314][ T1644] Soft offlining pfn 0x1f7e800 at process virtual address 0x7f1bf7e00000
[  102.200495][ T1644] Soft offlining pfn 0x1fa2a00 at process virtual address 0x7f1bf8000000
[  102.247260][ T1644] Soft offlining pfn 0x1fd2600 at process virtual address 0x7f1bf7e00000
[  102.290189][ T1644] Soft offlining pfn 0x1fd2400 at process virtual address 0x7f1bf8000000
[  102.328558][ T1644] __get_any_page: 0x1fd2400: unknown zero refcount page type 3bfffc000000000

# numactl -H
available: 4 nodes (0-3)
node 0 cpus: 0 1 2 3 4 5 24 25 26 27 28 29
node 0 size: 15646 MB
node 0 free: 14779 MB
node 1 cpus: 6 7 8 9 10 11 30 31 32 33 34 35
node 1 size: 31966 MB
node 1 free: 29825 MB
node 2 cpus: 12 13 14 15 16 17 36 37 38 39 40 41
node 2 size: 32253 MB
node 2 free: 31029 MB
node 3 cpus: 18 19 20 21 22 23 42 43 44 45 46 47
node 3 size: 32252 MB
node 3 free: 31360 MB
node distances:
node   0   1   2   3 
  0:  10  21  21  21 
  1:  21  10  21  21 
  2:  21  21  10  21 
  3:  21  21  21  10

# cat /proc/meminfo 
MemTotal:       114809324 kB
MemFree:        109554164 kB
MemAvailable:   109256800 kB
Buffers:            5376 kB
Cached:           166932 kB
SwapCached:            0 kB
Active:           119804 kB
Inactive:         107788 kB
Active(anon):      56356 kB
Inactive(anon):     9308 kB
Active(file):      63448 kB
Inactive(file):    98480 kB
Unevictable:           0 kB
Mlocked:               0 kB
SwapTotal:       4194300 kB
SwapFree:        4194300 kB
Dirty:               108 kB
Writeback:             0 kB
AnonPages:         55296 kB
Mapped:            82252 kB
Shmem:             10380 kB
KReclaimable:     114784 kB
Slab:            4401424 kB
SReclaimable:     114784 kB
SUnreclaim:      4286640 kB
KernelStack:       18560 kB
PageTables:         5648 kB
NFS_Unstable:          0 kB
Bounce:                0 kB
WritebackTmp:          0 kB
CommitLimit:    61547760 kB
Committed_AS:     286384 kB
VmallocTotal:   34359738367 kB
VmallocUsed:      168220 kB
VmallocChunk:          0 kB
Percpu:            39424 kB
HardwareCorrupted:   196 kB
AnonHugePages:     10240 kB
ShmemHugePages:        0 kB
ShmemPmdMapped:        0 kB
FileHugePages:         0 kB
FilePmdMapped:         0 kB
HugePages_Total:      50
HugePages_Free:        2
HugePages_Rsvd:        0
HugePages_Surp:       25
Hugepagesize:       2048 kB
Hugetlb:          102400 kB
DirectMap4k:      578044 kB
DirectMap2M:    22358016 kB
DirectMap1G:    113246208 kB

> 
> Cover letter:
> 
> This patchset was initially based on Naoya's hwpoison rework [1], so
> thanks to him for the initial work.
> I would also like to think Naoya for testing the patchset off-line,
> and report any issues he found, that was quite helpful.
> 
> This patchset aims to fix some issues laying in soft-offline handling,
> but it also takes the chance and takes some further steps to perform 
> cleanups and some refactoring as well.
> 
> 
>  - Motivation:
> 
>    A customer and I were facing an issue were processes were killed
>    after having soft-offlined some of their pages.
>    This should not happen when soft-offlining, as it is meant to be non-disruptive.
>    I was able to reproduce the issue when I stressed the memory +
>    soft offlining pages in the meantime.
> 
>    After debugging the issue, I saw that the problem was that pages were returned
>    back to user-space after having offlined them properly.
>    So, when those pages were faulted in, the fault handler returned VM_FAULT_POISON
>    all the way down to the arch handler, and it simply killed the process.
> 
>    After a further anaylsis, it became clear that the problem was that when
>    kcompactd kicked in to migrate pages over, compaction_alloc callback
>    was handing poisoned pages to the migrate routine.
> 
>    All this could happen because isolate_freepages_block and
>    fast_isolate_freepages just check for the page to be PageBuddy,
>    and since 1) poisoned pages can be part of a higher order page
>    and 2) poisoned pages are also Page Buddy, they can sneak in easily.
> 
>    I also saw some other problems with sawap pages, but I suspected it
>    to be the same sort of problem, so I did not follow that trace.
> 
>    The above refers to soft-offline.
>    But I also saw problems with hard-offline, specially hugetlb corruption,
>    and some other weird stuff. (I could paste the logs)
> 
>    The full explanation refering to the soft-offline case can be found at [2].
> 
>  - Approach:
> 
>    The taken approach is to contain those pages and never let them hit 
>    neither pcplists nor buddy freelists.
>    Only when they are completely out of reach, we flag them as poisoned.
> 
>    A full explanation of this can be found in patch#11 and patch#12
> 
>  - Outcome:
> 
>    With this patchset, I no longer see the issues with soft-offline.
> 
> [1] https://lore.kernel.org/linux-mm/1541746035-13408-1-git-send-email-n-horiguchi@ah.jp.nec.com/
> [2] https://lore.kernel.org/linux-mm/20190826104144.GA7849@linux/T/#u
> 
> Naoya Horiguchi (6):
>   mm,hwpoison: cleanup unused PageHuge() check
>   mm, hwpoison: remove recalculating hpage
>   mm,madvise: call soft_offline_page() without MF_COUNT_INCREASED
>   mm,hwpoison-inject: don't pin for hwpoison_filter
>   mm,hwpoison: remove MF_COUNT_INCREASED
>   mm,hwpoison: remove flag argument from soft offline functions
> 
> Oscar Salvador (9):
>   mm,madvise: Refactor madvise_inject_error
>   mm,hwpoison: Un-export get_hwpoison_page and make it static
>   mm,hwpoison: Kill put_hwpoison_page
>   mm,hwpoison: Unify THP handling for hard and soft offline
>   mm,hwpoison: Rework soft offline for free pages
>   mm,hwpoison: Rework soft offline for in-use pages
>   mm,hwpoison: Refactor soft_offline_huge_page and __soft_offline_page
>   mm,hwpoison: Return 0 if the page is already poisoned in soft-offline
>   mm,hwpoison: introduce MF_MSG_UNSPLIT_THP
> 
>  drivers/base/memory.c      |   2 +-
>  include/linux/mm.h         |  12 +-
>  include/linux/page-flags.h |   6 +-
>  include/ras/ras_event.h    |   3 +
>  mm/hugetlb.c               |  60 +++++++-
>  mm/hwpoison-inject.c       |  18 +--
>  mm/madvise.c               |  37 ++---
>  mm/memory-failure.c        | 307 +++++++++++++++----------------------
>  mm/migrate.c               |  11 +-
>  mm/page_alloc.c            |  70 +++++++--
>  10 files changed, 270 insertions(+), 256 deletions(-)
> 
> -- 
> 2.26.2
> 
> 

^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4 12/15] mm,hwpoison: Rework soft offline for in-use pages
       [not found]   ` <f7387d64d0024d15a1bc821a8e19b8f0@DB7PR04MB5180.eurprd04.prod.outlook.com>
@ 2020-07-20  8:27     ` osalvador
  2020-07-22  8:08       ` osalvador
  0 siblings, 1 reply; 24+ messages in thread
From: osalvador @ 2020-07-20  8:27 UTC (permalink / raw)
  To: HORIGUCHI NAOYA(堀口 直也)
  Cc: akpm, Michal Hocko, linux-mm, mike.kravetz, david, aneesh.kumar,
	linux-kernel, Oscar Salvador

On 2020-07-17 08:55, HORIGUCHI NAOYA wrote:
> I ran Quan Cai's test program (https://github.com/cailca/linux-mm) on a
> small (4GB memory) VM, and weiredly found that (1) the target hugepages
> are not always dissolved and (2) dissovled hugetpages are still counted
> in "HugePages_Total:". See below:
> 
>     $ ./random 1
>     - start: migrate_huge_offline
>     - use NUMA nodes 0,1.
>     - mmap and free 8388608 bytes hugepages on node 0
>     - mmap and free 8388608 bytes hugepages on node 1
>     madvise: Cannot allocate memory
> 
>     $ cat /proc/meminfo
>     MemTotal:        4026772 kB
>     MemFree:          976300 kB
>     MemAvailable:     892840 kB
>     Buffers:           20936 kB
>     Cached:            99768 kB
>     SwapCached:         5904 kB
>     Active:            84332 kB
>     Inactive:         116328 kB
>     Active(anon):      27944 kB
>     Inactive(anon):    68524 kB
>     Active(file):      56388 kB
>     Inactive(file):    47804 kB
>     Unevictable:        7532 kB
>     Mlocked:               0 kB
>     SwapTotal:       2621436 kB
>     SwapFree:        2609844 kB
>     Dirty:                56 kB
>     Writeback:             0 kB
>     AnonPages:         81764 kB
>     Mapped:            54348 kB
>     Shmem:              8948 kB
>     KReclaimable:      22744 kB
>     Slab:              52056 kB
>     SReclaimable:      22744 kB
>     SUnreclaim:        29312 kB
>     KernelStack:        3888 kB
>     PageTables:         2804 kB
>     NFS_Unstable:          0 kB
>     Bounce:                0 kB
>     WritebackTmp:          0 kB
>     CommitLimit:     3260612 kB
>     Committed_AS:     828196 kB
>     VmallocTotal:   34359738367 kB
>     VmallocUsed:       19260 kB
>     VmallocChunk:          0 kB
>     Percpu:             5120 kB
>     HardwareCorrupted:  5368 kB
>     AnonHugePages:     18432 kB
>     ShmemHugePages:        0 kB
>     ShmemPmdMapped:        0 kB
>     FileHugePages:         0 kB
>     FilePmdMapped:         0 kB
>     CmaTotal:              0 kB
>     CmaFree:               0 kB
>     HugePages_Total:    1342     // still counted as hugetlb pages.
>     HugePages_Free:        0     // all hugepage are still allocated
> (or leaked?)
>     HugePages_Rsvd:        0
>     HugePages_Surp:      762     // some are counted in surplus.
>     Hugepagesize:       2048 kB
>     Hugetlb:         2748416 kB
>     DirectMap4k:      112480 kB
>     DirectMap2M:     4081664 kB
> 
> 
>     $ page-types -b hwpoison
>                  flags      page-count       MB  symbolic-flags
>              long-symbolic-flags
>     0x0000000000080008             421        1
> ___U_______________X_______________________      uptodate,hwpoison
>     0x00000000000a8018               1        0
> ___UD__________H_G_X_______________________
> uptodate,dirty,compound_head,huge,hwpoison
>     0x00000000000a801c             920        3
> __RUD__________H_G_X_______________________
> referenced,uptodate,dirty,compound_head,huge,hwpoison
>                  total            1342        5
> 
> This means that some hugepages are dissolved, but the others not,
> maybe which is not desirable.
> I'll dig this more later but just let me share at first.
> 
> A few minor comment below ...


Uhm, weird.

I will be taking a look today.

Thanks


^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4 12/15] mm,hwpoison: Rework soft offline for in-use pages
  2020-07-20  8:27     ` osalvador
@ 2020-07-22  8:08       ` osalvador
  2020-07-23 10:19         ` Oscar Salvador
  0 siblings, 1 reply; 24+ messages in thread
From: osalvador @ 2020-07-22  8:08 UTC (permalink / raw)
  To: HORIGUCHI NAOYA(堀口 直也)
  Cc: akpm, Michal Hocko, linux-mm, mike.kravetz, david, aneesh.kumar,
	linux-kernel, Oscar Salvador

On 2020-07-20 10:27, osalvador@suse.de wrote:
> On 2020-07-17 08:55, HORIGUCHI NAOYA wrote:
>> I ran Quan Cai's test program (https://github.com/cailca/linux-mm) on 
>> a
>> small (4GB memory) VM, and weiredly found that (1) the target 
>> hugepages
>> are not always dissolved and (2) dissovled hugetpages are still 
>> counted
>> in "HugePages_Total:". See below:
>> 
>>     $ ./random 1
>>     - start: migrate_huge_offline
>>     - use NUMA nodes 0,1.
>>     - mmap and free 8388608 bytes hugepages on node 0
>>     - mmap and free 8388608 bytes hugepages on node 1
>>     madvise: Cannot allocate memory
>> 
>>     $ cat /proc/meminfo
>>     MemTotal:        4026772 kB
>>     MemFree:          976300 kB
>>     MemAvailable:     892840 kB
>>     Buffers:           20936 kB
>>     Cached:            99768 kB
>>     SwapCached:         5904 kB
>>     Active:            84332 kB
>>     Inactive:         116328 kB
>>     Active(anon):      27944 kB
>>     Inactive(anon):    68524 kB
>>     Active(file):      56388 kB
>>     Inactive(file):    47804 kB
>>     Unevictable:        7532 kB
>>     Mlocked:               0 kB
>>     SwapTotal:       2621436 kB
>>     SwapFree:        2609844 kB
>>     Dirty:                56 kB
>>     Writeback:             0 kB
>>     AnonPages:         81764 kB
>>     Mapped:            54348 kB
>>     Shmem:              8948 kB
>>     KReclaimable:      22744 kB
>>     Slab:              52056 kB
>>     SReclaimable:      22744 kB
>>     SUnreclaim:        29312 kB
>>     KernelStack:        3888 kB
>>     PageTables:         2804 kB
>>     NFS_Unstable:          0 kB
>>     Bounce:                0 kB
>>     WritebackTmp:          0 kB
>>     CommitLimit:     3260612 kB
>>     Committed_AS:     828196 kB
>>     VmallocTotal:   34359738367 kB
>>     VmallocUsed:       19260 kB
>>     VmallocChunk:          0 kB
>>     Percpu:             5120 kB
>>     HardwareCorrupted:  5368 kB
>>     AnonHugePages:     18432 kB
>>     ShmemHugePages:        0 kB
>>     ShmemPmdMapped:        0 kB
>>     FileHugePages:         0 kB
>>     FilePmdMapped:         0 kB
>>     CmaTotal:              0 kB
>>     CmaFree:               0 kB
>>     HugePages_Total:    1342     // still counted as hugetlb pages.
>>     HugePages_Free:        0     // all hugepage are still allocated
>> (or leaked?)
>>     HugePages_Rsvd:        0
>>     HugePages_Surp:      762     // some are counted in surplus.
>>     Hugepagesize:       2048 kB
>>     Hugetlb:         2748416 kB
>>     DirectMap4k:      112480 kB
>>     DirectMap2M:     4081664 kB
>> 
>> 
>>     $ page-types -b hwpoison
>>                  flags      page-count       MB  symbolic-flags
>>              long-symbolic-flags
>>     0x0000000000080008             421        1
>> ___U_______________X_______________________      uptodate,hwpoison
>>     0x00000000000a8018               1        0
>> ___UD__________H_G_X_______________________
>> uptodate,dirty,compound_head,huge,hwpoison
>>     0x00000000000a801c             920        3
>> __RUD__________H_G_X_______________________
>> referenced,uptodate,dirty,compound_head,huge,hwpoison
>>                  total            1342        5
>> 
>> This means that some hugepages are dissolved, but the others not,
>> maybe which is not desirable.
>> I'll dig this more later but just let me share at first.
>> 
>> A few minor comment below ...
> 
> 
> Uhm, weird.
> 
> I will be taking a look today.

After some digging up I __think__ I found the problem.
I will try to fix it up and I will be running tests.

I might reach out to you once I am done because I remember you had a 
test-suite that worked quite well, so you can give it a spin there.

Thanks



^ permalink raw reply	[flat|nested] 24+ messages in thread

* Re: [PATCH v4 12/15] mm,hwpoison: Rework soft offline for in-use pages
  2020-07-22  8:08       ` osalvador
@ 2020-07-23 10:19         ` Oscar Salvador
  0 siblings, 0 replies; 24+ messages in thread
From: Oscar Salvador @ 2020-07-23 10:19 UTC (permalink / raw)
  To: HORIGUCHI NAOYA(堀口 直也)
  Cc: akpm, Michal Hocko, linux-mm, mike.kravetz, david, aneesh.kumar,
	linux-kernel, Oscar Salvador

On Wed, Jul 22, 2020 at 10:08:59AM +0200, osalvador@suse.de wrote:
> On 2020-07-20 10:27, osalvador@suse.de wrote:
> > > This means that some hugepages are dissolved, but the others not,
> > > maybe which is not desirable.
> > > I'll dig this more later but just let me share at first.
> > > 
> > > A few minor comment below ...
> > 
> > 
> > Uhm, weird.
> > 
> > I will be taking a look today.
> 
> After some digging up I __think__ I found the problem.
> I will try to fix it up and I will be running tests.

I found the problem.
I re-ran the tests again with small and large memory and 
the stats look correct this time.

After some more testing, I also fixed a list corruption that was happening
due to the same problem.

I am creating a git branch so you can re-run your tests on it as well.

-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 24+ messages in thread

end of thread, other threads:[~2020-07-23 10:19 UTC | newest]

Thread overview: 24+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-07-16 12:37 [PATCH v4 00/15] Hwpoison soft-offline rework Oscar Salvador
2020-07-16 12:37 ` [PATCH v4 01/15] mm,hwpoison: cleanup unused PageHuge() check Oscar Salvador
2020-07-16 12:37 ` [PATCH v4 02/15] mm, hwpoison: remove recalculating hpage Oscar Salvador
2020-07-16 12:37 ` [PATCH v4 03/15] mm,madvise: call soft_offline_page() without MF_COUNT_INCREASED Oscar Salvador
2020-07-16 23:15   ` Mike Kravetz
2020-07-16 12:37 ` [PATCH v4 04/15] mm,madvise: Refactor madvise_inject_error Oscar Salvador
2020-07-16 12:37 ` [PATCH v4 05/15] mm,hwpoison-inject: don't pin for hwpoison_filter Oscar Salvador
2020-07-16 12:38 ` [PATCH v4 06/15] mm,hwpoison: Un-export get_hwpoison_page and make it static Oscar Salvador
2020-07-16 12:38 ` [PATCH v4 07/15] mm,hwpoison: Kill put_hwpoison_page Oscar Salvador
2020-07-16 12:38 ` [PATCH v4 08/15] mm,hwpoison: remove MF_COUNT_INCREASED Oscar Salvador
2020-07-16 12:38 ` [PATCH v4 09/15] mm,hwpoison: remove flag argument from soft offline functions Oscar Salvador
2020-07-16 12:38 ` [PATCH v4 10/15] mm,hwpoison: Unify THP handling for hard and soft offline Oscar Salvador
2020-07-16 12:38 ` [PATCH v4 11/15] mm,hwpoison: Rework soft offline for free pages Oscar Salvador
2020-07-16 12:38 ` [PATCH v4 12/15] mm,hwpoison: Rework soft offline for in-use pages Oscar Salvador
2020-07-17  6:55   ` HORIGUCHI NAOYA(堀口 直也)
     [not found]   ` <f7387d64d0024d15a1bc821a8e19b8f0@DB7PR04MB5180.eurprd04.prod.outlook.com>
2020-07-20  8:27     ` osalvador
2020-07-22  8:08       ` osalvador
2020-07-23 10:19         ` Oscar Salvador
2020-07-16 12:38 ` [PATCH v4 13/15] mm,hwpoison: Refactor soft_offline_huge_page and __soft_offline_page Oscar Salvador
2020-07-16 12:38 ` [PATCH v4 14/15] mm,hwpoison: Return 0 if the page is already poisoned in soft-offline Oscar Salvador
2020-07-16 12:38 ` [PATCH v4 15/15] mm,hwpoison: introduce MF_MSG_UNSPLIT_THP Oscar Salvador
2020-07-16 12:38 ` [PATCH] x86/speculation: Add basic IBPB (Indirect Branch Prediction Barrier) support Oscar Salvador
2020-07-16 12:43   ` osalvador
2020-07-17 14:49 ` [PATCH v4 00/15] Hwpoison soft-offline rework Qian Cai

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.