linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH v4 0/7] HWpoison: further fixes and cleanups
@ 2020-09-17  8:10 Oscar Salvador
  2020-09-17  8:10 ` [PATCH v4 1/7] mm,hwpoison: take free pages off the buddy freelists Oscar Salvador
                   ` (7 more replies)
  0 siblings, 8 replies; 18+ messages in thread
From: Oscar Salvador @ 2020-09-17  8:10 UTC (permalink / raw)
  To: akpm
  Cc: aris, naoya.horiguchi, mhocko, tony.luck, cai, linux-kernel,
	linux-mm, Oscar Salvador

This patchset includes some fixups (patch#1,patch#2 and patch#3)
and some cleanups (patch#4-7).

Patch#1 is a fix to take off HWPoison pages off a buddy freelist since
it can lead us to having HWPoison pages back in the game without no one
noticing it.
So fix it (we did that already for soft_offline_page [1]).

Patch#2 is fixing a rebasing problem that made the call
to page_handle_poison from _soft_offline_page set the
wrong value for hugepage_or_freepage. [2]

Patch#3 is not really a fixup, but tries to re-handle a page
in case it was allocated under us.

[1] https://patchwork.kernel.org/cover/11704083/
[2] https://patchwork.kernel.org/comment/23619775/

Thanks

Oscar Salvador (7):
  mm,hwpoison: take free pages off the buddy freelists
  mm,hwpoison: Do not set hugepage_or_freepage unconditionally
  mm,hwpoison: Try to narrow window race for free pages
  mm,hwpoison: refactor madvise_inject_error
  mm,hwpoison: drain pcplists before bailing out for non-buddy
    zero-refcount page
  mm,hwpoison: drop unneeded pcplist draining
  mm,hwpoison: remove stale code

 mm/madvise.c        | 36 +++++++++++----------------
 mm/memory-failure.c | 59 +++++++++++++++++++++++++++++++++------------
 2 files changed, 58 insertions(+), 37 deletions(-)

-- 
2.26.2


^ permalink raw reply	[flat|nested] 18+ messages in thread

* [PATCH v4 1/7] mm,hwpoison: take free pages off the buddy freelists
  2020-09-17  8:10 [PATCH v4 0/7] HWpoison: further fixes and cleanups Oscar Salvador
@ 2020-09-17  8:10 ` Oscar Salvador
  2020-09-25  2:22   ` HORIGUCHI NAOYA(堀口 直也)
  2020-09-17  8:10 ` [PATCH v4 2/7] mm,hwpoison: Do not set hugepage_or_freepage unconditionally Oscar Salvador
                   ` (6 subsequent siblings)
  7 siblings, 1 reply; 18+ messages in thread
From: Oscar Salvador @ 2020-09-17  8:10 UTC (permalink / raw)
  To: akpm
  Cc: aris, naoya.horiguchi, mhocko, tony.luck, cai, linux-kernel,
	linux-mm, Oscar Salvador

The crux of the matter is that historically we left poisoned pages in the
buddy system because we have some checks in place when allocating a page
that a gatekeeper for poisoned pages.  Unfortunately, we do have other
users (e.g: compaction [1]) that scan buddy freelists and try to get a
page from there without checking whether the page is HWPoison.

As I stated already, I think it is fundamentally wrong to keep HWPoison
pages within the buddy systems, checks in place or not.

Let us fix this we same way we did for soft_offline [2], and take the page
off the buddy freelist, so it is completely unreachable.

Note that this is fairly simple to trigger, as we only need to poison free
buddy pages (madvise MADV_HWPOISON) and then we need to run some sort of
memory stress system.

Just for a matter of reference, I put a dump_page in compaction_alloc to
trigger for HWPoison patches:

kernel: page:0000000012b2982b refcount:1 mapcount:0 mapping:0000000000000000 index:0x1 pfn:0x1d5db
kernel: flags: 0xfffffc0800000(hwpoison)
kernel: raw: 000fffffc0800000 ffffea00007573c8 ffffc90000857de0 0000000000000000
kernel: raw: 0000000000000001 0000000000000000 00000001ffffffff 0000000000000000
kernel: page dumped because: compaction_alloc

kernel: CPU: 4 PID: 123 Comm: kcompactd0 Tainted: G            E     5.9.0-rc2-mm1-1-default+ #5
kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
kernel: Call Trace:
kernel:  dump_stack+0x6d/0x8b
kernel:  compaction_alloc+0xb2/0xc0
kernel:  migrate_pages+0x2a6/0x12a0
kernel:  ? isolate_freepages+0xc80/0xc80
kernel:  ? __ClearPageMovable+0xb0/0xb0
kernel:  compact_zone+0x5eb/0x11c0
kernel:  ? finish_task_switch+0x74/0x300
kernel:  ? lock_timer_base+0xa8/0x170
kernel:  proactive_compact_node+0x89/0xf0
kernel:  ? kcompactd+0x2d0/0x3a0
kernel:  kcompactd+0x2d0/0x3a0
kernel:  ? finish_wait+0x80/0x80
kernel:  ? kcompactd_do_work+0x350/0x350
kernel:  kthread+0x118/0x130
kernel:  ? kthread_associate_blkcg+0xa0/0xa0
kernel:  ret_from_fork+0x22/0x30

After that, if e.g: someone faults in the page, that someone will get
killed unexpectedly.

While we are at it, I also changed the action result for such cases.
I think that MF_DELAYED is a bit misleading, because in case we could
contain the page and take it off the buddy, such a page is not to be used
again unless it is unpoisoned, so we fixed the situation.

So unless I am missing something, I strongly think that we should report
MF_RECOVERED.

[1] https://lore.kernel.org/linux-mm/20190826104144.GA7849@linux/T/#u
[2] https://patchwork.kernel.org/patch/11694847/

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 mm/memory-failure.c | 28 ++++++++++++++++++++++------
 1 file changed, 22 insertions(+), 6 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 4eb3c42ffe35..74a53881f94b 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1323,8 +1323,9 @@ int memory_failure(unsigned long pfn, int flags)
 	struct page *hpage;
 	struct page *orig_head;
 	struct dev_pagemap *pgmap;
-	int res;
 	unsigned long page_flags;
+	int res = 0;
+	bool retry = true;
 
 	if (!sysctl_memory_failure_recovery)
 		panic("Memory failure on page %lx", pfn);
@@ -1364,10 +1365,21 @@ int memory_failure(unsigned long pfn, int flags)
 	 * In fact it's dangerous to directly bump up page count from 0,
 	 * that may make page_ref_freeze()/page_ref_unfreeze() mismatch.
 	 */
+try_again:
 	if (!(flags & MF_COUNT_INCREASED) && !get_hwpoison_page(p)) {
 		if (is_free_buddy_page(p)) {
-			action_result(pfn, MF_MSG_BUDDY, MF_DELAYED);
-			return 0;
+			if (take_page_off_buddy(p)) {
+				action_result(pfn, MF_MSG_BUDDY, MF_RECOVERED);
+				return 0;
+			} else {
+				/* We lost the race, try again */
+				if (retry) {
+					retry = false;
+					goto try_again;
+				}
+				action_result(pfn, MF_MSG_BUDDY, MF_FAILED);
+				return -EBUSY;
+			}
 		} else {
 			action_result(pfn, MF_MSG_KERNEL_HIGH_ORDER, MF_IGNORED);
 			return -EBUSY;
@@ -1393,11 +1405,15 @@ int memory_failure(unsigned long pfn, int flags)
 	shake_page(p, 0);
 	/* shake_page could have turned it free. */
 	if (!PageLRU(p) && is_free_buddy_page(p)) {
+		if (!take_page_off_buddy(p))
+			res = -EBUSY;
+
 		if (flags & MF_COUNT_INCREASED)
-			action_result(pfn, MF_MSG_BUDDY, MF_DELAYED);
+			action_result(pfn, MF_MSG_BUDDY, res ? MF_FAILED : MF_RECOVERED);
 		else
-			action_result(pfn, MF_MSG_BUDDY_2ND, MF_DELAYED);
-		return 0;
+			action_result(pfn, MF_MSG_BUDDY_2ND, res ? MF_FAILED : MF_RECOVERED);
+
+		return res;
 	}
 
 	lock_page(p);
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v4 2/7] mm,hwpoison: Do not set hugepage_or_freepage unconditionally
  2020-09-17  8:10 [PATCH v4 0/7] HWpoison: further fixes and cleanups Oscar Salvador
  2020-09-17  8:10 ` [PATCH v4 1/7] mm,hwpoison: take free pages off the buddy freelists Oscar Salvador
@ 2020-09-17  8:10 ` Oscar Salvador
  2020-09-18 19:26   ` Aristeu Rozanski
  2020-09-17  8:10 ` [PATCH v4 3/7] mm,hwpoison: Try to narrow window race for free pages Oscar Salvador
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 18+ messages in thread
From: Oscar Salvador @ 2020-09-17  8:10 UTC (permalink / raw)
  To: akpm
  Cc: aris, naoya.horiguchi, mhocko, tony.luck, cai, linux-kernel,
	linux-mm, Oscar Salvador

Aristeu Rozanski reported that some hwpoison tests
were returning -EBUSY while before the rework they succeed [1] [2].

The root case is that during a rebase, the call to page_handle_poison was
setting hugepage_or_freepage = true unconditionally from __soft_offline_page.

Aristeu said that after this fix, everything works.

[1] https://patchwork.kernel.org/comment/23617301/
[2] https://patchwork.kernel.org/comment/23619535/

Signed-off-by: Oscar Salvador <osalvador@suse.de>
Reported-by: Aristeu Rozanski <aris@ruivo.org>
---
 mm/memory-failure.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 74a53881f94b..db61bdee9734 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1852,7 +1852,7 @@ static int __soft_offline_page(struct page *page)
 		if (!ret) {
 			bool release = !huge;
 
-			if (!page_handle_poison(page, true, release))
+			if (!page_handle_poison(page, huge, release))
 				ret = -EBUSY;
 		} else {
 			if (!list_empty(&pagelist))
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v4 3/7] mm,hwpoison: Try to narrow window race for free pages
  2020-09-17  8:10 [PATCH v4 0/7] HWpoison: further fixes and cleanups Oscar Salvador
  2020-09-17  8:10 ` [PATCH v4 1/7] mm,hwpoison: take free pages off the buddy freelists Oscar Salvador
  2020-09-17  8:10 ` [PATCH v4 2/7] mm,hwpoison: Do not set hugepage_or_freepage unconditionally Oscar Salvador
@ 2020-09-17  8:10 ` Oscar Salvador
  2020-09-18 19:27   ` Aristeu Rozanski
  2020-09-17  8:10 ` [PATCH v4 4/7] mm,hwpoison: refactor madvise_inject_error Oscar Salvador
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 18+ messages in thread
From: Oscar Salvador @ 2020-09-17  8:10 UTC (permalink / raw)
  To: akpm
  Cc: aris, naoya.horiguchi, mhocko, tony.luck, cai, linux-kernel,
	linux-mm, Oscar Salvador

Aristeu Rozanski reported that a customer test case started
to report -EBUSY after the hwpoison report patchset.

There is a race window between spotting a free page and taking it off
its buddy freelist, so it might be that by the time we try to take it off,
the page has been already allocated.

This patch tries to handle such race window by trying to handle the new
type of page again if the page was allocated under us.

After this patch, Aristeu said the test cases work properly.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
Reported-by: Aristeu Rozanski <aris@ruivo.org>
---
 mm/memory-failure.c | 7 ++++++-
 1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index db61bdee9734..a2ccd3ba4015 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1917,6 +1917,7 @@ int soft_offline_page(unsigned long pfn, int flags)
 {
 	int ret;
 	struct page *page;
+	bool try_again = true;
 
 	if (!pfn_valid(pfn))
 		return -ENXIO;
@@ -1932,6 +1933,7 @@ int soft_offline_page(unsigned long pfn, int flags)
 		return 0;
 	}
 
+retry:
 	get_online_mems();
 	ret = get_any_page(page, pfn, flags);
 	put_online_mems();
@@ -1939,7 +1941,10 @@ int soft_offline_page(unsigned long pfn, int flags)
 	if (ret > 0)
 		ret = soft_offline_in_use_page(page);
 	else if (ret == 0)
-		ret = soft_offline_free_page(page);
+		if (soft_offline_free_page(page) && try_again) {
+			try_again = false;
+			goto retry;
+		}
 
 	return ret;
 }
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v4 4/7] mm,hwpoison: refactor madvise_inject_error
  2020-09-17  8:10 [PATCH v4 0/7] HWpoison: further fixes and cleanups Oscar Salvador
                   ` (2 preceding siblings ...)
  2020-09-17  8:10 ` [PATCH v4 3/7] mm,hwpoison: Try to narrow window race for free pages Oscar Salvador
@ 2020-09-17  8:10 ` Oscar Salvador
  2020-09-17  8:10 ` [PATCH v4 5/7] mm,hwpoison: drain pcplists before bailing out for non-buddy zero-refcount page Oscar Salvador
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 18+ messages in thread
From: Oscar Salvador @ 2020-09-17  8:10 UTC (permalink / raw)
  To: akpm
  Cc: aris, naoya.horiguchi, mhocko, tony.luck, cai, linux-kernel,
	linux-mm, Oscar Salvador, Oscar Salvador

Make a proper if-else condition for {hard,soft}-offline.

[akpm: remove zone variable and refactor comment]
Signed-off-by: Oscar Salvador <osalvador@suse.com>
---
 mm/madvise.c | 32 ++++++++++++++------------------
 1 file changed, 14 insertions(+), 18 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 4d87b31a1576..4a48f7215195 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -869,8 +869,6 @@ static long madvise_remove(struct vm_area_struct *vma,
 static int madvise_inject_error(int behavior,
 		unsigned long start, unsigned long end)
 {
-	struct page *page;
-	struct zone *zone;
 	unsigned long size;
 
 	if (!capable(CAP_SYS_ADMIN))
@@ -879,6 +877,7 @@ static int madvise_inject_error(int behavior,
 
 	for (; start < end; start += size) {
 		unsigned long pfn;
+		struct page *page;
 		int ret;
 
 		ret = get_user_pages_fast(start, 1, 0, &page);
@@ -895,25 +894,22 @@ static int madvise_inject_error(int behavior,
 
 		if (behavior == MADV_SOFT_OFFLINE) {
 			pr_info("Soft offlining pfn %#lx at process virtual address %#lx\n",
-					pfn, start);
-
+				 pfn, start);
 			ret = soft_offline_page(pfn, MF_COUNT_INCREASED);
-			if (ret)
-				return ret;
-			continue;
+		} else {
+			pr_info("Injecting memory failure for pfn %#lx at process virtual address %#lx\n",
+				 pfn, start);
+			/*
+			 * Drop the page reference taken by
+			 * get_user_pages_fast(). In the absence of
+			 * MF_COUNT_INCREASED the memory_failure() routine is
+			 * responsible for pinning the page to prevent it
+			 * from being released back to the page allocator.
+			 */
+			put_page(page);
+			ret = memory_failure(pfn, 0);
 		}
 
-		pr_info("Injecting memory failure for pfn %#lx at process virtual address %#lx\n",
-				pfn, start);
-
-		/*
-		 * Drop the page reference taken by get_user_pages_fast(). In
-		 * the absence of MF_COUNT_INCREASED the memory_failure()
-		 * routine is responsible for pinning the page to prevent it
-		 * from being released back to the page allocator.
-		 */
-		put_page(page);
-		ret = memory_failure(pfn, 0);
 		if (ret)
 			return ret;
 	}
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v4 5/7] mm,hwpoison: drain pcplists before bailing out for non-buddy zero-refcount page
  2020-09-17  8:10 [PATCH v4 0/7] HWpoison: further fixes and cleanups Oscar Salvador
                   ` (3 preceding siblings ...)
  2020-09-17  8:10 ` [PATCH v4 4/7] mm,hwpoison: refactor madvise_inject_error Oscar Salvador
@ 2020-09-17  8:10 ` Oscar Salvador
  2020-09-25  2:22   ` HORIGUCHI NAOYA(堀口 直也)
  2020-09-17  8:10 ` [PATCH v4 6/7] mm,hwpoison: drop unneeded pcplist draining Oscar Salvador
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 18+ messages in thread
From: Oscar Salvador @ 2020-09-17  8:10 UTC (permalink / raw)
  To: akpm
  Cc: aris, naoya.horiguchi, mhocko, tony.luck, cai, linux-kernel,
	linux-mm, Oscar Salvador

A page with 0-refcount and !PageBuddy could perfectly be a pcppage.
Currently, we bail out with an error if we encounter such a page, meaning
that we do not handle pcppages neither from hard-offline nor from
soft-offline path.

Fix this by draining pcplists whenever we find this kind of page and retry
the check again.  It might be that pcplists have been spilled into the
buddy allocator and so we can handle it.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 mm/memory-failure.c | 24 ++++++++++++++++++++++--
 1 file changed, 22 insertions(+), 2 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index a2ccd3ba4015..7fba4ba201d5 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -950,13 +950,13 @@ static int page_action(struct page_state *ps, struct page *p,
 }
 
 /**
- * get_hwpoison_page() - Get refcount for memory error handling:
+ * __get_hwpoison_page() - Get refcount for memory error handling:
  * @page:	raw error page (hit by memory error)
  *
  * Return: return 0 if failed to grab the refcount, otherwise true (some
  * non-zero value.)
  */
-static int get_hwpoison_page(struct page *page)
+static int __get_hwpoison_page(struct page *page)
 {
 	struct page *head = compound_head(page);
 
@@ -986,6 +986,26 @@ static int get_hwpoison_page(struct page *page)
 	return 0;
 }
 
+static int get_hwpoison_page(struct page *p)
+{
+	int ret;
+	bool drained = false;
+
+retry:
+	ret = __get_hwpoison_page(p);
+	if (!ret && !is_free_buddy_page(p) && !page_count(p) && !drained) {
+		/*
+		 * The page might be in a pcplist, so try to drain those
+		 * and see if we are lucky.
+		 */
+		drain_all_pages(page_zone(p));
+		drained = true;
+		goto retry;
+	}
+
+	return ret;
+}
+
 /*
  * Do all that is necessary to remove user space mappings. Unmap
  * the pages and send SIGBUS to the processes if the data was dirty.
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v4 6/7] mm,hwpoison: drop unneeded pcplist draining
  2020-09-17  8:10 [PATCH v4 0/7] HWpoison: further fixes and cleanups Oscar Salvador
                   ` (4 preceding siblings ...)
  2020-09-17  8:10 ` [PATCH v4 5/7] mm,hwpoison: drain pcplists before bailing out for non-buddy zero-refcount page Oscar Salvador
@ 2020-09-17  8:10 ` Oscar Salvador
  2020-09-17  8:10 ` [PATCH v4 7/7] mm,hwpoison: remove stale code Oscar Salvador
  2020-09-17 11:39 ` [PATCH v4 0/7] HWpoison: further fixes and cleanups HORIGUCHI NAOYA(堀口 直也)
  7 siblings, 0 replies; 18+ messages in thread
From: Oscar Salvador @ 2020-09-17  8:10 UTC (permalink / raw)
  To: akpm
  Cc: aris, naoya.horiguchi, mhocko, tony.luck, cai, linux-kernel,
	linux-mm, Oscar Salvador

memory_failure and soft_offline_path paths now drain pcplists by calling
get_hwpoison_page.

memory_failure flags the page as HWPoison before, so that page cannot
longer go into a pcplist, and soft_offline_page only flags a page as
HWPoison if 1) we took the page off a buddy freelist 2) the page was
in-use and we migrated it 3) was a clean pagecache.

Because of that, a page cannot longer be poisoned and be in a pcplist.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 mm/madvise.c | 4 ----
 1 file changed, 4 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 4a48f7215195..302f3a84d17c 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -914,10 +914,6 @@ static int madvise_inject_error(int behavior,
 			return ret;
 	}
 
-	/* Ensure that all poisoned pages are removed from per-cpu lists */
-	for_each_populated_zone(zone)
-		drain_all_pages(zone);
-
 	return 0;
 }
 #endif
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [PATCH v4 7/7] mm,hwpoison: remove stale code
  2020-09-17  8:10 [PATCH v4 0/7] HWpoison: further fixes and cleanups Oscar Salvador
                   ` (5 preceding siblings ...)
  2020-09-17  8:10 ` [PATCH v4 6/7] mm,hwpoison: drop unneeded pcplist draining Oscar Salvador
@ 2020-09-17  8:10 ` Oscar Salvador
  2020-09-17 11:39 ` [PATCH v4 0/7] HWpoison: further fixes and cleanups HORIGUCHI NAOYA(堀口 直也)
  7 siblings, 0 replies; 18+ messages in thread
From: Oscar Salvador @ 2020-09-17  8:10 UTC (permalink / raw)
  To: akpm
  Cc: aris, naoya.horiguchi, mhocko, tony.luck, cai, linux-kernel,
	linux-mm, Oscar Salvador

Currently we call shake_page and then check whether the page is Buddy
because shake_page calls drain_all_pages, which sends pcp-pages back to
the buddy freelists, so we could have a chance to handle free pages.

get_hwpoison_page already calls drain_all_pages, and we do call
get_hwpoison_page right before coming here, so we should be on the safe
side.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 mm/memory-failure.c | 12 ------------
 1 file changed, 12 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 7fba4ba201d5..f68cb5e3b320 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1423,18 +1423,6 @@ int memory_failure(unsigned long pfn, int flags)
 	 * walked by the page reclaim code, however that's not a big loss.
 	 */
 	shake_page(p, 0);
-	/* shake_page could have turned it free. */
-	if (!PageLRU(p) && is_free_buddy_page(p)) {
-		if (!take_page_off_buddy(p))
-			res = -EBUSY;
-
-		if (flags & MF_COUNT_INCREASED)
-			action_result(pfn, MF_MSG_BUDDY, res ? MF_FAILED : MF_RECOVERED);
-		else
-			action_result(pfn, MF_MSG_BUDDY_2ND, res ? MF_FAILED : MF_RECOVERED);
-
-		return res;
-	}
 
 	lock_page(p);
 
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 0/7] HWpoison: further fixes and cleanups
  2020-09-17  8:10 [PATCH v4 0/7] HWpoison: further fixes and cleanups Oscar Salvador
                   ` (6 preceding siblings ...)
  2020-09-17  8:10 ` [PATCH v4 7/7] mm,hwpoison: remove stale code Oscar Salvador
@ 2020-09-17 11:39 ` HORIGUCHI NAOYA(堀口 直也)
  2020-09-17 13:09   ` Oscar Salvador
  7 siblings, 1 reply; 18+ messages in thread
From: HORIGUCHI NAOYA(堀口 直也) @ 2020-09-17 11:39 UTC (permalink / raw)
  To: Oscar Salvador; +Cc: akpm, aris, mhocko, tony.luck, cai, linux-kernel, linux-mm

On Thu, Sep 17, 2020 at 10:10:42AM +0200, Oscar Salvador wrote:
> This patchset includes some fixups (patch#1,patch#2 and patch#3)
> and some cleanups (patch#4-7).
> 
> Patch#1 is a fix to take off HWPoison pages off a buddy freelist since
> it can lead us to having HWPoison pages back in the game without no one
> noticing it.
> So fix it (we did that already for soft_offline_page [1]).
> 
> Patch#2 is fixing a rebasing problem that made the call
> to page_handle_poison from _soft_offline_page set the
> wrong value for hugepage_or_freepage. [2]
> 
> Patch#3 is not really a fixup, but tries to re-handle a page
> in case it was allocated under us.

Thanks for the update.
This patchset triggers the following BUG_ON() with Aristeu's workload:

    [ 1010.400900] Soft offlining pfn 0xbff8c at process virtual address 0x7fe6c99c8000
    [ 1010.402931] page:00000000f5670686 refcount:1 mapcount:-128 mapping:0000000000000000 index:0x1 pfn:0xbff89
    [ 1010.405604] flags: 0xfffe000800000(hwpoison)
    [ 1010.406755] raw: 000fffe000800000 ffffcddf029ab848 ffffcddf02ff9448 0000000000000000
    [ 1010.408824] raw: 0000000000000001 0000000000000000 00000001ffffff7f 0000000000000000
    [ 1010.410877] page dumped because: VM_BUG_ON_PAGE(page_count(buddy) != 0)
    [ 1010.412673] ------------[ cut here ]------------
    [ 1010.413930] kernel BUG at mm/page_alloc.c:800!
    [ 1010.415143] invalid opcode: 0000 [#1] SMP PTI
    [ 1010.416320] CPU: 3 PID: 1340 Comm: kworker/3:0 Not tainted 5.9.0-rc2-mm1-v5.9-rc2-200917-1952-00212-gf1a0765b04cb+ #33
    [ 1010.419101] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-20190727_073836-buildvm-ppc64le-16.ppc.fedoraproject.org-3.fc31 04/01/2014
    [ 1010.422645] Workqueue: mm_percpu_wq drain_local_pages_wq
    [ 1010.424075] RIP: 0010:__free_one_page+0x552/0x580
    [ 1010.425344] Code: 48 c7 c6 90 6c 0f 84 4c 89 e7 e8 69 7e fd ff 0f 0b 0f 1f 44 00 00 e9 e5 fc ff ff 48 c7 c6 c8 f3 11 84 4c 89 f7 e8 4e 7e fd ff <0f> 0b 83 fb 08 0f 86 cb fc ff ff 48 83 c4 20 5b 5d 41 5c 41 5d 41
    [ 1010.430231] RSP: 0018:ffffaa96c171fda0 EFLAGS: 00010082
    [ 1010.431651] RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000027
    [ 1010.433598] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffff8dc8bbd18d08
    [ 1010.435627] RBP: 00000000000bff88 R08: ffff8dc8bbd18d00 R09: 6573756163656220
    [ 1010.437544] R10: 6163656220646570 R11: 6d75642065676170 R12: ffffcddf02ffe200
    [ 1010.439376] R13: 00000000000bff89 R14: ffffcddf02ffe240 R15: ffff8dc7bffd5680
    [ 1010.441271] FS:  0000000000000000(0000) GS:ffff8dc8bbd00000(0000) knlGS:0000000000000000
    [ 1010.443349] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
    [ 1010.444892] CR2: 00007f6b69f92000 CR3: 0000000139c4a000 CR4: 00000000001506e0
    [ 1010.446746] Call Trace:
    [ 1010.447424]  free_pcppages_bulk+0x1d4/0x2c0
    [ 1010.448553]  drain_pages_zone+0x42/0x50
    [ 1010.449585]  drain_local_pages_wq+0xe/0x10
    [ 1010.450702]  process_one_work+0x1b0/0x360
    [ 1010.451769]  worker_thread+0x50/0x3a0
    [ 1010.452940]  ? process_one_work+0x360/0x360
    [ 1010.454072]  kthread+0xfe/0x140
    [ 1010.454989]  ? kthread_park+0x90/0x90
    [ 1010.455970]  ret_from_fork+0x22/0x30

This message seems to show that the pages to be moved to buddy have refcount.
Could you review how changes in v3 -> v4 make it?

Here's my reproducer.

    [build1:~]$ cat test_ksm_madv_soft.c
    #include <stdio.h>
    #include <string.h>
    #include <sys/mman.h>
    #include <unistd.h>
    #include <sys/types.h>
    #include <errno.h>
    #include <stdlib.h>
    
    #define MADV_SOFT_OFFLINE 101
    
    #define err(x) perror(x),exit(EXIT_FAILURE)
    
    int main() {
            int ret;
            int size = 100000*0x1000;
    
            char *p1 = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
            printf("p1 %p\n", p1);
            char *p2 = mmap(NULL, size, PROT_READ | PROT_WRITE, MAP_PRIVATE | MAP_ANONYMOUS, -1, 0);
            printf("p2 %p\n", p2);
    
            ret = madvise(p1, size, MADV_MERGEABLE);
            printf("madvise(p1) %d\n", ret);
            ret = madvise(p2, size, MADV_MERGEABLE);
            printf("madvise(p2) %d\n", ret);
    
            printf("writing p1 ... ");
            memset(p1, 'a', size);
            printf("done\n");
            printf("writing p2 ... ");
            memset(p2, 'a', size);
            printf("done\n");
    
            usleep(10000000);
            printf("soft offline\n");
            ret = madvise(p1, size, MADV_SOFT_OFFLINE);
            printf("soft offline returns %d\n", ret);
            if (ret)
                    err("madvise");
    
            madvise(p1, size, MADV_UNMERGEABLE);
            madvise(p2, size, MADV_UNMERGEABLE);
    
            printf("OK\n");
    }
    
    [build1:~/upstream/mm_regression/lib]$ cat tmp_run_ksm_madv.sh
    
    rm test_ksm_madv_soft 2> /dev/null
    gcc -o test_ksm_madv_soft test_ksm_madv_soft.c || exit 1
    
    echo 0 > /sys/kernel/mm/ksm/sleep_millisecs
    echo 100000 > /sys/kernel/mm/ksm/pages_to_scan
    echo 100000 > /sys/kernel/mm/ksm/max_page_sharing
    echo 2 > /sys/kernel/mm/ksm/run
    echo 1 > /sys/kernel/mm/ksm/run
    
    ./test_ksm_madv_soft

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 0/7] HWpoison: further fixes and cleanups
  2020-09-17 11:39 ` [PATCH v4 0/7] HWpoison: further fixes and cleanups HORIGUCHI NAOYA(堀口 直也)
@ 2020-09-17 13:09   ` Oscar Salvador
  2020-09-17 13:40     ` Oscar Salvador
  0 siblings, 1 reply; 18+ messages in thread
From: Oscar Salvador @ 2020-09-17 13:09 UTC (permalink / raw)
  To: HORIGUCHI NAOYA(堀口 直也)
  Cc: akpm, aris, mhocko, tony.luck, cai, linux-kernel, linux-mm

On Thu, Sep 17, 2020 at 11:39:21AM +0000, HORIGUCHI NAOYA(堀口 直也) wrote:
> Thanks for the update.
> This patchset triggers the following BUG_ON() with Aristeu's workload:

I just took a look, but I found more oddities.
The patchset you sent seems to be a bit buggy and it is missing some things
my patchset contains, e.g:

static __always_inline bool free_pages_prepare(struct page *page,
					unsigned int order, bool check_free)
{
	...
	if (unlikely(PageHWPoison(page)) && !order) {
	...
		return false;
}

Moreover, in page_handle_poison, you managed it wrong because:

static bool page_handle_poison(struct page *page, bool hugepage_or_freepage, bool release)
{
        if (release) {
                put_page(page);
                drain_all_pages(page_zone(page));
        }

	...
        SetPageHWPoison(page);
        page_ref_inc(page);

1) You are freeing the page first, which means it goes to buddy
2) Then you set it as poisoned and you update its refcount.

Now we have a page sitting in Buddy with a refcount = 1 and poisoned, and that is quite wrong.

Honestly, I do not know how your patchset diverged so much from mine, but is
not right.

I will go over my patchset and yours and compare/fix things.

-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 0/7] HWpoison: further fixes and cleanups
  2020-09-17 13:09   ` Oscar Salvador
@ 2020-09-17 13:40     ` Oscar Salvador
  2020-09-17 15:27       ` HORIGUCHI NAOYA(堀口 直也)
  0 siblings, 1 reply; 18+ messages in thread
From: Oscar Salvador @ 2020-09-17 13:40 UTC (permalink / raw)
  To: HORIGUCHI NAOYA(堀口 直也)
  Cc: akpm, aris, mhocko, tony.luck, cai, linux-kernel, linux-mm

On Thu, Sep 17, 2020 at 03:09:52PM +0200, Oscar Salvador wrote:
> static bool page_handle_poison(struct page *page, bool hugepage_or_freepage, bool release)
> {
>         if (release) {
>                 put_page(page);
>                 drain_all_pages(page_zone(page));
>         }
> 
> 	...
>         SetPageHWPoison(page);
>         page_ref_inc(page);
> 
> 1) You are freeing the page first, which means it goes to buddy
> 2) Then you set it as poisoned and you update its refcount.
> 
> Now we have a page sitting in Buddy with a refcount = 1 and poisoned, and that is quite wrong.

Hi Naoya,

Ok, I tested it and with the following changes on top I cannot reproduce the issue:

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index f68cb5e3b320..4ffaaa5c2603 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -67,11 +67,6 @@ atomic_long_t num_poisoned_pages __read_mostly = ATOMIC_LONG_INIT(0);
 
 static bool page_handle_poison(struct page *page, bool hugepage_or_freepage, bool release)
 {
-	if (release) {
-		put_page(page);
-		drain_all_pages(page_zone(page));
-	}
-
 	if (hugepage_or_freepage) {
 		/*
 		 * Doing this check for free pages is also fine since dissolve_free_huge_page
@@ -89,6 +84,12 @@ static bool page_handle_poison(struct page *page, bool hugepage_or_freepage, boo
 	}
 
 	SetPageHWPoison(page);
+
+	if (release) {
+                put_page(page);
+                drain_all_pages(page_zone(page));
+        }
+
 	page_ref_inc(page);
 	num_poisoned_pages_inc();
 	return true;
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index 0d9f9bd0e06c..8a6ab05f074c 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1173,6 +1173,16 @@ static __always_inline bool free_pages_prepare(struct page *page,
 
 	trace_mm_page_free(page, order);
 
+	if (unlikely(PageHWPoison(page)) && !order) {
+		/*
+		 * Untie memcg state and reset page's owner
+		 */
+		if (memcg_kmem_enabled() && PageKmemcg(page))
+			__memcg_kmem_uncharge_page(page, order);
+		reset_page_owner(page, order);
+		return false;
+	}
+
 	/*
 	 * Check tail pages before head page information is cleared to
 	 * avoid checking PageCompound for order-0 pages.


# sh tmp_run_ksm_madv.sh 
p1 0x7f6b6b08e000
p2 0x7f6b529ee000
madvise(p1) 0
madvise(p2) 0
writing p1 ... done
writing p2 ... done
soft offline
soft offline returns 0
OK


Can you try to re-run your tests and see if they come clean?
If they do, I will try to see if Andrew can squezee above changes into [1],
where they belong to.

Otherwise I will craft a v5 containing all fixups (pretty unfortunate).

[1] https://patchwork.kernel.org/patch/11704099/

-- 
Oscar Salvador
SUSE L3

^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 0/7] HWpoison: further fixes and cleanups
  2020-09-17 13:40     ` Oscar Salvador
@ 2020-09-17 15:27       ` HORIGUCHI NAOYA(堀口 直也)
  2020-09-18  5:49         ` osalvador
  2020-09-18 19:25         ` aris
  0 siblings, 2 replies; 18+ messages in thread
From: HORIGUCHI NAOYA(堀口 直也) @ 2020-09-17 15:27 UTC (permalink / raw)
  To: Oscar Salvador; +Cc: akpm, aris, mhocko, tony.luck, cai, linux-kernel, linux-mm

On Thu, Sep 17, 2020 at 03:40:06PM +0200, Oscar Salvador wrote:
> On Thu, Sep 17, 2020 at 03:09:52PM +0200, Oscar Salvador wrote:
> > static bool page_handle_poison(struct page *page, bool hugepage_or_freepage, bool release)
> > {
> >         if (release) {
> >                 put_page(page);
> >                 drain_all_pages(page_zone(page));
> >         }
> >
> > 	...
> >         SetPageHWPoison(page);
> >         page_ref_inc(page);
> >
> > 1) You are freeing the page first, which means it goes to buddy
> > 2) Then you set it as poisoned and you update its refcount.
> >
> > Now we have a page sitting in Buddy with a refcount = 1 and poisoned, and that is quite wrong.
>
> Hi Naoya,
>
> Ok, I tested it and with the following changes on top I cannot reproduce the issue:
>
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index f68cb5e3b320..4ffaaa5c2603 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -67,11 +67,6 @@ atomic_long_t num_poisoned_pages __read_mostly = ATOMIC_LONG_INIT(0);
>
>  static bool page_handle_poison(struct page *page, bool hugepage_or_freepage, bool release)
>  {
> -	if (release) {
> -		put_page(page);
> -		drain_all_pages(page_zone(page));
> -	}
> -
>  	if (hugepage_or_freepage) {
>  		/*
>  		 * Doing this check for free pages is also fine since dissolve_free_huge_page
> @@ -89,6 +84,12 @@ static bool page_handle_poison(struct page *page, bool hugepage_or_freepage, boo
>  	}
>
>  	SetPageHWPoison(page);
> +
> +	if (release) {
> +                put_page(page);
> +                drain_all_pages(page_zone(page));
> +        }
> +
>  	page_ref_inc(page);
>  	num_poisoned_pages_inc();
>  	return true;
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index 0d9f9bd0e06c..8a6ab05f074c 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1173,6 +1173,16 @@ static __always_inline bool free_pages_prepare(struct page *page,
>
>  	trace_mm_page_free(page, order);
>
> +	if (unlikely(PageHWPoison(page)) && !order) {
> +		/*
> +		 * Untie memcg state and reset page's owner
> +		 */
> +		if (memcg_kmem_enabled() && PageKmemcg(page))
> +			__memcg_kmem_uncharge_page(page, order);
> +		reset_page_owner(page, order);
> +		return false;
> +	}
> +

Sorry, I modified the patches based on the different assumption from yours.
I firstly thought of taking page off after confirming the error page
is freed back to buddy. This approach leaves the possibility of reusing
the error page (which is acceptable), but simpler and less invasive one.

Your approach removes the error page from page allocator's control in
freeing time. It has no possibility of reusing the error page but changes
are tightly coupled with page free code.

This is a tradeoff between complexity and completeness of soft offline,
Now I'm not sure I could persist on my own opinion without providing
working code, and it's OK for me to take your one.

>  	/*
>  	 * Check tail pages before head page information is cleared to
>  	 * avoid checking PageCompound for order-0 pages.
>
>

...
(let me reply to older email here...)
> static bool page_handle_poison(struct page *page, bool hugepage_or_freepage, bool release)
> {
>         if (release) {
>                 put_page(page);
>                 drain_all_pages(page_zone(page));
>         }
>
>       ...
>         SetPageHWPoison(page);
>         page_ref_inc(page);
>
> 1) You are freeing the page first, which means it goes to buddy
> 2) Then you set it as poisoned and you update its refcount.
>
> Now we have a page sitting in Buddy with a refcount = 1 and poisoned, and that is quite wrong.

This order was correct in my approach. Isolation operation is done
after confirming it's in the free list. This prevents PageHWPoison pages
from being in pcpclists. But the page_ref_inc() may not be necessary in my approach.


> # sh tmp_run_ksm_madv.sh
> p1 0x7f6b6b08e000
> p2 0x7f6b529ee000
> madvise(p1) 0
> madvise(p2) 0
> writing p1 ... done
> writing p2 ... done
> soft offline
> soft offline returns 0
> OK
>
>
> Can you try to re-run your tests and see if they come clean?

The test passed in my environment, so this is fine.

> If they do, I will try to see if Andrew can squezee above changes into [1],
> where they belong to.

Yes, proposing the fix for mmhwpoison-rework-soft-offline-for-in-use-pages.patch
seems fine to me.

Again, sorry for modifying code without asking.

Naoya Horiguchi

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 0/7] HWpoison: further fixes and cleanups
  2020-09-17 15:27       ` HORIGUCHI NAOYA(堀口 直也)
@ 2020-09-18  5:49         ` osalvador
  2020-09-18 19:25         ` aris
  1 sibling, 0 replies; 18+ messages in thread
From: osalvador @ 2020-09-18  5:49 UTC (permalink / raw)
  To: HORIGUCHI NAOYA(堀口 直也)
  Cc: akpm, aris, mhocko, tony.luck, cai, linux-kernel, linux-mm

On 2020-09-17 17:27, HORIGUCHI NAOYA wrote:
> Sorry, I modified the patches based on the different assumption from 
> yours.
> I firstly thought of taking page off after confirming the error page
> is freed back to buddy. This approach leaves the possibility of reusing
> the error page (which is acceptable), but simpler and less invasive 
> one.
> 
> Your approach removes the error page from page allocator's control in
> freeing time. It has no possibility of reusing the error page but 
> changes
> are tightly coupled with page free code.
> 
> This is a tradeoff between complexity and completeness of soft offline,
> Now I'm not sure I could persist on my own opinion without providing
> working code, and it's OK for me to take your one.

Yeah, you are right it is a trade off.
I would suggest taking this path now, and if it proofs to be problematic 
in some way, we can always
do the:

free_page
  take_it_off_buddy
   OK: mark it as hwpoison and increment refcount
   NOT_OK (raced with allocation): oops, sorry

> The test passed in my environment, so this is fine.

Thanks for trying it out.

> 
>> If they do, I will try to see if Andrew can squezee above changes into 
>> [1],
>> where they belong to.
> 
> Yes, proposing the fix for 
> mmhwpoison-rework-soft-offline-for-in-use-pages.patch
> seems fine to me.
> 
> Again, sorry for modifying code without asking.

No worries, I wil do a couple of tests on my own and then I will talk to 
Andrew to see if we can squeeze the changes in there.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 0/7] HWpoison: further fixes and cleanups
  2020-09-17 15:27       ` HORIGUCHI NAOYA(堀口 直也)
  2020-09-18  5:49         ` osalvador
@ 2020-09-18 19:25         ` aris
  1 sibling, 0 replies; 18+ messages in thread
From: aris @ 2020-09-18 19:25 UTC (permalink / raw)
  To: HORIGUCHI NAOYA(堀口 直也)
  Cc: Oscar Salvador, akpm, mhocko, tony.luck, cai, linux-kernel, linux-mm

On Thu, Sep 17, 2020 at 03:27:18PM +0000, HORIGUCHI NAOYA(堀口 直也) wrote:
> The test passed in my environment, so this is fine.

With everything applied, it works for me as well.
(apologies for the delay, the machine I was using stopped working and took a
while to get another one)

-- 
Aristeu


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 2/7] mm,hwpoison: Do not set hugepage_or_freepage unconditionally
  2020-09-17  8:10 ` [PATCH v4 2/7] mm,hwpoison: Do not set hugepage_or_freepage unconditionally Oscar Salvador
@ 2020-09-18 19:26   ` Aristeu Rozanski
  0 siblings, 0 replies; 18+ messages in thread
From: Aristeu Rozanski @ 2020-09-18 19:26 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: akpm, naoya.horiguchi, mhocko, tony.luck, cai, linux-kernel, linux-mm

On Thu, Sep 17, 2020 at 10:10:44AM +0200, Oscar Salvador wrote:
> Aristeu Rozanski reported that some hwpoison tests
> were returning -EBUSY while before the rework they succeed [1] [2].
> 
> The root case is that during a rebase, the call to page_handle_poison was
> setting hugepage_or_freepage = true unconditionally from __soft_offline_page.
> 
> Aristeu said that after this fix, everything works.
> 
> [1] https://patchwork.kernel.org/comment/23617301/
> [2] https://patchwork.kernel.org/comment/23619535/
> 
> Signed-off-by: Oscar Salvador <osalvador@suse.de>
> Reported-by: Aristeu Rozanski <aris@ruivo.org>

Tested-by: Aristeu Rozanski <aris@ruivo.org>

-- 
Aristeu


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 3/7] mm,hwpoison: Try to narrow window race for free pages
  2020-09-17  8:10 ` [PATCH v4 3/7] mm,hwpoison: Try to narrow window race for free pages Oscar Salvador
@ 2020-09-18 19:27   ` Aristeu Rozanski
  0 siblings, 0 replies; 18+ messages in thread
From: Aristeu Rozanski @ 2020-09-18 19:27 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: akpm, naoya.horiguchi, mhocko, tony.luck, cai, linux-kernel, linux-mm

On Thu, Sep 17, 2020 at 10:10:45AM +0200, Oscar Salvador wrote:
> Aristeu Rozanski reported that a customer test case started
> to report -EBUSY after the hwpoison report patchset.
> 
> There is a race window between spotting a free page and taking it off
> its buddy freelist, so it might be that by the time we try to take it off,
> the page has been already allocated.
> 
> This patch tries to handle such race window by trying to handle the new
> type of page again if the page was allocated under us.
> 
> After this patch, Aristeu said the test cases work properly.
> 
> Signed-off-by: Oscar Salvador <osalvador@suse.de>
> Reported-by: Aristeu Rozanski <aris@ruivo.org>

Tested-by: Aristeu Rozanski <aris@ruivo.org>

-- 
Aristeu


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 1/7] mm,hwpoison: take free pages off the buddy freelists
  2020-09-17  8:10 ` [PATCH v4 1/7] mm,hwpoison: take free pages off the buddy freelists Oscar Salvador
@ 2020-09-25  2:22   ` HORIGUCHI NAOYA(堀口 直也)
  0 siblings, 0 replies; 18+ messages in thread
From: HORIGUCHI NAOYA(堀口 直也) @ 2020-09-25  2:22 UTC (permalink / raw)
  To: Oscar Salvador; +Cc: akpm, aris, mhocko, tony.luck, cai, linux-kernel, linux-mm

On Thu, Sep 17, 2020 at 10:10:43AM +0200, Oscar Salvador wrote:
> The crux of the matter is that historically we left poisoned pages in the
> buddy system because we have some checks in place when allocating a page
> that a gatekeeper for poisoned pages.  Unfortunately, we do have other
> users (e.g: compaction [1]) that scan buddy freelists and try to get a
> page from there without checking whether the page is HWPoison.
> 
> As I stated already, I think it is fundamentally wrong to keep HWPoison
> pages within the buddy systems, checks in place or not.
> 
> Let us fix this we same way we did for soft_offline [2], and take the page
> off the buddy freelist, so it is completely unreachable.
> 
> Note that this is fairly simple to trigger, as we only need to poison free
> buddy pages (madvise MADV_HWPOISON) and then we need to run some sort of
> memory stress system.
> 
> Just for a matter of reference, I put a dump_page in compaction_alloc to
> trigger for HWPoison patches:
> 
> kernel: page:0000000012b2982b refcount:1 mapcount:0 mapping:0000000000000000 index:0x1 pfn:0x1d5db
> kernel: flags: 0xfffffc0800000(hwpoison)
> kernel: raw: 000fffffc0800000 ffffea00007573c8 ffffc90000857de0 0000000000000000
> kernel: raw: 0000000000000001 0000000000000000 00000001ffffffff 0000000000000000
> kernel: page dumped because: compaction_alloc
> 
> kernel: CPU: 4 PID: 123 Comm: kcompactd0 Tainted: G            E     5.9.0-rc2-mm1-1-default+ #5
> kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.10.2-0-g5f4c7b1-prebuilt.qemu-project.org 04/01/2014
> kernel: Call Trace:
> kernel:  dump_stack+0x6d/0x8b
> kernel:  compaction_alloc+0xb2/0xc0
> kernel:  migrate_pages+0x2a6/0x12a0
> kernel:  ? isolate_freepages+0xc80/0xc80
> kernel:  ? __ClearPageMovable+0xb0/0xb0
> kernel:  compact_zone+0x5eb/0x11c0
> kernel:  ? finish_task_switch+0x74/0x300
> kernel:  ? lock_timer_base+0xa8/0x170
> kernel:  proactive_compact_node+0x89/0xf0
> kernel:  ? kcompactd+0x2d0/0x3a0
> kernel:  kcompactd+0x2d0/0x3a0
> kernel:  ? finish_wait+0x80/0x80
> kernel:  ? kcompactd_do_work+0x350/0x350
> kernel:  kthread+0x118/0x130
> kernel:  ? kthread_associate_blkcg+0xa0/0xa0
> kernel:  ret_from_fork+0x22/0x30
> 
> After that, if e.g: someone faults in the page, that someone will get
> killed unexpectedly.
> 
> While we are at it, I also changed the action result for such cases.
> I think that MF_DELAYED is a bit misleading, because in case we could
> contain the page and take it off the buddy, such a page is not to be used
> again unless it is unpoisoned, so we fixed the situation.
> 
> So unless I am missing something, I strongly think that we should report
> MF_RECOVERED.
> 
> [1] https://lore.kernel.org/linux-mm/20190826104144.GA7849@linux/T/#u
> [2] https://patchwork.kernel.org/patch/11694847/
> 
> Signed-off-by: Oscar Salvador <osalvador@suse.de>

Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>

^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [PATCH v4 5/7] mm,hwpoison: drain pcplists before bailing out for non-buddy zero-refcount page
  2020-09-17  8:10 ` [PATCH v4 5/7] mm,hwpoison: drain pcplists before bailing out for non-buddy zero-refcount page Oscar Salvador
@ 2020-09-25  2:22   ` HORIGUCHI NAOYA(堀口 直也)
  0 siblings, 0 replies; 18+ messages in thread
From: HORIGUCHI NAOYA(堀口 直也) @ 2020-09-25  2:22 UTC (permalink / raw)
  To: Oscar Salvador; +Cc: akpm, aris, mhocko, tony.luck, cai, linux-kernel, linux-mm

On Thu, Sep 17, 2020 at 10:10:47AM +0200, Oscar Salvador wrote:
> A page with 0-refcount and !PageBuddy could perfectly be a pcppage.
> Currently, we bail out with an error if we encounter such a page, meaning
> that we do not handle pcppages neither from hard-offline nor from
> soft-offline path.
> 
> Fix this by draining pcplists whenever we find this kind of page and retry
> the check again.  It might be that pcplists have been spilled into the
> buddy allocator and so we can handle it.
> 
> Signed-off-by: Oscar Salvador <osalvador@suse.de>

Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>

^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2020-09-25  2:22 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-09-17  8:10 [PATCH v4 0/7] HWpoison: further fixes and cleanups Oscar Salvador
2020-09-17  8:10 ` [PATCH v4 1/7] mm,hwpoison: take free pages off the buddy freelists Oscar Salvador
2020-09-25  2:22   ` HORIGUCHI NAOYA(堀口 直也)
2020-09-17  8:10 ` [PATCH v4 2/7] mm,hwpoison: Do not set hugepage_or_freepage unconditionally Oscar Salvador
2020-09-18 19:26   ` Aristeu Rozanski
2020-09-17  8:10 ` [PATCH v4 3/7] mm,hwpoison: Try to narrow window race for free pages Oscar Salvador
2020-09-18 19:27   ` Aristeu Rozanski
2020-09-17  8:10 ` [PATCH v4 4/7] mm,hwpoison: refactor madvise_inject_error Oscar Salvador
2020-09-17  8:10 ` [PATCH v4 5/7] mm,hwpoison: drain pcplists before bailing out for non-buddy zero-refcount page Oscar Salvador
2020-09-25  2:22   ` HORIGUCHI NAOYA(堀口 直也)
2020-09-17  8:10 ` [PATCH v4 6/7] mm,hwpoison: drop unneeded pcplist draining Oscar Salvador
2020-09-17  8:10 ` [PATCH v4 7/7] mm,hwpoison: remove stale code Oscar Salvador
2020-09-17 11:39 ` [PATCH v4 0/7] HWpoison: further fixes and cleanups HORIGUCHI NAOYA(堀口 直也)
2020-09-17 13:09   ` Oscar Salvador
2020-09-17 13:40     ` Oscar Salvador
2020-09-17 15:27       ` HORIGUCHI NAOYA(堀口 直也)
2020-09-18  5:49         ` osalvador
2020-09-18 19:25         ` aris

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).