linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/7] HWPoison: Refactor get page interface
@ 2020-11-19 10:57 Oscar Salvador
  2020-11-19 10:57 ` [PATCH 1/7] mm,hwpoison: Refactor get_any_page Oscar Salvador
                   ` (7 more replies)
  0 siblings, 8 replies; 26+ messages in thread
From: Oscar Salvador @ 2020-11-19 10:57 UTC (permalink / raw)
  To: akpm; +Cc: n-horiguchi, linux-mm, linux-kernel, Oscar Salvador

Hi,

following up on previous fix-ups an refactors, this patchset simplifies
the get page interface and removes the MF_COUNT_INCREASED trick we have
for soft offline.

Please, note that this patchset is on top of [1] and [2].

This patchset does three things:

 1) Drops MF_COUNT_INCREASED trick
 2) Refactors get page interface
 3) Places a common entry for grabbin a page from both hard offline
    and soft offline guarded by zone_pcp_{disable/enable}, so we do not
    have to drain pcplists by ourself and retry again.

Note that the MF_COUNT_INCREASED trick was left because if get_hwpoison_page
races with put_page (e.g:)

CPU0                         CPU1
put_page (refcount decremented to 0)
 __put_single_page
  free_unref_page
   free_unref_page_prepare
    free_pcp_prepare
     free_pages_prepare                           soft_offline_page
     :page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP     get_any_page
                            			    get_hwpoison_page
   free_unref_page_commit
    free_one_page
     __free_one_page (place it in buddy)

get_hwpoison_page sees that page has a refcount of 0, but since it was not placed
in buddy yet we cannot really handle it.
We now have a sort of maximum passes in get_any_page, so in case we race
with either an allocation or a put_page, we retry again.

After an off-list discussion with Naoya, he agreed to proceed.

[1] https://patchwork.kernel.org/project/linux-mm/list/?series=364009
[2] https://patchwork.kernel.org/project/linux-mm/list/?series=381903

Naoya Horiguchi (3):
  mm,madvise: call soft_offline_page() without MF_COUNT_INCREASED
  mm,hwpoison: remove MF_COUNT_INCREASED
  mm,hwpoison: remove flag argument from soft offline functions

Oscar Salvador (4):
  mm,hwpoison: Refactor get_any_page
  mm,hwpoison: Drop pfn parameter
  mm,hwpoison: Disable pcplists before grabbing a refcount
  mm,hwpoison: Remove drain_all_pages from shake_page

 drivers/base/memory.c |   2 +-
 include/linux/mm.h    |   9 +--
 mm/madvise.c          |  19 +++--
 mm/memory-failure.c   | 168 +++++++++++++++++-------------------------
 4 files changed, 85 insertions(+), 113 deletions(-)

-- 
2.26.2


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 1/7] mm,hwpoison: Refactor get_any_page
  2020-11-19 10:57 [PATCH 0/7] HWPoison: Refactor get page interface Oscar Salvador
@ 2020-11-19 10:57 ` Oscar Salvador
  2020-11-20  1:33   ` HORIGUCHI NAOYA(堀口 直也)
  2020-11-25 16:54   ` Vlastimil Babka
  2020-11-19 10:57 ` [PATCH 2/7] mm,hwpoison: Drop pfn parameter Oscar Salvador
                   ` (6 subsequent siblings)
  7 siblings, 2 replies; 26+ messages in thread
From: Oscar Salvador @ 2020-11-19 10:57 UTC (permalink / raw)
  To: akpm; +Cc: n-horiguchi, linux-mm, linux-kernel, Oscar Salvador

When we want to grab a refcount via get_any_page, we call
__get_any_page that calls get_hwpoison_page to get the
actual refcount.
get_any_page is only there because we have a sort of retry
mechanism in case the page we met is unknown to us or
if we raced with an allocation.

Also __get_any_page prints some messages about the page type
in case the page was a free page or the page type was unknown,
but if anything, we only need to print a message in case the
pagetype was unknown, as that is reporting an error down the chain.

Let us merge get_any_page and __get_any_page, and let the message
be printed in soft_offline_page.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 mm/memory-failure.c | 91 ++++++++++++++++++---------------------------
 1 file changed, 37 insertions(+), 54 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 869ece2a1de2..0d2323ba4b8e 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1707,70 +1707,49 @@ EXPORT_SYMBOL(unpoison_memory);
 
 /*
  * Safely get reference count of an arbitrary page.
- * Returns 0 for a free page, -EIO for a zero refcount page
- * that is not free, and 1 for any other page type.
- * For 1 the page is returned with increased page count, otherwise not.
+ * Returns 0 for a free page, 1 for an in-use page, -EIO for a page-type we
+ * cannot handle and -EBUSY if we raced with an allocation.
+ * We only incremented refcount in case the page was already in-use and it is
+ * a known type we can handle.
  */
-static int __get_any_page(struct page *p, unsigned long pfn, int flags)
+static int get_any_page(struct page *p, unsigned long pfn, int flags)
 {
-	int ret;
+	int ret = 0, pass = 0;
 
 	if (flags & MF_COUNT_INCREASED)
 		return 1;
 
-	/*
-	 * When the target page is a free hugepage, just remove it
-	 * from free hugepage list.
-	 */
+try_again:
 	if (!get_hwpoison_page(p)) {
-		if (PageHuge(p)) {
-			pr_info("%s: %#lx free huge page\n", __func__, pfn);
-			ret = 0;
-		} else if (is_free_buddy_page(p)) {
-			pr_info("%s: %#lx free buddy page\n", __func__, pfn);
-			ret = 0;
-		} else if (page_count(p)) {
-			/* raced with allocation */
+		if (page_count(p)) {
+			/* We raced with an allocation, retry. */
+			if (pass++ < 3)
+				goto try_again;
 			ret = -EBUSY;
-		} else {
-			pr_info("%s: %#lx: unknown zero refcount page type %lx\n",
-				__func__, pfn, p->flags);
+		} else if (!PageHuge(p) && !is_free_buddy_page(p)) {
+			/* We raced with put_page, retry. */
+			if (pass++ < 3)
+				goto try_again;
 			ret = -EIO;
 		}
 	} else {
-		/* Not a free page */
-		ret = 1;
-	}
-	return ret;
-}
-
-static int get_any_page(struct page *page, unsigned long pfn, int flags)
-{
-	int ret = __get_any_page(page, pfn, flags);
-
-	if (ret == -EBUSY)
-		ret = __get_any_page(page, pfn, flags);
-
-	if (ret == 1 && !PageHuge(page) &&
-	    !PageLRU(page) && !__PageMovable(page)) {
-		/*
-		 * Try to free it.
-		 */
-		put_page(page);
-		shake_page(page, 1);
-
-		/*
-		 * Did it turn free?
-		 */
-		ret = __get_any_page(page, pfn, 0);
-		if (ret == 1 && !PageLRU(page)) {
-			/* Drop page reference which is from __get_any_page() */
-			put_page(page);
-			pr_info("soft_offline: %#lx: unknown non LRU page type %lx (%pGp)\n",
-				pfn, page->flags, &page->flags);
-			return -EIO;
+		if (PageHuge(p) || PageLRU(p) || __PageMovable(p)) {
+			ret = 1;
+		} else {
+			/*
+			 * A page we cannot handle. Check whether we can turn
+			 * it into something we can handle.
+			 */
+			if (pass++ < 3) {
+				put_page(p);
+				shake_page(p, 1);
+				goto try_again;
+			}
+			put_page(p);
+			ret = -EIO;
 		}
 	}
+
 	return ret;
 }
 
@@ -1939,7 +1918,7 @@ int soft_offline_page(unsigned long pfn, int flags)
 		return -EIO;
 
 	if (PageHWPoison(page)) {
-		pr_info("soft offline: %#lx page already poisoned\n", pfn);
+		pr_info("%s: %#lx page already poisoned\n", __func__, pfn);
 		if (flags & MF_COUNT_INCREASED)
 			put_page(page);
 		return 0;
@@ -1950,13 +1929,17 @@ int soft_offline_page(unsigned long pfn, int flags)
 	ret = get_any_page(page, pfn, flags);
 	put_online_mems();
 
-	if (ret > 0)
+	if (ret > 0) {
 		ret = soft_offline_in_use_page(page);
-	else if (ret == 0)
+	} else if (ret == 0) {
 		if (soft_offline_free_page(page) && try_again) {
 			try_again = false;
 			goto retry;
 		}
+	} else if (ret == -EIO) {
+		pr_info("%s: %#lx: unknown page type: %lx (%pGP)\n",
+			 __func__, pfn, page->flags, &page->flags);
+	}
 
 	return ret;
 }
-- 
2.26.2


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 2/7] mm,hwpoison: Drop pfn parameter
  2020-11-19 10:57 [PATCH 0/7] HWPoison: Refactor get page interface Oscar Salvador
  2020-11-19 10:57 ` [PATCH 1/7] mm,hwpoison: Refactor get_any_page Oscar Salvador
@ 2020-11-19 10:57 ` Oscar Salvador
  2020-11-20  1:33   ` HORIGUCHI NAOYA(堀口 直也)
  2020-11-25 16:55   ` Vlastimil Babka
  2020-11-19 10:57 ` [PATCH 3/7] mm,madvise: call soft_offline_page() without MF_COUNT_INCREASED Oscar Salvador
                   ` (5 subsequent siblings)
  7 siblings, 2 replies; 26+ messages in thread
From: Oscar Salvador @ 2020-11-19 10:57 UTC (permalink / raw)
  To: akpm; +Cc: n-horiguchi, linux-mm, linux-kernel, Oscar Salvador

pfn parameter is no longer needed, drop it.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 mm/memory-failure.c | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 0d2323ba4b8e..04237fc04a00 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1712,7 +1712,7 @@ EXPORT_SYMBOL(unpoison_memory);
  * We only incremented refcount in case the page was already in-use and it is
  * a known type we can handle.
  */
-static int get_any_page(struct page *p, unsigned long pfn, int flags)
+static int get_any_page(struct page *p, int flags)
 {
 	int ret = 0, pass = 0;
 
@@ -1926,7 +1926,7 @@ int soft_offline_page(unsigned long pfn, int flags)
 
 retry:
 	get_online_mems();
-	ret = get_any_page(page, pfn, flags);
+	ret = get_any_page(page, flags);
 	put_online_mems();
 
 	if (ret > 0) {
-- 
2.26.2


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 3/7] mm,madvise: call soft_offline_page() without MF_COUNT_INCREASED
  2020-11-19 10:57 [PATCH 0/7] HWPoison: Refactor get page interface Oscar Salvador
  2020-11-19 10:57 ` [PATCH 1/7] mm,hwpoison: Refactor get_any_page Oscar Salvador
  2020-11-19 10:57 ` [PATCH 2/7] mm,hwpoison: Drop pfn parameter Oscar Salvador
@ 2020-11-19 10:57 ` Oscar Salvador
  2020-11-25 18:20   ` Vlastimil Babka
  2020-11-19 10:57 ` [PATCH 4/7] mm,hwpoison: remove MF_COUNT_INCREASED Oscar Salvador
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 26+ messages in thread
From: Oscar Salvador @ 2020-11-19 10:57 UTC (permalink / raw)
  To: akpm; +Cc: n-horiguchi, linux-mm, linux-kernel, Naoya Horiguchi, Oscar Salvador

From: Naoya Horiguchi <naoya.horiguchi@nec.com>

The call to get_user_pages_fast is only to get the pointer to a struct
page of a given address, pinning it is memory-poisoning handler's job,
so drop the refcount grabbed by get_user_pages_fast().

Note that the target page is still pinned after this put_page() because
the current process should have refcount from mapping.

Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 mm/madvise.c | 19 +++++++++++--------
 1 file changed, 11 insertions(+), 8 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index c6b5524add58..7a0f64b93635 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -900,20 +900,23 @@ static int madvise_inject_error(int behavior,
 		 */
 		size = page_size(compound_head(page));
 
+		/*
+		 * The get_user_pages_fast() is just to get the pfn of the
+		 * given address, and the refcount has nothing to do with
+		 * what we try to test, so it should be released immediately.
+		 * This is racy but it's intended because the real hardware
+		 * errors could happen at any moment and memory error handlers
+		 * must properly handle the race.
+		 */
+		put_page(page);
+
 		if (behavior == MADV_SOFT_OFFLINE) {
 			pr_info("Soft offlining pfn %#lx at process virtual address %#lx\n",
 				 pfn, start);
-			ret = soft_offline_page(pfn, MF_COUNT_INCREASED);
+			ret = soft_offline_page(pfn, 0);
 		} else {
 			pr_info("Injecting memory failure for pfn %#lx at process virtual address %#lx\n",
 				 pfn, start);
-			/*
-			 * Drop the page reference taken by get_user_pages_fast(). In
-			 * the absence of MF_COUNT_INCREASED the memory_failure()
-			 * routine is responsible for pinning the page to prevent it
-			 * from being released back to the page allocator.
-			 */
-			put_page(page);
 			ret = memory_failure(pfn, 0);
 		}
 
-- 
2.26.2


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 4/7] mm,hwpoison: remove MF_COUNT_INCREASED
  2020-11-19 10:57 [PATCH 0/7] HWPoison: Refactor get page interface Oscar Salvador
                   ` (2 preceding siblings ...)
  2020-11-19 10:57 ` [PATCH 3/7] mm,madvise: call soft_offline_page() without MF_COUNT_INCREASED Oscar Salvador
@ 2020-11-19 10:57 ` Oscar Salvador
  2020-11-19 10:57 ` [PATCH 5/7] mm,hwpoison: remove flag argument from soft offline functions Oscar Salvador
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 26+ messages in thread
From: Oscar Salvador @ 2020-11-19 10:57 UTC (permalink / raw)
  To: akpm; +Cc: n-horiguchi, linux-mm, linux-kernel, Naoya Horiguchi, Oscar Salvador

From: Naoya Horiguchi <naoya.horiguchi@nec.com>

Now there's no user of MF_COUNT_INCREASED, so we can safely remove
it from all calling points.

Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 include/linux/mm.h  | 7 +++----
 mm/memory-failure.c | 9 ++-------
 2 files changed, 5 insertions(+), 11 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index db6ae4d3fb4e..e2341d445ecb 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3025,10 +3025,9 @@ void register_page_bootmem_memmap(unsigned long section_nr, struct page *map,
 				  unsigned long nr_pages);
 
 enum mf_flags {
-	MF_COUNT_INCREASED = 1 << 0,
-	MF_ACTION_REQUIRED = 1 << 1,
-	MF_MUST_KILL = 1 << 2,
-	MF_SOFT_OFFLINE = 1 << 3,
+	MF_ACTION_REQUIRED = 1 << 0,
+	MF_MUST_KILL = 1 << 1,
+	MF_SOFT_OFFLINE = 1 << 2,
 };
 extern int memory_failure(unsigned long pfn, int flags);
 extern void memory_failure_queue(unsigned long pfn, int flags);
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 04237fc04a00..37f4bb5a49d5 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1185,7 +1185,7 @@ static int memory_failure_hugetlb(unsigned long pfn, int flags)
 
 	num_poisoned_pages_inc();
 
-	if (!(flags & MF_COUNT_INCREASED) && !get_hwpoison_page(p)) {
+	if (!get_hwpoison_page(p)) {
 		/*
 		 * Check "filter hit" and "race with other subpage."
 		 */
@@ -1387,7 +1387,7 @@ int memory_failure(unsigned long pfn, int flags)
 	 * In fact it's dangerous to directly bump up page count from 0,
 	 * that may make page_ref_freeze()/page_ref_unfreeze() mismatch.
 	 */
-	if (!(flags & MF_COUNT_INCREASED) && !get_hwpoison_page(p)) {
+	if (!get_hwpoison_page(p)) {
 		if (is_free_buddy_page(p)) {
 			if (take_page_off_buddy(p)) {
 				page_ref_inc(p);
@@ -1716,9 +1716,6 @@ static int get_any_page(struct page *p, int flags)
 {
 	int ret = 0, pass = 0;
 
-	if (flags & MF_COUNT_INCREASED)
-		return 1;
-
 try_again:
 	if (!get_hwpoison_page(p)) {
 		if (page_count(p)) {
@@ -1919,8 +1916,6 @@ int soft_offline_page(unsigned long pfn, int flags)
 
 	if (PageHWPoison(page)) {
 		pr_info("%s: %#lx page already poisoned\n", __func__, pfn);
-		if (flags & MF_COUNT_INCREASED)
-			put_page(page);
 		return 0;
 	}
 
-- 
2.26.2


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 5/7] mm,hwpoison: remove flag argument from soft offline functions
  2020-11-19 10:57 [PATCH 0/7] HWPoison: Refactor get page interface Oscar Salvador
                   ` (3 preceding siblings ...)
  2020-11-19 10:57 ` [PATCH 4/7] mm,hwpoison: remove MF_COUNT_INCREASED Oscar Salvador
@ 2020-11-19 10:57 ` Oscar Salvador
  2020-11-19 10:57 ` [PATCH 6/7] mm,hwpoison: Disable pcplists before grabbing a refcount Oscar Salvador
                   ` (2 subsequent siblings)
  7 siblings, 0 replies; 26+ messages in thread
From: Oscar Salvador @ 2020-11-19 10:57 UTC (permalink / raw)
  To: akpm; +Cc: n-horiguchi, linux-mm, linux-kernel, Naoya Horiguchi, Oscar Salvador

From: Naoya Horiguchi <naoya.horiguchi@nec.com>

The argument @flag no longer affects the behavior of soft_offline_page()
and its variants, so let's remove them.

Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 drivers/base/memory.c | 2 +-
 include/linux/mm.h    | 2 +-
 mm/madvise.c          | 2 +-
 mm/memory-failure.c   | 9 ++++-----
 4 files changed, 7 insertions(+), 8 deletions(-)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index eef4ffb6122c..00a2f7191357 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -464,7 +464,7 @@ static ssize_t soft_offline_page_store(struct device *dev,
 	if (kstrtoull(buf, 0, &pfn) < 0)
 		return -EINVAL;
 	pfn >>= PAGE_SHIFT;
-	ret = soft_offline_page(pfn, 0);
+	ret = soft_offline_page(pfn);
 	return ret == 0 ? count : ret;
 }
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index e2341d445ecb..48ba7deec46b 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -3037,7 +3037,7 @@ extern int sysctl_memory_failure_early_kill;
 extern int sysctl_memory_failure_recovery;
 extern void shake_page(struct page *p, int access);
 extern atomic_long_t num_poisoned_pages __read_mostly;
-extern int soft_offline_page(unsigned long pfn, int flags);
+extern int soft_offline_page(unsigned long pfn);
 
 
 /*
diff --git a/mm/madvise.c b/mm/madvise.c
index 7a0f64b93635..2bae426b01d5 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -913,7 +913,7 @@ static int madvise_inject_error(int behavior,
 		if (behavior == MADV_SOFT_OFFLINE) {
 			pr_info("Soft offlining pfn %#lx at process virtual address %#lx\n",
 				 pfn, start);
-			ret = soft_offline_page(pfn, 0);
+			ret = soft_offline_page(pfn);
 		} else {
 			pr_info("Injecting memory failure for pfn %#lx at process virtual address %#lx\n",
 				 pfn, start);
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 37f4bb5a49d5..4abf5d5ffc96 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1572,7 +1572,7 @@ static void memory_failure_work_func(struct work_struct *work)
 		if (!gotten)
 			break;
 		if (entry.flags & MF_SOFT_OFFLINE)
-			soft_offline_page(entry.pfn, entry.flags);
+			soft_offline_page(entry.pfn);
 		else
 			memory_failure(entry.pfn, entry.flags);
 	}
@@ -1712,7 +1712,7 @@ EXPORT_SYMBOL(unpoison_memory);
  * We only incremented refcount in case the page was already in-use and it is
  * a known type we can handle.
  */
-static int get_any_page(struct page *p, int flags)
+static int get_any_page(struct page *p)
 {
 	int ret = 0, pass = 0;
 
@@ -1882,7 +1882,6 @@ static int soft_offline_free_page(struct page *page)
 /**
  * soft_offline_page - Soft offline a page.
  * @pfn: pfn to soft-offline
- * @flags: flags. Same as memory_failure().
  *
  * Returns 0 on success, otherwise negated errno.
  *
@@ -1901,7 +1900,7 @@ static int soft_offline_free_page(struct page *page)
  * This is not a 100% solution for all memory, but tries to be
  * ``good enough'' for the majority of memory.
  */
-int soft_offline_page(unsigned long pfn, int flags)
+int soft_offline_page(unsigned long pfn)
 {
 	int ret;
 	struct page *page;
@@ -1921,7 +1920,7 @@ int soft_offline_page(unsigned long pfn, int flags)
 
 retry:
 	get_online_mems();
-	ret = get_any_page(page, flags);
+	ret = get_any_page(page);
 	put_online_mems();
 
 	if (ret > 0) {
-- 
2.26.2


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 6/7] mm,hwpoison: Disable pcplists before grabbing a refcount
  2020-11-19 10:57 [PATCH 0/7] HWPoison: Refactor get page interface Oscar Salvador
                   ` (4 preceding siblings ...)
  2020-11-19 10:57 ` [PATCH 5/7] mm,hwpoison: remove flag argument from soft offline functions Oscar Salvador
@ 2020-11-19 10:57 ` Oscar Salvador
  2020-11-20  1:33   ` HORIGUCHI NAOYA(堀口 直也)
  2020-11-26 13:45   ` Vlastimil Babka
  2020-11-19 10:57 ` [PATCH 7/7] mm,hwpoison: Remove drain_all_pages from shake_page Oscar Salvador
  2020-12-02 13:34 ` [PATCH 0/7] HWPoison: Refactor get page interface Qian Cai
  7 siblings, 2 replies; 26+ messages in thread
From: Oscar Salvador @ 2020-11-19 10:57 UTC (permalink / raw)
  To: akpm; +Cc: n-horiguchi, linux-mm, linux-kernel, Oscar Salvador

Currently, we have a sort of retry mechanism to make sure pages in
pcp-lists are spilled to the buddy system, so we can handle those.

We can save us this extra checks with the new disable-pcplist mechanism
that is available with [1].

zone_pcplist_disable makes sure to 1) disable pcplists, so any page
that is freed up from that point onwards will end up in the buddy
system and 2) drain pcplists, so those pages that already in pcplists
are spilled to buddy.

With that, we can make a common entry point for grabbing a refcount
from both soft_offline and memory_failure paths that is guarded by
zone_pcplist_disable/zone_pcplist_enable.

[1] https://patchwork.kernel.org/project/linux-mm/cover/20201111092812.11329-1-vbabka@suse.cz/

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 mm/memory-failure.c | 120 +++++++++++++++++++++-----------------------
 1 file changed, 58 insertions(+), 62 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 4abf5d5ffc96..512613e9a1bd 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -985,26 +985,67 @@ static int __get_hwpoison_page(struct page *page)
 	return 0;
 }
 
-static int get_hwpoison_page(struct page *p)
+/*
+ * Safely get reference count of an arbitrary page.
+ *
+ * Returns 0 for a free page, 1 for an in-use page,
+ * -EIO for a page-type we cannot handle and -EBUSY if we raced with an
+ * allocation.
+ * We only incremented refcount in case the page was already in-use and it
+ * is a known type we can handle.
+ */
+static int get_any_page(struct page *p)
 {
-	int ret;
-	bool drained = false;
+	int ret = 0, pass = 0;
 
-retry:
-	ret = __get_hwpoison_page(p);
-	if (!ret && !is_free_buddy_page(p) && !page_count(p) && !drained) {
-		/*
-		 * The page might be in a pcplist, so try to drain those
-		 * and see if we are lucky.
-		 */
-		drain_all_pages(page_zone(p));
-		drained = true;
-		goto retry;
+try_again:
+	if (!__get_hwpoison_page(p)) {
+		if (page_count(p)) {
+			/* We raced with an allocation, retry. */
+			if (pass++ < 3)
+				goto try_again;
+			ret = -EBUSY;
+		} else if (!PageHuge(p) && !is_free_buddy_page(p)) {
+			/* We raced with put_page, retry. */
+			if (pass++ < 3)
+				goto try_again;
+			ret = -EIO;
+		}
+	} else {
+		if (PageHuge(p) || PageLRU(p) || __PageMovable(p)) {
+			ret = 1;
+		} else {
+			/*
+			 * A page we cannot handle. Check whether we can turn
+			 * it into something we can handle.
+			 */
+			if (pass++ < 3) {
+				put_page(p);
+				shake_page(p, 1);
+				goto try_again;
+			}
+			put_page(p);
+			ret = -EIO;
+		}
 	}
 
 	return ret;
 }
 
+static int get_hwpoison_page(struct page *p, enum mf_flags ctxt)
+{
+	int ret;
+
+	zone_pcp_disable(page_zone(p));
+	if (ctxt == MF_SOFT_OFFLINE)
+		ret = get_any_page(p);
+	else
+		ret = __get_hwpoison_page(p);
+	zone_pcp_enable(page_zone(p));
+
+	return ret;
+}
+
 /*
  * Do all that is necessary to remove user space mappings. Unmap
  * the pages and send SIGBUS to the processes if the data was dirty.
@@ -1185,7 +1226,7 @@ static int memory_failure_hugetlb(unsigned long pfn, int flags)
 
 	num_poisoned_pages_inc();
 
-	if (!get_hwpoison_page(p)) {
+	if (!get_hwpoison_page(p, 0)) {
 		/*
 		 * Check "filter hit" and "race with other subpage."
 		 */
@@ -1387,7 +1428,7 @@ int memory_failure(unsigned long pfn, int flags)
 	 * In fact it's dangerous to directly bump up page count from 0,
 	 * that may make page_ref_freeze()/page_ref_unfreeze() mismatch.
 	 */
-	if (!get_hwpoison_page(p)) {
+	if (!get_hwpoison_page(p, 0)) {
 		if (is_free_buddy_page(p)) {
 			if (take_page_off_buddy(p)) {
 				page_ref_inc(p);
@@ -1674,7 +1715,7 @@ int unpoison_memory(unsigned long pfn)
 		return 0;
 	}
 
-	if (!get_hwpoison_page(p)) {
+	if (!get_hwpoison_page(p, 0)) {
 		if (TestClearPageHWPoison(p))
 			num_poisoned_pages_dec();
 		unpoison_pr_info("Unpoison: Software-unpoisoned free page %#lx\n",
@@ -1705,51 +1746,6 @@ int unpoison_memory(unsigned long pfn)
 }
 EXPORT_SYMBOL(unpoison_memory);
 
-/*
- * Safely get reference count of an arbitrary page.
- * Returns 0 for a free page, 1 for an in-use page, -EIO for a page-type we
- * cannot handle and -EBUSY if we raced with an allocation.
- * We only incremented refcount in case the page was already in-use and it is
- * a known type we can handle.
- */
-static int get_any_page(struct page *p)
-{
-	int ret = 0, pass = 0;
-
-try_again:
-	if (!get_hwpoison_page(p)) {
-		if (page_count(p)) {
-			/* We raced with an allocation, retry. */
-			if (pass++ < 3)
-				goto try_again;
-			ret = -EBUSY;
-		} else if (!PageHuge(p) && !is_free_buddy_page(p)) {
-			/* We raced with put_page, retry. */
-			if (pass++ < 3)
-				goto try_again;
-			ret = -EIO;
-		}
-	} else {
-		if (PageHuge(p) || PageLRU(p) || __PageMovable(p)) {
-			ret = 1;
-		} else {
-			/*
-			 * A page we cannot handle. Check whether we can turn
-			 * it into something we can handle.
-			 */
-			if (pass++ < 3) {
-				put_page(p);
-				shake_page(p, 1);
-				goto try_again;
-			}
-			put_page(p);
-			ret = -EIO;
-		}
-	}
-
-	return ret;
-}
-
 static bool isolate_page(struct page *page, struct list_head *pagelist)
 {
 	bool isolated = false;
@@ -1920,7 +1916,7 @@ int soft_offline_page(unsigned long pfn)
 
 retry:
 	get_online_mems();
-	ret = get_any_page(page);
+	ret = get_hwpoison_page(page, MF_SOFT_OFFLINE);
 	put_online_mems();
 
 	if (ret > 0) {
-- 
2.26.2


^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 7/7] mm,hwpoison: Remove drain_all_pages from shake_page
  2020-11-19 10:57 [PATCH 0/7] HWPoison: Refactor get page interface Oscar Salvador
                   ` (5 preceding siblings ...)
  2020-11-19 10:57 ` [PATCH 6/7] mm,hwpoison: Disable pcplists before grabbing a refcount Oscar Salvador
@ 2020-11-19 10:57 ` Oscar Salvador
  2020-11-20  1:33   ` HORIGUCHI NAOYA(堀口 直也)
  2020-11-26 13:52   ` Vlastimil Babka
  2020-12-02 13:34 ` [PATCH 0/7] HWPoison: Refactor get page interface Qian Cai
  7 siblings, 2 replies; 26+ messages in thread
From: Oscar Salvador @ 2020-11-19 10:57 UTC (permalink / raw)
  To: akpm; +Cc: n-horiguchi, linux-mm, linux-kernel, Oscar Salvador

get_hwpoison_page already drains pcplists, previously disabling
them when trying to grab a refcount.
We do not need shake_page to take care of it anymore.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 mm/memory-failure.c | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 512613e9a1bd..ad976e1c3178 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -263,8 +263,8 @@ static int kill_proc(struct to_kill *tk, unsigned long pfn, int flags)
 }
 
 /*
- * When a unknown page type is encountered drain as many buffers as possible
- * in the hope to turn the page into a LRU or free page, which we can handle.
+ * Unknown page type encountered. Try to check whether it can turn PageLRU by
+ * lru_add_drain_all, or a free page by reclaiming slabs when possible.
  */
 void shake_page(struct page *p, int access)
 {
@@ -275,9 +275,6 @@ void shake_page(struct page *p, int access)
 		lru_add_drain_all();
 		if (PageLRU(p))
 			return;
-		drain_all_pages(page_zone(p));
-		if (PageLRU(p) || is_free_buddy_page(p))
-			return;
 	}
 
 	/*
-- 
2.26.2


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/7] mm,hwpoison: Refactor get_any_page
  2020-11-19 10:57 ` [PATCH 1/7] mm,hwpoison: Refactor get_any_page Oscar Salvador
@ 2020-11-20  1:33   ` HORIGUCHI NAOYA(堀口 直也)
  2020-11-25 16:54   ` Vlastimil Babka
  1 sibling, 0 replies; 26+ messages in thread
From: HORIGUCHI NAOYA(堀口 直也) @ 2020-11-20  1:33 UTC (permalink / raw)
  To: Oscar Salvador; +Cc: akpm, n-horiguchi, linux-mm, linux-kernel

On Thu, Nov 19, 2020 at 11:57:10AM +0100, Oscar Salvador wrote:
> When we want to grab a refcount via get_any_page, we call
> __get_any_page that calls get_hwpoison_page to get the
> actual refcount.
> get_any_page is only there because we have a sort of retry
> mechanism in case the page we met is unknown to us or
> if we raced with an allocation.
> 
> Also __get_any_page prints some messages about the page type
> in case the page was a free page or the page type was unknown,
> but if anything, we only need to print a message in case the
> pagetype was unknown, as that is reporting an error down the chain.
> 
> Let us merge get_any_page and __get_any_page, and let the message
> be printed in soft_offline_page.
> 
> Signed-off-by: Oscar Salvador <osalvador@suse.de>

Looks good to me, thank you.

Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/7] mm,hwpoison: Drop pfn parameter
  2020-11-19 10:57 ` [PATCH 2/7] mm,hwpoison: Drop pfn parameter Oscar Salvador
@ 2020-11-20  1:33   ` HORIGUCHI NAOYA(堀口 直也)
  2020-11-25 16:55   ` Vlastimil Babka
  1 sibling, 0 replies; 26+ messages in thread
From: HORIGUCHI NAOYA(堀口 直也) @ 2020-11-20  1:33 UTC (permalink / raw)
  To: Oscar Salvador; +Cc: akpm, n-horiguchi, linux-mm, linux-kernel

On Thu, Nov 19, 2020 at 11:57:11AM +0100, Oscar Salvador wrote:
> pfn parameter is no longer needed, drop it.
> 
> Signed-off-by: Oscar Salvador <osalvador@suse.de>

Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 6/7] mm,hwpoison: Disable pcplists before grabbing a refcount
  2020-11-19 10:57 ` [PATCH 6/7] mm,hwpoison: Disable pcplists before grabbing a refcount Oscar Salvador
@ 2020-11-20  1:33   ` HORIGUCHI NAOYA(堀口 直也)
  2020-11-26 13:45   ` Vlastimil Babka
  1 sibling, 0 replies; 26+ messages in thread
From: HORIGUCHI NAOYA(堀口 直也) @ 2020-11-20  1:33 UTC (permalink / raw)
  To: Oscar Salvador; +Cc: akpm, n-horiguchi, linux-mm, linux-kernel

On Thu, Nov 19, 2020 at 11:57:15AM +0100, Oscar Salvador wrote:
> Currently, we have a sort of retry mechanism to make sure pages in
> pcp-lists are spilled to the buddy system, so we can handle those.
> 
> We can save us this extra checks with the new disable-pcplist mechanism
> that is available with [1].
> 
> zone_pcplist_disable makes sure to 1) disable pcplists, so any page
> that is freed up from that point onwards will end up in the buddy
> system and 2) drain pcplists, so those pages that already in pcplists
> are spilled to buddy.
> 
> With that, we can make a common entry point for grabbing a refcount
> from both soft_offline and memory_failure paths that is guarded by
> zone_pcplist_disable/zone_pcplist_enable.

This new mechanism provides a better solution to us.

> 
> [1] https://patchwork.kernel.org/project/linux-mm/cover/20201111092812.11329-1-vbabka@suse.cz/
> 
> Signed-off-by: Oscar Salvador <osalvador@suse.de>

Thank you very much.

Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 7/7] mm,hwpoison: Remove drain_all_pages from shake_page
  2020-11-19 10:57 ` [PATCH 7/7] mm,hwpoison: Remove drain_all_pages from shake_page Oscar Salvador
@ 2020-11-20  1:33   ` HORIGUCHI NAOYA(堀口 直也)
  2020-11-26 13:52   ` Vlastimil Babka
  1 sibling, 0 replies; 26+ messages in thread
From: HORIGUCHI NAOYA(堀口 直也) @ 2020-11-20  1:33 UTC (permalink / raw)
  To: Oscar Salvador; +Cc: akpm, n-horiguchi, linux-mm, linux-kernel

On Thu, Nov 19, 2020 at 11:57:16AM +0100, Oscar Salvador wrote:
> get_hwpoison_page already drains pcplists, previously disabling
> them when trying to grab a refcount.
> We do not need shake_page to take care of it anymore.
> 
> Signed-off-by: Oscar Salvador <osalvador@suse.de>

Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 1/7] mm,hwpoison: Refactor get_any_page
  2020-11-19 10:57 ` [PATCH 1/7] mm,hwpoison: Refactor get_any_page Oscar Salvador
  2020-11-20  1:33   ` HORIGUCHI NAOYA(堀口 直也)
@ 2020-11-25 16:54   ` Vlastimil Babka
  1 sibling, 0 replies; 26+ messages in thread
From: Vlastimil Babka @ 2020-11-25 16:54 UTC (permalink / raw)
  To: Oscar Salvador, akpm; +Cc: n-horiguchi, linux-mm, linux-kernel

On 11/19/20 11:57 AM, Oscar Salvador wrote:
> When we want to grab a refcount via get_any_page, we call
> __get_any_page that calls get_hwpoison_page to get the
> actual refcount.
> get_any_page is only there because we have a sort of retry
> mechanism in case the page we met is unknown to us or
> if we raced with an allocation.
> 
> Also __get_any_page prints some messages about the page type
> in case the page was a free page or the page type was unknown,
> but if anything, we only need to print a message in case the
> pagetype was unknown, as that is reporting an error down the chain.
> 
> Let us merge get_any_page and __get_any_page, and let the message
> be printed in soft_offline_page.
> 
> Signed-off-by: Oscar Salvador <osalvador@suse.de>

Acked-by: Vlastimil Babka <Vbabka@suse.cz>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 2/7] mm,hwpoison: Drop pfn parameter
  2020-11-19 10:57 ` [PATCH 2/7] mm,hwpoison: Drop pfn parameter Oscar Salvador
  2020-11-20  1:33   ` HORIGUCHI NAOYA(堀口 直也)
@ 2020-11-25 16:55   ` Vlastimil Babka
  1 sibling, 0 replies; 26+ messages in thread
From: Vlastimil Babka @ 2020-11-25 16:55 UTC (permalink / raw)
  To: Oscar Salvador, akpm; +Cc: n-horiguchi, linux-mm, linux-kernel

On 11/19/20 11:57 AM, Oscar Salvador wrote:
> pfn parameter is no longer needed, drop it.

Could have been also part of previous patch.

> Signed-off-by: Oscar Salvador <osalvador@suse.de>

Acked-by: Vlastimil Babka <Vbabka@suse.cz>

> ---
>   mm/memory-failure.c | 4 ++--
>   1 file changed, 2 insertions(+), 2 deletions(-)
> 
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 0d2323ba4b8e..04237fc04a00 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1712,7 +1712,7 @@ EXPORT_SYMBOL(unpoison_memory);
>    * We only incremented refcount in case the page was already in-use and it is
>    * a known type we can handle.
>    */
> -static int get_any_page(struct page *p, unsigned long pfn, int flags)
> +static int get_any_page(struct page *p, int flags)
>   {
>   	int ret = 0, pass = 0;
>   
> @@ -1926,7 +1926,7 @@ int soft_offline_page(unsigned long pfn, int flags)
>   
>   retry:
>   	get_online_mems();
> -	ret = get_any_page(page, pfn, flags);
> +	ret = get_any_page(page, flags);
>   	put_online_mems();
>   
>   	if (ret > 0) {
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/7] mm,madvise: call soft_offline_page() without MF_COUNT_INCREASED
  2020-11-19 10:57 ` [PATCH 3/7] mm,madvise: call soft_offline_page() without MF_COUNT_INCREASED Oscar Salvador
@ 2020-11-25 18:20   ` Vlastimil Babka
  2020-12-01 11:35     ` Oscar Salvador
  0 siblings, 1 reply; 26+ messages in thread
From: Vlastimil Babka @ 2020-11-25 18:20 UTC (permalink / raw)
  To: Oscar Salvador, akpm; +Cc: n-horiguchi, linux-mm, linux-kernel, Naoya Horiguchi

On 11/19/20 11:57 AM, Oscar Salvador wrote:
> From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> 
> The call to get_user_pages_fast is only to get the pointer to a struct
> page of a given address, pinning it is memory-poisoning handler's job,
> so drop the refcount grabbed by get_user_pages_fast().
> 
> Note that the target page is still pinned after this put_page() because
> the current process should have refcount from mapping.

Well, but can't it go away due to reclaim, migration or whatever?

> Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> Signed-off-by: Oscar Salvador <osalvador@suse.de>
> ---
>   mm/madvise.c | 19 +++++++++++--------
>   1 file changed, 11 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/madvise.c b/mm/madvise.c
> index c6b5524add58..7a0f64b93635 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -900,20 +900,23 @@ static int madvise_inject_error(int behavior,
>   		 */
>   		size = page_size(compound_head(page));
>   
> +		/*
> +		 * The get_user_pages_fast() is just to get the pfn of the
> +		 * given address, and the refcount has nothing to do with
> +		 * what we try to test, so it should be released immediately.
> +		 * This is racy but it's intended because the real hardware
> +		 * errors could happen at any moment and memory error handlers
> +		 * must properly handle the race.

Sure they have to. We might just be unexpectedly messing with other process' 
memory. Or does anything else prevent that?

> +		 */
> +		put_page(page);
> +
>   		if (behavior == MADV_SOFT_OFFLINE) {
>   			pr_info("Soft offlining pfn %#lx at process virtual address %#lx\n",
>   				 pfn, start);
> -			ret = soft_offline_page(pfn, MF_COUNT_INCREASED);
> +			ret = soft_offline_page(pfn, 0);
>   		} else {
>   			pr_info("Injecting memory failure for pfn %#lx at process virtual address %#lx\n",
>   				 pfn, start);
> -			/*
> -			 * Drop the page reference taken by get_user_pages_fast(). In
> -			 * the absence of MF_COUNT_INCREASED the memory_failure()
> -			 * routine is responsible for pinning the page to prevent it
> -			 * from being released back to the page allocator.
> -			 */
> -			put_page(page);
>   			ret = memory_failure(pfn, 0);
>   		}
>   
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 6/7] mm,hwpoison: Disable pcplists before grabbing a refcount
  2020-11-19 10:57 ` [PATCH 6/7] mm,hwpoison: Disable pcplists before grabbing a refcount Oscar Salvador
  2020-11-20  1:33   ` HORIGUCHI NAOYA(堀口 直也)
@ 2020-11-26 13:45   ` Vlastimil Babka
  2020-11-28  0:51     ` Andrew Morton
  1 sibling, 1 reply; 26+ messages in thread
From: Vlastimil Babka @ 2020-11-26 13:45 UTC (permalink / raw)
  To: Oscar Salvador, akpm; +Cc: n-horiguchi, linux-mm, linux-kernel

On 11/19/20 11:57 AM, Oscar Salvador wrote:
> Currently, we have a sort of retry mechanism to make sure pages in
> pcp-lists are spilled to the buddy system, so we can handle those.
> 
> We can save us this extra checks with the new disable-pcplist mechanism
> that is available with [1].
> 
> zone_pcplist_disable makes sure to 1) disable pcplists, so any page
> that is freed up from that point onwards will end up in the buddy
> system and 2) drain pcplists, so those pages that already in pcplists
> are spilled to buddy.
> 
> With that, we can make a common entry point for grabbing a refcount
> from both soft_offline and memory_failure paths that is guarded by
> zone_pcplist_disable/zone_pcplist_enable.
> 
> [1] https://patchwork.kernel.org/project/linux-mm/cover/20201111092812.11329-1-vbabka@suse.cz/
> 
> Signed-off-by: Oscar Salvador <osalvador@suse.de>

Acked-by: Vlastimil Babka <vbabka@suse.cz>

Note as you say the series should go after [1] above, which means after 
mm-page_alloc-disable-pcplists-during-memory-offline.patch in mmots, but 
currently it's ordered before, where zone_pcp_disable() etc doesn't yet exist. 
Found out as I review using checked out this commit from -next.

> ---
>   mm/memory-failure.c | 120 +++++++++++++++++++++-----------------------
>   1 file changed, 58 insertions(+), 62 deletions(-)
> 
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 4abf5d5ffc96..512613e9a1bd 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -985,26 +985,67 @@ static int __get_hwpoison_page(struct page *page)
>   	return 0;
>   }
>   
> -static int get_hwpoison_page(struct page *p)
> +/*
> + * Safely get reference count of an arbitrary page.
> + *
> + * Returns 0 for a free page, 1 for an in-use page,
> + * -EIO for a page-type we cannot handle and -EBUSY if we raced with an
> + * allocation.
> + * We only incremented refcount in case the page was already in-use and it
> + * is a known type we can handle.
> + */
> +static int get_any_page(struct page *p)
>   {
> -	int ret;
> -	bool drained = false;
> +	int ret = 0, pass = 0;
>   
> -retry:
> -	ret = __get_hwpoison_page(p);
> -	if (!ret && !is_free_buddy_page(p) && !page_count(p) && !drained) {
> -		/*
> -		 * The page might be in a pcplist, so try to drain those
> -		 * and see if we are lucky.
> -		 */
> -		drain_all_pages(page_zone(p));
> -		drained = true;
> -		goto retry;
> +try_again:
> +	if (!__get_hwpoison_page(p)) {
> +		if (page_count(p)) {
> +			/* We raced with an allocation, retry. */
> +			if (pass++ < 3)
> +				goto try_again;
> +			ret = -EBUSY;
> +		} else if (!PageHuge(p) && !is_free_buddy_page(p)) {
> +			/* We raced with put_page, retry. */
> +			if (pass++ < 3)
> +				goto try_again;
> +			ret = -EIO;
> +		}
> +	} else {
> +		if (PageHuge(p) || PageLRU(p) || __PageMovable(p)) {
> +			ret = 1;
> +		} else {
> +			/*
> +			 * A page we cannot handle. Check whether we can turn
> +			 * it into something we can handle.
> +			 */
> +			if (pass++ < 3) {
> +				put_page(p);
> +				shake_page(p, 1);
> +				goto try_again;
> +			}
> +			put_page(p);
> +			ret = -EIO;
> +		}
>   	}
>   
>   	return ret;
>   }
>   
> +static int get_hwpoison_page(struct page *p, enum mf_flags ctxt)
> +{
> +	int ret;
> +
> +	zone_pcp_disable(page_zone(p));
> +	if (ctxt == MF_SOFT_OFFLINE)
> +		ret = get_any_page(p);
> +	else
> +		ret = __get_hwpoison_page(p);
> +	zone_pcp_enable(page_zone(p));
> +
> +	return ret;
> +}
> +
>   /*
>    * Do all that is necessary to remove user space mappings. Unmap
>    * the pages and send SIGBUS to the processes if the data was dirty.
> @@ -1185,7 +1226,7 @@ static int memory_failure_hugetlb(unsigned long pfn, int flags)
>   
>   	num_poisoned_pages_inc();
>   
> -	if (!get_hwpoison_page(p)) {
> +	if (!get_hwpoison_page(p, 0)) {
>   		/*
>   		 * Check "filter hit" and "race with other subpage."
>   		 */
> @@ -1387,7 +1428,7 @@ int memory_failure(unsigned long pfn, int flags)
>   	 * In fact it's dangerous to directly bump up page count from 0,
>   	 * that may make page_ref_freeze()/page_ref_unfreeze() mismatch.
>   	 */
> -	if (!get_hwpoison_page(p)) {
> +	if (!get_hwpoison_page(p, 0)) {
>   		if (is_free_buddy_page(p)) {
>   			if (take_page_off_buddy(p)) {
>   				page_ref_inc(p);
> @@ -1674,7 +1715,7 @@ int unpoison_memory(unsigned long pfn)
>   		return 0;
>   	}
>   
> -	if (!get_hwpoison_page(p)) {
> +	if (!get_hwpoison_page(p, 0)) {
>   		if (TestClearPageHWPoison(p))
>   			num_poisoned_pages_dec();
>   		unpoison_pr_info("Unpoison: Software-unpoisoned free page %#lx\n",
> @@ -1705,51 +1746,6 @@ int unpoison_memory(unsigned long pfn)
>   }
>   EXPORT_SYMBOL(unpoison_memory);
>   
> -/*
> - * Safely get reference count of an arbitrary page.
> - * Returns 0 for a free page, 1 for an in-use page, -EIO for a page-type we
> - * cannot handle and -EBUSY if we raced with an allocation.
> - * We only incremented refcount in case the page was already in-use and it is
> - * a known type we can handle.
> - */
> -static int get_any_page(struct page *p)
> -{
> -	int ret = 0, pass = 0;
> -
> -try_again:
> -	if (!get_hwpoison_page(p)) {
> -		if (page_count(p)) {
> -			/* We raced with an allocation, retry. */
> -			if (pass++ < 3)
> -				goto try_again;
> -			ret = -EBUSY;
> -		} else if (!PageHuge(p) && !is_free_buddy_page(p)) {
> -			/* We raced with put_page, retry. */
> -			if (pass++ < 3)
> -				goto try_again;
> -			ret = -EIO;
> -		}
> -	} else {
> -		if (PageHuge(p) || PageLRU(p) || __PageMovable(p)) {
> -			ret = 1;
> -		} else {
> -			/*
> -			 * A page we cannot handle. Check whether we can turn
> -			 * it into something we can handle.
> -			 */
> -			if (pass++ < 3) {
> -				put_page(p);
> -				shake_page(p, 1);
> -				goto try_again;
> -			}
> -			put_page(p);
> -			ret = -EIO;
> -		}
> -	}
> -
> -	return ret;
> -}
> -
>   static bool isolate_page(struct page *page, struct list_head *pagelist)
>   {
>   	bool isolated = false;
> @@ -1920,7 +1916,7 @@ int soft_offline_page(unsigned long pfn)
>   
>   retry:
>   	get_online_mems();
> -	ret = get_any_page(page);
> +	ret = get_hwpoison_page(page, MF_SOFT_OFFLINE);
>   	put_online_mems();
>   
>   	if (ret > 0) {
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 7/7] mm,hwpoison: Remove drain_all_pages from shake_page
  2020-11-19 10:57 ` [PATCH 7/7] mm,hwpoison: Remove drain_all_pages from shake_page Oscar Salvador
  2020-11-20  1:33   ` HORIGUCHI NAOYA(堀口 直也)
@ 2020-11-26 13:52   ` Vlastimil Babka
  2020-11-27  7:20     ` Oscar Salvador
  1 sibling, 1 reply; 26+ messages in thread
From: Vlastimil Babka @ 2020-11-26 13:52 UTC (permalink / raw)
  To: Oscar Salvador, akpm; +Cc: n-horiguchi, linux-mm, linux-kernel

On 11/19/20 11:57 AM, Oscar Salvador wrote:
> get_hwpoison_page already drains pcplists, previously disabling
> them when trying to grab a refcount.
> We do not need shake_page to take care of it anymore.
> 
> Signed-off-by: Oscar Salvador <osalvador@suse.de>
> ---
>   mm/memory-failure.c | 7 ++-----
>   1 file changed, 2 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 512613e9a1bd..ad976e1c3178 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -263,8 +263,8 @@ static int kill_proc(struct to_kill *tk, unsigned long pfn, int flags)
>   }
>   
>   /*
> - * When a unknown page type is encountered drain as many buffers as possible
> - * in the hope to turn the page into a LRU or free page, which we can handle.
> + * Unknown page type encountered. Try to check whether it can turn PageLRU by
> + * lru_add_drain_all, or a free page by reclaiming slabs when possible.
>    */
>   void shake_page(struct page *p, int access)
>   {
> @@ -275,9 +275,6 @@ void shake_page(struct page *p, int access)
>   		lru_add_drain_all();
>   		if (PageLRU(p))
>   			return;
> -		drain_all_pages(page_zone(p));
> -		if (PageLRU(p) || is_free_buddy_page(p))
> -			return;

I wonder if page in the lru pagevec can in fact become free after draining the 
lru - in that case we could keep the is_free_buddy_page() check.

>   	}
>   
>   	/*
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 7/7] mm,hwpoison: Remove drain_all_pages from shake_page
  2020-11-26 13:52   ` Vlastimil Babka
@ 2020-11-27  7:20     ` Oscar Salvador
  0 siblings, 0 replies; 26+ messages in thread
From: Oscar Salvador @ 2020-11-27  7:20 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: akpm, n-horiguchi, linux-mm, linux-kernel

On Thu, Nov 26, 2020 at 02:52:32PM +0100, Vlastimil Babka wrote:
> > @@ -275,9 +275,6 @@ void shake_page(struct page *p, int access)
> >   		lru_add_drain_all();
> >   		if (PageLRU(p))
> >   			return;
> > -		drain_all_pages(page_zone(p));
> > -		if (PageLRU(p) || is_free_buddy_page(p))
> > -			return;
> 
> I wonder if page in the lru pagevec can in fact become free after draining
> the lru - in that case we could keep the is_free_buddy_page() check.

Uhm, yes, I think it can happen.
After all, once the page hits one of the inactives' LRU it can be
reclaimed.

I am fine with joining the left PageLRU and is_free_buddy_page check.

Will do that in the next revision.

Thanks!

-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 6/7] mm,hwpoison: Disable pcplists before grabbing a refcount
  2020-11-26 13:45   ` Vlastimil Babka
@ 2020-11-28  0:51     ` Andrew Morton
  0 siblings, 0 replies; 26+ messages in thread
From: Andrew Morton @ 2020-11-28  0:51 UTC (permalink / raw)
  To: Vlastimil Babka; +Cc: Oscar Salvador, n-horiguchi, linux-mm, linux-kernel

On Thu, 26 Nov 2020 14:45:30 +0100 Vlastimil Babka <vbabka@suse.cz> wrote:

> Note as you say the series should go after [1] above, which means after 
> mm-page_alloc-disable-pcplists-during-memory-offline.patch in mmots, but 
> currently it's ordered before, where zone_pcp_disable() etc doesn't yet exist. 
> Found out as I review using checked out this commit from -next.

Thanks, I reordered them.

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/7] mm,madvise: call soft_offline_page() without MF_COUNT_INCREASED
  2020-11-25 18:20   ` Vlastimil Babka
@ 2020-12-01 11:35     ` Oscar Salvador
  2020-12-04 17:25       ` Vlastimil Babka
  0 siblings, 1 reply; 26+ messages in thread
From: Oscar Salvador @ 2020-12-01 11:35 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: akpm, n-horiguchi, linux-mm, linux-kernel, Naoya Horiguchi

On Wed, Nov 25, 2020 at 07:20:33PM +0100, Vlastimil Babka wrote:
> On 11/19/20 11:57 AM, Oscar Salvador wrote:
> > From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> > 
> > The call to get_user_pages_fast is only to get the pointer to a struct
> > page of a given address, pinning it is memory-poisoning handler's job,
> > so drop the refcount grabbed by get_user_pages_fast().
> > 
> > Note that the target page is still pinned after this put_page() because
> > the current process should have refcount from mapping.
> 
> Well, but can't it go away due to reclaim, migration or whatever?

Yes, it can.

> > @@ -900,20 +900,23 @@ static int madvise_inject_error(int behavior,
> >   		 */
> >   		size = page_size(compound_head(page));
> > +		/*
> > +		 * The get_user_pages_fast() is just to get the pfn of the
> > +		 * given address, and the refcount has nothing to do with
> > +		 * what we try to test, so it should be released immediately.
> > +		 * This is racy but it's intended because the real hardware
> > +		 * errors could happen at any moment and memory error handlers
> > +		 * must properly handle the race.
> 
> Sure they have to. We might just be unexpectedly messing with other process'
> memory. Or does anything else prevent that?

No, nothing does, and I have to confess that I managed to confuse myself here.
If we release such page and that page ends up in buddy, nothing prevents someone
else to get that page, and then we would be messing with other process memory.

I guess the right thing to do is just to make sure we got that page and that
that page remains pinned as long as the memory failure handling goes.

I will remove those patches from the patchset and re-submit with only the
refactoring and pcp-disabling.

Thanks Vlastimil

-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/7] HWPoison: Refactor get page interface
  2020-11-19 10:57 [PATCH 0/7] HWPoison: Refactor get page interface Oscar Salvador
                   ` (6 preceding siblings ...)
  2020-11-19 10:57 ` [PATCH 7/7] mm,hwpoison: Remove drain_all_pages from shake_page Oscar Salvador
@ 2020-12-02 13:34 ` Qian Cai
  2020-12-02 13:41   ` Oscar Salvador
  7 siblings, 1 reply; 26+ messages in thread
From: Qian Cai @ 2020-12-02 13:34 UTC (permalink / raw)
  To: Oscar Salvador, akpm; +Cc: n-horiguchi, linux-mm, linux-kernel

On Thu, 2020-11-19 at 11:57 +0100, Oscar Salvador wrote:
> Hi,
> 
> following up on previous fix-ups an refactors, this patchset simplifies
> the get page interface and removes the MF_COUNT_INCREASED trick we have
> for soft offline.

Well, the madvise() EIO is back. I don't understand why we can't test it on a
NUMA system before posting this over and over again.

# git clone https://e.coding.net/cailca/linux/mm
# cd mm; make
# ./ranbug 1 
- start: migrate_huge_offline
- use NUMA nodes 0,3.
- mmap and free 8388608 bytes hugepages on node 0
- mmap and free 8388608 bytes hugepages on node 3
madvise: Input/output error

[ 1270.054919][ T7497] Soft offlining pfn 0x1958e00 at process virtual address 0x7f7d9ca00000
[ 1270.067318][ T7497] Soft offlining pfn 0x18d0600 at process virtual address 0x7f7d9c800000
[ 1270.078856][ T7497] Soft offlining pfn 0x1ac800 at process virtual address 0x7f7d9ca00000
[ 1270.091268][ T7497] Soft offlining pfn 0x1e10a00 at process virtual address 0x7f7d9c800000
[ 1270.101946][ T7497] Soft offlining pfn 0x18c800 at process virtual address 0x7f7d9ca00000
[ 1270.111678][ T7497] soft offline: 0x18c800: hugepage isolation failed: 0, page count 2, type bfffc00001000e (referenced|uptodate|dirty|head)
[ 1270.126133][ T7497] Soft offlining pfn 0x18b5400 at process virtual address 0x7f7d9c800000
[ 1270.136581][ T7497] Soft offlining pfn 0x211c00 at process virtual address 0x7f7d9ca00000
[ 1270.146214][ T7497] soft offline: 0x211c00: hugepage isolation failed: 0, page count 2, type bfffc00001000e (referenced|uptodate|dirty|head)
[ 1270.160624][ T7497] Soft offlining pfn 0x19bee00 at process virtual address 0x7f7d9c800000
[ 1270.170896][ T7497] Soft offlining pfn 0x1e21a00 at process virtual address 0x7f7d9ca00000
[ 1270.185011][ T7497] Soft offlining pfn 0x1fd1200 at process virtual address 0x7f7d9c800000
[ 1270.195341][ T7497] Soft offlining pfn 0x1882400 at process virtual address 0x7f7d9ca00000
[ 1270.480593][ T7497] Soft offlining pfn 0x18bc000 at process virtual address 0x7f7d9c800000
[ 1270.491961][ T7497] soft offline: 0x18bc000: hugepage isolation failed: 0, page count 2, type 3bfffc00001000e (referenced|uptodate|dirty|head)
[ 1270.506018][ T7497] Soft offlining pfn 0x1e76a00 at process virtual address 0x7f7d9c800000
[ 1270.590266][ T7497] Soft offlining pfn 0x1b3c00 at process virtual address 0x7f7d9ca00000
[ 1270.600207][ T7497] soft offline: 0x1b3c00: hugepage isolation failed: 0, page count 2, type bfffc00001000e (referenced|uptodate|dirty|head)
[ 1270.614316][ T7497] Soft offlining pfn 0x1882600 at process virtual address 0x7f7d9c800000
[ 1270.662427][ T7497] Soft offlining pfn 0x1b3c00 at process virtual address 0x7f7d9ca00000
[ 1270.744249][ T7497] Soft offlining pfn 0x18bc000 at process virtual address 0x7f7d9c800000
[ 1270.754314][ T7497] Soft offlining pfn 0x18d1200 at process virtual address 0x7f7d9ca00000
[ 1270.765204][ T7497] soft offline: 0x18d1200: hugepage isolation failed: 0, page count 2, type 3bfffc00001000e (referenced|uptodate|dirty|head)
[ 1270.816653][ T7497] Soft offlining pfn 0x18d0400 at process virtual address 0x7f7d9c800000
[ 1270.827049][ T7497] Soft offlining pfn 0x18d1200 at process virtual address 0x7f7d9ca00000
[ 1270.837997][ T7497] soft offline: 0x18d1200: hugepage isolation failed: 0, page count 2, type 3bfffc00001000e (referenced|uptodate|dirty|head)
[ 1270.852156][ T7497] Soft offlining pfn 0x186ca00 at process virtual address 0x7f7d9c800000
[ 1270.862350][ T7497] Soft offlining pfn 0x18d1200 at process virtual address 0x7f7d9ca00000
[ 1270.872922][ T7497] soft offline: 0x18d1200: hugepage isolation failed: 0, page count 2, type 3bfffc00001000e (referenced|uptodate|dirty|head)
[ 1270.887133][ T7497] Soft offlining pfn 0x18ac200 at process virtual address 0x7f7d9c800000
[ 1270.897450][ T7497] Soft offlining pfn 0x211c00 at process virtual address 0x7f7d9ca00000
[ 1270.907416][ T7497] soft offline: 0x211c00: hugepage isolation failed: 0, page count 2, type bfffc00001000e (referenced|uptodate|dirty|head)
[ 1270.921365][ T7497] Soft offlining pfn 0x1e1cc00 at process virtual address 0x7f7d9c800000
[ 1270.931700][ T7497] Soft offlining pfn 0x18c800 at process virtual address 0x7f7d9ca00000
[ 1270.941580][ T7497] soft offline: 0x18c800: hugepage isolation failed: 0, page count 2, type bfffc00001000e (referenced|uptodate|dirty|head)
[ 1270.955649][ T7497] Soft offlining pfn 0x1e6ae00 at process virtual address 0x7f7d9c800000
[ 1270.966063][ T7497] Soft offlining pfn 0x211c00 at process virtual address 0x7f7d9ca00000
[ 1270.975965][ T7497] soft offline: 0x211c00: hugepage isolation failed: 0, page count 2, type bfffc00001000e (referenced|uptodate|dirty|head)
[ 1270.990059][ T7497] Soft offlining pfn 0x1e72e00 at process virtual address 0x7f7d9c800000
[ 1271.000323][ T7497] Soft offlining pfn 0x18d1200 at process virtual address 0x7f7d9ca00000
[ 1271.011006][ T7497] soft offline: 0x18d1200: hugepage isolation failed: 0, page count 2, type 3bfffc00001000e (referenced|uptodate|dirty|head)
[ 1271.025152][ T7497] Soft offlining pfn 0x1e22200 at process virtual address 0x7f7d9c800000
[ 1271.035395][ T7497] Soft offlining pfn 0x18d1200 at process virtual address 0x7f7d9ca00000
[ 1271.045916][ T7497] soft offline: 0x18d1200: hugepage isolation failed: 0, page count 2, type 3bfffc00001000e (referenced|uptodate|dirty|head)
[ 1271.060159][ T7497] Soft offlining pfn 0x1e6fe00 at process virtual address 0x7f7d9c800000
[ 1271.070695][ T7497] Soft offlining pfn 0x18c800 at process virtual address 0x7f7d9ca00000
[ 1271.080596][ T7497] soft offline: 0x18c800: hugepage isolation failed: 0, page count 2, type bfffc00001000e (referenced|uptodate|dirty|head)
[ 1271.094725][ T7497] Soft offlining pfn 0x1968200 at process virtual address 0x7f7d9c800000
[ 1271.105006][ T7497] Soft offlining pfn 0x18d1200 at process virtual address 0x7f7d9ca00000
[ 1271.115567][ T7497] soft offline: 0x18d1200: hugepage isolation failed: 0, page count 2, type 3bfffc00001000e (referenced|uptodate|dirty|head)
[ 1271.129775][ T7497] Soft offlining pfn 0x1e1ae00 at process virtual address 0x7f7d9c800000
[ 1271.140285][ T7497] Soft offlining pfn 0x18c800 at process virtual address 0x7f7d9ca00000
[ 1271.150185][ T7497] soft offline: 0x18c800: hugepage isolation failed: 0, page count 2, type bfffc[ 1271.468115][ T7497] Soft offlining pfn 0x1de4600 at process virtual address 0x7f7d9c800000
[ 1271.479348][ T7497] Soft offlining pfn 0x145e00 at process virtual address 0x7f7d9ca00000
[ 1271.489928][ T7497] soft offline: 0x145e00: hugepage isolation 1271.538433][ T7497] Soft offlining pfn 0x1fae00 at process virtual address 0x7f7d9c800000
[ 1271.548880][ T7497] Soft offlining pfn 0x1995e00 at process virtual address 0x7f7d9ca00000
[ 1271.558877][ T7497] soft offline: 0x1995e00: hugepage isolation failed: 0, page count 2, type 3bfffc00001000e (referenced|uptodate|dirty|head)
[ 1271.573055][ T7497] Soft offlining pfn 0x221e00 at process virtual address 0x7f7d9c800000
[ 1271.583453][ T7497] Soft offlining pfn 0x1901800 at process virtual address 0x7f7d9ca00000
[ 1271.593440][ T7497] soft offline: 0x1901800: hugepage isolation failed: 0, page count 2, type 3bfffc00001000e (referenced|uptodate|dirty|head)
[ 1271.610005][ T7497] Soft offlining pfn 0x232400 at process virtual address 0x7f7d9c800000
[ 1271.620439][ T7497] Soft offlinin[ 1272.005890][ T7497] Soft offlining pfn 0x230e00 at process virtual address 0x7f7d9c800000
[ 1272.017226][ T7497] Soft offlining pfn 0x185fe00 at process virtual address 0x7f7d9ca00000
[ 1272.029194][ T7497] Soft offlining pfn 0x1f1400 at process virtual address 0x7f7d9c800000
[ 1272.040088][ T7497] Soft offlining pfn 0x1f9e00 at process virtual address 0x7f7d9ca00000
[ 1272.052415][ T7497] Soft offlining pfn 0x1885a00 at process virtual address 0x7f7d9c800000
[ 1272.062510][ T7497] Soft offlining pfn 0x18b6000 at process virtual address 0x7f7d9ca00000
[ 1272.071931][ T7497] soft_offline_page: 0x18b6000: unknown page type: 3bfffc000000000 ((%pG?))

> 
> Please, note that this patchset is on top of [1] and [2].
> 
> This patchset does three things:
> 
>  1) Drops MF_COUNT_INCREASED trick
>  2) Refactors get page interface
>  3) Places a common entry for grabbin a page from both hard offline
>     and soft offline guarded by zone_pcp_{disable/enable}, so we do not
>     have to drain pcplists by ourself and retry again.
> 
> Note that the MF_COUNT_INCREASED trick was left because if get_hwpoison_page
> races with put_page (e.g:)
> 
> CPU0                         CPU1
> put_page (refcount decremented to 0)
>  __put_single_page
>   free_unref_page
>    free_unref_page_prepare
>     free_pcp_prepare
>      free_pages_prepare                           soft_offline_page
>      :page->flags &= ~PAGE_FLAGS_CHECK_AT_PREP     get_any_page
>                             			    get_hwpoison_page
>    free_unref_page_commit
>     free_one_page
>      __free_one_page (place it in buddy)
> 
> get_hwpoison_page sees that page has a refcount of 0, but since it was not
> placed
> in buddy yet we cannot really handle it.
> We now have a sort of maximum passes in get_any_page, so in case we race
> with either an allocation or a put_page, we retry again.
> 
> After an off-list discussion with Naoya, he agreed to proceed.
> 
> [1] https://patchwork.kernel.org/project/linux-mm/list/?series=364009
> [2] https://patchwork.kernel.org/project/linux-mm/list/?series=381903
> 
> Naoya Horiguchi (3):
>   mm,madvise: call soft_offline_page() without MF_COUNT_INCREASED
>   mm,hwpoison: remove MF_COUNT_INCREASED
>   mm,hwpoison: remove flag argument from soft offline functions
> 
> Oscar Salvador (4):
>   mm,hwpoison: Refactor get_any_page
>   mm,hwpoison: Drop pfn parameter
>   mm,hwpoison: Disable pcplists before grabbing a refcount
>   mm,hwpoison: Remove drain_all_pages from shake_page
> 
>  drivers/base/memory.c |   2 +-
>  include/linux/mm.h    |   9 +--
>  mm/madvise.c          |  19 +++--
>  mm/memory-failure.c   | 168 +++++++++++++++++-------------------------
>  4 files changed, 85 insertions(+), 113 deletions(-)
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 0/7] HWPoison: Refactor get page interface
  2020-12-02 13:34 ` [PATCH 0/7] HWPoison: Refactor get page interface Qian Cai
@ 2020-12-02 13:41   ` Oscar Salvador
  0 siblings, 0 replies; 26+ messages in thread
From: Oscar Salvador @ 2020-12-02 13:41 UTC (permalink / raw)
  To: Qian Cai; +Cc: akpm, n-horiguchi, linux-mm, linux-kernel

On Wed, Dec 02, 2020 at 08:34:57AM -0500, Qian Cai wrote:
> On Thu, 2020-11-19 at 11:57 +0100, Oscar Salvador wrote:
> > Hi,
> > 
> > following up on previous fix-ups an refactors, this patchset simplifies
> > the get page interface and removes the MF_COUNT_INCREASED trick we have
> > for soft offline.
> 
> Well, the madvise() EIO is back. I don't understand why we can't test it on a
> NUMA system before posting this over and over again.
> 
> # git clone https://e.coding.net/cailca/linux/mm
> # cd mm; make
> # ./ranbug 1 
> - start: migrate_huge_offline
> - use NUMA nodes 0,3.
> - mmap and free 8388608 bytes hugepages on node 0
> - mmap and free 8388608 bytes hugepages on node 3
> madvise: Input/output error

I tried it out myself enlarging the window race artificially but I was not able
to get -EIO anymore.
But as Vlastimil pointed out in the respective patch, it is better to keep the
page pinned for madvise.

I am planning to re-post leaving out the patches that remove the pinning.

Anyway, thanks for the report.

-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/7] mm,madvise: call soft_offline_page() without MF_COUNT_INCREASED
  2020-12-01 11:35     ` Oscar Salvador
@ 2020-12-04 17:25       ` Vlastimil Babka
  2020-12-05 15:34         ` Oscar Salvador
  0 siblings, 1 reply; 26+ messages in thread
From: Vlastimil Babka @ 2020-12-04 17:25 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: akpm, n-horiguchi, linux-mm, linux-kernel, Naoya Horiguchi, Dan Williams

On 12/1/20 12:35 PM, Oscar Salvador wrote:
> On Wed, Nov 25, 2020 at 07:20:33PM +0100, Vlastimil Babka wrote:
>> On 11/19/20 11:57 AM, Oscar Salvador wrote:
>> > From: Naoya Horiguchi <naoya.horiguchi@nec.com>
>> > 
>> > The call to get_user_pages_fast is only to get the pointer to a struct
>> > page of a given address, pinning it is memory-poisoning handler's job,
>> > so drop the refcount grabbed by get_user_pages_fast().
>> > 
>> > Note that the target page is still pinned after this put_page() because
>> > the current process should have refcount from mapping.
>> 
>> Well, but can't it go away due to reclaim, migration or whatever?
> 
> Yes, it can.
> 
>> > @@ -900,20 +900,23 @@ static int madvise_inject_error(int behavior,
>> >   		 */
>> >   		size = page_size(compound_head(page));
>> > +		/*
>> > +		 * The get_user_pages_fast() is just to get the pfn of the
>> > +		 * given address, and the refcount has nothing to do with
>> > +		 * what we try to test, so it should be released immediately.
>> > +		 * This is racy but it's intended because the real hardware
>> > +		 * errors could happen at any moment and memory error handlers
>> > +		 * must properly handle the race.
>> 
>> Sure they have to. We might just be unexpectedly messing with other process'
>> memory. Or does anything else prevent that?
> 
> No, nothing does, and I have to confess that I managed to confuse myself here.
> If we release such page and that page ends up in buddy, nothing prevents someone
> else to get that page, and then we would be messing with other process memory.
> 
> I guess the right thing to do is just to make sure we got that page and that
> that page remains pinned as long as the memory failure handling goes.

OK, so that means we don't introduce this race for MADV_SOFT_OFFLINE, but it's
already (and still) there for MADV_HWPOISON since Dan's 23e7b5c2e271 ("mm,
madvise_inject_error: Let memory_failure() optionally take a page reference") no?

> I will remove those patches from the patchset and re-submit with only the
> refactoring and pcp-disabling.
> 
> Thanks Vlastimil
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/7] mm,madvise: call soft_offline_page() without MF_COUNT_INCREASED
  2020-12-04 17:25       ` Vlastimil Babka
@ 2020-12-05 15:34         ` Oscar Salvador
  2020-12-07  2:34           ` HORIGUCHI NAOYA(堀口 直也)
  0 siblings, 1 reply; 26+ messages in thread
From: Oscar Salvador @ 2020-12-05 15:34 UTC (permalink / raw)
  To: Vlastimil Babka
  Cc: akpm, n-horiguchi, linux-mm, linux-kernel, Naoya Horiguchi, Dan Williams

On Fri, Dec 04, 2020 at 06:25:31PM +0100, Vlastimil Babka wrote:
> OK, so that means we don't introduce this race for MADV_SOFT_OFFLINE, but it's
> already (and still) there for MADV_HWPOISON since Dan's 23e7b5c2e271 ("mm,
> madvise_inject_error: Let memory_failure() optionally take a page reference") no?

What about the following?
CCing Dan as well.

From: Oscar Salvador <osalvador@suse.de>
Date: Sat, 5 Dec 2020 16:14:40 +0100
Subject: [PATCH] mm,memory_failure: Always pin the page in
 madvise_inject_error

madvise_inject_error() uses get_user_pages_fast to get the page
from the addr we specified.
After [1], we drop such extra reference for memory_failure() path.
That commit says that memory_failure wanted to keep the pin in order
to take the page out of circulation.

The truth is that we need to keep the page pinned, otherwise the
page might be re-used after the put_page(), and we can end up messing
with someone else's memory.
E.g:

CPU0
process X					CPU1
 madvise_inject_error
  get_user_pages
   put_page
					page gets reclaimed
					process Y allocates the page
  memory_failure
   // We mess with process Y memory

madvise() is meant to operate on a self address space, so messing with
pages that do not belong to us seems the wrong thing to do.
To avoid that, let us keep the page pinned for memory_failure as well.

Pages for DAX mappings will release this extra refcount in
memory_failure_dev_pagemap.

[1] ("23e7b5c2e271: mm, madvise_inject_error:
      Let memory_failure() optionally take a page reference")

Signed-off-by: Oscar Salvador <osalvador@suse.de>
Suggested-by: Vlastimil Babka <vbabka@suse.cz>
Fixes: 23e7b5c2e271 ("mm, madvise_inject_error: Let memory_failure() optionally take a page reference")
---
 mm/madvise.c        | 9 +--------
 mm/memory-failure.c | 6 ++++++
 2 files changed, 7 insertions(+), 8 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index c6b5524add58..19edddba196d 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -907,14 +907,7 @@ static int madvise_inject_error(int behavior,
 		} else {
 			pr_info("Injecting memory failure for pfn %#lx at process virtual address %#lx\n",
 				 pfn, start);
-			/*
-			 * Drop the page reference taken by get_user_pages_fast(). In
-			 * the absence of MF_COUNT_INCREASED the memory_failure()
-			 * routine is responsible for pinning the page to prevent it
-			 * from being released back to the page allocator.
-			 */
-			put_page(page);
-			ret = memory_failure(pfn, 0);
+			ret = memory_failure(pfn, MF_COUNT_INCREASED);
 		}
 
 		if (ret)
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 869ece2a1de2..ba861169c9ae 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1269,6 +1269,12 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
 	if (!cookie)
 		goto out;
 
+	if (flags & MF_COUNT_INCREASED)
+		/*
+		 * Drop the extra refcount in case we come from madvise().
+		 */
+		put_page(page);
+
 	if (hwpoison_filter(page)) {
 		rc = 0;
 		goto unlock;


-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/7] mm,madvise: call soft_offline_page() without MF_COUNT_INCREASED
  2020-12-05 15:34         ` Oscar Salvador
@ 2020-12-07  2:34           ` HORIGUCHI NAOYA(堀口 直也)
  2020-12-07  7:24             ` Oscar Salvador
  0 siblings, 1 reply; 26+ messages in thread
From: HORIGUCHI NAOYA(堀口 直也) @ 2020-12-07  2:34 UTC (permalink / raw)
  To: Oscar Salvador
  Cc: Vlastimil Babka, akpm, n-horiguchi, linux-mm, linux-kernel, Dan Williams

On Sat, Dec 05, 2020 at 04:34:23PM +0100, Oscar Salvador wrote:
> On Fri, Dec 04, 2020 at 06:25:31PM +0100, Vlastimil Babka wrote:
> > OK, so that means we don't introduce this race for MADV_SOFT_OFFLINE, but it's
> > already (and still) there for MADV_HWPOISON since Dan's 23e7b5c2e271 ("mm,
> > madvise_inject_error: Let memory_failure() optionally take a page reference") no?
> 
> What about the following?
> CCing Dan as well.

Hi Oscar, Vlastimil,

Thanks for mentioning this. I agree with that direction.

> 
> From: Oscar Salvador <osalvador@suse.de>
> Date: Sat, 5 Dec 2020 16:14:40 +0100
> Subject: [PATCH] mm,memory_failure: Always pin the page in
>  madvise_inject_error
> 
> madvise_inject_error() uses get_user_pages_fast to get the page
> from the addr we specified.
> After [1], we drop such extra reference for memory_failure() path.
> That commit says that memory_failure wanted to keep the pin in order
> to take the page out of circulation.
> 
> The truth is that we need to keep the page pinned, otherwise the
> page might be re-used after the put_page(), and we can end up messing
> with someone else's memory.
> E.g:
> 
> CPU0
> process X					CPU1
>  madvise_inject_error
>   get_user_pages
>    put_page
> 					page gets reclaimed
> 					process Y allocates the page
>   memory_failure
>    // We mess with process Y memory
> 
> madvise() is meant to operate on a self address space, so messing with
> pages that do not belong to us seems the wrong thing to do.
> To avoid that, let us keep the page pinned for memory_failure as well.
> 
> Pages for DAX mappings will release this extra refcount in
> memory_failure_dev_pagemap.
> 
> [1] ("23e7b5c2e271: mm, madvise_inject_error:
>       Let memory_failure() optionally take a page reference")
> 
> Signed-off-by: Oscar Salvador <osalvador@suse.de>
> Suggested-by: Vlastimil Babka <vbabka@suse.cz>
> Fixes: 23e7b5c2e271 ("mm, madvise_inject_error: Let memory_failure() optionally take a page reference")
> ---
>  mm/madvise.c        | 9 +--------
>  mm/memory-failure.c | 6 ++++++
>  2 files changed, 7 insertions(+), 8 deletions(-)
> 
> diff --git a/mm/madvise.c b/mm/madvise.c
> index c6b5524add58..19edddba196d 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -907,14 +907,7 @@ static int madvise_inject_error(int behavior,
>  		} else {
>  			pr_info("Injecting memory failure for pfn %#lx at process virtual address %#lx\n",
>  				 pfn, start);
> -			/*
> -			 * Drop the page reference taken by get_user_pages_fast(). In
> -			 * the absence of MF_COUNT_INCREASED the memory_failure()
> -			 * routine is responsible for pinning the page to prevent it
> -			 * from being released back to the page allocator.
> -			 */
> -			put_page(page);
> -			ret = memory_failure(pfn, 0);
> +			ret = memory_failure(pfn, MF_COUNT_INCREASED);
>  		}
>  
>  		if (ret)
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 869ece2a1de2..ba861169c9ae 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1269,6 +1269,12 @@ static int memory_failure_dev_pagemap(unsigned long pfn, int flags,
>  	if (!cookie)
>  		goto out;
>  
> +	if (flags & MF_COUNT_INCREASED)
> +		/*
> +		 * Drop the extra refcount in case we come from madvise().
> +		 */
> +		put_page(page);
> +

Should this if-block come before dax_lock_page() block?
It seems that if dax_lock_page returns NULL, memory_failure_dev_pagemap()
returns without releasing the refcount.
memory_failure() on dev_pagemap doesn't use page refcount (unlike other
type of memory), so we can release it unconditionally.

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 3/7] mm,madvise: call soft_offline_page() without MF_COUNT_INCREASED
  2020-12-07  2:34           ` HORIGUCHI NAOYA(堀口 直也)
@ 2020-12-07  7:24             ` Oscar Salvador
  0 siblings, 0 replies; 26+ messages in thread
From: Oscar Salvador @ 2020-12-07  7:24 UTC (permalink / raw)
  To: HORIGUCHI NAOYA(堀口 直也)
  Cc: Vlastimil Babka, akpm, n-horiguchi, linux-mm, linux-kernel, Dan Williams

On 2020-12-07 03:34, HORIGUCHI NAOYA wrote:
>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>> index 869ece2a1de2..ba861169c9ae 100644
>> --- a/mm/memory-failure.c
>> +++ b/mm/memory-failure.c
>> @@ -1269,6 +1269,12 @@ static int memory_failure_dev_pagemap(unsigned 
>> long pfn, int flags,
>>  	if (!cookie)
>>  		goto out;
>> 
>> +	if (flags & MF_COUNT_INCREASED)
>> +		/*
>> +		 * Drop the extra refcount in case we come from madvise().
>> +		 */
>> +		put_page(page);
>> +
> 
> Should this if-block come before dax_lock_page() block?

Yeah, it should go first thing since as you noticed we kept the refcount 
if we fail.
Saturday brain... I will fix it.

Thanks Naoya

-- 
Oscar Salvador
SUSE L3

^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2020-12-07  7:25 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2020-11-19 10:57 [PATCH 0/7] HWPoison: Refactor get page interface Oscar Salvador
2020-11-19 10:57 ` [PATCH 1/7] mm,hwpoison: Refactor get_any_page Oscar Salvador
2020-11-20  1:33   ` HORIGUCHI NAOYA(堀口 直也)
2020-11-25 16:54   ` Vlastimil Babka
2020-11-19 10:57 ` [PATCH 2/7] mm,hwpoison: Drop pfn parameter Oscar Salvador
2020-11-20  1:33   ` HORIGUCHI NAOYA(堀口 直也)
2020-11-25 16:55   ` Vlastimil Babka
2020-11-19 10:57 ` [PATCH 3/7] mm,madvise: call soft_offline_page() without MF_COUNT_INCREASED Oscar Salvador
2020-11-25 18:20   ` Vlastimil Babka
2020-12-01 11:35     ` Oscar Salvador
2020-12-04 17:25       ` Vlastimil Babka
2020-12-05 15:34         ` Oscar Salvador
2020-12-07  2:34           ` HORIGUCHI NAOYA(堀口 直也)
2020-12-07  7:24             ` Oscar Salvador
2020-11-19 10:57 ` [PATCH 4/7] mm,hwpoison: remove MF_COUNT_INCREASED Oscar Salvador
2020-11-19 10:57 ` [PATCH 5/7] mm,hwpoison: remove flag argument from soft offline functions Oscar Salvador
2020-11-19 10:57 ` [PATCH 6/7] mm,hwpoison: Disable pcplists before grabbing a refcount Oscar Salvador
2020-11-20  1:33   ` HORIGUCHI NAOYA(堀口 直也)
2020-11-26 13:45   ` Vlastimil Babka
2020-11-28  0:51     ` Andrew Morton
2020-11-19 10:57 ` [PATCH 7/7] mm,hwpoison: Remove drain_all_pages from shake_page Oscar Salvador
2020-11-20  1:33   ` HORIGUCHI NAOYA(堀口 直也)
2020-11-26 13:52   ` Vlastimil Babka
2020-11-27  7:20     ` Oscar Salvador
2020-12-02 13:34 ` [PATCH 0/7] HWPoison: Refactor get page interface Qian Cai
2020-12-02 13:41   ` Oscar Salvador

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).