linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 00/10] Hwpoison soft-offline rework
@ 2019-09-10 10:30 Oscar Salvador
  2019-09-10 10:30 ` [PATCH 01/10] mm,hwpoison: cleanup unused PageHuge() check Oscar Salvador
                   ` (11 more replies)
  0 siblings, 12 replies; 26+ messages in thread
From: Oscar Salvador @ 2019-09-10 10:30 UTC (permalink / raw)
  To: n-horiguchi; +Cc: mhocko, mike.kravetz, linux-mm, linux-kernel, Oscar Salvador


This patchset was based on Naoya's hwpoison rework [1], so thanks to him
for the initial work.

This patchset aims to fix some issues laying in soft-offline handling,
but it also takes the chance and takes some further steps to perform 
cleanups and some refactoring as well.

 - Motivation:

   A customer and I were facing an issue where poisoned pages we returned
   back to user-space after having offlined them properly.
   This was only seend under some memory stress + soft offlining pages.
   After some anaylsis, it became clear that the problem was that
   when kcompactd kicked in to migrate pages over, compaction_alloc
   callback was handing poisoned pages to the migrate routine.
   Once this page was later on fault in, __do_page_fault returned
   VM_FAULT_HWPOISON making the process being killed.

   All this could happen because isolate_freepages_block and
   fast_isolate_freepages just check for the page to be PageBuddy,
   and since 1) poisoned pages can be part of a higher order page
   and 2) poisoned pages are also Page Buddy, they can sneak in easily.

   I also saw some problem with swap pages, but I suspected to be the
   same sort of problem, so I did not follow that trace.

   The full explanation can be see in [2].

 - Approach:

   The taken approach is to not let poisoned pages hit neither
   pcplists nor buddy freelists.
   This is achieved by:

In-use pages:

   * Normal pages

   1) do not release the last reference count after the
      invalidation/migration of the page.
   2) the page is being handed to page_set_poison, which does:
      2a) sets PageHWPoison flag
      2b) calls put_page (only to be able to call __page_cache_release)
          Since poisoned pages are skipped in free_pages_prepare,
          this put_page is safe.
      2c) Sets the refcount to 1
    
   * Hugetlb pages

   1) Hand the page to page_set_poison after migration
   2) page_set_poison does:
      2a) Calls dissolve_free_huge_page
      2b) If ranged to be dissolved contains poisoned pages,
          we free the rangeas order-0 pages (as we do with gigantic hugetlb page),
          so free_pages_prepare will skip them accordingly.
      2c) Sets the refcount to 1

Free pages:

   * Normal pages:

   1) Take the page off the buddy freelist
   2) Set PageHWPoison flag and set refcount to 1

   * Hugetlb pages

   1) Try to allocate a new hugetlb page to the pool
   2) Take off the pool the poisoned hugetlb


With this patchset, I no longer see the issues I faced before.

Note:
I presented this as RFC to open discussion of the taken aproach.
I think that furthers cleanups and refactors could be made, but I would
like to get some insight of the taken approach before touching more
code.

Thanks

[1] https://lore.kernel.org/linux-mm/1541746035-13408-1-git-send-email-n-horiguchi@ah.jp.nec.com/
[2] https://lore.kernel.org/linux-mm/20190826104144.GA7849@linux/T/#u

Naoya Horiguchi (5):
  mm,hwpoison: cleanup unused PageHuge() check
  mm,madvise: call soft_offline_page() without MF_COUNT_INCREASED
  mm,hwpoison-inject: don't pin for hwpoison_filter
  mm,hwpoison: remove MF_COUNT_INCREASED
  mm: remove flag argument from soft offline functions

Oscar Salvador (5):
  mm,hwpoison: Unify THP handling for hard and soft offline
  mm,hwpoison: Rework soft offline for in-use pages
  mm,hwpoison: Refactor soft_offline_huge_page and __soft_offline_page
  mm,hwpoison: Rework soft offline for free pages
  mm,hwpoison: Use hugetlb_replace_page to replace free hugetlb pages

 drivers/base/memory.c      |   2 +-
 include/linux/mm.h         |   9 +-
 include/linux/page-flags.h |   5 -
 mm/hugetlb.c               |  51 +++++++-
 mm/hwpoison-inject.c       |  18 +--
 mm/madvise.c               |  25 ++--
 mm/memory-failure.c        | 319 +++++++++++++++++++++------------------------
 mm/migrate.c               |  11 +-
 mm/page_alloc.c            |  62 +++++++--
 9 files changed, 267 insertions(+), 235 deletions(-)

-- 
2.12.3



^ permalink raw reply	[flat|nested] 26+ messages in thread

* [PATCH 01/10] mm,hwpoison: cleanup unused PageHuge() check
  2019-09-10 10:30 [PATCH 00/10] Hwpoison soft-offline rework Oscar Salvador
@ 2019-09-10 10:30 ` Oscar Salvador
  2019-09-11 10:17   ` David Hildenbrand
  2019-09-10 10:30 ` [PATCH 02/10] mm,madvise: call soft_offline_page() without MF_COUNT_INCREASED Oscar Salvador
                   ` (10 subsequent siblings)
  11 siblings, 1 reply; 26+ messages in thread
From: Oscar Salvador @ 2019-09-10 10:30 UTC (permalink / raw)
  To: n-horiguchi; +Cc: mhocko, mike.kravetz, linux-mm, linux-kernel, Oscar Salvador

From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

memory_failure() forks to memory_failure_hugetlb() for hugetlb pages,
so a PageHuge() check after the fork should not be necessary.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
---
 mm/memory-failure.c | 5 +----
 1 file changed, 1 insertion(+), 4 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 7ef849da8278..e43b61462fd5 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1353,10 +1353,7 @@ int memory_failure(unsigned long pfn, int flags)
 	 * page_remove_rmap() in try_to_unmap_one(). So to determine page status
 	 * correctly, we save a copy of the page flags at this time.
 	 */
-	if (PageHuge(p))
-		page_flags = hpage->flags;
-	else
-		page_flags = p->flags;
+	page_flags = p->flags;
 
 	/*
 	 * unpoison always clear PG_hwpoison inside page lock
-- 
2.12.3



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 02/10] mm,madvise: call soft_offline_page() without MF_COUNT_INCREASED
  2019-09-10 10:30 [PATCH 00/10] Hwpoison soft-offline rework Oscar Salvador
  2019-09-10 10:30 ` [PATCH 01/10] mm,hwpoison: cleanup unused PageHuge() check Oscar Salvador
@ 2019-09-10 10:30 ` Oscar Salvador
  2019-09-11 10:23   ` David Hildenbrand
  2019-09-11 10:27   ` David Hildenbrand
  2019-09-10 10:30 ` [PATCH 03/10] mm,hwpoison-inject: don't pin for hwpoison_filter Oscar Salvador
                   ` (9 subsequent siblings)
  11 siblings, 2 replies; 26+ messages in thread
From: Oscar Salvador @ 2019-09-10 10:30 UTC (permalink / raw)
  To: n-horiguchi; +Cc: mhocko, mike.kravetz, linux-mm, linux-kernel, Oscar Salvador

From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

Currently madvise_inject_error() pins the target via get_user_pages_fast.
The call to get_user_pages_fast is only to get the respective page
of a given address, but it is the job of the memory-poisoning handler
to deal with races, so drop the refcount grabbed by get_user_pages_fast.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 mm/madvise.c | 25 +++++++++++--------------
 1 file changed, 11 insertions(+), 14 deletions(-)

diff --git a/mm/madvise.c b/mm/madvise.c
index 6e023414f5c1..fbe6d402232c 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -883,6 +883,16 @@ static int madvise_inject_error(int behavior,
 		ret = get_user_pages_fast(start, 1, 0, &page);
 		if (ret != 1)
 			return ret;
+		/*
+		 * The get_user_pages_fast() is just to get the pfn of the
+		 * given address, and the refcount has nothing to do with
+		 * what we try to test, so it should be released immediately.
+		 * This is racy but it's intended because the real hardware
+		 * errors could happen at any moment and memory error handlers
+		 * must properly handle the race.
+		 */
+		put_page(page);
+
 		pfn = page_to_pfn(page);
 
 		/*
@@ -892,16 +902,11 @@ static int madvise_inject_error(int behavior,
 		 */
 		order = compound_order(compound_head(page));
 
-		if (PageHWPoison(page)) {
-			put_page(page);
-			continue;
-		}
-
 		if (behavior == MADV_SOFT_OFFLINE) {
 			pr_info("Soft offlining pfn %#lx at process virtual address %#lx\n",
 					pfn, start);
 
-			ret = soft_offline_page(page, MF_COUNT_INCREASED);
+			ret = soft_offline_page(page, 0);
 			if (ret)
 				return ret;
 			continue;
@@ -909,14 +914,6 @@ static int madvise_inject_error(int behavior,
 
 		pr_info("Injecting memory failure for pfn %#lx at process virtual address %#lx\n",
 				pfn, start);
-
-		/*
-		 * Drop the page reference taken by get_user_pages_fast(). In
-		 * the absence of MF_COUNT_INCREASED the memory_failure()
-		 * routine is responsible for pinning the page to prevent it
-		 * from being released back to the page allocator.
-		 */
-		put_page(page);
 		ret = memory_failure(pfn, 0);
 		if (ret)
 			return ret;
-- 
2.12.3



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 03/10] mm,hwpoison-inject: don't pin for hwpoison_filter
  2019-09-10 10:30 [PATCH 00/10] Hwpoison soft-offline rework Oscar Salvador
  2019-09-10 10:30 ` [PATCH 01/10] mm,hwpoison: cleanup unused PageHuge() check Oscar Salvador
  2019-09-10 10:30 ` [PATCH 02/10] mm,madvise: call soft_offline_page() without MF_COUNT_INCREASED Oscar Salvador
@ 2019-09-10 10:30 ` Oscar Salvador
  2019-09-16  7:41   ` David Hildenbrand
  2019-09-10 10:30 ` [PATCH 04/10] mm,hwpoison: remove MF_COUNT_INCREASED Oscar Salvador
                   ` (8 subsequent siblings)
  11 siblings, 1 reply; 26+ messages in thread
From: Oscar Salvador @ 2019-09-10 10:30 UTC (permalink / raw)
  To: n-horiguchi; +Cc: mhocko, mike.kravetz, linux-mm, linux-kernel, Oscar Salvador

From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

Another memory error injection interface debugfs:hwpoison/corrupt-pfn
also takes bogus refcount for hwpoison_filter(). It's justified
because this does a coarse filter, expecting that memory_failure()
redoes the check for sure.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 mm/hwpoison-inject.c | 18 +++++-------------
 1 file changed, 5 insertions(+), 13 deletions(-)

diff --git a/mm/hwpoison-inject.c b/mm/hwpoison-inject.c
index 5b7430bd83a6..0c8cdb80fd7d 100644
--- a/mm/hwpoison-inject.c
+++ b/mm/hwpoison-inject.c
@@ -26,11 +26,6 @@ static int hwpoison_inject(void *data, u64 val)
 
 	p = pfn_to_page(pfn);
 	hpage = compound_head(p);
-	/*
-	 * This implies unable to support free buddy pages.
-	 */
-	if (!get_hwpoison_page(p))
-		return 0;
 
 	if (!hwpoison_filter_enable)
 		goto inject;
@@ -40,23 +35,20 @@ static int hwpoison_inject(void *data, u64 val)
 	 * This implies unable to support non-LRU pages.
 	 */
 	if (!PageLRU(hpage) && !PageHuge(p))
-		goto put_out;
+		return 0;
 
 	/*
-	 * do a racy check with elevated page count, to make sure PG_hwpoison
-	 * will only be set for the targeted owner (or on a free page).
+	 * do a racy check to make sure PG_hwpoison will only be set for
+	 * the targeted owner (or on a free page).
 	 * memory_failure() will redo the check reliably inside page lock.
 	 */
 	err = hwpoison_filter(hpage);
 	if (err)
-		goto put_out;
+		return 0;
 
 inject:
 	pr_info("Injecting memory failure at pfn %#lx\n", pfn);
-	return memory_failure(pfn, MF_COUNT_INCREASED);
-put_out:
-	put_hwpoison_page(p);
-	return 0;
+	return memory_failure(pfn, 0);
 }
 
 static int hwpoison_unpoison(void *data, u64 val)
-- 
2.12.3



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 04/10] mm,hwpoison: remove MF_COUNT_INCREASED
  2019-09-10 10:30 [PATCH 00/10] Hwpoison soft-offline rework Oscar Salvador
                   ` (2 preceding siblings ...)
  2019-09-10 10:30 ` [PATCH 03/10] mm,hwpoison-inject: don't pin for hwpoison_filter Oscar Salvador
@ 2019-09-10 10:30 ` Oscar Salvador
  2019-09-11 10:24   ` David Hildenbrand
  2019-09-10 10:30 ` [PATCH 05/10] mm: remove flag argument from soft offline functions Oscar Salvador
                   ` (7 subsequent siblings)
  11 siblings, 1 reply; 26+ messages in thread
From: Oscar Salvador @ 2019-09-10 10:30 UTC (permalink / raw)
  To: n-horiguchi; +Cc: mhocko, mike.kravetz, linux-mm, linux-kernel, Oscar Salvador

From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

Now there's no user of MF_COUNT_INCREASED, so we can safely remove
it from all calling points.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 include/linux/mm.h  |  7 +++----
 mm/memory-failure.c | 16 +++-------------
 2 files changed, 6 insertions(+), 17 deletions(-)

diff --git a/include/linux/mm.h b/include/linux/mm.h
index ad6766a08f9b..fb36a4165a4e 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2814,10 +2814,9 @@ void register_page_bootmem_memmap(unsigned long section_nr, struct page *map,
 				  unsigned long nr_pages);
 
 enum mf_flags {
-	MF_COUNT_INCREASED = 1 << 0,
-	MF_ACTION_REQUIRED = 1 << 1,
-	MF_MUST_KILL = 1 << 2,
-	MF_SOFT_OFFLINE = 1 << 3,
+	MF_ACTION_REQUIRED = 1 << 0,
+	MF_MUST_KILL = 1 << 1,
+	MF_SOFT_OFFLINE = 1 << 2,
 };
 extern int memory_failure(unsigned long pfn, int flags);
 extern void memory_failure_queue(unsigned long pfn, int flags);
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index e43b61462fd5..1be785b25324 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1092,7 +1092,7 @@ static int memory_failure_hugetlb(unsigned long pfn, int flags)
 
 	num_poisoned_pages_inc();
 
-	if (!(flags & MF_COUNT_INCREASED) && !get_hwpoison_page(p)) {
+	if (!get_hwpoison_page(p)) {
 		/*
 		 * Check "filter hit" and "race with other subpage."
 		 */
@@ -1286,7 +1286,7 @@ int memory_failure(unsigned long pfn, int flags)
 	 * In fact it's dangerous to directly bump up page count from 0,
 	 * that may make page_ref_freeze()/page_ref_unfreeze() mismatch.
 	 */
-	if (!(flags & MF_COUNT_INCREASED) && !get_hwpoison_page(p)) {
+	if (!get_hwpoison_page(p)) {
 		if (is_free_buddy_page(p)) {
 			action_result(pfn, MF_MSG_BUDDY, MF_DELAYED);
 			return 0;
@@ -1327,10 +1327,7 @@ int memory_failure(unsigned long pfn, int flags)
 	shake_page(p, 0);
 	/* shake_page could have turned it free. */
 	if (!PageLRU(p) && is_free_buddy_page(p)) {
-		if (flags & MF_COUNT_INCREASED)
-			action_result(pfn, MF_MSG_BUDDY, MF_DELAYED);
-		else
-			action_result(pfn, MF_MSG_BUDDY_2ND, MF_DELAYED);
+		action_result(pfn, MF_MSG_BUDDY_2ND, MF_DELAYED);
 		return 0;
 	}
 
@@ -1618,9 +1615,6 @@ static int __get_any_page(struct page *p, unsigned long pfn, int flags)
 {
 	int ret;
 
-	if (flags & MF_COUNT_INCREASED)
-		return 1;
-
 	/*
 	 * When the target page is a free hugepage, just remove it
 	 * from free hugepage list.
@@ -1890,15 +1884,11 @@ int soft_offline_page(struct page *page, int flags)
 	if (is_zone_device_page(page)) {
 		pr_debug_ratelimited("soft_offline: %#lx page is device page\n",
 				pfn);
-		if (flags & MF_COUNT_INCREASED)
-			put_page(page);
 		return -EIO;
 	}
 
 	if (PageHWPoison(page)) {
 		pr_info("soft offline: %#lx page already poisoned\n", pfn);
-		if (flags & MF_COUNT_INCREASED)
-			put_hwpoison_page(page);
 		return -EBUSY;
 	}
 
-- 
2.12.3



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 05/10] mm: remove flag argument from soft offline functions
  2019-09-10 10:30 [PATCH 00/10] Hwpoison soft-offline rework Oscar Salvador
                   ` (3 preceding siblings ...)
  2019-09-10 10:30 ` [PATCH 04/10] mm,hwpoison: remove MF_COUNT_INCREASED Oscar Salvador
@ 2019-09-10 10:30 ` Oscar Salvador
  2019-09-11 10:24   ` David Hildenbrand
  2019-09-10 10:30 ` [PATCH 06/10] mm,hwpoison: Unify THP handling for hard and soft offline Oscar Salvador
                   ` (6 subsequent siblings)
  11 siblings, 1 reply; 26+ messages in thread
From: Oscar Salvador @ 2019-09-10 10:30 UTC (permalink / raw)
  To: n-horiguchi; +Cc: mhocko, mike.kravetz, linux-mm, linux-kernel, Oscar Salvador

From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>

The argument @flag no longer affects the behavior of soft_offline_page()
and its variants, so let's remove them.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 drivers/base/memory.c |  2 +-
 include/linux/mm.h    |  2 +-
 mm/madvise.c          |  2 +-
 mm/memory-failure.c   | 27 +++++++++++++--------------
 4 files changed, 16 insertions(+), 17 deletions(-)

diff --git a/drivers/base/memory.c b/drivers/base/memory.c
index 6bea4f3f8040..e5485c22ef77 100644
--- a/drivers/base/memory.c
+++ b/drivers/base/memory.c
@@ -540,7 +540,7 @@ static ssize_t soft_offline_page_store(struct device *dev,
 	pfn >>= PAGE_SHIFT;
 	if (!pfn_valid(pfn))
 		return -ENXIO;
-	ret = soft_offline_page(pfn_to_page(pfn), 0);
+	ret = soft_offline_page(pfn_to_page(pfn));
 	return ret == 0 ? count : ret;
 }
 
diff --git a/include/linux/mm.h b/include/linux/mm.h
index fb36a4165a4e..3cc800d9f57a 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -2827,7 +2827,7 @@ extern int sysctl_memory_failure_early_kill;
 extern int sysctl_memory_failure_recovery;
 extern void shake_page(struct page *p, int access);
 extern atomic_long_t num_poisoned_pages __read_mostly;
-extern int soft_offline_page(struct page *page, int flags);
+extern int soft_offline_page(struct page *page);
 
 
 /*
diff --git a/mm/madvise.c b/mm/madvise.c
index fbe6d402232c..ece128211400 100644
--- a/mm/madvise.c
+++ b/mm/madvise.c
@@ -906,7 +906,7 @@ static int madvise_inject_error(int behavior,
 			pr_info("Soft offlining pfn %#lx at process virtual address %#lx\n",
 					pfn, start);
 
-			ret = soft_offline_page(page, 0);
+			ret = soft_offline_page(page);
 			if (ret)
 				return ret;
 			continue;
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 1be785b25324..5071d39bdfef 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1478,7 +1478,7 @@ static void memory_failure_work_func(struct work_struct *work)
 		if (!gotten)
 			break;
 		if (entry.flags & MF_SOFT_OFFLINE)
-			soft_offline_page(pfn_to_page(entry.pfn), entry.flags);
+			soft_offline_page(pfn_to_page(entry.pfn));
 		else
 			memory_failure(entry.pfn, entry.flags);
 	}
@@ -1611,7 +1611,7 @@ static struct page *new_page(struct page *p, unsigned long private)
  * that is not free, and 1 for any other page type.
  * For 1 the page is returned with increased page count, otherwise not.
  */
-static int __get_any_page(struct page *p, unsigned long pfn, int flags)
+static int __get_any_page(struct page *p, unsigned long pfn)
 {
 	int ret;
 
@@ -1638,9 +1638,9 @@ static int __get_any_page(struct page *p, unsigned long pfn, int flags)
 	return ret;
 }
 
-static int get_any_page(struct page *page, unsigned long pfn, int flags)
+static int get_any_page(struct page *page, unsigned long pfn)
 {
-	int ret = __get_any_page(page, pfn, flags);
+	int ret = __get_any_page(page, pfn);
 
 	if (ret == 1 && !PageHuge(page) &&
 	    !PageLRU(page) && !__PageMovable(page)) {
@@ -1653,7 +1653,7 @@ static int get_any_page(struct page *page, unsigned long pfn, int flags)
 		/*
 		 * Did it turn free?
 		 */
-		ret = __get_any_page(page, pfn, 0);
+		ret = __get_any_page(page, pfn);
 		if (ret == 1 && !PageLRU(page)) {
 			/* Drop page reference which is from __get_any_page() */
 			put_hwpoison_page(page);
@@ -1665,7 +1665,7 @@ static int get_any_page(struct page *page, unsigned long pfn, int flags)
 	return ret;
 }
 
-static int soft_offline_huge_page(struct page *page, int flags)
+static int soft_offline_huge_page(struct page *page)
 {
 	int ret;
 	unsigned long pfn = page_to_pfn(page);
@@ -1724,7 +1724,7 @@ static int soft_offline_huge_page(struct page *page, int flags)
 	return ret;
 }
 
-static int __soft_offline_page(struct page *page, int flags)
+static int __soft_offline_page(struct page *page)
 {
 	int ret;
 	unsigned long pfn = page_to_pfn(page);
@@ -1804,7 +1804,7 @@ static int __soft_offline_page(struct page *page, int flags)
 	return ret;
 }
 
-static int soft_offline_in_use_page(struct page *page, int flags)
+static int soft_offline_in_use_page(struct page *page)
 {
 	int ret;
 	int mt;
@@ -1834,9 +1834,9 @@ static int soft_offline_in_use_page(struct page *page, int flags)
 	mt = get_pageblock_migratetype(page);
 	set_pageblock_migratetype(page, MIGRATE_ISOLATE);
 	if (PageHuge(page))
-		ret = soft_offline_huge_page(page, flags);
+		ret = soft_offline_huge_page(page);
 	else
-		ret = __soft_offline_page(page, flags);
+		ret = __soft_offline_page(page);
 	set_pageblock_migratetype(page, mt);
 	return ret;
 }
@@ -1857,7 +1857,6 @@ static int soft_offline_free_page(struct page *page)
 /**
  * soft_offline_page - Soft offline a page.
  * @page: page to offline
- * @flags: flags. Same as memory_failure().
  *
  * Returns 0 on success, otherwise negated errno.
  *
@@ -1876,7 +1875,7 @@ static int soft_offline_free_page(struct page *page)
  * This is not a 100% solution for all memory, but tries to be
  * ``good enough'' for the majority of memory.
  */
-int soft_offline_page(struct page *page, int flags)
+int soft_offline_page(struct page *page)
 {
 	int ret;
 	unsigned long pfn = page_to_pfn(page);
@@ -1893,11 +1892,11 @@ int soft_offline_page(struct page *page, int flags)
 	}
 
 	get_online_mems();
-	ret = get_any_page(page, pfn, flags);
+	ret = get_any_page(page, pfn);
 	put_online_mems();
 
 	if (ret > 0)
-		ret = soft_offline_in_use_page(page, flags);
+		ret = soft_offline_in_use_page(page);
 	else if (ret == 0)
 		ret = soft_offline_free_page(page);
 
-- 
2.12.3



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 06/10] mm,hwpoison: Unify THP handling for hard and soft offline
  2019-09-10 10:30 [PATCH 00/10] Hwpoison soft-offline rework Oscar Salvador
                   ` (4 preceding siblings ...)
  2019-09-10 10:30 ` [PATCH 05/10] mm: remove flag argument from soft offline functions Oscar Salvador
@ 2019-09-10 10:30 ` Oscar Salvador
  2019-09-10 10:30 ` [PATCH 07/10] mm,hwpoison: Rework soft offline for in-use pages Oscar Salvador
                   ` (5 subsequent siblings)
  11 siblings, 0 replies; 26+ messages in thread
From: Oscar Salvador @ 2019-09-10 10:30 UTC (permalink / raw)
  To: n-horiguchi; +Cc: mhocko, mike.kravetz, linux-mm, linux-kernel, Oscar Salvador

Place the THP's page handling in a helper and use it
from both hard and soft-offline machinery, so we get rid
of some duplicated code.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 mm/memory-failure.c | 48 ++++++++++++++++++++++--------------------------
 1 file changed, 22 insertions(+), 26 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 5071d39bdfef..820742035402 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1077,6 +1077,25 @@ static int identify_page_state(unsigned long pfn, struct page *p,
 	return page_action(ps, p, pfn);
 }
 
+static int try_to_split_thp_page(struct page *page, const char *msg)
+{
+	lock_page(page);
+	if (!PageAnon(page) || unlikely(split_huge_page(page))) {
+		unsigned long pfn = page_to_pfn(page);
+
+		unlock_page(page);
+		if (!PageAnon(page))
+			pr_info("%s: %#lx: non anonymous thp\n", msg, pfn);
+		else
+			pr_info("%s: %#lx: thp split failed\n", msg, pfn);
+		put_hwpoison_page(page);
+		return -EBUSY;
+	}
+	unlock_page(page);
+
+	return 0;
+}
+
 static int memory_failure_hugetlb(unsigned long pfn, int flags)
 {
 	struct page *p = pfn_to_page(pfn);
@@ -1297,21 +1316,8 @@ int memory_failure(unsigned long pfn, int flags)
 	}
 
 	if (PageTransHuge(hpage)) {
-		lock_page(p);
-		if (!PageAnon(p) || unlikely(split_huge_page(p))) {
-			unlock_page(p);
-			if (!PageAnon(p))
-				pr_err("Memory failure: %#lx: non anonymous thp\n",
-					pfn);
-			else
-				pr_err("Memory failure: %#lx: thp split failed\n",
-					pfn);
-			if (TestClearPageHWPoison(p))
-				num_poisoned_pages_dec();
-			put_hwpoison_page(p);
+		if (try_to_split_thp_page(p, "Memory Failure") < 0)
 			return -EBUSY;
-		}
-		unlock_page(p);
 		VM_BUG_ON_PAGE(!page_count(p), p);
 		hpage = compound_head(p);
 	}
@@ -1810,19 +1816,9 @@ static int soft_offline_in_use_page(struct page *page)
 	int mt;
 	struct page *hpage = compound_head(page);
 
-	if (!PageHuge(page) && PageTransHuge(hpage)) {
-		lock_page(page);
-		if (!PageAnon(page) || unlikely(split_huge_page(page))) {
-			unlock_page(page);
-			if (!PageAnon(page))
-				pr_info("soft offline: %#lx: non anonymous thp\n", page_to_pfn(page));
-			else
-				pr_info("soft offline: %#lx: thp split failed\n", page_to_pfn(page));
-			put_hwpoison_page(page);
+	if (!PageHuge(page) && PageTransHuge(hpage))
+		if (try_to_split_thp_page(page, "soft offline") < 0)
 			return -EBUSY;
-		}
-		unlock_page(page);
-	}
 
 	/*
 	 * Setting MIGRATE_ISOLATE here ensures that the page will be linked
-- 
2.12.3



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 07/10] mm,hwpoison: Rework soft offline for in-use pages
  2019-09-10 10:30 [PATCH 00/10] Hwpoison soft-offline rework Oscar Salvador
                   ` (5 preceding siblings ...)
  2019-09-10 10:30 ` [PATCH 06/10] mm,hwpoison: Unify THP handling for hard and soft offline Oscar Salvador
@ 2019-09-10 10:30 ` Oscar Salvador
  2019-09-10 10:30 ` [PATCH 08/10] mm,hwpoison: Refactor soft_offline_huge_page and __soft_offline_page Oscar Salvador
                   ` (4 subsequent siblings)
  11 siblings, 0 replies; 26+ messages in thread
From: Oscar Salvador @ 2019-09-10 10:30 UTC (permalink / raw)
  To: n-horiguchi; +Cc: mhocko, mike.kravetz, linux-mm, linux-kernel, Oscar Salvador

This patch changes the way we set and handle in-use poisoned pages.
Until now, poisoned pages were released to the buddy allocator, trusting
that the checks that take place prior to hand the page to userspace would
act as a safe net and would skip that page.

This has proved to be wrong, as we got some pfn walkers out there, like
compaction, that all they care is the page to be PageBuddy and be in a
freelist.
Although this might not be the only user, having poisoned pages
in the buddy allocator seems a bad idea as we should only have
free pages that are ready and meant to be used as such.

Before explainaing the taken approach, let us break down the kind
of pages we can soft offline.

- Anonymous THP (after the split, they end up being 4K pages)
- Hugetlb
- Order-0 pages (that can be either migrated or invalited)

The following will only refer to in-use pages, free pages will
be explained in patch#9.

* Normal pages (order-0 and anon-THP)

  - If they are clean and unmapped page cache pages, we detach
    the page from its mapping.
  - If they are mapped/dirty, we do the isolate-and-migrate dance.

  Either way, do not call put_page directly from those paths.
  Instead, we keep the page and send it to page_set_poison.

  page_set_poison sets the HWPoison flag and calls put_page.
  This call to put_page is mainly to be able to call __page_cache_release,
  since this function is not exported.

  Down the chain, we placed a check for HWPoison page in free_pages_prepare,
  that just skips any poisoned page, so those pages do not end up in any
  pcplist/freelist.

  [[Now, I think that we would be better off if we duplicated/exported
  __page_cache_release in/to the hwpoison code, so this last put_page
  could go]]

  After that, we set the refcount on the page to 1 and we increment
  the poisoned pages counter.

* Hugetlb pages

  - we isolate-and-migrate them

  After the migration has been succesful, we call page_set_poison
  that sets the HWPoison flag and actually calls
  dissolve_free_huge_page for hugetlb pages.

  When dissolving a non-gigantib hugetlb page and we know that
  the memory range contains poisoned pages, we free the pages
  as order-0 pages, so free_pages_prepare will skip them accordingly.
  poisoned page.
  Since the infrastructure is already there because that is the way
  we free gigantic hugetlb pages, it does not take any effort to adapt
  it for non-gigantic hugetlb pages.

Because of the way we handle now in-use pages, we can safely drop
the put-as-isolation-migratetype dance, that was guarding
for the poisoned pages to end up in pcplists.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 mm/hugetlb.c        | 35 +++++++++++++++++++++++++------
 mm/memory-failure.c | 60 ++++++++++++++++++++++++++---------------------------
 mm/migrate.c        | 11 +++-------
 mm/page_alloc.c     |  3 +++
 4 files changed, 64 insertions(+), 45 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index ef37c85423a5..139e1c05c9a1 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1045,16 +1045,17 @@ static int hstate_next_node_to_free(struct hstate *h, nodemask_t *nodes_allowed)
 		((node = hstate_next_node_to_free(hs, mask)) || 1);	\
 		nr_nodes--)
 
-#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE
-static void destroy_compound_gigantic_page(struct page *page,
-					unsigned int order)
+static void destroy_compound_page(struct page *page, unsigned int order)
 {
 	int i;
 	int nr_pages = 1 << order;
 	struct page *p = page + 1;
+	bool gigantic = order > MAX_ORDER - 1;
 
 	atomic_set(compound_mapcount_ptr(page), 0);
 	for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) {
+		if (!gigantic)
+			p->mapping = NULL;
 		clear_compound_head(p);
 		set_page_refcounted(p);
 	}
@@ -1063,6 +1064,13 @@ static void destroy_compound_gigantic_page(struct page *page,
 	__ClearPageHead(page);
 }
 
+#ifdef CONFIG_ARCH_HAS_GIGANTIC_PAGE
+static void destroy_compound_gigantic_page(struct page *page,
+					unsigned int order)
+{
+	destroy_compound_page(page, order);
+}
+
 static void free_gigantic_page(struct page *page, unsigned int order)
 {
 	free_contig_range(page_to_pfn(page), 1 << order);
@@ -1175,6 +1183,8 @@ static inline void destroy_compound_gigantic_page(struct page *page,
 static void update_and_free_page(struct hstate *h, struct page *page)
 {
 	int i;
+	bool poisoned = false;
+	unsigned int order = huge_page_order(h);
 
 	if (hstate_is_gigantic(h) && !gigantic_page_runtime_supported())
 		return;
@@ -1182,6 +1192,8 @@ static void update_and_free_page(struct hstate *h, struct page *page)
 	h->nr_huge_pages--;
 	h->nr_huge_pages_node[page_to_nid(page)]--;
 	for (i = 0; i < pages_per_huge_page(h); i++) {
+		if (unlikely(PageHWPoison(page)))
+			poisoned = true;
 		page[i].flags &= ~(1 << PG_locked | 1 << PG_error |
 				1 << PG_referenced | 1 << PG_dirty |
 				1 << PG_active | 1 << PG_private |
@@ -1191,10 +1203,21 @@ static void update_and_free_page(struct hstate *h, struct page *page)
 	set_compound_page_dtor(page, NULL_COMPOUND_DTOR);
 	set_page_refcounted(page);
 	if (hstate_is_gigantic(h)) {
-		destroy_compound_gigantic_page(page, huge_page_order(h));
-		free_gigantic_page(page, huge_page_order(h));
+		destroy_compound_gigantic_page(page, order);
+		free_gigantic_page(page, order);
 	} else {
-		__free_pages(page, huge_page_order(h));
+		if (poisoned) {
+			unsigned long pfn = page_to_pfn(page);
+			/*
+			 * If we have poisoned pages in the range,
+			 * we free them up as order-0 pages, so
+			 * free_pages_prepare will skip them accordingly.
+			 */
+			destroy_compound_page(page, order);
+			free_contig_range(pfn, 1 << order);
+		} else {
+			__free_pages(page, order);
+		}
 	}
 }
 
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 820742035402..d44dacb8e2ea 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -78,6 +78,24 @@ EXPORT_SYMBOL_GPL(hwpoison_filter_dev_minor);
 EXPORT_SYMBOL_GPL(hwpoison_filter_flags_mask);
 EXPORT_SYMBOL_GPL(hwpoison_filter_flags_value);
 
+static bool page_set_poison(struct page *page)
+{
+	SetPageHWPoison(page);
+
+	if (PageHuge(page) && dissolve_free_huge_page(page))
+		goto error;
+	else if (!PageHuge(page) && page_count(page))
+		put_page(page);
+
+	set_page_refcounted(page);
+	num_poisoned_pages_inc();
+
+	return true;
+error:
+	ClearPageHWPoison(page);
+	return false;
+}
+
 static int hwpoison_filter_dev(struct page *p)
 {
 	struct address_space *mapping;
@@ -1704,28 +1722,16 @@ static int soft_offline_huge_page(struct page *page)
 
 	ret = migrate_pages(&pagelist, new_page, NULL, MPOL_MF_MOVE_ALL,
 				MIGRATE_SYNC, MR_MEMORY_FAILURE);
-	if (ret) {
+	if (!ret) {
+		if (!page_set_poison(page))
+			ret = -EBUSY;
+	} else {
 		pr_info("soft offline: %#lx: hugepage migration failed %d, type %lx (%pGp)\n",
 			pfn, ret, page->flags, &page->flags);
 		if (!list_empty(&pagelist))
 			putback_movable_pages(&pagelist);
 		if (ret > 0)
 			ret = -EIO;
-	} else {
-		/*
-		 * We set PG_hwpoison only when the migration source hugepage
-		 * was successfully dissolved, because otherwise hwpoisoned
-		 * hugepage remains on free hugepage list, then userspace will
-		 * find it as SIGBUS by allocation failure. That's not expected
-		 * in soft-offlining.
-		 */
-		ret = dissolve_free_huge_page(page);
-		if (!ret) {
-			if (set_hwpoison_free_buddy_page(page))
-				num_poisoned_pages_inc();
-			else
-				ret = -EBUSY;
-		}
 	}
 	return ret;
 }
@@ -1760,10 +1766,8 @@ static int __soft_offline_page(struct page *page)
 	 * would need to fix isolation locking first.
 	 */
 	if (ret == 1) {
-		put_hwpoison_page(page);
 		pr_info("soft_offline: %#lx: invalidated\n", pfn);
-		SetPageHWPoison(page);
-		num_poisoned_pages_inc();
+		page_set_poison(page);
 		return 0;
 	}
 
@@ -1794,7 +1798,12 @@ static int __soft_offline_page(struct page *page)
 		list_add(&page->lru, &pagelist);
 		ret = migrate_pages(&pagelist, new_page, NULL, MPOL_MF_MOVE_ALL,
 					MIGRATE_SYNC, MR_MEMORY_FAILURE);
-		if (ret) {
+		if (!ret) {
+			/*
+			 * Page was succesfully migrated.
+			 */
+			page_set_poison(page);
+		} else {
 			if (!list_empty(&pagelist))
 				putback_movable_pages(&pagelist);
 
@@ -1813,27 +1822,16 @@ static int __soft_offline_page(struct page *page)
 static int soft_offline_in_use_page(struct page *page)
 {
 	int ret;
-	int mt;
 	struct page *hpage = compound_head(page);
 
 	if (!PageHuge(page) && PageTransHuge(hpage))
 		if (try_to_split_thp_page(page, "soft offline") < 0)
 			return -EBUSY;
 
-	/*
-	 * Setting MIGRATE_ISOLATE here ensures that the page will be linked
-	 * to free list immediately (not via pcplist) when released after
-	 * successful page migration. Otherwise we can't guarantee that the
-	 * page is really free after put_page() returns, so
-	 * set_hwpoison_free_buddy_page() highly likely fails.
-	 */
-	mt = get_pageblock_migratetype(page);
-	set_pageblock_migratetype(page, MIGRATE_ISOLATE);
 	if (PageHuge(page))
 		ret = soft_offline_huge_page(page);
 	else
 		ret = __soft_offline_page(page);
-	set_pageblock_migratetype(page, mt);
 	return ret;
 }
 
diff --git a/mm/migrate.c b/mm/migrate.c
index bdd1e95a459e..c396a019b2a4 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1223,16 +1223,11 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
 	 * we want to retry.
 	 */
 	if (rc == MIGRATEPAGE_SUCCESS) {
-		put_page(page);
-		if (reason == MR_MEMORY_FAILURE) {
+		if (reason != MR_MEMORY_FAILURE)
 			/*
-			 * Set PG_HWPoison on just freed page
-			 * intentionally. Although it's rather weird,
-			 * it's how HWPoison flag works at the moment.
+			 * We handle poisoned pages in hwpoison code
 			 */
-			if (set_hwpoison_free_buddy_page(page))
-				num_poisoned_pages_inc();
-		}
+			put_page(page);
 	} else {
 		if (rc != -EAGAIN) {
 			if (likely(!__PageMovable(page))) {
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index c5d62f1c2851..fe38229d0a77 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1134,6 +1134,9 @@ static __always_inline bool free_pages_prepare(struct page *page,
 
 	trace_mm_page_free(page, order);
 
+	if (unlikely(PageHWPoison(page)))
+		return false;
+
 	/*
 	 * Check tail pages before head page information is cleared to
 	 * avoid checking PageCompound for order-0 pages.
-- 
2.12.3



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 08/10] mm,hwpoison: Refactor soft_offline_huge_page and __soft_offline_page
  2019-09-10 10:30 [PATCH 00/10] Hwpoison soft-offline rework Oscar Salvador
                   ` (6 preceding siblings ...)
  2019-09-10 10:30 ` [PATCH 07/10] mm,hwpoison: Rework soft offline for in-use pages Oscar Salvador
@ 2019-09-10 10:30 ` Oscar Salvador
  2019-09-10 10:30 ` [PATCH 09/10] mm,hwpoison: Rework soft offline for free pages Oscar Salvador
                   ` (3 subsequent siblings)
  11 siblings, 0 replies; 26+ messages in thread
From: Oscar Salvador @ 2019-09-10 10:30 UTC (permalink / raw)
  To: n-horiguchi; +Cc: mhocko, mike.kravetz, linux-mm, linux-kernel, Oscar Salvador

soft_offline_huge_page and __soft_offline_page share quite some code,
and it can be re-joined into one single function without losing neither
functionatilty nor readibility.

This allows us of getting rid of quite some duplicated code.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 mm/memory-failure.c | 145 ++++++++++++++++++++--------------------------------
 1 file changed, 56 insertions(+), 89 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index d44dacb8e2ea..ce017a0d79a6 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1689,57 +1689,48 @@ static int get_any_page(struct page *page, unsigned long pfn)
 	return ret;
 }
 
-static int soft_offline_huge_page(struct page *page)
+static bool isolate_page(struct page *page, struct list_head *pagelist)
 {
-	int ret;
-	unsigned long pfn = page_to_pfn(page);
-	struct page *hpage = compound_head(page);
-	LIST_HEAD(pagelist);
+	bool isolated = false;
+	bool lru = PageLRU(page);
 
-	/*
-	 * This double-check of PageHWPoison is to avoid the race with
-	 * memory_failure(). See also comment in __soft_offline_page().
-	 */
-	lock_page(hpage);
-	if (PageHWPoison(hpage)) {
-		unlock_page(hpage);
-		put_hwpoison_page(hpage);
-		pr_info("soft offline: %#lx hugepage already poisoned\n", pfn);
-		return -EBUSY;
+	if (PageHuge(page)) {
+		isolated = isolate_huge_page(page, pagelist);
+	} else {
+		if (PageLRU(page))
+			isolated = !isolate_lru_page(page);
+		else
+			isolated = !isolate_movable_page(page, ISOLATE_UNEVICTABLE);
+
+		if (isolated) {
+			if (lru)
+				inc_node_page_state(page, NR_ISOLATED_ANON +
+						    page_is_file_cache(page));
+			list_add(&page->lru, pagelist);
+		}
 	}
-	unlock_page(hpage);
 
-	ret = isolate_huge_page(hpage, &pagelist);
 	/*
-	 * get_any_page() and isolate_huge_page() takes a refcount each,
-	 * so need to drop one here.
+	 * If we succeed to isolate the page, we grabbed another refcount on
+	 * the page, so we can safely drop the one we got from get_any_pages().
+	 * If we failed to isolate the page, it means that we cannot go further
+	 * any further and we will return an error, so drop the reference we got
+	 * from get_any_pages() as well.
 	 */
-	put_hwpoison_page(hpage);
-	if (!ret) {
-		pr_info("soft offline: %#lx hugepage failed to isolate\n", pfn);
-		return -EBUSY;
-	}
-
-	ret = migrate_pages(&pagelist, new_page, NULL, MPOL_MF_MOVE_ALL,
-				MIGRATE_SYNC, MR_MEMORY_FAILURE);
-	if (!ret) {
-		if (!page_set_poison(page))
-			ret = -EBUSY;
-	} else {
-		pr_info("soft offline: %#lx: hugepage migration failed %d, type %lx (%pGp)\n",
-			pfn, ret, page->flags, &page->flags);
-		if (!list_empty(&pagelist))
-			putback_movable_pages(&pagelist);
-		if (ret > 0)
-			ret = -EIO;
-	}
-	return ret;
+	put_hwpoison_page(page);
+	return isolated;
 }
-
+/*
+ * This routine handles both hugetlb and normal pages
+ */
 static int __soft_offline_page(struct page *page)
 {
-	int ret;
+	int ret = 0;
 	unsigned long pfn = page_to_pfn(page);
+	struct page *hpage = compound_head(page);
+	const char *msg_page[] = { "page", "hugepage"};
+	bool huge = PageHuge(page);
+	LIST_HEAD(pagelist);
 
 	/*
 	 * Check PageHWPoison again inside page lock because PageHWPoison
@@ -1747,92 +1738,68 @@ static int __soft_offline_page(struct page *page)
 	 * memory_failure() also double-checks PageHWPoison inside page lock,
 	 * so there's no race between soft_offline_page() and memory_failure().
 	 */
-	lock_page(page);
-	wait_on_page_writeback(page);
-	if (PageHWPoison(page)) {
-		unlock_page(page);
-		put_hwpoison_page(page);
+	lock_page(hpage);
+	if (!PageHuge(page))
+		wait_on_page_writeback(page);
+	if (PageHWPoison(hpage)) {
+		unlock_page(hpage);
+		put_hwpoison_page(hpage);
 		pr_info("soft offline: %#lx page already poisoned\n", pfn);
 		return -EBUSY;
 	}
-	/*
-	 * Try to invalidate first. This should work for
-	 * non dirty unmapped page cache pages.
-	 */
-	ret = invalidate_inode_page(page);
+
+	if (!PageHuge(page))
+		/*
+		 * Try to invalidate first. This should work for
+		 * non dirty unmapped page cache pages.
+		 */
+		ret = invalidate_inode_page(page);
 	unlock_page(page);
 	/*
 	 * RED-PEN would be better to keep it isolated here, but we
 	 * would need to fix isolation locking first.
 	 */
-	if (ret == 1) {
+	if (ret) {
 		pr_info("soft_offline: %#lx: invalidated\n", pfn);
 		page_set_poison(page);
 		return 0;
 	}
 
-	/*
-	 * Simple invalidation didn't work.
-	 * Try to migrate to a new page instead. migrate.c
-	 * handles a large number of cases for us.
-	 */
-	if (PageLRU(page))
-		ret = isolate_lru_page(page);
-	else
-		ret = isolate_movable_page(page, ISOLATE_UNEVICTABLE);
-	/*
-	 * Drop page reference which is came from get_any_page()
-	 * successful isolate_lru_page() already took another one.
-	 */
-	put_hwpoison_page(page);
-	if (!ret) {
-		LIST_HEAD(pagelist);
-		/*
-		 * After isolated lru page, the PageLRU will be cleared,
-		 * so use !__PageMovable instead for LRU page's mapping
-		 * cannot have PAGE_MAPPING_MOVABLE.
-		 */
-		if (!__PageMovable(page))
-			inc_node_page_state(page, NR_ISOLATED_ANON +
-						page_is_file_cache(page));
-		list_add(&page->lru, &pagelist);
+	if (isolate_page(hpage, &pagelist)) {
 		ret = migrate_pages(&pagelist, new_page, NULL, MPOL_MF_MOVE_ALL,
-					MIGRATE_SYNC, MR_MEMORY_FAILURE);
+				    MIGRATE_SYNC, MR_MEMORY_FAILURE);
 		if (!ret) {
 			/*
 			 * Page was succesfully migrated.
 			 */
-			page_set_poison(page);
+			if (!page_set_poison(page))
+				ret = -EBUSY;
 		} else {
 			if (!list_empty(&pagelist))
 				putback_movable_pages(&pagelist);
 
-			pr_info("soft offline: %#lx: migration failed %d, type %lx (%pGp)\n",
-				pfn, ret, page->flags, &page->flags);
+			pr_info("soft offline: %#lx: %s migration failed %d, type %lx (%pGp)\n",
+				 pfn, msg_page[huge], ret, page->flags, &page->flags);
 			if (ret > 0)
 				ret = -EIO;
 		}
 	} else {
-		pr_info("soft offline: %#lx: isolation failed: %d, page count %d, type %lx (%pGp)\n",
-			pfn, ret, page_count(page), page->flags, &page->flags);
+		pr_info("soft offline: %#lx %s isolation failed: %d, page count %d, type %lx (%pGp)\n",
+			 pfn, msg_page[huge], ret, page_count(page), page->flags, &page->flags);
 	}
+
 	return ret;
 }
 
 static int soft_offline_in_use_page(struct page *page)
 {
-	int ret;
 	struct page *hpage = compound_head(page);
 
 	if (!PageHuge(page) && PageTransHuge(hpage))
 		if (try_to_split_thp_page(page, "soft offline") < 0)
 			return -EBUSY;
 
-	if (PageHuge(page))
-		ret = soft_offline_huge_page(page);
-	else
-		ret = __soft_offline_page(page);
-	return ret;
+	return __soft_offline_page(page);
 }
 
 static int soft_offline_free_page(struct page *page)
-- 
2.12.3



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 09/10] mm,hwpoison: Rework soft offline for free pages
  2019-09-10 10:30 [PATCH 00/10] Hwpoison soft-offline rework Oscar Salvador
                   ` (7 preceding siblings ...)
  2019-09-10 10:30 ` [PATCH 08/10] mm,hwpoison: Refactor soft_offline_huge_page and __soft_offline_page Oscar Salvador
@ 2019-09-10 10:30 ` Oscar Salvador
  2019-09-10 10:30 ` [PATCH 10/10] mm,hwpoison: Use hugetlb_replace_page to replace free hugetlb pages Oscar Salvador
                   ` (2 subsequent siblings)
  11 siblings, 0 replies; 26+ messages in thread
From: Oscar Salvador @ 2019-09-10 10:30 UTC (permalink / raw)
  To: n-horiguchi; +Cc: mhocko, mike.kravetz, linux-mm, linux-kernel, Oscar Salvador

take_page_off_buddy will be used to take a page meant to be poisoned
off the buddy allocator.
take_page_off_buddy calls break_down_buddy_pages, which will split a
higher-order page in case our page belongs to one.

Once we grab the page, we call page_set_poison to set it as poisoned
and grab a refcount on it.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 include/linux/page-flags.h |  5 ----
 mm/memory-failure.c        |  6 +++--
 mm/page_alloc.c            | 59 +++++++++++++++++++++++++++++++++++++++-------
 3 files changed, 54 insertions(+), 16 deletions(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index f91cb8898ff0..21df81c9ea57 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -414,13 +414,8 @@ PAGEFLAG_FALSE(Uncached)
 PAGEFLAG(HWPoison, hwpoison, PF_ANY)
 TESTSCFLAG(HWPoison, hwpoison, PF_ANY)
 #define __PG_HWPOISON (1UL << PG_hwpoison)
-extern bool set_hwpoison_free_buddy_page(struct page *page);
 #else
 PAGEFLAG_FALSE(HWPoison)
-static inline bool set_hwpoison_free_buddy_page(struct page *page)
-{
-	return 0;
-}
 #define __PG_HWPOISON 0
 #endif
 
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index ce017a0d79a6..03f07015a106 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -78,6 +78,8 @@ EXPORT_SYMBOL_GPL(hwpoison_filter_dev_minor);
 EXPORT_SYMBOL_GPL(hwpoison_filter_flags_mask);
 EXPORT_SYMBOL_GPL(hwpoison_filter_flags_value);
 
+extern bool take_page_off_buddy(struct page *page);
+
 static bool page_set_poison(struct page *page)
 {
 	SetPageHWPoison(page);
@@ -1807,8 +1809,8 @@ static int soft_offline_free_page(struct page *page)
 	int rc = dissolve_free_huge_page(page);
 
 	if (!rc) {
-		if (set_hwpoison_free_buddy_page(page))
-			num_poisoned_pages_inc();
+		if (take_page_off_buddy(page))
+			page_set_poison(page);
 		else
 			rc = -EBUSY;
 	}
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index fe38229d0a77..68f6c2cda512 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -8605,30 +8605,71 @@ bool is_free_buddy_page(struct page *page)
 
 #ifdef CONFIG_MEMORY_FAILURE
 /*
- * Set PG_hwpoison flag if a given page is confirmed to be a free page.  This
- * test is performed under the zone lock to prevent a race against page
- * allocation.
+ * Break down a higher-order page in sub-pages, and keep our target out of
+ * buddy allocator.
  */
-bool set_hwpoison_free_buddy_page(struct page *page)
+static void break_down_buddy_pages(struct zone *zone, struct page *page,
+				   struct page *target, int low, int high,
+				   struct free_area *area, int migratetype)
+{
+	unsigned long size = 1 << high;
+	struct page *current_buddy, *next_page;
+
+	while (high > low) {
+		area--;
+		high--;
+		size >>= 1;
+
+		if (target >= &page[size]) {
+			next_page = page + size;
+			current_buddy = page;
+		} else {
+			next_page = page;
+			current_buddy = page + size;
+		}
+
+		if (set_page_guard(zone, current_buddy, high, migratetype))
+			continue;
+
+		if (current_buddy != target) {
+			add_to_free_area(current_buddy, area, migratetype);
+			set_page_order(current_buddy, high);
+			page = next_page;
+		}
+	}
+}
+
+/*
+ * Take a page that will be marked as poisoned off the buddy allocator.
+ */
+bool take_page_off_buddy(struct page *page)
 {
 	struct zone *zone = page_zone(page);
 	unsigned long pfn = page_to_pfn(page);
 	unsigned long flags;
 	unsigned int order;
-	bool hwpoisoned = false;
+	bool ret = false;
 
 	spin_lock_irqsave(&zone->lock, flags);
 	for (order = 0; order < MAX_ORDER; order++) {
 		struct page *page_head = page - (pfn & ((1 << order) - 1));
+		int buddy_order = page_order(page_head);
+		struct free_area *area = &(zone->free_area[buddy_order]);
+
+		if (PageBuddy(page_head) && buddy_order >= order) {
+			unsigned long pfn_head = page_to_pfn(page_head);
+			int migratetype = get_pfnblock_migratetype(page_head,
+								   pfn_head);
 
-		if (PageBuddy(page_head) && page_order(page_head) >= order) {
-			if (!TestSetPageHWPoison(page))
-				hwpoisoned = true;
+			del_page_from_free_area(page_head, area);
+			break_down_buddy_pages(zone, page_head, page, 0,
+					       buddy_order, area, migratetype);
+			ret = true;
 			break;
 		}
 	}
 	spin_unlock_irqrestore(&zone->lock, flags);
 
-	return hwpoisoned;
+	return ret;
 }
 #endif
-- 
2.12.3



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* [PATCH 10/10] mm,hwpoison: Use hugetlb_replace_page to replace free hugetlb pages
  2019-09-10 10:30 [PATCH 00/10] Hwpoison soft-offline rework Oscar Salvador
                   ` (8 preceding siblings ...)
  2019-09-10 10:30 ` [PATCH 09/10] mm,hwpoison: Rework soft offline for free pages Oscar Salvador
@ 2019-09-10 10:30 ` Oscar Salvador
  2019-09-10 10:32 ` [PATCH 00/10] Hwpoison soft-offline rework Oscar Salvador
  2019-09-11  5:29 ` Naoya Horiguchi
  11 siblings, 0 replies; 26+ messages in thread
From: Oscar Salvador @ 2019-09-10 10:30 UTC (permalink / raw)
  To: n-horiguchi; +Cc: mhocko, mike.kravetz, linux-mm, linux-kernel, Oscar Salvador

When soft offlining a free hugtlb, try first to allocate a new hugetlb
to the pool and pass the old state to the new one by move_hugetlb_state().
Either we succeed or not, we dissolve the poisoned hugetlb page.

Worst-scenario case is that we cannot allocate a new fresh hugetlb page
as a replacement.

Signed-off-by: Oscar Salvador <osalvador@suse.de>
---
 mm/hugetlb.c        | 16 ++++++++++++++++
 mm/memory-failure.c | 34 ++++++++++++++++++++++++++++------
 2 files changed, 44 insertions(+), 6 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 139e1c05c9a1..d0844aec7531 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5154,3 +5154,19 @@ void move_hugetlb_state(struct page *oldpage, struct page *newpage, int reason)
 		spin_unlock(&hugetlb_lock);
 	}
 }
+
+#ifdef CONFIG_MEMORY_FAILURE
+int hugetlb_replace_page(struct page *page, int reason)
+{
+	int nid = page_to_nid(page);
+	struct hstate *h = page_hstate(page);
+	struct page *new_page;
+
+	new_page = alloc_huge_page_nodemask(h, nid, &node_states[N_MEMORY]);
+	if (!new_page)
+		return -ENOMEM;
+
+	move_hugetlb_state(page, new_page, reason);
+	return 0;
+}
+#endif
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 03f07015a106..fe73fe19c6e9 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -79,6 +79,7 @@ EXPORT_SYMBOL_GPL(hwpoison_filter_flags_mask);
 EXPORT_SYMBOL_GPL(hwpoison_filter_flags_value);
 
 extern bool take_page_off_buddy(struct page *page);
+extern int hugetlb_replace_page(struct page *page, int reason);
 
 static bool page_set_poison(struct page *page)
 {
@@ -1804,16 +1805,37 @@ static int soft_offline_in_use_page(struct page *page)
 	return __soft_offline_page(page);
 }
 
+static int soft_offline_free_huge_page(struct page *page)
+{
+	struct page *hpage = compound_head(page);
+
+	/*
+	 * Try to add a new hugetlb page to the pool
+	 */
+	if (hugetlb_replace_page(hpage, MR_MEMORY_FAILURE))
+		return -EBUSY;
+
+	/*
+	 * Remove old hugetlb from the pool
+	 */
+	if (!page_set_poison(hpage))
+		return -EBUSY;
+
+	return 0;
+}
+
 static int soft_offline_free_page(struct page *page)
 {
-	int rc = dissolve_free_huge_page(page);
+	int rc = -EBUSY;
 
-	if (!rc) {
-		if (take_page_off_buddy(page))
+	if (PageHuge(page))
+		rc = soft_offline_free_huge_page(page);
+	else
+		if (take_page_off_buddy(page)) {
 			page_set_poison(page);
-		else
-			rc = -EBUSY;
-	}
+			rc = 0;
+		}
+
 	return rc;
 }
 
-- 
2.12.3



^ permalink raw reply related	[flat|nested] 26+ messages in thread

* Re: [PATCH 00/10] Hwpoison soft-offline rework
  2019-09-10 10:30 [PATCH 00/10] Hwpoison soft-offline rework Oscar Salvador
                   ` (9 preceding siblings ...)
  2019-09-10 10:30 ` [PATCH 10/10] mm,hwpoison: Use hugetlb_replace_page to replace free hugetlb pages Oscar Salvador
@ 2019-09-10 10:32 ` Oscar Salvador
  2019-09-11  5:29 ` Naoya Horiguchi
  11 siblings, 0 replies; 26+ messages in thread
From: Oscar Salvador @ 2019-09-10 10:32 UTC (permalink / raw)
  To: n-horiguchi; +Cc: mhocko, mike.kravetz, linux-mm, linux-kernel

On Tue, Sep 10, 2019 at 12:30:06PM +0200, Oscar Salvador wrote:
> 
> This patchset was based on Naoya's hwpoison rework [1], so thanks to him
> for the initial work.
> 
> This patchset aims to fix some issues laying in soft-offline handling,
> but it also takes the chance and takes some further steps to perform 
> cleanups and some refactoring as well.

Of course, this was meant to be a "RFC PATCH" and not a "PATCH", but
fat-fingers...

Sorry for the inconvenience.

-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 00/10] Hwpoison soft-offline rework
  2019-09-10 10:30 [PATCH 00/10] Hwpoison soft-offline rework Oscar Salvador
                   ` (10 preceding siblings ...)
  2019-09-10 10:32 ` [PATCH 00/10] Hwpoison soft-offline rework Oscar Salvador
@ 2019-09-11  5:29 ` Naoya Horiguchi
  2019-09-11  6:22   ` Naoya Horiguchi
  11 siblings, 1 reply; 26+ messages in thread
From: Naoya Horiguchi @ 2019-09-11  5:29 UTC (permalink / raw)
  To: Oscar Salvador; +Cc: mhocko, mike.kravetz, linux-mm, linux-kernel

Hi Oscar,

Thank you for your working on this.

My testing shows the following error:

  [ 1926.932435] ===> testcase 'mce_ksm_soft-offline_avoid_access.auto2' start
  [ 1927.155321] bash (15853): drop_caches: 3
  [ 1929.019094] page:ffffe5c384c4cd40 refcount:1 mapcount:0 mapping:0000000000000003 index:0x700000001
  [ 1929.021586] anon
  [ 1929.021588] flags: 0x57ffe00088000e(referenced|uptodate|dirty|swapbacked|hwpoison)
  [ 1929.024289] raw: 0057ffe00088000e dead000000000100 dead000000000122 0000000000000003
  [ 1929.026611] raw: 0000000700000001 0000000000000000 00000000ffffffff 0000000000000000
  [ 1929.028760] page dumped because: VM_BUG_ON_PAGE(page_ref_count(page))
  [ 1929.030559] ------------[ cut here ]------------
  [ 1929.031684] kernel BUG at mm/internal.h:73!
  [ 1929.032738] invalid opcode: 0000 [#1] SMP PTI
  [ 1929.033941] CPU: 3 PID: 16052 Comm: mceinj.sh Not tainted 5.3.0-rc8-v5.3-rc8-190911-1025-00010-ga436dbce8674+ #18
  [ 1929.037137] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-2.fc30 04/01/2014
  [ 1929.040066] RIP: 0010:page_set_poison+0xf9/0x160
  [ 1929.041665] Code: 63 02 7f 31 c0 5b 5d 41 5c c3 48 c7 c6 d0 d1 0c b0 48 89 df e8 88 bb f8 ff 0f 0b 48 c7 c6 f0 2a 0d b0 48 89 df e8 77 bb f8 ff <0f> 0b 48 8b 45 00 48 c1 e8 33 83 e0 07 83 f8 04 75 89 48 8b 45 08
  [ 1929.047773] RSP: 0018:ffffb4fb8a73bde0 EFLAGS: 00010246
  [ 1929.049511] RAX: 0000000000000039 RBX: ffffe5c384c4cd40 RCX: 0000000000000000
  [ 1929.051870] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffffb00d1814
  [ 1929.054238] RBP: ffffe5c384c4cd40 R08: 0000000000000596 R09: 0000000000000048
  [ 1929.056599] R10: 0000000000000000 R11: ffffb4fb8a73bc58 R12: 0000000000000000
  [ 1929.058986] R13: ffffb4fb8a73be10 R14: 0000000000131335 R15: 0000000000000001
  [ 1929.061366] FS:  00007fc9e208d740(0000) GS:ffff9fa9bdb00000(0000) knlGS:0000000000000000
  [ 1929.063842] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [ 1929.065429] CR2: 000055946c05d192 CR3: 00000001365f2000 CR4: 00000000001406e0
  [ 1929.067373] Call Trace:
  [ 1929.068094]  soft_offline_page+0x2be/0x600
  [ 1929.069246]  soft_offline_page_store+0xdf/0x110
  [ 1929.070510]  kernfs_fop_write+0x116/0x190
  [ 1929.071618]  vfs_write+0xa5/0x1a0
  [ 1929.072614]  ksys_write+0x59/0xd0
  [ 1929.073548]  do_syscall_64+0x5f/0x1a0
  [ 1929.074554]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
  [ 1929.075957] RIP: 0033:0x7fc9e217ded8

It seems that soft-offlining on ksm pages is affected by this changeset.
Could you try to handle this?

- Naoya

On Tue, Sep 10, 2019 at 12:30:06PM +0200, Oscar Salvador wrote:
>
> This patchset was based on Naoya's hwpoison rework [1], so thanks to him
> for the initial work.
>
> This patchset aims to fix some issues laying in soft-offline handling,
> but it also takes the chance and takes some further steps to perform
> cleanups and some refactoring as well.
>
>  - Motivation:
>
>    A customer and I were facing an issue where poisoned pages we returned
>    back to user-space after having offlined them properly.
>    This was only seend under some memory stress + soft offlining pages.
>    After some anaylsis, it became clear that the problem was that
>    when kcompactd kicked in to migrate pages over, compaction_alloc
>    callback was handing poisoned pages to the migrate routine.
>    Once this page was later on fault in, __do_page_fault returned
>    VM_FAULT_HWPOISON making the process being killed.
>
>    All this could happen because isolate_freepages_block and
>    fast_isolate_freepages just check for the page to be PageBuddy,
>    and since 1) poisoned pages can be part of a higher order page
>    and 2) poisoned pages are also Page Buddy, they can sneak in easily.
>
>    I also saw some problem with swap pages, but I suspected to be the
>    same sort of problem, so I did not follow that trace.
>
>    The full explanation can be see in [2].
>
>  - Approach:
>
>    The taken approach is to not let poisoned pages hit neither
>    pcplists nor buddy freelists.
>    This is achieved by:
>
> In-use pages:
>
>    * Normal pages
>
>    1) do not release the last reference count after the
>       invalidation/migration of the page.
>    2) the page is being handed to page_set_poison, which does:
>       2a) sets PageHWPoison flag
>       2b) calls put_page (only to be able to call __page_cache_release)
>           Since poisoned pages are skipped in free_pages_prepare,
>           this put_page is safe.
>       2c) Sets the refcount to 1
>
>    * Hugetlb pages
>
>    1) Hand the page to page_set_poison after migration
>    2) page_set_poison does:
>       2a) Calls dissolve_free_huge_page
>       2b) If ranged to be dissolved contains poisoned pages,
>           we free the rangeas order-0 pages (as we do with gigantic hugetlb page),
>           so free_pages_prepare will skip them accordingly.
>       2c) Sets the refcount to 1
>
> Free pages:
>
>    * Normal pages:
>
>    1) Take the page off the buddy freelist
>    2) Set PageHWPoison flag and set refcount to 1
>
>    * Hugetlb pages
>
>    1) Try to allocate a new hugetlb page to the pool
>    2) Take off the pool the poisoned hugetlb
>
>
> With this patchset, I no longer see the issues I faced before.
>
> Note:
> I presented this as RFC to open discussion of the taken aproach.
> I think that furthers cleanups and refactors could be made, but I would
> like to get some insight of the taken approach before touching more
> code.
>
> Thanks
>
> [1] https://lore.kernel.org/linux-mm/1541746035-13408-1-git-send-email-n-horiguchi@ah.jp.nec.com/
> [2] https://lore.kernel.org/linux-mm/20190826104144.GA7849@linux/T/#u
>
> Naoya Horiguchi (5):
>   mm,hwpoison: cleanup unused PageHuge() check
>   mm,madvise: call soft_offline_page() without MF_COUNT_INCREASED
>   mm,hwpoison-inject: don't pin for hwpoison_filter
>   mm,hwpoison: remove MF_COUNT_INCREASED
>   mm: remove flag argument from soft offline functions
>
> Oscar Salvador (5):
>   mm,hwpoison: Unify THP handling for hard and soft offline
>   mm,hwpoison: Rework soft offline for in-use pages
>   mm,hwpoison: Refactor soft_offline_huge_page and __soft_offline_page
>   mm,hwpoison: Rework soft offline for free pages
>   mm,hwpoison: Use hugetlb_replace_page to replace free hugetlb pages
>
>  drivers/base/memory.c      |   2 +-
>  include/linux/mm.h         |   9 +-
>  include/linux/page-flags.h |   5 -
>  mm/hugetlb.c               |  51 +++++++-
>  mm/hwpoison-inject.c       |  18 +--
>  mm/madvise.c               |  25 ++--
>  mm/memory-failure.c        | 319 +++++++++++++++++++++------------------------
>  mm/migrate.c               |  11 +-
>  mm/page_alloc.c            |  62 +++++++--
>  9 files changed, 267 insertions(+), 235 deletions(-)
>
> --
> 2.12.3
>
>


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 00/10] Hwpoison soft-offline rework
  2019-09-11  5:29 ` Naoya Horiguchi
@ 2019-09-11  6:22   ` Naoya Horiguchi
  2019-09-11  6:35     ` osalvador
  0 siblings, 1 reply; 26+ messages in thread
From: Naoya Horiguchi @ 2019-09-11  6:22 UTC (permalink / raw)
  To: Oscar Salvador; +Cc: mhocko, mike.kravetz, linux-mm, linux-kernel

I found another panic ...

This testcase is testing the corner case where hugepage migration fails
by allocation failure on destination numa node which is caused by the
condition that all remaining free hugetlb are reserved.

  [ 2610.711713] ===> testcase 'page_migration/hugetlb_madv_soft_reserve_noovercommit.auto2' start
  [ 2610.995836] bash (15807): drop_caches: 3
  [ 2612.596154] Soft offlining pfn 0x1d000 at process virtual address 0x700000000000
  [ 2612.910245] bash (15807): drop_caches: 3
  [ 2613.099769] page:fffff4ba40740000 refcount:1 mapcount:-128 mapping:0000000000000000 index:0x0
  [ 2613.102099] flags: 0xfffe000800000(hwpoison)
  [ 2613.103424] raw: 000fffe000800000 ffff9c953ffd5af8 fffff4ba40e78008 0000000000000000
  [ 2613.105817] raw: 0000000000000000 0000000000000009 00000001ffffff7f 0000000000000000
  [ 2613.107703] page dumped because: VM_BUG_ON_PAGE(page_count(buddy) != 0)
  [ 2613.109485] ------------[ cut here ]------------
  [ 2613.110834] kernel BUG at mm/page_alloc.c:821!
  [ 2613.112015] invalid opcode: 0000 [#1] SMP PTI
  [ 2613.113245] CPU: 0 PID: 16195 Comm: sysctl Not tainted 5.3.0-rc8-v5.3-rc8-190911-1025-00010-ga436dbce8674+ #18
  [ 2613.115982] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-2.fc30 04/01/2014
  [ 2613.118495] RIP: 0010:free_one_page+0x5f1/0x610
  [ 2613.119803] Code: 09 7e 81 49 8d b4 17 c0 00 00 00 8b 14 24 48 89 ef e8 83 75 00 00 e9 16 fe ff ff 48 c7 c6 c0 2b 0d 82 4c 89 e7 e8 9f c3 fd ff <0f> 0b 48 c7 c6 c0 2b 0d 82 e8 91 c3 fd ff 0f 0b 66 66 2e 0f 1f 84
  [ 2613.124751] RSP: 0018:ffffac7442727ca8 EFLAGS: 00010046
  [ 2613.126224] RAX: 000000000000003b RBX: 0000000000000009 RCX: 0000000000000006
  [ 2613.128261] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff820d1814
  [ 2613.130299] RBP: fffff4ba40748000 R08: 0000000000000710 R09: 000000000000004a
  [ 2613.132082] R10: 0000000000000000 R11: ffffac7442727b20 R12: fffff4ba40740000
  [ 2613.133821] R13: 000000000001d200 R14: 0000000000000000 R15: ffff9c953ffd5680
  [ 2613.135769] FS:  00007fbf2e6cb900(0000) GS:ffff9c953da00000(0000) knlGS:0000000000000000
  [ 2613.138077] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [ 2613.139721] CR2: 000055eef99d4d38 CR3: 000000007cb50000 CR4: 00000000001406f0
  [ 2613.141747] Call Trace:
  [ 2613.142472]  __free_pages_ok+0x175/0x4e0
  [ 2613.143454]  free_pool_huge_page+0xec/0x100
  [ 2613.144500]  __nr_hugepages_store_common+0x173/0x2e0
  [ 2613.145968]  ? __do_proc_doulongvec_minmax+0x3ae/0x440
  [ 2613.147444]  hugetlb_sysctl_handler_common+0xad/0xc0
  [ 2613.148867]  proc_sys_call_handler+0x1a5/0x1c0
  [ 2613.150104]  vfs_write+0xa5/0x1a0
  [ 2613.151081]  ksys_write+0x59/0xd0
  [ 2613.152052]  do_syscall_64+0x5f/0x1a0
  [ 2613.153071]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
  [ 2613.154327] RIP: 0033:0x7fbf2eb01ed8


On Wed, Sep 11, 2019 at 05:29:56AM +0000, Horiguchi Naoya(堀口 直也) wrote:
> Hi Oscar,
> 
> Thank you for your working on this.
> 
> My testing shows the following error:
> 
>   [ 1926.932435] ===> testcase 'mce_ksm_soft-offline_avoid_access.auto2' start
>   [ 1927.155321] bash (15853): drop_caches: 3
>   [ 1929.019094] page:ffffe5c384c4cd40 refcount:1 mapcount:0 mapping:0000000000000003 index:0x700000001
>   [ 1929.021586] anon
>   [ 1929.021588] flags: 0x57ffe00088000e(referenced|uptodate|dirty|swapbacked|hwpoison)
>   [ 1929.024289] raw: 0057ffe00088000e dead000000000100 dead000000000122 0000000000000003
>   [ 1929.026611] raw: 0000000700000001 0000000000000000 00000000ffffffff 0000000000000000
>   [ 1929.028760] page dumped because: VM_BUG_ON_PAGE(page_ref_count(page))
>   [ 1929.030559] ------------[ cut here ]------------
>   [ 1929.031684] kernel BUG at mm/internal.h:73!
>   [ 1929.032738] invalid opcode: 0000 [#1] SMP PTI
>   [ 1929.033941] CPU: 3 PID: 16052 Comm: mceinj.sh Not tainted 5.3.0-rc8-v5.3-rc8-190911-1025-00010-ga436dbce8674+ #18
>   [ 1929.037137] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.12.0-2.fc30 04/01/2014
>   [ 1929.040066] RIP: 0010:page_set_poison+0xf9/0x160
>   [ 1929.041665] Code: 63 02 7f 31 c0 5b 5d 41 5c c3 48 c7 c6 d0 d1 0c b0 48 89 df e8 88 bb f8 ff 0f 0b 48 c7 c6 f0 2a 0d b0 48 89 df e8 77 bb f8 ff <0f> 0b 48 8b 45 00 48 c1 e8 33 83 e0 07 83 f8 04 75 89 48 8b 45 08
>   [ 1929.047773] RSP: 0018:ffffb4fb8a73bde0 EFLAGS: 00010246
>   [ 1929.049511] RAX: 0000000000000039 RBX: ffffe5c384c4cd40 RCX: 0000000000000000
>   [ 1929.051870] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffffb00d1814
>   [ 1929.054238] RBP: ffffe5c384c4cd40 R08: 0000000000000596 R09: 0000000000000048
>   [ 1929.056599] R10: 0000000000000000 R11: ffffb4fb8a73bc58 R12: 0000000000000000
>   [ 1929.058986] R13: ffffb4fb8a73be10 R14: 0000000000131335 R15: 0000000000000001
>   [ 1929.061366] FS:  00007fc9e208d740(0000) GS:ffff9fa9bdb00000(0000) knlGS:0000000000000000
>   [ 1929.063842] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>   [ 1929.065429] CR2: 000055946c05d192 CR3: 00000001365f2000 CR4: 00000000001406e0
>   [ 1929.067373] Call Trace:
>   [ 1929.068094]  soft_offline_page+0x2be/0x600
>   [ 1929.069246]  soft_offline_page_store+0xdf/0x110
>   [ 1929.070510]  kernfs_fop_write+0x116/0x190
>   [ 1929.071618]  vfs_write+0xa5/0x1a0
>   [ 1929.072614]  ksys_write+0x59/0xd0
>   [ 1929.073548]  do_syscall_64+0x5f/0x1a0
>   [ 1929.074554]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>   [ 1929.075957] RIP: 0033:0x7fc9e217ded8
> 
> It seems that soft-offlining on ksm pages is affected by this changeset.
> Could you try to handle this?
> 
> - Naoya
> 
> On Tue, Sep 10, 2019 at 12:30:06PM +0200, Oscar Salvador wrote:
> >
> > This patchset was based on Naoya's hwpoison rework [1], so thanks to him
> > for the initial work.
> >
> > This patchset aims to fix some issues laying in soft-offline handling,
> > but it also takes the chance and takes some further steps to perform
> > cleanups and some refactoring as well.
> >
> >  - Motivation:
> >
> >    A customer and I were facing an issue where poisoned pages we returned
> >    back to user-space after having offlined them properly.
> >    This was only seend under some memory stress + soft offlining pages.
> >    After some anaylsis, it became clear that the problem was that
> >    when kcompactd kicked in to migrate pages over, compaction_alloc
> >    callback was handing poisoned pages to the migrate routine.
> >    Once this page was later on fault in, __do_page_fault returned
> >    VM_FAULT_HWPOISON making the process being killed.
> >
> >    All this could happen because isolate_freepages_block and
> >    fast_isolate_freepages just check for the page to be PageBuddy,
> >    and since 1) poisoned pages can be part of a higher order page
> >    and 2) poisoned pages are also Page Buddy, they can sneak in easily.
> >
> >    I also saw some problem with swap pages, but I suspected to be the
> >    same sort of problem, so I did not follow that trace.
> >
> >    The full explanation can be see in [2].
> >
> >  - Approach:
> >
> >    The taken approach is to not let poisoned pages hit neither
> >    pcplists nor buddy freelists.
> >    This is achieved by:
> >
> > In-use pages:
> >
> >    * Normal pages
> >
> >    1) do not release the last reference count after the
> >       invalidation/migration of the page.
> >    2) the page is being handed to page_set_poison, which does:
> >       2a) sets PageHWPoison flag
> >       2b) calls put_page (only to be able to call __page_cache_release)
> >           Since poisoned pages are skipped in free_pages_prepare,
> >           this put_page is safe.
> >       2c) Sets the refcount to 1
> >
> >    * Hugetlb pages
> >
> >    1) Hand the page to page_set_poison after migration
> >    2) page_set_poison does:
> >       2a) Calls dissolve_free_huge_page
> >       2b) If ranged to be dissolved contains poisoned pages,
> >           we free the rangeas order-0 pages (as we do with gigantic hugetlb page),
> >           so free_pages_prepare will skip them accordingly.
> >       2c) Sets the refcount to 1
> >
> > Free pages:
> >
> >    * Normal pages:
> >
> >    1) Take the page off the buddy freelist
> >    2) Set PageHWPoison flag and set refcount to 1
> >
> >    * Hugetlb pages
> >
> >    1) Try to allocate a new hugetlb page to the pool
> >    2) Take off the pool the poisoned hugetlb
> >
> >
> > With this patchset, I no longer see the issues I faced before.
> >
> > Note:
> > I presented this as RFC to open discussion of the taken aproach.
> > I think that furthers cleanups and refactors could be made, but I would
> > like to get some insight of the taken approach before touching more
> > code.
> >
> > Thanks
> >
> > [1] https://lore.kernel.org/linux-mm/1541746035-13408-1-git-send-email-n-horiguchi@ah.jp.nec.com/
> > [2] https://lore.kernel.org/linux-mm/20190826104144.GA7849@linux/T/#u
> >
> > Naoya Horiguchi (5):
> >   mm,hwpoison: cleanup unused PageHuge() check
> >   mm,madvise: call soft_offline_page() without MF_COUNT_INCREASED
> >   mm,hwpoison-inject: don't pin for hwpoison_filter
> >   mm,hwpoison: remove MF_COUNT_INCREASED
> >   mm: remove flag argument from soft offline functions
> >
> > Oscar Salvador (5):
> >   mm,hwpoison: Unify THP handling for hard and soft offline
> >   mm,hwpoison: Rework soft offline for in-use pages
> >   mm,hwpoison: Refactor soft_offline_huge_page and __soft_offline_page
> >   mm,hwpoison: Rework soft offline for free pages
> >   mm,hwpoison: Use hugetlb_replace_page to replace free hugetlb pages
> >
> >  drivers/base/memory.c      |   2 +-
> >  include/linux/mm.h         |   9 +-
> >  include/linux/page-flags.h |   5 -
> >  mm/hugetlb.c               |  51 +++++++-
> >  mm/hwpoison-inject.c       |  18 +--
> >  mm/madvise.c               |  25 ++--
> >  mm/memory-failure.c        | 319 +++++++++++++++++++++------------------------
> >  mm/migrate.c               |  11 +-
> >  mm/page_alloc.c            |  62 +++++++--
> >  9 files changed, 267 insertions(+), 235 deletions(-)
> >
> > --
> > 2.12.3
> >
> >
> 


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 00/10] Hwpoison soft-offline rework
  2019-09-11  6:22   ` Naoya Horiguchi
@ 2019-09-11  6:35     ` osalvador
  2019-09-11  7:21       ` Naoya Horiguchi
  0 siblings, 1 reply; 26+ messages in thread
From: osalvador @ 2019-09-11  6:35 UTC (permalink / raw)
  To: Naoya Horiguchi; +Cc: mhocko, mike.kravetz, linux-mm, linux-kernel

On 2019-09-11 08:22, Naoya Horiguchi wrote:
> I found another panic ...

Hi Naoya,

Thanks for giving it a try. Are these testcase public?
I will definetely take a look and try to solve these cases.

Thanks!
> 
> This testcase is testing the corner case where hugepage migration fails
> by allocation failure on destination numa node which is caused by the
> condition that all remaining free hugetlb are reserved.
> 
>   [ 2610.711713] ===> testcase
> 'page_migration/hugetlb_madv_soft_reserve_noovercommit.auto2' start
>   [ 2610.995836] bash (15807): drop_caches: 3
>   [ 2612.596154] Soft offlining pfn 0x1d000 at process virtual address
> 0x700000000000
>   [ 2612.910245] bash (15807): drop_caches: 3
>   [ 2613.099769] page:fffff4ba40740000 refcount:1 mapcount:-128
> mapping:0000000000000000 index:0x0
>   [ 2613.102099] flags: 0xfffe000800000(hwpoison)
>   [ 2613.103424] raw: 000fffe000800000 ffff9c953ffd5af8
> fffff4ba40e78008 0000000000000000
>   [ 2613.105817] raw: 0000000000000000 0000000000000009
> 00000001ffffff7f 0000000000000000
>   [ 2613.107703] page dumped because: VM_BUG_ON_PAGE(page_count(buddy) 
> != 0)
>   [ 2613.109485] ------------[ cut here ]------------
>   [ 2613.110834] kernel BUG at mm/page_alloc.c:821!
>   [ 2613.112015] invalid opcode: 0000 [#1] SMP PTI
>   [ 2613.113245] CPU: 0 PID: 16195 Comm: sysctl Not tainted
> 5.3.0-rc8-v5.3-rc8-190911-1025-00010-ga436dbce8674+ #18
>   [ 2613.115982] Hardware name: QEMU Standard PC (i440FX + PIIX,
> 1996), BIOS 1.12.0-2.fc30 04/01/2014
>   [ 2613.118495] RIP: 0010:free_one_page+0x5f1/0x610
>   [ 2613.119803] Code: 09 7e 81 49 8d b4 17 c0 00 00 00 8b 14 24 48 89
> ef e8 83 75 00 00 e9 16 fe ff ff 48 c7 c6 c0 2b 0d 82 4c 89 e7 e8 9f
> c3 fd ff <0f> 0b 48 c7 c6 c0 2b 0d 82 e8 91 c3 fd ff 0f 0b 66 66 2e 0f
> 1f 84
>   [ 2613.124751] RSP: 0018:ffffac7442727ca8 EFLAGS: 00010046
>   [ 2613.126224] RAX: 000000000000003b RBX: 0000000000000009 RCX:
> 0000000000000006
>   [ 2613.128261] RDX: 0000000000000000 RSI: 0000000000000000 RDI:
> ffffffff820d1814
>   [ 2613.130299] RBP: fffff4ba40748000 R08: 0000000000000710 R09:
> 000000000000004a
>   [ 2613.132082] R10: 0000000000000000 R11: ffffac7442727b20 R12:
> fffff4ba40740000
>   [ 2613.133821] R13: 000000000001d200 R14: 0000000000000000 R15:
> ffff9c953ffd5680
>   [ 2613.135769] FS:  00007fbf2e6cb900(0000) GS:ffff9c953da00000(0000)
> knlGS:0000000000000000
>   [ 2613.138077] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>   [ 2613.139721] CR2: 000055eef99d4d38 CR3: 000000007cb50000 CR4:
> 00000000001406f0
>   [ 2613.141747] Call Trace:
>   [ 2613.142472]  __free_pages_ok+0x175/0x4e0
>   [ 2613.143454]  free_pool_huge_page+0xec/0x100
>   [ 2613.144500]  __nr_hugepages_store_common+0x173/0x2e0
>   [ 2613.145968]  ? __do_proc_doulongvec_minmax+0x3ae/0x440
>   [ 2613.147444]  hugetlb_sysctl_handler_common+0xad/0xc0
>   [ 2613.148867]  proc_sys_call_handler+0x1a5/0x1c0
>   [ 2613.150104]  vfs_write+0xa5/0x1a0
>   [ 2613.151081]  ksys_write+0x59/0xd0
>   [ 2613.152052]  do_syscall_64+0x5f/0x1a0
>   [ 2613.153071]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>   [ 2613.154327] RIP: 0033:0x7fbf2eb01ed8
> 
> 
> On Wed, Sep 11, 2019 at 05:29:56AM +0000, Horiguchi Naoya(堀口 直也) wrote:
>> Hi Oscar,
>> 
>> Thank you for your working on this.
>> 
>> My testing shows the following error:
>> 
>>   [ 1926.932435] ===> testcase 
>> 'mce_ksm_soft-offline_avoid_access.auto2' start
>>   [ 1927.155321] bash (15853): drop_caches: 3
>>   [ 1929.019094] page:ffffe5c384c4cd40 refcount:1 mapcount:0 
>> mapping:0000000000000003 index:0x700000001
>>   [ 1929.021586] anon
>>   [ 1929.021588] flags: 
>> 0x57ffe00088000e(referenced|uptodate|dirty|swapbacked|hwpoison)
>>   [ 1929.024289] raw: 0057ffe00088000e dead000000000100 
>> dead000000000122 0000000000000003
>>   [ 1929.026611] raw: 0000000700000001 0000000000000000 
>> 00000000ffffffff 0000000000000000
>>   [ 1929.028760] page dumped because: 
>> VM_BUG_ON_PAGE(page_ref_count(page))
>>   [ 1929.030559] ------------[ cut here ]------------
>>   [ 1929.031684] kernel BUG at mm/internal.h:73!
>>   [ 1929.032738] invalid opcode: 0000 [#1] SMP PTI
>>   [ 1929.033941] CPU: 3 PID: 16052 Comm: mceinj.sh Not tainted 
>> 5.3.0-rc8-v5.3-rc8-190911-1025-00010-ga436dbce8674+ #18
>>   [ 1929.037137] Hardware name: QEMU Standard PC (i440FX + PIIX, 
>> 1996), BIOS 1.12.0-2.fc30 04/01/2014
>>   [ 1929.040066] RIP: 0010:page_set_poison+0xf9/0x160
>>   [ 1929.041665] Code: 63 02 7f 31 c0 5b 5d 41 5c c3 48 c7 c6 d0 d1 0c 
>> b0 48 89 df e8 88 bb f8 ff 0f 0b 48 c7 c6 f0 2a 0d b0 48 89 df e8 77 
>> bb f8 ff <0f> 0b 48 8b 45 00 48 c1 e8 33 83 e0 07 83 f8 04 75 89 48 8b 
>> 45 08
>>   [ 1929.047773] RSP: 0018:ffffb4fb8a73bde0 EFLAGS: 00010246
>>   [ 1929.049511] RAX: 0000000000000039 RBX: ffffe5c384c4cd40 RCX: 
>> 0000000000000000
>>   [ 1929.051870] RDX: 0000000000000000 RSI: 0000000000000000 RDI: 
>> ffffffffb00d1814
>>   [ 1929.054238] RBP: ffffe5c384c4cd40 R08: 0000000000000596 R09: 
>> 0000000000000048
>>   [ 1929.056599] R10: 0000000000000000 R11: ffffb4fb8a73bc58 R12: 
>> 0000000000000000
>>   [ 1929.058986] R13: ffffb4fb8a73be10 R14: 0000000000131335 R15: 
>> 0000000000000001
>>   [ 1929.061366] FS:  00007fc9e208d740(0000) GS:ffff9fa9bdb00000(0000) 
>> knlGS:0000000000000000
>>   [ 1929.063842] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>   [ 1929.065429] CR2: 000055946c05d192 CR3: 00000001365f2000 CR4: 
>> 00000000001406e0
>>   [ 1929.067373] Call Trace:
>>   [ 1929.068094]  soft_offline_page+0x2be/0x600
>>   [ 1929.069246]  soft_offline_page_store+0xdf/0x110
>>   [ 1929.070510]  kernfs_fop_write+0x116/0x190
>>   [ 1929.071618]  vfs_write+0xa5/0x1a0
>>   [ 1929.072614]  ksys_write+0x59/0xd0
>>   [ 1929.073548]  do_syscall_64+0x5f/0x1a0
>>   [ 1929.074554]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>>   [ 1929.075957] RIP: 0033:0x7fc9e217ded8
>> 
>> It seems that soft-offlining on ksm pages is affected by this 
>> changeset.
>> Could you try to handle this?
>> 
>> - Naoya
>> 
>> On Tue, Sep 10, 2019 at 12:30:06PM +0200, Oscar Salvador wrote:
>> >
>> > This patchset was based on Naoya's hwpoison rework [1], so thanks to him
>> > for the initial work.
>> >
>> > This patchset aims to fix some issues laying in soft-offline handling,
>> > but it also takes the chance and takes some further steps to perform
>> > cleanups and some refactoring as well.
>> >
>> >  - Motivation:
>> >
>> >    A customer and I were facing an issue where poisoned pages we returned
>> >    back to user-space after having offlined them properly.
>> >    This was only seend under some memory stress + soft offlining pages.
>> >    After some anaylsis, it became clear that the problem was that
>> >    when kcompactd kicked in to migrate pages over, compaction_alloc
>> >    callback was handing poisoned pages to the migrate routine.
>> >    Once this page was later on fault in, __do_page_fault returned
>> >    VM_FAULT_HWPOISON making the process being killed.
>> >
>> >    All this could happen because isolate_freepages_block and
>> >    fast_isolate_freepages just check for the page to be PageBuddy,
>> >    and since 1) poisoned pages can be part of a higher order page
>> >    and 2) poisoned pages are also Page Buddy, they can sneak in easily.
>> >
>> >    I also saw some problem with swap pages, but I suspected to be the
>> >    same sort of problem, so I did not follow that trace.
>> >
>> >    The full explanation can be see in [2].
>> >
>> >  - Approach:
>> >
>> >    The taken approach is to not let poisoned pages hit neither
>> >    pcplists nor buddy freelists.
>> >    This is achieved by:
>> >
>> > In-use pages:
>> >
>> >    * Normal pages
>> >
>> >    1) do not release the last reference count after the
>> >       invalidation/migration of the page.
>> >    2) the page is being handed to page_set_poison, which does:
>> >       2a) sets PageHWPoison flag
>> >       2b) calls put_page (only to be able to call __page_cache_release)
>> >           Since poisoned pages are skipped in free_pages_prepare,
>> >           this put_page is safe.
>> >       2c) Sets the refcount to 1
>> >
>> >    * Hugetlb pages
>> >
>> >    1) Hand the page to page_set_poison after migration
>> >    2) page_set_poison does:
>> >       2a) Calls dissolve_free_huge_page
>> >       2b) If ranged to be dissolved contains poisoned pages,
>> >           we free the rangeas order-0 pages (as we do with gigantic hugetlb page),
>> >           so free_pages_prepare will skip them accordingly.
>> >       2c) Sets the refcount to 1
>> >
>> > Free pages:
>> >
>> >    * Normal pages:
>> >
>> >    1) Take the page off the buddy freelist
>> >    2) Set PageHWPoison flag and set refcount to 1
>> >
>> >    * Hugetlb pages
>> >
>> >    1) Try to allocate a new hugetlb page to the pool
>> >    2) Take off the pool the poisoned hugetlb
>> >
>> >
>> > With this patchset, I no longer see the issues I faced before.
>> >
>> > Note:
>> > I presented this as RFC to open discussion of the taken aproach.
>> > I think that furthers cleanups and refactors could be made, but I would
>> > like to get some insight of the taken approach before touching more
>> > code.
>> >
>> > Thanks
>> >
>> > [1] https://lore.kernel.org/linux-mm/1541746035-13408-1-git-send-email-n-horiguchi@ah.jp.nec.com/
>> > [2] https://lore.kernel.org/linux-mm/20190826104144.GA7849@linux/T/#u
>> >
>> > Naoya Horiguchi (5):
>> >   mm,hwpoison: cleanup unused PageHuge() check
>> >   mm,madvise: call soft_offline_page() without MF_COUNT_INCREASED
>> >   mm,hwpoison-inject: don't pin for hwpoison_filter
>> >   mm,hwpoison: remove MF_COUNT_INCREASED
>> >   mm: remove flag argument from soft offline functions
>> >
>> > Oscar Salvador (5):
>> >   mm,hwpoison: Unify THP handling for hard and soft offline
>> >   mm,hwpoison: Rework soft offline for in-use pages
>> >   mm,hwpoison: Refactor soft_offline_huge_page and __soft_offline_page
>> >   mm,hwpoison: Rework soft offline for free pages
>> >   mm,hwpoison: Use hugetlb_replace_page to replace free hugetlb pages
>> >
>> >  drivers/base/memory.c      |   2 +-
>> >  include/linux/mm.h         |   9 +-
>> >  include/linux/page-flags.h |   5 -
>> >  mm/hugetlb.c               |  51 +++++++-
>> >  mm/hwpoison-inject.c       |  18 +--
>> >  mm/madvise.c               |  25 ++--
>> >  mm/memory-failure.c        | 319 +++++++++++++++++++++------------------------
>> >  mm/migrate.c               |  11 +-
>> >  mm/page_alloc.c            |  62 +++++++--
>> >  9 files changed, 267 insertions(+), 235 deletions(-)
>> >
>> > --
>> > 2.12.3
>> >
>> >
>> 



^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 00/10] Hwpoison soft-offline rework
  2019-09-11  6:35     ` osalvador
@ 2019-09-11  7:21       ` Naoya Horiguchi
  2019-09-12 13:16         ` Oscar Salvador
  0 siblings, 1 reply; 26+ messages in thread
From: Naoya Horiguchi @ 2019-09-11  7:21 UTC (permalink / raw)
  To: osalvador; +Cc: mhocko, mike.kravetz, linux-mm, linux-kernel

On Wed, Sep 11, 2019 at 08:35:26AM +0200, osalvador@suse.de wrote:
> On 2019-09-11 08:22, Naoya Horiguchi wrote:
> > I found another panic ...
> 
> Hi Naoya,
> 
> Thanks for giving it a try. Are these testcase public?
> I will definetely take a look and try to solve these cases.

It's available on https://github.com/Naoya-Horiguchi/mm_regression.
The README is a bit obsolete (sorry about that ...,) but you can run
the testcase like below:

  $ git clone https://github.com/Naoya-Horiguchi/mm_regression
  $ cd mm_regression
  mm_regression $ git clone https://github.com/Naoya-Horiguchi/test_core 
  mm_regression $ make
  // you might need to install some dependencies like numa library and mce-inject tool
  mm_regression $ make update_recipes

To run the single testcase, run the commands like below:

  mm_regression $ RECIPEFILES=cases/page_migration/hugetlb_migratepages_allocate1_noovercommit.auto2 bash run.sh
  mm_regression $ RECIPEFILES=cases/cases/mce_ksm_soft-offline_avoid_access.auto2 bash run.sh
  
You can run a set of many testcases with the commands like below:

  mm_regression $ RECIPEFILES=cases/cases/mce_ksm_* bash run.sh
  // run all ksm related testcases. I reproduced the panic with this command.

  mm_regression $ run_class=simple bash run.sh
  // run the set of minimum testcases I run for each releases.

Hopefully this will help you.

Thanks,
Naoya Horiguchi


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 01/10] mm,hwpoison: cleanup unused PageHuge() check
  2019-09-10 10:30 ` [PATCH 01/10] mm,hwpoison: cleanup unused PageHuge() check Oscar Salvador
@ 2019-09-11 10:17   ` David Hildenbrand
  0 siblings, 0 replies; 26+ messages in thread
From: David Hildenbrand @ 2019-09-11 10:17 UTC (permalink / raw)
  To: Oscar Salvador, n-horiguchi; +Cc: mhocko, mike.kravetz, linux-mm, linux-kernel

On 10.09.19 12:30, Oscar Salvador wrote:
> From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> 
> memory_failure() forks to memory_failure_hugetlb() for hugetlb pages,
> so a PageHuge() check after the fork should not be necessary.
> 
> Signed-off-by: Oscar Salvador <osalvador@suse.de>
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com>
> ---
>  mm/memory-failure.c | 5 +----
>  1 file changed, 1 insertion(+), 4 deletions(-)
> 
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 7ef849da8278..e43b61462fd5 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1353,10 +1353,7 @@ int memory_failure(unsigned long pfn, int flags)
>  	 * page_remove_rmap() in try_to_unmap_one(). So to determine page status
>  	 * correctly, we save a copy of the page flags at this time.
>  	 */
> -	if (PageHuge(p))
> -		page_flags = hpage->flags;
> -	else
> -		page_flags = p->flags;
> +	page_flags = p->flags;
>  
>  	/*
>  	 * unpoison always clear PG_hwpoison inside page lock
> 

Reviewed-by: David Hildenbrand <david@redhat.com>

-- 

Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 02/10] mm,madvise: call soft_offline_page() without MF_COUNT_INCREASED
  2019-09-10 10:30 ` [PATCH 02/10] mm,madvise: call soft_offline_page() without MF_COUNT_INCREASED Oscar Salvador
@ 2019-09-11 10:23   ` David Hildenbrand
  2019-09-12  1:28     ` Naoya Horiguchi
  2019-09-11 10:27   ` David Hildenbrand
  1 sibling, 1 reply; 26+ messages in thread
From: David Hildenbrand @ 2019-09-11 10:23 UTC (permalink / raw)
  To: Oscar Salvador, n-horiguchi; +Cc: mhocko, mike.kravetz, linux-mm, linux-kernel

On 10.09.19 12:30, Oscar Salvador wrote:
> From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> 
> Currently madvise_inject_error() pins the target via get_user_pages_fast.
> The call to get_user_pages_fast is only to get the respective page
> of a given address, but it is the job of the memory-poisoning handler
> to deal with races, so drop the refcount grabbed by get_user_pages_fast.
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Signed-off-by: Oscar Salvador <osalvador@suse.de>
> ---
>  mm/madvise.c | 25 +++++++++++--------------
>  1 file changed, 11 insertions(+), 14 deletions(-)
> 
> diff --git a/mm/madvise.c b/mm/madvise.c
> index 6e023414f5c1..fbe6d402232c 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -883,6 +883,16 @@ static int madvise_inject_error(int behavior,
>  		ret = get_user_pages_fast(start, 1, 0, &page);
>  		if (ret != 1)
>  			return ret;
> +		/*
> +		 * The get_user_pages_fast() is just to get the pfn of the
> +		 * given address, and the refcount has nothing to do with
> +		 * what we try to test, so it should be released immediately.
> +		 * This is racy but it's intended because the real hardware
> +		 * errors could happen at any moment and memory error handlers
> +		 * must properly handle the race.
> +		 */
> +		put_page(page);
> +

I wonder if it would be clearer to do that after the page has been fully
used  - e.g. after getting the pfn and the order (and then e.g.,
symbolically setting the page pointer to 0).

I guess the important part of this patch is to not have an elevated
refcount while calling soft_offline_page().

>  		pfn = page_to_pfn(page);
>  
>  		/*
> @@ -892,16 +902,11 @@ static int madvise_inject_error(int behavior,
>  		 */
>  		order = compound_order(compound_head(page));
>  
> -		if (PageHWPoison(page)) {
> -			put_page(page);
> -			continue;
> -		}

This change is not reflected in the changelog. I would have expected
that only the put_page() would go. If this should go completely, I
suggest a separate patch.

> -
>  		if (behavior == MADV_SOFT_OFFLINE) {
>  			pr_info("Soft offlining pfn %#lx at process virtual address %#lx\n",
>  					pfn, start);
>  
> -			ret = soft_offline_page(page, MF_COUNT_INCREASED);
> +			ret = soft_offline_page(page, 0);
>  			if (ret)
>  				return ret;
>  			continue;
> @@ -909,14 +914,6 @@ static int madvise_inject_error(int behavior,
>  
>  		pr_info("Injecting memory failure for pfn %#lx at process virtual address %#lx\n",
>  				pfn, start);
> -
> -		/*
> -		 * Drop the page reference taken by get_user_pages_fast(). In
> -		 * the absence of MF_COUNT_INCREASED the memory_failure()
> -		 * routine is responsible for pinning the page to prevent it
> -		 * from being released back to the page allocator.
> -		 */
> -		put_page(page);
>  		ret = memory_failure(pfn, 0);
>  		if (ret)
>  			return ret;
> 


-- 

Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 04/10] mm,hwpoison: remove MF_COUNT_INCREASED
  2019-09-10 10:30 ` [PATCH 04/10] mm,hwpoison: remove MF_COUNT_INCREASED Oscar Salvador
@ 2019-09-11 10:24   ` David Hildenbrand
  0 siblings, 0 replies; 26+ messages in thread
From: David Hildenbrand @ 2019-09-11 10:24 UTC (permalink / raw)
  To: Oscar Salvador, n-horiguchi; +Cc: mhocko, mike.kravetz, linux-mm, linux-kernel

On 10.09.19 12:30, Oscar Salvador wrote:
> From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> 
> Now there's no user of MF_COUNT_INCREASED, so we can safely remove
> it from all calling points.
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Signed-off-by: Oscar Salvador <osalvador@suse.de>
> ---
>  include/linux/mm.h  |  7 +++----
>  mm/memory-failure.c | 16 +++-------------
>  2 files changed, 6 insertions(+), 17 deletions(-)
> 
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index ad6766a08f9b..fb36a4165a4e 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2814,10 +2814,9 @@ void register_page_bootmem_memmap(unsigned long section_nr, struct page *map,
>  				  unsigned long nr_pages);
>  
>  enum mf_flags {
> -	MF_COUNT_INCREASED = 1 << 0,
> -	MF_ACTION_REQUIRED = 1 << 1,
> -	MF_MUST_KILL = 1 << 2,
> -	MF_SOFT_OFFLINE = 1 << 3,
> +	MF_ACTION_REQUIRED = 1 << 0,
> +	MF_MUST_KILL = 1 << 1,
> +	MF_SOFT_OFFLINE = 1 << 2,
>  };
>  extern int memory_failure(unsigned long pfn, int flags);
>  extern void memory_failure_queue(unsigned long pfn, int flags);
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index e43b61462fd5..1be785b25324 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1092,7 +1092,7 @@ static int memory_failure_hugetlb(unsigned long pfn, int flags)
>  
>  	num_poisoned_pages_inc();
>  
> -	if (!(flags & MF_COUNT_INCREASED) && !get_hwpoison_page(p)) {
> +	if (!get_hwpoison_page(p)) {
>  		/*
>  		 * Check "filter hit" and "race with other subpage."
>  		 */
> @@ -1286,7 +1286,7 @@ int memory_failure(unsigned long pfn, int flags)
>  	 * In fact it's dangerous to directly bump up page count from 0,
>  	 * that may make page_ref_freeze()/page_ref_unfreeze() mismatch.
>  	 */
> -	if (!(flags & MF_COUNT_INCREASED) && !get_hwpoison_page(p)) {
> +	if (!get_hwpoison_page(p)) {
>  		if (is_free_buddy_page(p)) {
>  			action_result(pfn, MF_MSG_BUDDY, MF_DELAYED);
>  			return 0;
> @@ -1327,10 +1327,7 @@ int memory_failure(unsigned long pfn, int flags)
>  	shake_page(p, 0);
>  	/* shake_page could have turned it free. */
>  	if (!PageLRU(p) && is_free_buddy_page(p)) {
> -		if (flags & MF_COUNT_INCREASED)
> -			action_result(pfn, MF_MSG_BUDDY, MF_DELAYED);
> -		else
> -			action_result(pfn, MF_MSG_BUDDY_2ND, MF_DELAYED);
> +		action_result(pfn, MF_MSG_BUDDY_2ND, MF_DELAYED);
>  		return 0;
>  	}
>  
> @@ -1618,9 +1615,6 @@ static int __get_any_page(struct page *p, unsigned long pfn, int flags)
>  {
>  	int ret;
>  
> -	if (flags & MF_COUNT_INCREASED)
> -		return 1;
> -
>  	/*
>  	 * When the target page is a free hugepage, just remove it
>  	 * from free hugepage list.
> @@ -1890,15 +1884,11 @@ int soft_offline_page(struct page *page, int flags)
>  	if (is_zone_device_page(page)) {
>  		pr_debug_ratelimited("soft_offline: %#lx page is device page\n",
>  				pfn);
> -		if (flags & MF_COUNT_INCREASED)
> -			put_page(page);
>  		return -EIO;
>  	}
>  
>  	if (PageHWPoison(page)) {
>  		pr_info("soft offline: %#lx page already poisoned\n", pfn);
> -		if (flags & MF_COUNT_INCREASED)
> -			put_hwpoison_page(page);
>  		return -EBUSY;
>  	}
>  
> 

Acked-by: David Hildenbrand <david@redhat.com>

-- 

Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 05/10] mm: remove flag argument from soft offline functions
  2019-09-10 10:30 ` [PATCH 05/10] mm: remove flag argument from soft offline functions Oscar Salvador
@ 2019-09-11 10:24   ` David Hildenbrand
  0 siblings, 0 replies; 26+ messages in thread
From: David Hildenbrand @ 2019-09-11 10:24 UTC (permalink / raw)
  To: Oscar Salvador, n-horiguchi; +Cc: mhocko, mike.kravetz, linux-mm, linux-kernel

On 10.09.19 12:30, Oscar Salvador wrote:
> From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> 
> The argument @flag no longer affects the behavior of soft_offline_page()
> and its variants, so let's remove them.
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Signed-off-by: Oscar Salvador <osalvador@suse.de>
> ---
>  drivers/base/memory.c |  2 +-
>  include/linux/mm.h    |  2 +-
>  mm/madvise.c          |  2 +-
>  mm/memory-failure.c   | 27 +++++++++++++--------------
>  4 files changed, 16 insertions(+), 17 deletions(-)
> 
> diff --git a/drivers/base/memory.c b/drivers/base/memory.c
> index 6bea4f3f8040..e5485c22ef77 100644
> --- a/drivers/base/memory.c
> +++ b/drivers/base/memory.c
> @@ -540,7 +540,7 @@ static ssize_t soft_offline_page_store(struct device *dev,
>  	pfn >>= PAGE_SHIFT;
>  	if (!pfn_valid(pfn))
>  		return -ENXIO;
> -	ret = soft_offline_page(pfn_to_page(pfn), 0);
> +	ret = soft_offline_page(pfn_to_page(pfn));
>  	return ret == 0 ? count : ret;
>  }
>  
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index fb36a4165a4e..3cc800d9f57a 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -2827,7 +2827,7 @@ extern int sysctl_memory_failure_early_kill;
>  extern int sysctl_memory_failure_recovery;
>  extern void shake_page(struct page *p, int access);
>  extern atomic_long_t num_poisoned_pages __read_mostly;
> -extern int soft_offline_page(struct page *page, int flags);
> +extern int soft_offline_page(struct page *page);
>  
>  
>  /*
> diff --git a/mm/madvise.c b/mm/madvise.c
> index fbe6d402232c..ece128211400 100644
> --- a/mm/madvise.c
> +++ b/mm/madvise.c
> @@ -906,7 +906,7 @@ static int madvise_inject_error(int behavior,
>  			pr_info("Soft offlining pfn %#lx at process virtual address %#lx\n",
>  					pfn, start);
>  
> -			ret = soft_offline_page(page, 0);
> +			ret = soft_offline_page(page);
>  			if (ret)
>  				return ret;
>  			continue;
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 1be785b25324..5071d39bdfef 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1478,7 +1478,7 @@ static void memory_failure_work_func(struct work_struct *work)
>  		if (!gotten)
>  			break;
>  		if (entry.flags & MF_SOFT_OFFLINE)
> -			soft_offline_page(pfn_to_page(entry.pfn), entry.flags);
> +			soft_offline_page(pfn_to_page(entry.pfn));
>  		else
>  			memory_failure(entry.pfn, entry.flags);
>  	}
> @@ -1611,7 +1611,7 @@ static struct page *new_page(struct page *p, unsigned long private)
>   * that is not free, and 1 for any other page type.
>   * For 1 the page is returned with increased page count, otherwise not.
>   */
> -static int __get_any_page(struct page *p, unsigned long pfn, int flags)
> +static int __get_any_page(struct page *p, unsigned long pfn)
>  {
>  	int ret;
>  
> @@ -1638,9 +1638,9 @@ static int __get_any_page(struct page *p, unsigned long pfn, int flags)
>  	return ret;
>  }
>  
> -static int get_any_page(struct page *page, unsigned long pfn, int flags)
> +static int get_any_page(struct page *page, unsigned long pfn)
>  {
> -	int ret = __get_any_page(page, pfn, flags);
> +	int ret = __get_any_page(page, pfn);
>  
>  	if (ret == 1 && !PageHuge(page) &&
>  	    !PageLRU(page) && !__PageMovable(page)) {
> @@ -1653,7 +1653,7 @@ static int get_any_page(struct page *page, unsigned long pfn, int flags)
>  		/*
>  		 * Did it turn free?
>  		 */
> -		ret = __get_any_page(page, pfn, 0);
> +		ret = __get_any_page(page, pfn);
>  		if (ret == 1 && !PageLRU(page)) {
>  			/* Drop page reference which is from __get_any_page() */
>  			put_hwpoison_page(page);
> @@ -1665,7 +1665,7 @@ static int get_any_page(struct page *page, unsigned long pfn, int flags)
>  	return ret;
>  }
>  
> -static int soft_offline_huge_page(struct page *page, int flags)
> +static int soft_offline_huge_page(struct page *page)
>  {
>  	int ret;
>  	unsigned long pfn = page_to_pfn(page);
> @@ -1724,7 +1724,7 @@ static int soft_offline_huge_page(struct page *page, int flags)
>  	return ret;
>  }
>  
> -static int __soft_offline_page(struct page *page, int flags)
> +static int __soft_offline_page(struct page *page)
>  {
>  	int ret;
>  	unsigned long pfn = page_to_pfn(page);
> @@ -1804,7 +1804,7 @@ static int __soft_offline_page(struct page *page, int flags)
>  	return ret;
>  }
>  
> -static int soft_offline_in_use_page(struct page *page, int flags)
> +static int soft_offline_in_use_page(struct page *page)
>  {
>  	int ret;
>  	int mt;
> @@ -1834,9 +1834,9 @@ static int soft_offline_in_use_page(struct page *page, int flags)
>  	mt = get_pageblock_migratetype(page);
>  	set_pageblock_migratetype(page, MIGRATE_ISOLATE);
>  	if (PageHuge(page))
> -		ret = soft_offline_huge_page(page, flags);
> +		ret = soft_offline_huge_page(page);
>  	else
> -		ret = __soft_offline_page(page, flags);
> +		ret = __soft_offline_page(page);
>  	set_pageblock_migratetype(page, mt);
>  	return ret;
>  }
> @@ -1857,7 +1857,6 @@ static int soft_offline_free_page(struct page *page)
>  /**
>   * soft_offline_page - Soft offline a page.
>   * @page: page to offline
> - * @flags: flags. Same as memory_failure().
>   *
>   * Returns 0 on success, otherwise negated errno.
>   *
> @@ -1876,7 +1875,7 @@ static int soft_offline_free_page(struct page *page)
>   * This is not a 100% solution for all memory, but tries to be
>   * ``good enough'' for the majority of memory.
>   */
> -int soft_offline_page(struct page *page, int flags)
> +int soft_offline_page(struct page *page)
>  {
>  	int ret;
>  	unsigned long pfn = page_to_pfn(page);
> @@ -1893,11 +1892,11 @@ int soft_offline_page(struct page *page, int flags)
>  	}
>  
>  	get_online_mems();
> -	ret = get_any_page(page, pfn, flags);
> +	ret = get_any_page(page, pfn);
>  	put_online_mems();
>  
>  	if (ret > 0)
> -		ret = soft_offline_in_use_page(page, flags);
> +		ret = soft_offline_in_use_page(page);
>  	else if (ret == 0)
>  		ret = soft_offline_free_page(page);
>  
> 

Acked-by: David Hildenbrand <david@redhat.com>

-- 

Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 02/10] mm,madvise: call soft_offline_page() without MF_COUNT_INCREASED
  2019-09-10 10:30 ` [PATCH 02/10] mm,madvise: call soft_offline_page() without MF_COUNT_INCREASED Oscar Salvador
  2019-09-11 10:23   ` David Hildenbrand
@ 2019-09-11 10:27   ` David Hildenbrand
  2019-09-12  1:37     ` Naoya Horiguchi
  1 sibling, 1 reply; 26+ messages in thread
From: David Hildenbrand @ 2019-09-11 10:27 UTC (permalink / raw)
  To: Oscar Salvador, n-horiguchi; +Cc: mhocko, mike.kravetz, linux-mm, linux-kernel

On 10.09.19 12:30, Oscar Salvador wrote:
> From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> 
> Currently madvise_inject_error() pins the target via get_user_pages_fast.
> The call to get_user_pages_fast is only to get the respective page
> of a given address, but it is the job of the memory-poisoning handler
> to deal with races, so drop the refcount grabbed by get_user_pages_fast.
> 

Oh, and another question "it is the job of the memory-poisoning handler"
- is that already properly implemented? (newbee question ¯\_(ツ)_/¯)

-- 

Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 02/10] mm,madvise: call soft_offline_page() without MF_COUNT_INCREASED
  2019-09-11 10:23   ` David Hildenbrand
@ 2019-09-12  1:28     ` Naoya Horiguchi
  0 siblings, 0 replies; 26+ messages in thread
From: Naoya Horiguchi @ 2019-09-12  1:28 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Oscar Salvador, mhocko, mike.kravetz, linux-mm, linux-kernel

Hi David,

On Wed, Sep 11, 2019 at 12:23:24PM +0200, David Hildenbrand wrote:
> On 10.09.19 12:30, Oscar Salvador wrote:
> > From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > 
> > Currently madvise_inject_error() pins the target via get_user_pages_fast.
> > The call to get_user_pages_fast is only to get the respective page
> > of a given address, but it is the job of the memory-poisoning handler
> > to deal with races, so drop the refcount grabbed by get_user_pages_fast.
> > 
> > Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > Signed-off-by: Oscar Salvador <osalvador@suse.de>
> > ---
> >  mm/madvise.c | 25 +++++++++++--------------
> >  1 file changed, 11 insertions(+), 14 deletions(-)
> > 
> > diff --git a/mm/madvise.c b/mm/madvise.c
> > index 6e023414f5c1..fbe6d402232c 100644
> > --- a/mm/madvise.c
> > +++ b/mm/madvise.c
> > @@ -883,6 +883,16 @@ static int madvise_inject_error(int behavior,
> >  		ret = get_user_pages_fast(start, 1, 0, &page);
> >  		if (ret != 1)
> >  			return ret;
> > +		/*
> > +		 * The get_user_pages_fast() is just to get the pfn of the
> > +		 * given address, and the refcount has nothing to do with
> > +		 * what we try to test, so it should be released immediately.
> > +		 * This is racy but it's intended because the real hardware
> > +		 * errors could happen at any moment and memory error handlers
> > +		 * must properly handle the race.
> > +		 */
> > +		put_page(page);
> > +
> 
> I wonder if it would be clearer to do that after the page has been fully
> used  - e.g. after getting the pfn and the order (and then e.g.,
> symbolically setting the page pointer to 0).

Yes, this could be called just after page_to_pfn() below.

> I guess the important part of this patch is to not have an elevated
> refcount while calling soft_offline_page().
> 

That's right.

> >  		pfn = page_to_pfn(page);
> >  
> >  		/*
> > @@ -892,16 +902,11 @@ static int madvise_inject_error(int behavior,
> >  		 */
> >  		order = compound_order(compound_head(page));
> >  
> > -		if (PageHWPoison(page)) {
> > -			put_page(page);
> > -			continue;
> > -		}
> 
> This change is not reflected in the changelog. I would have expected
> that only the put_page() would go. If this should go completely, I
> suggest a separate patch.
> 

I forget why I tried to remove the if block, and now I think only the
put_page() should go as you point out.

Thanks for the comment.

- Naoya


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 02/10] mm,madvise: call soft_offline_page() without MF_COUNT_INCREASED
  2019-09-11 10:27   ` David Hildenbrand
@ 2019-09-12  1:37     ` Naoya Horiguchi
  2019-09-13  8:37       ` David Hildenbrand
  0 siblings, 1 reply; 26+ messages in thread
From: Naoya Horiguchi @ 2019-09-12  1:37 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Oscar Salvador, mhocko, mike.kravetz, linux-mm, linux-kernel

On Wed, Sep 11, 2019 at 12:27:22PM +0200, David Hildenbrand wrote:
> On 10.09.19 12:30, Oscar Salvador wrote:
> > From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> > 
> > Currently madvise_inject_error() pins the target via get_user_pages_fast.
> > The call to get_user_pages_fast is only to get the respective page
> > of a given address, but it is the job of the memory-poisoning handler
> > to deal with races, so drop the refcount grabbed by get_user_pages_fast.
> > 
> 
> Oh, and another question "it is the job of the memory-poisoning handler"
> - is that already properly implemented? (newbee question ¯\_(ツ)_/¯)

The above description might be confusing, sorry. It's intended likes

  The call to get_user_pages_fast is only to get the pointer to struct
  page of a given address, pinning it is memory-poisoning handler's job,
  so drop the refcount grabbed by get_user_pages_fast.

And pinning is done in get_hwpoison_page() for hard-offline and
get_any_page() for soft-offline.  For soft-offline case, the semantics of
refcount of poisoned pages is what this patchset tries to change/improve.

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 00/10] Hwpoison soft-offline rework
  2019-09-11  7:21       ` Naoya Horiguchi
@ 2019-09-12 13:16         ` Oscar Salvador
  0 siblings, 0 replies; 26+ messages in thread
From: Oscar Salvador @ 2019-09-12 13:16 UTC (permalink / raw)
  To: Naoya Horiguchi; +Cc: mhocko, mike.kravetz, linux-mm, linux-kernel

> It's available on https://github.com/Naoya-Horiguchi/mm_regression.
> The README is a bit obsolete (sorry about that ...,) but you can run
> the testcase like below:
> 
>   $ git clone https://github.com/Naoya-Horiguchi/mm_regression
>   $ cd mm_regression
>   mm_regression $ git clone https://github.com/Naoya-Horiguchi/test_c
> ore 
>   mm_regression $ make
>   // you might need to install some dependencies like numa library
> and mce-inject tool
>   mm_regression $ make update_recipes
> 
> To run the single testcase, run the commands like below:
> 
>   mm_regression $
> RECIPEFILES=cases/page_migration/hugetlb_migratepages_allocate1_noove
> rcommit.auto2 bash run.sh
>   mm_regression $ RECIPEFILES=cases/cases/mce_ksm_soft-
> offline_avoid_access.auto2 bash run.sh
>   
> You can run a set of many testcases with the commands like below:
> 
>   mm_regression $ RECIPEFILES=cases/cases/mce_ksm_* bash run.sh
>   // run all ksm related testcases. I reproduced the panic with this
> command.
> 
>   mm_regression $ run_class=simple bash run.sh
>   // run the set of minimum testcases I run for each releases.
> 
> Hopefully this will help you.

Great, I was able to reproduce it.

I will be working on solving it.

Thanks Naoya!

> 
> Thanks,
> Naoya Horiguchi
> 
-- 
Oscar Salvador
SUSE L3


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 02/10] mm,madvise: call soft_offline_page() without MF_COUNT_INCREASED
  2019-09-12  1:37     ` Naoya Horiguchi
@ 2019-09-13  8:37       ` David Hildenbrand
  0 siblings, 0 replies; 26+ messages in thread
From: David Hildenbrand @ 2019-09-13  8:37 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Oscar Salvador, mhocko, mike.kravetz, linux-mm, linux-kernel

On 12.09.19 03:37, Naoya Horiguchi wrote:
> On Wed, Sep 11, 2019 at 12:27:22PM +0200, David Hildenbrand wrote:
>> On 10.09.19 12:30, Oscar Salvador wrote:
>>> From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
>>>
>>> Currently madvise_inject_error() pins the target via get_user_pages_fast.
>>> The call to get_user_pages_fast is only to get the respective page
>>> of a given address, but it is the job of the memory-poisoning handler
>>> to deal with races, so drop the refcount grabbed by get_user_pages_fast.
>>>
>>
>> Oh, and another question "it is the job of the memory-poisoning handler"
>> - is that already properly implemented? (newbee question ¯\_(ツ)_/¯)
> 
> The above description might be confusing, sorry. It's intended likes
> 
>   The call to get_user_pages_fast is only to get the pointer to struct
>   page of a given address, pinning it is memory-poisoning handler's job,
>   so drop the refcount grabbed by get_user_pages_fast.
> 
> And pinning is done in get_hwpoison_page() for hard-offline and
> get_any_page() for soft-offline.  For soft-offline case, the semantics of
> refcount of poisoned pages is what this patchset tries to change/improve.

Thanks for the clarification!

-- 

Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 26+ messages in thread

* Re: [PATCH 03/10] mm,hwpoison-inject: don't pin for hwpoison_filter
  2019-09-10 10:30 ` [PATCH 03/10] mm,hwpoison-inject: don't pin for hwpoison_filter Oscar Salvador
@ 2019-09-16  7:41   ` David Hildenbrand
  0 siblings, 0 replies; 26+ messages in thread
From: David Hildenbrand @ 2019-09-16  7:41 UTC (permalink / raw)
  To: Oscar Salvador, n-horiguchi; +Cc: mhocko, mike.kravetz, linux-mm, linux-kernel

On 10.09.19 12:30, Oscar Salvador wrote:
> From: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> 
> Another memory error injection interface debugfs:hwpoison/corrupt-pfn
> also takes bogus refcount for hwpoison_filter(). It's justified
> because this does a coarse filter, expecting that memory_failure()
> redoes the check for sure.
> 
> Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
> Signed-off-by: Oscar Salvador <osalvador@suse.de>
> ---
>  mm/hwpoison-inject.c | 18 +++++-------------
>  1 file changed, 5 insertions(+), 13 deletions(-)
> 
> diff --git a/mm/hwpoison-inject.c b/mm/hwpoison-inject.c
> index 5b7430bd83a6..0c8cdb80fd7d 100644
> --- a/mm/hwpoison-inject.c
> +++ b/mm/hwpoison-inject.c
> @@ -26,11 +26,6 @@ static int hwpoison_inject(void *data, u64 val)
>  
>  	p = pfn_to_page(pfn);
>  	hpage = compound_head(p);
> -	/*
> -	 * This implies unable to support free buddy pages.
> -	 */
> -	if (!get_hwpoison_page(p))
> -		return 0;
>  
>  	if (!hwpoison_filter_enable)
>  		goto inject;
> @@ -40,23 +35,20 @@ static int hwpoison_inject(void *data, u64 val)
>  	 * This implies unable to support non-LRU pages.
>  	 */
>  	if (!PageLRU(hpage) && !PageHuge(p))
> -		goto put_out;
> +		return 0;
>  
>  	/*
> -	 * do a racy check with elevated page count, to make sure PG_hwpoison
> -	 * will only be set for the targeted owner (or on a free page).
> +	 * do a racy check to make sure PG_hwpoison will only be set for
> +	 * the targeted owner (or on a free page).
>  	 * memory_failure() will redo the check reliably inside page lock.
>  	 */
>  	err = hwpoison_filter(hpage);
>  	if (err)
> -		goto put_out;
> +		return 0;
>  
>  inject:
>  	pr_info("Injecting memory failure at pfn %#lx\n", pfn);
> -	return memory_failure(pfn, MF_COUNT_INCREASED);
> -put_out:
> -	put_hwpoison_page(p);
> -	return 0;
> +	return memory_failure(pfn, 0);
>  }
>  
>  static int hwpoison_unpoison(void *data, u64 val)
> 

After this change, get_hwpoison_page() is only used in
mm/memory-failure.c - you might be able to un-export it and make it static.

-- 

Thanks,

David / dhildenb


^ permalink raw reply	[flat|nested] 26+ messages in thread

end of thread, other threads:[~2019-09-16  7:42 UTC | newest]

Thread overview: 26+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2019-09-10 10:30 [PATCH 00/10] Hwpoison soft-offline rework Oscar Salvador
2019-09-10 10:30 ` [PATCH 01/10] mm,hwpoison: cleanup unused PageHuge() check Oscar Salvador
2019-09-11 10:17   ` David Hildenbrand
2019-09-10 10:30 ` [PATCH 02/10] mm,madvise: call soft_offline_page() without MF_COUNT_INCREASED Oscar Salvador
2019-09-11 10:23   ` David Hildenbrand
2019-09-12  1:28     ` Naoya Horiguchi
2019-09-11 10:27   ` David Hildenbrand
2019-09-12  1:37     ` Naoya Horiguchi
2019-09-13  8:37       ` David Hildenbrand
2019-09-10 10:30 ` [PATCH 03/10] mm,hwpoison-inject: don't pin for hwpoison_filter Oscar Salvador
2019-09-16  7:41   ` David Hildenbrand
2019-09-10 10:30 ` [PATCH 04/10] mm,hwpoison: remove MF_COUNT_INCREASED Oscar Salvador
2019-09-11 10:24   ` David Hildenbrand
2019-09-10 10:30 ` [PATCH 05/10] mm: remove flag argument from soft offline functions Oscar Salvador
2019-09-11 10:24   ` David Hildenbrand
2019-09-10 10:30 ` [PATCH 06/10] mm,hwpoison: Unify THP handling for hard and soft offline Oscar Salvador
2019-09-10 10:30 ` [PATCH 07/10] mm,hwpoison: Rework soft offline for in-use pages Oscar Salvador
2019-09-10 10:30 ` [PATCH 08/10] mm,hwpoison: Refactor soft_offline_huge_page and __soft_offline_page Oscar Salvador
2019-09-10 10:30 ` [PATCH 09/10] mm,hwpoison: Rework soft offline for free pages Oscar Salvador
2019-09-10 10:30 ` [PATCH 10/10] mm,hwpoison: Use hugetlb_replace_page to replace free hugetlb pages Oscar Salvador
2019-09-10 10:32 ` [PATCH 00/10] Hwpoison soft-offline rework Oscar Salvador
2019-09-11  5:29 ` Naoya Horiguchi
2019-09-11  6:22   ` Naoya Horiguchi
2019-09-11  6:35     ` osalvador
2019-09-11  7:21       ` Naoya Horiguchi
2019-09-12 13:16         ` Oscar Salvador

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).