All of lore.kernel.org
 help / color / mirror / Atom feed
* [PATCH 0/4] hwpoison fixes for v4.2
@ 2015-05-12  9:46 ` Naoya Horiguchi
  0 siblings, 0 replies; 14+ messages in thread
From: Naoya Horiguchi @ 2015-05-12  9:46 UTC (permalink / raw)
  To: Andrew Morton, Andi Kleen
  Cc: Tony Luck, Kirill A. Shutemov, linux-mm, linux-kernel, Naoya Horiguchi

There are some long-standing issues on hwpoison, so this patchset mentions
them. I explain details about each bug in individual patches. In summary:

Patch 1: fix the wrong behavior in failing thp split

Patch 2: fix inconsistent refcounting problem on thp tail pages

Patch 3: fix isolation in soft offlining with keeping refcount

Patch 4: potential fix for me_huge_page()

The user visible effects of patch 1 to 3 are kernel panic with BUG_ON,
so I believe that this patchset helps hwpoison to be more reliable.

This series is based on v4.1-rc3 + Xie XiuQi's patch "memory-failure:
export page_type and action result".

Thanks,
Naoya Horiguchi
---
Tree: https://github.com/Naoya-Horiguchi/linux/tree/v4.1-rc3/hwpoison_for_v4.2
---
Summary:

Naoya Horiguchi (4):
      mm/memory-failure: split thp earlier in memory error handling
      mm/memory-failure: introduce get_hwpoison_page() for consistent refcount handling
      mm: soft-offline: don't free target page in successful page migration
      mm/memory-failure: me_huge_page() does nothing for thp

 include/linux/mm.h   |   1 +
 mm/hwpoison-inject.c |   4 +-
 mm/memory-failure.c  | 164 +++++++++++++++++++--------------------------------
 mm/migrate.c         |   9 ++-
 mm/swap.c            |   2 -
 5 files changed, 70 insertions(+), 110 deletions(-)

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 0/4] hwpoison fixes for v4.2
@ 2015-05-12  9:46 ` Naoya Horiguchi
  0 siblings, 0 replies; 14+ messages in thread
From: Naoya Horiguchi @ 2015-05-12  9:46 UTC (permalink / raw)
  To: Andrew Morton, Andi Kleen
  Cc: Tony Luck, Kirill A. Shutemov, linux-mm, linux-kernel, Naoya Horiguchi

There are some long-standing issues on hwpoison, so this patchset mentions
them. I explain details about each bug in individual patches. In summary:

Patch 1: fix the wrong behavior in failing thp split

Patch 2: fix inconsistent refcounting problem on thp tail pages

Patch 3: fix isolation in soft offlining with keeping refcount

Patch 4: potential fix for me_huge_page()

The user visible effects of patch 1 to 3 are kernel panic with BUG_ON,
so I believe that this patchset helps hwpoison to be more reliable.

This series is based on v4.1-rc3 + Xie XiuQi's patch "memory-failure:
export page_type and action result".

Thanks,
Naoya Horiguchi
---
Tree: https://github.com/Naoya-Horiguchi/linux/tree/v4.1-rc3/hwpoison_for_v4.2
---
Summary:

Naoya Horiguchi (4):
      mm/memory-failure: split thp earlier in memory error handling
      mm/memory-failure: introduce get_hwpoison_page() for consistent refcount handling
      mm: soft-offline: don't free target page in successful page migration
      mm/memory-failure: me_huge_page() does nothing for thp

 include/linux/mm.h   |   1 +
 mm/hwpoison-inject.c |   4 +-
 mm/memory-failure.c  | 164 +++++++++++++++++++--------------------------------
 mm/migrate.c         |   9 ++-
 mm/swap.c            |   2 -
 5 files changed, 70 insertions(+), 110 deletions(-)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* [PATCH 4/4] mm/memory-failure: me_huge_page() does nothing for thp
  2015-05-12  9:46 ` Naoya Horiguchi
@ 2015-05-12  9:46   ` Naoya Horiguchi
  -1 siblings, 0 replies; 14+ messages in thread
From: Naoya Horiguchi @ 2015-05-12  9:46 UTC (permalink / raw)
  To: Andrew Morton, Andi Kleen
  Cc: Tony Luck, Kirill A. Shutemov, linux-mm, linux-kernel, Naoya Horiguchi

memory_failure() is supposed not to handle thp itself, but to split it. But
if something were wrong and page_action() were called on thp, me_huge_page()
(action routine for hugepages) should be better to take no action, rather
than to take wrong action prepared for hugetlb (which triggers BUG_ON().)

This change is for potential problems, but makes sense to me because thp is
an actively developing feature and this code path can be open in the future.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 mm/memory-failure.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git v4.1-rc3.orig/mm/memory-failure.c v4.1-rc3/mm/memory-failure.c
index 918256de15bf..6b5bdc575496 100644
--- v4.1-rc3.orig/mm/memory-failure.c
+++ v4.1-rc3/mm/memory-failure.c
@@ -743,6 +743,10 @@ static int me_huge_page(struct page *p, unsigned long pfn)
 {
 	int res = 0;
 	struct page *hpage = compound_head(p);
+
+	if (!PageHuge(hpage))
+		return MF_DELAYED;
+
 	/*
 	 * We can safely recover from error on free or reserved (i.e.
 	 * not in-use) hugepage by dequeuing it from freelist.
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/4] mm/memory-failure: introduce get_hwpoison_page() for consistent refcount handling
  2015-05-12  9:46 ` Naoya Horiguchi
@ 2015-05-12  9:46   ` Naoya Horiguchi
  -1 siblings, 0 replies; 14+ messages in thread
From: Naoya Horiguchi @ 2015-05-12  9:46 UTC (permalink / raw)
  To: Andrew Morton, Andi Kleen
  Cc: Tony Luck, Kirill A. Shutemov, linux-mm, linux-kernel, Naoya Horiguchi

memory_failrue() can run in 2 different mode (specified by MF_COUNT_INCREASED)
in page refcount perspective. When MF_COUNT_INCREASED is set, memory_failrue()
assumes that the caller takes a refcount of the target page. And if cleared,
memory_failure() takes it in it's own.

In current code, however, refcounting is done differently in each caller. For
example, madvise_hwpoison() uses get_user_pages_fast() and hwpoison_inject()
uses get_page_unless_zero(). So this inconsistent refcounting causes refcount
failure especially for thp tail pages. Typical user visible effects are like
memory leak or VM_BUG_ON_PAGE(!page_count(page)) in isolate_lru_page().

To fix this refcounting issue, this patch introduces get_hwpoison_page() to
handle thp tail pages in the same manner for each caller of hwpoison code.

There's a non-trivial change around unpoisoning, which now returns immediately
for thp with "MCE: Memory failure is now running on %#lx\n" message. This is
not right when split_huge_page() fails. So this patch also allows
unpoison_memory() to handle thp.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 include/linux/mm.h   |  1 +
 mm/hwpoison-inject.c |  4 ++--
 mm/memory-failure.c  | 50 ++++++++++++++++++++++++++++++++------------------
 mm/swap.c            |  2 --
 4 files changed, 35 insertions(+), 22 deletions(-)

diff --git v4.1-rc3.orig/include/linux/mm.h v4.1-rc3/include/linux/mm.h
index 0632deaefba0..cbcf7b9d21af 100644
--- v4.1-rc3.orig/include/linux/mm.h
+++ v4.1-rc3/include/linux/mm.h
@@ -2146,6 +2146,7 @@ enum mf_flags {
 extern int memory_failure(unsigned long pfn, int trapno, int flags);
 extern void memory_failure_queue(unsigned long pfn, int trapno, int flags);
 extern int unpoison_memory(unsigned long pfn);
+extern int get_hwpoison_page(struct page *page);
 extern int sysctl_memory_failure_early_kill;
 extern int sysctl_memory_failure_recovery;
 extern void shake_page(struct page *p, int access);
diff --git v4.1-rc3.orig/mm/hwpoison-inject.c v4.1-rc3/mm/hwpoison-inject.c
index 4ca5fe0042e1..bf73ac17dad4 100644
--- v4.1-rc3.orig/mm/hwpoison-inject.c
+++ v4.1-rc3/mm/hwpoison-inject.c
@@ -28,7 +28,7 @@ static int hwpoison_inject(void *data, u64 val)
 	/*
 	 * This implies unable to support free buddy pages.
 	 */
-	if (!get_page_unless_zero(hpage))
+	if (!get_hwpoison_page(p))
 		return 0;
 
 	if (!hwpoison_filter_enable)
@@ -58,7 +58,7 @@ static int hwpoison_inject(void *data, u64 val)
 	pr_info("Injecting memory failure at pfn %#lx\n", pfn);
 	return memory_failure(pfn, 18, MF_COUNT_INCREASED);
 put_out:
-	put_page(hpage);
+	put_page(p);
 	return 0;
 }
 
diff --git v4.1-rc3.orig/mm/memory-failure.c v4.1-rc3/mm/memory-failure.c
index 331c75b23ba2..6d4cc5442b1a 100644
--- v4.1-rc3.orig/mm/memory-failure.c
+++ v4.1-rc3/mm/memory-failure.c
@@ -885,6 +885,28 @@ static int page_action(struct page_state *ps, struct page *p,
 }
 
 /*
+ * Get refcount for memory error handling:
+ * - @page: raw page
+ */
+inline int get_hwpoison_page(struct page *page)
+{
+	struct page *head = compound_head(page);
+
+	if (PageHuge(head))
+		return get_page_unless_zero(head);
+	else if (PageTransHuge(head))
+		if (get_page_unless_zero(head)) {
+			if (PageTail(page))
+				get_page(page);
+			return 1;
+		} else {
+			return 0;
+		}
+	else
+		return get_page_unless_zero(page);
+}
+
+/*
  * Do all that is necessary to remove user space mappings. Unmap
  * the pages and send SIGBUS to the processes if the data was dirty.
  */
@@ -1066,8 +1088,7 @@ int memory_failure(unsigned long pfn, int trapno, int flags)
 	 * In fact it's dangerous to directly bump up page count from 0,
 	 * that may make page_freeze_refs()/page_unfreeze_refs() mismatch.
 	 */
-	if (!(flags & MF_COUNT_INCREASED) &&
-		!get_page_unless_zero(hpage)) {
+	if (!(flags & MF_COUNT_INCREASED) && !get_hwpoison_page(p)) {
 		if (is_free_buddy_page(p)) {
 			action_result(pfn, MF_MSG_BUDDY, MF_DELAYED);
 			return 0;
@@ -1375,19 +1396,12 @@ int unpoison_memory(unsigned long pfn)
 		return 0;
 	}
 
-	/*
-	 * unpoison_memory() can encounter thp only when the thp is being
-	 * worked by memory_failure() and the page lock is not held yet.
-	 * In such case, we yield to memory_failure() and make unpoison fail.
-	 */
-	if (!PageHuge(page) && PageTransHuge(page)) {
-		pr_info("MCE: Memory failure is now running on %#lx\n", pfn);
-			return 0;
-	}
-
-	nr_pages = 1 << compound_order(page);
+	if (PageHuge(page))
+		nr_pages = 1 << compound_order(page);
+	else
+		nr_pages = 1;
 
-	if (!get_page_unless_zero(page)) {
+	if (!get_hwpoison_page(p)) {
 		/*
 		 * Since HWPoisoned hugepage should have non-zero refcount,
 		 * race between memory failure and unpoison seems to happen.
@@ -1411,7 +1425,7 @@ int unpoison_memory(unsigned long pfn)
 	 * the PG_hwpoison page will be caught and isolated on the entrance to
 	 * the free buddy page pool.
 	 */
-	if (TestClearPageHWPoison(page)) {
+	if (TestClearPageHWPoison(p)) {
 		pr_info("MCE: Software-unpoisoned page %#lx\n", pfn);
 		atomic_long_sub(nr_pages, &num_poisoned_pages);
 		freeit = 1;
@@ -1420,9 +1434,9 @@ int unpoison_memory(unsigned long pfn)
 	}
 	unlock_page(page);
 
-	put_page(page);
+	put_page(p);
 	if (freeit && !(pfn == my_zero_pfn(0) && page_count(p) == 1))
-		put_page(page);
+		put_page(p);
 
 	return 0;
 }
@@ -1455,7 +1469,7 @@ static int __get_any_page(struct page *p, unsigned long pfn, int flags)
 	 * When the target page is a free hugepage, just remove it
 	 * from free hugepage list.
 	 */
-	if (!get_page_unless_zero(compound_head(p))) {
+	if (!get_hwpoison_page(p)) {
 		if (PageHuge(p)) {
 			pr_info("%s: %#lx free huge page\n", __func__, pfn);
 			ret = 0;
diff --git v4.1-rc3.orig/mm/swap.c v4.1-rc3/mm/swap.c
index a7251a8ed532..c303c1c0e4f3 100644
--- v4.1-rc3.orig/mm/swap.c
+++ v4.1-rc3/mm/swap.c
@@ -210,8 +210,6 @@ void put_refcounted_compound_page(struct page *page_head, struct page *page)
 		 */
 		if (put_page_testzero(page_head))
 			VM_BUG_ON_PAGE(1, page_head);
-		/* __split_huge_page_refcount will wait now */
-		VM_BUG_ON_PAGE(page_mapcount(page) <= 0, page);
 		atomic_dec(&page->_mapcount);
 		VM_BUG_ON_PAGE(atomic_read(&page_head->_count) <= 0, page_head);
 		VM_BUG_ON_PAGE(atomic_read(&page->_count) != 0, page);
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 1/4] mm/memory-failure: split thp earlier in memory error handling
  2015-05-12  9:46 ` Naoya Horiguchi
@ 2015-05-12  9:46   ` Naoya Horiguchi
  -1 siblings, 0 replies; 14+ messages in thread
From: Naoya Horiguchi @ 2015-05-12  9:46 UTC (permalink / raw)
  To: Andrew Morton, Andi Kleen
  Cc: Tony Luck, Kirill A. Shutemov, linux-mm, linux-kernel, Naoya Horiguchi

memory_failure() doesn't handle thp itself at this time and need to split
it before doing isolation. Currently thp is split in the middle of
hwpoison_user_mappings(), but there're corner cases where memory_failure()
wrongly tries to handle thp without splitting.
  1) "non anonymous" thp, which is not a normal operating mode of thp, but
a memory error could hit a thp before anon_vma is initialized. In such case,
split_huge_page() fails and me_huge_page() (intended for hugetlb) is called
for thp, which triggers BUG_ON in page_hstate().
  2) !PageLRU case, where hwpoison_user_mappings() returns with SWAP_SUCCESS
and the result is the same as case 1.

memory_failure() can't avoid splitting, so let's split it more earlier, which
also reduces code which are prepared for both of normal page and thp.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 mm/memory-failure.c | 88 +++++++++++++++--------------------------------------
 1 file changed, 25 insertions(+), 63 deletions(-)

diff --git v4.1-rc3.orig/mm/memory-failure.c v4.1-rc3/mm/memory-failure.c
index 9e9d04843d52..331c75b23ba2 100644
--- v4.1-rc3.orig/mm/memory-failure.c
+++ v4.1-rc3/mm/memory-failure.c
@@ -897,7 +897,6 @@ static int hwpoison_user_mappings(struct page *p, unsigned long pfn,
 	int ret;
 	int kill = 1, forcekill;
 	struct page *hpage = *hpagep;
-	struct page *ppage;
 
 	/*
 	 * Here we are interested only in user-mapped pages, so skip any
@@ -947,59 +946,6 @@ static int hwpoison_user_mappings(struct page *p, unsigned long pfn,
 	}
 
 	/*
-	 * ppage: poisoned page
-	 *   if p is regular page(4k page)
-	 *        ppage == real poisoned page;
-	 *   else p is hugetlb or THP, ppage == head page.
-	 */
-	ppage = hpage;
-
-	if (PageTransHuge(hpage)) {
-		/*
-		 * Verify that this isn't a hugetlbfs head page, the check for
-		 * PageAnon is just for avoid tripping a split_huge_page
-		 * internal debug check, as split_huge_page refuses to deal with
-		 * anything that isn't an anon page. PageAnon can't go away fro
-		 * under us because we hold a refcount on the hpage, without a
-		 * refcount on the hpage. split_huge_page can't be safely called
-		 * in the first place, having a refcount on the tail isn't
-		 * enough * to be safe.
-		 */
-		if (!PageHuge(hpage) && PageAnon(hpage)) {
-			if (unlikely(split_huge_page(hpage))) {
-				/*
-				 * FIXME: if splitting THP is failed, it is
-				 * better to stop the following operation rather
-				 * than causing panic by unmapping. System might
-				 * survive if the page is freed later.
-				 */
-				printk(KERN_INFO
-					"MCE %#lx: failed to split THP\n", pfn);
-
-				BUG_ON(!PageHWPoison(p));
-				return SWAP_FAIL;
-			}
-			/*
-			 * We pinned the head page for hwpoison handling,
-			 * now we split the thp and we are interested in
-			 * the hwpoisoned raw page, so move the refcount
-			 * to it. Similarly, page lock is shifted.
-			 */
-			if (hpage != p) {
-				if (!(flags & MF_COUNT_INCREASED)) {
-					put_page(hpage);
-					get_page(p);
-				}
-				lock_page(p);
-				unlock_page(hpage);
-				*hpagep = p;
-			}
-			/* THP is split, so ppage should be the real poisoned page. */
-			ppage = p;
-		}
-	}
-
-	/*
 	 * First collect all the processes that have the page
 	 * mapped in dirty form.  This has to be done before try_to_unmap,
 	 * because ttu takes the rmap data structures down.
@@ -1008,12 +954,12 @@ static int hwpoison_user_mappings(struct page *p, unsigned long pfn,
 	 * there's nothing that can be done.
 	 */
 	if (kill)
-		collect_procs(ppage, &tokill, flags & MF_ACTION_REQUIRED);
+		collect_procs(hpage, &tokill, flags & MF_ACTION_REQUIRED);
 
-	ret = try_to_unmap(ppage, ttu);
+	ret = try_to_unmap(hpage, ttu);
 	if (ret != SWAP_SUCCESS)
 		printk(KERN_ERR "MCE %#lx: failed to unmap page (mapcount=%d)\n",
-				pfn, page_mapcount(ppage));
+				pfn, page_mapcount(hpage));
 
 	/*
 	 * Now that the dirty bit has been propagated to the
@@ -1025,7 +971,7 @@ static int hwpoison_user_mappings(struct page *p, unsigned long pfn,
 	 * use a more force-full uncatchable kill to prevent
 	 * any accesses to the poisoned memory.
 	 */
-	forcekill = PageDirty(ppage) || (flags & MF_MUST_KILL);
+	forcekill = PageDirty(hpage) || (flags & MF_MUST_KILL);
 	kill_procs(&tokill, forcekill, trapno,
 		      ret != SWAP_SUCCESS, p, pfn, flags);
 
@@ -1071,6 +1017,7 @@ int memory_failure(unsigned long pfn, int trapno, int flags)
 	struct page_state *ps;
 	struct page *p;
 	struct page *hpage;
+	struct page *orig_head;
 	int res;
 	unsigned int nr_pages;
 	unsigned long page_flags;
@@ -1086,7 +1033,7 @@ int memory_failure(unsigned long pfn, int trapno, int flags)
 	}
 
 	p = pfn_to_page(pfn);
-	hpage = compound_head(p);
+	orig_head = hpage = compound_head(p);
 	if (TestSetPageHWPoison(p)) {
 		printk(KERN_ERR "MCE %#lx: already hardware poisoned\n", pfn);
 		return 0;
@@ -1149,6 +1096,21 @@ int memory_failure(unsigned long pfn, int trapno, int flags)
 		}
 	}
 
+	if (!PageHuge(p) && PageTransHuge(hpage)) {
+		if (!PageAnon(hpage)) {
+			pr_info("MCE: %#lx: non anonymous thp", pfn);
+			put_page(p);
+			return -EBUSY;
+		}
+		if (unlikely(split_huge_page(hpage))) {
+			pr_info("MCE: %#lx: thp split failed", pfn);
+			put_page(p);
+			return -EBUSY;
+		}
+		VM_BUG_ON_PAGE(!page_count(p), p);
+		hpage = compound_head(p);
+	}
+
 	/*
 	 * We ignore non-LRU pages for good reasons.
 	 * - PG_locked is only well defined for LRU pages and a few others
@@ -1158,9 +1120,9 @@ int memory_failure(unsigned long pfn, int trapno, int flags)
 	 * walked by the page reclaim code, however that's not a big loss.
 	 */
 	if (!PageHuge(p)) {
-		if (!PageLRU(hpage))
-			shake_page(hpage, 0);
-		if (!PageLRU(hpage)) {
+		if (!PageLRU(p))
+			shake_page(p, 0);
+		if (!PageLRU(p)) {
 			/*
 			 * shake_page could have turned it free.
 			 */
@@ -1181,7 +1143,7 @@ int memory_failure(unsigned long pfn, int trapno, int flags)
 	 * The page could have changed compound pages during the locking.
 	 * If this happens just bail out.
 	 */
-	if (compound_head(p) != hpage) {
+	if (PageCompound(p) && compound_head(p) != orig_head) {
 		action_result(pfn, MF_MSG_DIFFERENT_COMPOUND, MF_IGNORED);
 		res = -EBUSY;
 		goto out;
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 3/4] mm: soft-offline: don't free target page in successful page migration
  2015-05-12  9:46 ` Naoya Horiguchi
@ 2015-05-12  9:46   ` Naoya Horiguchi
  -1 siblings, 0 replies; 14+ messages in thread
From: Naoya Horiguchi @ 2015-05-12  9:46 UTC (permalink / raw)
  To: Andrew Morton, Andi Kleen
  Cc: Tony Luck, Kirill A. Shutemov, linux-mm, linux-kernel, Naoya Horiguchi

Stress testing showed that soft offline events for a process iterating
"mmap-pagefault-munmap" loop can trigger VM_BUG_ON(PAGE_FLAGS_CHECK_AT_PREP)
in __free_one_page():

  [   14.025761] Soft offlining page 0x70fe1 at 0x70100008d000
  [   14.029400] Soft offlining page 0x705fb at 0x70300008d000
  [   14.030208] page:ffffea0001c3f840 count:0 mapcount:0 mapping:          (null) index:0x2
  [   14.031186] flags: 0x1fffff80800000(hwpoison)
  [   14.031186] page dumped because: VM_BUG_ON_PAGE(page->flags & ((1 << 25) - 1))
  [   14.031186] ------------[ cut here ]------------
  [   14.031186] kernel BUG at /src/linux-dev/mm/page_alloc.c:585!
  [   14.031186] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
  [   14.031186] Modules linked in: cfg80211 rfkill crc32c_intel microcode ppdev parport_pc pcspkr serio_raw virtio_balloon parport i2c_piix4 virtio_blk virtio_net ata_generic pata_acpi floppy
  [   14.031186] CPU: 3 PID: 1779 Comm: test_base_madv_ Not tainted 4.0.0-v4.0-150511-1451-00009-g82360a3730e6 #139
  [   14.031186] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
  [   14.031186] task: ffff88007d33bae0 ti: ffff88007a114000 task.ti: ffff88007a114000
  [   14.031186] RIP: 0010:[<ffffffff811a806a>]  [<ffffffff811a806a>] free_pcppages_bulk+0x52a/0x6f0
  [   14.031186] RSP: 0018:ffff88007a117d28  EFLAGS: 00010096
  [   14.031186] RAX: 0000000000000042 RBX: ffffea0001c3f860 RCX: 0000000000000006
  [   14.031186] RDX: 0000000000000007 RSI: 0000000000000000 RDI: ffff88011f50d3d0
  [   14.031186] RBP: ffff88007a117da8 R08: 000000000000000a R09: 00000000fffffffe
  [   14.031186] R10: 0000000000001d3e R11: 0000000000000002 R12: 0000000000070fe1
  [   14.031186] R13: 0000000000000000 R14: 0000000000000000 R15: ffffea0001c3f840
  [   14.031186] FS:  00007f8a8e3e1740(0000) GS:ffff88011f500000(0000) knlGS:0000000000000000
  [   14.031186] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
  [   14.031186] CR2: 00007f78c7341258 CR3: 000000007bb08000 CR4: 00000000000007e0
  [   14.031186] Stack:
  [   14.031186]  ffff88011f5189c8 ffff88011f5189b8 ffffea0001c3f840 ffff88011f518998
  [   14.031186]  ffffea0001d30cc0 0000001200000002 0000000200000012 0000000000000003
  [   14.031186]  ffff88007ffda6c0 000000000000000a ffff88007a117dd8 ffff88011f518998
  [   14.031186] Call Trace:
  [   14.031186]  [<ffffffff811a8380>] ? page_alloc_cpu_notify+0x50/0x50
  [   14.031186]  [<ffffffff811a82bd>] drain_pages_zone+0x3d/0x50
  [   14.031186]  [<ffffffff811a839d>] drain_local_pages+0x1d/0x30
  [   14.031186]  [<ffffffff81122a96>] on_each_cpu_mask+0x46/0x80
  [   14.031186]  [<ffffffff811a5e8b>] drain_all_pages+0x14b/0x1e0
  [   14.031186]  [<ffffffff812151a2>] soft_offline_page+0x432/0x6e0
  [   14.031186]  [<ffffffff811e2dac>] SyS_madvise+0x73c/0x780
  [   14.031186]  [<ffffffff810dcb3f>] ? put_prev_task_fair+0x2f/0x50
  [   14.031186]  [<ffffffff81143f74>] ? __audit_syscall_entry+0xc4/0x120
  [   14.031186]  [<ffffffff8105bdac>] ? do_audit_syscall_entry+0x6c/0x70
  [   14.031186]  [<ffffffff8105cc63>] ? syscall_trace_enter_phase1+0x103/0x170
  [   14.031186]  [<ffffffff816f908e>] ? int_check_syscall_exit_work+0x34/0x3d
  [   14.031186]  [<ffffffff816f8e72>] system_call_fastpath+0x12/0x17
  [   14.031186] Code: ff 89 45 b4 48 8b 45 c0 48 83 b8 a8 00 00 00 00 0f 85 e3 fb ff ff 0f 1f 00 0f 0b 48 8b 7d 90 48 c7 c6 e8 95 a6 81 e8 e6 32 02 00 <0f> 0b 8b 45 cc 49 89 47 30 41 8b 47 18 83 f8 ff 0f 85 10 ff ff
  [   14.031186] RIP  [<ffffffff811a806a>] free_pcppages_bulk+0x52a/0x6f0
  [   14.031186]  RSP <ffff88007a117d28>
  [   14.031186] ---[ end trace 53926436e76d1f35 ]---

When soft offline successfully migrates page, the source page is supposed to
be freed. But there is a race condition where a source page looks isolated
(i.e. the refcount is 0 and the PageHWPoison is set) but somewhat linked to
pcplist. Then another soft offline event calls drain_all_pages() and tries to
free such hwpoisoned page, which is forbidden.

This odd page state seems to happen due to the race between put_page() in
putback_lru_page() and __pagevec_lru_add_fn(). But I don't want to play with
tweaking drain code as done in commit 9ab3b598d2df "mm: hwpoison: drop
lru_add_drain_all() in __soft_offline_page()", or to change page freeing code
for this soft offline's purpose.

Instead, let's think about the difference between hard offline and soft offline.
There is an interesting difference in how to isolate the in-use page between
these, that is, hard offline marks PageHWPoison of the target page at first, and
doesn't free it by keeping its refcount 1. OTOH, soft offline tries to free
the target page then marks PageHWPoison. This difference might be the source
of complexity and result in bugs like the above. So making soft offline isolate
with keeping refcount can be a solution for this problem.

We can pass to page migration code the "reason" which shows the caller, so
let's use this more to avoid calling putback_lru_page() when called from soft
offline, which effectively does the isolation for soft offline. With this
change, target pages of soft offline never be reused without changing
migratetype, so this patch also removes the related code.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 mm/memory-failure.c | 22 ----------------------
 mm/migrate.c        |  9 ++++++---
 2 files changed, 6 insertions(+), 25 deletions(-)

diff --git v4.1-rc3.orig/mm/memory-failure.c v4.1-rc3/mm/memory-failure.c
index 6d4cc5442b1a..918256de15bf 100644
--- v4.1-rc3.orig/mm/memory-failure.c
+++ v4.1-rc3/mm/memory-failure.c
@@ -1640,20 +1640,7 @@ static int __soft_offline_page(struct page *page, int flags)
 			if (ret > 0)
 				ret = -EIO;
 		} else {
-			/*
-			 * After page migration succeeds, the source page can
-			 * be trapped in pagevec and actual freeing is delayed.
-			 * Freeing code works differently based on PG_hwpoison,
-			 * so there's a race. We need to make sure that the
-			 * source page should be freed back to buddy before
-			 * setting PG_hwpoison.
-			 */
-			if (!is_free_buddy_page(page))
-				drain_all_pages(page_zone(page));
 			SetPageHWPoison(page);
-			if (!is_free_buddy_page(page))
-				pr_info("soft offline: %#lx: page leaked\n",
-					pfn);
 			atomic_long_inc(&num_poisoned_pages);
 		}
 	} else {
@@ -1705,14 +1692,6 @@ int soft_offline_page(struct page *page, int flags)
 
 	get_online_mems();
 
-	/*
-	 * Isolate the page, so that it doesn't get reallocated if it
-	 * was free. This flag should be kept set until the source page
-	 * is freed and PG_hwpoison on it is set.
-	 */
-	if (get_pageblock_migratetype(page) != MIGRATE_ISOLATE)
-		set_migratetype_isolate(page, true);
-
 	ret = get_any_page(page, pfn, flags);
 	put_online_mems();
 	if (ret > 0) { /* for in-use pages */
@@ -1731,6 +1710,5 @@ int soft_offline_page(struct page *page, int flags)
 				atomic_long_inc(&num_poisoned_pages);
 		}
 	}
-	unset_migratetype_isolate(page, MIGRATE_MOVABLE);
 	return ret;
 }
diff --git v4.1-rc3.orig/mm/migrate.c v4.1-rc3/mm/migrate.c
index f53838fe3dfe..d4fe1f94120b 100644
--- v4.1-rc3.orig/mm/migrate.c
+++ v4.1-rc3/mm/migrate.c
@@ -918,7 +918,8 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 static ICE_noinline int unmap_and_move(new_page_t get_new_page,
 				   free_page_t put_new_page,
 				   unsigned long private, struct page *page,
-				   int force, enum migrate_mode mode)
+				   int force, enum migrate_mode mode,
+				   enum migrate_reason reason)
 {
 	int rc = 0;
 	int *result = NULL;
@@ -949,7 +950,8 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
 		list_del(&page->lru);
 		dec_zone_page_state(page, NR_ISOLATED_ANON +
 				page_is_file_cache(page));
-		putback_lru_page(page);
+		if (reason != MR_MEMORY_FAILURE)
+			putback_lru_page(page);
 	}
 
 	/*
@@ -1122,7 +1124,8 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
 						pass > 2, mode);
 			else
 				rc = unmap_and_move(get_new_page, put_new_page,
-						private, page, pass > 2, mode);
+						private, page, pass > 2, mode,
+						reason);
 
 			switch(rc) {
 			case -ENOMEM:
-- 
2.1.0

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 4/4] mm/memory-failure: me_huge_page() does nothing for thp
@ 2015-05-12  9:46   ` Naoya Horiguchi
  0 siblings, 0 replies; 14+ messages in thread
From: Naoya Horiguchi @ 2015-05-12  9:46 UTC (permalink / raw)
  To: Andrew Morton, Andi Kleen
  Cc: Tony Luck, Kirill A. Shutemov, linux-mm, linux-kernel, Naoya Horiguchi

memory_failure() is supposed not to handle thp itself, but to split it. But
if something were wrong and page_action() were called on thp, me_huge_page()
(action routine for hugepages) should be better to take no action, rather
than to take wrong action prepared for hugetlb (which triggers BUG_ON().)

This change is for potential problems, but makes sense to me because thp is
an actively developing feature and this code path can be open in the future.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 mm/memory-failure.c | 4 ++++
 1 file changed, 4 insertions(+)

diff --git v4.1-rc3.orig/mm/memory-failure.c v4.1-rc3/mm/memory-failure.c
index 918256de15bf..6b5bdc575496 100644
--- v4.1-rc3.orig/mm/memory-failure.c
+++ v4.1-rc3/mm/memory-failure.c
@@ -743,6 +743,10 @@ static int me_huge_page(struct page *p, unsigned long pfn)
 {
 	int res = 0;
 	struct page *hpage = compound_head(p);
+
+	if (!PageHuge(hpage))
+		return MF_DELAYED;
+
 	/*
 	 * We can safely recover from error on free or reserved (i.e.
 	 * not in-use) hugepage by dequeuing it from freelist.
-- 
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 3/4] mm: soft-offline: don't free target page in successful page migration
@ 2015-05-12  9:46   ` Naoya Horiguchi
  0 siblings, 0 replies; 14+ messages in thread
From: Naoya Horiguchi @ 2015-05-12  9:46 UTC (permalink / raw)
  To: Andrew Morton, Andi Kleen
  Cc: Tony Luck, Kirill A. Shutemov, linux-mm, linux-kernel, Naoya Horiguchi

Stress testing showed that soft offline events for a process iterating
"mmap-pagefault-munmap" loop can trigger VM_BUG_ON(PAGE_FLAGS_CHECK_AT_PREP)
in __free_one_page():

  [   14.025761] Soft offlining page 0x70fe1 at 0x70100008d000
  [   14.029400] Soft offlining page 0x705fb at 0x70300008d000
  [   14.030208] page:ffffea0001c3f840 count:0 mapcount:0 mapping:          (null) index:0x2
  [   14.031186] flags: 0x1fffff80800000(hwpoison)
  [   14.031186] page dumped because: VM_BUG_ON_PAGE(page->flags & ((1 << 25) - 1))
  [   14.031186] ------------[ cut here ]------------
  [   14.031186] kernel BUG at /src/linux-dev/mm/page_alloc.c:585!
  [   14.031186] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC
  [   14.031186] Modules linked in: cfg80211 rfkill crc32c_intel microcode ppdev parport_pc pcspkr serio_raw virtio_balloon parport i2c_piix4 virtio_blk virtio_net ata_generic pata_acpi floppy
  [   14.031186] CPU: 3 PID: 1779 Comm: test_base_madv_ Not tainted 4.0.0-v4.0-150511-1451-00009-g82360a3730e6 #139
  [   14.031186] Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
  [   14.031186] task: ffff88007d33bae0 ti: ffff88007a114000 task.ti: ffff88007a114000
  [   14.031186] RIP: 0010:[<ffffffff811a806a>]  [<ffffffff811a806a>] free_pcppages_bulk+0x52a/0x6f0
  [   14.031186] RSP: 0018:ffff88007a117d28  EFLAGS: 00010096
  [   14.031186] RAX: 0000000000000042 RBX: ffffea0001c3f860 RCX: 0000000000000006
  [   14.031186] RDX: 0000000000000007 RSI: 0000000000000000 RDI: ffff88011f50d3d0
  [   14.031186] RBP: ffff88007a117da8 R08: 000000000000000a R09: 00000000fffffffe
  [   14.031186] R10: 0000000000001d3e R11: 0000000000000002 R12: 0000000000070fe1
  [   14.031186] R13: 0000000000000000 R14: 0000000000000000 R15: ffffea0001c3f840
  [   14.031186] FS:  00007f8a8e3e1740(0000) GS:ffff88011f500000(0000) knlGS:0000000000000000
  [   14.031186] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
  [   14.031186] CR2: 00007f78c7341258 CR3: 000000007bb08000 CR4: 00000000000007e0
  [   14.031186] Stack:
  [   14.031186]  ffff88011f5189c8 ffff88011f5189b8 ffffea0001c3f840 ffff88011f518998
  [   14.031186]  ffffea0001d30cc0 0000001200000002 0000000200000012 0000000000000003
  [   14.031186]  ffff88007ffda6c0 000000000000000a ffff88007a117dd8 ffff88011f518998
  [   14.031186] Call Trace:
  [   14.031186]  [<ffffffff811a8380>] ? page_alloc_cpu_notify+0x50/0x50
  [   14.031186]  [<ffffffff811a82bd>] drain_pages_zone+0x3d/0x50
  [   14.031186]  [<ffffffff811a839d>] drain_local_pages+0x1d/0x30
  [   14.031186]  [<ffffffff81122a96>] on_each_cpu_mask+0x46/0x80
  [   14.031186]  [<ffffffff811a5e8b>] drain_all_pages+0x14b/0x1e0
  [   14.031186]  [<ffffffff812151a2>] soft_offline_page+0x432/0x6e0
  [   14.031186]  [<ffffffff811e2dac>] SyS_madvise+0x73c/0x780
  [   14.031186]  [<ffffffff810dcb3f>] ? put_prev_task_fair+0x2f/0x50
  [   14.031186]  [<ffffffff81143f74>] ? __audit_syscall_entry+0xc4/0x120
  [   14.031186]  [<ffffffff8105bdac>] ? do_audit_syscall_entry+0x6c/0x70
  [   14.031186]  [<ffffffff8105cc63>] ? syscall_trace_enter_phase1+0x103/0x170
  [   14.031186]  [<ffffffff816f908e>] ? int_check_syscall_exit_work+0x34/0x3d
  [   14.031186]  [<ffffffff816f8e72>] system_call_fastpath+0x12/0x17
  [   14.031186] Code: ff 89 45 b4 48 8b 45 c0 48 83 b8 a8 00 00 00 00 0f 85 e3 fb ff ff 0f 1f 00 0f 0b 48 8b 7d 90 48 c7 c6 e8 95 a6 81 e8 e6 32 02 00 <0f> 0b 8b 45 cc 49 89 47 30 41 8b 47 18 83 f8 ff 0f 85 10 ff ff
  [   14.031186] RIP  [<ffffffff811a806a>] free_pcppages_bulk+0x52a/0x6f0
  [   14.031186]  RSP <ffff88007a117d28>
  [   14.031186] ---[ end trace 53926436e76d1f35 ]---

When soft offline successfully migrates page, the source page is supposed to
be freed. But there is a race condition where a source page looks isolated
(i.e. the refcount is 0 and the PageHWPoison is set) but somewhat linked to
pcplist. Then another soft offline event calls drain_all_pages() and tries to
free such hwpoisoned page, which is forbidden.

This odd page state seems to happen due to the race between put_page() in
putback_lru_page() and __pagevec_lru_add_fn(). But I don't want to play with
tweaking drain code as done in commit 9ab3b598d2df "mm: hwpoison: drop
lru_add_drain_all() in __soft_offline_page()", or to change page freeing code
for this soft offline's purpose.

Instead, let's think about the difference between hard offline and soft offline.
There is an interesting difference in how to isolate the in-use page between
these, that is, hard offline marks PageHWPoison of the target page at first, and
doesn't free it by keeping its refcount 1. OTOH, soft offline tries to free
the target page then marks PageHWPoison. This difference might be the source
of complexity and result in bugs like the above. So making soft offline isolate
with keeping refcount can be a solution for this problem.

We can pass to page migration code the "reason" which shows the caller, so
let's use this more to avoid calling putback_lru_page() when called from soft
offline, which effectively does the isolation for soft offline. With this
change, target pages of soft offline never be reused without changing
migratetype, so this patch also removes the related code.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 mm/memory-failure.c | 22 ----------------------
 mm/migrate.c        |  9 ++++++---
 2 files changed, 6 insertions(+), 25 deletions(-)

diff --git v4.1-rc3.orig/mm/memory-failure.c v4.1-rc3/mm/memory-failure.c
index 6d4cc5442b1a..918256de15bf 100644
--- v4.1-rc3.orig/mm/memory-failure.c
+++ v4.1-rc3/mm/memory-failure.c
@@ -1640,20 +1640,7 @@ static int __soft_offline_page(struct page *page, int flags)
 			if (ret > 0)
 				ret = -EIO;
 		} else {
-			/*
-			 * After page migration succeeds, the source page can
-			 * be trapped in pagevec and actual freeing is delayed.
-			 * Freeing code works differently based on PG_hwpoison,
-			 * so there's a race. We need to make sure that the
-			 * source page should be freed back to buddy before
-			 * setting PG_hwpoison.
-			 */
-			if (!is_free_buddy_page(page))
-				drain_all_pages(page_zone(page));
 			SetPageHWPoison(page);
-			if (!is_free_buddy_page(page))
-				pr_info("soft offline: %#lx: page leaked\n",
-					pfn);
 			atomic_long_inc(&num_poisoned_pages);
 		}
 	} else {
@@ -1705,14 +1692,6 @@ int soft_offline_page(struct page *page, int flags)
 
 	get_online_mems();
 
-	/*
-	 * Isolate the page, so that it doesn't get reallocated if it
-	 * was free. This flag should be kept set until the source page
-	 * is freed and PG_hwpoison on it is set.
-	 */
-	if (get_pageblock_migratetype(page) != MIGRATE_ISOLATE)
-		set_migratetype_isolate(page, true);
-
 	ret = get_any_page(page, pfn, flags);
 	put_online_mems();
 	if (ret > 0) { /* for in-use pages */
@@ -1731,6 +1710,5 @@ int soft_offline_page(struct page *page, int flags)
 				atomic_long_inc(&num_poisoned_pages);
 		}
 	}
-	unset_migratetype_isolate(page, MIGRATE_MOVABLE);
 	return ret;
 }
diff --git v4.1-rc3.orig/mm/migrate.c v4.1-rc3/mm/migrate.c
index f53838fe3dfe..d4fe1f94120b 100644
--- v4.1-rc3.orig/mm/migrate.c
+++ v4.1-rc3/mm/migrate.c
@@ -918,7 +918,8 @@ static int __unmap_and_move(struct page *page, struct page *newpage,
 static ICE_noinline int unmap_and_move(new_page_t get_new_page,
 				   free_page_t put_new_page,
 				   unsigned long private, struct page *page,
-				   int force, enum migrate_mode mode)
+				   int force, enum migrate_mode mode,
+				   enum migrate_reason reason)
 {
 	int rc = 0;
 	int *result = NULL;
@@ -949,7 +950,8 @@ static ICE_noinline int unmap_and_move(new_page_t get_new_page,
 		list_del(&page->lru);
 		dec_zone_page_state(page, NR_ISOLATED_ANON +
 				page_is_file_cache(page));
-		putback_lru_page(page);
+		if (reason != MR_MEMORY_FAILURE)
+			putback_lru_page(page);
 	}
 
 	/*
@@ -1122,7 +1124,8 @@ int migrate_pages(struct list_head *from, new_page_t get_new_page,
 						pass > 2, mode);
 			else
 				rc = unmap_and_move(get_new_page, put_new_page,
-						private, page, pass > 2, mode);
+						private, page, pass > 2, mode,
+						reason);
 
 			switch(rc) {
 			case -ENOMEM:
-- 
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 1/4] mm/memory-failure: split thp earlier in memory error handling
@ 2015-05-12  9:46   ` Naoya Horiguchi
  0 siblings, 0 replies; 14+ messages in thread
From: Naoya Horiguchi @ 2015-05-12  9:46 UTC (permalink / raw)
  To: Andrew Morton, Andi Kleen
  Cc: Tony Luck, Kirill A. Shutemov, linux-mm, linux-kernel, Naoya Horiguchi

memory_failure() doesn't handle thp itself at this time and need to split
it before doing isolation. Currently thp is split in the middle of
hwpoison_user_mappings(), but there're corner cases where memory_failure()
wrongly tries to handle thp without splitting.
  1) "non anonymous" thp, which is not a normal operating mode of thp, but
a memory error could hit a thp before anon_vma is initialized. In such case,
split_huge_page() fails and me_huge_page() (intended for hugetlb) is called
for thp, which triggers BUG_ON in page_hstate().
  2) !PageLRU case, where hwpoison_user_mappings() returns with SWAP_SUCCESS
and the result is the same as case 1.

memory_failure() can't avoid splitting, so let's split it more earlier, which
also reduces code which are prepared for both of normal page and thp.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 mm/memory-failure.c | 88 +++++++++++++++--------------------------------------
 1 file changed, 25 insertions(+), 63 deletions(-)

diff --git v4.1-rc3.orig/mm/memory-failure.c v4.1-rc3/mm/memory-failure.c
index 9e9d04843d52..331c75b23ba2 100644
--- v4.1-rc3.orig/mm/memory-failure.c
+++ v4.1-rc3/mm/memory-failure.c
@@ -897,7 +897,6 @@ static int hwpoison_user_mappings(struct page *p, unsigned long pfn,
 	int ret;
 	int kill = 1, forcekill;
 	struct page *hpage = *hpagep;
-	struct page *ppage;
 
 	/*
 	 * Here we are interested only in user-mapped pages, so skip any
@@ -947,59 +946,6 @@ static int hwpoison_user_mappings(struct page *p, unsigned long pfn,
 	}
 
 	/*
-	 * ppage: poisoned page
-	 *   if p is regular page(4k page)
-	 *        ppage == real poisoned page;
-	 *   else p is hugetlb or THP, ppage == head page.
-	 */
-	ppage = hpage;
-
-	if (PageTransHuge(hpage)) {
-		/*
-		 * Verify that this isn't a hugetlbfs head page, the check for
-		 * PageAnon is just for avoid tripping a split_huge_page
-		 * internal debug check, as split_huge_page refuses to deal with
-		 * anything that isn't an anon page. PageAnon can't go away fro
-		 * under us because we hold a refcount on the hpage, without a
-		 * refcount on the hpage. split_huge_page can't be safely called
-		 * in the first place, having a refcount on the tail isn't
-		 * enough * to be safe.
-		 */
-		if (!PageHuge(hpage) && PageAnon(hpage)) {
-			if (unlikely(split_huge_page(hpage))) {
-				/*
-				 * FIXME: if splitting THP is failed, it is
-				 * better to stop the following operation rather
-				 * than causing panic by unmapping. System might
-				 * survive if the page is freed later.
-				 */
-				printk(KERN_INFO
-					"MCE %#lx: failed to split THP\n", pfn);
-
-				BUG_ON(!PageHWPoison(p));
-				return SWAP_FAIL;
-			}
-			/*
-			 * We pinned the head page for hwpoison handling,
-			 * now we split the thp and we are interested in
-			 * the hwpoisoned raw page, so move the refcount
-			 * to it. Similarly, page lock is shifted.
-			 */
-			if (hpage != p) {
-				if (!(flags & MF_COUNT_INCREASED)) {
-					put_page(hpage);
-					get_page(p);
-				}
-				lock_page(p);
-				unlock_page(hpage);
-				*hpagep = p;
-			}
-			/* THP is split, so ppage should be the real poisoned page. */
-			ppage = p;
-		}
-	}
-
-	/*
 	 * First collect all the processes that have the page
 	 * mapped in dirty form.  This has to be done before try_to_unmap,
 	 * because ttu takes the rmap data structures down.
@@ -1008,12 +954,12 @@ static int hwpoison_user_mappings(struct page *p, unsigned long pfn,
 	 * there's nothing that can be done.
 	 */
 	if (kill)
-		collect_procs(ppage, &tokill, flags & MF_ACTION_REQUIRED);
+		collect_procs(hpage, &tokill, flags & MF_ACTION_REQUIRED);
 
-	ret = try_to_unmap(ppage, ttu);
+	ret = try_to_unmap(hpage, ttu);
 	if (ret != SWAP_SUCCESS)
 		printk(KERN_ERR "MCE %#lx: failed to unmap page (mapcount=%d)\n",
-				pfn, page_mapcount(ppage));
+				pfn, page_mapcount(hpage));
 
 	/*
 	 * Now that the dirty bit has been propagated to the
@@ -1025,7 +971,7 @@ static int hwpoison_user_mappings(struct page *p, unsigned long pfn,
 	 * use a more force-full uncatchable kill to prevent
 	 * any accesses to the poisoned memory.
 	 */
-	forcekill = PageDirty(ppage) || (flags & MF_MUST_KILL);
+	forcekill = PageDirty(hpage) || (flags & MF_MUST_KILL);
 	kill_procs(&tokill, forcekill, trapno,
 		      ret != SWAP_SUCCESS, p, pfn, flags);
 
@@ -1071,6 +1017,7 @@ int memory_failure(unsigned long pfn, int trapno, int flags)
 	struct page_state *ps;
 	struct page *p;
 	struct page *hpage;
+	struct page *orig_head;
 	int res;
 	unsigned int nr_pages;
 	unsigned long page_flags;
@@ -1086,7 +1033,7 @@ int memory_failure(unsigned long pfn, int trapno, int flags)
 	}
 
 	p = pfn_to_page(pfn);
-	hpage = compound_head(p);
+	orig_head = hpage = compound_head(p);
 	if (TestSetPageHWPoison(p)) {
 		printk(KERN_ERR "MCE %#lx: already hardware poisoned\n", pfn);
 		return 0;
@@ -1149,6 +1096,21 @@ int memory_failure(unsigned long pfn, int trapno, int flags)
 		}
 	}
 
+	if (!PageHuge(p) && PageTransHuge(hpage)) {
+		if (!PageAnon(hpage)) {
+			pr_info("MCE: %#lx: non anonymous thp", pfn);
+			put_page(p);
+			return -EBUSY;
+		}
+		if (unlikely(split_huge_page(hpage))) {
+			pr_info("MCE: %#lx: thp split failed", pfn);
+			put_page(p);
+			return -EBUSY;
+		}
+		VM_BUG_ON_PAGE(!page_count(p), p);
+		hpage = compound_head(p);
+	}
+
 	/*
 	 * We ignore non-LRU pages for good reasons.
 	 * - PG_locked is only well defined for LRU pages and a few others
@@ -1158,9 +1120,9 @@ int memory_failure(unsigned long pfn, int trapno, int flags)
 	 * walked by the page reclaim code, however that's not a big loss.
 	 */
 	if (!PageHuge(p)) {
-		if (!PageLRU(hpage))
-			shake_page(hpage, 0);
-		if (!PageLRU(hpage)) {
+		if (!PageLRU(p))
+			shake_page(p, 0);
+		if (!PageLRU(p)) {
 			/*
 			 * shake_page could have turned it free.
 			 */
@@ -1181,7 +1143,7 @@ int memory_failure(unsigned long pfn, int trapno, int flags)
 	 * The page could have changed compound pages during the locking.
 	 * If this happens just bail out.
 	 */
-	if (compound_head(p) != hpage) {
+	if (PageCompound(p) && compound_head(p) != orig_head) {
 		action_result(pfn, MF_MSG_DIFFERENT_COMPOUND, MF_IGNORED);
 		res = -EBUSY;
 		goto out;
-- 
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* [PATCH 2/4] mm/memory-failure: introduce get_hwpoison_page() for consistent refcount handling
@ 2015-05-12  9:46   ` Naoya Horiguchi
  0 siblings, 0 replies; 14+ messages in thread
From: Naoya Horiguchi @ 2015-05-12  9:46 UTC (permalink / raw)
  To: Andrew Morton, Andi Kleen
  Cc: Tony Luck, Kirill A. Shutemov, linux-mm, linux-kernel, Naoya Horiguchi

memory_failrue() can run in 2 different mode (specified by MF_COUNT_INCREASED)
in page refcount perspective. When MF_COUNT_INCREASED is set, memory_failrue()
assumes that the caller takes a refcount of the target page. And if cleared,
memory_failure() takes it in it's own.

In current code, however, refcounting is done differently in each caller. For
example, madvise_hwpoison() uses get_user_pages_fast() and hwpoison_inject()
uses get_page_unless_zero(). So this inconsistent refcounting causes refcount
failure especially for thp tail pages. Typical user visible effects are like
memory leak or VM_BUG_ON_PAGE(!page_count(page)) in isolate_lru_page().

To fix this refcounting issue, this patch introduces get_hwpoison_page() to
handle thp tail pages in the same manner for each caller of hwpoison code.

There's a non-trivial change around unpoisoning, which now returns immediately
for thp with "MCE: Memory failure is now running on %#lx\n" message. This is
not right when split_huge_page() fails. So this patch also allows
unpoison_memory() to handle thp.

Signed-off-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com>
---
 include/linux/mm.h   |  1 +
 mm/hwpoison-inject.c |  4 ++--
 mm/memory-failure.c  | 50 ++++++++++++++++++++++++++++++++------------------
 mm/swap.c            |  2 --
 4 files changed, 35 insertions(+), 22 deletions(-)

diff --git v4.1-rc3.orig/include/linux/mm.h v4.1-rc3/include/linux/mm.h
index 0632deaefba0..cbcf7b9d21af 100644
--- v4.1-rc3.orig/include/linux/mm.h
+++ v4.1-rc3/include/linux/mm.h
@@ -2146,6 +2146,7 @@ enum mf_flags {
 extern int memory_failure(unsigned long pfn, int trapno, int flags);
 extern void memory_failure_queue(unsigned long pfn, int trapno, int flags);
 extern int unpoison_memory(unsigned long pfn);
+extern int get_hwpoison_page(struct page *page);
 extern int sysctl_memory_failure_early_kill;
 extern int sysctl_memory_failure_recovery;
 extern void shake_page(struct page *p, int access);
diff --git v4.1-rc3.orig/mm/hwpoison-inject.c v4.1-rc3/mm/hwpoison-inject.c
index 4ca5fe0042e1..bf73ac17dad4 100644
--- v4.1-rc3.orig/mm/hwpoison-inject.c
+++ v4.1-rc3/mm/hwpoison-inject.c
@@ -28,7 +28,7 @@ static int hwpoison_inject(void *data, u64 val)
 	/*
 	 * This implies unable to support free buddy pages.
 	 */
-	if (!get_page_unless_zero(hpage))
+	if (!get_hwpoison_page(p))
 		return 0;
 
 	if (!hwpoison_filter_enable)
@@ -58,7 +58,7 @@ static int hwpoison_inject(void *data, u64 val)
 	pr_info("Injecting memory failure at pfn %#lx\n", pfn);
 	return memory_failure(pfn, 18, MF_COUNT_INCREASED);
 put_out:
-	put_page(hpage);
+	put_page(p);
 	return 0;
 }
 
diff --git v4.1-rc3.orig/mm/memory-failure.c v4.1-rc3/mm/memory-failure.c
index 331c75b23ba2..6d4cc5442b1a 100644
--- v4.1-rc3.orig/mm/memory-failure.c
+++ v4.1-rc3/mm/memory-failure.c
@@ -885,6 +885,28 @@ static int page_action(struct page_state *ps, struct page *p,
 }
 
 /*
+ * Get refcount for memory error handling:
+ * - @page: raw page
+ */
+inline int get_hwpoison_page(struct page *page)
+{
+	struct page *head = compound_head(page);
+
+	if (PageHuge(head))
+		return get_page_unless_zero(head);
+	else if (PageTransHuge(head))
+		if (get_page_unless_zero(head)) {
+			if (PageTail(page))
+				get_page(page);
+			return 1;
+		} else {
+			return 0;
+		}
+	else
+		return get_page_unless_zero(page);
+}
+
+/*
  * Do all that is necessary to remove user space mappings. Unmap
  * the pages and send SIGBUS to the processes if the data was dirty.
  */
@@ -1066,8 +1088,7 @@ int memory_failure(unsigned long pfn, int trapno, int flags)
 	 * In fact it's dangerous to directly bump up page count from 0,
 	 * that may make page_freeze_refs()/page_unfreeze_refs() mismatch.
 	 */
-	if (!(flags & MF_COUNT_INCREASED) &&
-		!get_page_unless_zero(hpage)) {
+	if (!(flags & MF_COUNT_INCREASED) && !get_hwpoison_page(p)) {
 		if (is_free_buddy_page(p)) {
 			action_result(pfn, MF_MSG_BUDDY, MF_DELAYED);
 			return 0;
@@ -1375,19 +1396,12 @@ int unpoison_memory(unsigned long pfn)
 		return 0;
 	}
 
-	/*
-	 * unpoison_memory() can encounter thp only when the thp is being
-	 * worked by memory_failure() and the page lock is not held yet.
-	 * In such case, we yield to memory_failure() and make unpoison fail.
-	 */
-	if (!PageHuge(page) && PageTransHuge(page)) {
-		pr_info("MCE: Memory failure is now running on %#lx\n", pfn);
-			return 0;
-	}
-
-	nr_pages = 1 << compound_order(page);
+	if (PageHuge(page))
+		nr_pages = 1 << compound_order(page);
+	else
+		nr_pages = 1;
 
-	if (!get_page_unless_zero(page)) {
+	if (!get_hwpoison_page(p)) {
 		/*
 		 * Since HWPoisoned hugepage should have non-zero refcount,
 		 * race between memory failure and unpoison seems to happen.
@@ -1411,7 +1425,7 @@ int unpoison_memory(unsigned long pfn)
 	 * the PG_hwpoison page will be caught and isolated on the entrance to
 	 * the free buddy page pool.
 	 */
-	if (TestClearPageHWPoison(page)) {
+	if (TestClearPageHWPoison(p)) {
 		pr_info("MCE: Software-unpoisoned page %#lx\n", pfn);
 		atomic_long_sub(nr_pages, &num_poisoned_pages);
 		freeit = 1;
@@ -1420,9 +1434,9 @@ int unpoison_memory(unsigned long pfn)
 	}
 	unlock_page(page);
 
-	put_page(page);
+	put_page(p);
 	if (freeit && !(pfn == my_zero_pfn(0) && page_count(p) == 1))
-		put_page(page);
+		put_page(p);
 
 	return 0;
 }
@@ -1455,7 +1469,7 @@ static int __get_any_page(struct page *p, unsigned long pfn, int flags)
 	 * When the target page is a free hugepage, just remove it
 	 * from free hugepage list.
 	 */
-	if (!get_page_unless_zero(compound_head(p))) {
+	if (!get_hwpoison_page(p)) {
 		if (PageHuge(p)) {
 			pr_info("%s: %#lx free huge page\n", __func__, pfn);
 			ret = 0;
diff --git v4.1-rc3.orig/mm/swap.c v4.1-rc3/mm/swap.c
index a7251a8ed532..c303c1c0e4f3 100644
--- v4.1-rc3.orig/mm/swap.c
+++ v4.1-rc3/mm/swap.c
@@ -210,8 +210,6 @@ void put_refcounted_compound_page(struct page *page_head, struct page *page)
 		 */
 		if (put_page_testzero(page_head))
 			VM_BUG_ON_PAGE(1, page_head);
-		/* __split_huge_page_refcount will wait now */
-		VM_BUG_ON_PAGE(page_mapcount(page) <= 0, page);
 		atomic_dec(&page->_mapcount);
 		VM_BUG_ON_PAGE(atomic_read(&page_head->_count) <= 0, page_head);
 		VM_BUG_ON_PAGE(atomic_read(&page->_count) != 0, page);
-- 
2.1.0

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply related	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/4] mm/memory-failure: introduce get_hwpoison_page() for consistent refcount handling
  2015-05-12  9:46   ` Naoya Horiguchi
@ 2015-05-12 22:00     ` Andrew Morton
  -1 siblings, 0 replies; 14+ messages in thread
From: Andrew Morton @ 2015-05-12 22:00 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andi Kleen, Tony Luck, Kirill A. Shutemov, linux-mm,
	linux-kernel, Naoya Horiguchi

On Tue, 12 May 2015 09:46:47 +0000 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote:

> memory_failrue() can run in 2 different mode (specified by MF_COUNT_INCREASED)
> in page refcount perspective. When MF_COUNT_INCREASED is set, memory_failrue()
> assumes that the caller takes a refcount of the target page. And if cleared,
> memory_failure() takes it in it's own.
> 
> In current code, however, refcounting is done differently in each caller. For
> example, madvise_hwpoison() uses get_user_pages_fast() and hwpoison_inject()
> uses get_page_unless_zero(). So this inconsistent refcounting causes refcount
> failure especially for thp tail pages. Typical user visible effects are like
> memory leak or VM_BUG_ON_PAGE(!page_count(page)) in isolate_lru_page().
> 
> To fix this refcounting issue, this patch introduces get_hwpoison_page() to
> handle thp tail pages in the same manner for each caller of hwpoison code.
> 
> There's a non-trivial change around unpoisoning, which now returns immediately
> for thp with "MCE: Memory failure is now running on %#lx\n" message. This is
> not right when split_huge_page() fails. So this patch also allows
> unpoison_memory() to handle thp.
>
> ...
>
>  /*
> + * Get refcount for memory error handling:
> + * - @page: raw page
> + */
> +inline int get_hwpoison_page(struct page *page)
> +{
> +	struct page *head = compound_head(page);
> +
> +	if (PageHuge(head))
> +		return get_page_unless_zero(head);
> +	else if (PageTransHuge(head))
> +		if (get_page_unless_zero(head)) {
> +			if (PageTail(page))
> +				get_page(page);
> +			return 1;
> +		} else {
> +			return 0;
> +		}
> +	else
> +		return get_page_unless_zero(page);
> +}

This function is a bit weird.

- The comment looks like kerneldoc but isn't kerneldoc

- Why the inline?  It isn't fastpath?

- The layout is rather painful.  It could be

	if (PageHuge(head))
		return get_page_unless_zero(head);

	if (PageTransHuge(head)) {
		if (get_page_unless_zero(head)) {
			if (PageTail(page))
				get_page(page);
			return 1;
		} else {
			return 0;
		}
	}

	return get_page_unless_zero(page);

- Some illuminating comments would be nice.  In particular that code
  path where it grabs a ref on the tail page as well as on the head
  page.  What's going on there?



^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/4] mm/memory-failure: introduce get_hwpoison_page() for consistent refcount handling
@ 2015-05-12 22:00     ` Andrew Morton
  0 siblings, 0 replies; 14+ messages in thread
From: Andrew Morton @ 2015-05-12 22:00 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Andi Kleen, Tony Luck, Kirill A. Shutemov, linux-mm,
	linux-kernel, Naoya Horiguchi

On Tue, 12 May 2015 09:46:47 +0000 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote:

> memory_failrue() can run in 2 different mode (specified by MF_COUNT_INCREASED)
> in page refcount perspective. When MF_COUNT_INCREASED is set, memory_failrue()
> assumes that the caller takes a refcount of the target page. And if cleared,
> memory_failure() takes it in it's own.
> 
> In current code, however, refcounting is done differently in each caller. For
> example, madvise_hwpoison() uses get_user_pages_fast() and hwpoison_inject()
> uses get_page_unless_zero(). So this inconsistent refcounting causes refcount
> failure especially for thp tail pages. Typical user visible effects are like
> memory leak or VM_BUG_ON_PAGE(!page_count(page)) in isolate_lru_page().
> 
> To fix this refcounting issue, this patch introduces get_hwpoison_page() to
> handle thp tail pages in the same manner for each caller of hwpoison code.
> 
> There's a non-trivial change around unpoisoning, which now returns immediately
> for thp with "MCE: Memory failure is now running on %#lx\n" message. This is
> not right when split_huge_page() fails. So this patch also allows
> unpoison_memory() to handle thp.
>
> ...
>
>  /*
> + * Get refcount for memory error handling:
> + * - @page: raw page
> + */
> +inline int get_hwpoison_page(struct page *page)
> +{
> +	struct page *head = compound_head(page);
> +
> +	if (PageHuge(head))
> +		return get_page_unless_zero(head);
> +	else if (PageTransHuge(head))
> +		if (get_page_unless_zero(head)) {
> +			if (PageTail(page))
> +				get_page(page);
> +			return 1;
> +		} else {
> +			return 0;
> +		}
> +	else
> +		return get_page_unless_zero(page);
> +}

This function is a bit weird.

- The comment looks like kerneldoc but isn't kerneldoc

- Why the inline?  It isn't fastpath?

- The layout is rather painful.  It could be

	if (PageHuge(head))
		return get_page_unless_zero(head);

	if (PageTransHuge(head)) {
		if (get_page_unless_zero(head)) {
			if (PageTail(page))
				get_page(page);
			return 1;
		} else {
			return 0;
		}
	}

	return get_page_unless_zero(page);

- Some illuminating comments would be nice.  In particular that code
  path where it grabs a ref on the tail page as well as on the head
  page.  What's going on there?


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/4] mm/memory-failure: introduce get_hwpoison_page() for consistent refcount handling
  2015-05-12 22:00     ` Andrew Morton
@ 2015-05-13  0:11       ` Naoya Horiguchi
  -1 siblings, 0 replies; 14+ messages in thread
From: Naoya Horiguchi @ 2015-05-13  0:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Tony Luck, Kirill A. Shutemov, linux-mm,
	linux-kernel, Naoya Horiguchi

On Tue, May 12, 2015 at 03:00:17PM -0700, Andrew Morton wrote:
> On Tue, 12 May 2015 09:46:47 +0000 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote:
> 
> > memory_failrue() can run in 2 different mode (specified by MF_COUNT_INCREASED)
> > in page refcount perspective. When MF_COUNT_INCREASED is set, memory_failrue()
> > assumes that the caller takes a refcount of the target page. And if cleared,
> > memory_failure() takes it in it's own.
> > 
> > In current code, however, refcounting is done differently in each caller. For
> > example, madvise_hwpoison() uses get_user_pages_fast() and hwpoison_inject()
> > uses get_page_unless_zero(). So this inconsistent refcounting causes refcount
> > failure especially for thp tail pages. Typical user visible effects are like
> > memory leak or VM_BUG_ON_PAGE(!page_count(page)) in isolate_lru_page().
> > 
> > To fix this refcounting issue, this patch introduces get_hwpoison_page() to
> > handle thp tail pages in the same manner for each caller of hwpoison code.
> > 
> > There's a non-trivial change around unpoisoning, which now returns immediately
> > for thp with "MCE: Memory failure is now running on %#lx\n" message. This is
> > not right when split_huge_page() fails. So this patch also allows
> > unpoison_memory() to handle thp.
> >
> > ...
> >
> >  /*
> > + * Get refcount for memory error handling:
> > + * - @page: raw page
> > + */
> > +inline int get_hwpoison_page(struct page *page)
> > +{
> > +	struct page *head = compound_head(page);
> > +
> > +	if (PageHuge(head))
> > +		return get_page_unless_zero(head);
> > +	else if (PageTransHuge(head))
> > +		if (get_page_unless_zero(head)) {
> > +			if (PageTail(page))
> > +				get_page(page);
> > +			return 1;
> > +		} else {
> > +			return 0;
> > +		}
> > +	else
> > +		return get_page_unless_zero(page);
> > +}
> 
> This function is a bit weird.
> 
> - The comment looks like kerneldoc but isn't kerneldoc

OK, will fix it.

> - Why the inline?  It isn't fastpath?

No, so I'll drop 'inline'.

> - The layout is rather painful.  It could be
> 
> 	if (PageHuge(head))
> 		return get_page_unless_zero(head);
> 
> 	if (PageTransHuge(head)) {
> 		if (get_page_unless_zero(head)) {
> 			if (PageTail(page))
> 				get_page(page);
> 			return 1;
> 		} else {
> 			return 0;
> 		}
> 	}
> 
> 	return get_page_unless_zero(page);

OK, will do like this.

> - Some illuminating comments would be nice.  In particular that code
>   path where it grabs a ref on the tail page as well as on the head
>   page.  What's going on there?

We can't call get_page_unless_zero() directly for thp tail pages because
that breaks thp's refcounting rule (refcount of tail pages is stored in
->_mapcount.) This code intends to comply with the rule in hwpoison code too.
So I'll comment the point.

Hmm, I found just now that I forget to put_page(head) in if (PageTail) block,
which leaks head page after thp split.
So it'll be fixed in the next version.

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 14+ messages in thread

* Re: [PATCH 2/4] mm/memory-failure: introduce get_hwpoison_page() for consistent refcount handling
@ 2015-05-13  0:11       ` Naoya Horiguchi
  0 siblings, 0 replies; 14+ messages in thread
From: Naoya Horiguchi @ 2015-05-13  0:11 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Andi Kleen, Tony Luck, Kirill A. Shutemov, linux-mm,
	linux-kernel, Naoya Horiguchi

On Tue, May 12, 2015 at 03:00:17PM -0700, Andrew Morton wrote:
> On Tue, 12 May 2015 09:46:47 +0000 Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> wrote:
> 
> > memory_failrue() can run in 2 different mode (specified by MF_COUNT_INCREASED)
> > in page refcount perspective. When MF_COUNT_INCREASED is set, memory_failrue()
> > assumes that the caller takes a refcount of the target page. And if cleared,
> > memory_failure() takes it in it's own.
> > 
> > In current code, however, refcounting is done differently in each caller. For
> > example, madvise_hwpoison() uses get_user_pages_fast() and hwpoison_inject()
> > uses get_page_unless_zero(). So this inconsistent refcounting causes refcount
> > failure especially for thp tail pages. Typical user visible effects are like
> > memory leak or VM_BUG_ON_PAGE(!page_count(page)) in isolate_lru_page().
> > 
> > To fix this refcounting issue, this patch introduces get_hwpoison_page() to
> > handle thp tail pages in the same manner for each caller of hwpoison code.
> > 
> > There's a non-trivial change around unpoisoning, which now returns immediately
> > for thp with "MCE: Memory failure is now running on %#lx\n" message. This is
> > not right when split_huge_page() fails. So this patch also allows
> > unpoison_memory() to handle thp.
> >
> > ...
> >
> >  /*
> > + * Get refcount for memory error handling:
> > + * - @page: raw page
> > + */
> > +inline int get_hwpoison_page(struct page *page)
> > +{
> > +	struct page *head = compound_head(page);
> > +
> > +	if (PageHuge(head))
> > +		return get_page_unless_zero(head);
> > +	else if (PageTransHuge(head))
> > +		if (get_page_unless_zero(head)) {
> > +			if (PageTail(page))
> > +				get_page(page);
> > +			return 1;
> > +		} else {
> > +			return 0;
> > +		}
> > +	else
> > +		return get_page_unless_zero(page);
> > +}
> 
> This function is a bit weird.
> 
> - The comment looks like kerneldoc but isn't kerneldoc

OK, will fix it.

> - Why the inline?  It isn't fastpath?

No, so I'll drop 'inline'.

> - The layout is rather painful.  It could be
> 
> 	if (PageHuge(head))
> 		return get_page_unless_zero(head);
> 
> 	if (PageTransHuge(head)) {
> 		if (get_page_unless_zero(head)) {
> 			if (PageTail(page))
> 				get_page(page);
> 			return 1;
> 		} else {
> 			return 0;
> 		}
> 	}
> 
> 	return get_page_unless_zero(page);

OK, will do like this.

> - Some illuminating comments would be nice.  In particular that code
>   path where it grabs a ref on the tail page as well as on the head
>   page.  What's going on there?

We can't call get_page_unless_zero() directly for thp tail pages because
that breaks thp's refcounting rule (refcount of tail pages is stored in
->_mapcount.) This code intends to comply with the rule in hwpoison code too.
So I'll comment the point.

Hmm, I found just now that I forget to put_page(head) in if (PageTail) block,
which leaks head page after thp split.
So it'll be fixed in the next version.

Thanks,
Naoya Horiguchi
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

^ permalink raw reply	[flat|nested] 14+ messages in thread

end of thread, other threads:[~2015-05-13  0:22 UTC | newest]

Thread overview: 14+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2015-05-12  9:46 [PATCH 0/4] hwpoison fixes for v4.2 Naoya Horiguchi
2015-05-12  9:46 ` Naoya Horiguchi
2015-05-12  9:46 ` [PATCH 2/4] mm/memory-failure: introduce get_hwpoison_page() for consistent refcount handling Naoya Horiguchi
2015-05-12  9:46   ` Naoya Horiguchi
2015-05-12 22:00   ` Andrew Morton
2015-05-12 22:00     ` Andrew Morton
2015-05-13  0:11     ` Naoya Horiguchi
2015-05-13  0:11       ` Naoya Horiguchi
2015-05-12  9:46 ` [PATCH 1/4] mm/memory-failure: split thp earlier in memory error handling Naoya Horiguchi
2015-05-12  9:46   ` Naoya Horiguchi
2015-05-12  9:46 ` [PATCH 3/4] mm: soft-offline: don't free target page in successful page migration Naoya Horiguchi
2015-05-12  9:46   ` Naoya Horiguchi
2015-05-12  9:46 ` [PATCH 4/4] mm/memory-failure: me_huge_page() does nothing for thp Naoya Horiguchi
2015-05-12  9:46   ` Naoya Horiguchi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.