linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [RFC v4 PATCH 0/6] Solve silent data loss caused by poisoned page cache (shmem/tmpfs)
@ 2021-10-14 19:16 Yang Shi
  2021-10-14 19:16 ` [v4 PATCH 1/6] mm: hwpoison: remove the unnecessary THP check Yang Shi
                   ` (7 more replies)
  0 siblings, 8 replies; 18+ messages in thread
From: Yang Shi @ 2021-10-14 19:16 UTC (permalink / raw)
  To: naoya.horiguchi, hughd, kirill.shutemov, willy, peterx, osalvador, akpm
  Cc: shy828301, linux-mm, linux-fsdevel, linux-kernel


When discussing the patch that splits page cache THP in order to offline the
poisoned page, Noaya mentioned there is a bigger problem [1] that prevents this
from working since the page cache page will be truncated if uncorrectable
errors happen.  By looking this deeper it turns out this approach (truncating
poisoned page) may incur silent data loss for all non-readonly filesystems if
the page is dirty.  It may be worse for in-memory filesystem, e.g. shmem/tmpfs
since the data blocks are actually gone.

To solve this problem we could keep the poisoned dirty page in page cache then
notify the users on any later access, e.g. page fault, read/write, etc.  The
clean page could be truncated as is since they can be reread from disk later on.

The consequence is the filesystems may find poisoned page and manipulate it as
healthy page since all the filesystems actually don't check if the page is
poisoned or not in all the relevant paths except page fault.  In general, we
need make the filesystems be aware of poisoned page before we could keep the
poisoned page in page cache in order to solve the data loss problem.

To make filesystems be aware of poisoned page we should consider:
- The page should be not written back: clearing dirty flag could prevent from
  writeback.
- The page should not be dropped (it shows as a clean page) by drop caches or
  other callers: the refcount pin from hwpoison could prevent from invalidating
  (called by cache drop, inode cache shrinking, etc), but it doesn't avoid
  invalidation in DIO path.
- The page should be able to get truncated/hole punched/unlinked: it works as it
  is.
- Notify users when the page is accessed, e.g. read/write, page fault and other
  paths (compression, encryption, etc).

The scope of the last one is huge since almost all filesystems need do it once
a page is returned from page cache lookup.  There are a couple of options to
do it:

1. Check hwpoison flag for every path, the most straightforward way.
2. Return NULL for poisoned page from page cache lookup, the most callsites
   check if NULL is returned, this should have least work I think.  But the
   error handling in filesystems just return -ENOMEM, the error code will incur
   confusion to the users obviously.
3. To improve #2, we could return error pointer, e.g. ERR_PTR(-EIO), but this
   will involve significant amount of code change as well since all the paths
   need check if the pointer is ERR or not just like option #1.

I did prototype for both #1 and #3, but it seems #3 may require more changes
than #1.  For #3 ERR_PTR will be returned so all the callers need to check the
return value otherwise invalid pointer may be dereferenced, but not all callers
really care about the content of the page, for example, partial truncate which
just sets the truncated range in one page to 0.  So for such paths it needs
additional modification if ERR_PTR is returned.  And if the callers have their
own way to handle the problematic pages we need to add a new FGP flag to tell
FGP functions to return the pointer to the page.

It may happen very rarely, but once it happens the consequence (data corruption)
could be very bad and it is very hard to debug.  It seems this problem had been
slightly discussed before, but seems no action was taken at that time. [2]

As the aforementioned investigation, it needs huge amount of work to solve
the potential data loss for all filesystems.  But it is much easier for
in-memory filesystems and such filesystems actually suffer more than others
since even the data blocks are gone due to truncating.  So this patchset starts
from shmem/tmpfs by taking option #1.

TODO:
* The unpoison has been broken since commit 0ed950d1f281 ("mm,hwpoison: make
  get_hwpoison_page() call get_any_page()"), and this patch series make
  refcount check for unpoisoning shmem page fail.
* Expand to other filesystems.  But I haven't heard feedback from filesystem
  developers yet.

Patch breakdown:
Patch #1: cleanup, depended by patch #2
Patch #2: fix THP with hwpoisoned subpage(s) PMD map bug
Patch #3: coding style cleanup
Patch #4: refactor and preparation.
Patch #5: keep the poisoned page in page cache and handle such case for all
          the paths.
Patch #6: the previous patches unblock page cache THP split, so this patch
          add page cache THP split support.


Changelog
v3 --> v4:
  * Separated coding style cleanup from patch 2/5 by adding a new patch
    (patch 3/6) per Kirill.
  * Moved setting PageHasHWPoisoned flag to proper place (patch 2/6) per
    Peter Xu.
  * Elaborated why soft offline doesn't need to set this flag in the commit
    message (patch 2/6) per Peter Xu.
  * Renamed "dec" parameter to "extra_pins" for has_extra_refcount() (patch 4/6)
    per Peter Xu.
  * Adopted the suggestions for comment and coding style (patch 5/6) per
    Naoya.
  * Checked if page is hwpoison or not for shmem_get_link() (patch 5/6) per
    Peter Xu.
  * Collected acks.
v2 --> v3:
  * Incorporated the comments from Kirill.
  * Reordered the series to reflect the right dependency (patch #3 from v2
    is patch #1 in this revision, patch #1 from v2 is patch #2 in this
    revision).
  * After the reorder, patch #2 depends on patch #1 and both need to be
    backported to -stable.
v1 --> v2:
  * Incorporated the suggestion from Kirill to use a new page flag to
    indicate there is hwpoisoned subpage(s) in a THP. (patch #1)
  * Dropped patch #2 of v1.
  * Refctored the page refcount check logic of hwpoison per Naoya. (patch #2)
  * Removed unnecessary THP check per Naoya. (patch #3)
  * Incorporated the other comments for shmem from Naoya. (patch #4)


Yang Shi (6):
      mm: hwpoison: remove the unnecessary THP check
      mm: filemap: check if THP has hwpoisoned subpage for PMD page fault
      mm: filemap: coding style cleanup for filemap_map_pmd()
      mm: hwpoison: refactor refcount check handling
      mm: shmem: don't truncate page if memory failure happens
      mm: hwpoison: handle non-anonymous THP correctly

 include/linux/page-flags.h |  23 ++++++++++++
 mm/filemap.c               |  12 +++----
 mm/huge_memory.c           |   2 ++
 mm/memory-failure.c        | 136 ++++++++++++++++++++++++++++++++++++++++++++-------------------------
 mm/memory.c                |   9 +++++
 mm/page_alloc.c            |   4 ++-
 mm/shmem.c                 |  37 +++++++++++++++++--
 mm/userfaultfd.c           |   5 +++
 8 files changed, 170 insertions(+), 58 deletions(-)


[1] https://lore.kernel.org/linux-mm/CAHbLzkqNPBh_sK09qfr4yu4WTFOzRy+MKj+PA7iG-adzi9zGsg@mail.gmail.com/T/#m0e959283380156f1d064456af01ae51fdff91265
[2] https://lore.kernel.org/lkml/20210318183350.GT3420@casper.infradead.org/



^ permalink raw reply	[flat|nested] 18+ messages in thread

* [v4 PATCH 1/6] mm: hwpoison: remove the unnecessary THP check
  2021-10-14 19:16 [RFC v4 PATCH 0/6] Solve silent data loss caused by poisoned page cache (shmem/tmpfs) Yang Shi
@ 2021-10-14 19:16 ` Yang Shi
  2021-10-14 19:16 ` [v4 PATCH 2/6] mm: filemap: check if THP has hwpoisoned subpage for PMD page fault Yang Shi
                   ` (6 subsequent siblings)
  7 siblings, 0 replies; 18+ messages in thread
From: Yang Shi @ 2021-10-14 19:16 UTC (permalink / raw)
  To: naoya.horiguchi, hughd, kirill.shutemov, willy, peterx, osalvador, akpm
  Cc: shy828301, linux-mm, linux-fsdevel, linux-kernel

When handling THP hwpoison checked if the THP is in allocation or free
stage since hwpoison may mistreat it as hugetlb page.  After
commit 415c64c1453a ("mm/memory-failure: split thp earlier in memory error
handling") the problem has been fixed, so this check is no longer
needed.  Remove it.  The side effect of the removal is hwpoison may
report unsplit THP instead of unknown error for shmem THP.  It seems not
like a big deal.

The following patch depends on this, which fixes shmem THP with
hwpoisoned subpage(s) are mapped PMD wrongly.  So this patch needs to be
backported to -stable as well.

Cc: <stable@vger.kernel.org>
Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Suggested-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Yang Shi <shy828301@gmail.com>
---
 mm/memory-failure.c | 14 --------------
 1 file changed, 14 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 3e6449f2102a..73f68699e7ab 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1147,20 +1147,6 @@ static int __get_hwpoison_page(struct page *page)
 	if (!HWPoisonHandlable(head))
 		return -EBUSY;
 
-	if (PageTransHuge(head)) {
-		/*
-		 * Non anonymous thp exists only in allocation/free time. We
-		 * can't handle such a case correctly, so let's give it up.
-		 * This should be better than triggering BUG_ON when kernel
-		 * tries to touch the "partially handled" page.
-		 */
-		if (!PageAnon(head)) {
-			pr_err("Memory failure: %#lx: non anonymous thp\n",
-				page_to_pfn(page));
-			return 0;
-		}
-	}
-
 	if (get_page_unless_zero(head)) {
 		if (head == compound_head(page))
 			return 1;
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [v4 PATCH 2/6] mm: filemap: check if THP has hwpoisoned subpage for PMD page fault
  2021-10-14 19:16 [RFC v4 PATCH 0/6] Solve silent data loss caused by poisoned page cache (shmem/tmpfs) Yang Shi
  2021-10-14 19:16 ` [v4 PATCH 1/6] mm: hwpoison: remove the unnecessary THP check Yang Shi
@ 2021-10-14 19:16 ` Yang Shi
  2021-10-19  5:50   ` Naoya Horiguchi
  2021-10-14 19:16 ` [v4 PATCH 3/6] mm: filemap: coding style cleanup for filemap_map_pmd() Yang Shi
                   ` (5 subsequent siblings)
  7 siblings, 1 reply; 18+ messages in thread
From: Yang Shi @ 2021-10-14 19:16 UTC (permalink / raw)
  To: naoya.horiguchi, hughd, kirill.shutemov, willy, peterx, osalvador, akpm
  Cc: shy828301, linux-mm, linux-fsdevel, linux-kernel

When handling shmem page fault the THP with corrupted subpage could be PMD
mapped if certain conditions are satisfied.  But kernel is supposed to
send SIGBUS when trying to map hwpoisoned page.

There are two paths which may do PMD map: fault around and regular fault.

Before commit f9ce0be71d1f ("mm: Cleanup faultaround and finish_fault() codepaths")
the thing was even worse in fault around path.  The THP could be PMD mapped as
long as the VMA fits regardless what subpage is accessed and corrupted.  After
this commit as long as head page is not corrupted the THP could be PMD mapped.

In the regular fault path the THP could be PMD mapped as long as the corrupted
page is not accessed and the VMA fits.

This loophole could be fixed by iterating every subpage to check if any
of them is hwpoisoned or not, but it is somewhat costly in page fault path.

So introduce a new page flag called HasHWPoisoned on the first tail page.  It
indicates the THP has hwpoisoned subpage(s).  It is set if any subpage of THP
is found hwpoisoned by memory failure and after the refcount is bumped
successfully, then cleared when the THP is freed or split.

The soft offline path doesn't need this since soft offline handler just
marks a subpage hwpoisoned when the subpage is migrated successfully.
But shmem THP didn't get split then migrated at all.

Fixes: 800d8c63b2e9 ("shmem: add huge pages support")
Cc: <stable@vger.kernel.org>
Suggested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
Signed-off-by: Yang Shi <shy828301@gmail.com>
---
 include/linux/page-flags.h | 23 +++++++++++++++++++++++
 mm/huge_memory.c           |  2 ++
 mm/memory-failure.c        | 14 ++++++++++++++
 mm/memory.c                |  9 +++++++++
 mm/page_alloc.c            |  4 +++-
 5 files changed, 51 insertions(+), 1 deletion(-)

diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
index a558d67ee86f..901723d75677 100644
--- a/include/linux/page-flags.h
+++ b/include/linux/page-flags.h
@@ -171,6 +171,15 @@ enum pageflags {
 	/* Compound pages. Stored in first tail page's flags */
 	PG_double_map = PG_workingset,
 
+#ifdef CONFIG_MEMORY_FAILURE
+	/*
+	 * Compound pages. Stored in first tail page's flags.
+	 * Indicates that at least one subpage is hwpoisoned in the
+	 * THP.
+	 */
+	PG_has_hwpoisoned = PG_mappedtodisk,
+#endif
+
 	/* non-lru isolated movable page */
 	PG_isolated = PG_reclaim,
 
@@ -668,6 +677,20 @@ PAGEFLAG_FALSE(DoubleMap)
 	TESTSCFLAG_FALSE(DoubleMap)
 #endif
 
+#if defined(CONFIG_MEMORY_FAILURE) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
+/*
+ * PageHasHWPoisoned indicates that at least on subpage is hwpoisoned in the
+ * compound page.
+ *
+ * This flag is set by hwpoison handler.  Cleared by THP split or free page.
+ */
+PAGEFLAG(HasHWPoisoned, has_hwpoisoned, PF_SECOND)
+	TESTSCFLAG(HasHWPoisoned, has_hwpoisoned, PF_SECOND)
+#else
+PAGEFLAG_FALSE(HasHWPoisoned)
+	TESTSCFLAG_FALSE(HasHWPoisoned)
+#endif
+
 /*
  * Check if a page is currently marked HWPoisoned. Note that this check is
  * best effort only and inherently racy: there is no way to synchronize with
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5e9ef0fc261e..0574b1613714 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2426,6 +2426,8 @@ static void __split_huge_page(struct page *page, struct list_head *list,
 	/* lock lru list/PageCompound, ref frozen by page_ref_freeze */
 	lruvec = lock_page_lruvec(head);
 
+	ClearPageHasHWPoisoned(head);
+
 	for (i = nr - 1; i >= 1; i--) {
 		__split_huge_page_tail(head, i, lruvec, list);
 		/* Some pages can be beyond EOF: drop them from page cache */
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 73f68699e7ab..2809d12f16af 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1694,6 +1694,20 @@ int memory_failure(unsigned long pfn, int flags)
 	}
 
 	if (PageTransHuge(hpage)) {
+		/*
+		 * The flag must be set after the refcount is bumpped
+		 * otherwise it may race with THP split.
+		 * And the flag can't be set in get_hwpoison_page() since
+		 * it is called by soft offline too and it is just called
+		 * for !MF_COUNT_INCREASE.  So here seems to be the best
+		 * place.
+		 *
+		 * Don't need care about the above error handling paths for
+		 * get_hwpoison_page() since they handle either free page
+		 * or unhandlable page.  The refcount is bumpped iff the
+		 * page is a valid handlable page.
+		 */
+		SetPageHasHWPoisoned(hpage);
 		if (try_to_split_thp_page(p, "Memory Failure") < 0) {
 			action_result(pfn, MF_MSG_UNSPLIT_THP, MF_IGNORED);
 			res = -EBUSY;
diff --git a/mm/memory.c b/mm/memory.c
index adf9b9ef8277..c52be6d6b605 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3906,6 +3906,15 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
 	if (compound_order(page) != HPAGE_PMD_ORDER)
 		return ret;
 
+	/*
+	 * Just backoff if any subpage of a THP is corrupted otherwise
+	 * the corrupted page may mapped by PMD silently to escape the
+	 * check.  This kind of THP just can be PTE mapped.  Access to
+	 * the corrupted subpage should trigger SIGBUS as expected.
+	 */
+	if (unlikely(PageHasHWPoisoned(page)))
+		return ret;
+
 	/*
 	 * Archs like ppc64 need additional space to store information
 	 * related to pte entry. Use the preallocated table for that.
diff --git a/mm/page_alloc.c b/mm/page_alloc.c
index b37435c274cf..7f37652f0287 100644
--- a/mm/page_alloc.c
+++ b/mm/page_alloc.c
@@ -1312,8 +1312,10 @@ static __always_inline bool free_pages_prepare(struct page *page,
 
 		VM_BUG_ON_PAGE(compound && compound_order(page) != order, page);
 
-		if (compound)
+		if (compound) {
 			ClearPageDoubleMap(page);
+			ClearPageHasHWPoisoned(page);
+		}
 		for (i = 1; i < (1 << order); i++) {
 			if (compound)
 				bad += free_tail_pages_check(page, page + i);
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [v4 PATCH 3/6] mm: filemap: coding style cleanup for filemap_map_pmd()
  2021-10-14 19:16 [RFC v4 PATCH 0/6] Solve silent data loss caused by poisoned page cache (shmem/tmpfs) Yang Shi
  2021-10-14 19:16 ` [v4 PATCH 1/6] mm: hwpoison: remove the unnecessary THP check Yang Shi
  2021-10-14 19:16 ` [v4 PATCH 2/6] mm: filemap: check if THP has hwpoisoned subpage for PMD page fault Yang Shi
@ 2021-10-14 19:16 ` Yang Shi
  2021-10-19  5:51   ` Naoya Horiguchi
  2021-10-14 19:16 ` [v4 PATCH 4/6] mm: hwpoison: refactor refcount check handling Yang Shi
                   ` (4 subsequent siblings)
  7 siblings, 1 reply; 18+ messages in thread
From: Yang Shi @ 2021-10-14 19:16 UTC (permalink / raw)
  To: naoya.horiguchi, hughd, kirill.shutemov, willy, peterx, osalvador, akpm
  Cc: shy828301, linux-mm, linux-fsdevel, linux-kernel

A minor cleanup to the indent.

Signed-off-by: Yang Shi <shy828301@gmail.com>
---
 mm/filemap.c | 12 ++++++------
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index dae481293b5d..2acc2b977f66 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3195,12 +3195,12 @@ static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page)
 	}
 
 	if (pmd_none(*vmf->pmd) && PageTransHuge(page)) {
-	    vm_fault_t ret = do_set_pmd(vmf, page);
-	    if (!ret) {
-		    /* The page is mapped successfully, reference consumed. */
-		    unlock_page(page);
-		    return true;
-	    }
+		vm_fault_t ret = do_set_pmd(vmf, page);
+		if (!ret) {
+			/* The page is mapped successfully, reference consumed. */
+			unlock_page(page);
+			return true;
+		}
 	}
 
 	if (pmd_none(*vmf->pmd)) {
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [v4 PATCH 4/6] mm: hwpoison: refactor refcount check handling
  2021-10-14 19:16 [RFC v4 PATCH 0/6] Solve silent data loss caused by poisoned page cache (shmem/tmpfs) Yang Shi
                   ` (2 preceding siblings ...)
  2021-10-14 19:16 ` [v4 PATCH 3/6] mm: filemap: coding style cleanup for filemap_map_pmd() Yang Shi
@ 2021-10-14 19:16 ` Yang Shi
  2021-10-14 19:16 ` [v4 PATCH 5/6] mm: shmem: don't truncate page if memory failure happens Yang Shi
                   ` (3 subsequent siblings)
  7 siblings, 0 replies; 18+ messages in thread
From: Yang Shi @ 2021-10-14 19:16 UTC (permalink / raw)
  To: naoya.horiguchi, hughd, kirill.shutemov, willy, peterx, osalvador, akpm
  Cc: shy828301, linux-mm, linux-fsdevel, linux-kernel

Memory failure will report failure if the page still has extra pinned
refcount other than from hwpoison after the handler is done.  Actually
the check is not necessary for all handlers, so move the check into
specific handlers.  This would make the following keeping shmem page in
page cache patch easier.

There may be expected extra pin for some cases, for example, when the
page is dirty and in swapcache.

Suggested-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Yang Shi <shy828301@gmail.com>
---
 mm/memory-failure.c | 93 +++++++++++++++++++++++++++++++--------------
 1 file changed, 64 insertions(+), 29 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 2809d12f16af..cdf8ccd0865f 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -806,12 +806,44 @@ static int truncate_error_page(struct page *p, unsigned long pfn,
 	return ret;
 }
 
+struct page_state {
+	unsigned long mask;
+	unsigned long res;
+	enum mf_action_page_type type;
+
+	/* Callback ->action() has to unlock the relevant page inside it. */
+	int (*action)(struct page_state *ps, struct page *p);
+};
+
+/*
+ * Return true if page is still referenced by others, otherwise return
+ * false.
+ *
+ * The extra_pins is true when one extra refcount is expected.
+ */
+static bool has_extra_refcount(struct page_state *ps, struct page *p,
+			       bool extra_pins)
+{
+	int count = page_count(p) - 1;
+
+	if (extra_pins)
+		count -= 1;
+
+	if (count > 0) {
+		pr_err("Memory failure: %#lx: %s still referenced by %d users\n",
+		       page_to_pfn(p), action_page_types[ps->type], count);
+		return true;
+	}
+
+	return false;
+}
+
 /*
  * Error hit kernel page.
  * Do nothing, try to be lucky and not touch this instead. For a few cases we
  * could be more sophisticated.
  */
-static int me_kernel(struct page *p, unsigned long pfn)
+static int me_kernel(struct page_state *ps, struct page *p)
 {
 	unlock_page(p);
 	return MF_IGNORED;
@@ -820,9 +852,9 @@ static int me_kernel(struct page *p, unsigned long pfn)
 /*
  * Page in unknown state. Do nothing.
  */
-static int me_unknown(struct page *p, unsigned long pfn)
+static int me_unknown(struct page_state *ps, struct page *p)
 {
-	pr_err("Memory failure: %#lx: Unknown page state\n", pfn);
+	pr_err("Memory failure: %#lx: Unknown page state\n", page_to_pfn(p));
 	unlock_page(p);
 	return MF_FAILED;
 }
@@ -830,7 +862,7 @@ static int me_unknown(struct page *p, unsigned long pfn)
 /*
  * Clean (or cleaned) page cache page.
  */
-static int me_pagecache_clean(struct page *p, unsigned long pfn)
+static int me_pagecache_clean(struct page_state *ps, struct page *p)
 {
 	int ret;
 	struct address_space *mapping;
@@ -867,9 +899,13 @@ static int me_pagecache_clean(struct page *p, unsigned long pfn)
 	 *
 	 * Open: to take i_rwsem or not for this? Right now we don't.
 	 */
-	ret = truncate_error_page(p, pfn, mapping);
+	ret = truncate_error_page(p, page_to_pfn(p), mapping);
 out:
 	unlock_page(p);
+
+	if (has_extra_refcount(ps, p, false))
+		ret = MF_FAILED;
+
 	return ret;
 }
 
@@ -878,7 +914,7 @@ static int me_pagecache_clean(struct page *p, unsigned long pfn)
  * Issues: when the error hit a hole page the error is not properly
  * propagated.
  */
-static int me_pagecache_dirty(struct page *p, unsigned long pfn)
+static int me_pagecache_dirty(struct page_state *ps, struct page *p)
 {
 	struct address_space *mapping = page_mapping(p);
 
@@ -922,7 +958,7 @@ static int me_pagecache_dirty(struct page *p, unsigned long pfn)
 		mapping_set_error(mapping, -EIO);
 	}
 
-	return me_pagecache_clean(p, pfn);
+	return me_pagecache_clean(ps, p);
 }
 
 /*
@@ -944,9 +980,10 @@ static int me_pagecache_dirty(struct page *p, unsigned long pfn)
  * Clean swap cache pages can be directly isolated. A later page fault will
  * bring in the known good data from disk.
  */
-static int me_swapcache_dirty(struct page *p, unsigned long pfn)
+static int me_swapcache_dirty(struct page_state *ps, struct page *p)
 {
 	int ret;
+	bool extra_pins = false;
 
 	ClearPageDirty(p);
 	/* Trigger EIO in shmem: */
@@ -954,10 +991,17 @@ static int me_swapcache_dirty(struct page *p, unsigned long pfn)
 
 	ret = delete_from_lru_cache(p) ? MF_FAILED : MF_DELAYED;
 	unlock_page(p);
+
+	if (ret == MF_DELAYED)
+		extra_pins = true;
+
+	if (has_extra_refcount(ps, p, extra_pins))
+		ret = MF_FAILED;
+
 	return ret;
 }
 
-static int me_swapcache_clean(struct page *p, unsigned long pfn)
+static int me_swapcache_clean(struct page_state *ps, struct page *p)
 {
 	int ret;
 
@@ -965,6 +1009,10 @@ static int me_swapcache_clean(struct page *p, unsigned long pfn)
 
 	ret = delete_from_lru_cache(p) ? MF_FAILED : MF_RECOVERED;
 	unlock_page(p);
+
+	if (has_extra_refcount(ps, p, false))
+		ret = MF_FAILED;
+
 	return ret;
 }
 
@@ -974,7 +1022,7 @@ static int me_swapcache_clean(struct page *p, unsigned long pfn)
  * - Error on hugepage is contained in hugepage unit (not in raw page unit.)
  *   To narrow down kill region to one page, we need to break up pmd.
  */
-static int me_huge_page(struct page *p, unsigned long pfn)
+static int me_huge_page(struct page_state *ps, struct page *p)
 {
 	int res;
 	struct page *hpage = compound_head(p);
@@ -985,7 +1033,7 @@ static int me_huge_page(struct page *p, unsigned long pfn)
 
 	mapping = page_mapping(hpage);
 	if (mapping) {
-		res = truncate_error_page(hpage, pfn, mapping);
+		res = truncate_error_page(hpage, page_to_pfn(p), mapping);
 		unlock_page(hpage);
 	} else {
 		res = MF_FAILED;
@@ -1003,6 +1051,9 @@ static int me_huge_page(struct page *p, unsigned long pfn)
 		}
 	}
 
+	if (has_extra_refcount(ps, p, false))
+		res = MF_FAILED;
+
 	return res;
 }
 
@@ -1028,14 +1079,7 @@ static int me_huge_page(struct page *p, unsigned long pfn)
 #define slab		(1UL << PG_slab)
 #define reserved	(1UL << PG_reserved)
 
-static struct page_state {
-	unsigned long mask;
-	unsigned long res;
-	enum mf_action_page_type type;
-
-	/* Callback ->action() has to unlock the relevant page inside it. */
-	int (*action)(struct page *p, unsigned long pfn);
-} error_states[] = {
+static struct page_state error_states[] = {
 	{ reserved,	reserved,	MF_MSG_KERNEL,	me_kernel },
 	/*
 	 * free pages are specially detected outside this table:
@@ -1095,19 +1139,10 @@ static int page_action(struct page_state *ps, struct page *p,
 			unsigned long pfn)
 {
 	int result;
-	int count;
 
 	/* page p should be unlocked after returning from ps->action().  */
-	result = ps->action(p, pfn);
+	result = ps->action(ps, p);
 
-	count = page_count(p) - 1;
-	if (ps->action == me_swapcache_dirty && result == MF_DELAYED)
-		count--;
-	if (count > 0) {
-		pr_err("Memory failure: %#lx: %s still referenced by %d users\n",
-		       pfn, action_page_types[ps->type], count);
-		result = MF_FAILED;
-	}
 	action_result(pfn, ps->type, result);
 
 	/* Could do more checks here if page looks ok */
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [v4 PATCH 5/6] mm: shmem: don't truncate page if memory failure happens
  2021-10-14 19:16 [RFC v4 PATCH 0/6] Solve silent data loss caused by poisoned page cache (shmem/tmpfs) Yang Shi
                   ` (3 preceding siblings ...)
  2021-10-14 19:16 ` [v4 PATCH 4/6] mm: hwpoison: refactor refcount check handling Yang Shi
@ 2021-10-14 19:16 ` Yang Shi
  2021-10-19  5:52   ` Naoya Horiguchi
  2021-10-14 19:16 ` [v4 PATCH 6/6] mm: hwpoison: handle non-anonymous THP correctly Yang Shi
                   ` (2 subsequent siblings)
  7 siblings, 1 reply; 18+ messages in thread
From: Yang Shi @ 2021-10-14 19:16 UTC (permalink / raw)
  To: naoya.horiguchi, hughd, kirill.shutemov, willy, peterx, osalvador, akpm
  Cc: shy828301, linux-mm, linux-fsdevel, linux-kernel

The current behavior of memory failure is to truncate the page cache
regardless of dirty or clean.  If the page is dirty the later access
will get the obsolete data from disk without any notification to the
users.  This may cause silent data loss.  It is even worse for shmem
since shmem is in-memory filesystem, truncating page cache means
discarding data blocks.  The later read would return all zero.

The right approach is to keep the corrupted page in page cache, any
later access would return error for syscalls or SIGBUS for page fault,
until the file is truncated, hole punched or removed.  The regular
storage backed filesystems would be more complicated so this patch
is focused on shmem.  This also unblock the support for soft
offlining shmem THP.

Signed-off-by: Yang Shi <shy828301@gmail.com>
---
 mm/memory-failure.c | 10 +++++++++-
 mm/shmem.c          | 37 ++++++++++++++++++++++++++++++++++---
 mm/userfaultfd.c    |  5 +++++
 3 files changed, 48 insertions(+), 4 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index cdf8ccd0865f..f5eab593b2a7 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -57,6 +57,7 @@
 #include <linux/ratelimit.h>
 #include <linux/page-isolation.h>
 #include <linux/pagewalk.h>
+#include <linux/shmem_fs.h>
 #include "internal.h"
 #include "ras/ras_event.h"
 
@@ -866,6 +867,7 @@ static int me_pagecache_clean(struct page_state *ps, struct page *p)
 {
 	int ret;
 	struct address_space *mapping;
+	bool extra_pins;
 
 	delete_from_lru_cache(p);
 
@@ -894,6 +896,12 @@ static int me_pagecache_clean(struct page_state *ps, struct page *p)
 		goto out;
 	}
 
+	/*
+	 * The shmem page is kept in page cache instead of truncating
+	 * so is expected to have an extra refcount after error-handling.
+	 */
+	extra_pins = shmem_mapping(mapping);
+
 	/*
 	 * Truncation is a bit tricky. Enable it per file system for now.
 	 *
@@ -903,7 +911,7 @@ static int me_pagecache_clean(struct page_state *ps, struct page *p)
 out:
 	unlock_page(p);
 
-	if (has_extra_refcount(ps, p, false))
+	if (has_extra_refcount(ps, p, extra_pins))
 		ret = MF_FAILED;
 
 	return ret;
diff --git a/mm/shmem.c b/mm/shmem.c
index b5860f4a2738..69eaf65409e6 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2456,6 +2456,7 @@ shmem_write_begin(struct file *file, struct address_space *mapping,
 	struct inode *inode = mapping->host;
 	struct shmem_inode_info *info = SHMEM_I(inode);
 	pgoff_t index = pos >> PAGE_SHIFT;
+	int ret = 0;
 
 	/* i_rwsem is held by caller */
 	if (unlikely(info->seals & (F_SEAL_GROW |
@@ -2466,7 +2467,15 @@ shmem_write_begin(struct file *file, struct address_space *mapping,
 			return -EPERM;
 	}
 
-	return shmem_getpage(inode, index, pagep, SGP_WRITE);
+	ret = shmem_getpage(inode, index, pagep, SGP_WRITE);
+
+	if (*pagep && PageHWPoison(*pagep)) {
+		unlock_page(*pagep);
+		put_page(*pagep);
+		ret = -EIO;
+	}
+
+	return ret;
 }
 
 static int
@@ -2555,6 +2564,11 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 			unlock_page(page);
 		}
 
+		if (page && PageHWPoison(page)) {
+			error = -EIO;
+			break;
+		}
+
 		/*
 		 * We must evaluate after, since reads (unlike writes)
 		 * are called without i_rwsem protection against truncate
@@ -3114,7 +3128,8 @@ static const char *shmem_get_link(struct dentry *dentry,
 		page = find_get_page(inode->i_mapping, 0);
 		if (!page)
 			return ERR_PTR(-ECHILD);
-		if (!PageUptodate(page)) {
+		if (PageHWPoison(page) ||
+		    !PageUptodate(page)) {
 			put_page(page);
 			return ERR_PTR(-ECHILD);
 		}
@@ -3122,6 +3137,11 @@ static const char *shmem_get_link(struct dentry *dentry,
 		error = shmem_getpage(inode, 0, &page, SGP_READ);
 		if (error)
 			return ERR_PTR(error);
+		if (page && PageHWPoison(page)) {
+			unlock_page(page);
+			put_page(page);
+			return ERR_PTR(-ECHILD);
+		}
 		unlock_page(page);
 	}
 	set_delayed_call(done, shmem_put_link, page);
@@ -3772,6 +3792,13 @@ static void shmem_destroy_inodecache(void)
 	kmem_cache_destroy(shmem_inode_cachep);
 }
 
+/* Keep the page in page cache instead of truncating it */
+static int shmem_error_remove_page(struct address_space *mapping,
+				   struct page *page)
+{
+	return 0;
+}
+
 const struct address_space_operations shmem_aops = {
 	.writepage	= shmem_writepage,
 	.set_page_dirty	= __set_page_dirty_no_writeback,
@@ -3782,7 +3809,7 @@ const struct address_space_operations shmem_aops = {
 #ifdef CONFIG_MIGRATION
 	.migratepage	= migrate_page,
 #endif
-	.error_remove_page = generic_error_remove_page,
+	.error_remove_page = shmem_error_remove_page,
 };
 EXPORT_SYMBOL(shmem_aops);
 
@@ -4193,6 +4220,10 @@ struct page *shmem_read_mapping_page_gfp(struct address_space *mapping,
 		page = ERR_PTR(error);
 	else
 		unlock_page(page);
+
+	if (PageHWPoison(page))
+		page = ERR_PTR(-EIO);
+
 	return page;
 #else
 	/*
diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
index 7a9008415534..b688d5327177 100644
--- a/mm/userfaultfd.c
+++ b/mm/userfaultfd.c
@@ -233,6 +233,11 @@ static int mcontinue_atomic_pte(struct mm_struct *dst_mm,
 		goto out;
 	}
 
+	if (PageHWPoison(page)) {
+		ret = -EIO;
+		goto out_release;
+	}
+
 	ret = mfill_atomic_install_pte(dst_mm, dst_pmd, dst_vma, dst_addr,
 				       page, false, wp_copy);
 	if (ret)
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* [v4 PATCH 6/6] mm: hwpoison: handle non-anonymous THP correctly
  2021-10-14 19:16 [RFC v4 PATCH 0/6] Solve silent data loss caused by poisoned page cache (shmem/tmpfs) Yang Shi
                   ` (4 preceding siblings ...)
  2021-10-14 19:16 ` [v4 PATCH 5/6] mm: shmem: don't truncate page if memory failure happens Yang Shi
@ 2021-10-14 19:16 ` Yang Shi
  2021-10-15 20:28 ` [RFC v4 PATCH 0/6] Solve silent data loss caused by poisoned page cache (shmem/tmpfs) Andrew Morton
  2021-10-19  5:53 ` Naoya Horiguchi
  7 siblings, 0 replies; 18+ messages in thread
From: Yang Shi @ 2021-10-14 19:16 UTC (permalink / raw)
  To: naoya.horiguchi, hughd, kirill.shutemov, willy, peterx, osalvador, akpm
  Cc: shy828301, linux-mm, linux-fsdevel, linux-kernel

Currently hwpoison doesn't handle non-anonymous THP, but since v4.8 THP
support for tmpfs and read-only file cache has been added.  They could
be offlined by split THP, just like anonymous THP.

Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Yang Shi <shy828301@gmail.com>
---
 mm/memory-failure.c | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index f5eab593b2a7..3db738c02ab2 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1443,14 +1443,11 @@ static int identify_page_state(unsigned long pfn, struct page *p,
 static int try_to_split_thp_page(struct page *page, const char *msg)
 {
 	lock_page(page);
-	if (!PageAnon(page) || unlikely(split_huge_page(page))) {
+	if (unlikely(split_huge_page(page))) {
 		unsigned long pfn = page_to_pfn(page);
 
 		unlock_page(page);
-		if (!PageAnon(page))
-			pr_info("%s: %#lx: non anonymous thp\n", msg, pfn);
-		else
-			pr_info("%s: %#lx: thp split failed\n", msg, pfn);
+		pr_info("%s: %#lx: thp split failed\n", msg, pfn);
 		put_page(page);
 		return -EBUSY;
 	}
-- 
2.26.2



^ permalink raw reply related	[flat|nested] 18+ messages in thread

* Re: [RFC v4 PATCH 0/6] Solve silent data loss caused by poisoned page cache (shmem/tmpfs)
  2021-10-14 19:16 [RFC v4 PATCH 0/6] Solve silent data loss caused by poisoned page cache (shmem/tmpfs) Yang Shi
                   ` (5 preceding siblings ...)
  2021-10-14 19:16 ` [v4 PATCH 6/6] mm: hwpoison: handle non-anonymous THP correctly Yang Shi
@ 2021-10-15 20:28 ` Andrew Morton
  2021-10-15 21:48   ` Yang Shi
  2021-10-19  5:53 ` Naoya Horiguchi
  7 siblings, 1 reply; 18+ messages in thread
From: Andrew Morton @ 2021-10-15 20:28 UTC (permalink / raw)
  To: Yang Shi
  Cc: naoya.horiguchi, hughd, kirill.shutemov, willy, peterx,
	osalvador, linux-mm, linux-fsdevel, linux-kernel

On Thu, 14 Oct 2021 12:16:09 -0700 Yang Shi <shy828301@gmail.com> wrote:

> When discussing the patch that splits page cache THP in order to offline the
> poisoned page, Noaya mentioned there is a bigger problem [1] that prevents this
> from working since the page cache page will be truncated if uncorrectable
> errors happen.  By looking this deeper it turns out this approach (truncating
> poisoned page) may incur silent data loss for all non-readonly filesystems if
> the page is dirty.  It may be worse for in-memory filesystem, e.g. shmem/tmpfs
> since the data blocks are actually gone.
> 
> To solve this problem we could keep the poisoned dirty page in page cache then
> notify the users on any later access, e.g. page fault, read/write, etc.  The
> clean page could be truncated as is since they can be reread from disk later on.
> 
> The consequence is the filesystems may find poisoned page and manipulate it as
> healthy page since all the filesystems actually don't check if the page is
> poisoned or not in all the relevant paths except page fault.  In general, we
> need make the filesystems be aware of poisoned page before we could keep the
> poisoned page in page cache in order to solve the data loss problem.

Is the "RFC" still accurate, or might it be an accidental leftover?

I grabbed this series as-is for some testing, but I do think it wouild
be better if it was delivered as two separate series - one series for
the -stable material and one series for the 5.16-rc1 material.



^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC v4 PATCH 0/6] Solve silent data loss caused by poisoned page cache (shmem/tmpfs)
  2021-10-15 20:28 ` [RFC v4 PATCH 0/6] Solve silent data loss caused by poisoned page cache (shmem/tmpfs) Andrew Morton
@ 2021-10-15 21:48   ` Yang Shi
  0 siblings, 0 replies; 18+ messages in thread
From: Yang Shi @ 2021-10-15 21:48 UTC (permalink / raw)
  To: Andrew Morton
  Cc: HORIGUCHI NAOYA(堀口 直也),
	Hugh Dickins, Kirill A. Shutemov, Matthew Wilcox, Peter Xu,
	Oscar Salvador, Linux MM, Linux FS-devel Mailing List,
	Linux Kernel Mailing List

On Fri, Oct 15, 2021 at 1:28 PM Andrew Morton <akpm@linux-foundation.org> wrote:
>
> On Thu, 14 Oct 2021 12:16:09 -0700 Yang Shi <shy828301@gmail.com> wrote:
>
> > When discussing the patch that splits page cache THP in order to offline the
> > poisoned page, Noaya mentioned there is a bigger problem [1] that prevents this
> > from working since the page cache page will be truncated if uncorrectable
> > errors happen.  By looking this deeper it turns out this approach (truncating
> > poisoned page) may incur silent data loss for all non-readonly filesystems if
> > the page is dirty.  It may be worse for in-memory filesystem, e.g. shmem/tmpfs
> > since the data blocks are actually gone.
> >
> > To solve this problem we could keep the poisoned dirty page in page cache then
> > notify the users on any later access, e.g. page fault, read/write, etc.  The
> > clean page could be truncated as is since they can be reread from disk later on.
> >
> > The consequence is the filesystems may find poisoned page and manipulate it as
> > healthy page since all the filesystems actually don't check if the page is
> > poisoned or not in all the relevant paths except page fault.  In general, we
> > need make the filesystems be aware of poisoned page before we could keep the
> > poisoned page in page cache in order to solve the data loss problem.
>
> Is the "RFC" still accurate, or might it be an accidental leftover?

Yeah, I think it can be removed.

>
> I grabbed this series as-is for some testing, but I do think it wouild
> be better if it was delivered as two separate series - one series for
> the -stable material and one series for the 5.16-rc1 material.

Yeah, the patch 1/6 and patch 2/6 should go to -stable, then the
remaining patches are for 5.16-rc1. Thanks for taking them.

>


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [v4 PATCH 2/6] mm: filemap: check if THP has hwpoisoned subpage for PMD page fault
  2021-10-14 19:16 ` [v4 PATCH 2/6] mm: filemap: check if THP has hwpoisoned subpage for PMD page fault Yang Shi
@ 2021-10-19  5:50   ` Naoya Horiguchi
  2021-10-19 17:13     ` Yang Shi
  0 siblings, 1 reply; 18+ messages in thread
From: Naoya Horiguchi @ 2021-10-19  5:50 UTC (permalink / raw)
  To: Yang Shi
  Cc: naoya.horiguchi, hughd, kirill.shutemov, willy, peterx,
	osalvador, akpm, linux-mm, linux-fsdevel, linux-kernel

On Thu, Oct 14, 2021 at 12:16:11PM -0700, Yang Shi wrote:
> When handling shmem page fault the THP with corrupted subpage could be PMD
> mapped if certain conditions are satisfied.  But kernel is supposed to
> send SIGBUS when trying to map hwpoisoned page.
> 
> There are two paths which may do PMD map: fault around and regular fault.
> 
> Before commit f9ce0be71d1f ("mm: Cleanup faultaround and finish_fault() codepaths")
> the thing was even worse in fault around path.  The THP could be PMD mapped as
> long as the VMA fits regardless what subpage is accessed and corrupted.  After
> this commit as long as head page is not corrupted the THP could be PMD mapped.
> 
> In the regular fault path the THP could be PMD mapped as long as the corrupted
> page is not accessed and the VMA fits.
> 
> This loophole could be fixed by iterating every subpage to check if any
> of them is hwpoisoned or not, but it is somewhat costly in page fault path.
> 
> So introduce a new page flag called HasHWPoisoned on the first tail page.  It
> indicates the THP has hwpoisoned subpage(s).  It is set if any subpage of THP
> is found hwpoisoned by memory failure and after the refcount is bumped
> successfully, then cleared when the THP is freed or split.
> 
> The soft offline path doesn't need this since soft offline handler just
> marks a subpage hwpoisoned when the subpage is migrated successfully.
> But shmem THP didn't get split then migrated at all.
> 
> Fixes: 800d8c63b2e9 ("shmem: add huge pages support")
> Cc: <stable@vger.kernel.org>
> Suggested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> Signed-off-by: Yang Shi <shy828301@gmail.com>
> ---
>  include/linux/page-flags.h | 23 +++++++++++++++++++++++
>  mm/huge_memory.c           |  2 ++
>  mm/memory-failure.c        | 14 ++++++++++++++
>  mm/memory.c                |  9 +++++++++
>  mm/page_alloc.c            |  4 +++-
>  5 files changed, 51 insertions(+), 1 deletion(-)
> 
> diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> index a558d67ee86f..901723d75677 100644
> --- a/include/linux/page-flags.h
> +++ b/include/linux/page-flags.h
> @@ -171,6 +171,15 @@ enum pageflags {
>  	/* Compound pages. Stored in first tail page's flags */
>  	PG_double_map = PG_workingset,
>  
> +#ifdef CONFIG_MEMORY_FAILURE
> +	/*
> +	 * Compound pages. Stored in first tail page's flags.
> +	 * Indicates that at least one subpage is hwpoisoned in the
> +	 * THP.
> +	 */
> +	PG_has_hwpoisoned = PG_mappedtodisk,
> +#endif
> +
>  	/* non-lru isolated movable page */
>  	PG_isolated = PG_reclaim,
>  
> @@ -668,6 +677,20 @@ PAGEFLAG_FALSE(DoubleMap)
>  	TESTSCFLAG_FALSE(DoubleMap)
>  #endif
>  
> +#if defined(CONFIG_MEMORY_FAILURE) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
> +/*
> + * PageHasHWPoisoned indicates that at least on subpage is hwpoisoned in the

At least "one" subpage?

> + * compound page.
> + *
> + * This flag is set by hwpoison handler.  Cleared by THP split or free page.
> + */
> +PAGEFLAG(HasHWPoisoned, has_hwpoisoned, PF_SECOND)
> +	TESTSCFLAG(HasHWPoisoned, has_hwpoisoned, PF_SECOND)
> +#else
> +PAGEFLAG_FALSE(HasHWPoisoned)
> +	TESTSCFLAG_FALSE(HasHWPoisoned)
> +#endif
> +
>  /*
>   * Check if a page is currently marked HWPoisoned. Note that this check is
>   * best effort only and inherently racy: there is no way to synchronize with
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 5e9ef0fc261e..0574b1613714 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2426,6 +2426,8 @@ static void __split_huge_page(struct page *page, struct list_head *list,
>  	/* lock lru list/PageCompound, ref frozen by page_ref_freeze */
>  	lruvec = lock_page_lruvec(head);
>  
> +	ClearPageHasHWPoisoned(head);
> +
>  	for (i = nr - 1; i >= 1; i--) {
>  		__split_huge_page_tail(head, i, lruvec, list);
>  		/* Some pages can be beyond EOF: drop them from page cache */
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 73f68699e7ab..2809d12f16af 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1694,6 +1694,20 @@ int memory_failure(unsigned long pfn, int flags)
>  	}
>  
>  	if (PageTransHuge(hpage)) {
> +		/*
> +		 * The flag must be set after the refcount is bumpped

s/bumpped/bumped/ ?

> +		 * otherwise it may race with THP split.
> +		 * And the flag can't be set in get_hwpoison_page() since
> +		 * it is called by soft offline too and it is just called
> +		 * for !MF_COUNT_INCREASE.  So here seems to be the best
> +		 * place.
> +		 *
> +		 * Don't need care about the above error handling paths for
> +		 * get_hwpoison_page() since they handle either free page
> +		 * or unhandlable page.  The refcount is bumpped iff the

There's another "bumpped".

Otherwise looks good to me.

Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>

> +		 * page is a valid handlable page.
> +		 */
> +		SetPageHasHWPoisoned(hpage);
>  		if (try_to_split_thp_page(p, "Memory Failure") < 0) {
>  			action_result(pfn, MF_MSG_UNSPLIT_THP, MF_IGNORED);
>  			res = -EBUSY;
> diff --git a/mm/memory.c b/mm/memory.c
> index adf9b9ef8277..c52be6d6b605 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3906,6 +3906,15 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
>  	if (compound_order(page) != HPAGE_PMD_ORDER)
>  		return ret;
>  
> +	/*
> +	 * Just backoff if any subpage of a THP is corrupted otherwise
> +	 * the corrupted page may mapped by PMD silently to escape the
> +	 * check.  This kind of THP just can be PTE mapped.  Access to
> +	 * the corrupted subpage should trigger SIGBUS as expected.
> +	 */
> +	if (unlikely(PageHasHWPoisoned(page)))
> +		return ret;
> +
>  	/*
>  	 * Archs like ppc64 need additional space to store information
>  	 * related to pte entry. Use the preallocated table for that.
> diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> index b37435c274cf..7f37652f0287 100644
> --- a/mm/page_alloc.c
> +++ b/mm/page_alloc.c
> @@ -1312,8 +1312,10 @@ static __always_inline bool free_pages_prepare(struct page *page,
>  
>  		VM_BUG_ON_PAGE(compound && compound_order(page) != order, page);
>  
> -		if (compound)
> +		if (compound) {
>  			ClearPageDoubleMap(page);
> +			ClearPageHasHWPoisoned(page);
> +		}
>  		for (i = 1; i < (1 << order); i++) {
>  			if (compound)
>  				bad += free_tail_pages_check(page, page + i);
> -- 
> 2.26.2
> 


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [v4 PATCH 3/6] mm: filemap: coding style cleanup for filemap_map_pmd()
  2021-10-14 19:16 ` [v4 PATCH 3/6] mm: filemap: coding style cleanup for filemap_map_pmd() Yang Shi
@ 2021-10-19  5:51   ` Naoya Horiguchi
  0 siblings, 0 replies; 18+ messages in thread
From: Naoya Horiguchi @ 2021-10-19  5:51 UTC (permalink / raw)
  To: Yang Shi
  Cc: naoya.horiguchi, hughd, kirill.shutemov, willy, peterx,
	osalvador, akpm, linux-mm, linux-fsdevel, linux-kernel

On Thu, Oct 14, 2021 at 12:16:12PM -0700, Yang Shi wrote:
> A minor cleanup to the indent.
> 
> Signed-off-by: Yang Shi <shy828301@gmail.com>

Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [v4 PATCH 5/6] mm: shmem: don't truncate page if memory failure happens
  2021-10-14 19:16 ` [v4 PATCH 5/6] mm: shmem: don't truncate page if memory failure happens Yang Shi
@ 2021-10-19  5:52   ` Naoya Horiguchi
  2021-10-19 17:29     ` Yang Shi
  2021-10-20 18:32     ` Yang Shi
  0 siblings, 2 replies; 18+ messages in thread
From: Naoya Horiguchi @ 2021-10-19  5:52 UTC (permalink / raw)
  To: Yang Shi
  Cc: naoya.horiguchi, hughd, kirill.shutemov, willy, peterx,
	osalvador, akpm, linux-mm, linux-fsdevel, linux-kernel

On Thu, Oct 14, 2021 at 12:16:14PM -0700, Yang Shi wrote:
> The current behavior of memory failure is to truncate the page cache
> regardless of dirty or clean.  If the page is dirty the later access
> will get the obsolete data from disk without any notification to the
> users.  This may cause silent data loss.  It is even worse for shmem
> since shmem is in-memory filesystem, truncating page cache means
> discarding data blocks.  The later read would return all zero.
> 
> The right approach is to keep the corrupted page in page cache, any
> later access would return error for syscalls or SIGBUS for page fault,
> until the file is truncated, hole punched or removed.  The regular
> storage backed filesystems would be more complicated so this patch
> is focused on shmem.  This also unblock the support for soft
> offlining shmem THP.
> 
> Signed-off-by: Yang Shi <shy828301@gmail.com>
> ---
>  mm/memory-failure.c | 10 +++++++++-
>  mm/shmem.c          | 37 ++++++++++++++++++++++++++++++++++---
>  mm/userfaultfd.c    |  5 +++++
>  3 files changed, 48 insertions(+), 4 deletions(-)
> 
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index cdf8ccd0865f..f5eab593b2a7 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -57,6 +57,7 @@
>  #include <linux/ratelimit.h>
>  #include <linux/page-isolation.h>
>  #include <linux/pagewalk.h>
> +#include <linux/shmem_fs.h>
>  #include "internal.h"
>  #include "ras/ras_event.h"
>  
> @@ -866,6 +867,7 @@ static int me_pagecache_clean(struct page_state *ps, struct page *p)
>  {
>  	int ret;
>  	struct address_space *mapping;
> +	bool extra_pins;
>  
>  	delete_from_lru_cache(p);
>  
> @@ -894,6 +896,12 @@ static int me_pagecache_clean(struct page_state *ps, struct page *p)
>  		goto out;
>  	}
>  
> +	/*
> +	 * The shmem page is kept in page cache instead of truncating
> +	 * so is expected to have an extra refcount after error-handling.
> +	 */
> +	extra_pins = shmem_mapping(mapping);
> +
>  	/*
>  	 * Truncation is a bit tricky. Enable it per file system for now.
>  	 *
> @@ -903,7 +911,7 @@ static int me_pagecache_clean(struct page_state *ps, struct page *p)
>  out:
>  	unlock_page(p);
>  
> -	if (has_extra_refcount(ps, p, false))
> +	if (has_extra_refcount(ps, p, extra_pins))
>  		ret = MF_FAILED;
>  
>  	return ret;
> diff --git a/mm/shmem.c b/mm/shmem.c
> index b5860f4a2738..69eaf65409e6 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2456,6 +2456,7 @@ shmem_write_begin(struct file *file, struct address_space *mapping,
>  	struct inode *inode = mapping->host;
>  	struct shmem_inode_info *info = SHMEM_I(inode);
>  	pgoff_t index = pos >> PAGE_SHIFT;
> +	int ret = 0;
>  
>  	/* i_rwsem is held by caller */
>  	if (unlikely(info->seals & (F_SEAL_GROW |
> @@ -2466,7 +2467,15 @@ shmem_write_begin(struct file *file, struct address_space *mapping,
>  			return -EPERM;
>  	}
>  
> -	return shmem_getpage(inode, index, pagep, SGP_WRITE);
> +	ret = shmem_getpage(inode, index, pagep, SGP_WRITE);
> +
> +	if (*pagep && PageHWPoison(*pagep)) {

shmem_getpage() could return with pagep == NULL, so you need check ret first
to avoid NULL pointer dereference.

> +		unlock_page(*pagep);
> +		put_page(*pagep);
> +		ret = -EIO;
> +	}
> +
> +	return ret;
>  }
>  
>  static int
> @@ -2555,6 +2564,11 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
>  			unlock_page(page);
>  		}
>  
> +		if (page && PageHWPoison(page)) {
> +			error = -EIO;

Is it cleaner to add PageHWPoison() check in the existing "if (page)" block
just above?  Then, you don't have to check "page != NULL" twice.

@@ -2562,7 +2562,11 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
                        if (sgp == SGP_CACHE)
                                set_page_dirty(page);
                        unlock_page(page);

+                       if (PageHWPoison(page)) {
+                               error = -EIO;
+                               break;
+                       }
                }

                /*


Thanks,
Naoya Horiguchi


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC v4 PATCH 0/6] Solve silent data loss caused by poisoned page cache (shmem/tmpfs)
  2021-10-14 19:16 [RFC v4 PATCH 0/6] Solve silent data loss caused by poisoned page cache (shmem/tmpfs) Yang Shi
                   ` (6 preceding siblings ...)
  2021-10-15 20:28 ` [RFC v4 PATCH 0/6] Solve silent data loss caused by poisoned page cache (shmem/tmpfs) Andrew Morton
@ 2021-10-19  5:53 ` Naoya Horiguchi
  2021-10-19 17:32   ` Yang Shi
  7 siblings, 1 reply; 18+ messages in thread
From: Naoya Horiguchi @ 2021-10-19  5:53 UTC (permalink / raw)
  To: Yang Shi
  Cc: naoya.horiguchi, hughd, kirill.shutemov, willy, peterx,
	osalvador, akpm, linux-mm, linux-fsdevel, linux-kernel

On Thu, Oct 14, 2021 at 12:16:09PM -0700, Yang Shi wrote:
> 
> When discussing the patch that splits page cache THP in order to offline the
> poisoned page, Noaya mentioned there is a bigger problem [1] that prevents this
> from working since the page cache page will be truncated if uncorrectable
> errors happen.  By looking this deeper it turns out this approach (truncating
> poisoned page) may incur silent data loss for all non-readonly filesystems if
> the page is dirty.  It may be worse for in-memory filesystem, e.g. shmem/tmpfs
> since the data blocks are actually gone.
> 
> To solve this problem we could keep the poisoned dirty page in page cache then
> notify the users on any later access, e.g. page fault, read/write, etc.  The
> clean page could be truncated as is since they can be reread from disk later on.
> 
> The consequence is the filesystems may find poisoned page and manipulate it as
> healthy page since all the filesystems actually don't check if the page is
> poisoned or not in all the relevant paths except page fault.  In general, we
> need make the filesystems be aware of poisoned page before we could keep the
> poisoned page in page cache in order to solve the data loss problem.
> 
> To make filesystems be aware of poisoned page we should consider:
> - The page should be not written back: clearing dirty flag could prevent from
>   writeback.
> - The page should not be dropped (it shows as a clean page) by drop caches or
>   other callers: the refcount pin from hwpoison could prevent from invalidating
>   (called by cache drop, inode cache shrinking, etc), but it doesn't avoid
>   invalidation in DIO path.
> - The page should be able to get truncated/hole punched/unlinked: it works as it
>   is.
> - Notify users when the page is accessed, e.g. read/write, page fault and other
>   paths (compression, encryption, etc).
> 
> The scope of the last one is huge since almost all filesystems need do it once
> a page is returned from page cache lookup.  There are a couple of options to
> do it:
> 
> 1. Check hwpoison flag for every path, the most straightforward way.
> 2. Return NULL for poisoned page from page cache lookup, the most callsites
>    check if NULL is returned, this should have least work I think.  But the
>    error handling in filesystems just return -ENOMEM, the error code will incur
>    confusion to the users obviously.
> 3. To improve #2, we could return error pointer, e.g. ERR_PTR(-EIO), but this
>    will involve significant amount of code change as well since all the paths
>    need check if the pointer is ERR or not just like option #1.
> 
> I did prototype for both #1 and #3, but it seems #3 may require more changes
> than #1.  For #3 ERR_PTR will be returned so all the callers need to check the
> return value otherwise invalid pointer may be dereferenced, but not all callers
> really care about the content of the page, for example, partial truncate which
> just sets the truncated range in one page to 0.  So for such paths it needs
> additional modification if ERR_PTR is returned.  And if the callers have their
> own way to handle the problematic pages we need to add a new FGP flag to tell
> FGP functions to return the pointer to the page.
> 
> It may happen very rarely, but once it happens the consequence (data corruption)
> could be very bad and it is very hard to debug.  It seems this problem had been
> slightly discussed before, but seems no action was taken at that time. [2]
> 
> As the aforementioned investigation, it needs huge amount of work to solve
> the potential data loss for all filesystems.  But it is much easier for
> in-memory filesystems and such filesystems actually suffer more than others
> since even the data blocks are gone due to truncating.  So this patchset starts
> from shmem/tmpfs by taking option #1.

Thank you for the work. I have a few comment on todo...

> 
> TODO:
> * The unpoison has been broken since commit 0ed950d1f281 ("mm,hwpoison: make
>   get_hwpoison_page() call get_any_page()"), and this patch series make
>   refcount check for unpoisoning shmem page fail.

It's OK to leave unpoison unsolved now. I'm working on this now (revising
v1 patch [1]), but I'm facing some race issue cauisng kernel panic with kernel
mode page fault, so I need to solve it.

[1] https://lore.kernel.org/linux-mm/20210614021212.223326-1-nao.horiguchi@gmail.com/

> * Expand to other filesystems.  But I haven't heard feedback from filesystem
>   developers yet.

I think that hugetlbfs can be a good next target because it's similar to
shmem in that it's in-memory filesystem.

Thanks,
Naoya Horiguchi


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [v4 PATCH 2/6] mm: filemap: check if THP has hwpoisoned subpage for PMD page fault
  2021-10-19  5:50   ` Naoya Horiguchi
@ 2021-10-19 17:13     ` Yang Shi
  0 siblings, 0 replies; 18+ messages in thread
From: Yang Shi @ 2021-10-19 17:13 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: HORIGUCHI NAOYA(堀口 直也),
	Hugh Dickins, Kirill A. Shutemov, Matthew Wilcox, Peter Xu,
	Oscar Salvador, Andrew Morton, Linux MM,
	Linux FS-devel Mailing List, Linux Kernel Mailing List

On Mon, Oct 18, 2021 at 10:51 PM Naoya Horiguchi
<naoya.horiguchi@linux.dev> wrote:
>
> On Thu, Oct 14, 2021 at 12:16:11PM -0700, Yang Shi wrote:
> > When handling shmem page fault the THP with corrupted subpage could be PMD
> > mapped if certain conditions are satisfied.  But kernel is supposed to
> > send SIGBUS when trying to map hwpoisoned page.
> >
> > There are two paths which may do PMD map: fault around and regular fault.
> >
> > Before commit f9ce0be71d1f ("mm: Cleanup faultaround and finish_fault() codepaths")
> > the thing was even worse in fault around path.  The THP could be PMD mapped as
> > long as the VMA fits regardless what subpage is accessed and corrupted.  After
> > this commit as long as head page is not corrupted the THP could be PMD mapped.
> >
> > In the regular fault path the THP could be PMD mapped as long as the corrupted
> > page is not accessed and the VMA fits.
> >
> > This loophole could be fixed by iterating every subpage to check if any
> > of them is hwpoisoned or not, but it is somewhat costly in page fault path.
> >
> > So introduce a new page flag called HasHWPoisoned on the first tail page.  It
> > indicates the THP has hwpoisoned subpage(s).  It is set if any subpage of THP
> > is found hwpoisoned by memory failure and after the refcount is bumped
> > successfully, then cleared when the THP is freed or split.
> >
> > The soft offline path doesn't need this since soft offline handler just
> > marks a subpage hwpoisoned when the subpage is migrated successfully.
> > But shmem THP didn't get split then migrated at all.
> >
> > Fixes: 800d8c63b2e9 ("shmem: add huge pages support")
> > Cc: <stable@vger.kernel.org>
> > Suggested-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com>
> > Signed-off-by: Yang Shi <shy828301@gmail.com>
> > ---
> >  include/linux/page-flags.h | 23 +++++++++++++++++++++++
> >  mm/huge_memory.c           |  2 ++
> >  mm/memory-failure.c        | 14 ++++++++++++++
> >  mm/memory.c                |  9 +++++++++
> >  mm/page_alloc.c            |  4 +++-
> >  5 files changed, 51 insertions(+), 1 deletion(-)
> >
> > diff --git a/include/linux/page-flags.h b/include/linux/page-flags.h
> > index a558d67ee86f..901723d75677 100644
> > --- a/include/linux/page-flags.h
> > +++ b/include/linux/page-flags.h
> > @@ -171,6 +171,15 @@ enum pageflags {
> >       /* Compound pages. Stored in first tail page's flags */
> >       PG_double_map = PG_workingset,
> >
> > +#ifdef CONFIG_MEMORY_FAILURE
> > +     /*
> > +      * Compound pages. Stored in first tail page's flags.
> > +      * Indicates that at least one subpage is hwpoisoned in the
> > +      * THP.
> > +      */
> > +     PG_has_hwpoisoned = PG_mappedtodisk,
> > +#endif
> > +
> >       /* non-lru isolated movable page */
> >       PG_isolated = PG_reclaim,
> >
> > @@ -668,6 +677,20 @@ PAGEFLAG_FALSE(DoubleMap)
> >       TESTSCFLAG_FALSE(DoubleMap)
> >  #endif
> >
> > +#if defined(CONFIG_MEMORY_FAILURE) && defined(CONFIG_TRANSPARENT_HUGEPAGE)
> > +/*
> > + * PageHasHWPoisoned indicates that at least on subpage is hwpoisoned in the
>
> At least "one" subpage?
>
> > + * compound page.
> > + *
> > + * This flag is set by hwpoison handler.  Cleared by THP split or free page.
> > + */
> > +PAGEFLAG(HasHWPoisoned, has_hwpoisoned, PF_SECOND)
> > +     TESTSCFLAG(HasHWPoisoned, has_hwpoisoned, PF_SECOND)
> > +#else
> > +PAGEFLAG_FALSE(HasHWPoisoned)
> > +     TESTSCFLAG_FALSE(HasHWPoisoned)
> > +#endif
> > +
> >  /*
> >   * Check if a page is currently marked HWPoisoned. Note that this check is
> >   * best effort only and inherently racy: there is no way to synchronize with
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 5e9ef0fc261e..0574b1613714 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -2426,6 +2426,8 @@ static void __split_huge_page(struct page *page, struct list_head *list,
> >       /* lock lru list/PageCompound, ref frozen by page_ref_freeze */
> >       lruvec = lock_page_lruvec(head);
> >
> > +     ClearPageHasHWPoisoned(head);
> > +
> >       for (i = nr - 1; i >= 1; i--) {
> >               __split_huge_page_tail(head, i, lruvec, list);
> >               /* Some pages can be beyond EOF: drop them from page cache */
> > diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> > index 73f68699e7ab..2809d12f16af 100644
> > --- a/mm/memory-failure.c
> > +++ b/mm/memory-failure.c
> > @@ -1694,6 +1694,20 @@ int memory_failure(unsigned long pfn, int flags)
> >       }
> >
> >       if (PageTransHuge(hpage)) {
> > +             /*
> > +              * The flag must be set after the refcount is bumpped
>
> s/bumpped/bumped/ ?
>
> > +              * otherwise it may race with THP split.
> > +              * And the flag can't be set in get_hwpoison_page() since
> > +              * it is called by soft offline too and it is just called
> > +              * for !MF_COUNT_INCREASE.  So here seems to be the best
> > +              * place.
> > +              *
> > +              * Don't need care about the above error handling paths for
> > +              * get_hwpoison_page() since they handle either free page
> > +              * or unhandlable page.  The refcount is bumpped iff the
>
> There's another "bumpped".

Thanks for catching these typos, will fix them in the next version.

>
> Otherwise looks good to me.
>
> Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
>
> > +              * page is a valid handlable page.
> > +              */
> > +             SetPageHasHWPoisoned(hpage);
> >               if (try_to_split_thp_page(p, "Memory Failure") < 0) {
> >                       action_result(pfn, MF_MSG_UNSPLIT_THP, MF_IGNORED);
> >                       res = -EBUSY;
> > diff --git a/mm/memory.c b/mm/memory.c
> > index adf9b9ef8277..c52be6d6b605 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -3906,6 +3906,15 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
> >       if (compound_order(page) != HPAGE_PMD_ORDER)
> >               return ret;
> >
> > +     /*
> > +      * Just backoff if any subpage of a THP is corrupted otherwise
> > +      * the corrupted page may mapped by PMD silently to escape the
> > +      * check.  This kind of THP just can be PTE mapped.  Access to
> > +      * the corrupted subpage should trigger SIGBUS as expected.
> > +      */
> > +     if (unlikely(PageHasHWPoisoned(page)))
> > +             return ret;
> > +
> >       /*
> >        * Archs like ppc64 need additional space to store information
> >        * related to pte entry. Use the preallocated table for that.
> > diff --git a/mm/page_alloc.c b/mm/page_alloc.c
> > index b37435c274cf..7f37652f0287 100644
> > --- a/mm/page_alloc.c
> > +++ b/mm/page_alloc.c
> > @@ -1312,8 +1312,10 @@ static __always_inline bool free_pages_prepare(struct page *page,
> >
> >               VM_BUG_ON_PAGE(compound && compound_order(page) != order, page);
> >
> > -             if (compound)
> > +             if (compound) {
> >                       ClearPageDoubleMap(page);
> > +                     ClearPageHasHWPoisoned(page);
> > +             }
> >               for (i = 1; i < (1 << order); i++) {
> >                       if (compound)
> >                               bad += free_tail_pages_check(page, page + i);
> > --
> > 2.26.2
> >


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [v4 PATCH 5/6] mm: shmem: don't truncate page if memory failure happens
  2021-10-19  5:52   ` Naoya Horiguchi
@ 2021-10-19 17:29     ` Yang Shi
  2021-10-19 22:30       ` Naoya Horiguchi
  2021-10-20 18:32     ` Yang Shi
  1 sibling, 1 reply; 18+ messages in thread
From: Yang Shi @ 2021-10-19 17:29 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: HORIGUCHI NAOYA(堀口 直也),
	Hugh Dickins, Kirill A. Shutemov, Matthew Wilcox, Peter Xu,
	Oscar Salvador, Andrew Morton, Linux MM,
	Linux FS-devel Mailing List, Linux Kernel Mailing List

On Mon, Oct 18, 2021 at 10:52 PM Naoya Horiguchi
<naoya.horiguchi@linux.dev> wrote:
>
> On Thu, Oct 14, 2021 at 12:16:14PM -0700, Yang Shi wrote:
> > The current behavior of memory failure is to truncate the page cache
> > regardless of dirty or clean.  If the page is dirty the later access
> > will get the obsolete data from disk without any notification to the
> > users.  This may cause silent data loss.  It is even worse for shmem
> > since shmem is in-memory filesystem, truncating page cache means
> > discarding data blocks.  The later read would return all zero.
> >
> > The right approach is to keep the corrupted page in page cache, any
> > later access would return error for syscalls or SIGBUS for page fault,
> > until the file is truncated, hole punched or removed.  The regular
> > storage backed filesystems would be more complicated so this patch
> > is focused on shmem.  This also unblock the support for soft
> > offlining shmem THP.
> >
> > Signed-off-by: Yang Shi <shy828301@gmail.com>
> > ---
> >  mm/memory-failure.c | 10 +++++++++-
> >  mm/shmem.c          | 37 ++++++++++++++++++++++++++++++++++---
> >  mm/userfaultfd.c    |  5 +++++
> >  3 files changed, 48 insertions(+), 4 deletions(-)
> >
> > diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> > index cdf8ccd0865f..f5eab593b2a7 100644
> > --- a/mm/memory-failure.c
> > +++ b/mm/memory-failure.c
> > @@ -57,6 +57,7 @@
> >  #include <linux/ratelimit.h>
> >  #include <linux/page-isolation.h>
> >  #include <linux/pagewalk.h>
> > +#include <linux/shmem_fs.h>
> >  #include "internal.h"
> >  #include "ras/ras_event.h"
> >
> > @@ -866,6 +867,7 @@ static int me_pagecache_clean(struct page_state *ps, struct page *p)
> >  {
> >       int ret;
> >       struct address_space *mapping;
> > +     bool extra_pins;
> >
> >       delete_from_lru_cache(p);
> >
> > @@ -894,6 +896,12 @@ static int me_pagecache_clean(struct page_state *ps, struct page *p)
> >               goto out;
> >       }
> >
> > +     /*
> > +      * The shmem page is kept in page cache instead of truncating
> > +      * so is expected to have an extra refcount after error-handling.
> > +      */
> > +     extra_pins = shmem_mapping(mapping);
> > +
> >       /*
> >        * Truncation is a bit tricky. Enable it per file system for now.
> >        *
> > @@ -903,7 +911,7 @@ static int me_pagecache_clean(struct page_state *ps, struct page *p)
> >  out:
> >       unlock_page(p);
> >
> > -     if (has_extra_refcount(ps, p, false))
> > +     if (has_extra_refcount(ps, p, extra_pins))
> >               ret = MF_FAILED;
> >
> >       return ret;
> > diff --git a/mm/shmem.c b/mm/shmem.c
> > index b5860f4a2738..69eaf65409e6 100644
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -2456,6 +2456,7 @@ shmem_write_begin(struct file *file, struct address_space *mapping,
> >       struct inode *inode = mapping->host;
> >       struct shmem_inode_info *info = SHMEM_I(inode);
> >       pgoff_t index = pos >> PAGE_SHIFT;
> > +     int ret = 0;
> >
> >       /* i_rwsem is held by caller */
> >       if (unlikely(info->seals & (F_SEAL_GROW |
> > @@ -2466,7 +2467,15 @@ shmem_write_begin(struct file *file, struct address_space *mapping,
> >                       return -EPERM;
> >       }
> >
> > -     return shmem_getpage(inode, index, pagep, SGP_WRITE);
> > +     ret = shmem_getpage(inode, index, pagep, SGP_WRITE);
> > +
> > +     if (*pagep && PageHWPoison(*pagep)) {
>
> shmem_getpage() could return with pagep == NULL, so you need check ret first
> to avoid NULL pointer dereference.

Realy? IIUC pagep can't be NULL. It is a pointer's pointer passed in
by the caller, for example, generic_perform_write(). Of course,
"*pagep" could be NULL.

>
> > +             unlock_page(*pagep);
> > +             put_page(*pagep);
> > +             ret = -EIO;
> > +     }
> > +
> > +     return ret;
> >  }
> >
> >  static int
> > @@ -2555,6 +2564,11 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
> >                       unlock_page(page);
> >               }
> >
> > +             if (page && PageHWPoison(page)) {
> > +                     error = -EIO;
>
> Is it cleaner to add PageHWPoison() check in the existing "if (page)" block
> just above?  Then, you don't have to check "page != NULL" twice.
>
> @@ -2562,7 +2562,11 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
>                         if (sgp == SGP_CACHE)
>                                 set_page_dirty(page);
>                         unlock_page(page);
>
> +                       if (PageHWPoison(page)) {
> +                               error = -EIO;
> +                               break;
> +                       }

Yeah, it looks better indeed.

>                 }
>
>                 /*
>
>
> Thanks,
> Naoya Horiguchi


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [RFC v4 PATCH 0/6] Solve silent data loss caused by poisoned page cache (shmem/tmpfs)
  2021-10-19  5:53 ` Naoya Horiguchi
@ 2021-10-19 17:32   ` Yang Shi
  0 siblings, 0 replies; 18+ messages in thread
From: Yang Shi @ 2021-10-19 17:32 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: HORIGUCHI NAOYA(堀口 直也),
	Hugh Dickins, Kirill A. Shutemov, Matthew Wilcox, Peter Xu,
	Oscar Salvador, Andrew Morton, Linux MM,
	Linux FS-devel Mailing List, Linux Kernel Mailing List

On Mon, Oct 18, 2021 at 10:53 PM Naoya Horiguchi
<naoya.horiguchi@linux.dev> wrote:
>
> On Thu, Oct 14, 2021 at 12:16:09PM -0700, Yang Shi wrote:
> >
> > When discussing the patch that splits page cache THP in order to offline the
> > poisoned page, Noaya mentioned there is a bigger problem [1] that prevents this
> > from working since the page cache page will be truncated if uncorrectable
> > errors happen.  By looking this deeper it turns out this approach (truncating
> > poisoned page) may incur silent data loss for all non-readonly filesystems if
> > the page is dirty.  It may be worse for in-memory filesystem, e.g. shmem/tmpfs
> > since the data blocks are actually gone.
> >
> > To solve this problem we could keep the poisoned dirty page in page cache then
> > notify the users on any later access, e.g. page fault, read/write, etc.  The
> > clean page could be truncated as is since they can be reread from disk later on.
> >
> > The consequence is the filesystems may find poisoned page and manipulate it as
> > healthy page since all the filesystems actually don't check if the page is
> > poisoned or not in all the relevant paths except page fault.  In general, we
> > need make the filesystems be aware of poisoned page before we could keep the
> > poisoned page in page cache in order to solve the data loss problem.
> >
> > To make filesystems be aware of poisoned page we should consider:
> > - The page should be not written back: clearing dirty flag could prevent from
> >   writeback.
> > - The page should not be dropped (it shows as a clean page) by drop caches or
> >   other callers: the refcount pin from hwpoison could prevent from invalidating
> >   (called by cache drop, inode cache shrinking, etc), but it doesn't avoid
> >   invalidation in DIO path.
> > - The page should be able to get truncated/hole punched/unlinked: it works as it
> >   is.
> > - Notify users when the page is accessed, e.g. read/write, page fault and other
> >   paths (compression, encryption, etc).
> >
> > The scope of the last one is huge since almost all filesystems need do it once
> > a page is returned from page cache lookup.  There are a couple of options to
> > do it:
> >
> > 1. Check hwpoison flag for every path, the most straightforward way.
> > 2. Return NULL for poisoned page from page cache lookup, the most callsites
> >    check if NULL is returned, this should have least work I think.  But the
> >    error handling in filesystems just return -ENOMEM, the error code will incur
> >    confusion to the users obviously.
> > 3. To improve #2, we could return error pointer, e.g. ERR_PTR(-EIO), but this
> >    will involve significant amount of code change as well since all the paths
> >    need check if the pointer is ERR or not just like option #1.
> >
> > I did prototype for both #1 and #3, but it seems #3 may require more changes
> > than #1.  For #3 ERR_PTR will be returned so all the callers need to check the
> > return value otherwise invalid pointer may be dereferenced, but not all callers
> > really care about the content of the page, for example, partial truncate which
> > just sets the truncated range in one page to 0.  So for such paths it needs
> > additional modification if ERR_PTR is returned.  And if the callers have their
> > own way to handle the problematic pages we need to add a new FGP flag to tell
> > FGP functions to return the pointer to the page.
> >
> > It may happen very rarely, but once it happens the consequence (data corruption)
> > could be very bad and it is very hard to debug.  It seems this problem had been
> > slightly discussed before, but seems no action was taken at that time. [2]
> >
> > As the aforementioned investigation, it needs huge amount of work to solve
> > the potential data loss for all filesystems.  But it is much easier for
> > in-memory filesystems and such filesystems actually suffer more than others
> > since even the data blocks are gone due to truncating.  So this patchset starts
> > from shmem/tmpfs by taking option #1.
>
> Thank you for the work. I have a few comment on todo...
>
> >
> > TODO:
> > * The unpoison has been broken since commit 0ed950d1f281 ("mm,hwpoison: make
> >   get_hwpoison_page() call get_any_page()"), and this patch series make
> >   refcount check for unpoisoning shmem page fail.
>
> It's OK to leave unpoison unsolved now. I'm working on this now (revising
> v1 patch [1]), but I'm facing some race issue cauisng kernel panic with kernel
> mode page fault, so I need to solve it.
>
> [1] https://lore.kernel.org/linux-mm/20210614021212.223326-1-nao.horiguchi@gmail.com/

Thanks.

>
> > * Expand to other filesystems.  But I haven't heard feedback from filesystem
> >   developers yet.
>
> I think that hugetlbfs can be a good next target because it's similar to
> shmem in that it's in-memory filesystem.

Yeah, I agree. Will look into it later. Thanks for the suggestion.

>
> Thanks,
> Naoya Horiguchi


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [v4 PATCH 5/6] mm: shmem: don't truncate page if memory failure happens
  2021-10-19 17:29     ` Yang Shi
@ 2021-10-19 22:30       ` Naoya Horiguchi
  0 siblings, 0 replies; 18+ messages in thread
From: Naoya Horiguchi @ 2021-10-19 22:30 UTC (permalink / raw)
  To: Yang Shi
  Cc: HORIGUCHI NAOYA(堀口 直也),
	Hugh Dickins, Kirill A. Shutemov, Matthew Wilcox, Peter Xu,
	Oscar Salvador, Andrew Morton, Linux MM,
	Linux FS-devel Mailing List, Linux Kernel Mailing List

On Tue, Oct 19, 2021 at 10:29:51AM -0700, Yang Shi wrote:
> On Mon, Oct 18, 2021 at 10:52 PM Naoya Horiguchi
> <naoya.horiguchi@linux.dev> wrote:
> >
> > On Thu, Oct 14, 2021 at 12:16:14PM -0700, Yang Shi wrote:
...
> > > @@ -2466,7 +2467,15 @@ shmem_write_begin(struct file *file, struct address_space *mapping,
> > >                       return -EPERM;
> > >       }
> > >
> > > -     return shmem_getpage(inode, index, pagep, SGP_WRITE);
> > > +     ret = shmem_getpage(inode, index, pagep, SGP_WRITE);
> > > +
> > > +     if (*pagep && PageHWPoison(*pagep)) {
> >
> > shmem_getpage() could return with pagep == NULL, so you need check ret first
> > to avoid NULL pointer dereference.
> 
> Realy? IIUC pagep can't be NULL. It is a pointer's pointer passed in
> by the caller, for example, generic_perform_write(). Of course,
> "*pagep" could be NULL.

Oh, I simply missed this. You're right. Please ignore my comment on this.

- Naoya Horiguchi


^ permalink raw reply	[flat|nested] 18+ messages in thread

* Re: [v4 PATCH 5/6] mm: shmem: don't truncate page if memory failure happens
  2021-10-19  5:52   ` Naoya Horiguchi
  2021-10-19 17:29     ` Yang Shi
@ 2021-10-20 18:32     ` Yang Shi
  1 sibling, 0 replies; 18+ messages in thread
From: Yang Shi @ 2021-10-20 18:32 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: HORIGUCHI NAOYA(堀口 直也),
	Hugh Dickins, Kirill A. Shutemov, Matthew Wilcox, Peter Xu,
	Oscar Salvador, Andrew Morton, Linux MM,
	Linux FS-devel Mailing List, Linux Kernel Mailing List

On Mon, Oct 18, 2021 at 10:52 PM Naoya Horiguchi
<naoya.horiguchi@linux.dev> wrote:
>
> On Thu, Oct 14, 2021 at 12:16:14PM -0700, Yang Shi wrote:
> > The current behavior of memory failure is to truncate the page cache
> > regardless of dirty or clean.  If the page is dirty the later access
> > will get the obsolete data from disk without any notification to the
> > users.  This may cause silent data loss.  It is even worse for shmem
> > since shmem is in-memory filesystem, truncating page cache means
> > discarding data blocks.  The later read would return all zero.
> >
> > The right approach is to keep the corrupted page in page cache, any
> > later access would return error for syscalls or SIGBUS for page fault,
> > until the file is truncated, hole punched or removed.  The regular
> > storage backed filesystems would be more complicated so this patch
> > is focused on shmem.  This also unblock the support for soft
> > offlining shmem THP.
> >
> > Signed-off-by: Yang Shi <shy828301@gmail.com>
> > ---
> >  mm/memory-failure.c | 10 +++++++++-
> >  mm/shmem.c          | 37 ++++++++++++++++++++++++++++++++++---
> >  mm/userfaultfd.c    |  5 +++++
> >  3 files changed, 48 insertions(+), 4 deletions(-)
> >
> > diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> > index cdf8ccd0865f..f5eab593b2a7 100644
> > --- a/mm/memory-failure.c
> > +++ b/mm/memory-failure.c
> > @@ -57,6 +57,7 @@
> >  #include <linux/ratelimit.h>
> >  #include <linux/page-isolation.h>
> >  #include <linux/pagewalk.h>
> > +#include <linux/shmem_fs.h>
> >  #include "internal.h"
> >  #include "ras/ras_event.h"
> >
> > @@ -866,6 +867,7 @@ static int me_pagecache_clean(struct page_state *ps, struct page *p)
> >  {
> >       int ret;
> >       struct address_space *mapping;
> > +     bool extra_pins;
> >
> >       delete_from_lru_cache(p);
> >
> > @@ -894,6 +896,12 @@ static int me_pagecache_clean(struct page_state *ps, struct page *p)
> >               goto out;
> >       }
> >
> > +     /*
> > +      * The shmem page is kept in page cache instead of truncating
> > +      * so is expected to have an extra refcount after error-handling.
> > +      */
> > +     extra_pins = shmem_mapping(mapping);
> > +
> >       /*
> >        * Truncation is a bit tricky. Enable it per file system for now.
> >        *
> > @@ -903,7 +911,7 @@ static int me_pagecache_clean(struct page_state *ps, struct page *p)
> >  out:
> >       unlock_page(p);
> >
> > -     if (has_extra_refcount(ps, p, false))
> > +     if (has_extra_refcount(ps, p, extra_pins))
> >               ret = MF_FAILED;
> >
> >       return ret;
> > diff --git a/mm/shmem.c b/mm/shmem.c
> > index b5860f4a2738..69eaf65409e6 100644
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -2456,6 +2456,7 @@ shmem_write_begin(struct file *file, struct address_space *mapping,
> >       struct inode *inode = mapping->host;
> >       struct shmem_inode_info *info = SHMEM_I(inode);
> >       pgoff_t index = pos >> PAGE_SHIFT;
> > +     int ret = 0;
> >
> >       /* i_rwsem is held by caller */
> >       if (unlikely(info->seals & (F_SEAL_GROW |
> > @@ -2466,7 +2467,15 @@ shmem_write_begin(struct file *file, struct address_space *mapping,
> >                       return -EPERM;
> >       }
> >
> > -     return shmem_getpage(inode, index, pagep, SGP_WRITE);
> > +     ret = shmem_getpage(inode, index, pagep, SGP_WRITE);
> > +
> > +     if (*pagep && PageHWPoison(*pagep)) {
>
> shmem_getpage() could return with pagep == NULL, so you need check ret first
> to avoid NULL pointer dereference.
>
> > +             unlock_page(*pagep);
> > +             put_page(*pagep);
> > +             ret = -EIO;
> > +     }
> > +
> > +     return ret;
> >  }
> >
> >  static int
> > @@ -2555,6 +2564,11 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
> >                       unlock_page(page);
> >               }
> >
> > +             if (page && PageHWPoison(page)) {
> > +                     error = -EIO;
>
> Is it cleaner to add PageHWPoison() check in the existing "if (page)" block
> just above?  Then, you don't have to check "page != NULL" twice.
>
> @@ -2562,7 +2562,11 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
>                         if (sgp == SGP_CACHE)
>                                 set_page_dirty(page);
>                         unlock_page(page);
>
> +                       if (PageHWPoison(page)) {
> +                               error = -EIO;
> +                               break;

Further looking shows I missed a "put_page" in the first place. Will
fix in the next version too.

> +                       }
>                 }
>
>                 /*
>
>
> Thanks,
> Naoya Horiguchi


^ permalink raw reply	[flat|nested] 18+ messages in thread

end of thread, other threads:[~2021-10-20 18:32 UTC | newest]

Thread overview: 18+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-10-14 19:16 [RFC v4 PATCH 0/6] Solve silent data loss caused by poisoned page cache (shmem/tmpfs) Yang Shi
2021-10-14 19:16 ` [v4 PATCH 1/6] mm: hwpoison: remove the unnecessary THP check Yang Shi
2021-10-14 19:16 ` [v4 PATCH 2/6] mm: filemap: check if THP has hwpoisoned subpage for PMD page fault Yang Shi
2021-10-19  5:50   ` Naoya Horiguchi
2021-10-19 17:13     ` Yang Shi
2021-10-14 19:16 ` [v4 PATCH 3/6] mm: filemap: coding style cleanup for filemap_map_pmd() Yang Shi
2021-10-19  5:51   ` Naoya Horiguchi
2021-10-14 19:16 ` [v4 PATCH 4/6] mm: hwpoison: refactor refcount check handling Yang Shi
2021-10-14 19:16 ` [v4 PATCH 5/6] mm: shmem: don't truncate page if memory failure happens Yang Shi
2021-10-19  5:52   ` Naoya Horiguchi
2021-10-19 17:29     ` Yang Shi
2021-10-19 22:30       ` Naoya Horiguchi
2021-10-20 18:32     ` Yang Shi
2021-10-14 19:16 ` [v4 PATCH 6/6] mm: hwpoison: handle non-anonymous THP correctly Yang Shi
2021-10-15 20:28 ` [RFC v4 PATCH 0/6] Solve silent data loss caused by poisoned page cache (shmem/tmpfs) Andrew Morton
2021-10-15 21:48   ` Yang Shi
2021-10-19  5:53 ` Naoya Horiguchi
2021-10-19 17:32   ` Yang Shi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).