linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [RFC PATCH 0/4] Solve silent data loss caused by poisoned page cache (shmem/tmpfs)
@ 2021-09-14 18:37 Yang Shi
  2021-09-14 18:37 ` [PATCH 1/4] mm: filemap: check if any subpage is hwpoisoned for PMD page fault Yang Shi
                   ` (3 more replies)
  0 siblings, 4 replies; 15+ messages in thread
From: Yang Shi @ 2021-09-14 18:37 UTC (permalink / raw)
  To: naoya.horiguchi, hughd, kirill.shutemov, willy, osalvador, akpm
  Cc: shy828301, linux-mm, linux-fsdevel, linux-kernel


When discussing the patch that splits page cache THP in order to offline the
poisoned page, Noaya mentioned there is a bigger problem [1] that prevents this
from working since the page cache page will be truncated if uncorrectable
errors happen.  By looking this deeper it turns out this approach (truncating
poisoned page) may incur silent data loss for all non-readonly filesystems if
the page is dirty.  It may be worse for in-memory filesystem, e.g. shmem/tmpfs
since the data blocks are actually gone.

To solve this problem we could keep the poisoned dirty page in page cache then
notify the users on any later access, e.g. page fault, read/write, etc.  The
clean page could be truncated as is since they can be reread from disk later on.

The consequence is the filesystems may find poisoned page and manipulate it as
healthy page since all the filesystems actually don't check if the page is
poisoned or not in all the relevant paths except page fault.  In general, we
need make the filesystems be aware of poisoned page before we could keep the
poisoned page in page cache in order to solve the data loss problem.

To make filesystems be aware of poisoned page we should consider:
- The page should be not written back: clearing dirty flag could prevent from
  writeback.
- The page should not be dropped (it shows as a clean page) by drop caches or
  other callers: the refcount pin from hwpoison could prevent from invalidating
  (called by cache drop, inode cache shrinking, etc), but it doesn't avoid
  invalidation in DIO path.
- The page should be able to get truncated/hole punched/unlinked: it works as it
  is.
- Notify users when the page is accessed, e.g. read/write, page fault and other
  paths (compression, encryption, etc).

The scope of the last one is huge since almost all filesystems need do it once
a page is returned from page cache lookup.  There are a couple of options to
do it:

1. Check hwpoison flag for every path, the most straightforward way.
2. Return NULL for poisoned page from page cache lookup, the most callsites
   check if NULL is returned, this should have least work I think.  But the
   error handling in filesystems just return -ENOMEM, the error code will incur
   confusion to the users obviously.
3. To improve #2, we could return error pointer, e.g. ERR_PTR(-EIO), but this
   will involve significant amount of code change as well since all the paths
   need check if the pointer is ERR or not just like option #1.

I did prototype for both #1 and #3, but it seems #3 may require more changes
than #1.  For #3 ERR_PTR will be returned so all the callers need to check the
return value otherwise invalid pointer may be dereferenced, but not all callers
really care about the content of the page, for example, partial truncate which
just sets the truncated range in one page to 0.  So for such paths it needs
additional modification if ERR_PTR is returned.  And if the callers have their
own way to handle the problematic pages we need to add a new FGP flag to tell
FGP functions to return the pointer to the page.

It may happen very rarely, but once it happens the consequence (data corruption)
could be very bad and it is very hard to debug.  It seems this problem had been
slightly discussed before, but seems no action was taken at that time. [2]

As the aforementioned investigation, it needs huge amount of work to solve
the potential data loss for all filesystems.  But it is much easier for
in-memory filesystems and such filesystems actually suffer more than others
since even the data blocks are gone due to truncating.  So this patchset starts
from shmem/tmpfs by taking option #1.

Patch #1 and #2: fix bugs in page fault and khugepaged.  And patch #2 also
                 did some preparation for the later patches.
Patch #3: keep the poisoned page in page cache and handle such case for all
          the paths.
Patch #4: the previous patches unblock page cache THP split, so this patch
          add page cache THP split support.


[1] https://lore.kernel.org/linux-mm/CAHbLzkqNPBh_sK09qfr4yu4WTFOzRy+MKj+PA7iG-adzi9zGsg@mail.gmail.com/T/#m0e959283380156f1d064456af01ae51fdff91265
[2] https://lore.kernel.org/lkml/20210318183350.GT3420@casper.infradead.org/


^ permalink raw reply	[flat|nested] 15+ messages in thread

* [PATCH 1/4] mm: filemap: check if any subpage is hwpoisoned for PMD page fault
  2021-09-14 18:37 [RFC PATCH 0/4] Solve silent data loss caused by poisoned page cache (shmem/tmpfs) Yang Shi
@ 2021-09-14 18:37 ` Yang Shi
  2021-09-15 11:46   ` Kirill A. Shutemov
  2021-09-14 18:37 ` [PATCH 2/4] mm: khugepaged: check if file page is on LRU after locking page Yang Shi
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 15+ messages in thread
From: Yang Shi @ 2021-09-14 18:37 UTC (permalink / raw)
  To: naoya.horiguchi, hughd, kirill.shutemov, willy, osalvador, akpm
  Cc: shy828301, linux-mm, linux-fsdevel, linux-kernel

When handling shmem page fault the THP with corrupted subpage could be PMD
mapped if certain conditions are satisfied.  But kernel is supposed to
send SIGBUS when trying to map hwpoisoned page.

There are two paths which may do PMD map: fault around and regular fault.

Before commit f9ce0be71d1f ("mm: Cleanup faultaround and finish_fault()
codepaths") the thing was even worse in fault around path.  The THP could be
PMD mapped as long as the VMA fits regardless what subpage is accessed and
corrupted.  After this commit as long as head page is not corrupted the THP
could be PMD mapped.

In the regulat fault path the THP could be PMD mapped as long as the corrupted
page is not accessed and the VMA fits.

Fix the loophole by iterating all subpage to check hwpoisoned one when doing
PMD map, if any is found just fallback to PTE map.  Such THP just can be PTE
mapped.  Do the check in the icache flush loop in order to avoid iterating
all subpages twice and icache flush is actually noop for most architectures.

Cc: <stable@vger.kernel.org>
Signed-off-by: Yang Shi <shy828301@gmail.com>
---
 mm/filemap.c | 15 +++++++++------
 mm/memory.c  | 11 ++++++++++-
 2 files changed, 19 insertions(+), 7 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index dae481293b5d..740b7afe159a 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -3195,12 +3195,14 @@ static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page)
 	}
 
 	if (pmd_none(*vmf->pmd) && PageTransHuge(page)) {
-	    vm_fault_t ret = do_set_pmd(vmf, page);
-	    if (!ret) {
-		    /* The page is mapped successfully, reference consumed. */
-		    unlock_page(page);
-		    return true;
-	    }
+		vm_fault_t ret = do_set_pmd(vmf, page);
+		if (ret == VM_FAULT_FALLBACK)
+			goto out;
+		if (!ret) {
+			/* The page is mapped successfully, reference consumed. */
+			unlock_page(page);
+			return true;
+		}
 	}
 
 	if (pmd_none(*vmf->pmd)) {
@@ -3220,6 +3222,7 @@ static bool filemap_map_pmd(struct vm_fault *vmf, struct page *page)
 		return true;
 	}
 
+out:
 	return false;
 }
 
diff --git a/mm/memory.c b/mm/memory.c
index 25fc46e87214..1765bf72ed16 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3920,8 +3920,17 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
 	if (unlikely(!pmd_none(*vmf->pmd)))
 		goto out;
 
-	for (i = 0; i < HPAGE_PMD_NR; i++)
+	for (i = 0; i < HPAGE_PMD_NR; i++) {
+		/*
+		 * Just backoff if any subpage of a THP is corrupted otherwise
+		 * the corrupted page may mapped by PMD silently to escape the
+		 * check.  This kind of THP just can be PTE mapped.  Access to
+		 * the corrupted subpage should trigger SIGBUS as expected.
+		 */
+		if (PageHWPoison(page + i))
+			goto out;
 		flush_icache_page(vma, page + i);
+	}
 
 	entry = mk_huge_pmd(page, vma->vm_page_prot);
 	if (write)
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 2/4] mm: khugepaged: check if file page is on LRU after locking page
  2021-09-14 18:37 [RFC PATCH 0/4] Solve silent data loss caused by poisoned page cache (shmem/tmpfs) Yang Shi
  2021-09-14 18:37 ` [PATCH 1/4] mm: filemap: check if any subpage is hwpoisoned for PMD page fault Yang Shi
@ 2021-09-14 18:37 ` Yang Shi
  2021-09-15 11:49   ` Kirill A. Shutemov
  2021-09-14 18:37 ` [PATCH 3/4] mm: shmem: don't truncate page if memory failure happens Yang Shi
  2021-09-14 18:37 ` [PATCH 4/4] mm: hwpoison: handle non-anonymous THP correctly Yang Shi
  3 siblings, 1 reply; 15+ messages in thread
From: Yang Shi @ 2021-09-14 18:37 UTC (permalink / raw)
  To: naoya.horiguchi, hughd, kirill.shutemov, willy, osalvador, akpm
  Cc: shy828301, linux-mm, linux-fsdevel, linux-kernel

The khugepaged does check if the page is on LRU or not but it doesn't
hold page lock.  And it doesn't check this again after holding page
lock.  So it may race with some others, e.g. reclaimer, migration, etc.
All of them isolates page from LRU then lock the page then do something.

But it could pass the refcount check done by khugepaged to proceed
collapse.  Typically such race is not fatal.  But if the page has been
isolated from LRU before khugepaged it likely means the page may be not
suitable for collapse for now.

The other more fatal case is the following patch will keep the poisoned
page in page cache for shmem, so khugepaged may collapse a poisoned page
since the refcount check could pass.  3 refcounts come from:
  - hwpoison
  - page cache
  - khugepaged

Since it is not on LRU so no refcount is incremented from LRU isolation.

This is definitely not expected.  Checking if it is on LRU or not after
holding page lock could help serialize against hwpoison handler.

But there is still a small race window between setting hwpoison flag and
bump refcount in hwpoison handler.  It could be closed by checking
hwpoison flag in khugepaged, however this race seems unlikely to happen
in real life workload.  So just check LRU flag for now to avoid
over-engineering.

Signed-off-by: Yang Shi <shy828301@gmail.com>
---
 mm/khugepaged.c | 6 ++++++
 1 file changed, 6 insertions(+)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 045cc579f724..bdc161dc27dc 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1808,6 +1808,12 @@ static void collapse_file(struct mm_struct *mm,
 			goto out_unlock;
 		}
 
+		/* The hwpoisoned page is off LRU but in page cache */
+		if (!PageLRU(page)) {
+			result = SCAN_PAGE_LRU;
+			goto out_unlock;
+		}
+
 		if (isolate_lru_page(page)) {
 			result = SCAN_DEL_PAGE_LRU;
 			goto out_unlock;
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 3/4] mm: shmem: don't truncate page if memory failure happens
  2021-09-14 18:37 [RFC PATCH 0/4] Solve silent data loss caused by poisoned page cache (shmem/tmpfs) Yang Shi
  2021-09-14 18:37 ` [PATCH 1/4] mm: filemap: check if any subpage is hwpoisoned for PMD page fault Yang Shi
  2021-09-14 18:37 ` [PATCH 2/4] mm: khugepaged: check if file page is on LRU after locking page Yang Shi
@ 2021-09-14 18:37 ` Yang Shi
  2021-09-21  9:49   ` Naoya Horiguchi
  2021-09-14 18:37 ` [PATCH 4/4] mm: hwpoison: handle non-anonymous THP correctly Yang Shi
  3 siblings, 1 reply; 15+ messages in thread
From: Yang Shi @ 2021-09-14 18:37 UTC (permalink / raw)
  To: naoya.horiguchi, hughd, kirill.shutemov, willy, osalvador, akpm
  Cc: shy828301, linux-mm, linux-fsdevel, linux-kernel

The current behavior of memory failure is to truncate the page cache
regardless of dirty or clean.  If the page is dirty the later access
will get the obsolete data from disk without any notification to the
users.  This may cause silent data loss.  It is even worse for shmem
since shmem is in-memory filesystem, truncating page cache means
discarding data blocks.  The later read would return all zero.

The right approach is to keep the corrupted page in page cache, any
later access would return error for syscalls or SIGBUS for page fault,
until the file is truncated, hole punched or removed.  The regular
storage backed filesystems would be more complicated so this patch
is focused on shmem.  This also unblock the support for soft
offlining shmem THP.

Signed-off-by: Yang Shi <shy828301@gmail.com>
---
 mm/memory-failure.c |  3 ++-
 mm/shmem.c          | 25 +++++++++++++++++++++++--
 2 files changed, 25 insertions(+), 3 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 54879c339024..3e06cb9d5121 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1101,7 +1101,8 @@ static int page_action(struct page_state *ps, struct page *p,
 	result = ps->action(p, pfn);
 
 	count = page_count(p) - 1;
-	if (ps->action == me_swapcache_dirty && result == MF_DELAYED)
+	if ((ps->action == me_swapcache_dirty && result == MF_DELAYED) ||
+	    (ps->action == me_pagecache_dirty && result == MF_FAILED))
 		count--;
 	if (count > 0) {
 		pr_err("Memory failure: %#lx: %s still referenced by %d users\n",
diff --git a/mm/shmem.c b/mm/shmem.c
index 88742953532c..ec33f4f7173d 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2456,6 +2456,7 @@ shmem_write_begin(struct file *file, struct address_space *mapping,
 	struct inode *inode = mapping->host;
 	struct shmem_inode_info *info = SHMEM_I(inode);
 	pgoff_t index = pos >> PAGE_SHIFT;
+	int ret = 0;
 
 	/* i_rwsem is held by caller */
 	if (unlikely(info->seals & (F_SEAL_GROW |
@@ -2466,7 +2467,19 @@ shmem_write_begin(struct file *file, struct address_space *mapping,
 			return -EPERM;
 	}
 
-	return shmem_getpage(inode, index, pagep, SGP_WRITE);
+	ret = shmem_getpage(inode, index, pagep, SGP_WRITE);
+
+	if (!ret) {
+		if (*pagep) {
+			if (PageHWPoison(*pagep)) {
+				unlock_page(*pagep);
+				put_page(*pagep);
+				ret = -EIO;
+			}
+		}
+	}
+
+	return ret;
 }
 
 static int
@@ -2555,6 +2568,11 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
 			unlock_page(page);
 		}
 
+		if (page && PageHWPoison(page)) {
+			error = -EIO;
+			break;
+		}
+
 		/*
 		 * We must evaluate after, since reads (unlike writes)
 		 * are called without i_rwsem protection against truncate
@@ -3782,7 +3800,6 @@ const struct address_space_operations shmem_aops = {
 #ifdef CONFIG_MIGRATION
 	.migratepage	= migrate_page,
 #endif
-	.error_remove_page = generic_error_remove_page,
 };
 EXPORT_SYMBOL(shmem_aops);
 
@@ -4193,6 +4210,10 @@ struct page *shmem_read_mapping_page_gfp(struct address_space *mapping,
 		page = ERR_PTR(error);
 	else
 		unlock_page(page);
+
+	if (PageHWPoison(page))
+		page = NULL;
+
 	return page;
 #else
 	/*
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* [PATCH 4/4] mm: hwpoison: handle non-anonymous THP correctly
  2021-09-14 18:37 [RFC PATCH 0/4] Solve silent data loss caused by poisoned page cache (shmem/tmpfs) Yang Shi
                   ` (2 preceding siblings ...)
  2021-09-14 18:37 ` [PATCH 3/4] mm: shmem: don't truncate page if memory failure happens Yang Shi
@ 2021-09-14 18:37 ` Yang Shi
  2021-09-21  9:50   ` Naoya Horiguchi
  3 siblings, 1 reply; 15+ messages in thread
From: Yang Shi @ 2021-09-14 18:37 UTC (permalink / raw)
  To: naoya.horiguchi, hughd, kirill.shutemov, willy, osalvador, akpm
  Cc: shy828301, linux-mm, linux-fsdevel, linux-kernel

Currently hwpoison doesn't handle non-anonymous THP, but since v4.8 THP
support for tmpfs and read-only file cache has been added.  They could
be offlined by split THP, just like anonymous THP.

Signed-off-by: Yang Shi <shy828301@gmail.com>
---
 mm/memory-failure.c | 21 ++++++++++++---------
 1 file changed, 12 insertions(+), 9 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 3e06cb9d5121..6f72aab8ec4a 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1150,13 +1150,16 @@ static int __get_hwpoison_page(struct page *page)
 
 	if (PageTransHuge(head)) {
 		/*
-		 * Non anonymous thp exists only in allocation/free time. We
-		 * can't handle such a case correctly, so let's give it up.
-		 * This should be better than triggering BUG_ON when kernel
-		 * tries to touch the "partially handled" page.
+		 * We can't handle allocating or freeing THPs, so let's give
+		 * it up. This should be better than triggering BUG_ON when
+		 * kernel tries to touch the "partially handled" page.
+		 *
+		 * page->mapping won't be initialized until the page is added
+		 * to rmap or page cache.  Use this as an indicator for if
+		 * this is an instantiated page.
 		 */
-		if (!PageAnon(head)) {
-			pr_err("Memory failure: %#lx: non anonymous thp\n",
+		if (!head->mapping) {
+			pr_err("Memory failure: %#lx: non instantiated thp\n",
 				page_to_pfn(page));
 			return 0;
 		}
@@ -1415,12 +1418,12 @@ static int identify_page_state(unsigned long pfn, struct page *p,
 static int try_to_split_thp_page(struct page *page, const char *msg)
 {
 	lock_page(page);
-	if (!PageAnon(page) || unlikely(split_huge_page(page))) {
+	if (!page->mapping || unlikely(split_huge_page(page))) {
 		unsigned long pfn = page_to_pfn(page);
 
 		unlock_page(page);
-		if (!PageAnon(page))
-			pr_info("%s: %#lx: non anonymous thp\n", msg, pfn);
+		if (!page->mapping)
+			pr_info("%s: %#lx: not instantiated thp\n", msg, pfn);
 		else
 			pr_info("%s: %#lx: thp split failed\n", msg, pfn);
 		put_page(page);
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/4] mm: filemap: check if any subpage is hwpoisoned for PMD page fault
  2021-09-14 18:37 ` [PATCH 1/4] mm: filemap: check if any subpage is hwpoisoned for PMD page fault Yang Shi
@ 2021-09-15 11:46   ` Kirill A. Shutemov
  2021-09-15 17:28     ` Yang Shi
  0 siblings, 1 reply; 15+ messages in thread
From: Kirill A. Shutemov @ 2021-09-15 11:46 UTC (permalink / raw)
  To: Yang Shi
  Cc: naoya.horiguchi, hughd, kirill.shutemov, willy, osalvador, akpm,
	linux-mm, linux-fsdevel, linux-kernel

On Tue, Sep 14, 2021 at 11:37:15AM -0700, Yang Shi wrote:
> diff --git a/mm/memory.c b/mm/memory.c
> index 25fc46e87214..1765bf72ed16 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -3920,8 +3920,17 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
>  	if (unlikely(!pmd_none(*vmf->pmd)))
>  		goto out;
>  
> -	for (i = 0; i < HPAGE_PMD_NR; i++)
> +	for (i = 0; i < HPAGE_PMD_NR; i++) {
> +		/*
> +		 * Just backoff if any subpage of a THP is corrupted otherwise
> +		 * the corrupted page may mapped by PMD silently to escape the
> +		 * check.  This kind of THP just can be PTE mapped.  Access to
> +		 * the corrupted subpage should trigger SIGBUS as expected.
> +		 */
> +		if (PageHWPoison(page + i))
> +			goto out;
>  		flush_icache_page(vma, page + i);
> +	}

This is somewhat costly.

flush_icache_page() is empty on most archs so compiler makes the loop go
away before the change. Also page->flags for most of the pages will not
necessary be hot.

I wounder if we should consider making PG_hwpoison to cover full compound
page. On marking page hwpoison we try to split it and mark relevant base
page, if split fails -- mark full compound page.

As alternative we can have one more flag that indicates that the compound
page contains at least one hwpoisoned base page. We should have enough
space in the first tail page.

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/4] mm: khugepaged: check if file page is on LRU after locking page
  2021-09-14 18:37 ` [PATCH 2/4] mm: khugepaged: check if file page is on LRU after locking page Yang Shi
@ 2021-09-15 11:49   ` Kirill A. Shutemov
  2021-09-15 17:48     ` Yang Shi
  0 siblings, 1 reply; 15+ messages in thread
From: Kirill A. Shutemov @ 2021-09-15 11:49 UTC (permalink / raw)
  To: Yang Shi
  Cc: naoya.horiguchi, hughd, kirill.shutemov, willy, osalvador, akpm,
	linux-mm, linux-fsdevel, linux-kernel

On Tue, Sep 14, 2021 at 11:37:16AM -0700, Yang Shi wrote:
> The khugepaged does check if the page is on LRU or not but it doesn't
> hold page lock.  And it doesn't check this again after holding page
> lock.  So it may race with some others, e.g. reclaimer, migration, etc.
> All of them isolates page from LRU then lock the page then do something.
> 
> But it could pass the refcount check done by khugepaged to proceed
> collapse.  Typically such race is not fatal.  But if the page has been
> isolated from LRU before khugepaged it likely means the page may be not
> suitable for collapse for now.
> 
> The other more fatal case is the following patch will keep the poisoned
> page in page cache for shmem, so khugepaged may collapse a poisoned page
> since the refcount check could pass.  3 refcounts come from:
>   - hwpoison
>   - page cache
>   - khugepaged
> 
> Since it is not on LRU so no refcount is incremented from LRU isolation.
> 
> This is definitely not expected.  Checking if it is on LRU or not after
> holding page lock could help serialize against hwpoison handler.
> 
> But there is still a small race window between setting hwpoison flag and
> bump refcount in hwpoison handler.  It could be closed by checking
> hwpoison flag in khugepaged, however this race seems unlikely to happen
> in real life workload.  So just check LRU flag for now to avoid
> over-engineering.
> 
> Signed-off-by: Yang Shi <shy828301@gmail.com>
> ---
>  mm/khugepaged.c | 6 ++++++
>  1 file changed, 6 insertions(+)
> 
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 045cc579f724..bdc161dc27dc 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1808,6 +1808,12 @@ static void collapse_file(struct mm_struct *mm,
>  			goto out_unlock;
>  		}
>  
> +		/* The hwpoisoned page is off LRU but in page cache */
> +		if (!PageLRU(page)) {
> +			result = SCAN_PAGE_LRU;
> +			goto out_unlock;
> +		}
> +
>  		if (isolate_lru_page(page)) {

isolate_lru_page() should catch the case, no? TestClearPageLRU would fail
and we get here.

>  			result = SCAN_DEL_PAGE_LRU;
>  			goto out_unlock;
> -- 
> 2.26.2
> 
> 

-- 
 Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 1/4] mm: filemap: check if any subpage is hwpoisoned for PMD page fault
  2021-09-15 11:46   ` Kirill A. Shutemov
@ 2021-09-15 17:28     ` Yang Shi
  0 siblings, 0 replies; 15+ messages in thread
From: Yang Shi @ 2021-09-15 17:28 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: HORIGUCHI NAOYA(堀口 直也),
	Hugh Dickins, Kirill A. Shutemov, Matthew Wilcox, Oscar Salvador,
	Andrew Morton, Linux MM, Linux FS-devel Mailing List,
	Linux Kernel Mailing List

On Wed, Sep 15, 2021 at 4:46 AM Kirill A. Shutemov <kirill@shutemov.name> wrote:
>
> On Tue, Sep 14, 2021 at 11:37:15AM -0700, Yang Shi wrote:
> > diff --git a/mm/memory.c b/mm/memory.c
> > index 25fc46e87214..1765bf72ed16 100644
> > --- a/mm/memory.c
> > +++ b/mm/memory.c
> > @@ -3920,8 +3920,17 @@ vm_fault_t do_set_pmd(struct vm_fault *vmf, struct page *page)
> >       if (unlikely(!pmd_none(*vmf->pmd)))
> >               goto out;
> >
> > -     for (i = 0; i < HPAGE_PMD_NR; i++)
> > +     for (i = 0; i < HPAGE_PMD_NR; i++) {
> > +             /*
> > +              * Just backoff if any subpage of a THP is corrupted otherwise
> > +              * the corrupted page may mapped by PMD silently to escape the
> > +              * check.  This kind of THP just can be PTE mapped.  Access to
> > +              * the corrupted subpage should trigger SIGBUS as expected.
> > +              */
> > +             if (PageHWPoison(page + i))
> > +                     goto out;
> >               flush_icache_page(vma, page + i);
> > +     }
>
> This is somewhat costly.
>
> flush_icache_page() is empty on most archs so compiler makes the loop go
> away before the change. Also page->flags for most of the pages will not
> necessary be hot.

Yeah, good point.

>
> I wounder if we should consider making PG_hwpoison to cover full compound
> page. On marking page hwpoison we try to split it and mark relevant base
> page, if split fails -- mark full compound page.

We need extra bits to record exactly which subpage(s) are poisoned so
that the right page can be isolated when splitting.

>
> As alternative we can have one more flag that indicates that the compound
> page contains at least one hwpoisoned base page. We should have enough
> space in the first tail page.

Yes, actually I was thinking about the same thing too when debugging
this problem. I think this approach is more feasible. We could add a
new flag in the first tail page just like doublemap which indicates
there is/are poisoned subpage(s). It could be cleared when splitting.

I will try to implement this in the next version. Thanks a lot for the
suggestion.

>
> --
>  Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/4] mm: khugepaged: check if file page is on LRU after locking page
  2021-09-15 11:49   ` Kirill A. Shutemov
@ 2021-09-15 17:48     ` Yang Shi
  2021-09-15 23:00       ` Yang Shi
  0 siblings, 1 reply; 15+ messages in thread
From: Yang Shi @ 2021-09-15 17:48 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: HORIGUCHI NAOYA(堀口 直也),
	Hugh Dickins, Kirill A. Shutemov, Matthew Wilcox, Oscar Salvador,
	Andrew Morton, Linux MM, Linux FS-devel Mailing List,
	Linux Kernel Mailing List

On Wed, Sep 15, 2021 at 4:49 AM Kirill A. Shutemov <kirill@shutemov.name> wrote:
>
> On Tue, Sep 14, 2021 at 11:37:16AM -0700, Yang Shi wrote:
> > The khugepaged does check if the page is on LRU or not but it doesn't
> > hold page lock.  And it doesn't check this again after holding page
> > lock.  So it may race with some others, e.g. reclaimer, migration, etc.
> > All of them isolates page from LRU then lock the page then do something.
> >
> > But it could pass the refcount check done by khugepaged to proceed
> > collapse.  Typically such race is not fatal.  But if the page has been
> > isolated from LRU before khugepaged it likely means the page may be not
> > suitable for collapse for now.
> >
> > The other more fatal case is the following patch will keep the poisoned
> > page in page cache for shmem, so khugepaged may collapse a poisoned page
> > since the refcount check could pass.  3 refcounts come from:
> >   - hwpoison
> >   - page cache
> >   - khugepaged
> >
> > Since it is not on LRU so no refcount is incremented from LRU isolation.
> >
> > This is definitely not expected.  Checking if it is on LRU or not after
> > holding page lock could help serialize against hwpoison handler.
> >
> > But there is still a small race window between setting hwpoison flag and
> > bump refcount in hwpoison handler.  It could be closed by checking
> > hwpoison flag in khugepaged, however this race seems unlikely to happen
> > in real life workload.  So just check LRU flag for now to avoid
> > over-engineering.
> >
> > Signed-off-by: Yang Shi <shy828301@gmail.com>
> > ---
> >  mm/khugepaged.c | 6 ++++++
> >  1 file changed, 6 insertions(+)
> >
> > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > index 045cc579f724..bdc161dc27dc 100644
> > --- a/mm/khugepaged.c
> > +++ b/mm/khugepaged.c
> > @@ -1808,6 +1808,12 @@ static void collapse_file(struct mm_struct *mm,
> >                       goto out_unlock;
> >               }
> >
> > +             /* The hwpoisoned page is off LRU but in page cache */
> > +             if (!PageLRU(page)) {
> > +                     result = SCAN_PAGE_LRU;
> > +                     goto out_unlock;
> > +             }
> > +
> >               if (isolate_lru_page(page)) {
>
> isolate_lru_page() should catch the case, no? TestClearPageLRU would fail
> and we get here.

Hmm... you are definitely right. How could I miss this point.

It might be because of I messed up the page state by some tests which
may do hole punch then reread the same index. That could drop the
poisoned page then collapse succeed. But I'm not sure. Anyway I didn't
figure out how the poisoned page could be collapsed. It seems
impossible. I will drop this patch.

>
> >                       result = SCAN_DEL_PAGE_LRU;
> >                       goto out_unlock;
> > --
> > 2.26.2
> >
> >
>
> --
>  Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/4] mm: khugepaged: check if file page is on LRU after locking page
  2021-09-15 17:48     ` Yang Shi
@ 2021-09-15 23:00       ` Yang Shi
  2021-09-15 23:10         ` Yang Shi
  0 siblings, 1 reply; 15+ messages in thread
From: Yang Shi @ 2021-09-15 23:00 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: HORIGUCHI NAOYA(堀口 直也),
	Hugh Dickins, Kirill A. Shutemov, Matthew Wilcox, Oscar Salvador,
	Andrew Morton, Linux MM, Linux FS-devel Mailing List,
	Linux Kernel Mailing List

On Wed, Sep 15, 2021 at 10:48 AM Yang Shi <shy828301@gmail.com> wrote:
>
> On Wed, Sep 15, 2021 at 4:49 AM Kirill A. Shutemov <kirill@shutemov.name> wrote:
> >
> > On Tue, Sep 14, 2021 at 11:37:16AM -0700, Yang Shi wrote:
> > > The khugepaged does check if the page is on LRU or not but it doesn't
> > > hold page lock.  And it doesn't check this again after holding page
> > > lock.  So it may race with some others, e.g. reclaimer, migration, etc.
> > > All of them isolates page from LRU then lock the page then do something.
> > >
> > > But it could pass the refcount check done by khugepaged to proceed
> > > collapse.  Typically such race is not fatal.  But if the page has been
> > > isolated from LRU before khugepaged it likely means the page may be not
> > > suitable for collapse for now.
> > >
> > > The other more fatal case is the following patch will keep the poisoned
> > > page in page cache for shmem, so khugepaged may collapse a poisoned page
> > > since the refcount check could pass.  3 refcounts come from:
> > >   - hwpoison
> > >   - page cache
> > >   - khugepaged
> > >
> > > Since it is not on LRU so no refcount is incremented from LRU isolation.
> > >
> > > This is definitely not expected.  Checking if it is on LRU or not after
> > > holding page lock could help serialize against hwpoison handler.
> > >
> > > But there is still a small race window between setting hwpoison flag and
> > > bump refcount in hwpoison handler.  It could be closed by checking
> > > hwpoison flag in khugepaged, however this race seems unlikely to happen
> > > in real life workload.  So just check LRU flag for now to avoid
> > > over-engineering.
> > >
> > > Signed-off-by: Yang Shi <shy828301@gmail.com>
> > > ---
> > >  mm/khugepaged.c | 6 ++++++
> > >  1 file changed, 6 insertions(+)
> > >
> > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > index 045cc579f724..bdc161dc27dc 100644
> > > --- a/mm/khugepaged.c
> > > +++ b/mm/khugepaged.c
> > > @@ -1808,6 +1808,12 @@ static void collapse_file(struct mm_struct *mm,
> > >                       goto out_unlock;
> > >               }
> > >
> > > +             /* The hwpoisoned page is off LRU but in page cache */
> > > +             if (!PageLRU(page)) {
> > > +                     result = SCAN_PAGE_LRU;
> > > +                     goto out_unlock;
> > > +             }
> > > +
> > >               if (isolate_lru_page(page)) {
> >
> > isolate_lru_page() should catch the case, no? TestClearPageLRU would fail
> > and we get here.
>
> Hmm... you are definitely right. How could I miss this point.
>
> It might be because of I messed up the page state by some tests which
> may do hole punch then reread the same index. That could drop the
> poisoned page then collapse succeed. But I'm not sure. Anyway I didn't
> figure out how the poisoned page could be collapsed. It seems
> impossible. I will drop this patch.

I think I figured out the problem. This problem happened after the
page cache split patch and if the hwpoisoned page is not head page. It
is because THP split will unfreeze the refcount of tail pages to 2
(restore refcount from page cache) then dec refcount to 1. The
refcount pin from hwpoison is gone and it is still on LRU. Then
khugepged locked the page before hwpoison, the refcount is expected to
khugepaged.

The worse thing is it seems this problem is applicable to anonymous
page too. Once the anonymous THP is split by hwpoison the pin from
hwpoison is gone too the refcount is 1 (comes from PTE map). Then
khugepaged could collapse it to huge page again. It may incur data
corruption.

And the poisoned page may be freed back to buddy since the lost refcount pin.

If the poisoned page is head page, the code is fine since hwpoison
doesn't put the refcount for head page after split.

The fix is simple, just keep the refcount pin for hwpoisoned subpage.

>
> >
> > >                       result = SCAN_DEL_PAGE_LRU;
> > >                       goto out_unlock;
> > > --
> > > 2.26.2
> > >
> > >
> >
> > --
> >  Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 2/4] mm: khugepaged: check if file page is on LRU after locking page
  2021-09-15 23:00       ` Yang Shi
@ 2021-09-15 23:10         ` Yang Shi
  0 siblings, 0 replies; 15+ messages in thread
From: Yang Shi @ 2021-09-15 23:10 UTC (permalink / raw)
  To: Kirill A. Shutemov
  Cc: HORIGUCHI NAOYA(堀口 直也),
	Hugh Dickins, Kirill A. Shutemov, Matthew Wilcox, Oscar Salvador,
	Andrew Morton, Linux MM, Linux FS-devel Mailing List,
	Linux Kernel Mailing List

On Wed, Sep 15, 2021 at 4:00 PM Yang Shi <shy828301@gmail.com> wrote:
>
> On Wed, Sep 15, 2021 at 10:48 AM Yang Shi <shy828301@gmail.com> wrote:
> >
> > On Wed, Sep 15, 2021 at 4:49 AM Kirill A. Shutemov <kirill@shutemov.name> wrote:
> > >
> > > On Tue, Sep 14, 2021 at 11:37:16AM -0700, Yang Shi wrote:
> > > > The khugepaged does check if the page is on LRU or not but it doesn't
> > > > hold page lock.  And it doesn't check this again after holding page
> > > > lock.  So it may race with some others, e.g. reclaimer, migration, etc.
> > > > All of them isolates page from LRU then lock the page then do something.
> > > >
> > > > But it could pass the refcount check done by khugepaged to proceed
> > > > collapse.  Typically such race is not fatal.  But if the page has been
> > > > isolated from LRU before khugepaged it likely means the page may be not
> > > > suitable for collapse for now.
> > > >
> > > > The other more fatal case is the following patch will keep the poisoned
> > > > page in page cache for shmem, so khugepaged may collapse a poisoned page
> > > > since the refcount check could pass.  3 refcounts come from:
> > > >   - hwpoison
> > > >   - page cache
> > > >   - khugepaged
> > > >
> > > > Since it is not on LRU so no refcount is incremented from LRU isolation.
> > > >
> > > > This is definitely not expected.  Checking if it is on LRU or not after
> > > > holding page lock could help serialize against hwpoison handler.
> > > >
> > > > But there is still a small race window between setting hwpoison flag and
> > > > bump refcount in hwpoison handler.  It could be closed by checking
> > > > hwpoison flag in khugepaged, however this race seems unlikely to happen
> > > > in real life workload.  So just check LRU flag for now to avoid
> > > > over-engineering.
> > > >
> > > > Signed-off-by: Yang Shi <shy828301@gmail.com>
> > > > ---
> > > >  mm/khugepaged.c | 6 ++++++
> > > >  1 file changed, 6 insertions(+)
> > > >
> > > > diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> > > > index 045cc579f724..bdc161dc27dc 100644
> > > > --- a/mm/khugepaged.c
> > > > +++ b/mm/khugepaged.c
> > > > @@ -1808,6 +1808,12 @@ static void collapse_file(struct mm_struct *mm,
> > > >                       goto out_unlock;
> > > >               }
> > > >
> > > > +             /* The hwpoisoned page is off LRU but in page cache */
> > > > +             if (!PageLRU(page)) {
> > > > +                     result = SCAN_PAGE_LRU;
> > > > +                     goto out_unlock;
> > > > +             }
> > > > +
> > > >               if (isolate_lru_page(page)) {
> > >
> > > isolate_lru_page() should catch the case, no? TestClearPageLRU would fail
> > > and we get here.
> >
> > Hmm... you are definitely right. How could I miss this point.
> >
> > It might be because of I messed up the page state by some tests which
> > may do hole punch then reread the same index. That could drop the
> > poisoned page then collapse succeed. But I'm not sure. Anyway I didn't
> > figure out how the poisoned page could be collapsed. It seems
> > impossible. I will drop this patch.
>
> I think I figured out the problem. This problem happened after the
> page cache split patch and if the hwpoisoned page is not head page. It
> is because THP split will unfreeze the refcount of tail pages to 2
> (restore refcount from page cache) then dec refcount to 1. The
> refcount pin from hwpoison is gone and it is still on LRU. Then
> khugepged locked the page before hwpoison, the refcount is expected to
> khugepaged.
>
> The worse thing is it seems this problem is applicable to anonymous
> page too. Once the anonymous THP is split by hwpoison the pin from
> hwpoison is gone too the refcount is 1 (comes from PTE map). Then
> khugepaged could collapse it to huge page again. It may incur data
> corruption.
>
> And the poisoned page may be freed back to buddy since the lost refcount pin.
>
> If the poisoned page is head page, the code is fine since hwpoison
> doesn't put the refcount for head page after split.
>
> The fix is simple, just keep the refcount pin for hwpoisoned subpage.

Err... wait... I just realized I missed the below code block:

if (subpage == page)
        continue;

It skips the subpage passed to split_huge_page() so the refcount pin
from the caller for this subpage is kept. And hwpoison doesn't put it.
So it seems fine.

>
> >
> > >
> > > >                       result = SCAN_DEL_PAGE_LRU;
> > > >                       goto out_unlock;
> > > > --
> > > > 2.26.2
> > > >
> > > >
> > >
> > > --
> > >  Kirill A. Shutemov

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/4] mm: shmem: don't truncate page if memory failure happens
  2021-09-14 18:37 ` [PATCH 3/4] mm: shmem: don't truncate page if memory failure happens Yang Shi
@ 2021-09-21  9:49   ` Naoya Horiguchi
  2021-09-21 19:34     ` Yang Shi
  0 siblings, 1 reply; 15+ messages in thread
From: Naoya Horiguchi @ 2021-09-21  9:49 UTC (permalink / raw)
  To: Yang Shi
  Cc: naoya.horiguchi, hughd, kirill.shutemov, willy, osalvador, akpm,
	linux-mm, linux-fsdevel, linux-kernel

On Tue, Sep 14, 2021 at 11:37:17AM -0700, Yang Shi wrote:
> The current behavior of memory failure is to truncate the page cache
> regardless of dirty or clean.  If the page is dirty the later access
> will get the obsolete data from disk without any notification to the
> users.  This may cause silent data loss.  It is even worse for shmem
> since shmem is in-memory filesystem, truncating page cache means
> discarding data blocks.  The later read would return all zero.
> 
> The right approach is to keep the corrupted page in page cache, any
> later access would return error for syscalls or SIGBUS for page fault,
> until the file is truncated, hole punched or removed.  The regular
> storage backed filesystems would be more complicated so this patch
> is focused on shmem.  This also unblock the support for soft
> offlining shmem THP.
> 
> Signed-off-by: Yang Shi <shy828301@gmail.com>
> ---
>  mm/memory-failure.c |  3 ++-
>  mm/shmem.c          | 25 +++++++++++++++++++++++--
>  2 files changed, 25 insertions(+), 3 deletions(-)
> 
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 54879c339024..3e06cb9d5121 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1101,7 +1101,8 @@ static int page_action(struct page_state *ps, struct page *p,
>  	result = ps->action(p, pfn);
>  
>  	count = page_count(p) - 1;
> -	if (ps->action == me_swapcache_dirty && result == MF_DELAYED)
> +	if ((ps->action == me_swapcache_dirty && result == MF_DELAYED) ||
> +	    (ps->action == me_pagecache_dirty && result == MF_FAILED))

This new line seems to affect the cases of dirty page cache
on other filesystems, whose result is to miss "still referenced"
messages for some unmap failure cases (although it's not so critical).
So checking filesystem type (for example with shmem_mapping())
might be helpful?

And I think that if we might want to have some refactoring to pass
*ps to each ps->action() callback, then move this refcount check to
the needed places.
I don't think that we always need the refcount check, for example in
MF_MSG_KERNEL and MF_MSG_UNKNOWN cases (because no one knows the
expected values for these cases).


>  		count--;
>  	if (count > 0) {
>  		pr_err("Memory failure: %#lx: %s still referenced by %d users\n",
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 88742953532c..ec33f4f7173d 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2456,6 +2456,7 @@ shmem_write_begin(struct file *file, struct address_space *mapping,
>  	struct inode *inode = mapping->host;
>  	struct shmem_inode_info *info = SHMEM_I(inode);
>  	pgoff_t index = pos >> PAGE_SHIFT;
> +	int ret = 0;
>  
>  	/* i_rwsem is held by caller */
>  	if (unlikely(info->seals & (F_SEAL_GROW |
> @@ -2466,7 +2467,19 @@ shmem_write_begin(struct file *file, struct address_space *mapping,
>  			return -EPERM;
>  	}
>  
> -	return shmem_getpage(inode, index, pagep, SGP_WRITE);
> +	ret = shmem_getpage(inode, index, pagep, SGP_WRITE);
> +
> +	if (!ret) {

Maybe this "!ret" check is not necessary because *pagep is set
non-NULL only when ret is 0.  It could save one indent level.

> +		if (*pagep) {
> +			if (PageHWPoison(*pagep)) {
> +				unlock_page(*pagep);
> +				put_page(*pagep);
> +				ret = -EIO;
> +			}
> +		}
> +	}
> +
> +	return ret;
>  }
>  
>  static int
> @@ -2555,6 +2568,11 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
>  			unlock_page(page);
>  		}
>  
> +		if (page && PageHWPoison(page)) {
> +			error = -EIO;
> +			break;
> +		}
> +
>  		/*
>  		 * We must evaluate after, since reads (unlike writes)
>  		 * are called without i_rwsem protection against truncate
> @@ -3782,7 +3800,6 @@ const struct address_space_operations shmem_aops = {
>  #ifdef CONFIG_MIGRATION
>  	.migratepage	= migrate_page,
>  #endif
> -	.error_remove_page = generic_error_remove_page,

This change makes truncate_error_page() calls invalidate_inode_page(),
and in my testing it fails with "Failed to invalidate" message.
So as a result memory_failure() finally returns with -EBUSY. I'm not
sure it's expected because this patchset changes to keep error pages
in page cache as a proper error handling.
Maybe you can avoid this by defining .error_remove_page in shmem_aops
which simply returns 0.

>  };
>  EXPORT_SYMBOL(shmem_aops);
>  
> @@ -4193,6 +4210,10 @@ struct page *shmem_read_mapping_page_gfp(struct address_space *mapping,
>  		page = ERR_PTR(error);
>  	else
>  		unlock_page(page);
> +
> +	if (PageHWPoison(page))
> +		page = NULL;
> +
>  	return page;

One more comment ...

  - I guess that you add PageHWPoison() checks after some call sites
    of shmem_getpage() and shmem_getpage_gfp(), but seems not cover all.
    For example, mcontinue_atomic_pte() in mm/userfaultfd.c can properly
    handle PageHWPoison?

I'm trying to test more detail, but in my current understanding,
this patch looks promising to me.  Thank you for your effort.

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/4] mm: hwpoison: handle non-anonymous THP correctly
  2021-09-14 18:37 ` [PATCH 4/4] mm: hwpoison: handle non-anonymous THP correctly Yang Shi
@ 2021-09-21  9:50   ` Naoya Horiguchi
  2021-09-21 19:46     ` Yang Shi
  0 siblings, 1 reply; 15+ messages in thread
From: Naoya Horiguchi @ 2021-09-21  9:50 UTC (permalink / raw)
  To: Yang Shi
  Cc: naoya.horiguchi, hughd, kirill.shutemov, willy, osalvador, akpm,
	linux-mm, linux-fsdevel, linux-kernel

On Tue, Sep 14, 2021 at 11:37:18AM -0700, Yang Shi wrote:
> Currently hwpoison doesn't handle non-anonymous THP, but since v4.8 THP
> support for tmpfs and read-only file cache has been added.  They could
> be offlined by split THP, just like anonymous THP.
> 
> Signed-off-by: Yang Shi <shy828301@gmail.com>
> ---
>  mm/memory-failure.c | 21 ++++++++++++---------
>  1 file changed, 12 insertions(+), 9 deletions(-)
> 
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 3e06cb9d5121..6f72aab8ec4a 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1150,13 +1150,16 @@ static int __get_hwpoison_page(struct page *page)
>  
>  	if (PageTransHuge(head)) {
>  		/*
> -		 * Non anonymous thp exists only in allocation/free time. We
> -		 * can't handle such a case correctly, so let's give it up.
> -		 * This should be better than triggering BUG_ON when kernel
> -		 * tries to touch the "partially handled" page.
> +		 * We can't handle allocating or freeing THPs, so let's give
> +		 * it up. This should be better than triggering BUG_ON when
> +		 * kernel tries to touch the "partially handled" page.
> +		 *
> +		 * page->mapping won't be initialized until the page is added
> +		 * to rmap or page cache.  Use this as an indicator for if
> +		 * this is an instantiated page.
>  		 */
> -		if (!PageAnon(head)) {
> -			pr_err("Memory failure: %#lx: non anonymous thp\n",
> +		if (!head->mapping) {
> +			pr_err("Memory failure: %#lx: non instantiated thp\n",
>  				page_to_pfn(page));
>  			return 0;
>  		}

How about cleaning up this whole "PageTransHuge()" block?  As explained in
commit 415c64c1453a (mm/memory-failure: split thp earlier in memory error
handling), this check was introduced to avoid that non-anonymous thp is
considered as hugetlb and code for hugetlb is executed (resulting in crash).

With recent improvement in __get_hwpoison_page(), this confusion never
happens (because hugetlb check is done before this check), so this check
seems to finish its role.

Thanks,
Naoya Horiguchi

> @@ -1415,12 +1418,12 @@ static int identify_page_state(unsigned long pfn, struct page *p,
>  static int try_to_split_thp_page(struct page *page, const char *msg)
>  {
>  	lock_page(page);
> -	if (!PageAnon(page) || unlikely(split_huge_page(page))) {
> +	if (!page->mapping || unlikely(split_huge_page(page))) {
>  		unsigned long pfn = page_to_pfn(page);
>  
>  		unlock_page(page);
> -		if (!PageAnon(page))
> -			pr_info("%s: %#lx: non anonymous thp\n", msg, pfn);
> +		if (!page->mapping)
> +			pr_info("%s: %#lx: not instantiated thp\n", msg, pfn);
>  		else
>  			pr_info("%s: %#lx: thp split failed\n", msg, pfn);
>  		put_page(page);
> -- 
> 2.26.2
> 

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 3/4] mm: shmem: don't truncate page if memory failure happens
  2021-09-21  9:49   ` Naoya Horiguchi
@ 2021-09-21 19:34     ` Yang Shi
  0 siblings, 0 replies; 15+ messages in thread
From: Yang Shi @ 2021-09-21 19:34 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: HORIGUCHI NAOYA(堀口 直也),
	Hugh Dickins, Kirill A. Shutemov, Matthew Wilcox, Oscar Salvador,
	Andrew Morton, Linux MM, Linux FS-devel Mailing List,
	Linux Kernel Mailing List

On Tue, Sep 21, 2021 at 2:49 AM Naoya Horiguchi
<naoya.horiguchi@linux.dev> wrote:
>
> On Tue, Sep 14, 2021 at 11:37:17AM -0700, Yang Shi wrote:
> > The current behavior of memory failure is to truncate the page cache
> > regardless of dirty or clean.  If the page is dirty the later access
> > will get the obsolete data from disk without any notification to the
> > users.  This may cause silent data loss.  It is even worse for shmem
> > since shmem is in-memory filesystem, truncating page cache means
> > discarding data blocks.  The later read would return all zero.
> >
> > The right approach is to keep the corrupted page in page cache, any
> > later access would return error for syscalls or SIGBUS for page fault,
> > until the file is truncated, hole punched or removed.  The regular
> > storage backed filesystems would be more complicated so this patch
> > is focused on shmem.  This also unblock the support for soft
> > offlining shmem THP.
> >
> > Signed-off-by: Yang Shi <shy828301@gmail.com>
> > ---
> >  mm/memory-failure.c |  3 ++-
> >  mm/shmem.c          | 25 +++++++++++++++++++++++--
> >  2 files changed, 25 insertions(+), 3 deletions(-)
> >
> > diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> > index 54879c339024..3e06cb9d5121 100644
> > --- a/mm/memory-failure.c
> > +++ b/mm/memory-failure.c
> > @@ -1101,7 +1101,8 @@ static int page_action(struct page_state *ps, struct page *p,
> >       result = ps->action(p, pfn);
> >
> >       count = page_count(p) - 1;
> > -     if (ps->action == me_swapcache_dirty && result == MF_DELAYED)
> > +     if ((ps->action == me_swapcache_dirty && result == MF_DELAYED) ||
> > +         (ps->action == me_pagecache_dirty && result == MF_FAILED))
>
> This new line seems to affect the cases of dirty page cache
> on other filesystems, whose result is to miss "still referenced"
> messages for some unmap failure cases (although it's not so critical).
> So checking filesystem type (for example with shmem_mapping())
> might be helpful?
>
> And I think that if we might want to have some refactoring to pass
> *ps to each ps->action() callback, then move this refcount check to
> the needed places.
> I don't think that we always need the refcount check, for example in
> MF_MSG_KERNEL and MF_MSG_UNKNOWN cases (because no one knows the
> expected values for these cases).

Yeah, seems make sense to me. How's about doing the below (totally untested):

static inline bool check_refcount(struct *page, bool dec)
{
    int count = page_count(page) - 1;

    if (dec || shmem_mapping(page->mapping))
        count -= 1;

    if (count > 0) {
         pr_err("Memory failure: %#lx: %s still referenced by %d users\n",
                       pfn, action_page_types[ps->type], count);
         return false;
    }

    return true;
}

Then call this in the needed me_* functions and return right value per
the return value of it. I think me_swapcache_dirty() is the only place
need pass in true for dec parameter.

>
>
> >               count--;
> >       if (count > 0) {
> >               pr_err("Memory failure: %#lx: %s still referenced by %d users\n",
> > diff --git a/mm/shmem.c b/mm/shmem.c
> > index 88742953532c..ec33f4f7173d 100644
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -2456,6 +2456,7 @@ shmem_write_begin(struct file *file, struct address_space *mapping,
> >       struct inode *inode = mapping->host;
> >       struct shmem_inode_info *info = SHMEM_I(inode);
> >       pgoff_t index = pos >> PAGE_SHIFT;
> > +     int ret = 0;
> >
> >       /* i_rwsem is held by caller */
> >       if (unlikely(info->seals & (F_SEAL_GROW |
> > @@ -2466,7 +2467,19 @@ shmem_write_begin(struct file *file, struct address_space *mapping,
> >                       return -EPERM;
> >       }
> >
> > -     return shmem_getpage(inode, index, pagep, SGP_WRITE);
> > +     ret = shmem_getpage(inode, index, pagep, SGP_WRITE);
> > +
> > +     if (!ret) {
>
> Maybe this "!ret" check is not necessary because *pagep is set
> non-NULL only when ret is 0.  It could save one indent level.

Yes, sure.

>
> > +             if (*pagep) {
> > +                     if (PageHWPoison(*pagep)) {
> > +                             unlock_page(*pagep);
> > +                             put_page(*pagep);
> > +                             ret = -EIO;
> > +                     }
> > +             }
> > +     }
> > +
> > +     return ret;
> >  }
> >
> >  static int
> > @@ -2555,6 +2568,11 @@ static ssize_t shmem_file_read_iter(struct kiocb *iocb, struct iov_iter *to)
> >                       unlock_page(page);
> >               }
> >
> > +             if (page && PageHWPoison(page)) {
> > +                     error = -EIO;
> > +                     break;
> > +             }
> > +
> >               /*
> >                * We must evaluate after, since reads (unlike writes)
> >                * are called without i_rwsem protection against truncate
> > @@ -3782,7 +3800,6 @@ const struct address_space_operations shmem_aops = {
> >  #ifdef CONFIG_MIGRATION
> >       .migratepage    = migrate_page,
> >  #endif
> > -     .error_remove_page = generic_error_remove_page,
>
> This change makes truncate_error_page() calls invalidate_inode_page(),
> and in my testing it fails with "Failed to invalidate" message.
> So as a result memory_failure() finally returns with -EBUSY. I'm not
> sure it's expected because this patchset changes to keep error pages
> in page cache as a proper error handling.
> Maybe you can avoid this by defining .error_remove_page in shmem_aops
> which simply returns 0.

Yes, the "Failed to invalidate" message seems confusing. I agree a
shmem specific callback is better.

>
> >  };
> >  EXPORT_SYMBOL(shmem_aops);
> >
> > @@ -4193,6 +4210,10 @@ struct page *shmem_read_mapping_page_gfp(struct address_space *mapping,
> >               page = ERR_PTR(error);
> >       else
> >               unlock_page(page);
> > +
> > +     if (PageHWPoison(page))
> > +             page = NULL;
> > +
> >       return page;
>
> One more comment ...
>
>   - I guess that you add PageHWPoison() checks after some call sites
>     of shmem_getpage() and shmem_getpage_gfp(), but seems not cover all.
>     For example, mcontinue_atomic_pte() in mm/userfaultfd.c can properly
>     handle PageHWPoison?

No, I didn't touch anything outside shmem.c. I could add this in the
next version.

BTW, I just found another problem for the change in
shmem_read_mapping_page_gfp(), it should return ERR_PTR(-EIO) instead
of NULL since the callers may not handle NULL. Will fix in the next
version too.

>
> I'm trying to test more detail, but in my current understanding,
> this patch looks promising to me.  Thank you for your effort.

Thank a lot for taking time do the review.

>
> Thanks,
> Naoya Horiguchi

^ permalink raw reply	[flat|nested] 15+ messages in thread

* Re: [PATCH 4/4] mm: hwpoison: handle non-anonymous THP correctly
  2021-09-21  9:50   ` Naoya Horiguchi
@ 2021-09-21 19:46     ` Yang Shi
  0 siblings, 0 replies; 15+ messages in thread
From: Yang Shi @ 2021-09-21 19:46 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: HORIGUCHI NAOYA(堀口 直也),
	Hugh Dickins, Kirill A. Shutemov, Matthew Wilcox, Oscar Salvador,
	Andrew Morton, Linux MM, Linux FS-devel Mailing List,
	Linux Kernel Mailing List

On Tue, Sep 21, 2021 at 2:50 AM Naoya Horiguchi
<naoya.horiguchi@linux.dev> wrote:
>
> On Tue, Sep 14, 2021 at 11:37:18AM -0700, Yang Shi wrote:
> > Currently hwpoison doesn't handle non-anonymous THP, but since v4.8 THP
> > support for tmpfs and read-only file cache has been added.  They could
> > be offlined by split THP, just like anonymous THP.
> >
> > Signed-off-by: Yang Shi <shy828301@gmail.com>
> > ---
> >  mm/memory-failure.c | 21 ++++++++++++---------
> >  1 file changed, 12 insertions(+), 9 deletions(-)
> >
> > diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> > index 3e06cb9d5121..6f72aab8ec4a 100644
> > --- a/mm/memory-failure.c
> > +++ b/mm/memory-failure.c
> > @@ -1150,13 +1150,16 @@ static int __get_hwpoison_page(struct page *page)
> >
> >       if (PageTransHuge(head)) {
> >               /*
> > -              * Non anonymous thp exists only in allocation/free time. We
> > -              * can't handle such a case correctly, so let's give it up.
> > -              * This should be better than triggering BUG_ON when kernel
> > -              * tries to touch the "partially handled" page.
> > +              * We can't handle allocating or freeing THPs, so let's give
> > +              * it up. This should be better than triggering BUG_ON when
> > +              * kernel tries to touch the "partially handled" page.
> > +              *
> > +              * page->mapping won't be initialized until the page is added
> > +              * to rmap or page cache.  Use this as an indicator for if
> > +              * this is an instantiated page.
> >                */
> > -             if (!PageAnon(head)) {
> > -                     pr_err("Memory failure: %#lx: non anonymous thp\n",
> > +             if (!head->mapping) {
> > +                     pr_err("Memory failure: %#lx: non instantiated thp\n",
> >                               page_to_pfn(page));
> >                       return 0;
> >               }
>
> How about cleaning up this whole "PageTransHuge()" block?  As explained in
> commit 415c64c1453a (mm/memory-failure: split thp earlier in memory error
> handling), this check was introduced to avoid that non-anonymous thp is
> considered as hugetlb and code for hugetlb is executed (resulting in crash).
>
> With recent improvement in __get_hwpoison_page(), this confusion never
> happens (because hugetlb check is done before this check), so this check
> seems to finish its role.

I see. IIUC the !PageAnon check was used to prevent from mistreating
the THP to hugetlb page. But it was actually solved by splitting THP
earlier. If so this check definitely could go away since the worst
case is split failure. Will fix it in the next version.

>
> Thanks,
> Naoya Horiguchi
>
> > @@ -1415,12 +1418,12 @@ static int identify_page_state(unsigned long pfn, struct page *p,
> >  static int try_to_split_thp_page(struct page *page, const char *msg)
> >  {
> >       lock_page(page);
> > -     if (!PageAnon(page) || unlikely(split_huge_page(page))) {
> > +     if (!page->mapping || unlikely(split_huge_page(page))) {
> >               unsigned long pfn = page_to_pfn(page);
> >
> >               unlock_page(page);
> > -             if (!PageAnon(page))
> > -                     pr_info("%s: %#lx: non anonymous thp\n", msg, pfn);
> > +             if (!page->mapping)
> > +                     pr_info("%s: %#lx: not instantiated thp\n", msg, pfn);
> >               else
> >                       pr_info("%s: %#lx: thp split failed\n", msg, pfn);
> >               put_page(page);
> > --
> > 2.26.2
> >

^ permalink raw reply	[flat|nested] 15+ messages in thread

end of thread, other threads:[~2021-09-21 19:46 UTC | newest]

Thread overview: 15+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-09-14 18:37 [RFC PATCH 0/4] Solve silent data loss caused by poisoned page cache (shmem/tmpfs) Yang Shi
2021-09-14 18:37 ` [PATCH 1/4] mm: filemap: check if any subpage is hwpoisoned for PMD page fault Yang Shi
2021-09-15 11:46   ` Kirill A. Shutemov
2021-09-15 17:28     ` Yang Shi
2021-09-14 18:37 ` [PATCH 2/4] mm: khugepaged: check if file page is on LRU after locking page Yang Shi
2021-09-15 11:49   ` Kirill A. Shutemov
2021-09-15 17:48     ` Yang Shi
2021-09-15 23:00       ` Yang Shi
2021-09-15 23:10         ` Yang Shi
2021-09-14 18:37 ` [PATCH 3/4] mm: shmem: don't truncate page if memory failure happens Yang Shi
2021-09-21  9:49   ` Naoya Horiguchi
2021-09-21 19:34     ` Yang Shi
2021-09-14 18:37 ` [PATCH 4/4] mm: hwpoison: handle non-anonymous THP correctly Yang Shi
2021-09-21  9:50   ` Naoya Horiguchi
2021-09-21 19:46     ` Yang Shi

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).