stable.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* hwpoison, shmem: fix data lost issue for 5.15.y
@ 2022-11-14 13:14 Naoya Horiguchi
  2022-11-14 16:11 ` Greg KH
  0 siblings, 1 reply; 13+ messages in thread
From: Naoya Horiguchi @ 2022-11-14 13:14 UTC (permalink / raw)
  To: stable
  Cc: Yang Shi, James Houghton, Mike Kravetz, Muchun Song, Miaohe Lin,
	Andrew Morton, Naoya Horiguchi

Hi,

I'd like to request the follow commits to be backported to 5.15.y.

- dd0f230a0a80 ("mm: hwpoison: refactor refcount check handling")
- 4966455d9100 ("mm: hwpoison: handle non-anonymous THP correctly")
- a76054266661 ("mm: shmem: don't truncate page if memory failure happens")

These patches fixed a data lost issue by preventing shmem pagecache from
being removed by memory error.  These were not tagged for stable originally,
but that's revisited recently.

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: hwpoison, shmem: fix data lost issue for 5.15.y
  2022-11-14 13:14 hwpoison, shmem: fix data lost issue for 5.15.y Naoya Horiguchi
@ 2022-11-14 16:11 ` Greg KH
  2022-11-14 22:39   ` Naoya Horiguchi
  0 siblings, 1 reply; 13+ messages in thread
From: Greg KH @ 2022-11-14 16:11 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: stable, Yang Shi, James Houghton, Mike Kravetz, Muchun Song,
	Miaohe Lin, Andrew Morton, Naoya Horiguchi

On Mon, Nov 14, 2022 at 10:14:03PM +0900, Naoya Horiguchi wrote:
> Hi,
> 
> I'd like to request the follow commits to be backported to 5.15.y.
> 
> - dd0f230a0a80 ("mm: hwpoison: refactor refcount check handling")
> - 4966455d9100 ("mm: hwpoison: handle non-anonymous THP correctly")
> - a76054266661 ("mm: shmem: don't truncate page if memory failure happens")
> 
> These patches fixed a data lost issue by preventing shmem pagecache from
> being removed by memory error.  These were not tagged for stable originally,
> but that's revisited recently.

And have you tested that these all apply properly (and in which order?)
and work correctly?

thanks,

greg k-h

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: hwpoison, shmem: fix data lost issue for 5.15.y
  2022-11-14 16:11 ` Greg KH
@ 2022-11-14 22:39   ` Naoya Horiguchi
  2022-11-14 22:53     ` Mike Kravetz
  2022-11-16  9:16     ` Greg KH
  0 siblings, 2 replies; 13+ messages in thread
From: Naoya Horiguchi @ 2022-11-14 22:39 UTC (permalink / raw)
  To: Greg KH
  Cc: stable, Yang Shi, James Houghton, Mike Kravetz, Muchun Song,
	Miaohe Lin, Andrew Morton, Naoya Horiguchi

On Mon, Nov 14, 2022 at 05:11:35PM +0100, Greg KH wrote:
> On Mon, Nov 14, 2022 at 10:14:03PM +0900, Naoya Horiguchi wrote:
> > Hi,
> > 
> > I'd like to request the follow commits to be backported to 5.15.y.
> > 
> > - dd0f230a0a80 ("mm: hwpoison: refactor refcount check handling")
> > - 4966455d9100 ("mm: hwpoison: handle non-anonymous THP correctly")
> > - a76054266661 ("mm: shmem: don't truncate page if memory failure happens")
> > 
> > These patches fixed a data lost issue by preventing shmem pagecache from
> > being removed by memory error.  These were not tagged for stable originally,
> > but that's revisited recently.
> 
> And have you tested that these all apply properly (and in which order?)

Yes, I've checked that these cleanly apply (without any change) on
5.15.78 in the above order (i.e. dd0f23 is first, 496645 comes next,
then a76054).

> and work correctly?

Yes, I ran related testcases in my test suite, and their status changed
FAIL to PASS with these patches.

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: hwpoison, shmem: fix data lost issue for 5.15.y
  2022-11-14 22:39   ` Naoya Horiguchi
@ 2022-11-14 22:53     ` Mike Kravetz
  2022-11-15  1:16       ` HORIGUCHI NAOYA(堀口 直也)
  2022-11-16  9:16     ` Greg KH
  1 sibling, 1 reply; 13+ messages in thread
From: Mike Kravetz @ 2022-11-14 22:53 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: Greg KH, stable, Yang Shi, James Houghton, Muchun Song,
	Miaohe Lin, Andrew Morton, Naoya Horiguchi

On 11/15/22 07:39, Naoya Horiguchi wrote:
> On Mon, Nov 14, 2022 at 05:11:35PM +0100, Greg KH wrote:
> > On Mon, Nov 14, 2022 at 10:14:03PM +0900, Naoya Horiguchi wrote:
> > > Hi,
> > > 
> > > I'd like to request the follow commits to be backported to 5.15.y.
> > > 
> > > - dd0f230a0a80 ("mm: hwpoison: refactor refcount check handling")
> > > - 4966455d9100 ("mm: hwpoison: handle non-anonymous THP correctly")
> > > - a76054266661 ("mm: shmem: don't truncate page if memory failure happens")
> > > 
> > > These patches fixed a data lost issue by preventing shmem pagecache from
> > > being removed by memory error.  These were not tagged for stable originally,
> > > but that's revisited recently.
> > 
> > And have you tested that these all apply properly (and in which order?)
> 
> Yes, I've checked that these cleanly apply (without any change) on
> 5.15.78 in the above order (i.e. dd0f23 is first, 496645 comes next,
> then a76054).
> 
> > and work correctly?
> 
> Yes, I ran related testcases in my test suite, and their status changed
> FAIL to PASS with these patches.

Hi Naoya,

Just curious if you have plans to do backports for earlier releases?

If not, I can start that effort.  We have seen data loss/corruption because of
this on a 4.14 based release.   So, I would go at least that far back.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: hwpoison, shmem: fix data lost issue for 5.15.y
  2022-11-14 22:53     ` Mike Kravetz
@ 2022-11-15  1:16       ` HORIGUCHI NAOYA(堀口 直也)
  2022-11-15  1:30         ` Mike Kravetz
  2022-11-23  1:05         ` Mike Kravetz
  0 siblings, 2 replies; 13+ messages in thread
From: HORIGUCHI NAOYA(堀口 直也) @ 2022-11-15  1:16 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Naoya Horiguchi, Greg KH, stable, Yang Shi, James Houghton,
	Muchun Song, Miaohe Lin, Andrew Morton

On Mon, Nov 14, 2022 at 02:53:51PM -0800, Mike Kravetz wrote:
> On 11/15/22 07:39, Naoya Horiguchi wrote:
> > On Mon, Nov 14, 2022 at 05:11:35PM +0100, Greg KH wrote:
> > > On Mon, Nov 14, 2022 at 10:14:03PM +0900, Naoya Horiguchi wrote:
> > > > Hi,
> > > > 
> > > > I'd like to request the follow commits to be backported to 5.15.y.
> > > > 
> > > > - dd0f230a0a80 ("mm: hwpoison: refactor refcount check handling")
> > > > - 4966455d9100 ("mm: hwpoison: handle non-anonymous THP correctly")
> > > > - a76054266661 ("mm: shmem: don't truncate page if memory failure happens")
> > > > 
> > > > These patches fixed a data lost issue by preventing shmem pagecache from
> > > > being removed by memory error.  These were not tagged for stable originally,
> > > > but that's revisited recently.
> > > 
> > > And have you tested that these all apply properly (and in which order?)
> > 
> > Yes, I've checked that these cleanly apply (without any change) on
> > 5.15.78 in the above order (i.e. dd0f23 is first, 496645 comes next,
> > then a76054).
> > 
> > > and work correctly?
> > 
> > Yes, I ran related testcases in my test suite, and their status changed
> > FAIL to PASS with these patches.
> 
> Hi Naoya,
> 
> Just curious if you have plans to do backports for earlier releases?

I didn't have a clear plan.  I just thought that we should backport to
earlier kernels if someone want and the patches are applicable easily
enough and well-tested.

> 
> If not, I can start that effort.  We have seen data loss/corruption because of
> this on a 4.14 based release.   So, I would go at least that far back.

Thank you for raising hand, that's really helpful.

Maybe dd0f230a0a80 ("[PATCH] hugetlbfs: don't delete error page from
pagecbache") should be considered to backport together, because it's
the similar issue and reported (a while ago) to fail to backport.
dd0f230a0a80 does not apply cleanly on top of 5.15.78 + the above 3 patches.
So I need check more and will update my current proposal for 5.15.y.

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: hwpoison, shmem: fix data lost issue for 5.15.y
  2022-11-15  1:16       ` HORIGUCHI NAOYA(堀口 直也)
@ 2022-11-15  1:30         ` Mike Kravetz
  2022-11-15  6:39           ` Naoya Horiguchi
  2022-11-23  1:05         ` Mike Kravetz
  1 sibling, 1 reply; 13+ messages in thread
From: Mike Kravetz @ 2022-11-15  1:30 UTC (permalink / raw)
  To: HORIGUCHI NAOYA(堀口 直也)
  Cc: Naoya Horiguchi, Greg KH, stable, Yang Shi, James Houghton,
	Muchun Song, Miaohe Lin, Andrew Morton

On 11/15/22 01:16, HORIGUCHI NAOYA(堀口 直也) wrote:
> On Mon, Nov 14, 2022 at 02:53:51PM -0800, Mike Kravetz wrote:
> > On 11/15/22 07:39, Naoya Horiguchi wrote:
> > > On Mon, Nov 14, 2022 at 05:11:35PM +0100, Greg KH wrote:
> > > > On Mon, Nov 14, 2022 at 10:14:03PM +0900, Naoya Horiguchi wrote:
> > > > > Hi,
> > > > > 
> > > > > I'd like to request the follow commits to be backported to 5.15.y.
> > > > > 
> > > > > - dd0f230a0a80 ("mm: hwpoison: refactor refcount check handling")
> > > > > - 4966455d9100 ("mm: hwpoison: handle non-anonymous THP correctly")
> > > > > - a76054266661 ("mm: shmem: don't truncate page if memory failure happens")
> > > > > 
> > > > > These patches fixed a data lost issue by preventing shmem pagecache from
> > > > > being removed by memory error.  These were not tagged for stable originally,
> > > > > but that's revisited recently.
> > > > 
> > > > And have you tested that these all apply properly (and in which order?)
> > > 
> > > Yes, I've checked that these cleanly apply (without any change) on
> > > 5.15.78 in the above order (i.e. dd0f23 is first, 496645 comes next,
> > > then a76054).
> > > 
> > > > and work correctly?
> > > 
> > > Yes, I ran related testcases in my test suite, and their status changed
> > > FAIL to PASS with these patches.
> > 
> > Hi Naoya,
> > 
> > Just curious if you have plans to do backports for earlier releases?
> 
> I didn't have a clear plan.  I just thought that we should backport to
> earlier kernels if someone want and the patches are applicable easily
> enough and well-tested.
> 
> > 
> > If not, I can start that effort.  We have seen data loss/corruption because of
> > this on a 4.14 based release.   So, I would go at least that far back.
> 
> Thank you for raising hand, that's really helpful.
> 
> Maybe dd0f230a0a80 ("[PATCH] hugetlbfs: don't delete error page from
> pagecbache") should be considered to backport together, because it's
> the similar issue and reported (a while ago) to fail to backport.

Since dd0f230a0a80 was marked for backports, Greg's automation flags it as
FAILED due to conflicts in earlier releases.  I am not sure if James has
a plan to do backports for dd0f230a0a80.  Again, this is also something I
would help with due to real customer issues.
-- 
Mike Kravetz

> dd0f230a0a80 does not apply cleanly on top of 5.15.78 + the above 3 patches.
> So I need check more and will update my current proposal for 5.15.y.
> 
> Thanks,
> Naoya Horiguchi

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: hwpoison, shmem: fix data lost issue for 5.15.y
  2022-11-15  1:30         ` Mike Kravetz
@ 2022-11-15  6:39           ` Naoya Horiguchi
  2022-11-16  3:48             ` Mike Kravetz
  0 siblings, 1 reply; 13+ messages in thread
From: Naoya Horiguchi @ 2022-11-15  6:39 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: HORIGUCHI NAOYA(堀口 直也),
	Greg KH, stable, Yang Shi, James Houghton, Muchun Song,
	Miaohe Lin, Andrew Morton

On Mon, Nov 14, 2022 at 05:30:29PM -0800, Mike Kravetz wrote:
> On 11/15/22 01:16, HORIGUCHI NAOYA(堀口 直也) wrote:
> > On Mon, Nov 14, 2022 at 02:53:51PM -0800, Mike Kravetz wrote:
> > > On 11/15/22 07:39, Naoya Horiguchi wrote:
> > > > On Mon, Nov 14, 2022 at 05:11:35PM +0100, Greg KH wrote:
> > > > > On Mon, Nov 14, 2022 at 10:14:03PM +0900, Naoya Horiguchi wrote:
> > > > > > Hi,
> > > > > > 
> > > > > > I'd like to request the follow commits to be backported to 5.15.y.
> > > > > > 
> > > > > > - dd0f230a0a80 ("mm: hwpoison: refactor refcount check handling")
> > > > > > - 4966455d9100 ("mm: hwpoison: handle non-anonymous THP correctly")
> > > > > > - a76054266661 ("mm: shmem: don't truncate page if memory failure happens")
> > > > > > 
> > > > > > These patches fixed a data lost issue by preventing shmem pagecache from
> > > > > > being removed by memory error.  These were not tagged for stable originally,
> > > > > > but that's revisited recently.
> > > > > 
> > > > > And have you tested that these all apply properly (and in which order?)
> > > > 
> > > > Yes, I've checked that these cleanly apply (without any change) on
> > > > 5.15.78 in the above order (i.e. dd0f23 is first, 496645 comes next,
> > > > then a76054).
> > > > 
> > > > > and work correctly?
> > > > 
> > > > Yes, I ran related testcases in my test suite, and their status changed
> > > > FAIL to PASS with these patches.
> > > 
> > > Hi Naoya,
> > > 
> > > Just curious if you have plans to do backports for earlier releases?
> > 
> > I didn't have a clear plan.  I just thought that we should backport to
> > earlier kernels if someone want and the patches are applicable easily
> > enough and well-tested.
> > 
> > > 
> > > If not, I can start that effort.  We have seen data loss/corruption because of
> > > this on a 4.14 based release.   So, I would go at least that far back.
> > 
> > Thank you for raising hand, that's really helpful.
> > 
> > Maybe dd0f230a0a80 ("[PATCH] hugetlbfs: don't delete error page from

# I meant 8625147cafaa, sorry if the wrong commit ID confused you.

I tested with 8625147cafaa too, and it made hugetlb-related testcases
passed.

> > pagecbache") should be considered to backport together, because it's
> > the similar issue and reported (a while ago) to fail to backport.
> 
> Since dd0f230a0a80 was marked for backports, Greg's automation flags it as
> FAILED due to conflicts in earlier releases.  I am not sure if James has
> a plan to do backports for dd0f230a0a80.  Again, this is also something I
> would help with due to real customer issues.

OK, so I do want 8625147cafaa to be merged to stable.
Another reason why I addressed 8625147cafaa here is that this patch depends
on the code handling "extra_pins" (given by dd0f230a0a80) and it's simpler
to handle together.

We need to slightly modify 8625147cafaa to apply to 5.15.y.  So in summary,
my updated suggestion for 5.15.y is like below:

- [1/4] cherry-pick dd0f230a0a80 ("mm: hwpoison: refactor refcount check handling")
- [2/4] cherry-pick 4966455d9100 ("mm: hwpoison: handle non-anonymous THP correctly")
- [3/4] cherry-pick a76054266661 ("mm: shmem: don't truncate page if memory failure happens")
- [4/4] apply the following patch (as a modified version of 8625147cafaa)

Thanks,
Naoya Horiguchi
---
From bdd88699d8ba6907e0815608482f8c0e2170982d Mon Sep 17 00:00:00 2001
From: James Houghton <jthoughton@google.com>
Date: Tue, 15 Nov 2022 15:20:51 +0900
Subject: [PATCH] hugetlbfs: don't delete error page from pagecache

commit 8625147cafaa9ba74713d682f5185eb62cb2aedb upstream.

This change is very similar to the change that was made for shmem [1], and
it solves the same problem but for HugeTLBFS instead.

Currently, when poison is found in a HugeTLB page, the page is removed
from the page cache.  That means that attempting to map or read that
hugepage in the future will result in a new hugepage being allocated
instead of notifying the user that the page was poisoned.  As [1] states,
this is effectively memory corruption.

The fix is to leave the page in the page cache.  If the user attempts to
use a poisoned HugeTLB page with a syscall, the syscall will fail with
EIO, the same error code that shmem uses.  For attempts to map the page,
the thread will get a BUS_MCEERR_AR SIGBUS.

[1]: commit a76054266661 ("mm: shmem: don't truncate page if memory failure happens")

Although there's a conflict in hugetlbfs_error_remove_page() when
backporting to stable tree, the resolution is almost trivial (just
removing conflicting lines in the function).

Link: https://lkml.kernel.org/r/20221018200125.848471-1-jthoughton@google.com
Signed-off-by: James Houghton <jthoughton@google.com>
Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com>
Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Tested-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Reviewed-by: Yang Shi <shy828301@gmail.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: James Houghton <jthoughton@google.com>
Cc: Miaohe Lin <linmiaohe@huawei.com>
Cc: Muchun Song <songmuchun@bytedance.com>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
---
 fs/hugetlbfs/inode.c | 13 ++++++-------
 mm/hugetlb.c         |  4 ++++
 mm/memory-failure.c  |  5 ++++-
 3 files changed, 14 insertions(+), 8 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index d74a49b188c2..be8deec29ebe 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -361,6 +361,12 @@ static ssize_t hugetlbfs_read_iter(struct kiocb *iocb, struct iov_iter *to)
 		} else {
 			unlock_page(page);
 
+			if (PageHWPoison(page)) {
+				put_page(page);
+				retval = -EIO;
+				break;
+			}
+
 			/*
 			 * We have the page, copy it to user space buffer.
 			 */
@@ -984,13 +990,6 @@ static int hugetlbfs_migrate_page(struct address_space *mapping,
 static int hugetlbfs_error_remove_page(struct address_space *mapping,
 				struct page *page)
 {
-	struct inode *inode = mapping->host;
-	pgoff_t index = page->index;
-
-	remove_huge_page(page);
-	if (unlikely(hugetlb_unreserve_pages(inode, index, index + 1, 1)))
-		hugetlb_fix_reserve_counts(inode);
-
 	return 0;
 }
 
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index dbb63ec3b5fa..e7bd42f23667 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5350,6 +5350,10 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 	ptl = huge_pte_lockptr(h, dst_mm, dst_pte);
 	spin_lock(ptl);
 
+	ret = -EIO;
+	if (PageHWPoison(page))
+		goto out_release_unlock;
+
 	/*
 	 * Recheck the i_size after holding PT lock to make sure not
 	 * to leave any page mapped (as page_mapped()) beyond the end
diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 85b1a77e3a99..2ad0f4580091 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1040,6 +1040,7 @@ static int me_huge_page(struct page_state *ps, struct page *p)
 	int res;
 	struct page *hpage = compound_head(p);
 	struct address_space *mapping;
+	bool extra_pins = false;
 
 	if (!PageHuge(hpage))
 		return MF_DELAYED;
@@ -1047,6 +1048,8 @@ static int me_huge_page(struct page_state *ps, struct page *p)
 	mapping = page_mapping(hpage);
 	if (mapping) {
 		res = truncate_error_page(hpage, page_to_pfn(p), mapping);
+		/* The page is kept in page cache. */
+		extra_pins = true;
 		unlock_page(hpage);
 	} else {
 		res = MF_FAILED;
@@ -1064,7 +1067,7 @@ static int me_huge_page(struct page_state *ps, struct page *p)
 		}
 	}
 
-	if (has_extra_refcount(ps, p, false))
+	if (has_extra_refcount(ps, p, extra_pins))
 		res = MF_FAILED;
 
 	return res;
-- 
2.37.3.518.g79f2338b37


^ permalink raw reply related	[flat|nested] 13+ messages in thread

* Re: hwpoison, shmem: fix data lost issue for 5.15.y
  2022-11-15  6:39           ` Naoya Horiguchi
@ 2022-11-16  3:48             ` Mike Kravetz
  2022-11-16 23:58               ` Naoya Horiguchi
  0 siblings, 1 reply; 13+ messages in thread
From: Mike Kravetz @ 2022-11-16  3:48 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: HORIGUCHI NAOYA(堀口 直也),
	Greg KH, stable, Yang Shi, James Houghton, Muchun Song,
	Miaohe Lin, Andrew Morton

On 11/15/22 15:39, Naoya Horiguchi wrote:
> On Mon, Nov 14, 2022 at 05:30:29PM -0800, Mike Kravetz wrote:
> > On 11/15/22 01:16, HORIGUCHI NAOYA(堀口 直也) wrote:
> > > On Mon, Nov 14, 2022 at 02:53:51PM -0800, Mike Kravetz wrote:
> > > > On 11/15/22 07:39, Naoya Horiguchi wrote:
> > > > > On Mon, Nov 14, 2022 at 05:11:35PM +0100, Greg KH wrote:
> > > > > > On Mon, Nov 14, 2022 at 10:14:03PM +0900, Naoya Horiguchi wrote:
> > > > > > > Hi,
> > > > > > > 
> > > > > > > I'd like to request the follow commits to be backported to 5.15.y.
> > > > > > > 
> > > > > > > - dd0f230a0a80 ("mm: hwpoison: refactor refcount check handling")
> > > > > > > - 4966455d9100 ("mm: hwpoison: handle non-anonymous THP correctly")
> > > > > > > - a76054266661 ("mm: shmem: don't truncate page if memory failure happens")
> > > > > > > 
> > > > > > > These patches fixed a data lost issue by preventing shmem pagecache from
> > > > > > > being removed by memory error.  These were not tagged for stable originally,
> > > > > > > but that's revisited recently.
> > > > > > 
> > > > > > And have you tested that these all apply properly (and in which order?)
> > > > > 
> > > > > Yes, I've checked that these cleanly apply (without any change) on
> > > > > 5.15.78 in the above order (i.e. dd0f23 is first, 496645 comes next,
> > > > > then a76054).
> > > > > 
> > > > > > and work correctly?
> > > > > 
> > > > > Yes, I ran related testcases in my test suite, and their status changed
> > > > > FAIL to PASS with these patches.
> > > > 
> > > > Hi Naoya,
> > > > 
> > > > Just curious if you have plans to do backports for earlier releases?
> > > 
> > > I didn't have a clear plan.  I just thought that we should backport to
> > > earlier kernels if someone want and the patches are applicable easily
> > > enough and well-tested.
> > > 
> > > > 
> > > > If not, I can start that effort.  We have seen data loss/corruption because of
> > > > this on a 4.14 based release.   So, I would go at least that far back.
> > > 
> > > Thank you for raising hand, that's really helpful.
> > > 
> > > Maybe dd0f230a0a80 ("[PATCH] hugetlbfs: don't delete error page from
> 
> # I meant 8625147cafaa, sorry if the wrong commit ID confused you.
> 
> I tested with 8625147cafaa too, and it made hugetlb-related testcases
> passed.
<snip>
> We need to slightly modify 8625147cafaa to apply to 5.15.y.  So in summary,
> my updated suggestion for 5.15.y is like below:
> 
> - [1/4] cherry-pick dd0f230a0a80 ("mm: hwpoison: refactor refcount check handling")
> - [2/4] cherry-pick 4966455d9100 ("mm: hwpoison: handle non-anonymous THP correctly")
> - [3/4] cherry-pick a76054266661 ("mm: shmem: don't truncate page if memory failure happens")
> - [4/4] apply the following patch (as a modified version of 8625147cafaa)

Hi Naoya,

Just curious, do you have automated tests for this?  I wanted test backports
to each stable release.  I could manually test, but that would be a bit
involved and was hoping you had something automated.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: hwpoison, shmem: fix data lost issue for 5.15.y
  2022-11-14 22:39   ` Naoya Horiguchi
  2022-11-14 22:53     ` Mike Kravetz
@ 2022-11-16  9:16     ` Greg KH
  1 sibling, 0 replies; 13+ messages in thread
From: Greg KH @ 2022-11-16  9:16 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: stable, Yang Shi, James Houghton, Mike Kravetz, Muchun Song,
	Miaohe Lin, Andrew Morton, Naoya Horiguchi

On Tue, Nov 15, 2022 at 07:39:00AM +0900, Naoya Horiguchi wrote:
> On Mon, Nov 14, 2022 at 05:11:35PM +0100, Greg KH wrote:
> > On Mon, Nov 14, 2022 at 10:14:03PM +0900, Naoya Horiguchi wrote:
> > > Hi,
> > > 
> > > I'd like to request the follow commits to be backported to 5.15.y.
> > > 
> > > - dd0f230a0a80 ("mm: hwpoison: refactor refcount check handling")
> > > - 4966455d9100 ("mm: hwpoison: handle non-anonymous THP correctly")
> > > - a76054266661 ("mm: shmem: don't truncate page if memory failure happens")
> > > 
> > > These patches fixed a data lost issue by preventing shmem pagecache from
> > > being removed by memory error.  These were not tagged for stable originally,
> > > but that's revisited recently.
> > 
> > And have you tested that these all apply properly (and in which order?)
> 
> Yes, I've checked that these cleanly apply (without any change) on
> 5.15.78 in the above order (i.e. dd0f23 is first, 496645 comes next,
> then a76054).
> 
> > and work correctly?
> 
> Yes, I ran related testcases in my test suite, and their status changed
> FAIL to PASS with these patches.

Great, all now queued up, thanks.

greg k-h

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: hwpoison, shmem: fix data lost issue for 5.15.y
  2022-11-16  3:48             ` Mike Kravetz
@ 2022-11-16 23:58               ` Naoya Horiguchi
  0 siblings, 0 replies; 13+ messages in thread
From: Naoya Horiguchi @ 2022-11-16 23:58 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: HORIGUCHI NAOYA(堀口 直也),
	Greg KH, stable, Yang Shi, James Houghton, Muchun Song,
	Miaohe Lin, Andrew Morton

On Tue, Nov 15, 2022 at 07:48:42PM -0800, Mike Kravetz wrote:
> On 11/15/22 15:39, Naoya Horiguchi wrote:
> > On Mon, Nov 14, 2022 at 05:30:29PM -0800, Mike Kravetz wrote:
> > > On 11/15/22 01:16, HORIGUCHI NAOYA(堀口 直也) wrote:
> > > > On Mon, Nov 14, 2022 at 02:53:51PM -0800, Mike Kravetz wrote:
> > > > > On 11/15/22 07:39, Naoya Horiguchi wrote:
> > > > > > On Mon, Nov 14, 2022 at 05:11:35PM +0100, Greg KH wrote:
> > > > > > > On Mon, Nov 14, 2022 at 10:14:03PM +0900, Naoya Horiguchi wrote:
> > > > > > > > Hi,
> > > > > > > > 
> > > > > > > > I'd like to request the follow commits to be backported to 5.15.y.
> > > > > > > > 
> > > > > > > > - dd0f230a0a80 ("mm: hwpoison: refactor refcount check handling")
> > > > > > > > - 4966455d9100 ("mm: hwpoison: handle non-anonymous THP correctly")
> > > > > > > > - a76054266661 ("mm: shmem: don't truncate page if memory failure happens")
> > > > > > > > 
> > > > > > > > These patches fixed a data lost issue by preventing shmem pagecache from
> > > > > > > > being removed by memory error.  These were not tagged for stable originally,
> > > > > > > > but that's revisited recently.
> > > > > > > 
> > > > > > > And have you tested that these all apply properly (and in which order?)
> > > > > > 
> > > > > > Yes, I've checked that these cleanly apply (without any change) on
> > > > > > 5.15.78 in the above order (i.e. dd0f23 is first, 496645 comes next,
> > > > > > then a76054).
> > > > > > 
> > > > > > > and work correctly?
> > > > > > 
> > > > > > Yes, I ran related testcases in my test suite, and their status changed
> > > > > > FAIL to PASS with these patches.
> > > > > 
> > > > > Hi Naoya,
> > > > > 
> > > > > Just curious if you have plans to do backports for earlier releases?
> > > > 
> > > > I didn't have a clear plan.  I just thought that we should backport to
> > > > earlier kernels if someone want and the patches are applicable easily
> > > > enough and well-tested.
> > > > 
> > > > > 
> > > > > If not, I can start that effort.  We have seen data loss/corruption because of
> > > > > this on a 4.14 based release.   So, I would go at least that far back.
> > > > 
> > > > Thank you for raising hand, that's really helpful.
> > > > 
> > > > Maybe dd0f230a0a80 ("[PATCH] hugetlbfs: don't delete error page from
> > 
> > # I meant 8625147cafaa, sorry if the wrong commit ID confused you.
> > 
> > I tested with 8625147cafaa too, and it made hugetlb-related testcases
> > passed.
> <snip>
> > We need to slightly modify 8625147cafaa to apply to 5.15.y.  So in summary,
> > my updated suggestion for 5.15.y is like below:
> > 
> > - [1/4] cherry-pick dd0f230a0a80 ("mm: hwpoison: refactor refcount check handling")
> > - [2/4] cherry-pick 4966455d9100 ("mm: hwpoison: handle non-anonymous THP correctly")
> > - [3/4] cherry-pick a76054266661 ("mm: shmem: don't truncate page if memory failure happens")
> > - [4/4] apply the following patch (as a modified version of 8625147cafaa)
> 
> Hi Naoya,
> 
> Just curious, do you have automated tests for this?  I wanted test backports
> to each stable release.  I could manually test, but that would be a bit
> involved and was hoping you had something automated.

Yes, related testcases are available on https://github.com/nhoriguchi/mm_regression.
You can run them by the following steps:

  $ git clone https://github.com/nhoriguchi/mm_regression.
  $ cd mm_regression

  # Check that your testing server meets the prerequisite
  # https://github.com/nhoriguchi/mm_regression#prerequisite    

  $ make
  ...
  # Compiler might show errors but that's OK because all
  # files are not needed to run relevant testcases.

  $ bash run.sh prepare debug

  # List the testcases in work/debug/recipelist like below:

  $ cat work/debug/recipelist
  mm/hwpoison/shmem_link/link-hard.auto3
  mm/hwpoison/shmem_link/link-sym.auto3
  mm/hwpoison/shmem_rw/thp-always.auto3
  mm/hwpoison/shmem_rw/thp-never.auto3

  $ bash run.sh project run

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: hwpoison, shmem: fix data lost issue for 5.15.y
  2022-11-15  1:16       ` HORIGUCHI NAOYA(堀口 直也)
  2022-11-15  1:30         ` Mike Kravetz
@ 2022-11-23  1:05         ` Mike Kravetz
  2022-11-23 18:54           ` Yang Shi
  1 sibling, 1 reply; 13+ messages in thread
From: Mike Kravetz @ 2022-11-23  1:05 UTC (permalink / raw)
  To: HORIGUCHI NAOYA(堀口 直也)
  Cc: Naoya Horiguchi, Greg KH, stable, Yang Shi, James Houghton,
	Muchun Song, Miaohe Lin, Andrew Morton

On 11/15/22 01:16, HORIGUCHI NAOYA(堀口 直也) wrote:
> On Mon, Nov 14, 2022 at 02:53:51PM -0800, Mike Kravetz wrote:
> > On 11/15/22 07:39, Naoya Horiguchi wrote:
> > > On Mon, Nov 14, 2022 at 05:11:35PM +0100, Greg KH wrote:
> > > > On Mon, Nov 14, 2022 at 10:14:03PM +0900, Naoya Horiguchi wrote:
> > > > > Hi,
> > > > > 
> > > > > I'd like to request the follow commits to be backported to 5.15.y.
> > > > > 
> > > > > - dd0f230a0a80 ("mm: hwpoison: refactor refcount check handling")
> > > > > - 4966455d9100 ("mm: hwpoison: handle non-anonymous THP correctly")
> > > > > - a76054266661 ("mm: shmem: don't truncate page if memory failure happens")
> > > > > 
> > > > > These patches fixed a data lost issue by preventing shmem pagecache from
> > > > > being removed by memory error.  These were not tagged for stable originally,
> > > > > but that's revisited recently.
> > > > 
> > > > And have you tested that these all apply properly (and in which order?)
> > > 
> > > Yes, I've checked that these cleanly apply (without any change) on
> > > 5.15.78 in the above order (i.e. dd0f23 is first, 496645 comes next,
> > > then a76054).
> > > 
> > > > and work correctly?
> > > 
> > > Yes, I ran related testcases in my test suite, and their status changed
> > > FAIL to PASS with these patches.
> > 
> > Hi Naoya,
> > 
> > Just curious if you have plans to do backports for earlier releases?
> 
> I didn't have a clear plan.  I just thought that we should backport to
> earlier kernels if someone want and the patches are applicable easily
> enough and well-tested.
> 
> > 
> > If not, I can start that effort.  We have seen data loss/corruption because of
> > this on a 4.14 based release.   So, I would go at least that far back.
> 
> Thank you for raising hand, that's really helpful.
> 
> Maybe dd0f230a0a80 ("[PATCH] hugetlbfs: don't delete error page from
> pagecbache") should be considered to backport together, because it's
> the similar issue and reported (a while ago) to fail to backport.
> dd0f230a0a80 does not apply cleanly on top of 5.15.78 + the above 3 patches.
> So I need check more and will update my current proposal for 5.15.y.

When working with 5.10.y, I noticed that commit eac96c3efdb5 ("mm: filemap:
check if THP has hwpoisoned subpage for PMD page fault") as well as the
prereq commit c7cb42e94473 ("mm: hwpoison: remove the unnecessary THP check")
were not backported to 5.10.y.  Without those patches, THP testing will
fail.

Naoya and Yang Shi, does that sound right?

I have backports for those as well but want to check if you think
anything else is needed.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: hwpoison, shmem: fix data lost issue for 5.15.y
  2022-11-23  1:05         ` Mike Kravetz
@ 2022-11-23 18:54           ` Yang Shi
  2022-11-24  3:03             ` Naoya Horiguchi
  0 siblings, 1 reply; 13+ messages in thread
From: Yang Shi @ 2022-11-23 18:54 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: HORIGUCHI NAOYA(堀口 直也),
	Naoya Horiguchi, Greg KH, stable, James Houghton, Muchun Song,
	Miaohe Lin, Andrew Morton

On Tue, Nov 22, 2022 at 5:05 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
>
> On 11/15/22 01:16, HORIGUCHI NAOYA(堀口 直也) wrote:
> > On Mon, Nov 14, 2022 at 02:53:51PM -0800, Mike Kravetz wrote:
> > > On 11/15/22 07:39, Naoya Horiguchi wrote:
> > > > On Mon, Nov 14, 2022 at 05:11:35PM +0100, Greg KH wrote:
> > > > > On Mon, Nov 14, 2022 at 10:14:03PM +0900, Naoya Horiguchi wrote:
> > > > > > Hi,
> > > > > >
> > > > > > I'd like to request the follow commits to be backported to 5.15.y.
> > > > > >
> > > > > > - dd0f230a0a80 ("mm: hwpoison: refactor refcount check handling")
> > > > > > - 4966455d9100 ("mm: hwpoison: handle non-anonymous THP correctly")
> > > > > > - a76054266661 ("mm: shmem: don't truncate page if memory failure happens")
> > > > > >
> > > > > > These patches fixed a data lost issue by preventing shmem pagecache from
> > > > > > being removed by memory error.  These were not tagged for stable originally,
> > > > > > but that's revisited recently.
> > > > >
> > > > > And have you tested that these all apply properly (and in which order?)
> > > >
> > > > Yes, I've checked that these cleanly apply (without any change) on
> > > > 5.15.78 in the above order (i.e. dd0f23 is first, 496645 comes next,
> > > > then a76054).
> > > >
> > > > > and work correctly?
> > > >
> > > > Yes, I ran related testcases in my test suite, and their status changed
> > > > FAIL to PASS with these patches.
> > >
> > > Hi Naoya,
> > >
> > > Just curious if you have plans to do backports for earlier releases?
> >
> > I didn't have a clear plan.  I just thought that we should backport to
> > earlier kernels if someone want and the patches are applicable easily
> > enough and well-tested.
> >
> > >
> > > If not, I can start that effort.  We have seen data loss/corruption because of
> > > this on a 4.14 based release.   So, I would go at least that far back.
> >
> > Thank you for raising hand, that's really helpful.
> >
> > Maybe dd0f230a0a80 ("[PATCH] hugetlbfs: don't delete error page from
> > pagecbache") should be considered to backport together, because it's
> > the similar issue and reported (a while ago) to fail to backport.
> > dd0f230a0a80 does not apply cleanly on top of 5.15.78 + the above 3 patches.
> > So I need check more and will update my current proposal for 5.15.y.
>
> When working with 5.10.y, I noticed that commit eac96c3efdb5 ("mm: filemap:
> check if THP has hwpoisoned subpage for PMD page fault") as well as the
> prereq commit c7cb42e94473 ("mm: hwpoison: remove the unnecessary THP check")
> were not backported to 5.10.y.  Without those patches, THP testing will
> fail.
>
> Naoya and Yang Shi, does that sound right?

Yes, since the hwpoisoned THP will be kept in page cache so the page
fault may happen on it again, without that commit the page fault won't
return -EHWPOISON if I remember correctly.

>
> I have backports for those as well but want to check if you think
> anything else is needed.

Thanks for backporting them. No more fix is needed AFAICT.

> --
> Mike Kravetz

^ permalink raw reply	[flat|nested] 13+ messages in thread

* Re: hwpoison, shmem: fix data lost issue for 5.15.y
  2022-11-23 18:54           ` Yang Shi
@ 2022-11-24  3:03             ` Naoya Horiguchi
  0 siblings, 0 replies; 13+ messages in thread
From: Naoya Horiguchi @ 2022-11-24  3:03 UTC (permalink / raw)
  To: Yang Shi
  Cc: Mike Kravetz, HORIGUCHI NAOYA(堀口 直也),
	Greg KH, stable, James Houghton, Muchun Song, Miaohe Lin,
	Andrew Morton

On Wed, Nov 23, 2022 at 10:54:15AM -0800, Yang Shi wrote:
> On Tue, Nov 22, 2022 at 5:05 PM Mike Kravetz <mike.kravetz@oracle.com> wrote:
> >
> > On 11/15/22 01:16, HORIGUCHI NAOYA(堀口 直也) wrote:
> > > On Mon, Nov 14, 2022 at 02:53:51PM -0800, Mike Kravetz wrote:
> > > > On 11/15/22 07:39, Naoya Horiguchi wrote:
> > > > > On Mon, Nov 14, 2022 at 05:11:35PM +0100, Greg KH wrote:
> > > > > > On Mon, Nov 14, 2022 at 10:14:03PM +0900, Naoya Horiguchi wrote:
> > > > > > > Hi,
> > > > > > >
> > > > > > > I'd like to request the follow commits to be backported to 5.15.y.
> > > > > > >
> > > > > > > - dd0f230a0a80 ("mm: hwpoison: refactor refcount check handling")
> > > > > > > - 4966455d9100 ("mm: hwpoison: handle non-anonymous THP correctly")
> > > > > > > - a76054266661 ("mm: shmem: don't truncate page if memory failure happens")
> > > > > > >
> > > > > > > These patches fixed a data lost issue by preventing shmem pagecache from
> > > > > > > being removed by memory error.  These were not tagged for stable originally,
> > > > > > > but that's revisited recently.
> > > > > >
> > > > > > And have you tested that these all apply properly (and in which order?)
> > > > >
> > > > > Yes, I've checked that these cleanly apply (without any change) on
> > > > > 5.15.78 in the above order (i.e. dd0f23 is first, 496645 comes next,
> > > > > then a76054).
> > > > >
> > > > > > and work correctly?
> > > > >
> > > > > Yes, I ran related testcases in my test suite, and their status changed
> > > > > FAIL to PASS with these patches.
> > > >
> > > > Hi Naoya,
> > > >
> > > > Just curious if you have plans to do backports for earlier releases?
> > >
> > > I didn't have a clear plan.  I just thought that we should backport to
> > > earlier kernels if someone want and the patches are applicable easily
> > > enough and well-tested.
> > >
> > > >
> > > > If not, I can start that effort.  We have seen data loss/corruption because of
> > > > this on a 4.14 based release.   So, I would go at least that far back.
> > >
> > > Thank you for raising hand, that's really helpful.
> > >
> > > Maybe dd0f230a0a80 ("[PATCH] hugetlbfs: don't delete error page from
> > > pagecbache") should be considered to backport together, because it's
> > > the similar issue and reported (a while ago) to fail to backport.
> > > dd0f230a0a80 does not apply cleanly on top of 5.15.78 + the above 3 patches.
> > > So I need check more and will update my current proposal for 5.15.y.
> >
> > When working with 5.10.y, I noticed that commit eac96c3efdb5 ("mm: filemap:
> > check if THP has hwpoisoned subpage for PMD page fault") as well as the
> > prereq commit c7cb42e94473 ("mm: hwpoison: remove the unnecessary THP check")
> > were not backported to 5.10.y.  Without those patches, THP testing will
> > fail.
> >
> > Naoya and Yang Shi, does that sound right?
>
> Yes, since the hwpoisoned THP will be kept in page cache so the page
> fault may happen on it again, without that commit the page fault won't
> return -EHWPOISON if I remember correctly.
>
> >
> > I have backports for those as well but want to check if you think
> > anything else is needed.
>
> Thanks for backporting them. No more fix is needed AFAICT.

I agree with Yang.  There seems no other commit related to current
pagecache problem but not backported yet.

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 13+ messages in thread

end of thread, other threads:[~2022-11-24  3:03 UTC | newest]

Thread overview: 13+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-11-14 13:14 hwpoison, shmem: fix data lost issue for 5.15.y Naoya Horiguchi
2022-11-14 16:11 ` Greg KH
2022-11-14 22:39   ` Naoya Horiguchi
2022-11-14 22:53     ` Mike Kravetz
2022-11-15  1:16       ` HORIGUCHI NAOYA(堀口 直也)
2022-11-15  1:30         ` Mike Kravetz
2022-11-15  6:39           ` Naoya Horiguchi
2022-11-16  3:48             ` Mike Kravetz
2022-11-16 23:58               ` Naoya Horiguchi
2022-11-23  1:05         ` Mike Kravetz
2022-11-23 18:54           ` Yang Shi
2022-11-24  3:03             ` Naoya Horiguchi
2022-11-16  9:16     ` Greg KH

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).