linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
* [PATCH 0/4] A few fixup patches for memory failure
@ 2022-02-28 14:02 Miaohe Lin
  2022-02-28 14:02 ` [PATCH 1/4] mm/memory-failure.c: fix race with changing page compound again Miaohe Lin
                   ` (3 more replies)
  0 siblings, 4 replies; 27+ messages in thread
From: Miaohe Lin @ 2022-02-28 14:02 UTC (permalink / raw)
  To: akpm, naoya.horiguchi; +Cc: linux-mm, linux-kernel, linmiaohe

Hi everyone,
This series contains a few patches to fix the race with changing page
compound page, fix potential VM_BUG_ON_PAGE and so on. More details can
be found in the respective changelogs. Thanks!

Miaohe Lin (4):
  mm/memory-failure.c: fix race with changing page compound again
  mm/memory-failure.c: fix wrong user reference report
  mm/memory-failure.c: avoid calling invalidate_inode_page() with
    unexpected pages
  mm/memory-failure.c: fix potential VM_BUG_ON_PAGE in
    split_huge_page_to_list

 mm/memory-failure.c | 33 +++++++++++++++++++++++++++------
 1 file changed, 27 insertions(+), 6 deletions(-)

-- 
2.23.0


^ permalink raw reply	[flat|nested] 27+ messages in thread

* [PATCH 1/4] mm/memory-failure.c: fix race with changing page compound again
  2022-02-28 14:02 [PATCH 0/4] A few fixup patches for memory failure Miaohe Lin
@ 2022-02-28 14:02 ` Miaohe Lin
  2022-03-04  8:26   ` HORIGUCHI NAOYA(堀口 直也)
  2022-02-28 14:02 ` [PATCH 2/4] mm/memory-failure.c: fix wrong user reference report Miaohe Lin
                   ` (2 subsequent siblings)
  3 siblings, 1 reply; 27+ messages in thread
From: Miaohe Lin @ 2022-02-28 14:02 UTC (permalink / raw)
  To: akpm, naoya.horiguchi; +Cc: linux-mm, linux-kernel, linmiaohe

There is a race window where we got the compound_head, the hugetlb page
could be freed to buddy, or even changed to another compound page just
before we try to get hwpoison page. If this happens, just bail out.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
---
 mm/memory-failure.c | 11 +++++++++++
 1 file changed, 11 insertions(+)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 5444a8ef4867..0d7c58340a98 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1534,6 +1534,17 @@ static int memory_failure_hugetlb(unsigned long pfn, int flags)
 	}
 
 	lock_page(head);
+
+	/**
+	 * The page could have changed compound pages due to race window.
+	 * If this happens just bail out.
+	 */
+	if (!PageHuge(p) || compound_head(p) != head) {
+		action_result(pfn, MF_MSG_DIFFERENT_COMPOUND, MF_IGNORED);
+		res = -EBUSY;
+		goto out;
+	}
+
 	page_flags = head->flags;
 
 	if (hwpoison_filter(p)) {
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 2/4] mm/memory-failure.c: fix wrong user reference report
  2022-02-28 14:02 [PATCH 0/4] A few fixup patches for memory failure Miaohe Lin
  2022-02-28 14:02 ` [PATCH 1/4] mm/memory-failure.c: fix race with changing page compound again Miaohe Lin
@ 2022-02-28 14:02 ` Miaohe Lin
  2022-03-04  8:27   ` HORIGUCHI NAOYA(堀口 直也)
  2022-02-28 14:02 ` [PATCH 3/4] mm/memory-failure.c: avoid calling invalidate_inode_page() with unexpected pages Miaohe Lin
  2022-02-28 14:02 ` [PATCH 4/4] mm/memory-failure.c: fix potential VM_BUG_ON_PAGE in split_huge_page_to_list Miaohe Lin
  3 siblings, 1 reply; 27+ messages in thread
From: Miaohe Lin @ 2022-02-28 14:02 UTC (permalink / raw)
  To: akpm, naoya.horiguchi; +Cc: linux-mm, linux-kernel, linmiaohe

The dirty swapcache page is still residing in the swap cache after it's
hwpoisoned. So there is always one extra refcount for swap cache.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
---
 mm/memory-failure.c | 6 +-----
 1 file changed, 1 insertion(+), 5 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 0d7c58340a98..5f9503573263 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -984,7 +984,6 @@ static int me_pagecache_dirty(struct page_state *ps, struct page *p)
 static int me_swapcache_dirty(struct page_state *ps, struct page *p)
 {
 	int ret;
-	bool extra_pins = false;
 
 	ClearPageDirty(p);
 	/* Trigger EIO in shmem: */
@@ -993,10 +992,7 @@ static int me_swapcache_dirty(struct page_state *ps, struct page *p)
 	ret = delete_from_lru_cache(p) ? MF_FAILED : MF_DELAYED;
 	unlock_page(p);
 
-	if (ret == MF_DELAYED)
-		extra_pins = true;
-
-	if (has_extra_refcount(ps, p, extra_pins))
+	if (has_extra_refcount(ps, p, true))
 		ret = MF_FAILED;
 
 	return ret;
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 3/4] mm/memory-failure.c: avoid calling invalidate_inode_page() with unexpected pages
  2022-02-28 14:02 [PATCH 0/4] A few fixup patches for memory failure Miaohe Lin
  2022-02-28 14:02 ` [PATCH 1/4] mm/memory-failure.c: fix race with changing page compound again Miaohe Lin
  2022-02-28 14:02 ` [PATCH 2/4] mm/memory-failure.c: fix wrong user reference report Miaohe Lin
@ 2022-02-28 14:02 ` Miaohe Lin
  2022-02-28 14:02 ` [PATCH 4/4] mm/memory-failure.c: fix potential VM_BUG_ON_PAGE in split_huge_page_to_list Miaohe Lin
  3 siblings, 0 replies; 27+ messages in thread
From: Miaohe Lin @ 2022-02-28 14:02 UTC (permalink / raw)
  To: akpm, naoya.horiguchi; +Cc: linux-mm, linux-kernel, linmiaohe

Since commit 042c4f32323b ("mm/truncate: Inline invalidate_complete_page()
into its one caller"), invalidate_inode_page() can invalidate the pages in
the swap cache because the check of page->mapping != mapping is removed.
But invalidate_inode_page() is not expected to deal with the pages in swap
cache. Also non-lru movable page can reach here too. They're not page cache
pages. Skip these pages by checking PageSwapCache and PageLRU.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
---
 mm/memory-failure.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 5f9503573263..23bfd809dc8c 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -2185,7 +2185,7 @@ static int __soft_offline_page(struct page *page)
 		return 0;
 	}
 
-	if (!PageHuge(page))
+	if (!PageHuge(page) && PageLRU(page) && !PageSwapCache(page))
 		/*
 		 * Try to invalidate first. This should work for
 		 * non dirty unmapped page cache pages.
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* [PATCH 4/4] mm/memory-failure.c: fix potential VM_BUG_ON_PAGE in split_huge_page_to_list
  2022-02-28 14:02 [PATCH 0/4] A few fixup patches for memory failure Miaohe Lin
                   ` (2 preceding siblings ...)
  2022-02-28 14:02 ` [PATCH 3/4] mm/memory-failure.c: avoid calling invalidate_inode_page() with unexpected pages Miaohe Lin
@ 2022-02-28 14:02 ` Miaohe Lin
  2022-03-04  8:28   ` HORIGUCHI NAOYA(堀口 直也)
  3 siblings, 1 reply; 27+ messages in thread
From: Miaohe Lin @ 2022-02-28 14:02 UTC (permalink / raw)
  To: akpm, naoya.horiguchi; +Cc: linux-mm, linux-kernel, linmiaohe

The huge zero page could reach here and if we ever try to split it, the
VM_BUG_ON_PAGE will be triggered in split_huge_page_to_list(). Also the
non-lru compound movable pages could be taken for transhuge pages. Skip
these pages by checking PageLRU because huge zero page isn't lru page as
non-lru compound movable pages.

Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
---
 mm/memory-failure.c | 14 ++++++++++++++
 1 file changed, 14 insertions(+)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 23bfd809dc8c..ac6492e36978 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1792,6 +1792,20 @@ int memory_failure(unsigned long pfn, int flags)
 	}
 
 	if (PageTransHuge(hpage)) {
+		/*
+		 * The non-lru compound movable pages could be taken for
+		 * transhuge pages. Also huge zero page could reach here
+		 * and if we ever try to split it, the VM_BUG_ON_PAGE will
+		 * be triggered in split_huge_page_to_list(). Skip these
+		 * pages by checking PageLRU because huge zero page isn't
+		 * lru page as non-lru compound movable pages.
+		 */
+		if (!PageLRU(hpage)) {
+			put_page(p);
+			action_result(pfn, MF_MSG_UNSPLIT_THP, MF_IGNORED);
+			res = -EBUSY;
+			goto unlock_mutex;
+		}
 		/*
 		 * The flag must be set after the refcount is bumped
 		 * otherwise it may race with THP split.
-- 
2.23.0


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/4] mm/memory-failure.c: fix race with changing page compound again
  2022-02-28 14:02 ` [PATCH 1/4] mm/memory-failure.c: fix race with changing page compound again Miaohe Lin
@ 2022-03-04  8:26   ` HORIGUCHI NAOYA(堀口 直也)
  2022-03-04 19:32     ` Mike Kravetz
  0 siblings, 1 reply; 27+ messages in thread
From: HORIGUCHI NAOYA(堀口 直也) @ 2022-03-04  8:26 UTC (permalink / raw)
  To: Miaohe Lin; +Cc: akpm, linux-mm, linux-kernel

On Mon, Feb 28, 2022 at 10:02:42PM +0800, Miaohe Lin wrote:
> There is a race window where we got the compound_head, the hugetlb page
> could be freed to buddy, or even changed to another compound page just
> before we try to get hwpoison page. If this happens, just bail out.

I think that when some hugetlb page is about to change into other type/size
of compound page, it has to go through buddy allocator because hugetlb pages
are maintained in separate memory allocator and they never change into other
normal state directly.  memory_failure_hugetlb() takes refcount before
lock_page(), so the hugetlb page seems not change between get_hwpoison_page()
and lock_page(). So it this new check really necessary?

Thanks,
Naoya Horiguchi

> 
> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
> ---
>  mm/memory-failure.c | 11 +++++++++++
>  1 file changed, 11 insertions(+)
> 
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 5444a8ef4867..0d7c58340a98 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1534,6 +1534,17 @@ static int memory_failure_hugetlb(unsigned long pfn, int flags)
>  	}
>  
>  	lock_page(head);
> +
> +	/**
> +	 * The page could have changed compound pages due to race window.
> +	 * If this happens just bail out.
> +	 */
> +	if (!PageHuge(p) || compound_head(p) != head) {
> +		action_result(pfn, MF_MSG_DIFFERENT_COMPOUND, MF_IGNORED);
> +		res = -EBUSY;
> +		goto out;
> +	}
> +
>  	page_flags = head->flags;
>  
>  	if (hwpoison_filter(p)) {
> -- 
> 2.23.0

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/4] mm/memory-failure.c: fix wrong user reference report
  2022-02-28 14:02 ` [PATCH 2/4] mm/memory-failure.c: fix wrong user reference report Miaohe Lin
@ 2022-03-04  8:27   ` HORIGUCHI NAOYA(堀口 直也)
  2022-03-07 11:26     ` Miaohe Lin
  0 siblings, 1 reply; 27+ messages in thread
From: HORIGUCHI NAOYA(堀口 直也) @ 2022-03-04  8:27 UTC (permalink / raw)
  To: Miaohe Lin; +Cc: akpm, linux-mm, linux-kernel

On Mon, Feb 28, 2022 at 10:02:43PM +0800, Miaohe Lin wrote:
> The dirty swapcache page is still residing in the swap cache after it's
> hwpoisoned. So there is always one extra refcount for swap cache.

The diff seems fine at a glance, but let me have a few question to
understand the issue more.

- Is the behavior described above the effect of recent change on shmem where
  dirty pagecache is pinned on hwpoison (commit a76054266661 ("mm: shmem:
  don't truncate page if memory failure happens"). Or the older kernels
  behave as the same?

- Is the behavior true for normal anonymous pages (not shmem pages)?

I'm trying to test hwpoison hitting the dirty swapcache, but it seems that
in my testing memory_faliure() fails with "hwpoison: unhandlable page"
warning at get_any_page().  So I'm still not sure that me_pagecache_dirty()
fixes any visible problem.

Thanks,
Naoya Horiguchi

> 
> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
> ---
>  mm/memory-failure.c | 6 +-----
>  1 file changed, 1 insertion(+), 5 deletions(-)
> 
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 0d7c58340a98..5f9503573263 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -984,7 +984,6 @@ static int me_pagecache_dirty(struct page_state *ps, struct page *p)
>  static int me_swapcache_dirty(struct page_state *ps, struct page *p)
>  {
>  	int ret;
> -	bool extra_pins = false;
>  
>  	ClearPageDirty(p);
>  	/* Trigger EIO in shmem: */
> @@ -993,10 +992,7 @@ static int me_swapcache_dirty(struct page_state *ps, struct page *p)
>  	ret = delete_from_lru_cache(p) ? MF_FAILED : MF_DELAYED;
>  	unlock_page(p);
>  
> -	if (ret == MF_DELAYED)
> -		extra_pins = true;
> -
> -	if (has_extra_refcount(ps, p, extra_pins))
> +	if (has_extra_refcount(ps, p, true))
>  		ret = MF_FAILED;
>  
>  	return ret;
> -- 
> 2.23.0

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 4/4] mm/memory-failure.c: fix potential VM_BUG_ON_PAGE in split_huge_page_to_list
  2022-02-28 14:02 ` [PATCH 4/4] mm/memory-failure.c: fix potential VM_BUG_ON_PAGE in split_huge_page_to_list Miaohe Lin
@ 2022-03-04  8:28   ` HORIGUCHI NAOYA(堀口 直也)
  2022-03-07  7:07     ` Miaohe Lin
  0 siblings, 1 reply; 27+ messages in thread
From: HORIGUCHI NAOYA(堀口 直也) @ 2022-03-04  8:28 UTC (permalink / raw)
  To: Miaohe Lin; +Cc: akpm, linux-mm, linux-kernel

On Mon, Feb 28, 2022 at 10:02:45PM +0800, Miaohe Lin wrote:
> The huge zero page could reach here and if we ever try to split it, the
> VM_BUG_ON_PAGE will be triggered in split_huge_page_to_list(). Also the
> non-lru compound movable pages could be taken for transhuge pages. Skip
> these pages by checking PageLRU because huge zero page isn't lru page as
> non-lru compound movable pages.

It seems that memory_failure() also fails at get_any_page() with "hwpoison:
unhandlable page" message.

  [16478.203474] page:00000000b6acdbd1 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x1810b4
  [16478.206612] flags: 0x57ffffc0801000(reserved|hwpoison|node=1|zone=2|lastcpupid=0x1fffff)
  [16478.209411] raw: 0057ffffc0801000 fffff11bc6042d08 fffff11bc6042d08 0000000000000000
  [16478.211921] raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
  [16478.214473] page dumped because: hwpoison: unhandlable page
  [16478.216386] Memory failure: 0x1810b4: recovery action for unknown page: Ignored

We can't handle errors on huge (or normal) zero page, so the current
behavior seems to me more suitable than "unsplit thp".

Or if you have some producer to reach the following path with huge zero
page, could you share it?

Thanks,
Naoya Horiguchi

> 
> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
> ---
>  mm/memory-failure.c | 14 ++++++++++++++
>  1 file changed, 14 insertions(+)
> 
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 23bfd809dc8c..ac6492e36978 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1792,6 +1792,20 @@ int memory_failure(unsigned long pfn, int flags)
>  	}
>  
>  	if (PageTransHuge(hpage)) {
> +		/*
> +		 * The non-lru compound movable pages could be taken for
> +		 * transhuge pages. Also huge zero page could reach here
> +		 * and if we ever try to split it, the VM_BUG_ON_PAGE will
> +		 * be triggered in split_huge_page_to_list(). Skip these
> +		 * pages by checking PageLRU because huge zero page isn't
> +		 * lru page as non-lru compound movable pages.
> +		 */
> +		if (!PageLRU(hpage)) {
> +			put_page(p);
> +			action_result(pfn, MF_MSG_UNSPLIT_THP, MF_IGNORED);
> +			res = -EBUSY;
> +			goto unlock_mutex;
> +		}
>  		/*
>  		 * The flag must be set after the refcount is bumped
>  		 * otherwise it may race with THP split.
> -- 
> 2.23.0

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/4] mm/memory-failure.c: fix race with changing page compound again
  2022-03-04  8:26   ` HORIGUCHI NAOYA(堀口 直也)
@ 2022-03-04 19:32     ` Mike Kravetz
  2022-03-07  3:44       ` Miaohe Lin
  0 siblings, 1 reply; 27+ messages in thread
From: Mike Kravetz @ 2022-03-04 19:32 UTC (permalink / raw)
  To: HORIGUCHI NAOYA(堀口 直也), Miaohe Lin
  Cc: akpm, linux-mm, linux-kernel

On 3/4/22 00:26, HORIGUCHI NAOYA(堀口 直也) wrote:
> On Mon, Feb 28, 2022 at 10:02:42PM +0800, Miaohe Lin wrote:
>> There is a race window where we got the compound_head, the hugetlb page
>> could be freed to buddy, or even changed to another compound page just
>> before we try to get hwpoison page. If this happens, just bail out.
> 
> I think that when some hugetlb page is about to change into other type/size
> of compound page, it has to go through buddy allocator because hugetlb pages
> are maintained in separate memory allocator and they never change into other
> normal state directly.  memory_failure_hugetlb() takes refcount before
> lock_page(), so the hugetlb page seems not change between get_hwpoison_page()
> and lock_page(). So it this new check really necessary?

A hugetlb page could change size without going through buddy via the new
demote functionality [1].  Only hugetlb pages on the hugetlb free list can
be demoted.  

We should not demote a page if poison is set.  However, there is no check in
the demote code.  IIUC, poison is set early in the memory error handling
process, even before taking ref on page.  Demote code needs to be fixed so
that poisoned pages are not demoted.  I can do that.

With this change in place, then I think Naoya's statement that hugetlb pages
can not change state is correct and this patch is not necessary.

Does that sound reasonable?

[1] https://lore.kernel.org/linux-mm/20211007181918.136982-1-mike.kravetz@oracle.com/
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/4] mm/memory-failure.c: fix race with changing page compound again
  2022-03-04 19:32     ` Mike Kravetz
@ 2022-03-07  3:44       ` Miaohe Lin
  2022-03-07  7:01         ` HORIGUCHI NAOYA(堀口 直也)
  0 siblings, 1 reply; 27+ messages in thread
From: Miaohe Lin @ 2022-03-07  3:44 UTC (permalink / raw)
  To: Mike Kravetz, HORIGUCHI NAOYA(堀口 直也)
  Cc: akpm, linux-mm, linux-kernel

On 2022/3/5 3:32, Mike Kravetz wrote:
> On 3/4/22 00:26, HORIGUCHI NAOYA(堀口 直也) wrote:
>> On Mon, Feb 28, 2022 at 10:02:42PM +0800, Miaohe Lin wrote:
>>> There is a race window where we got the compound_head, the hugetlb page
>>> could be freed to buddy, or even changed to another compound page just
>>> before we try to get hwpoison page. If this happens, just bail out.
>>
>> I think that when some hugetlb page is about to change into other type/size
>> of compound page, it has to go through buddy allocator because hugetlb pages
>> are maintained in separate memory allocator and they never change into other
>> normal state directly.  memory_failure_hugetlb() takes refcount before
>> lock_page(), so the hugetlb page seems not change between get_hwpoison_page()
>> and lock_page(). So it this new check really necessary?
> 
> A hugetlb page could change size without going through buddy via the new
> demote functionality [1].  Only hugetlb pages on the hugetlb free list can
> be demoted.  
> 
> We should not demote a page if poison is set.  However, there is no check in
> the demote code.  IIUC, poison is set early in the memory error handling
> process, even before taking ref on page.  Demote code needs to be fixed so
> that poisoned pages are not demoted.  I can do that.
> 
> With this change in place, then I think Naoya's statement that hugetlb pages
> can not change state is correct and this patch is not necessary.
> 

Sorry for my confusing commit words. What I mean to tell is indeed the below race:
  CPU 1							CPU 2
  memory_failure_hugetlb
  struct page *head = compound_head(p);
							hugetlb page is freed to buddy, or
							even changed to another compound page
							as we haven't held the page refcnt now
  get_hwpoison_page -- page is not what we want now...

Does this make sense for both of you? Many thanks for comment and reply! :)

> Does that sound reasonable?
> 
> [1] https://lore.kernel.org/linux-mm/20211007181918.136982-1-mike.kravetz@oracle.com/

This is really a nice feature. Thanks.

> 

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/4] mm/memory-failure.c: fix race with changing page compound again
  2022-03-07  3:44       ` Miaohe Lin
@ 2022-03-07  7:01         ` HORIGUCHI NAOYA(堀口 直也)
  2022-03-07 19:07           ` Mike Kravetz
  0 siblings, 1 reply; 27+ messages in thread
From: HORIGUCHI NAOYA(堀口 直也) @ 2022-03-07  7:01 UTC (permalink / raw)
  To: Miaohe Lin; +Cc: Mike Kravetz, akpm, linux-mm, linux-kernel

On Mon, Mar 07, 2022 at 11:44:20AM +0800, Miaohe Lin wrote:
> On 2022/3/5 3:32, Mike Kravetz wrote:
> > On 3/4/22 00:26, HORIGUCHI NAOYA(堀口 直也) wrote:
> >> On Mon, Feb 28, 2022 at 10:02:42PM +0800, Miaohe Lin wrote:
> >>> There is a race window where we got the compound_head, the hugetlb page
> >>> could be freed to buddy, or even changed to another compound page just
> >>> before we try to get hwpoison page. If this happens, just bail out.
> >>
> >> I think that when some hugetlb page is about to change into other type/size
> >> of compound page, it has to go through buddy allocator because hugetlb pages
> >> are maintained in separate memory allocator and they never change into other
> >> normal state directly.  memory_failure_hugetlb() takes refcount before
> >> lock_page(), so the hugetlb page seems not change between get_hwpoison_page()
> >> and lock_page(). So it this new check really necessary?
> > 
> > A hugetlb page could change size without going through buddy via the new
> > demote functionality [1].  Only hugetlb pages on the hugetlb free list can
> > be demoted.  
> > 
> > We should not demote a page if poison is set.  However, there is no check in
> > the demote code.  IIUC, poison is set early in the memory error handling
> > process, even before taking ref on page.  Demote code needs to be fixed so
> > that poisoned pages are not demoted.  I can do that.
> > 
> > With this change in place, then I think Naoya's statement that hugetlb pages
> > can not change state is correct and this patch is not necessary.
> > 
> 
> Sorry for my confusing commit words. What I mean to tell is indeed the below race:
>   CPU 1							CPU 2
>   memory_failure_hugetlb
>   struct page *head = compound_head(p);
> 							hugetlb page is freed to buddy, or
> 							even changed to another compound page
> 							as we haven't held the page refcnt now
>   get_hwpoison_page -- page is not what we want now...
> 
> Does this make sense for both of you? Many thanks for comment and reply! :)

Thanks for elaboration, I agree with you (I simply overlooked this race, sorry).
And please add this in the commit log.

> +
> +       /**
> +        * The page could have changed compound pages due to race window.
> +        * If this happens just bail out.
> +        */
> +       if (!PageHuge(p) || compound_head(p) != head) {
> +               action_result(pfn, MF_MSG_DIFFERENT_COMPOUND, MF_IGNORED);
> +               res = -EBUSY;
> +               goto out;
> +       }

Let me have one comment on the diff. The result code MF_MSG_DIFFERENT_COMPOUND
might not fit when PageHuge is false in the check (because it's no longer a
compound page).  Maybe you may invent another result code, or changes
MF_MSG_DIFFERENT_COMPOUND (for example) to MF_MSG_DIFFERENT_PAGE_SIZE?

> 
> > Does that sound reasonable?
> > 
> > [1] https://lore.kernel.org/linux-mm/20211007181918.136982-1-mike.kravetz@oracle.com/
> 
> This is really a nice feature. Thanks.

Yes. Hugepage demotion after entering memory_failure_hugetlb() should
cause the race, so we need the check after page lock.

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 4/4] mm/memory-failure.c: fix potential VM_BUG_ON_PAGE in split_huge_page_to_list
  2022-03-04  8:28   ` HORIGUCHI NAOYA(堀口 直也)
@ 2022-03-07  7:07     ` Miaohe Lin
  2022-03-07 19:53       ` Yang Shi
  0 siblings, 1 reply; 27+ messages in thread
From: Miaohe Lin @ 2022-03-07  7:07 UTC (permalink / raw)
  To: HORIGUCHI NAOYA(堀口 直也)
  Cc: akpm, linux-mm, linux-kernel

On 2022/3/4 16:28, HORIGUCHI NAOYA(堀口 直也) wrote:
> On Mon, Feb 28, 2022 at 10:02:45PM +0800, Miaohe Lin wrote:
>> The huge zero page could reach here and if we ever try to split it, the
>> VM_BUG_ON_PAGE will be triggered in split_huge_page_to_list(). Also the
>> non-lru compound movable pages could be taken for transhuge pages. Skip
>> these pages by checking PageLRU because huge zero page isn't lru page as
>> non-lru compound movable pages.
> 
> It seems that memory_failure() also fails at get_any_page() with "hwpoison:
> unhandlable page" message.
> 
>   [16478.203474] page:00000000b6acdbd1 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x1810b4
>   [16478.206612] flags: 0x57ffffc0801000(reserved|hwpoison|node=1|zone=2|lastcpupid=0x1fffff)
>   [16478.209411] raw: 0057ffffc0801000 fffff11bc6042d08 fffff11bc6042d08 0000000000000000
>   [16478.211921] raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
>   [16478.214473] page dumped because: hwpoison: unhandlable page
>   [16478.216386] Memory failure: 0x1810b4: recovery action for unknown page: Ignored
> 
> We can't handle errors on huge (or normal) zero page, so the current

Sorry for confusing commit log again. I should have a coffee before I make this patch.
Huge or normal zero page will fail at get_any_page because they're neither HWPoisonHandlable
nor PageHuge.

> behavior seems to me more suitable than "unsplit thp".
> 
> Or if you have some producer to reach the following path with huge zero
> page, could you share it?
> 

What I mean is that non-lru movable compound page can reach here unexpected because __PageMovable(page)
is handleable now. So get_any_page could succeed to grab the page refcnt. And since it's compound page,
it will go through the split_huge_page_to_list because PageTransHuge checks PageHead(page) which can also
be true for compound page. But this type of pages is unexpected for split_huge_page_to_list.

Does this make sense for you? Thanks Naoya.

> Thanks,
> Naoya Horiguchi
> 
>>
>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
>> ---
>>  mm/memory-failure.c | 14 ++++++++++++++
>>  1 file changed, 14 insertions(+)
>>
>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>> index 23bfd809dc8c..ac6492e36978 100644
>> --- a/mm/memory-failure.c
>> +++ b/mm/memory-failure.c
>> @@ -1792,6 +1792,20 @@ int memory_failure(unsigned long pfn, int flags)
>>  	}
>>  
>>  	if (PageTransHuge(hpage)) {
>> +		/*
>> +		 * The non-lru compound movable pages could be taken for
>> +		 * transhuge pages. Also huge zero page could reach here
>> +		 * and if we ever try to split it, the VM_BUG_ON_PAGE will
>> +		 * be triggered in split_huge_page_to_list(). Skip these
>> +		 * pages by checking PageLRU because huge zero page isn't
>> +		 * lru page as non-lru compound movable pages.
>> +		 */
>> +		if (!PageLRU(hpage)) {
>> +			put_page(p);
>> +			action_result(pfn, MF_MSG_UNSPLIT_THP, MF_IGNORED);
>> +			res = -EBUSY;
>> +			goto unlock_mutex;
>> +		}
>>  		/*
>>  		 * The flag must be set after the refcount is bumped
>>  		 * otherwise it may race with THP split.
>> -- 
>> 2.23.0


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/4] mm/memory-failure.c: fix wrong user reference report
  2022-03-04  8:27   ` HORIGUCHI NAOYA(堀口 直也)
@ 2022-03-07 11:26     ` Miaohe Lin
  2022-03-07 20:14       ` Yang Shi
  0 siblings, 1 reply; 27+ messages in thread
From: Miaohe Lin @ 2022-03-07 11:26 UTC (permalink / raw)
  To: HORIGUCHI NAOYA(堀口 直也)
  Cc: akpm, linux-mm, linux-kernel

On 2022/3/4 16:27, HORIGUCHI NAOYA(堀口 直也) wrote:
> On Mon, Feb 28, 2022 at 10:02:43PM +0800, Miaohe Lin wrote:
>> The dirty swapcache page is still residing in the swap cache after it's
>> hwpoisoned. So there is always one extra refcount for swap cache.
> 
> The diff seems fine at a glance, but let me have a few question to
> understand the issue more.
> 
> - Is the behavior described above the effect of recent change on shmem where
>   dirty pagecache is pinned on hwpoison (commit a76054266661 ("mm: shmem:
>   don't truncate page if memory failure happens"). Or the older kernels
>   behave as the same?
> 
> - Is the behavior true for normal anonymous pages (not shmem pages)?
> 

The behavior described above is aimed at swapcache not pagecache. So it should be
irrelevant with the recent change on shmem.

What I try to fix here is that me_swapcache_dirty holds one extra pin via SwapCache
regardless of the return value of delete_from_lru_cache. We should try to report more
accurate extra refcount for debugging purpose.

> I'm trying to test hwpoison hitting the dirty swapcache, but it seems that
> in my testing memory_faliure() fails with "hwpoison: unhandlable page"

Maybe memory_faliure is racing with page reclaim where page is isolated?

> warning at get_any_page().  So I'm still not sure that me_pagecache_dirty()
> fixes any visible problem.

IIUC, me_pagecache_dirty can't do much except for the corresponding address_space supporting
error_remove_page which can truncate the dirty pagecache page. But this may cause silent data
loss. It's better to keep the page stay in the pagecache until the file is truncated, hole
punched or removed as commit a76054266661 pointed out.

Thanks.

> > Thanks,
> Naoya Horiguchi
> 
>>
>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
>> ---
>>  mm/memory-failure.c | 6 +-----
>>  1 file changed, 1 insertion(+), 5 deletions(-)
>>
>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>> index 0d7c58340a98..5f9503573263 100644
>> --- a/mm/memory-failure.c
>> +++ b/mm/memory-failure.c
>> @@ -984,7 +984,6 @@ static int me_pagecache_dirty(struct page_state *ps, struct page *p)
>>  static int me_swapcache_dirty(struct page_state *ps, struct page *p)
>>  {
>>  	int ret;
>> -	bool extra_pins = false;
>>  
>>  	ClearPageDirty(p);
>>  	/* Trigger EIO in shmem: */
>> @@ -993,10 +992,7 @@ static int me_swapcache_dirty(struct page_state *ps, struct page *p)
>>  	ret = delete_from_lru_cache(p) ? MF_FAILED : MF_DELAYED;
>>  	unlock_page(p);
>>  
>> -	if (ret == MF_DELAYED)
>> -		extra_pins = true;
>> -
>> -	if (has_extra_refcount(ps, p, extra_pins))
>> +	if (has_extra_refcount(ps, p, true))
>>  		ret = MF_FAILED;
>>  
>>  	return ret;
>> -- 
>> 2.23.0


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/4] mm/memory-failure.c: fix race with changing page compound again
  2022-03-07  7:01         ` HORIGUCHI NAOYA(堀口 直也)
@ 2022-03-07 19:07           ` Mike Kravetz
  2022-03-08  6:56             ` HORIGUCHI NAOYA(堀口 直也)
  0 siblings, 1 reply; 27+ messages in thread
From: Mike Kravetz @ 2022-03-07 19:07 UTC (permalink / raw)
  To: HORIGUCHI NAOYA(堀口 直也), Miaohe Lin
  Cc: akpm, linux-mm, linux-kernel

On 3/6/22 23:01, HORIGUCHI NAOYA(堀口 直也) wrote:
> On Mon, Mar 07, 2022 at 11:44:20AM +0800, Miaohe Lin wrote:
>> On 2022/3/5 3:32, Mike Kravetz wrote:
>>> On 3/4/22 00:26, HORIGUCHI NAOYA(堀口 直也) wrote:
>>>> On Mon, Feb 28, 2022 at 10:02:42PM +0800, Miaohe Lin wrote:
>>>>> There is a race window where we got the compound_head, the hugetlb page
>>>>> could be freed to buddy, or even changed to another compound page just
>>>>> before we try to get hwpoison page. If this happens, just bail out.
>>>>
>>>> I think that when some hugetlb page is about to change into other type/size
>>>> of compound page, it has to go through buddy allocator because hugetlb pages
>>>> are maintained in separate memory allocator and they never change into other
>>>> normal state directly.  memory_failure_hugetlb() takes refcount before
>>>> lock_page(), so the hugetlb page seems not change between get_hwpoison_page()
>>>> and lock_page(). So it this new check really necessary?
>>>
>>> A hugetlb page could change size without going through buddy via the new
>>> demote functionality [1].  Only hugetlb pages on the hugetlb free list can
>>> be demoted.  
>>>
>>> We should not demote a page if poison is set.  However, there is no check in
>>> the demote code.  IIUC, poison is set early in the memory error handling
>>> process, even before taking ref on page.  Demote code needs to be fixed so
>>> that poisoned pages are not demoted.  I can do that.
>>>
>>> With this change in place, then I think Naoya's statement that hugetlb pages
>>> can not change state is correct and this patch is not necessary.
>>>
>>
>> Sorry for my confusing commit words. What I mean to tell is indeed the below race:
>>   CPU 1							CPU 2
>>   memory_failure_hugetlb
>>   struct page *head = compound_head(p);
>> 							hugetlb page is freed to buddy, or
>> 							even changed to another compound page
>> 							as we haven't held the page refcnt now
>>   get_hwpoison_page -- page is not what we want now...
>>
>> Does this make sense for both of you? Many thanks for comment and reply! :)
> 
> Thanks for elaboration, I agree with you (I simply overlooked this race, sorry).

Yes, thank you.

> And please add this in the commit log.
> 
>> +
>> +       /**
>> +        * The page could have changed compound pages due to race window.
>> +        * If this happens just bail out.
>> +        */
>> +       if (!PageHuge(p) || compound_head(p) != head) {
>> +               action_result(pfn, MF_MSG_DIFFERENT_COMPOUND, MF_IGNORED);
>> +               res = -EBUSY;
>> +               goto out;
>> +       }
> 
> Let me have one comment on the diff. The result code MF_MSG_DIFFERENT_COMPOUND
> might not fit when PageHuge is false in the check (because it's no longer a
> compound page).  Maybe you may invent another result code, or changes
> MF_MSG_DIFFERENT_COMPOUND (for example) to MF_MSG_DIFFERENT_PAGE_SIZE?
> 

Suppose we do encounter this race.  Also, suppose p != head.
At the beginning of memory_failure_hugetlb, we do:

struct page *head = compound_head(p);
...
if (TestSetPageHWPoison(head))

So, it could be that we set Poison in the 'head' page but the error was really
in another page.  Is that correct?

Now with the race, head is not a huge page and the pages could even be on
buddy.  Does this mean we could have poison set on the wrong page in buddy?

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 4/4] mm/memory-failure.c: fix potential VM_BUG_ON_PAGE in split_huge_page_to_list
  2022-03-07  7:07     ` Miaohe Lin
@ 2022-03-07 19:53       ` Yang Shi
  2022-03-08 12:36         ` Miaohe Lin
  0 siblings, 1 reply; 27+ messages in thread
From: Yang Shi @ 2022-03-07 19:53 UTC (permalink / raw)
  To: Miaohe Lin
  Cc: HORIGUCHI NAOYA(堀口 直也),
	akpm, linux-mm, linux-kernel

On Sun, Mar 6, 2022 at 11:07 PM Miaohe Lin <linmiaohe@huawei.com> wrote:
>
> On 2022/3/4 16:28, HORIGUCHI NAOYA(堀口 直也) wrote:
> > On Mon, Feb 28, 2022 at 10:02:45PM +0800, Miaohe Lin wrote:
> >> The huge zero page could reach here and if we ever try to split it, the
> >> VM_BUG_ON_PAGE will be triggered in split_huge_page_to_list(). Also the
> >> non-lru compound movable pages could be taken for transhuge pages. Skip
> >> these pages by checking PageLRU because huge zero page isn't lru page as
> >> non-lru compound movable pages.
> >
> > It seems that memory_failure() also fails at get_any_page() with "hwpoison:
> > unhandlable page" message.
> >
> >   [16478.203474] page:00000000b6acdbd1 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x1810b4
> >   [16478.206612] flags: 0x57ffffc0801000(reserved|hwpoison|node=1|zone=2|lastcpupid=0x1fffff)
> >   [16478.209411] raw: 0057ffffc0801000 fffff11bc6042d08 fffff11bc6042d08 0000000000000000
> >   [16478.211921] raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
> >   [16478.214473] page dumped because: hwpoison: unhandlable page
> >   [16478.216386] Memory failure: 0x1810b4: recovery action for unknown page: Ignored
> >
> > We can't handle errors on huge (or normal) zero page, so the current
>
> Sorry for confusing commit log again. I should have a coffee before I make this patch.
> Huge or normal zero page will fail at get_any_page because they're neither HWPoisonHandlable
> nor PageHuge.
>
> > behavior seems to me more suitable than "unsplit thp".
> >
> > Or if you have some producer to reach the following path with huge zero
> > page, could you share it?
> >
>
> What I mean is that non-lru movable compound page can reach here unexpected because __PageMovable(page)
> is handleable now. So get_any_page could succeed to grab the page refcnt. And since it's compound page,
> it will go through the split_huge_page_to_list because PageTransHuge checks PageHead(page) which can also
> be true for compound page. But this type of pages is unexpected for split_huge_page_to_list.

Can we really handle non-LRU movable pages in memory failure
(uncorrectable errors)? Typically they are balloon, zsmalloc, etc.
Assuming we run into a base (4K) non-LRU movable page, we could reach
as far as identify_page_state(), it should not fall into any category
except me_unknown. So it seems we could just simply make it
unhandlable.

But it should be handlable for soft-offline since it could be migrated.


> Does this make sense for you? Thanks Naoya.
>
> > Thanks,
> > Naoya Horiguchi
> >
> >>
> >> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
> >> ---
> >>  mm/memory-failure.c | 14 ++++++++++++++
> >>  1 file changed, 14 insertions(+)
> >>
> >> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> >> index 23bfd809dc8c..ac6492e36978 100644
> >> --- a/mm/memory-failure.c
> >> +++ b/mm/memory-failure.c
> >> @@ -1792,6 +1792,20 @@ int memory_failure(unsigned long pfn, int flags)
> >>      }
> >>
> >>      if (PageTransHuge(hpage)) {
> >> +            /*
> >> +             * The non-lru compound movable pages could be taken for
> >> +             * transhuge pages. Also huge zero page could reach here
> >> +             * and if we ever try to split it, the VM_BUG_ON_PAGE will
> >> +             * be triggered in split_huge_page_to_list(). Skip these
> >> +             * pages by checking PageLRU because huge zero page isn't
> >> +             * lru page as non-lru compound movable pages.
> >> +             */
> >> +            if (!PageLRU(hpage)) {
> >> +                    put_page(p);
> >> +                    action_result(pfn, MF_MSG_UNSPLIT_THP, MF_IGNORED);
> >> +                    res = -EBUSY;
> >> +                    goto unlock_mutex;
> >> +            }
> >>              /*
> >>               * The flag must be set after the refcount is bumped
> >>               * otherwise it may race with THP split.
> >> --
> >> 2.23.0
>
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/4] mm/memory-failure.c: fix wrong user reference report
  2022-03-07 11:26     ` Miaohe Lin
@ 2022-03-07 20:14       ` Yang Shi
  2022-03-08 13:11         ` Miaohe Lin
  0 siblings, 1 reply; 27+ messages in thread
From: Yang Shi @ 2022-03-07 20:14 UTC (permalink / raw)
  To: Miaohe Lin
  Cc: HORIGUCHI NAOYA(堀口 直也),
	akpm, linux-mm, linux-kernel

On Mon, Mar 7, 2022 at 3:26 AM Miaohe Lin <linmiaohe@huawei.com> wrote:
>
> On 2022/3/4 16:27, HORIGUCHI NAOYA(堀口 直也) wrote:
> > On Mon, Feb 28, 2022 at 10:02:43PM +0800, Miaohe Lin wrote:
> >> The dirty swapcache page is still residing in the swap cache after it's
> >> hwpoisoned. So there is always one extra refcount for swap cache.
> >
> > The diff seems fine at a glance, but let me have a few question to
> > understand the issue more.
> >
> > - Is the behavior described above the effect of recent change on shmem where
> >   dirty pagecache is pinned on hwpoison (commit a76054266661 ("mm: shmem:
> >   don't truncate page if memory failure happens"). Or the older kernels
> >   behave as the same?
> >
> > - Is the behavior true for normal anonymous pages (not shmem pages)?
> >
>
> The behavior described above is aimed at swapcache not pagecache. So it should be
> irrelevant with the recent change on shmem.
>
> What I try to fix here is that me_swapcache_dirty holds one extra pin via SwapCache
> regardless of the return value of delete_from_lru_cache. We should try to report more
> accurate extra refcount for debugging purpose.

I think you misunderstood the code. The delete_from_lru_cache()
returning 0 means the page was on LRU and isolated from LRU
successfully now. Returning -EIO means the page was not on LRU, so it
should have at least an extra pin on it.

So MF_DELAYED means there is no other pin other than hwpoison and
swapcache which is expected, MF_FAILED means there might be extra
pins.

The has_extra_refcount() raised error then there is *unexpected* refcount.

>
> > I'm trying to test hwpoison hitting the dirty swapcache, but it seems that
> > in my testing memory_faliure() fails with "hwpoison: unhandlable page"
>
> Maybe memory_faliure is racing with page reclaim where page is isolated?
>
> > warning at get_any_page().  So I'm still not sure that me_pagecache_dirty()
> > fixes any visible problem.
>
> IIUC, me_pagecache_dirty can't do much except for the corresponding address_space supporting
> error_remove_page which can truncate the dirty pagecache page. But this may cause silent data
> loss. It's better to keep the page stay in the pagecache until the file is truncated, hole
> punched or removed as commit a76054266661 pointed out.
>
> Thanks.
>
> > > Thanks,
> > Naoya Horiguchi
> >
> >>
> >> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
> >> ---
> >>  mm/memory-failure.c | 6 +-----
> >>  1 file changed, 1 insertion(+), 5 deletions(-)
> >>
> >> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> >> index 0d7c58340a98..5f9503573263 100644
> >> --- a/mm/memory-failure.c
> >> +++ b/mm/memory-failure.c
> >> @@ -984,7 +984,6 @@ static int me_pagecache_dirty(struct page_state *ps, struct page *p)
> >>  static int me_swapcache_dirty(struct page_state *ps, struct page *p)
> >>  {
> >>      int ret;
> >> -    bool extra_pins = false;
> >>
> >>      ClearPageDirty(p);
> >>      /* Trigger EIO in shmem: */
> >> @@ -993,10 +992,7 @@ static int me_swapcache_dirty(struct page_state *ps, struct page *p)
> >>      ret = delete_from_lru_cache(p) ? MF_FAILED : MF_DELAYED;
> >>      unlock_page(p);
> >>
> >> -    if (ret == MF_DELAYED)
> >> -            extra_pins = true;
> >> -
> >> -    if (has_extra_refcount(ps, p, extra_pins))
> >> +    if (has_extra_refcount(ps, p, true))
> >>              ret = MF_FAILED;
> >>
> >>      return ret;
> >> --
> >> 2.23.0
>
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/4] mm/memory-failure.c: fix race with changing page compound again
  2022-03-07 19:07           ` Mike Kravetz
@ 2022-03-08  6:56             ` HORIGUCHI NAOYA(堀口 直也)
  2022-03-08 11:28               ` Miaohe Lin
  0 siblings, 1 reply; 27+ messages in thread
From: HORIGUCHI NAOYA(堀口 直也) @ 2022-03-08  6:56 UTC (permalink / raw)
  To: Mike Kravetz; +Cc: Miaohe Lin, akpm, linux-mm, linux-kernel

On Mon, Mar 07, 2022 at 11:07:32AM -0800, Mike Kravetz wrote:
...
> > 
> >> +
> >> +       /**
> >> +        * The page could have changed compound pages due to race window.
> >> +        * If this happens just bail out.
> >> +        */
> >> +       if (!PageHuge(p) || compound_head(p) != head) {
> >> +               action_result(pfn, MF_MSG_DIFFERENT_COMPOUND, MF_IGNORED);
> >> +               res = -EBUSY;
> >> +               goto out;
> >> +       }
> > 
> > Let me have one comment on the diff. The result code MF_MSG_DIFFERENT_COMPOUND
> > might not fit when PageHuge is false in the check (because it's no longer a
> > compound page).  Maybe you may invent another result code, or changes
> > MF_MSG_DIFFERENT_COMPOUND (for example) to MF_MSG_DIFFERENT_PAGE_SIZE?
> > 
> 
> Suppose we do encounter this race.  Also, suppose p != head.
> At the beginning of memory_failure_hugetlb, we do:
> 
> struct page *head = compound_head(p);
> ...
> if (TestSetPageHWPoison(head))
> 
> So, it could be that we set Poison in the 'head' page but the error was really
> in another page.  Is that correct?
> 
> Now with the race, head is not a huge page and the pages could even be on
> buddy.  Does this mean we could have poison set on the wrong page in buddy?

Correct, the race might be rare, but this needs a fix.
I think that setting PageHWPoison first (before taking refcount and page lock)
is the root of all related problems.  This behavior came from the original
concept in hwpoison that preventing consumption of corrupted data is the first
priority.  But now I think that this makes no sense if we have this kind of bugs.

I'll try to write a patch for this (I only fix memory_failure_hugetlb() first,
but generic path should be fixed later).
Thank you for pointing out.

- Naoya Horiguchi

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 1/4] mm/memory-failure.c: fix race with changing page compound again
  2022-03-08  6:56             ` HORIGUCHI NAOYA(堀口 直也)
@ 2022-03-08 11:28               ` Miaohe Lin
  0 siblings, 0 replies; 27+ messages in thread
From: Miaohe Lin @ 2022-03-08 11:28 UTC (permalink / raw)
  To: HORIGUCHI NAOYA(堀口 直也), Mike Kravetz
  Cc: akpm, linux-mm, linux-kernel

On 2022/3/8 14:56, HORIGUCHI NAOYA(堀口 直也) wrote:
> On Mon, Mar 07, 2022 at 11:07:32AM -0800, Mike Kravetz wrote:
> ...
>>>
>>>> +
>>>> +       /**
>>>> +        * The page could have changed compound pages due to race window.
>>>> +        * If this happens just bail out.
>>>> +        */
>>>> +       if (!PageHuge(p) || compound_head(p) != head) {
>>>> +               action_result(pfn, MF_MSG_DIFFERENT_COMPOUND, MF_IGNORED);
>>>> +               res = -EBUSY;
>>>> +               goto out;
>>>> +       }
>>>
>>> Let me have one comment on the diff. The result code MF_MSG_DIFFERENT_COMPOUND
>>> might not fit when PageHuge is false in the check (because it's no longer a
>>> compound page).  Maybe you may invent another result code, or changes
>>> MF_MSG_DIFFERENT_COMPOUND (for example) to MF_MSG_DIFFERENT_PAGE_SIZE?
>>>
>>
>> Suppose we do encounter this race.  Also, suppose p != head.
>> At the beginning of memory_failure_hugetlb, we do:
>>
>> struct page *head = compound_head(p);
>> ...
>> if (TestSetPageHWPoison(head))
>>
>> So, it could be that we set Poison in the 'head' page but the error was really
>> in another page.  Is that correct?
>>
>> Now with the race, head is not a huge page and the pages could even be on
>> buddy.  Does this mean we could have poison set on the wrong page in buddy?
> 
> Correct, the race might be rare, but this needs a fix.
> I think that setting PageHWPoison first (before taking refcount and page lock)
> is the root of all related problems.  This behavior came from the original
> concept in hwpoison that preventing consumption of corrupted data is the first
> priority.  But now I think that this makes no sense if we have this kind of bugs.
> 
> I'll try to write a patch for this (I only fix memory_failure_hugetlb() first,
> but generic path should be fixed later).
> Thank you for pointing out.

Many thanks for both of you for doing this. :)

> 
> - Naoya Horiguchi
> 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 4/4] mm/memory-failure.c: fix potential VM_BUG_ON_PAGE in split_huge_page_to_list
  2022-03-07 19:53       ` Yang Shi
@ 2022-03-08 12:36         ` Miaohe Lin
  2022-03-08 18:47           ` Yang Shi
  0 siblings, 1 reply; 27+ messages in thread
From: Miaohe Lin @ 2022-03-08 12:36 UTC (permalink / raw)
  To: Yang Shi
  Cc: HORIGUCHI NAOYA(堀口 直也),
	akpm, linux-mm, linux-kernel

On 2022/3/8 3:53, Yang Shi wrote:
> On Sun, Mar 6, 2022 at 11:07 PM Miaohe Lin <linmiaohe@huawei.com> wrote:
>>
>> On 2022/3/4 16:28, HORIGUCHI NAOYA(堀口 直也) wrote:
>>> On Mon, Feb 28, 2022 at 10:02:45PM +0800, Miaohe Lin wrote:
>>>> The huge zero page could reach here and if we ever try to split it, the
>>>> VM_BUG_ON_PAGE will be triggered in split_huge_page_to_list(). Also the
>>>> non-lru compound movable pages could be taken for transhuge pages. Skip
>>>> these pages by checking PageLRU because huge zero page isn't lru page as
>>>> non-lru compound movable pages.
>>>
>>> It seems that memory_failure() also fails at get_any_page() with "hwpoison:
>>> unhandlable page" message.
>>>
>>>   [16478.203474] page:00000000b6acdbd1 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x1810b4
>>>   [16478.206612] flags: 0x57ffffc0801000(reserved|hwpoison|node=1|zone=2|lastcpupid=0x1fffff)
>>>   [16478.209411] raw: 0057ffffc0801000 fffff11bc6042d08 fffff11bc6042d08 0000000000000000
>>>   [16478.211921] raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
>>>   [16478.214473] page dumped because: hwpoison: unhandlable page
>>>   [16478.216386] Memory failure: 0x1810b4: recovery action for unknown page: Ignored
>>>
>>> We can't handle errors on huge (or normal) zero page, so the current
>>
>> Sorry for confusing commit log again. I should have a coffee before I make this patch.
>> Huge or normal zero page will fail at get_any_page because they're neither HWPoisonHandlable
>> nor PageHuge.
>>
>>> behavior seems to me more suitable than "unsplit thp".
>>>
>>> Or if you have some producer to reach the following path with huge zero
>>> page, could you share it?
>>>
>>
>> What I mean is that non-lru movable compound page can reach here unexpected because __PageMovable(page)
>> is handleable now. So get_any_page could succeed to grab the page refcnt. And since it's compound page,
>> it will go through the split_huge_page_to_list because PageTransHuge checks PageHead(page) which can also
>> be true for compound page. But this type of pages is unexpected for split_huge_page_to_list.
> 
> Can we really handle non-LRU movable pages in memory failure
> (uncorrectable errors)? Typically they are balloon, zsmalloc, etc.
> Assuming we run into a base (4K) non-LRU movable page, we could reach
> as far as identify_page_state(), it should not fall into any category
> except me_unknown. So it seems we could just simply make it
> unhandlable.

There is the comment from memory_failure:
	/*
	 * We ignore non-LRU pages for good reasons.
	 * - PG_locked is only well defined for LRU pages and a few others
	 * - to avoid races with __SetPageLocked()
	 * - to avoid races with __SetPageSlab*() (and more non-atomic ops)
	 * The check (unnecessarily) ignores LRU pages being isolated and
	 * walked by the page reclaim code, however that's not a big loss.
	 */

So we could not handle non-LRU movable pages.

What do you mean is something like below?

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 5444a8ef4867..d80dbe0f20b6 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1784,6 +1784,13 @@ int memory_failure(unsigned long pfn, int flags)
                }
        }

+       if (__PageMovable(hpage)) {
+               put_page(p);
+               action_result(pfn, MF_MSG_MOVALBE_PAGE, MF_IGNORED);
+               res = -EBUSY;
+               goto unlock_mutex;
+       }
+
        if (PageTransHuge(hpage)) {
                /*
                 * The flag must be set after the refcount is bumped


i.e. Simply make non-LRU movable pages unhandlable ?

> 
> But it should be handlable for soft-offline since it could be migrated.
> 

Yes, non-LRU movable pages can be simply migrated.

Many thanks.

> 
>> Does this make sense for you? Thanks Naoya.
>>
>>> Thanks,
>>> Naoya Horiguchi
>>>
>>>>
>>>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
>>>> ---
>>>>  mm/memory-failure.c | 14 ++++++++++++++
>>>>  1 file changed, 14 insertions(+)
>>>>
>>>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>>>> index 23bfd809dc8c..ac6492e36978 100644
>>>> --- a/mm/memory-failure.c
>>>> +++ b/mm/memory-failure.c
>>>> @@ -1792,6 +1792,20 @@ int memory_failure(unsigned long pfn, int flags)
>>>>      }
>>>>
>>>>      if (PageTransHuge(hpage)) {
>>>> +            /*
>>>> +             * The non-lru compound movable pages could be taken for
>>>> +             * transhuge pages. Also huge zero page could reach here
>>>> +             * and if we ever try to split it, the VM_BUG_ON_PAGE will
>>>> +             * be triggered in split_huge_page_to_list(). Skip these
>>>> +             * pages by checking PageLRU because huge zero page isn't
>>>> +             * lru page as non-lru compound movable pages.
>>>> +             */
>>>> +            if (!PageLRU(hpage)) {
>>>> +                    put_page(p);
>>>> +                    action_result(pfn, MF_MSG_UNSPLIT_THP, MF_IGNORED);
>>>> +                    res = -EBUSY;
>>>> +                    goto unlock_mutex;
>>>> +            }
>>>>              /*
>>>>               * The flag must be set after the refcount is bumped
>>>>               * otherwise it may race with THP split.
>>>> --
>>>> 2.23.0
>>
>>
> .
> 


^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/4] mm/memory-failure.c: fix wrong user reference report
  2022-03-07 20:14       ` Yang Shi
@ 2022-03-08 13:11         ` Miaohe Lin
  2022-03-08 18:51           ` Yang Shi
  0 siblings, 1 reply; 27+ messages in thread
From: Miaohe Lin @ 2022-03-08 13:11 UTC (permalink / raw)
  To: Yang Shi
  Cc: HORIGUCHI NAOYA(堀口 直也),
	akpm, linux-mm, linux-kernel

On 2022/3/8 4:14, Yang Shi wrote:
> On Mon, Mar 7, 2022 at 3:26 AM Miaohe Lin <linmiaohe@huawei.com> wrote:
>>
>> On 2022/3/4 16:27, HORIGUCHI NAOYA(堀口 直也) wrote:
>>> On Mon, Feb 28, 2022 at 10:02:43PM +0800, Miaohe Lin wrote:
>>>> The dirty swapcache page is still residing in the swap cache after it's
>>>> hwpoisoned. So there is always one extra refcount for swap cache.
>>>
>>> The diff seems fine at a glance, but let me have a few question to
>>> understand the issue more.
>>>
>>> - Is the behavior described above the effect of recent change on shmem where
>>>   dirty pagecache is pinned on hwpoison (commit a76054266661 ("mm: shmem:
>>>   don't truncate page if memory failure happens"). Or the older kernels
>>>   behave as the same?
>>>
>>> - Is the behavior true for normal anonymous pages (not shmem pages)?
>>>
>>
>> The behavior described above is aimed at swapcache not pagecache. So it should be
>> irrelevant with the recent change on shmem.
>>
>> What I try to fix here is that me_swapcache_dirty holds one extra pin via SwapCache
>> regardless of the return value of delete_from_lru_cache. We should try to report more
>> accurate extra refcount for debugging purpose.
> 
> I think you misunderstood the code. The delete_from_lru_cache()
> returning 0 means the page was on LRU and isolated from LRU
> successfully now. Returning -EIO means the page was not on LRU, so it
> should have at least an extra pin on it.
> 
> So MF_DELAYED means there is no other pin other than hwpoison and
> swapcache which is expected, MF_FAILED means there might be extra
> pins.
> 
> The has_extra_refcount() raised error then there is *unexpected* refcount.

Many thanks for your explanation. It seems you're right. If page is held on
the lru_pvecs when we try to do delete_from_lru_cache, and after that it's
drained to the lru list( so its refcnt might be 2 now). Then we might have
the following complain if extra_pins is always true:
	"Memory failure: ... still referenced by 0 users\n"

But it seems the origin code can not report the correct reason too because
if we retry, page can be delete_from_lru_cache and we can succeed now.

Anyway, many thanks for pointing this out.

> 
>>
>>> I'm trying to test hwpoison hitting the dirty swapcache, but it seems that
>>> in my testing memory_faliure() fails with "hwpoison: unhandlable page"
>>
>> Maybe memory_faliure is racing with page reclaim where page is isolated?
>>
>>> warning at get_any_page().  So I'm still not sure that me_pagecache_dirty()
>>> fixes any visible problem.
>>
>> IIUC, me_pagecache_dirty can't do much except for the corresponding address_space supporting
>> error_remove_page which can truncate the dirty pagecache page. But this may cause silent data
>> loss. It's better to keep the page stay in the pagecache until the file is truncated, hole
>> punched or removed as commit a76054266661 pointed out.
>>
>> Thanks.
>>
>>>> Thanks,
>>> Naoya Horiguchi
>>>
>>>>
>>>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
>>>> ---
>>>>  mm/memory-failure.c | 6 +-----
>>>>  1 file changed, 1 insertion(+), 5 deletions(-)
>>>>
>>>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>>>> index 0d7c58340a98..5f9503573263 100644
>>>> --- a/mm/memory-failure.c
>>>> +++ b/mm/memory-failure.c
>>>> @@ -984,7 +984,6 @@ static int me_pagecache_dirty(struct page_state *ps, struct page *p)
>>>>  static int me_swapcache_dirty(struct page_state *ps, struct page *p)
>>>>  {
>>>>      int ret;
>>>> -    bool extra_pins = false;
>>>>
>>>>      ClearPageDirty(p);
>>>>      /* Trigger EIO in shmem: */
>>>> @@ -993,10 +992,7 @@ static int me_swapcache_dirty(struct page_state *ps, struct page *p)
>>>>      ret = delete_from_lru_cache(p) ? MF_FAILED : MF_DELAYED;
>>>>      unlock_page(p);
>>>>
>>>> -    if (ret == MF_DELAYED)
>>>> -            extra_pins = true;
>>>> -
>>>> -    if (has_extra_refcount(ps, p, extra_pins))
>>>> +    if (has_extra_refcount(ps, p, true))
>>>>              ret = MF_FAILED;
>>>>
>>>>      return ret;
>>>> --
>>>> 2.23.0
>>
>>
> .
> 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 4/4] mm/memory-failure.c: fix potential VM_BUG_ON_PAGE in split_huge_page_to_list
  2022-03-08 12:36         ` Miaohe Lin
@ 2022-03-08 18:47           ` Yang Shi
  2022-03-09  8:45             ` Miaohe Lin
  2022-03-10 11:46             ` Miaohe Lin
  0 siblings, 2 replies; 27+ messages in thread
From: Yang Shi @ 2022-03-08 18:47 UTC (permalink / raw)
  To: Miaohe Lin
  Cc: HORIGUCHI NAOYA(堀口 直也),
	akpm, linux-mm, linux-kernel

On Tue, Mar 8, 2022 at 4:36 AM Miaohe Lin <linmiaohe@huawei.com> wrote:
>
> On 2022/3/8 3:53, Yang Shi wrote:
> > On Sun, Mar 6, 2022 at 11:07 PM Miaohe Lin <linmiaohe@huawei.com> wrote:
> >>
> >> On 2022/3/4 16:28, HORIGUCHI NAOYA(堀口 直也) wrote:
> >>> On Mon, Feb 28, 2022 at 10:02:45PM +0800, Miaohe Lin wrote:
> >>>> The huge zero page could reach here and if we ever try to split it, the
> >>>> VM_BUG_ON_PAGE will be triggered in split_huge_page_to_list(). Also the
> >>>> non-lru compound movable pages could be taken for transhuge pages. Skip
> >>>> these pages by checking PageLRU because huge zero page isn't lru page as
> >>>> non-lru compound movable pages.
> >>>
> >>> It seems that memory_failure() also fails at get_any_page() with "hwpoison:
> >>> unhandlable page" message.
> >>>
> >>>   [16478.203474] page:00000000b6acdbd1 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x1810b4
> >>>   [16478.206612] flags: 0x57ffffc0801000(reserved|hwpoison|node=1|zone=2|lastcpupid=0x1fffff)
> >>>   [16478.209411] raw: 0057ffffc0801000 fffff11bc6042d08 fffff11bc6042d08 0000000000000000
> >>>   [16478.211921] raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
> >>>   [16478.214473] page dumped because: hwpoison: unhandlable page
> >>>   [16478.216386] Memory failure: 0x1810b4: recovery action for unknown page: Ignored
> >>>
> >>> We can't handle errors on huge (or normal) zero page, so the current
> >>
> >> Sorry for confusing commit log again. I should have a coffee before I make this patch.
> >> Huge or normal zero page will fail at get_any_page because they're neither HWPoisonHandlable
> >> nor PageHuge.
> >>
> >>> behavior seems to me more suitable than "unsplit thp".
> >>>
> >>> Or if you have some producer to reach the following path with huge zero
> >>> page, could you share it?
> >>>
> >>
> >> What I mean is that non-lru movable compound page can reach here unexpected because __PageMovable(page)
> >> is handleable now. So get_any_page could succeed to grab the page refcnt. And since it's compound page,
> >> it will go through the split_huge_page_to_list because PageTransHuge checks PageHead(page) which can also
> >> be true for compound page. But this type of pages is unexpected for split_huge_page_to_list.
> >
> > Can we really handle non-LRU movable pages in memory failure
> > (uncorrectable errors)? Typically they are balloon, zsmalloc, etc.
> > Assuming we run into a base (4K) non-LRU movable page, we could reach
> > as far as identify_page_state(), it should not fall into any category
> > except me_unknown. So it seems we could just simply make it
> > unhandlable.
>
> There is the comment from memory_failure:
>         /*
>          * We ignore non-LRU pages for good reasons.
>          * - PG_locked is only well defined for LRU pages and a few others
>          * - to avoid races with __SetPageLocked()
>          * - to avoid races with __SetPageSlab*() (and more non-atomic ops)
>          * The check (unnecessarily) ignores LRU pages being isolated and
>          * walked by the page reclaim code, however that's not a big loss.
>          */
>
> So we could not handle non-LRU movable pages.
>
> What do you mean is something like below?
>
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 5444a8ef4867..d80dbe0f20b6 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1784,6 +1784,13 @@ int memory_failure(unsigned long pfn, int flags)
>                 }
>         }
>
> +       if (__PageMovable(hpage)) {
> +               put_page(p);
> +               action_result(pfn, MF_MSG_MOVALBE_PAGE, MF_IGNORED);
> +               res = -EBUSY;
> +               goto unlock_mutex;
> +       }
> +
>         if (PageTransHuge(hpage)) {
>                 /*
>                  * The flag must be set after the refcount is bumped
>
>
> i.e. Simply make non-LRU movable pages unhandlable ?

I'd prefer this personally. Something like the below (compile test only):

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 5444a8ef4867..789e40909ade 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1176,12 +1176,18 @@ void ClearPageHWPoisonTakenOff(struct page *page)
  * does not return true for hugetlb or device memory pages, so it's assumed
  * to be called only in the context where we never have such pages.
  */
-static inline bool HWPoisonHandlable(struct page *page)
+static inline bool HWPoisonHandlable(struct page *page, unsigned long flags)
 {
- return PageLRU(page) || __PageMovable(page) || is_free_buddy_page(page);
+ bool movable = false;
+
+ /* Soft offline could mirgate non-LRU movable pages */
+ if ((flags & MF_SOFT_OFFLINE) && __PageMovable(page))
+ movable = true;
+
+ return movable || PageLRU(page) || is_free_buddy_page(page);
 }

-static int __get_hwpoison_page(struct page *page)
+static int __get_hwpoison_page(struct page *page, unsigned long flags)
 {
  struct page *head = compound_head(page);
  int ret = 0;
@@ -1196,7 +1202,7 @@ static int __get_hwpoison_page(struct page *page)
  * for any unsupported type of page in order to reduce the risk of
  * unexpected races caused by taking a page refcount.
  */
- if (!HWPoisonHandlable(head))
+ if (!HWPoisonHandlable(head, flags))
  return -EBUSY;

  if (get_page_unless_zero(head)) {
@@ -1221,7 +1227,7 @@ static int get_any_page(struct page *p, unsigned
long flags)

 try_again:
  if (!count_increased) {
- ret = __get_hwpoison_page(p);
+ ret = __get_hwpoison_page(p, flags);
  if (!ret) {
  if (page_count(p)) {
  /* We raced with an allocation, retry. */
@@ -1249,7 +1255,7 @@ static int get_any_page(struct page *p, unsigned
long flags)
  }
  }

- if (PageHuge(p) || HWPoisonHandlable(p)) {
+ if (PageHuge(p) || HWPoisonHandlable(p, flags)) {
  ret = 1;
  } else {
  /*

>
> >
> > But it should be handlable for soft-offline since it could be migrated.
> >
>
> Yes, non-LRU movable pages can be simply migrated.
>
> Many thanks.
>
> >
> >> Does this make sense for you? Thanks Naoya.
> >>
> >>> Thanks,
> >>> Naoya Horiguchi
> >>>
> >>>>
> >>>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
> >>>> ---
> >>>>  mm/memory-failure.c | 14 ++++++++++++++
> >>>>  1 file changed, 14 insertions(+)
> >>>>
> >>>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> >>>> index 23bfd809dc8c..ac6492e36978 100644
> >>>> --- a/mm/memory-failure.c
> >>>> +++ b/mm/memory-failure.c
> >>>> @@ -1792,6 +1792,20 @@ int memory_failure(unsigned long pfn, int flags)
> >>>>      }
> >>>>
> >>>>      if (PageTransHuge(hpage)) {
> >>>> +            /*
> >>>> +             * The non-lru compound movable pages could be taken for
> >>>> +             * transhuge pages. Also huge zero page could reach here
> >>>> +             * and if we ever try to split it, the VM_BUG_ON_PAGE will
> >>>> +             * be triggered in split_huge_page_to_list(). Skip these
> >>>> +             * pages by checking PageLRU because huge zero page isn't
> >>>> +             * lru page as non-lru compound movable pages.
> >>>> +             */
> >>>> +            if (!PageLRU(hpage)) {
> >>>> +                    put_page(p);
> >>>> +                    action_result(pfn, MF_MSG_UNSPLIT_THP, MF_IGNORED);
> >>>> +                    res = -EBUSY;
> >>>> +                    goto unlock_mutex;
> >>>> +            }
> >>>>              /*
> >>>>               * The flag must be set after the refcount is bumped
> >>>>               * otherwise it may race with THP split.
> >>>> --
> >>>> 2.23.0
> >>
> >>
> > .
> >
>

^ permalink raw reply related	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/4] mm/memory-failure.c: fix wrong user reference report
  2022-03-08 13:11         ` Miaohe Lin
@ 2022-03-08 18:51           ` Yang Shi
  2022-03-09  8:30             ` Miaohe Lin
  0 siblings, 1 reply; 27+ messages in thread
From: Yang Shi @ 2022-03-08 18:51 UTC (permalink / raw)
  To: Miaohe Lin
  Cc: HORIGUCHI NAOYA(堀口 直也),
	akpm, linux-mm, linux-kernel

On Tue, Mar 8, 2022 at 5:11 AM Miaohe Lin <linmiaohe@huawei.com> wrote:
>
> On 2022/3/8 4:14, Yang Shi wrote:
> > On Mon, Mar 7, 2022 at 3:26 AM Miaohe Lin <linmiaohe@huawei.com> wrote:
> >>
> >> On 2022/3/4 16:27, HORIGUCHI NAOYA(堀口 直也) wrote:
> >>> On Mon, Feb 28, 2022 at 10:02:43PM +0800, Miaohe Lin wrote:
> >>>> The dirty swapcache page is still residing in the swap cache after it's
> >>>> hwpoisoned. So there is always one extra refcount for swap cache.
> >>>
> >>> The diff seems fine at a glance, but let me have a few question to
> >>> understand the issue more.
> >>>
> >>> - Is the behavior described above the effect of recent change on shmem where
> >>>   dirty pagecache is pinned on hwpoison (commit a76054266661 ("mm: shmem:
> >>>   don't truncate page if memory failure happens"). Or the older kernels
> >>>   behave as the same?
> >>>
> >>> - Is the behavior true for normal anonymous pages (not shmem pages)?
> >>>
> >>
> >> The behavior described above is aimed at swapcache not pagecache. So it should be
> >> irrelevant with the recent change on shmem.
> >>
> >> What I try to fix here is that me_swapcache_dirty holds one extra pin via SwapCache
> >> regardless of the return value of delete_from_lru_cache. We should try to report more
> >> accurate extra refcount for debugging purpose.
> >
> > I think you misunderstood the code. The delete_from_lru_cache()
> > returning 0 means the page was on LRU and isolated from LRU
> > successfully now. Returning -EIO means the page was not on LRU, so it
> > should have at least an extra pin on it.
> >
> > So MF_DELAYED means there is no other pin other than hwpoison and
> > swapcache which is expected, MF_FAILED means there might be extra
> > pins.
> >
> > The has_extra_refcount() raised error then there is *unexpected* refcount.
>
> Many thanks for your explanation. It seems you're right. If page is held on
> the lru_pvecs when we try to do delete_from_lru_cache, and after that it's
> drained to the lru list( so its refcnt might be 2 now). Then we might have
> the following complain if extra_pins is always true:
>         "Memory failure: ... still referenced by 0 users\n"
>
> But it seems the origin code can not report the correct reason too because
> if we retry, page can be delete_from_lru_cache and we can succeed now.

Retry is ok, but it seems overkilling to me IMHO.

>
> Anyway, many thanks for pointing this out.
>
> >
> >>
> >>> I'm trying to test hwpoison hitting the dirty swapcache, but it seems that
> >>> in my testing memory_faliure() fails with "hwpoison: unhandlable page"
> >>
> >> Maybe memory_faliure is racing with page reclaim where page is isolated?
> >>
> >>> warning at get_any_page().  So I'm still not sure that me_pagecache_dirty()
> >>> fixes any visible problem.
> >>
> >> IIUC, me_pagecache_dirty can't do much except for the corresponding address_space supporting
> >> error_remove_page which can truncate the dirty pagecache page. But this may cause silent data
> >> loss. It's better to keep the page stay in the pagecache until the file is truncated, hole
> >> punched or removed as commit a76054266661 pointed out.
> >>
> >> Thanks.
> >>
> >>>> Thanks,
> >>> Naoya Horiguchi
> >>>
> >>>>
> >>>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
> >>>> ---
> >>>>  mm/memory-failure.c | 6 +-----
> >>>>  1 file changed, 1 insertion(+), 5 deletions(-)
> >>>>
> >>>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> >>>> index 0d7c58340a98..5f9503573263 100644
> >>>> --- a/mm/memory-failure.c
> >>>> +++ b/mm/memory-failure.c
> >>>> @@ -984,7 +984,6 @@ static int me_pagecache_dirty(struct page_state *ps, struct page *p)
> >>>>  static int me_swapcache_dirty(struct page_state *ps, struct page *p)
> >>>>  {
> >>>>      int ret;
> >>>> -    bool extra_pins = false;
> >>>>
> >>>>      ClearPageDirty(p);
> >>>>      /* Trigger EIO in shmem: */
> >>>> @@ -993,10 +992,7 @@ static int me_swapcache_dirty(struct page_state *ps, struct page *p)
> >>>>      ret = delete_from_lru_cache(p) ? MF_FAILED : MF_DELAYED;
> >>>>      unlock_page(p);
> >>>>
> >>>> -    if (ret == MF_DELAYED)
> >>>> -            extra_pins = true;
> >>>> -
> >>>> -    if (has_extra_refcount(ps, p, extra_pins))
> >>>> +    if (has_extra_refcount(ps, p, true))
> >>>>              ret = MF_FAILED;
> >>>>
> >>>>      return ret;
> >>>> --
> >>>> 2.23.0
> >>
> >>
> > .
> >
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 2/4] mm/memory-failure.c: fix wrong user reference report
  2022-03-08 18:51           ` Yang Shi
@ 2022-03-09  8:30             ` Miaohe Lin
  0 siblings, 0 replies; 27+ messages in thread
From: Miaohe Lin @ 2022-03-09  8:30 UTC (permalink / raw)
  To: Yang Shi
  Cc: HORIGUCHI NAOYA(堀口 直也),
	akpm, linux-mm, linux-kernel

On 2022/3/9 2:51, Yang Shi wrote:
> On Tue, Mar 8, 2022 at 5:11 AM Miaohe Lin <linmiaohe@huawei.com> wrote:
>>
>> On 2022/3/8 4:14, Yang Shi wrote:
>>> On Mon, Mar 7, 2022 at 3:26 AM Miaohe Lin <linmiaohe@huawei.com> wrote:
>>>>
>>>> On 2022/3/4 16:27, HORIGUCHI NAOYA(堀口 直也) wrote:
>>>>> On Mon, Feb 28, 2022 at 10:02:43PM +0800, Miaohe Lin wrote:
>>>>>> The dirty swapcache page is still residing in the swap cache after it's
>>>>>> hwpoisoned. So there is always one extra refcount for swap cache.
>>>>>
>>>>> The diff seems fine at a glance, but let me have a few question to
>>>>> understand the issue more.
>>>>>
>>>>> - Is the behavior described above the effect of recent change on shmem where
>>>>>   dirty pagecache is pinned on hwpoison (commit a76054266661 ("mm: shmem:
>>>>>   don't truncate page if memory failure happens"). Or the older kernels
>>>>>   behave as the same?
>>>>>
>>>>> - Is the behavior true for normal anonymous pages (not shmem pages)?
>>>>>
>>>>
>>>> The behavior described above is aimed at swapcache not pagecache. So it should be
>>>> irrelevant with the recent change on shmem.
>>>>
>>>> What I try to fix here is that me_swapcache_dirty holds one extra pin via SwapCache
>>>> regardless of the return value of delete_from_lru_cache. We should try to report more
>>>> accurate extra refcount for debugging purpose.
>>>
>>> I think you misunderstood the code. The delete_from_lru_cache()
>>> returning 0 means the page was on LRU and isolated from LRU
>>> successfully now. Returning -EIO means the page was not on LRU, so it
>>> should have at least an extra pin on it.
>>>
>>> So MF_DELAYED means there is no other pin other than hwpoison and
>>> swapcache which is expected, MF_FAILED means there might be extra
>>> pins.
>>>
>>> The has_extra_refcount() raised error then there is *unexpected* refcount.
>>
>> Many thanks for your explanation. It seems you're right. If page is held on
>> the lru_pvecs when we try to do delete_from_lru_cache, and after that it's
>> drained to the lru list( so its refcnt might be 2 now). Then we might have
>> the following complain if extra_pins is always true:
>>         "Memory failure: ... still referenced by 0 users\n"
>>
>> But it seems the origin code can not report the correct reason too because
>> if we retry, page can be delete_from_lru_cache and we can succeed now.
> 
> Retry is ok, but it seems overkilling to me IMHO.
> 

Anyway, it seems I misunderstood the code. So I will drop this patch. Thanks for comment.

>>
>> Anyway, many thanks for pointing this out.
>>
>>>
>>>>
>>>>> I'm trying to test hwpoison hitting the dirty swapcache, but it seems that
>>>>> in my testing memory_faliure() fails with "hwpoison: unhandlable page"
>>>>
>>>> Maybe memory_faliure is racing with page reclaim where page is isolated?
>>>>
>>>>> warning at get_any_page().  So I'm still not sure that me_pagecache_dirty()
>>>>> fixes any visible problem.
>>>>
>>>> IIUC, me_pagecache_dirty can't do much except for the corresponding address_space supporting
>>>> error_remove_page which can truncate the dirty pagecache page. But this may cause silent data
>>>> loss. It's better to keep the page stay in the pagecache until the file is truncated, hole
>>>> punched or removed as commit a76054266661 pointed out.
>>>>
>>>> Thanks.
>>>>
>>>>>> Thanks,
>>>>> Naoya Horiguchi
>>>>>
>>>>>>
>>>>>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
>>>>>> ---
>>>>>>  mm/memory-failure.c | 6 +-----
>>>>>>  1 file changed, 1 insertion(+), 5 deletions(-)
>>>>>>
>>>>>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>>>>>> index 0d7c58340a98..5f9503573263 100644
>>>>>> --- a/mm/memory-failure.c
>>>>>> +++ b/mm/memory-failure.c
>>>>>> @@ -984,7 +984,6 @@ static int me_pagecache_dirty(struct page_state *ps, struct page *p)
>>>>>>  static int me_swapcache_dirty(struct page_state *ps, struct page *p)
>>>>>>  {
>>>>>>      int ret;
>>>>>> -    bool extra_pins = false;
>>>>>>
>>>>>>      ClearPageDirty(p);
>>>>>>      /* Trigger EIO in shmem: */
>>>>>> @@ -993,10 +992,7 @@ static int me_swapcache_dirty(struct page_state *ps, struct page *p)
>>>>>>      ret = delete_from_lru_cache(p) ? MF_FAILED : MF_DELAYED;
>>>>>>      unlock_page(p);
>>>>>>
>>>>>> -    if (ret == MF_DELAYED)
>>>>>> -            extra_pins = true;
>>>>>> -
>>>>>> -    if (has_extra_refcount(ps, p, extra_pins))
>>>>>> +    if (has_extra_refcount(ps, p, true))
>>>>>>              ret = MF_FAILED;
>>>>>>
>>>>>>      return ret;
>>>>>> --
>>>>>> 2.23.0
>>>>
>>>>
>>> .
>>>
>>
> .
> 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 4/4] mm/memory-failure.c: fix potential VM_BUG_ON_PAGE in split_huge_page_to_list
  2022-03-08 18:47           ` Yang Shi
@ 2022-03-09  8:45             ` Miaohe Lin
  2022-03-10 11:46             ` Miaohe Lin
  1 sibling, 0 replies; 27+ messages in thread
From: Miaohe Lin @ 2022-03-09  8:45 UTC (permalink / raw)
  To: Yang Shi
  Cc: HORIGUCHI NAOYA(堀口 直也),
	akpm, linux-mm, linux-kernel

On 2022/3/9 2:47, Yang Shi wrote:
> On Tue, Mar 8, 2022 at 4:36 AM Miaohe Lin <linmiaohe@huawei.com> wrote:
>>
>> On 2022/3/8 3:53, Yang Shi wrote:
>>> On Sun, Mar 6, 2022 at 11:07 PM Miaohe Lin <linmiaohe@huawei.com> wrote:
>>>>
>>>> On 2022/3/4 16:28, HORIGUCHI NAOYA(堀口 直也) wrote:
>>>>> On Mon, Feb 28, 2022 at 10:02:45PM +0800, Miaohe Lin wrote:
>>>>>> The huge zero page could reach here and if we ever try to split it, the
>>>>>> VM_BUG_ON_PAGE will be triggered in split_huge_page_to_list(). Also the
>>>>>> non-lru compound movable pages could be taken for transhuge pages. Skip
>>>>>> these pages by checking PageLRU because huge zero page isn't lru page as
>>>>>> non-lru compound movable pages.
>>>>>
>>>>> It seems that memory_failure() also fails at get_any_page() with "hwpoison:
>>>>> unhandlable page" message.
>>>>>
>>>>>   [16478.203474] page:00000000b6acdbd1 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x1810b4
>>>>>   [16478.206612] flags: 0x57ffffc0801000(reserved|hwpoison|node=1|zone=2|lastcpupid=0x1fffff)
>>>>>   [16478.209411] raw: 0057ffffc0801000 fffff11bc6042d08 fffff11bc6042d08 0000000000000000
>>>>>   [16478.211921] raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
>>>>>   [16478.214473] page dumped because: hwpoison: unhandlable page
>>>>>   [16478.216386] Memory failure: 0x1810b4: recovery action for unknown page: Ignored
>>>>>
>>>>> We can't handle errors on huge (or normal) zero page, so the current
>>>>
>>>> Sorry for confusing commit log again. I should have a coffee before I make this patch.
>>>> Huge or normal zero page will fail at get_any_page because they're neither HWPoisonHandlable
>>>> nor PageHuge.
>>>>
>>>>> behavior seems to me more suitable than "unsplit thp".
>>>>>
>>>>> Or if you have some producer to reach the following path with huge zero
>>>>> page, could you share it?
>>>>>
>>>>
>>>> What I mean is that non-lru movable compound page can reach here unexpected because __PageMovable(page)
>>>> is handleable now. So get_any_page could succeed to grab the page refcnt. And since it's compound page,
>>>> it will go through the split_huge_page_to_list because PageTransHuge checks PageHead(page) which can also
>>>> be true for compound page. But this type of pages is unexpected for split_huge_page_to_list.
>>>
>>> Can we really handle non-LRU movable pages in memory failure
>>> (uncorrectable errors)? Typically they are balloon, zsmalloc, etc.
>>> Assuming we run into a base (4K) non-LRU movable page, we could reach
>>> as far as identify_page_state(), it should not fall into any category
>>> except me_unknown. So it seems we could just simply make it
>>> unhandlable.
>>
>> There is the comment from memory_failure:
>>         /*
>>          * We ignore non-LRU pages for good reasons.
>>          * - PG_locked is only well defined for LRU pages and a few others
>>          * - to avoid races with __SetPageLocked()
>>          * - to avoid races with __SetPageSlab*() (and more non-atomic ops)
>>          * The check (unnecessarily) ignores LRU pages being isolated and
>>          * walked by the page reclaim code, however that's not a big loss.
>>          */
>>
>> So we could not handle non-LRU movable pages.
>>
>> What do you mean is something like below?
>>
>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>> index 5444a8ef4867..d80dbe0f20b6 100644
>> --- a/mm/memory-failure.c
>> +++ b/mm/memory-failure.c
>> @@ -1784,6 +1784,13 @@ int memory_failure(unsigned long pfn, int flags)
>>                 }
>>         }
>>
>> +       if (__PageMovable(hpage)) {
>> +               put_page(p);
>> +               action_result(pfn, MF_MSG_MOVALBE_PAGE, MF_IGNORED);
>> +               res = -EBUSY;
>> +               goto unlock_mutex;
>> +       }
>> +
>>         if (PageTransHuge(hpage)) {
>>                 /*
>>                  * The flag must be set after the refcount is bumped
>>
>>
>> i.e. Simply make non-LRU movable pages unhandlable ?
> 

The below change looks good to me except we need to ensure the caller passing
in the MF_SOFT_OFFLINE flag while doing soft-offline now. This is false for
madvise_inject_error but it's trivial to change this.

Will try to do this in V2.
Many thanks.

> I'd prefer this personally. Something like the below (compile test only):
> 
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 5444a8ef4867..789e40909ade 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1176,12 +1176,18 @@ void ClearPageHWPoisonTakenOff(struct page *page)
>   * does not return true for hugetlb or device memory pages, so it's assumed
>   * to be called only in the context where we never have such pages.
>   */
> -static inline bool HWPoisonHandlable(struct page *page)
> +static inline bool HWPoisonHandlable(struct page *page, unsigned long flags)
>  {
> - return PageLRU(page) || __PageMovable(page) || is_free_buddy_page(page);
> + bool movable = false;
> +
> + /* Soft offline could mirgate non-LRU movable pages */
> + if ((flags & MF_SOFT_OFFLINE) && __PageMovable(page))
> + movable = true;
> +
> + return movable || PageLRU(page) || is_free_buddy_page(page);
>  }
> 
> -static int __get_hwpoison_page(struct page *page)
> +static int __get_hwpoison_page(struct page *page, unsigned long flags)
>  {
>   struct page *head = compound_head(page);
>   int ret = 0;
> @@ -1196,7 +1202,7 @@ static int __get_hwpoison_page(struct page *page)
>   * for any unsupported type of page in order to reduce the risk of
>   * unexpected races caused by taking a page refcount.
>   */
> - if (!HWPoisonHandlable(head))
> + if (!HWPoisonHandlable(head, flags))
>   return -EBUSY;
> 
>   if (get_page_unless_zero(head)) {
> @@ -1221,7 +1227,7 @@ static int get_any_page(struct page *p, unsigned
> long flags)
> 
>  try_again:
>   if (!count_increased) {
> - ret = __get_hwpoison_page(p);
> + ret = __get_hwpoison_page(p, flags);
>   if (!ret) {
>   if (page_count(p)) {
>   /* We raced with an allocation, retry. */
> @@ -1249,7 +1255,7 @@ static int get_any_page(struct page *p, unsigned
> long flags)
>   }
>   }
> 
> - if (PageHuge(p) || HWPoisonHandlable(p)) {
> + if (PageHuge(p) || HWPoisonHandlable(p, flags)) {
>   ret = 1;
>   } else {
>   /*
> 
>>
>>>
>>> But it should be handlable for soft-offline since it could be migrated.
>>>
>>
>> Yes, non-LRU movable pages can be simply migrated.
>>
>> Many thanks.
>>
>>>
>>>> Does this make sense for you? Thanks Naoya.
>>>>
>>>>> Thanks,
>>>>> Naoya Horiguchi
>>>>>
>>>>>>
>>>>>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
>>>>>> ---
>>>>>>  mm/memory-failure.c | 14 ++++++++++++++
>>>>>>  1 file changed, 14 insertions(+)
>>>>>>
>>>>>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>>>>>> index 23bfd809dc8c..ac6492e36978 100644
>>>>>> --- a/mm/memory-failure.c
>>>>>> +++ b/mm/memory-failure.c
>>>>>> @@ -1792,6 +1792,20 @@ int memory_failure(unsigned long pfn, int flags)
>>>>>>      }
>>>>>>
>>>>>>      if (PageTransHuge(hpage)) {
>>>>>> +            /*
>>>>>> +             * The non-lru compound movable pages could be taken for
>>>>>> +             * transhuge pages. Also huge zero page could reach here
>>>>>> +             * and if we ever try to split it, the VM_BUG_ON_PAGE will
>>>>>> +             * be triggered in split_huge_page_to_list(). Skip these
>>>>>> +             * pages by checking PageLRU because huge zero page isn't
>>>>>> +             * lru page as non-lru compound movable pages.
>>>>>> +             */
>>>>>> +            if (!PageLRU(hpage)) {
>>>>>> +                    put_page(p);
>>>>>> +                    action_result(pfn, MF_MSG_UNSPLIT_THP, MF_IGNORED);
>>>>>> +                    res = -EBUSY;
>>>>>> +                    goto unlock_mutex;
>>>>>> +            }
>>>>>>              /*
>>>>>>               * The flag must be set after the refcount is bumped
>>>>>>               * otherwise it may race with THP split.
>>>>>> --
>>>>>> 2.23.0
>>>>
>>>>
>>> .
>>>
>>
> .
> 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 4/4] mm/memory-failure.c: fix potential VM_BUG_ON_PAGE in split_huge_page_to_list
  2022-03-08 18:47           ` Yang Shi
  2022-03-09  8:45             ` Miaohe Lin
@ 2022-03-10 11:46             ` Miaohe Lin
  2022-03-10 19:32               ` Yang Shi
  1 sibling, 1 reply; 27+ messages in thread
From: Miaohe Lin @ 2022-03-10 11:46 UTC (permalink / raw)
  To: Yang Shi
  Cc: HORIGUCHI NAOYA(堀口 直也),
	akpm, linux-mm, linux-kernel

On 2022/3/9 2:47, Yang Shi wrote:
> On Tue, Mar 8, 2022 at 4:36 AM Miaohe Lin <linmiaohe@huawei.com> wrote:
>>
>> On 2022/3/8 3:53, Yang Shi wrote:
>>> On Sun, Mar 6, 2022 at 11:07 PM Miaohe Lin <linmiaohe@huawei.com> wrote:
>>>>
>>>> On 2022/3/4 16:28, HORIGUCHI NAOYA(堀口 直也) wrote:
>>>>> On Mon, Feb 28, 2022 at 10:02:45PM +0800, Miaohe Lin wrote:
>>>>>> The huge zero page could reach here and if we ever try to split it, the
>>>>>> VM_BUG_ON_PAGE will be triggered in split_huge_page_to_list(). Also the
>>>>>> non-lru compound movable pages could be taken for transhuge pages. Skip
>>>>>> these pages by checking PageLRU because huge zero page isn't lru page as
>>>>>> non-lru compound movable pages.
>>>>>
>>>>> It seems that memory_failure() also fails at get_any_page() with "hwpoison:
>>>>> unhandlable page" message.
>>>>>
>>>>>   [16478.203474] page:00000000b6acdbd1 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x1810b4
>>>>>   [16478.206612] flags: 0x57ffffc0801000(reserved|hwpoison|node=1|zone=2|lastcpupid=0x1fffff)
>>>>>   [16478.209411] raw: 0057ffffc0801000 fffff11bc6042d08 fffff11bc6042d08 0000000000000000
>>>>>   [16478.211921] raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
>>>>>   [16478.214473] page dumped because: hwpoison: unhandlable page
>>>>>   [16478.216386] Memory failure: 0x1810b4: recovery action for unknown page: Ignored
>>>>>
>>>>> We can't handle errors on huge (or normal) zero page, so the current
>>>>
>>>> Sorry for confusing commit log again. I should have a coffee before I make this patch.
>>>> Huge or normal zero page will fail at get_any_page because they're neither HWPoisonHandlable
>>>> nor PageHuge.
>>>>
>>>>> behavior seems to me more suitable than "unsplit thp".
>>>>>
>>>>> Or if you have some producer to reach the following path with huge zero
>>>>> page, could you share it?
>>>>>
>>>>
>>>> What I mean is that non-lru movable compound page can reach here unexpected because __PageMovable(page)
>>>> is handleable now. So get_any_page could succeed to grab the page refcnt. And since it's compound page,
>>>> it will go through the split_huge_page_to_list because PageTransHuge checks PageHead(page) which can also
>>>> be true for compound page. But this type of pages is unexpected for split_huge_page_to_list.
>>>
>>> Can we really handle non-LRU movable pages in memory failure
>>> (uncorrectable errors)? Typically they are balloon, zsmalloc, etc.
>>> Assuming we run into a base (4K) non-LRU movable page, we could reach
>>> as far as identify_page_state(), it should not fall into any category
>>> except me_unknown. So it seems we could just simply make it
>>> unhandlable.
>>
>> There is the comment from memory_failure:
>>         /*
>>          * We ignore non-LRU pages for good reasons.
>>          * - PG_locked is only well defined for LRU pages and a few others
>>          * - to avoid races with __SetPageLocked()
>>          * - to avoid races with __SetPageSlab*() (and more non-atomic ops)
>>          * The check (unnecessarily) ignores LRU pages being isolated and
>>          * walked by the page reclaim code, however that's not a big loss.
>>          */
>>
>> So we could not handle non-LRU movable pages.
>>
>> What do you mean is something like below?
>>
>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>> index 5444a8ef4867..d80dbe0f20b6 100644
>> --- a/mm/memory-failure.c
>> +++ b/mm/memory-failure.c
>> @@ -1784,6 +1784,13 @@ int memory_failure(unsigned long pfn, int flags)
>>                 }
>>         }
>>
>> +       if (__PageMovable(hpage)) {
>> +               put_page(p);
>> +               action_result(pfn, MF_MSG_MOVALBE_PAGE, MF_IGNORED);
>> +               res = -EBUSY;
>> +               goto unlock_mutex;
>> +       }
>> +
>>         if (PageTransHuge(hpage)) {
>>                 /*
>>                  * The flag must be set after the refcount is bumped
>>
>>
>> i.e. Simply make non-LRU movable pages unhandlable ?
> 

I think about the below code more carefully and I found that this will make
hwpoison_filter can't handle the non-LRU movable pages now. Because non-LRU
movable pages return early now and thus can't reach the hwpoison_filter. This
results in a inconsistent behavior with previous one. So I think the origin
fixup of this patch is more suitable. What do you think?

Thanks.

> I'd prefer this personally. Something like the below (compile test only):
> 
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 5444a8ef4867..789e40909ade 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1176,12 +1176,18 @@ void ClearPageHWPoisonTakenOff(struct page *page)
>   * does not return true for hugetlb or device memory pages, so it's assumed
>   * to be called only in the context where we never have such pages.
>   */
> -static inline bool HWPoisonHandlable(struct page *page)
> +static inline bool HWPoisonHandlable(struct page *page, unsigned long flags)
>  {
> - return PageLRU(page) || __PageMovable(page) || is_free_buddy_page(page);
> + bool movable = false;
> +
> + /* Soft offline could mirgate non-LRU movable pages */
> + if ((flags & MF_SOFT_OFFLINE) && __PageMovable(page))
> + movable = true;
> +
> + return movable || PageLRU(page) || is_free_buddy_page(page);
>  }
> 
> -static int __get_hwpoison_page(struct page *page)
> +static int __get_hwpoison_page(struct page *page, unsigned long flags)
>  {
>   struct page *head = compound_head(page);
>   int ret = 0;
> @@ -1196,7 +1202,7 @@ static int __get_hwpoison_page(struct page *page)
>   * for any unsupported type of page in order to reduce the risk of
>   * unexpected races caused by taking a page refcount.
>   */
> - if (!HWPoisonHandlable(head))
> + if (!HWPoisonHandlable(head, flags))
>   return -EBUSY;
> 
>   if (get_page_unless_zero(head)) {
> @@ -1221,7 +1227,7 @@ static int get_any_page(struct page *p, unsigned
> long flags)
> 
>  try_again:
>   if (!count_increased) {
> - ret = __get_hwpoison_page(p);
> + ret = __get_hwpoison_page(p, flags);
>   if (!ret) {
>   if (page_count(p)) {
>   /* We raced with an allocation, retry. */
> @@ -1249,7 +1255,7 @@ static int get_any_page(struct page *p, unsigned
> long flags)
>   }
>   }
> 
> - if (PageHuge(p) || HWPoisonHandlable(p)) {
> + if (PageHuge(p) || HWPoisonHandlable(p, flags)) {
>   ret = 1;
>   } else {
>   /*
> 
>>
>>>
>>> But it should be handlable for soft-offline since it could be migrated.
>>>
>>
>> Yes, non-LRU movable pages can be simply migrated.
>>
>> Many thanks.
>>
>>>
>>>> Does this make sense for you? Thanks Naoya.
>>>>
>>>>> Thanks,
>>>>> Naoya Horiguchi
>>>>>
>>>>>>
>>>>>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
>>>>>> ---
>>>>>>  mm/memory-failure.c | 14 ++++++++++++++
>>>>>>  1 file changed, 14 insertions(+)
>>>>>>
>>>>>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>>>>>> index 23bfd809dc8c..ac6492e36978 100644
>>>>>> --- a/mm/memory-failure.c
>>>>>> +++ b/mm/memory-failure.c
>>>>>> @@ -1792,6 +1792,20 @@ int memory_failure(unsigned long pfn, int flags)
>>>>>>      }
>>>>>>
>>>>>>      if (PageTransHuge(hpage)) {
>>>>>> +            /*
>>>>>> +             * The non-lru compound movable pages could be taken for
>>>>>> +             * transhuge pages. Also huge zero page could reach here
>>>>>> +             * and if we ever try to split it, the VM_BUG_ON_PAGE will
>>>>>> +             * be triggered in split_huge_page_to_list(). Skip these
>>>>>> +             * pages by checking PageLRU because huge zero page isn't
>>>>>> +             * lru page as non-lru compound movable pages.
>>>>>> +             */
>>>>>> +            if (!PageLRU(hpage)) {
>>>>>> +                    put_page(p);
>>>>>> +                    action_result(pfn, MF_MSG_UNSPLIT_THP, MF_IGNORED);
>>>>>> +                    res = -EBUSY;
>>>>>> +                    goto unlock_mutex;
>>>>>> +            }
>>>>>>              /*
>>>>>>               * The flag must be set after the refcount is bumped
>>>>>>               * otherwise it may race with THP split.
>>>>>> --
>>>>>> 2.23.0
>>>>
>>>>
>>> .
>>>
>>
> .
> 


^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 4/4] mm/memory-failure.c: fix potential VM_BUG_ON_PAGE in split_huge_page_to_list
  2022-03-10 11:46             ` Miaohe Lin
@ 2022-03-10 19:32               ` Yang Shi
  2022-03-11  1:54                 ` Miaohe Lin
  0 siblings, 1 reply; 27+ messages in thread
From: Yang Shi @ 2022-03-10 19:32 UTC (permalink / raw)
  To: Miaohe Lin
  Cc: HORIGUCHI NAOYA(堀口 直也),
	akpm, linux-mm, linux-kernel

On Thu, Mar 10, 2022 at 3:46 AM Miaohe Lin <linmiaohe@huawei.com> wrote:
>
> On 2022/3/9 2:47, Yang Shi wrote:
> > On Tue, Mar 8, 2022 at 4:36 AM Miaohe Lin <linmiaohe@huawei.com> wrote:
> >>
> >> On 2022/3/8 3:53, Yang Shi wrote:
> >>> On Sun, Mar 6, 2022 at 11:07 PM Miaohe Lin <linmiaohe@huawei.com> wrote:
> >>>>
> >>>> On 2022/3/4 16:28, HORIGUCHI NAOYA(堀口 直也) wrote:
> >>>>> On Mon, Feb 28, 2022 at 10:02:45PM +0800, Miaohe Lin wrote:
> >>>>>> The huge zero page could reach here and if we ever try to split it, the
> >>>>>> VM_BUG_ON_PAGE will be triggered in split_huge_page_to_list(). Also the
> >>>>>> non-lru compound movable pages could be taken for transhuge pages. Skip
> >>>>>> these pages by checking PageLRU because huge zero page isn't lru page as
> >>>>>> non-lru compound movable pages.
> >>>>>
> >>>>> It seems that memory_failure() also fails at get_any_page() with "hwpoison:
> >>>>> unhandlable page" message.
> >>>>>
> >>>>>   [16478.203474] page:00000000b6acdbd1 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x1810b4
> >>>>>   [16478.206612] flags: 0x57ffffc0801000(reserved|hwpoison|node=1|zone=2|lastcpupid=0x1fffff)
> >>>>>   [16478.209411] raw: 0057ffffc0801000 fffff11bc6042d08 fffff11bc6042d08 0000000000000000
> >>>>>   [16478.211921] raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
> >>>>>   [16478.214473] page dumped because: hwpoison: unhandlable page
> >>>>>   [16478.216386] Memory failure: 0x1810b4: recovery action for unknown page: Ignored
> >>>>>
> >>>>> We can't handle errors on huge (or normal) zero page, so the current
> >>>>
> >>>> Sorry for confusing commit log again. I should have a coffee before I make this patch.
> >>>> Huge or normal zero page will fail at get_any_page because they're neither HWPoisonHandlable
> >>>> nor PageHuge.
> >>>>
> >>>>> behavior seems to me more suitable than "unsplit thp".
> >>>>>
> >>>>> Or if you have some producer to reach the following path with huge zero
> >>>>> page, could you share it?
> >>>>>
> >>>>
> >>>> What I mean is that non-lru movable compound page can reach here unexpected because __PageMovable(page)
> >>>> is handleable now. So get_any_page could succeed to grab the page refcnt. And since it's compound page,
> >>>> it will go through the split_huge_page_to_list because PageTransHuge checks PageHead(page) which can also
> >>>> be true for compound page. But this type of pages is unexpected for split_huge_page_to_list.
> >>>
> >>> Can we really handle non-LRU movable pages in memory failure
> >>> (uncorrectable errors)? Typically they are balloon, zsmalloc, etc.
> >>> Assuming we run into a base (4K) non-LRU movable page, we could reach
> >>> as far as identify_page_state(), it should not fall into any category
> >>> except me_unknown. So it seems we could just simply make it
> >>> unhandlable.
> >>
> >> There is the comment from memory_failure:
> >>         /*
> >>          * We ignore non-LRU pages for good reasons.
> >>          * - PG_locked is only well defined for LRU pages and a few others
> >>          * - to avoid races with __SetPageLocked()
> >>          * - to avoid races with __SetPageSlab*() (and more non-atomic ops)
> >>          * The check (unnecessarily) ignores LRU pages being isolated and
> >>          * walked by the page reclaim code, however that's not a big loss.
> >>          */
> >>
> >> So we could not handle non-LRU movable pages.
> >>
> >> What do you mean is something like below?
> >>
> >> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> >> index 5444a8ef4867..d80dbe0f20b6 100644
> >> --- a/mm/memory-failure.c
> >> +++ b/mm/memory-failure.c
> >> @@ -1784,6 +1784,13 @@ int memory_failure(unsigned long pfn, int flags)
> >>                 }
> >>         }
> >>
> >> +       if (__PageMovable(hpage)) {
> >> +               put_page(p);
> >> +               action_result(pfn, MF_MSG_MOVALBE_PAGE, MF_IGNORED);
> >> +               res = -EBUSY;
> >> +               goto unlock_mutex;
> >> +       }
> >> +
> >>         if (PageTransHuge(hpage)) {
> >>                 /*
> >>                  * The flag must be set after the refcount is bumped
> >>
> >>
> >> i.e. Simply make non-LRU movable pages unhandlable ?
> >
>
> I think about the below code more carefully and I found that this will make
> hwpoison_filter can't handle the non-LRU movable pages now. Because non-LRU
> movable pages return early now and thus can't reach the hwpoison_filter. This
> results in a inconsistent behavior with previous one. So I think the origin
> fixup of this patch is more suitable. What do you think?

I'm not familiar with hwpoison_filter(), it seems like a test helper
for error injection. I don't think hwpoison_filter() is used to filter
unhandlable page, for example, slab page, IIUC. So the non-LRU movable
pages should be treated the same. If so, the old behavior was simply
wrong.

>
> Thanks.
>
> > I'd prefer this personally. Something like the below (compile test only):
> >
> > diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> > index 5444a8ef4867..789e40909ade 100644
> > --- a/mm/memory-failure.c
> > +++ b/mm/memory-failure.c
> > @@ -1176,12 +1176,18 @@ void ClearPageHWPoisonTakenOff(struct page *page)
> >   * does not return true for hugetlb or device memory pages, so it's assumed
> >   * to be called only in the context where we never have such pages.
> >   */
> > -static inline bool HWPoisonHandlable(struct page *page)
> > +static inline bool HWPoisonHandlable(struct page *page, unsigned long flags)
> >  {
> > - return PageLRU(page) || __PageMovable(page) || is_free_buddy_page(page);
> > + bool movable = false;
> > +
> > + /* Soft offline could mirgate non-LRU movable pages */
> > + if ((flags & MF_SOFT_OFFLINE) && __PageMovable(page))
> > + movable = true;
> > +
> > + return movable || PageLRU(page) || is_free_buddy_page(page);
> >  }
> >
> > -static int __get_hwpoison_page(struct page *page)
> > +static int __get_hwpoison_page(struct page *page, unsigned long flags)
> >  {
> >   struct page *head = compound_head(page);
> >   int ret = 0;
> > @@ -1196,7 +1202,7 @@ static int __get_hwpoison_page(struct page *page)
> >   * for any unsupported type of page in order to reduce the risk of
> >   * unexpected races caused by taking a page refcount.
> >   */
> > - if (!HWPoisonHandlable(head))
> > + if (!HWPoisonHandlable(head, flags))
> >   return -EBUSY;
> >
> >   if (get_page_unless_zero(head)) {
> > @@ -1221,7 +1227,7 @@ static int get_any_page(struct page *p, unsigned
> > long flags)
> >
> >  try_again:
> >   if (!count_increased) {
> > - ret = __get_hwpoison_page(p);
> > + ret = __get_hwpoison_page(p, flags);
> >   if (!ret) {
> >   if (page_count(p)) {
> >   /* We raced with an allocation, retry. */
> > @@ -1249,7 +1255,7 @@ static int get_any_page(struct page *p, unsigned
> > long flags)
> >   }
> >   }
> >
> > - if (PageHuge(p) || HWPoisonHandlable(p)) {
> > + if (PageHuge(p) || HWPoisonHandlable(p, flags)) {
> >   ret = 1;
> >   } else {
> >   /*
> >
> >>
> >>>
> >>> But it should be handlable for soft-offline since it could be migrated.
> >>>
> >>
> >> Yes, non-LRU movable pages can be simply migrated.
> >>
> >> Many thanks.
> >>
> >>>
> >>>> Does this make sense for you? Thanks Naoya.
> >>>>
> >>>>> Thanks,
> >>>>> Naoya Horiguchi
> >>>>>
> >>>>>>
> >>>>>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
> >>>>>> ---
> >>>>>>  mm/memory-failure.c | 14 ++++++++++++++
> >>>>>>  1 file changed, 14 insertions(+)
> >>>>>>
> >>>>>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> >>>>>> index 23bfd809dc8c..ac6492e36978 100644
> >>>>>> --- a/mm/memory-failure.c
> >>>>>> +++ b/mm/memory-failure.c
> >>>>>> @@ -1792,6 +1792,20 @@ int memory_failure(unsigned long pfn, int flags)
> >>>>>>      }
> >>>>>>
> >>>>>>      if (PageTransHuge(hpage)) {
> >>>>>> +            /*
> >>>>>> +             * The non-lru compound movable pages could be taken for
> >>>>>> +             * transhuge pages. Also huge zero page could reach here
> >>>>>> +             * and if we ever try to split it, the VM_BUG_ON_PAGE will
> >>>>>> +             * be triggered in split_huge_page_to_list(). Skip these
> >>>>>> +             * pages by checking PageLRU because huge zero page isn't
> >>>>>> +             * lru page as non-lru compound movable pages.
> >>>>>> +             */
> >>>>>> +            if (!PageLRU(hpage)) {
> >>>>>> +                    put_page(p);
> >>>>>> +                    action_result(pfn, MF_MSG_UNSPLIT_THP, MF_IGNORED);
> >>>>>> +                    res = -EBUSY;
> >>>>>> +                    goto unlock_mutex;
> >>>>>> +            }
> >>>>>>              /*
> >>>>>>               * The flag must be set after the refcount is bumped
> >>>>>>               * otherwise it may race with THP split.
> >>>>>> --
> >>>>>> 2.23.0
> >>>>
> >>>>
> >>> .
> >>>
> >>
> > .
> >
>
>

^ permalink raw reply	[flat|nested] 27+ messages in thread

* Re: [PATCH 4/4] mm/memory-failure.c: fix potential VM_BUG_ON_PAGE in split_huge_page_to_list
  2022-03-10 19:32               ` Yang Shi
@ 2022-03-11  1:54                 ` Miaohe Lin
  0 siblings, 0 replies; 27+ messages in thread
From: Miaohe Lin @ 2022-03-11  1:54 UTC (permalink / raw)
  To: Yang Shi
  Cc: HORIGUCHI NAOYA(堀口 直也),
	akpm, linux-mm, linux-kernel

On 2022/3/11 3:32, Yang Shi wrote:
> On Thu, Mar 10, 2022 at 3:46 AM Miaohe Lin <linmiaohe@huawei.com> wrote:
>>
>> On 2022/3/9 2:47, Yang Shi wrote:
>>> On Tue, Mar 8, 2022 at 4:36 AM Miaohe Lin <linmiaohe@huawei.com> wrote:
>>>>
>>>> On 2022/3/8 3:53, Yang Shi wrote:
>>>>> On Sun, Mar 6, 2022 at 11:07 PM Miaohe Lin <linmiaohe@huawei.com> wrote:
>>>>>>
>>>>>> On 2022/3/4 16:28, HORIGUCHI NAOYA(堀口 直也) wrote:
>>>>>>> On Mon, Feb 28, 2022 at 10:02:45PM +0800, Miaohe Lin wrote:
>>>>>>>> The huge zero page could reach here and if we ever try to split it, the
>>>>>>>> VM_BUG_ON_PAGE will be triggered in split_huge_page_to_list(). Also the
>>>>>>>> non-lru compound movable pages could be taken for transhuge pages. Skip
>>>>>>>> these pages by checking PageLRU because huge zero page isn't lru page as
>>>>>>>> non-lru compound movable pages.
>>>>>>>
>>>>>>> It seems that memory_failure() also fails at get_any_page() with "hwpoison:
>>>>>>> unhandlable page" message.
>>>>>>>
>>>>>>>   [16478.203474] page:00000000b6acdbd1 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x1810b4
>>>>>>>   [16478.206612] flags: 0x57ffffc0801000(reserved|hwpoison|node=1|zone=2|lastcpupid=0x1fffff)
>>>>>>>   [16478.209411] raw: 0057ffffc0801000 fffff11bc6042d08 fffff11bc6042d08 0000000000000000
>>>>>>>   [16478.211921] raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
>>>>>>>   [16478.214473] page dumped because: hwpoison: unhandlable page
>>>>>>>   [16478.216386] Memory failure: 0x1810b4: recovery action for unknown page: Ignored
>>>>>>>
>>>>>>> We can't handle errors on huge (or normal) zero page, so the current
>>>>>>
>>>>>> Sorry for confusing commit log again. I should have a coffee before I make this patch.
>>>>>> Huge or normal zero page will fail at get_any_page because they're neither HWPoisonHandlable
>>>>>> nor PageHuge.
>>>>>>
>>>>>>> behavior seems to me more suitable than "unsplit thp".
>>>>>>>
>>>>>>> Or if you have some producer to reach the following path with huge zero
>>>>>>> page, could you share it?
>>>>>>>
>>>>>>
>>>>>> What I mean is that non-lru movable compound page can reach here unexpected because __PageMovable(page)
>>>>>> is handleable now. So get_any_page could succeed to grab the page refcnt. And since it's compound page,
>>>>>> it will go through the split_huge_page_to_list because PageTransHuge checks PageHead(page) which can also
>>>>>> be true for compound page. But this type of pages is unexpected for split_huge_page_to_list.
>>>>>
>>>>> Can we really handle non-LRU movable pages in memory failure
>>>>> (uncorrectable errors)? Typically they are balloon, zsmalloc, etc.
>>>>> Assuming we run into a base (4K) non-LRU movable page, we could reach
>>>>> as far as identify_page_state(), it should not fall into any category
>>>>> except me_unknown. So it seems we could just simply make it
>>>>> unhandlable.
>>>>
>>>> There is the comment from memory_failure:
>>>>         /*
>>>>          * We ignore non-LRU pages for good reasons.
>>>>          * - PG_locked is only well defined for LRU pages and a few others
>>>>          * - to avoid races with __SetPageLocked()
>>>>          * - to avoid races with __SetPageSlab*() (and more non-atomic ops)
>>>>          * The check (unnecessarily) ignores LRU pages being isolated and
>>>>          * walked by the page reclaim code, however that's not a big loss.
>>>>          */
>>>>
>>>> So we could not handle non-LRU movable pages.
>>>>
>>>> What do you mean is something like below?
>>>>
>>>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>>>> index 5444a8ef4867..d80dbe0f20b6 100644
>>>> --- a/mm/memory-failure.c
>>>> +++ b/mm/memory-failure.c
>>>> @@ -1784,6 +1784,13 @@ int memory_failure(unsigned long pfn, int flags)
>>>>                 }
>>>>         }
>>>>
>>>> +       if (__PageMovable(hpage)) {
>>>> +               put_page(p);
>>>> +               action_result(pfn, MF_MSG_MOVALBE_PAGE, MF_IGNORED);
>>>> +               res = -EBUSY;
>>>> +               goto unlock_mutex;
>>>> +       }
>>>> +
>>>>         if (PageTransHuge(hpage)) {
>>>>                 /*
>>>>                  * The flag must be set after the refcount is bumped
>>>>
>>>>
>>>> i.e. Simply make non-LRU movable pages unhandlable ?
>>>
>>
>> I think about the below code more carefully and I found that this will make
>> hwpoison_filter can't handle the non-LRU movable pages now. Because non-LRU
>> movable pages return early now and thus can't reach the hwpoison_filter. This
>> results in a inconsistent behavior with previous one. So I think the origin
>> fixup of this patch is more suitable. What do you think?
> 
> I'm not familiar with hwpoison_filter(), it seems like a test helper
> for error injection. I don't think hwpoison_filter() is used to filter
> unhandlable page, for example, slab page, IIUC. So the non-LRU movable
> pages should be treated the same. If so, the old behavior was simply
> wrong.

I think you're right. hwpoison_filter should filter the handleable error.

Thanks.

> 
>>
>> Thanks.
>>
>>> I'd prefer this personally. Something like the below (compile test only):
>>>
>>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>>> index 5444a8ef4867..789e40909ade 100644
>>> --- a/mm/memory-failure.c
>>> +++ b/mm/memory-failure.c
>>> @@ -1176,12 +1176,18 @@ void ClearPageHWPoisonTakenOff(struct page *page)
>>>   * does not return true for hugetlb or device memory pages, so it's assumed
>>>   * to be called only in the context where we never have such pages.
>>>   */
>>> -static inline bool HWPoisonHandlable(struct page *page)
>>> +static inline bool HWPoisonHandlable(struct page *page, unsigned long flags)
>>>  {
>>> - return PageLRU(page) || __PageMovable(page) || is_free_buddy_page(page);
>>> + bool movable = false;
>>> +
>>> + /* Soft offline could mirgate non-LRU movable pages */
>>> + if ((flags & MF_SOFT_OFFLINE) && __PageMovable(page))
>>> + movable = true;
>>> +
>>> + return movable || PageLRU(page) || is_free_buddy_page(page);
>>>  }
>>>
>>> -static int __get_hwpoison_page(struct page *page)
>>> +static int __get_hwpoison_page(struct page *page, unsigned long flags)
>>>  {
>>>   struct page *head = compound_head(page);
>>>   int ret = 0;
>>> @@ -1196,7 +1202,7 @@ static int __get_hwpoison_page(struct page *page)
>>>   * for any unsupported type of page in order to reduce the risk of
>>>   * unexpected races caused by taking a page refcount.
>>>   */
>>> - if (!HWPoisonHandlable(head))
>>> + if (!HWPoisonHandlable(head, flags))
>>>   return -EBUSY;
>>>
>>>   if (get_page_unless_zero(head)) {
>>> @@ -1221,7 +1227,7 @@ static int get_any_page(struct page *p, unsigned
>>> long flags)
>>>
>>>  try_again:
>>>   if (!count_increased) {
>>> - ret = __get_hwpoison_page(p);
>>> + ret = __get_hwpoison_page(p, flags);
>>>   if (!ret) {
>>>   if (page_count(p)) {
>>>   /* We raced with an allocation, retry. */
>>> @@ -1249,7 +1255,7 @@ static int get_any_page(struct page *p, unsigned
>>> long flags)
>>>   }
>>>   }
>>>
>>> - if (PageHuge(p) || HWPoisonHandlable(p)) {
>>> + if (PageHuge(p) || HWPoisonHandlable(p, flags)) {
>>>   ret = 1;
>>>   } else {
>>>   /*
>>>
>>>>
>>>>>
>>>>> But it should be handlable for soft-offline since it could be migrated.
>>>>>
>>>>
>>>> Yes, non-LRU movable pages can be simply migrated.
>>>>
>>>> Many thanks.
>>>>
>>>>>
>>>>>> Does this make sense for you? Thanks Naoya.
>>>>>>
>>>>>>> Thanks,
>>>>>>> Naoya Horiguchi
>>>>>>>
>>>>>>>>
>>>>>>>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
>>>>>>>> ---
>>>>>>>>  mm/memory-failure.c | 14 ++++++++++++++
>>>>>>>>  1 file changed, 14 insertions(+)
>>>>>>>>
>>>>>>>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>>>>>>>> index 23bfd809dc8c..ac6492e36978 100644
>>>>>>>> --- a/mm/memory-failure.c
>>>>>>>> +++ b/mm/memory-failure.c
>>>>>>>> @@ -1792,6 +1792,20 @@ int memory_failure(unsigned long pfn, int flags)
>>>>>>>>      }
>>>>>>>>
>>>>>>>>      if (PageTransHuge(hpage)) {
>>>>>>>> +            /*
>>>>>>>> +             * The non-lru compound movable pages could be taken for
>>>>>>>> +             * transhuge pages. Also huge zero page could reach here
>>>>>>>> +             * and if we ever try to split it, the VM_BUG_ON_PAGE will
>>>>>>>> +             * be triggered in split_huge_page_to_list(). Skip these
>>>>>>>> +             * pages by checking PageLRU because huge zero page isn't
>>>>>>>> +             * lru page as non-lru compound movable pages.
>>>>>>>> +             */
>>>>>>>> +            if (!PageLRU(hpage)) {
>>>>>>>> +                    put_page(p);
>>>>>>>> +                    action_result(pfn, MF_MSG_UNSPLIT_THP, MF_IGNORED);
>>>>>>>> +                    res = -EBUSY;
>>>>>>>> +                    goto unlock_mutex;
>>>>>>>> +            }
>>>>>>>>              /*
>>>>>>>>               * The flag must be set after the refcount is bumped
>>>>>>>>               * otherwise it may race with THP split.
>>>>>>>> --
>>>>>>>> 2.23.0
>>>>>>
>>>>>>
>>>>> .
>>>>>
>>>>
>>> .
>>>
>>
>>
> .
> 


^ permalink raw reply	[flat|nested] 27+ messages in thread

end of thread, other threads:[~2022-03-11  1:55 UTC | newest]

Thread overview: 27+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-02-28 14:02 [PATCH 0/4] A few fixup patches for memory failure Miaohe Lin
2022-02-28 14:02 ` [PATCH 1/4] mm/memory-failure.c: fix race with changing page compound again Miaohe Lin
2022-03-04  8:26   ` HORIGUCHI NAOYA(堀口 直也)
2022-03-04 19:32     ` Mike Kravetz
2022-03-07  3:44       ` Miaohe Lin
2022-03-07  7:01         ` HORIGUCHI NAOYA(堀口 直也)
2022-03-07 19:07           ` Mike Kravetz
2022-03-08  6:56             ` HORIGUCHI NAOYA(堀口 直也)
2022-03-08 11:28               ` Miaohe Lin
2022-02-28 14:02 ` [PATCH 2/4] mm/memory-failure.c: fix wrong user reference report Miaohe Lin
2022-03-04  8:27   ` HORIGUCHI NAOYA(堀口 直也)
2022-03-07 11:26     ` Miaohe Lin
2022-03-07 20:14       ` Yang Shi
2022-03-08 13:11         ` Miaohe Lin
2022-03-08 18:51           ` Yang Shi
2022-03-09  8:30             ` Miaohe Lin
2022-02-28 14:02 ` [PATCH 3/4] mm/memory-failure.c: avoid calling invalidate_inode_page() with unexpected pages Miaohe Lin
2022-02-28 14:02 ` [PATCH 4/4] mm/memory-failure.c: fix potential VM_BUG_ON_PAGE in split_huge_page_to_list Miaohe Lin
2022-03-04  8:28   ` HORIGUCHI NAOYA(堀口 直也)
2022-03-07  7:07     ` Miaohe Lin
2022-03-07 19:53       ` Yang Shi
2022-03-08 12:36         ` Miaohe Lin
2022-03-08 18:47           ` Yang Shi
2022-03-09  8:45             ` Miaohe Lin
2022-03-10 11:46             ` Miaohe Lin
2022-03-10 19:32               ` Yang Shi
2022-03-11  1:54                 ` Miaohe Lin

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).