All of lore.kernel.org
 help / color / mirror / Atom feed
* Re: [v5 PATCH 6/6] mm: hwpoison: handle non-anonymous THP correctly
@ 2021-11-01 19:38 Jue Wang
  2021-11-01 20:11 ` Yang Shi
  0 siblings, 1 reply; 7+ messages in thread
From: Jue Wang @ 2021-11-01 19:38 UTC (permalink / raw)
  To: Yang Shi, Hugh Dickins
  Cc: Andrew Morton, kirill.shutemov, linux-fsdevel, LKML, Linux MM,
	HORIGUCHI NAOYA(堀口 直也),
	Oscar Salvador, Peter Xu, Matthew Wilcox

A related bug but whose fix may belong to a separate series:

split_huge_page fails when invoked concurrently on the same THP page.

It's possible that multiple memory errors on the same THP get consumed
by multiple threads and come down to split_huge_page path easily.

Thanks,
-Jue

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [v5 PATCH 6/6] mm: hwpoison: handle non-anonymous THP correctly
  2021-11-01 19:38 [v5 PATCH 6/6] mm: hwpoison: handle non-anonymous THP correctly Jue Wang
@ 2021-11-01 20:11 ` Yang Shi
  2021-11-01 21:57   ` Jue Wang
  2021-11-02  3:49   ` Matthew Wilcox
  0 siblings, 2 replies; 7+ messages in thread
From: Yang Shi @ 2021-11-01 20:11 UTC (permalink / raw)
  To: Jue Wang
  Cc: Hugh Dickins, Andrew Morton, Kirill A. Shutemov,
	Linux FS-devel Mailing List, LKML, Linux MM,
	HORIGUCHI NAOYA(堀口 直也),
	Oscar Salvador, Peter Xu, Matthew Wilcox

On Mon, Nov 1, 2021 at 12:38 PM Jue Wang <juew@google.com> wrote:
>
> A related bug but whose fix may belong to a separate series:
>
> split_huge_page fails when invoked concurrently on the same THP page.
>
> It's possible that multiple memory errors on the same THP get consumed
> by multiple threads and come down to split_huge_page path easily.

Yeah, I think it should be a known problem since the very beginning.
The THP split requires to pin the page and does check if the refcount
is expected or not and freezes the refcount if it is expected. So if
two concurrent paths try to split the same THP, one will fail due to
the pin from the other path, but the other one will succeed.

I don't think of a better way to remediate it other than retrying from
the very start off the top of my head. We can't simply check if it is
still a THP or not since THP split will just move the refcount pin to
the poisoned subpage so the retry path will lose the refcount for its
poisoned subpage.

Did you run into this problem on any real production environment? Or
it is just a artificial test case? I'm wondering if the extra
complexity is worth or not.

>
> Thanks,
> -Jue

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [v5 PATCH 6/6] mm: hwpoison: handle non-anonymous THP correctly
  2021-11-01 20:11 ` Yang Shi
@ 2021-11-01 21:57   ` Jue Wang
  2021-11-02  3:10     ` HORIGUCHI NAOYA(堀口 直也)
  2021-11-02  3:49   ` Matthew Wilcox
  1 sibling, 1 reply; 7+ messages in thread
From: Jue Wang @ 2021-11-01 21:57 UTC (permalink / raw)
  To: Yang Shi
  Cc: Hugh Dickins, Andrew Morton, Kirill A. Shutemov,
	Linux FS-devel Mailing List, LKML, Linux MM,
	HORIGUCHI NAOYA(堀口 直也),
	Oscar Salvador, Peter Xu, Matthew Wilcox

[-- Attachment #1: Type: text/plain, Size: 1716 bytes --]

On Mon, Nov 1, 2021 at 1:11 PM Yang Shi <shy828301@gmail.com> wrote:

> On Mon, Nov 1, 2021 at 12:38 PM Jue Wang <juew@google.com> wrote:
> >
> > A related bug but whose fix may belong to a separate series:
> >
> > split_huge_page fails when invoked concurrently on the same THP page.
> >
> > It's possible that multiple memory errors on the same THP get consumed
> > by multiple threads and come down to split_huge_page path easily.
>
> Yeah, I think it should be a known problem since the very beginning.
> The THP split requires to pin the page and does check if the refcount
> is expected or not and freezes the refcount if it is expected. So if
> two concurrent paths try to split the same THP, one will fail due to
> the pin from the other path, but the other one will succeed.
>

The failed thread will result in a -EBUSY from memory_failure and
SIGBUS sent to the process without context (address, BUS_MCEERR_AR).

This is undesirable for applications who intend to recover from memory
errors.

One possible fix is to recognize such cases and signal properly from
memory_failure.

>
> I don't think of a better way to remediate it other than retrying from
> the very start off the top of my head. We can't simply check if it is
> still a THP or not since THP split will just move the refcount pin to
> the poisoned subpage so the retry path will lose the refcount for its
> poisoned subpage.
>
> Did you run into this problem on any real production environment? Or
> it is just a artificial test case? I'm wondering if the extra
> complexity is worth or not.
>

This can be easily reproduced in artificial test cases.

I'd not surprised if production environment hits this bug.

>
> >
> > Thanks,
> > -Jue
>

[-- Attachment #2: Type: text/html, Size: 2631 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* RE: [v5 PATCH 6/6] mm: hwpoison: handle non-anonymous THP correctly
  2021-11-01 21:57   ` Jue Wang
@ 2021-11-02  3:10     ` HORIGUCHI NAOYA(堀口 直也)
  0 siblings, 0 replies; 7+ messages in thread
From: HORIGUCHI NAOYA(堀口 直也) @ 2021-11-02  3:10 UTC (permalink / raw)
  To: Jue Wang, Yang Shi
  Cc: Hugh Dickins, Andrew Morton, Kirill A. Shutemov,
	Linux FS-devel Mailing List, LKML, Linux MM, Oscar Salvador,
	Peter Xu, Matthew Wilcox

[-- Attachment #1: Type: text/plain, Size: 2105 bytes --]


> On Mon, Nov 1, 2021 at 1:11 PM Yang Shi <shy828301@gmail.com <mailto:shy828301@gmail.com> > wrote:
>       On Mon, Nov 1, 2021 at 12:38 PM Jue Wang <juew@google.com <mailto:juew@google.com> > wrote:
>       >
>       > A related bug but whose fix may belong to a separate series:
>       >
>       > split_huge_page fails when invoked concurrently on the same THP page.
>       >
>       > It's possible that multiple memory errors on the same THP get consumed
>       > by multiple threads and come down to split_huge_page path easily.
>
>       Yeah, I think it should be a known problem since the very beginning.
>       The THP split requires to pin the page and does check if the refcount
>       is expected or not and freezes the refcount if it is expected. So if
>       two concurrent paths try to split the same THP, one will fail due to
>       the pin from the other path, but the other one will succeed.
>
> The failed thread will result in a -EBUSY from memory_failure and
> SIGBUS sent to the process without context (address, BUS_MCEERR_AR).
>
> This is undesirable for applications who intend to recover from memory
> errors.
>
> One possible fix is to recognize such cases and signal properly from
> memory_failure.

Thanks for sharing about the issue.
kill_accessing_process() might be helpful to send SIGBUS with address info
to the process who lost the race.

Thanks,
Naoya Horiguchi

>
>       I don't think of a better way to remediate it other than retrying from
>       the very start off the top of my head. We can't simply check if it is
>       still a THP or not since THP split will just move the refcount pin to
>       the poisoned subpage so the retry path will lose the refcount for its
>       poisoned subpage.
>
>       Did you run into this problem on any real production environment? Or
>       it is just a artificial test case? I'm wondering if the extra
>       complexity is worth or not.
>
> This can be easily reproduced in artificial test cases.
>
> I'd not surprised if production environment hits this bug.

[-- Attachment #2: Type: text/html, Size: 11128 bytes --]

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [v5 PATCH 6/6] mm: hwpoison: handle non-anonymous THP correctly
  2021-11-01 20:11 ` Yang Shi
  2021-11-01 21:57   ` Jue Wang
@ 2021-11-02  3:49   ` Matthew Wilcox
  2021-11-02 16:41     ` Yang Shi
  1 sibling, 1 reply; 7+ messages in thread
From: Matthew Wilcox @ 2021-11-02  3:49 UTC (permalink / raw)
  To: Yang Shi
  Cc: Jue Wang, Hugh Dickins, Andrew Morton, Kirill A. Shutemov,
	Linux FS-devel Mailing List, LKML, Linux MM,
	HORIGUCHI NAOYA(堀口 直也),
	Oscar Salvador, Peter Xu

On Mon, Nov 01, 2021 at 01:11:33PM -0700, Yang Shi wrote:
> On Mon, Nov 1, 2021 at 12:38 PM Jue Wang <juew@google.com> wrote:
> >
> > A related bug but whose fix may belong to a separate series:
> >
> > split_huge_page fails when invoked concurrently on the same THP page.
> >
> > It's possible that multiple memory errors on the same THP get consumed
> > by multiple threads and come down to split_huge_page path easily.
> 
> Yeah, I think it should be a known problem since the very beginning.
> The THP split requires to pin the page and does check if the refcount
> is expected or not and freezes the refcount if it is expected. So if
> two concurrent paths try to split the same THP, one will fail due to
> the pin from the other path, but the other one will succeed.

No, they can both fail, if the timing is just right, because they each
have a refcount, so neither will succeed.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* Re: [v5 PATCH 6/6] mm: hwpoison: handle non-anonymous THP correctly
  2021-11-02  3:49   ` Matthew Wilcox
@ 2021-11-02 16:41     ` Yang Shi
  0 siblings, 0 replies; 7+ messages in thread
From: Yang Shi @ 2021-11-02 16:41 UTC (permalink / raw)
  To: Matthew Wilcox
  Cc: Jue Wang, Hugh Dickins, Andrew Morton, Kirill A. Shutemov,
	Linux FS-devel Mailing List, LKML, Linux MM,
	HORIGUCHI NAOYA(堀口 直也),
	Oscar Salvador, Peter Xu

On Mon, Nov 1, 2021 at 8:50 PM Matthew Wilcox <willy@infradead.org> wrote:
>
> On Mon, Nov 01, 2021 at 01:11:33PM -0700, Yang Shi wrote:
> > On Mon, Nov 1, 2021 at 12:38 PM Jue Wang <juew@google.com> wrote:
> > >
> > > A related bug but whose fix may belong to a separate series:
> > >
> > > split_huge_page fails when invoked concurrently on the same THP page.
> > >
> > > It's possible that multiple memory errors on the same THP get consumed
> > > by multiple threads and come down to split_huge_page path easily.
> >
> > Yeah, I think it should be a known problem since the very beginning.
> > The THP split requires to pin the page and does check if the refcount
> > is expected or not and freezes the refcount if it is expected. So if
> > two concurrent paths try to split the same THP, one will fail due to
> > the pin from the other path, but the other one will succeed.
>
> No, they can both fail, if the timing is just right, because they each
> have a refcount, so neither will succeed.

Oh, yes, if there is race between unlock_page and put_page.

^ permalink raw reply	[flat|nested] 7+ messages in thread

* [v5 PATCH 6/6] mm: hwpoison: handle non-anonymous THP correctly
  2021-10-20 21:07 [v5 PATCH 0/6] Solve silent data loss caused by poisoned page cache (shmem/tmpfs) Yang Shi
@ 2021-10-20 21:07 ` Yang Shi
  0 siblings, 0 replies; 7+ messages in thread
From: Yang Shi @ 2021-10-20 21:07 UTC (permalink / raw)
  To: naoya.horiguchi, hughd, kirill.shutemov, willy, peterx, osalvador, akpm
  Cc: shy828301, linux-mm, linux-fsdevel, linux-kernel

Currently hwpoison doesn't handle non-anonymous THP, but since v4.8 THP
support for tmpfs and read-only file cache has been added.  They could
be offlined by split THP, just like anonymous THP.

Acked-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Signed-off-by: Yang Shi <shy828301@gmail.com>
---
 mm/memory-failure.c | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 3603a3acf7b3..bd697c64e973 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1443,14 +1443,11 @@ static int identify_page_state(unsigned long pfn, struct page *p,
 static int try_to_split_thp_page(struct page *page, const char *msg)
 {
 	lock_page(page);
-	if (!PageAnon(page) || unlikely(split_huge_page(page))) {
+	if (unlikely(split_huge_page(page))) {
 		unsigned long pfn = page_to_pfn(page);
 
 		unlock_page(page);
-		if (!PageAnon(page))
-			pr_info("%s: %#lx: non anonymous thp\n", msg, pfn);
-		else
-			pr_info("%s: %#lx: thp split failed\n", msg, pfn);
+		pr_info("%s: %#lx: thp split failed\n", msg, pfn);
 		put_page(page);
 		return -EBUSY;
 	}
-- 
2.26.2


^ permalink raw reply related	[flat|nested] 7+ messages in thread

end of thread, other threads:[~2021-11-02 16:42 UTC | newest]

Thread overview: 7+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-11-01 19:38 [v5 PATCH 6/6] mm: hwpoison: handle non-anonymous THP correctly Jue Wang
2021-11-01 20:11 ` Yang Shi
2021-11-01 21:57   ` Jue Wang
2021-11-02  3:10     ` HORIGUCHI NAOYA(堀口 直也)
2021-11-02  3:49   ` Matthew Wilcox
2021-11-02 16:41     ` Yang Shi
  -- strict thread matches above, loose matches on Subject: below --
2021-10-20 21:07 [v5 PATCH 0/6] Solve silent data loss caused by poisoned page cache (shmem/tmpfs) Yang Shi
2021-10-20 21:07 ` [v5 PATCH 6/6] mm: hwpoison: handle non-anonymous THP correctly Yang Shi

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.