Re: [PATCH v1] mm,hwpoison: set PG_hwpoison for busy hugetlb pages

From: "HORIGUCHI NAOYA(堀口　直也)" <naoya.horiguchi@nec.com>
To: Miaohe Lin <linmiaohe@huawei.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>,
	Naoya Horiguchi <naoya.horiguchi@linux.dev>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>,
	Andrew Morton <akpm@linux-foundation.org>,
	zhenwei pi <pizhenwei@bytedance.com>,
	"linux-kernel@vger.kernel.org" <linux-kernel@vger.kernel.org>
Subject: Re: [PATCH v1] mm,hwpoison: set PG_hwpoison for busy hugetlb pages
Date: Thu, 2 Jun 2022 06:12:05 +0000	[thread overview]
Message-ID: <20220602061203.GB1172281@hori.linux.bs1.fc.nec.co.jp> (raw)
In-Reply-To: <7395dbe7-7be6-6ef7-7728-a118471caa5a@huawei.com>

On Thu, May 12, 2022 at 07:18:51PM +0800, Miaohe Lin wrote:
> On 2022/5/12 12:32, HORIGUCHI NAOYA(堀口 直也) wrote:
> > On Wed, May 11, 2022 at 11:35:55AM -0700, Mike Kravetz wrote:
> >> On 5/11/22 08:19, Naoya Horiguchi wrote:
> >>> From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> >>>
> >>> If memory_failure() fails to grab page refcount on a hugetlb page
> >>> because it's busy, it returns without setting PG_hwpoison on it.
> >>> This not only loses a chance of error containment, but breaks the rule
> >>> that action_result() should be called only when memory_failure() do
> >>> any of handling work (even if that's just setting PG_hwpoison).
> >>> This inconsistency could harm code maintainability.
> >>>
> >>> So set PG_hwpoison and call hugetlb_set_page_hwpoison() for such a case.
> >>>
> >>> Fixes: 405ce051236c ("mm/hwpoison: fix race between hugetlb free/demotion and memory_failure_hugetlb()")
> >>> Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> >>> ---
> >>>  include/linux/mm.h  | 1 +
> >>>  mm/memory-failure.c | 8 ++++----
> >>>  2 files changed, 5 insertions(+), 4 deletions(-)
> >>>
> >>> diff --git a/include/linux/mm.h b/include/linux/mm.h
> >>> index d446e834a3e5..04de0c3e4f9f 100644
> >>> --- a/include/linux/mm.h
> >>> +++ b/include/linux/mm.h
> >>> @@ -3187,6 +3187,7 @@ enum mf_flags {
> >>>  	MF_MUST_KILL = 1 << 2,
> >>>  	MF_SOFT_OFFLINE = 1 << 3,
> >>>  	MF_UNPOISON = 1 << 4,
> >>> +	MF_NO_RETRY = 1 << 5,
> >>>  };
> >>>  extern int memory_failure(unsigned long pfn, int flags);
> >>>  extern void memory_failure_queue(unsigned long pfn, int flags);
> >>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> >>> index 6a28d020a4da..e3269b991016 100644
> >>> --- a/mm/memory-failure.c
> >>> +++ b/mm/memory-failure.c
> >>> @@ -1526,7 +1526,8 @@ int __get_huge_page_for_hwpoison(unsigned long pfn, int flags)
> >>>  			count_increased = true;
> >>>  	} else {
> >>>  		ret = -EBUSY;
> >>> -		goto out;
> >>> +		if (!(flags & MF_NO_RETRY))
> >>> +			goto out;
> >>>  	}
> >>
> >> Hi Naoya,
> >>
> >> We are in the else block because !HPageFreed() and !HPageMigratable().
> >> IIUC, this likely means the page is isolated.  One common reason for isolation
> >> is migration.  So, the page could be isolated and on a list for migration.
> > 
> > Yes, and I also detected this issue by testing race between hugepage allocation
> > and memory_failure(). 
> > 
> >>
> >> I took a quick look at the hugetlb migration code and did not see any checks
> >> for PageHWPoison after a hugetlb page is isolated.  I could have missed
> >> something?  If there are no checks, we will read the PageHWPoison page
> >> in kernel mode while copying to the migration target.
> > 
> > Yes, that could happen.  This patch does not affect ongoing hugepage migration.
> > But after the migration source hugepage is freed, the PG_hwpoison should work
> > to prevent reusing.
> > 
> >>
> >> Is this an issue?  Is is something we need to be concerned with?  Memory
> >> errors can happen at any time, and gracefully handling them is best effort.
> > 
> > Right, so doing nothing for this case could be OK if doing something causes
> > some issues or makes code too complicated.  The motivation of this patch is
> > that now I think memory_failure() should do something (at least setting
> > PG_hwpoison) unless the page is already hwpoisoned or rejected by
> > hwpoison_filter(), because of the effect after free as mentioned above.
> > 
> > This is also expected in other case too. For example, slab is a unhandlable
> > type of page, but we do set PG_hwpoison.  This flag should not affect any of
> > ongoing slab-related process, but that's OK because it becomes effective
> > after the slab page is freed.
> > 
> > So this patch is intended to align to the behavior.  Allowing hugepage
> > migration to do something good using PG_hwpoison seems to me an unsolved
> > separate issue.
> 
> I tend to agree with Naoya. And could we try to do it better? IMHO, we could do a
> get_page_unless_zero here to ensure that hugetlb page migration will fail due to
> this extra page reference and thus preventing the page content from being accessed.
> Does this work? Or am I miss something?

Sorry for my missing to answering the question,

Taking page refcount to prevent page migration could work.  One concern is
how we can distinguish hugepages under migration and those under allocation
from buddy.  Maybe this was also mentioned in discussion over 
https://github.com/torvalds/linux/commit/405ce051236cc65b30bbfe490b28ce60ae6aed85
, there's a small window where an allocating compound page is refcounted and
hugepage (having deconstructor COMPOUND_PAGE_DTOR), and not protected by
hugetlb_lock, so simply get_page_unless_zero() might not work (might break
allocation code).
If we have more reliable indicator to ensure that a hugepage is under migration,
that would be helpful.

Thanks,
Naoya Horiguchi