From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id AEA6EC433EF
	for <linux-mm@archiver.kernel.org>; Tue, 26 Apr 2022 11:41:58 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id 2E55F6B0073; Tue, 26 Apr 2022 07:41:58 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id 26D4F6B0074; Tue, 26 Apr 2022 07:41:58 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id 0E7D36B0075; Tue, 26 Apr 2022 07:41:58 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (relay.hostedemail.com [64.99.140.28])
	by kanga.kvack.org (Postfix) with ESMTP id EE1056B0073
	for <linux-mm@kvack.org>; Tue, 26 Apr 2022 07:41:57 -0400 (EDT)
Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay13.hostedemail.com (Postfix) with ESMTP id CBD8160FF2
	for <linux-mm@kvack.org>; Tue, 26 Apr 2022 11:41:57 +0000 (UTC)
X-FDA: 79398841074.05.F53EF18
Received: from szxga03-in.huawei.com (szxga03-in.huawei.com [45.249.212.189])
	by imf16.hostedemail.com (Postfix) with ESMTP id 469BF18005D
	for <linux-mm@kvack.org>; Tue, 26 Apr 2022 11:41:53 +0000 (UTC)
Received: from canpemm500002.china.huawei.com (unknown [172.30.72.55])
	by szxga03-in.huawei.com (SkyGuard) with ESMTP id 4KnfxZ01fvzCs2d;
	Tue, 26 Apr 2022 19:37:21 +0800 (CST)
Received: from [10.174.177.76] (10.174.177.76) by
 canpemm500002.china.huawei.com (7.192.104.244) with Microsoft SMTP Server
 (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id
 15.1.2375.24; Tue, 26 Apr 2022 19:41:53 +0800
Subject: Re: [PATCH] mm/memory_hotplug: avoid consuming corrupted data when
 offline pages
To: =?UTF-8?B?SE9SSUdVQ0hJIE5BT1lBKOWggOWPoyDnm7TkuZ8p?=
	<naoya.horiguchi@nec.com>
CC: Mike Kravetz <mike.kravetz@oracle.com>, David Hildenbrand
	<david@redhat.com>, "akpm@linux-foundation.org" <akpm@linux-foundation.org>,
	"linux-mm@kvack.org" <linux-mm@kvack.org>, "linux-kernel@vger.kernel.org"
	<linux-kernel@vger.kernel.org>, Oscar Salvador <osalvador@suse.de>
References: <20220421135129.19767-1-linmiaohe@huawei.com>
 <8fae51af-b12f-4232-1330-54f7b0943907@redhat.com>
 <bb011aaf-a45c-c0a8-5e5f-211900d17f19@oracle.com>
 <5b3ba7f6-64c3-4979-1791-2c8fc49e37fb@huawei.com>
 <20220425135555.GA3767079@hori.linux.bs1.fc.nec.co.jp>
From: Miaohe Lin <linmiaohe@huawei.com>
Message-ID: <3613e166-41a9-62b2-81a3-c775b531a9c7@huawei.com>
Date: Tue, 26 Apr 2022 19:41:52 +0800
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
 Thunderbird/78.6.0
MIME-Version: 1.0
In-Reply-To: <20220425135555.GA3767079@hori.linux.bs1.fc.nec.co.jp>
Content-Type: text/plain; charset="utf-8"
Content-Language: en-US
Content-Transfer-Encoding: 8bit
X-Originating-IP: [10.174.177.76]
X-ClientProxiedBy: dggems702-chm.china.huawei.com (10.3.19.179) To
 canpemm500002.china.huawei.com (7.192.104.244)
X-CFilter-Loop: Reflected
X-Rspamd-Queue-Id: 469BF18005D
X-Stat-Signature: cs8qoh3pbhot3cyowfdssjrqc1g6re1x
X-Rspam-User: 
Authentication-Results: imf16.hostedemail.com;
	dkim=none;
	dmarc=pass (policy=quarantine) header.from=huawei.com;
	spf=pass (imf16.hostedemail.com: domain of linmiaohe@huawei.com designates 45.249.212.189 as permitted sender) smtp.mailfrom=linmiaohe@huawei.com
X-Rspamd-Server: rspam09
X-HE-Tag: 1650973313-629909
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On 2022/4/25 21:55, HORIGUCHI NAOYA(堀口 直也) wrote:
> On Fri, Apr 22, 2022 at 02:53:48PM +0800, Miaohe Lin wrote:
>> On 2022/4/22 6:04, Mike Kravetz wrote:
>>> On 4/21/22 07:20, David Hildenbrand wrote:
>>>> On 21.04.22 15:51, Miaohe Lin wrote:
>>>>> When trying to offline pages, HWPoisoned hugepage is migrated without
>>>>> checking PageHWPoison first. So corrupted data could be consumed. Fix
>>>>> it by deferring isolate_huge_page until PageHWPoison is handled.
>>>>>
>>>>
>>>> CCing Oscar, Mike and Naoya
>>>>
>>>>> Signed-off-by: Miaohe Lin <linmiaohe@huawei.com>
>>>>> ---
>>>>>  mm/memory_hotplug.c | 11 +++++++----
>>>>>  1 file changed, 7 insertions(+), 4 deletions(-)
>>>>>
>>>>> diff --git a/mm/memory_hotplug.c b/mm/memory_hotplug.c
>>>>> index 4c6065e5d274..093f85ec5c5c 100644
>>>>> --- a/mm/memory_hotplug.c
>>>>> +++ b/mm/memory_hotplug.c
>>>>> @@ -1600,11 +1600,9 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
>>>>>  		folio = page_folio(page);
>>>>>  		head = &folio->page;
>>>>>  
>>>>> -		if (PageHuge(page)) {
>>>>> +		if (PageHuge(page))
>>>>>  			pfn = page_to_pfn(head) + compound_nr(head) - 1;
>>>>> -			isolate_huge_page(head, &source);
>>>>> -			continue;
>>>>> -		} else if (PageTransHuge(page))
>>>>> +		else if (PageTransHuge(page))
>>>>>  			pfn = page_to_pfn(head) + thp_nr_pages(page) - 1;
>>>>>  
>>>>>  		/*
>>>>> @@ -1622,6 +1620,11 @@ do_migrate_range(unsigned long start_pfn, unsigned long end_pfn)
>>>>>  			continue;
>>>>>  		}
>>>>>  
>>>>> +		if (PageHuge(page)) {
>>>>> +			isolate_huge_page(head, &source);
>>>>> +			continue;
>>>>> +		}
>>>>> +
>>>>>  		if (!get_page_unless_zero(page))
>>>>>  			continue;
>>>>>  		/*
>>>>
>>>> The problem statement makes sense to me but I am not sure about the
>>>> details if we run into the "PageHWPoison" path with a huge page. I have
>>>> the gut feeling that we have to do more for huge pages in the
>>>> PageHWPoison() path, because we might be dealing with a free huge page
>>>> after unmap succeeds. I might be wrong.
>>>>
>>>
>>> Thinking about memory errors always makes my head hurt :)
>>
>> Me too. ;)
>>
>>>
>>> What 'state' could a poisoned hugetlb page be in here?
>>> - Mapped into a process address space?
>>> - On the hugetlb free lists?
> 
> There seems at least 2 cases where a hwpoisoned hugetlb page keeps undissolved.
>   (1) Hwpoisoned hugepage in free hugepage list
>   (2) Hwpoisoned hugepage belonging to a hugetlbfs file
> 
>>>
>>> IIUC, when poisoning a hugetlb page we try to dissolve those that are
>>> free and unmap those which are mapped.  So, this means those operations
>>> must have failed for some reason.  Is that correct?
> 
> Yes, (1) is made by failure in error handling, but (2) can be made even
> when memory error on in-use hugepage is successfully handled.

Could you please explain 2 more detailed? IIUC, in-use hugepage belonging to a hugetlbfs
file will be removed from pagecache via hugetlbfs_error_remove_page. So when successfully
handled, hugepage->mapping will be NULL and thus not belonging to a hugetlbfs file. Or am
I miss something?

> 
>>
>> I think you're right.
>>
>> BTW: Now that PageHWPoison is set under hugetlb_lock and HPageFreed is also checked
>> under that lock, it should be more likely to handle the hugetlb page on the free lists
>> successfully because the possible race with hugetlb page allocation or demotion can be
>> prevented by PageHWPoison flag. So it is more likely to be a mapped hugetlb page here.
>> But that doesn't matter.
>>
>>>
>>> Now, IF the above is true this implies there is a poisoned page somewhere
>>> within the hugetlb page.  But, poison is only marked in the head page.
>>> So, we do not really know where within the page the actual error exists.
>>> Is my reasoning still correct?
>>>
>>
>> IIRC, Naoya once proposed to store the poisoned page info into subpage field.
> 
> Yes, it's important to keep the offset of raw page.  I have a "storing raw
> error info" patch for this, but it's still immature to submit.
> 
>>
>>> If my reasoning is correct, then I am not sure what we can do here.
>>> The code to handle poison is:
>>>
>>>                  if (PageHWPoison(page)) {
>>>                         if (WARN_ON(folio_test_lru(folio)))
>>>                                 folio_isolate_lru(folio);
>>>                         if (folio_mapped(folio))
>>>                                 try_to_unmap(folio, TTU_IGNORE_MLOCK);
>>>                         continue;
>>>                 }
>>>
>>> My first observation is that if a hugetlb page is passed to try_to_unmap
>>> as above we may BUG.  This is because we need to hold i_mmap_rwsem in
>>> write mode because of the possibility of calling huge_pmd_unshare.  :(
>>
>> I'm sorry. I missed that case. :(
> 
> OK, so the above check should not be called for hugetlb pages.
> 
>>
>>>
>>> I 'think' try_to_unmap could succeed on a poisoned hugetlb page once we
>>> add correct locking.  So, the page would be unmapped.  I do not think anyone
>>> is holding a ref, so the last unmap should put the hugetlb page on the
>>> free list.  Correct?  We will not migrate the page, but ...
> 
> That seems depends on whether the hugepage is anonymous or file.
> If we consider anonymous hugepage, the above statement is correct.
> 
> As for what we can do, for (1) we can simply skip the hwpoisoned hugepage
> in do_migrate_range() (i.e. no need to call isolate_huge_page()), because in
> the outer while-loop in offline_pages() we call dissolve_free_huge_pages()
> so free hugepage can be handled here.  The above "storing raw error info"
> patch will also enable dissolve_free_huge_pages() to handle hwpoisoned free
> hugepage (by moving PG_hwpoison flag to the raw error page), so I assume
> to have the patch.

Yes, I think (1) will be fixed this way.

> 
> As for (2), I think that making such hugepages unmovable (by clearing
> HPageMigratable) could work.  With that change, scan_movable_pages() detects
> unmovable pages so do_migrate_range is not called for the hugepages.
> Moreover, scan_movable_pages() should return -EBUSY if an unmovable page is found,
> so offline_pages() fails without trying to touch the hwpoisoned hugepage.

Yes, it seems work but this will lead to the memory offline fails. Could this be too overkill?

> Anyway the reported problem (try to migrate hwpoisoned hugepage) would to be solved.
> 
> # Maybe we might optionally try to free the hwpoisoned hugepage by removing
> # the affected hugetlbfs file, but it's not unclear to me taht that's acceptable
> # (a file is gone when doing memory hotremove?).
> 
>>
>> Might hugetlb page still be in the pagecache? But yes, the last unmap could put
>> the hugetlb page on the free list.
>>
>>>
>>> After the call to do_migrate_range() in offline_pages, we will call
>>> dissolve_free_huge_pages.  For each hugetlb page, dissolve_free_huge_pages
>>> will call dissolve_free_huge_page likely passing in the 'head' page.
>>> When dissolve_free_huge_page is called for poisoned hugetlb pages from
>>> the memory error handling code, it passes in the sub page which contains
>>> the memory error.  Before freeing the hugetlb page to buddy, there is this
>>> code:
>>>
>>>                         /*
>>>                          * Move PageHWPoison flag from head page to the raw
>>>                          * error page, which makes any subpages rather than
>>>                          * the error page reusable.
>>>                          */
>>>                         if (PageHWPoison(head) && page != head) {
>>>                                 SetPageHWPoison(page);
>>>                                 ClearPageHWPoison(head);
>>>                         }
>>>                         update_and_free_page(h, head, false)
>>>
>>> As previously mentioned, outside the memory error handling code we do
>>> not know what page within the hugetlb page contains poison.  So, this
>>> will likely result with a page with errors on the free list and an OK
>>> page marked with error.
>>
>> Yes, this is possible. It's really a pity. Could we just reject to dissolve the
>> hugetlb page and keep the hugetlb page around? This will waste some healthy subpages
>> but can do the right things?
> 
> Could you let me try writing patches in my approach?
> I'll try to submit the fix with the "storing raw error info" patch.

Sure, I will wait for that. Many thanks for your hard work! :)

> 
>>
>>>
>>> In other places, we 'bail' if we encounter a hugetlb page with poison.
>>> It would be unfortunate to prevent memory offline if the range contains
>>> a hugetlb page with poison.  After all, offlining a section with poison
>>> make sense.
>>>
>>
>> IIUC, a (hugetlb) page with poison could be offlined anytime even if it's still in the
>> pagecache, swapcache, i.e. still be referenced by somewhere, because memory offline will
>> just offline the poisoned page while ignoring the page refcnt totally. I can't figure
>> out a easy way to fix it yet. :(
> 
> Yes, that's finally expected behavior, but unlike normal pagecache we need
> pay extra attention for hugetlbfs file because it's not storage-backed.
> 
> Thank you for valuable comments, everyone.

BTW: There is another possible problem about hwpoison. Think about the below scene:

offline_pages
do {
  scan_movable_pages
    __PageMovable /* Page is also hwpoisoned. */
      goto found; /* So ret is always 0. */
  do_migrate_range
     if (PageHWPoison(page)) /* PageMovable hwpoisoned page is not handled. */
        continue;
 } while (!ret); /* We will meet that page again and again. */

So we might busy-loop until a signal is pending. Or am I miss something? And I think this could be fixed by
deferring set hwpoisoned flag until page refcnt is successfully incremented for normal page in memory_failure()
which is already in your to-do list. So this possible issue could be left alone now?

Thanks!

> 
> - Naoya Horiguchi
>