From mboxrd@z Thu Jan  1 00:00:00 1970
Return-Path: <owner-linux-mm@kvack.org>
X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on
	aws-us-west-2-korg-lkml-1.web.codeaurora.org
Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17])
	by smtp.lore.kernel.org (Postfix) with ESMTP id DC76BC433FE
	for <linux-mm@archiver.kernel.org>; Tue, 12 Apr 2022 11:08:51 +0000 (UTC)
Received: by kanga.kvack.org (Postfix)
	id EE12B6B0080; Tue, 12 Apr 2022 07:08:50 -0400 (EDT)
Received: by kanga.kvack.org (Postfix, from userid 40)
	id E90DC6B0081; Tue, 12 Apr 2022 07:08:50 -0400 (EDT)
X-Delivered-To: int-list-linux-mm@kvack.org
Received: by kanga.kvack.org (Postfix, from userid 63042)
	id D58C86B0082; Tue, 12 Apr 2022 07:08:50 -0400 (EDT)
X-Delivered-To: linux-mm@kvack.org
Received: from relay.hostedemail.com (relay.a.hostedemail.com [64.99.140.24])
	by kanga.kvack.org (Postfix) with ESMTP id C1D4A6B0080
	for <linux-mm@kvack.org>; Tue, 12 Apr 2022 07:08:50 -0400 (EDT)
Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1])
	by unirelay08.hostedemail.com (Postfix) with ESMTP id 801A1221DA
	for <linux-mm@kvack.org>; Tue, 12 Apr 2022 11:08:50 +0000 (UTC)
X-FDA: 79347954420.08.5E63705
Received: from szxga01-in.huawei.com (szxga01-in.huawei.com [45.249.212.187])
	by imf27.hostedemail.com (Postfix) with ESMTP id 4AAA44000C
	for <linux-mm@kvack.org>; Tue, 12 Apr 2022 11:08:49 +0000 (UTC)
Received: from canpemm500002.china.huawei.com (unknown [172.30.72.57])
	by szxga01-in.huawei.com (SkyGuard) with ESMTP id 4Kd2yM5q74zdZYR;
	Tue, 12 Apr 2022 19:08:11 +0800 (CST)
Received: from [10.174.177.76] (10.174.177.76) by
 canpemm500002.china.huawei.com (7.192.104.244) with Microsoft SMTP Server
 (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_128_GCM_SHA256) id
 15.1.2375.24; Tue, 12 Apr 2022 19:08:46 +0800
Subject: Re: [PATCH] mm/memory-failure.c: bail out early if huge zero page
To: Naoya Horiguchi <naoya.horiguchi@linux.dev>
CC: <akpm@linux-foundation.org>, Linux-MM <linux-mm@kvack.org>, Naoya
 Horiguchi <naoya.horiguchi@nec.com>, Xu Yu <xuyu@linux.alibaba.com>, Oscar
 Salvador <osalvador@suse.de>
References: <49273e6688d7571756603dac996692a15f245d58.1649603963.git.xuyu@linux.alibaba.com>
 <8f06b79b-aeff-3479-a3cc-c0a649dc770b@huawei.com>
 <20220412090907.GA350357@u2004>
From: Miaohe Lin <linmiaohe@huawei.com>
Message-ID: <3eacf09c-e0fe-edde-d81d-ba372c2bad72@huawei.com>
Date: Tue, 12 Apr 2022 19:08:45 +0800
User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101
 Thunderbird/78.6.0
MIME-Version: 1.0
In-Reply-To: <20220412090907.GA350357@u2004>
Content-Type: text/plain; charset="utf-8"
Content-Language: en-US
Content-Transfer-Encoding: 7bit
X-Originating-IP: [10.174.177.76]
X-ClientProxiedBy: dggems702-chm.china.huawei.com (10.3.19.179) To
 canpemm500002.china.huawei.com (7.192.104.244)
X-CFilter-Loop: Reflected
X-Rspamd-Server: rspam05
X-Rspamd-Queue-Id: 4AAA44000C
X-Stat-Signature: fyy8n3i4afwhywwhq4pifubt1oajbwp8
X-Rspam-User: 
Authentication-Results: imf27.hostedemail.com;
	dkim=none;
	dmarc=pass (policy=quarantine) header.from=huawei.com;
	spf=pass (imf27.hostedemail.com: domain of linmiaohe@huawei.com designates 45.249.212.187 as permitted sender) smtp.mailfrom=linmiaohe@huawei.com
X-HE-Tag: 1649761729-720175
X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4
Sender: owner-linux-mm@kvack.org
Precedence: bulk
X-Loop: owner-majordomo@kvack.org
List-ID: <linux-mm.kvack.org>

On 2022/4/12 17:09, Naoya Horiguchi wrote:
> On Mon, Apr 11, 2022 at 10:18:26AM +0800, Miaohe Lin wrote:
>> On 2022/4/10 23:22, Xu Yu wrote:
>>> Kernel panic when injecting memory_failure for the global huge_zero_page,
>>> when CONFIG_DEBUG_VM is enabled, as follows.
>>>
>>> [    5.582720] Injecting memory failure for pfn 0x109ff9 at process virtual address 0x20ff9000
>>> [    5.583786] page:00000000fb053fc3 refcount:2 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x109e00
>>> [    5.584900] head:00000000fb053fc3 order:9 compound_mapcount:0 compound_pincount:0
>>> [    5.585796] flags: 0x17fffc000010001(locked|head|node=0|zone=2|lastcpupid=0x1ffff)
>>> [    5.586712] raw: 017fffc000010001 0000000000000000 dead000000000122 0000000000000000
>>> [    5.587640] raw: 0000000000000000 0000000000000000 00000002ffffffff 0000000000000000
>>> [    5.588565] page dumped because: VM_BUG_ON_PAGE(is_huge_zero_page(head))
>>> [    5.589398] ------------[ cut here ]------------
>>> [    5.589952] kernel BUG at mm/huge_memory.c:2499!
>>> [    5.590516] invalid opcode: 0000 [#1] PREEMPT SMP PTI
>>> [    5.591120] CPU: 6 PID: 553 Comm: split_bug Not tainted 5.18.0-rc1+ #11
>>> [    5.591904] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 3288b3c 04/01/2014
>>> [    5.592817] RIP: 0010:split_huge_page_to_list+0x66a/0x880
>>> [    5.593469] Code: 84 9b fb ff ff 48 8b 7c 24 08 31 f6 e8 9f 5d 2a 00 b8 b8 02 00 00 e9 e8 fb ff ff 48 c7 c6 e8 47 3c 82 4c b
>>> [    5.595806] RSP: 0018:ffffc90000dcbdf8 EFLAGS: 00010246
>>> [    5.596434] RAX: 000000000000003c RBX: 0000000000000001 RCX: 0000000000000000
>>> [    5.597322] RDX: 0000000000000000 RSI: ffffffff823e4c4f RDI: 00000000ffffffff
>>> [    5.598162] RBP: ffff88843fffdb40 R08: 0000000000000000 R09: 00000000fffeffff
>>> [    5.598999] R10: ffffc90000dcbc48 R11: ffffffff82d68448 R12: ffffea0004278000
>>> [    5.599849] R13: ffffffff823c6203 R14: 0000000000109ff9 R15: ffffea000427fe40
>>> [    5.600693] FS:  00007fc375a26740(0000) GS:ffff88842fd80000(0000) knlGS:0000000000000000
>>> [    5.601640] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [    5.602304] CR2: 00007fc3757c9290 CR3: 0000000102174006 CR4: 00000000003706e0
>>> [    5.603139] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>> [    5.603977] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>> [    5.604806] Call Trace:
>>> [    5.605101]  <TASK>
>>> [    5.605357]  ? __irq_work_queue_local+0x39/0x70
>>> [    5.605904]  try_to_split_thp_page+0x3a/0x130
>>> [    5.606430]  memory_failure+0x128/0x800
>>> [    5.606888]  madvise_inject_error.cold+0x8b/0xa1
>>> [    5.607444]  __x64_sys_madvise+0x54/0x60
>>> [    5.607915]  do_syscall_64+0x35/0x80
>>> [    5.608347]  entry_SYSCALL_64_after_hwframe+0x44/0xae
>>> [    5.608949] RIP: 0033:0x7fc3754f8bf9
>>> [    5.609374] Code: 01 00 48 81 c4 80 00 00 00 e9 f1 fe ff ff 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 8
>>> [    5.611554] RSP: 002b:00007ffeda93a1d8 EFLAGS: 00000217 ORIG_RAX: 000000000000001c
>>> [    5.612441] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fc3754f8bf9
>>> [    5.613269] RDX: 0000000000000064 RSI: 0000000000003000 RDI: 0000000020ff9000
>>> [    5.614108] RBP: 00007ffeda93a200 R08: 0000000000000000 R09: 0000000000000000
>>> [    5.614946] R10: 00000000ffffffff R11: 0000000000000217 R12: 0000000000400490
>>> [    5.615787] R13: 00007ffeda93a2e0 R14: 0000000000000000 R15: 0000000000000000
>>> [    5.616626]  </TASK>
>>>
>>
>> Thanks for the report and the patch!
>>
>> I remember I and Naoya discussed the try_to_split_thp_page in memory_failure might come
>> across non-lru movable compound page and huge_zero_page. We fixed the non-lru movable
>> compound page case but conclude huge_zero_page won't reach here due to the HWPoisonHandlable()
>> check. But we missed the MF_COUNT_INCREASED case where HWPoisonHandlable() is skipped.
>>
>>> In fact, huge_zero_page is unhandlable currently in either soft offline
>>> or memory failure injection.  With CONFIG_DEBUG_VM disabled,
>>> huge_zero_page is bailed out when checking HWPoisonHandlable() in
>>> get_any_page(), or checking page mapping in split_huge_page_to_list().
>>>
>>> This makes huge_zero_page bail out early in madvise_inject_error(), and
>>> panic above won't happen again.
>>
>> It seems this issue is expected to happen only in madvise_inject_error case because
>> MF_COUNT_INCREASED is only set here. So this fix should do the right thing. But I
>> don't know whether bail out early for huge_zero_page is suitable.
>>
>> Hi Naoya, what do you think?
> 
> Thank you for reporting.
> 
> ...
> 
>>> @@ -1087,12 +1087,21 @@ static int madvise_inject_error(int behavior,
>>>  			return ret;
>>>  		pfn = page_to_pfn(page);
>>>  
>>> +		head = compound_head(page);
>>> +		if (unlikely(is_huge_zero_page(head))) {
>>> +			pr_warn("Unhandlable attempt to %s pfn %#lx at process virtual address %#lx\n",
>>> +				behavior == MADV_SOFT_OFFLINE ? "soft offline" :
>>> +								"inject memory failure for",
>>> +				pfn, start);
>>> +			return -EINVAL;
>>> +		}
> 
> This check is about the detail of error handling, so I feel it desirable to
> do this in memory_failure().  And memory errors on huge zero page is the
> real scenario, so it seems to me better to make this case injectable rather
> than EINVAL.
> 
> How about checking is_huge_zero_page() before try_to_split_thp_page()?
> The result should be consistent with the results when called by other
> memory_failure()'s callers  like MCE handler and hard_offline_page_store().

Yes, they should be same except HWPoisonHandable isn't called for callers like MCE handler
and hard_offline_page_store().

> 
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 9b76222ee237..771fb4fc626c 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1852,6 +1852,12 @@ int memory_failure(unsigned long pfn, int flags)
>  	}
>  
>  	if (PageTransHuge(hpage)) {
> +		if (is_huge_zero_page(hpage)) {
> +			action_result(pfn, MF_MSG_KERNEL_HIGH_ORDER, MF_IGNORED);
> +			res = -EBUSY;
> +			goto unlock_mutex;
> +		}
> +

It seems that huge_zero_page could be handled simply by zap the corresponding page table without
loss any user data. Should we also try to handle this kind of page? Or just bail out as it's rare?

Thanks!

>  		/*
>  		 * The flag must be set after the refcount is bumped
>  		 * otherwise it may race with THP split.
> 
> 
> Thanks,
> Naoya Horiguchi
> .
>