Re: [PATCH] mm/memory-failure.c: bail out early if huge zero page

From: Miaohe Lin <linmiaohe@huawei.com>
To: Naoya Horiguchi <naoya.horiguchi@linux.dev>
Cc: <akpm@linux-foundation.org>, Linux-MM <linux-mm@kvack.org>,
	Naoya Horiguchi <naoya.horiguchi@nec.com>,
	Xu Yu <xuyu@linux.alibaba.com>,
	Oscar Salvador <osalvador@suse.de>
Subject: Re: [PATCH] mm/memory-failure.c: bail out early if huge zero page
Date: Tue, 12 Apr 2022 19:08:45 +0800	[thread overview]
Message-ID: <3eacf09c-e0fe-edde-d81d-ba372c2bad72@huawei.com> (raw)
In-Reply-To: <20220412090907.GA350357@u2004>

On 2022/4/12 17:09, Naoya Horiguchi wrote:
> On Mon, Apr 11, 2022 at 10:18:26AM +0800, Miaohe Lin wrote:
>> On 2022/4/10 23:22, Xu Yu wrote:
>>> Kernel panic when injecting memory_failure for the global huge_zero_page,
>>> when CONFIG_DEBUG_VM is enabled, as follows.
>>>
>>> [    5.582720] Injecting memory failure for pfn 0x109ff9 at process virtual address 0x20ff9000
>>> [    5.583786] page:00000000fb053fc3 refcount:2 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x109e00
>>> [    5.584900] head:00000000fb053fc3 order:9 compound_mapcount:0 compound_pincount:0
>>> [    5.585796] flags: 0x17fffc000010001(locked|head|node=0|zone=2|lastcpupid=0x1ffff)
>>> [    5.586712] raw: 017fffc000010001 0000000000000000 dead000000000122 0000000000000000
>>> [    5.587640] raw: 0000000000000000 0000000000000000 00000002ffffffff 0000000000000000
>>> [    5.588565] page dumped because: VM_BUG_ON_PAGE(is_huge_zero_page(head))
>>> [    5.589398] ------------[ cut here ]------------
>>> [    5.589952] kernel BUG at mm/huge_memory.c:2499!
>>> [    5.590516] invalid opcode: 0000 [#1] PREEMPT SMP PTI
>>> [    5.591120] CPU: 6 PID: 553 Comm: split_bug Not tainted 5.18.0-rc1+ #11
>>> [    5.591904] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 3288b3c 04/01/2014
>>> [    5.592817] RIP: 0010:split_huge_page_to_list+0x66a/0x880
>>> [    5.593469] Code: 84 9b fb ff ff 48 8b 7c 24 08 31 f6 e8 9f 5d 2a 00 b8 b8 02 00 00 e9 e8 fb ff ff 48 c7 c6 e8 47 3c 82 4c b
>>> [    5.595806] RSP: 0018:ffffc90000dcbdf8 EFLAGS: 00010246
>>> [    5.596434] RAX: 000000000000003c RBX: 0000000000000001 RCX: 0000000000000000
>>> [    5.597322] RDX: 0000000000000000 RSI: ffffffff823e4c4f RDI: 00000000ffffffff
>>> [    5.598162] RBP: ffff88843fffdb40 R08: 0000000000000000 R09: 00000000fffeffff
>>> [    5.598999] R10: ffffc90000dcbc48 R11: ffffffff82d68448 R12: ffffea0004278000
>>> [    5.599849] R13: ffffffff823c6203 R14: 0000000000109ff9 R15: ffffea000427fe40
>>> [    5.600693] FS:  00007fc375a26740(0000) GS:ffff88842fd80000(0000) knlGS:0000000000000000
>>> [    5.601640] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [    5.602304] CR2: 00007fc3757c9290 CR3: 0000000102174006 CR4: 00000000003706e0
>>> [    5.603139] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>> [    5.603977] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>> [    5.604806] Call Trace:
>>> [    5.605101]  <TASK>
>>> [    5.605357]  ? __irq_work_queue_local+0x39/0x70
>>> [    5.605904]  try_to_split_thp_page+0x3a/0x130
>>> [    5.606430]  memory_failure+0x128/0x800
>>> [    5.606888]  madvise_inject_error.cold+0x8b/0xa1
>>> [    5.607444]  __x64_sys_madvise+0x54/0x60
>>> [    5.607915]  do_syscall_64+0x35/0x80
>>> [    5.608347]  entry_SYSCALL_64_after_hwframe+0x44/0xae
>>> [    5.608949] RIP: 0033:0x7fc3754f8bf9
>>> [    5.609374] Code: 01 00 48 81 c4 80 00 00 00 e9 f1 fe ff ff 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 8
>>> [    5.611554] RSP: 002b:00007ffeda93a1d8 EFLAGS: 00000217 ORIG_RAX: 000000000000001c
>>> [    5.612441] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007fc3754f8bf9
>>> [    5.613269] RDX: 0000000000000064 RSI: 0000000000003000 RDI: 0000000020ff9000
>>> [    5.614108] RBP: 00007ffeda93a200 R08: 0000000000000000 R09: 0000000000000000
>>> [    5.614946] R10: 00000000ffffffff R11: 0000000000000217 R12: 0000000000400490
>>> [    5.615787] R13: 00007ffeda93a2e0 R14: 0000000000000000 R15: 0000000000000000
>>> [    5.616626]  </TASK>
>>>
>>
>> Thanks for the report and the patch!
>>
>> I remember I and Naoya discussed the try_to_split_thp_page in memory_failure might come
>> across non-lru movable compound page and huge_zero_page. We fixed the non-lru movable
>> compound page case but conclude huge_zero_page won't reach here due to the HWPoisonHandlable()
>> check. But we missed the MF_COUNT_INCREASED case where HWPoisonHandlable() is skipped.
>>
>>> In fact, huge_zero_page is unhandlable currently in either soft offline
>>> or memory failure injection.  With CONFIG_DEBUG_VM disabled,
>>> huge_zero_page is bailed out when checking HWPoisonHandlable() in
>>> get_any_page(), or checking page mapping in split_huge_page_to_list().
>>>
>>> This makes huge_zero_page bail out early in madvise_inject_error(), and
>>> panic above won't happen again.
>>
>> It seems this issue is expected to happen only in madvise_inject_error case because
>> MF_COUNT_INCREASED is only set here. So this fix should do the right thing. But I
>> don't know whether bail out early for huge_zero_page is suitable.
>>
>> Hi Naoya, what do you think?
> 
> Thank you for reporting.
> 
> ...
> 
>>> @@ -1087,12 +1087,21 @@ static int madvise_inject_error(int behavior,
>>>  			return ret;
>>>  		pfn = page_to_pfn(page);
>>>  
>>> +		head = compound_head(page);
>>> +		if (unlikely(is_huge_zero_page(head))) {
>>> +			pr_warn("Unhandlable attempt to %s pfn %#lx at process virtual address %#lx\n",
>>> +				behavior == MADV_SOFT_OFFLINE ? "soft offline" :
>>> +								"inject memory failure for",
>>> +				pfn, start);
>>> +			return -EINVAL;
>>> +		}
> 
> This check is about the detail of error handling, so I feel it desirable to
> do this in memory_failure().  And memory errors on huge zero page is the
> real scenario, so it seems to me better to make this case injectable rather
> than EINVAL.
> 
> How about checking is_huge_zero_page() before try_to_split_thp_page()?
> The result should be consistent with the results when called by other
> memory_failure()'s callers  like MCE handler and hard_offline_page_store().

Yes, they should be same except HWPoisonHandable isn't called for callers like MCE handler
and hard_offline_page_store().

> 
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 9b76222ee237..771fb4fc626c 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1852,6 +1852,12 @@ int memory_failure(unsigned long pfn, int flags)
>  	}
>  
>  	if (PageTransHuge(hpage)) {
> +		if (is_huge_zero_page(hpage)) {
> +			action_result(pfn, MF_MSG_KERNEL_HIGH_ORDER, MF_IGNORED);
> +			res = -EBUSY;
> +			goto unlock_mutex;
> +		}
> +

It seems that huge_zero_page could be handled simply by zap the corresponding page table without
loss any user data. Should we also try to handle this kind of page? Or just bail out as it's rare?

Thanks!

>  		/*
>  		 * The flag must be set after the refcount is bumped
>  		 * otherwise it may race with THP split.
> 
> 
> Thanks,
> Naoya Horiguchi
> .
>