* [BUG report] kernel NULL pointer dereference in split_huge_page with offlined memory block
@ 2022-09-07 10:08 Naoya Horiguchi
2022-09-07 10:23 ` David Hildenbrand
0 siblings, 1 reply; 19+ messages in thread
From: Naoya Horiguchi @ 2022-09-07 10:08 UTC (permalink / raw)
To: linux-mm
Cc: Andrew Morton, David Hildenbrand, Muchun Song, Miaohe Lin,
Matthew Wilcox, Michal Hocko, Yang Shi, Naoya Horiguchi
Hi MM folks,
When I'm testing memory hotremove with various settings, I found the following
NULL-pointer dereference. It reproduces easily with the folloing steps:
$ echo offline > /sys/devices/system/memory/memoryN/state
$ echo 1 > /sys/kernel/debug/split_huge_pages
I don't check in which commit this was introduced yet (at least v6.0-rc1,
v6.0-rc4 and mm-everything-2022-09-05-23-30 are affected), but I expect that
someone might have clear idea about this, so let me share first.
Thanks,
Naoya Horiguchi
---
[ 309.947421] BUG: kernel NULL pointer dereference, address: 0000000000000032
[ 309.949600] #PF: supervisor read access in kernel mode
[ 309.951220] #PF: error_code(0x0000) - not-present page
[ 309.952819] PGD 0 P4D 0
[ 309.953649] Oops: 0000 [#1] PREEMPT SMP PTI
[ 309.954999] CPU: 1 PID: 846 Comm: bash Tainted: G E N 6.0.0-rc1-v6.0-rc1-220815-2254-000-rc1+ #62
[ 309.958170] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1.fc35 04/01/2014
[ 309.960759] RIP: 0010:split_huge_pages_write.part.0+0x40c/0xe70
[ 309.962684] Code: 00 00 00 4d 8b ae 90 00 00 00 49 01 dd 4c 39 eb 72 47 eb c8 48 8b 41 08 a8 01 0f 85 57 08 00 00 0f 1f 44 00 00 0f 1f 44 00 00 <41> 8b 47 34 85 c0 0f 84 1c 09 00 00 f0 41 ff 4f 34 0f 94 c0 0f 1f
[ 309.968381] RSP: 0018:ffffb4d201d6bbd0 EFLAGS: 00010202
[ 309.970067] RAX: ffffffffffffffff RBX: 0000000000230000 RCX: ffffd6fac8c00000
[ 309.972262] RDX: 00000000000003ff RSI: 0000000000000014 RDI: ffffd6fac4fff300
[ 309.974475] RBP: ffffb4d201d6bc12 R08: 0000000000000054 R09: ffffd6fac46b7f88
[ 309.976725] R10: 00000000ffffffff R11: ffffff8000000000 R12: 0000000000001454
[ 309.978980] R13: 0000000000248000 R14: ffff93ce3ffd5d80 R15: fffffffffffffffe
[ 309.981267] FS: 00007fe2cd337740(0000) GS:ffff93ce3bc80000(0000) knlGS:0000000000000000
[ 309.983842] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 309.985672] CR2: 0000000000000032 CR3: 00000001018fc005 CR4: 0000000000170ee0
[ 309.987909] Call Trace:
[ 309.988794] <TASK>
[ 309.989461] ? _raw_spin_lock+0x13/0x40
[ 309.990578] ? __mark_inode_dirty+0x113/0x390
[ 309.991933] ? terminate_walk+0x90/0x100
[ 309.993186] ? path_openat+0x440/0x1070
[ 309.994421] ? do_filp_open+0x9f/0x130
[ 309.995610] full_proxy_write+0x53/0x80
[ 309.996820] vfs_write+0xb7/0x3a0
[ 309.997902] ? _raw_spin_unlock+0x15/0x30
[ 309.999190] ksys_write+0x4f/0xd0
[ 310.000249] do_syscall_64+0x3b/0x90
[ 310.001418] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 310.002938] RIP: 0033:0x7fe2cd1018b7
[ 310.004143] Code: 0f 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
[ 310.009871] RSP: 002b:00007ffc625f63f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[ 310.012060] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fe2cd1018b7
[ 310.014250] RDX: 0000000000000002 RSI: 000055c1a80afc50 RDI: 0000000000000001
[ 310.016533] RBP: 000055c1a80afc50 R08: 0000000000000000 R09: 00007fe2cd1b64e0
[ 310.018782] R10: 00007fe2cd1b63e0 R11: 0000000000000246 R12: 0000000000000002
[ 310.021086] R13: 00007fe2cd1fb5a0 R14: 0000000000000002 R15: 00007fe2cd1fb7a0
[ 310.023169] </TASK>
[ 310.023844] Modules linked in: nft_fib_inet(E) nft_fib_ipv4(E) nft_fib_ipv6(E) nft_fib(E) nft_reject_inet(E) nf_reject_ipv4(E) nf_reject_ipv6(E) nft_reject(E) nft_ct(E) nft_chain_nat(E) nf_nat(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) ip_set(E) rfkill(E) nf_tables(E) nfnetlink(E) qrtr(E) sunrpc(E) 9p(E) fscache(E) netfs(E) intel_rapl_msr(E) intel_rapl_common(E) kvm_intel(E) kvm(E) irqbypass(E) virtio_balloon(E) rapl(E) 9pnet_virtio(E) i2c_piix4(E) 9pnet(E) joydev(E) pcspkr(E) fuse(E) zram(E) ip_tables(E) xfs(E) crc32c_intel(E) serio_raw(E) virtio_blk(E) e1000(E) ata_generic(E) pata_acpi(E) floppy(E) qemu_fw_cfg(E)
[ 310.040426] CR2: 0000000000000032
[ 310.041715] ---[ end trace 0000000000000000 ]---
[ 310.043196] RIP: 0010:split_huge_pages_write.part.0+0x40c/0xe70
[ 310.044953] Code: 00 00 00 4d 8b ae 90 00 00 00 49 01 dd 4c 39 eb 72 47 eb c8 48 8b 41 08 a8 01 0f 85 57 08 00 00 0f 1f 44 00 00 0f 1f 44 00 00 <41> 8b 47 34 85 c0 0f 84 1c 09 00 00 f0 41 ff 4f 34 0f 94 c0 0f 1f
[ 310.050051] RSP: 0018:ffffb4d201d6bbd0 EFLAGS: 00010202
[ 310.051593] RAX: ffffffffffffffff RBX: 0000000000230000 RCX: ffffd6fac8c00000
[ 310.053664] RDX: 00000000000003ff RSI: 0000000000000014 RDI: ffffd6fac4fff300
[ 310.056165] RBP: ffffb4d201d6bc12 R08: 0000000000000054 R09: ffffd6fac46b7f88
[ 310.059144] R10: 00000000ffffffff R11: ffffff8000000000 R12: 0000000000001454
[ 310.062033] R13: 0000000000248000 R14: ffff93ce3ffd5d80 R15: fffffffffffffffe
[ 310.069111] FS: 00007fe2cd337740(0000) GS:ffff93ce3bc80000(0000) knlGS:0000000000000000
[ 310.077141] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 310.079988] CR2: 0000000000000032 CR3: 00000001018fc005 CR4: 0000000000170ee0
[ 310.083292] Kernel panic - not syncing: Fatal exception
[ 310.086117] Kernel Offset: 0x1a000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[ 310.090607] Rebooting in 2 seconds..
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [BUG report] kernel NULL pointer dereference in split_huge_page with offlined memory block
2022-09-07 10:08 [BUG report] kernel NULL pointer dereference in split_huge_page with offlined memory block Naoya Horiguchi
@ 2022-09-07 10:23 ` David Hildenbrand
2022-09-07 10:26 ` David Hildenbrand
0 siblings, 1 reply; 19+ messages in thread
From: David Hildenbrand @ 2022-09-07 10:23 UTC (permalink / raw)
To: Naoya Horiguchi, linux-mm
Cc: Andrew Morton, Muchun Song, Miaohe Lin, Matthew Wilcox,
Michal Hocko, Yang Shi, Naoya Horiguchi
On 07.09.22 12:08, Naoya Horiguchi wrote:
> Hi MM folks,
Hi,
>
> When I'm testing memory hotremove with various settings, I found the following
> NULL-pointer dereference. It reproduces easily with the folloing steps:
>
> $ echo offline > /sys/devices/system/memory/memoryN/state
> $ echo 1 > /sys/kernel/debug/split_huge_pages
>
That's weird, I don't immediately see how both features are related here, especially because it seems to fail quite early in split_huge_pages_write().
I was able to trigger it here as well, though, and I get on my kernel with debug options enabled:
[526045.808737] page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000
[526045.818306] flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff)
[526045.824842] raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000
[526045.832676] raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
[526045.840498] page dumped because: unmovable page
[526056.362715] page:000000007d7ab72e is uninitialized and poisoned
[526056.362720] page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
[526056.374837] ------------[ cut here ]------------
[526056.379544] kernel BUG at include/linux/mm.h:1248!
[526056.384429] invalid opcode: 0000 [#1] PREEMPT SMP PTI
[526056.389570] CPU: 16 PID: 20964 Comm: bash Tainted: G I 6.0.0-rc3-foll-numa+ #41
[526056.398347] Hardware name: Dell Inc. PowerEdge R740/0923K0, BIOS 2.11.2 004/21/2021
[526056.406087] RIP: 0010:split_huge_pages_write+0xcf4/0xe30
[526056.411495] Code: f9 ff ff 48 c7 c6 88 8e 4e a7 48 c7 c7 f8 0b a8 a7 e8 20 1c 3d 00 e9 54 f6 ff ff 48 c7 c6 38 06 4c a7 4c 89 e7 e8 5c 4a f7 ff <0f> 0b 48 8b 7c 24 08 be 02 00 00 00 e8 bb 6c 36 00 e9 11 f5 ff ff
[526056.430325] RSP: 0018:ffffae50e757fb40 EFLAGS: 00010292
[526056.435639] RAX: 0000000000000034 RBX: 0000000002d00000 RCX: 0000000000000000
[526056.442858] RDX: 0000000000000001 RSI: ffffffffa751a839 RDI: 00000000ffffffff
[526056.450077] RBP: 0000000000013937 R08: 0000000000000000 R09: ffffae50e757fa08
[526056.457296] R10: 0000000000000003 R11: ffffffffa793c768 R12: fffff00eb4000000
[526056.464514] R13: ffffae50e757fb7a R14: 000fffffffffffff R15: fffff00eb4000000
[526056.471733] FS: 00007fe35addf740(0000) GS:ffff8b71dc000000(0000) knlGS:0000000000000000
[526056.479906] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[526056.485738] CR2: 000055e87e606358 CR3: 00000004f5b46004 CR4: 00000000007706e0
[526056.492957] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[526056.500176] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[526056.507396] PKRU: 55555554
[526056.510196] Call Trace:
[526056.512734] <TASK>
[526056.514929] ? simple_setattr+0x40/0x60
[526056.518859] ? vfs_mkobj+0x1b0/0x1c0
[526056.522529] ? path_openat+0xb2e/0x1360
[526056.526456] ? do_filp_open+0xa1/0x130
[526056.530296] full_proxy_write+0x50/0x80
[526056.534229] vfs_write+0xd7/0x3e0
[526056.537635] ? fpregs_assert_state_consistent+0x22/0x50
[526056.542955] ? exit_to_user_mode_prepare+0x3c/0x1d0
[526056.547929] ? syscall_exit_to_user_mode+0x17/0x40
[526056.552808] ksys_write+0x53/0xd0
[526056.556215] do_syscall_64+0x58/0x80
[526056.559879] ? exit_to_user_mode_prepare+0x3c/0x1d0
[526056.564846] ? syscall_exit_to_user_mode+0x17/0x40
[526056.569726] ? do_syscall_64+0x67/0x80
[526056.573563] ? do_syscall_64+0x67/0x80
[526056.577404] ? do_syscall_64+0x67/0x80
[526056.581244] ? syscall_exit_to_user_mode+0x17/0x40
[526056.586120] ? do_syscall_64+0x67/0x80
[526056.589961] ? do_syscall_64+0x67/0x80
[526056.593801] ? do_syscall_64+0x67/0x80
[526056.597641] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[526056.602779] RIP: 0033:0x7fe35ab01c17
[526056.606446] Code: 0f 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
[526056.625277] RSP: 002b:00007ffc1863c8d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[526056.632929] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fe35ab01c17
[526056.640147] RDX: 0000000000000002 RSI: 000055e87e7009a0 RDI: 0000000000000001
[526056.647366] RBP: 000055e87e7009a0 R08: 0000000000000000 R09: 0000000000000073
[526056.654586] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000002
[526056.661805] R13: 00007fe35abf8780 R14: 0000000000000002 R15: 00007fe35abf39e0
[526056.669028] </TASK>
Looks like there is a page_to_nid() done in an offline memmap, which is wrong.
Usually, this indicates that a pfn_to_online_page() is missing in a PFN walker.
--
Thanks,
David / dhildenb
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [BUG report] kernel NULL pointer dereference in split_huge_page with offlined memory block
2022-09-07 10:23 ` David Hildenbrand
@ 2022-09-07 10:26 ` David Hildenbrand
2022-09-07 12:11 ` [PATCH] mm/huge_memory: use pfn_to_online_page() in split_huge_pages_all() Naoya Horiguchi
0 siblings, 1 reply; 19+ messages in thread
From: David Hildenbrand @ 2022-09-07 10:26 UTC (permalink / raw)
To: Naoya Horiguchi, linux-mm
Cc: Andrew Morton, Muchun Song, Miaohe Lin, Matthew Wilcox,
Michal Hocko, Yang Shi, Naoya Horiguchi
On 07.09.22 12:23, David Hildenbrand wrote:
> On 07.09.22 12:08, Naoya Horiguchi wrote:
>> Hi MM folks,
>
> Hi,
>
>>
>> When I'm testing memory hotremove with various settings, I found the following
>> NULL-pointer dereference. It reproduces easily with the folloing steps:
>>
>> $ echo offline > /sys/devices/system/memory/memoryN/state
>> $ echo 1 > /sys/kernel/debug/split_huge_pages
>>
>
> That's weird, I don't immediately see how both features are related here, especially because it seems to fail quite early in split_huge_pages_write().
>
> I was able to trigger it here as well, though, and I get on my kernel with debug options enabled:
>
> [526045.808737] page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000
> [526045.818306] flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff)
> [526045.824842] raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000
> [526045.832676] raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
> [526045.840498] page dumped because: unmovable page
> [526056.362715] page:000000007d7ab72e is uninitialized and poisoned
> [526056.362720] page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
> [526056.374837] ------------[ cut here ]------------
> [526056.379544] kernel BUG at include/linux/mm.h:1248!
> [526056.384429] invalid opcode: 0000 [#1] PREEMPT SMP PTI
> [526056.389570] CPU: 16 PID: 20964 Comm: bash Tainted: G I 6.0.0-rc3-foll-numa+ #41
> [526056.398347] Hardware name: Dell Inc. PowerEdge R740/0923K0, BIOS 2.11.2 004/21/2021
> [526056.406087] RIP: 0010:split_huge_pages_write+0xcf4/0xe30
> [526056.411495] Code: f9 ff ff 48 c7 c6 88 8e 4e a7 48 c7 c7 f8 0b a8 a7 e8 20 1c 3d 00 e9 54 f6 ff ff 48 c7 c6 38 06 4c a7 4c 89 e7 e8 5c 4a f7 ff <0f> 0b 48 8b 7c 24 08 be 02 00 00 00 e8 bb 6c 36 00 e9 11 f5 ff ff
> [526056.430325] RSP: 0018:ffffae50e757fb40 EFLAGS: 00010292
> [526056.435639] RAX: 0000000000000034 RBX: 0000000002d00000 RCX: 0000000000000000
> [526056.442858] RDX: 0000000000000001 RSI: ffffffffa751a839 RDI: 00000000ffffffff
> [526056.450077] RBP: 0000000000013937 R08: 0000000000000000 R09: ffffae50e757fa08
> [526056.457296] R10: 0000000000000003 R11: ffffffffa793c768 R12: fffff00eb4000000
> [526056.464514] R13: ffffae50e757fb7a R14: 000fffffffffffff R15: fffff00eb4000000
> [526056.471733] FS: 00007fe35addf740(0000) GS:ffff8b71dc000000(0000) knlGS:0000000000000000
> [526056.479906] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [526056.485738] CR2: 000055e87e606358 CR3: 00000004f5b46004 CR4: 00000000007706e0
> [526056.492957] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [526056.500176] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [526056.507396] PKRU: 55555554
> [526056.510196] Call Trace:
> [526056.512734] <TASK>
> [526056.514929] ? simple_setattr+0x40/0x60
> [526056.518859] ? vfs_mkobj+0x1b0/0x1c0
> [526056.522529] ? path_openat+0xb2e/0x1360
> [526056.526456] ? do_filp_open+0xa1/0x130
> [526056.530296] full_proxy_write+0x50/0x80
> [526056.534229] vfs_write+0xd7/0x3e0
> [526056.537635] ? fpregs_assert_state_consistent+0x22/0x50
> [526056.542955] ? exit_to_user_mode_prepare+0x3c/0x1d0
> [526056.547929] ? syscall_exit_to_user_mode+0x17/0x40
> [526056.552808] ksys_write+0x53/0xd0
> [526056.556215] do_syscall_64+0x58/0x80
> [526056.559879] ? exit_to_user_mode_prepare+0x3c/0x1d0
> [526056.564846] ? syscall_exit_to_user_mode+0x17/0x40
> [526056.569726] ? do_syscall_64+0x67/0x80
> [526056.573563] ? do_syscall_64+0x67/0x80
> [526056.577404] ? do_syscall_64+0x67/0x80
> [526056.581244] ? syscall_exit_to_user_mode+0x17/0x40
> [526056.586120] ? do_syscall_64+0x67/0x80
> [526056.589961] ? do_syscall_64+0x67/0x80
> [526056.593801] ? do_syscall_64+0x67/0x80
> [526056.597641] entry_SYSCALL_64_after_hwframe+0x63/0xcd
> [526056.602779] RIP: 0033:0x7fe35ab01c17
> [526056.606446] Code: 0f 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
> [526056.625277] RSP: 002b:00007ffc1863c8d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
> [526056.632929] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fe35ab01c17
> [526056.640147] RDX: 0000000000000002 RSI: 000055e87e7009a0 RDI: 0000000000000001
> [526056.647366] RBP: 000055e87e7009a0 R08: 0000000000000000 R09: 0000000000000073
> [526056.654586] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000002
> [526056.661805] R13: 00007fe35abf8780 R14: 0000000000000002 R15: 00007fe35abf39e0
> [526056.669028] </TASK>
>
>
> Looks like there is a page_to_nid() done in an offline memmap, which is wrong.
>
> Usually, this indicates that a pfn_to_online_page() is missing in a PFN walker.
>
And indeed, something like the following might do the trick:
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e9414ee57c5b..f42bb51e023a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2894,11 +2894,9 @@ static void split_huge_pages_all(void)
max_zone_pfn = zone_end_pfn(zone);
for (pfn = zone->zone_start_pfn; pfn < max_zone_pfn; pfn++) {
int nr_pages;
- if (!pfn_valid(pfn))
- continue;
- page = pfn_to_page(pfn);
- if (!get_page_unless_zero(page))
+ page = pfn_to_online_page(pfn);
+ if (!page || !get_page_unless_zero(page))
continue;
if (zone != page_zone(page))
--
Thanks,
David / dhildenb
^ permalink raw reply related [flat|nested] 19+ messages in thread
* [PATCH] mm/huge_memory: use pfn_to_online_page() in split_huge_pages_all()
2022-09-07 10:26 ` David Hildenbrand
@ 2022-09-07 12:11 ` Naoya Horiguchi
2022-09-07 12:39 ` David Hildenbrand
` (5 more replies)
0 siblings, 6 replies; 19+ messages in thread
From: Naoya Horiguchi @ 2022-09-07 12:11 UTC (permalink / raw)
To: David Hildenbrand
Cc: linux-mm, Andrew Morton, Muchun Song, Miaohe Lin, Matthew Wilcox,
Michal Hocko, Yang Shi, Naoya Horiguchi
On Wed, Sep 07, 2022 at 12:26:18PM +0200, David Hildenbrand wrote:
> On 07.09.22 12:23, David Hildenbrand wrote:
> > On 07.09.22 12:08, Naoya Horiguchi wrote:
> > > Hi MM folks,
> >
> > Hi,
> >
> > >
> > > When I'm testing memory hotremove with various settings, I found the following
> > > NULL-pointer dereference. It reproduces easily with the folloing steps:
> > >
> > > $ echo offline > /sys/devices/system/memory/memoryN/state
> > > $ echo 1 > /sys/kernel/debug/split_huge_pages
> > >
> >
> > That's weird, I don't immediately see how both features are related here, especially because it seems to fail quite early in split_huge_pages_write().
> >
> > I was able to trigger it here as well, though, and I get on my kernel with debug options enabled:
> >
> > [526045.808737] page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000
> > [526045.818306] flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff)
> > [526045.824842] raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000
> > [526045.832676] raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
> > [526045.840498] page dumped because: unmovable page
> > [526056.362715] page:000000007d7ab72e is uninitialized and poisoned
> > [526056.362720] page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
> > [526056.374837] ------------[ cut here ]------------
> > [526056.379544] kernel BUG at include/linux/mm.h:1248!
> > [526056.384429] invalid opcode: 0000 [#1] PREEMPT SMP PTI
> > [526056.389570] CPU: 16 PID: 20964 Comm: bash Tainted: G I 6.0.0-rc3-foll-numa+ #41
> > [526056.398347] Hardware name: Dell Inc. PowerEdge R740/0923K0, BIOS 2.11.2 004/21/2021
> > [526056.406087] RIP: 0010:split_huge_pages_write+0xcf4/0xe30
> > [526056.411495] Code: f9 ff ff 48 c7 c6 88 8e 4e a7 48 c7 c7 f8 0b a8 a7 e8 20 1c 3d 00 e9 54 f6 ff ff 48 c7 c6 38 06 4c a7 4c 89 e7 e8 5c 4a f7 ff <0f> 0b 48 8b 7c 24 08 be 02 00 00 00 e8 bb 6c 36 00 e9 11 f5 ff ff
> > [526056.430325] RSP: 0018:ffffae50e757fb40 EFLAGS: 00010292
> > [526056.435639] RAX: 0000000000000034 RBX: 0000000002d00000 RCX: 0000000000000000
> > [526056.442858] RDX: 0000000000000001 RSI: ffffffffa751a839 RDI: 00000000ffffffff
> > [526056.450077] RBP: 0000000000013937 R08: 0000000000000000 R09: ffffae50e757fa08
> > [526056.457296] R10: 0000000000000003 R11: ffffffffa793c768 R12: fffff00eb4000000
> > [526056.464514] R13: ffffae50e757fb7a R14: 000fffffffffffff R15: fffff00eb4000000
> > [526056.471733] FS: 00007fe35addf740(0000) GS:ffff8b71dc000000(0000) knlGS:0000000000000000
> > [526056.479906] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [526056.485738] CR2: 000055e87e606358 CR3: 00000004f5b46004 CR4: 00000000007706e0
> > [526056.492957] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [526056.500176] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > [526056.507396] PKRU: 55555554
> > [526056.510196] Call Trace:
> > [526056.512734] <TASK>
> > [526056.514929] ? simple_setattr+0x40/0x60
> > [526056.518859] ? vfs_mkobj+0x1b0/0x1c0
> > [526056.522529] ? path_openat+0xb2e/0x1360
> > [526056.526456] ? do_filp_open+0xa1/0x130
> > [526056.530296] full_proxy_write+0x50/0x80
> > [526056.534229] vfs_write+0xd7/0x3e0
> > [526056.537635] ? fpregs_assert_state_consistent+0x22/0x50
> > [526056.542955] ? exit_to_user_mode_prepare+0x3c/0x1d0
> > [526056.547929] ? syscall_exit_to_user_mode+0x17/0x40
> > [526056.552808] ksys_write+0x53/0xd0
> > [526056.556215] do_syscall_64+0x58/0x80
> > [526056.559879] ? exit_to_user_mode_prepare+0x3c/0x1d0
> > [526056.564846] ? syscall_exit_to_user_mode+0x17/0x40
> > [526056.569726] ? do_syscall_64+0x67/0x80
> > [526056.573563] ? do_syscall_64+0x67/0x80
> > [526056.577404] ? do_syscall_64+0x67/0x80
> > [526056.581244] ? syscall_exit_to_user_mode+0x17/0x40
> > [526056.586120] ? do_syscall_64+0x67/0x80
> > [526056.589961] ? do_syscall_64+0x67/0x80
> > [526056.593801] ? do_syscall_64+0x67/0x80
> > [526056.597641] entry_SYSCALL_64_after_hwframe+0x63/0xcd
> > [526056.602779] RIP: 0033:0x7fe35ab01c17
> > [526056.606446] Code: 0f 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
> > [526056.625277] RSP: 002b:00007ffc1863c8d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
> > [526056.632929] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fe35ab01c17
> > [526056.640147] RDX: 0000000000000002 RSI: 000055e87e7009a0 RDI: 0000000000000001
> > [526056.647366] RBP: 000055e87e7009a0 R08: 0000000000000000 R09: 0000000000000073
> > [526056.654586] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000002
> > [526056.661805] R13: 00007fe35abf8780 R14: 0000000000000002 R15: 00007fe35abf39e0
> > [526056.669028] </TASK>
> >
> >
> > Looks like there is a page_to_nid() done in an offline memmap, which is wrong.
> >
> > Usually, this indicates that a pfn_to_online_page() is missing in a PFN walker.
> >
>
> And indeed, something like the following might do the trick:
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index e9414ee57c5b..f42bb51e023a 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2894,11 +2894,9 @@ static void split_huge_pages_all(void)
> max_zone_pfn = zone_end_pfn(zone);
> for (pfn = zone->zone_start_pfn; pfn < max_zone_pfn; pfn++) {
> int nr_pages;
> - if (!pfn_valid(pfn))
> - continue;
> - page = pfn_to_page(pfn);
> - if (!get_page_unless_zero(page))
> + page = pfn_to_online_page(pfn);
> + if (!page || !get_page_unless_zero(page))
> continue;
> if (zone != page_zone(page))
>
Thank you very much, I confirmed the fix. So I try to write a patch like below.
I borrowed your debug message because it has more helpful info than mine.
If you have any comment or suggestion in the description, please let me know.
Thanks,
Naoya Horiguchi
---
From 8a5c284df732943065d23838090d15c94cd10395 Mon Sep 17 00:00:00 2001
From: Naoya Horiguchi <naoya.horiguchi@nec.com>
Date: Wed, 7 Sep 2022 20:58:44 +0900
Subject: [PATCH] mm/huge_memory: use pfn_to_online_page() in
split_huge_pages_all()
NULL pointer dereference is triggered when calling thp split via debugfs
on the system with offlined memory blocks. With debug option enabled,
the following kernel messages are printed out:
page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000
flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff)
raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000
raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
page dumped because: unmovable page
page:000000007d7ab72e is uninitialized and poisoned
page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
------------[ cut here ]------------
kernel BUG at include/linux/mm.h:1248!
invalid opcode: 0000 [#1] PREEMPT SMP PTI
CPU: 16 PID: 20964 Comm: bash Tainted: G I 6.0.0-rc3-foll-numa+ #41
...
RIP: 0010:split_huge_pages_write+0xcf4/0xe30
This shows that page_to_nid() in page_zone() is unexpectedly called for an
offlined memmap.
Use pfn_to_online_page() to get struct page in PFN walker.
Fixes: 49071d436b51 ("thp: add debugfs handle to split all huge pages")
Suggested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: <stable@vger.kernel.org>
---
mm/huge_memory.c | 7 ++-----
1 file changed, 2 insertions(+), 5 deletions(-)
diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5fa2ba86dae4..03149a8f46f9 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2894,11 +2894,8 @@ static void split_huge_pages_all(void)
max_zone_pfn = zone_end_pfn(zone);
for (pfn = zone->zone_start_pfn; pfn < max_zone_pfn; pfn++) {
int nr_pages;
- if (!pfn_valid(pfn))
- continue;
-
- page = pfn_to_page(pfn);
- if (!get_page_unless_zero(page))
+ page = pfn_to_online_page(pfn);
+ if (!page || !get_page_unless_zero(page))
continue;
if (zone != page_zone(page))
--
2.26.1.8815.g6a475b71f8
^ permalink raw reply related [flat|nested] 19+ messages in thread
* Re: [PATCH] mm/huge_memory: use pfn_to_online_page() in split_huge_pages_all()
2022-09-07 12:11 ` [PATCH] mm/huge_memory: use pfn_to_online_page() in split_huge_pages_all() Naoya Horiguchi
@ 2022-09-07 12:39 ` David Hildenbrand
2022-09-08 2:39 ` HORIGUCHI NAOYA(堀口 直也)
2022-09-07 17:10 ` Yang Shi
` (4 subsequent siblings)
5 siblings, 1 reply; 19+ messages in thread
From: David Hildenbrand @ 2022-09-07 12:39 UTC (permalink / raw)
To: Naoya Horiguchi
Cc: linux-mm, Andrew Morton, Muchun Song, Miaohe Lin, Matthew Wilcox,
Michal Hocko, Yang Shi, Naoya Horiguchi
On 07.09.22 14:11, Naoya Horiguchi wrote:
> On Wed, Sep 07, 2022 at 12:26:18PM +0200, David Hildenbrand wrote:
>> On 07.09.22 12:23, David Hildenbrand wrote:
>>> On 07.09.22 12:08, Naoya Horiguchi wrote:
>>>> Hi MM folks,
>>>
>>> Hi,
>>>
>>>>
>>>> When I'm testing memory hotremove with various settings, I found the following
>>>> NULL-pointer dereference. It reproduces easily with the folloing steps:
>>>>
>>>> $ echo offline > /sys/devices/system/memory/memoryN/state
>>>> $ echo 1 > /sys/kernel/debug/split_huge_pages
>>>>
>>>
>>> That's weird, I don't immediately see how both features are related here, especially because it seems to fail quite early in split_huge_pages_write().
>>>
>>> I was able to trigger it here as well, though, and I get on my kernel with debug options enabled:
>>>
>>> [526045.808737] page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000
>>> [526045.818306] flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff)
>>> [526045.824842] raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000
>>> [526045.832676] raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
>>> [526045.840498] page dumped because: unmovable page
>>> [526056.362715] page:000000007d7ab72e is uninitialized and poisoned
>>> [526056.362720] page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
>>> [526056.374837] ------------[ cut here ]------------
>>> [526056.379544] kernel BUG at include/linux/mm.h:1248!
>>> [526056.384429] invalid opcode: 0000 [#1] PREEMPT SMP PTI
>>> [526056.389570] CPU: 16 PID: 20964 Comm: bash Tainted: G I 6.0.0-rc3-foll-numa+ #41
>>> [526056.398347] Hardware name: Dell Inc. PowerEdge R740/0923K0, BIOS 2.11.2 004/21/2021
>>> [526056.406087] RIP: 0010:split_huge_pages_write+0xcf4/0xe30
>>> [526056.411495] Code: f9 ff ff 48 c7 c6 88 8e 4e a7 48 c7 c7 f8 0b a8 a7 e8 20 1c 3d 00 e9 54 f6 ff ff 48 c7 c6 38 06 4c a7 4c 89 e7 e8 5c 4a f7 ff <0f> 0b 48 8b 7c 24 08 be 02 00 00 00 e8 bb 6c 36 00 e9 11 f5 ff ff
>>> [526056.430325] RSP: 0018:ffffae50e757fb40 EFLAGS: 00010292
>>> [526056.435639] RAX: 0000000000000034 RBX: 0000000002d00000 RCX: 0000000000000000
>>> [526056.442858] RDX: 0000000000000001 RSI: ffffffffa751a839 RDI: 00000000ffffffff
>>> [526056.450077] RBP: 0000000000013937 R08: 0000000000000000 R09: ffffae50e757fa08
>>> [526056.457296] R10: 0000000000000003 R11: ffffffffa793c768 R12: fffff00eb4000000
>>> [526056.464514] R13: ffffae50e757fb7a R14: 000fffffffffffff R15: fffff00eb4000000
>>> [526056.471733] FS: 00007fe35addf740(0000) GS:ffff8b71dc000000(0000) knlGS:0000000000000000
>>> [526056.479906] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [526056.485738] CR2: 000055e87e606358 CR3: 00000004f5b46004 CR4: 00000000007706e0
>>> [526056.492957] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>> [526056.500176] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>> [526056.507396] PKRU: 55555554
>>> [526056.510196] Call Trace:
>>> [526056.512734] <TASK>
>>> [526056.514929] ? simple_setattr+0x40/0x60
>>> [526056.518859] ? vfs_mkobj+0x1b0/0x1c0
>>> [526056.522529] ? path_openat+0xb2e/0x1360
>>> [526056.526456] ? do_filp_open+0xa1/0x130
>>> [526056.530296] full_proxy_write+0x50/0x80
>>> [526056.534229] vfs_write+0xd7/0x3e0
>>> [526056.537635] ? fpregs_assert_state_consistent+0x22/0x50
>>> [526056.542955] ? exit_to_user_mode_prepare+0x3c/0x1d0
>>> [526056.547929] ? syscall_exit_to_user_mode+0x17/0x40
>>> [526056.552808] ksys_write+0x53/0xd0
>>> [526056.556215] do_syscall_64+0x58/0x80
>>> [526056.559879] ? exit_to_user_mode_prepare+0x3c/0x1d0
>>> [526056.564846] ? syscall_exit_to_user_mode+0x17/0x40
>>> [526056.569726] ? do_syscall_64+0x67/0x80
>>> [526056.573563] ? do_syscall_64+0x67/0x80
>>> [526056.577404] ? do_syscall_64+0x67/0x80
>>> [526056.581244] ? syscall_exit_to_user_mode+0x17/0x40
>>> [526056.586120] ? do_syscall_64+0x67/0x80
>>> [526056.589961] ? do_syscall_64+0x67/0x80
>>> [526056.593801] ? do_syscall_64+0x67/0x80
>>> [526056.597641] entry_SYSCALL_64_after_hwframe+0x63/0xcd
>>> [526056.602779] RIP: 0033:0x7fe35ab01c17
>>> [526056.606446] Code: 0f 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
>>> [526056.625277] RSP: 002b:00007ffc1863c8d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
>>> [526056.632929] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fe35ab01c17
>>> [526056.640147] RDX: 0000000000000002 RSI: 000055e87e7009a0 RDI: 0000000000000001
>>> [526056.647366] RBP: 000055e87e7009a0 R08: 0000000000000000 R09: 0000000000000073
>>> [526056.654586] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000002
>>> [526056.661805] R13: 00007fe35abf8780 R14: 0000000000000002 R15: 00007fe35abf39e0
>>> [526056.669028] </TASK>
>>>
>>>
>>> Looks like there is a page_to_nid() done in an offline memmap, which is wrong.
>>>
>>> Usually, this indicates that a pfn_to_online_page() is missing in a PFN walker.
>>>
>>
>> And indeed, something like the following might do the trick:
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index e9414ee57c5b..f42bb51e023a 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -2894,11 +2894,9 @@ static void split_huge_pages_all(void)
>> max_zone_pfn = zone_end_pfn(zone);
>> for (pfn = zone->zone_start_pfn; pfn < max_zone_pfn; pfn++) {
>> int nr_pages;
>> - if (!pfn_valid(pfn))
>> - continue;
>> - page = pfn_to_page(pfn);
>> - if (!get_page_unless_zero(page))
>> + page = pfn_to_online_page(pfn);
>> + if (!page || !get_page_unless_zero(page))
>> continue;
>> if (zone != page_zone(page))
>>
>
> Thank you very much, I confirmed the fix. So I try to write a patch like below.
> I borrowed your debug message because it has more helpful info than mine.
> If you have any comment or suggestion in the description, please let me know.
>
> Thanks,
> Naoya Horiguchi
> ---
> From 8a5c284df732943065d23838090d15c94cd10395 Mon Sep 17 00:00:00 2001
> From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> Date: Wed, 7 Sep 2022 20:58:44 +0900
> Subject: [PATCH] mm/huge_memory: use pfn_to_online_page() in
> split_huge_pages_all()
>
> NULL pointer dereference is triggered when calling thp split via debugfs
> on the system with offlined memory blocks. With debug option enabled,
> the following kernel messages are printed out:
>
> page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000
> flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff)
> raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000
> raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
> page dumped because: unmovable page
> page:000000007d7ab72e is uninitialized and poisoned
> page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
> ------------[ cut here ]------------
> kernel BUG at include/linux/mm.h:1248!
> invalid opcode: 0000 [#1] PREEMPT SMP PTI
> CPU: 16 PID: 20964 Comm: bash Tainted: G I 6.0.0-rc3-foll-numa+ #41
> ...
> RIP: 0010:split_huge_pages_write+0xcf4/0xe30
>
> This shows that page_to_nid() in page_zone() is unexpectedly called for an
> offlined memmap.
>
> Use pfn_to_online_page() to get struct page in PFN walker.
>
> Fixes: 49071d436b51 ("thp: add debugfs handle to split all huge pages")
> Suggested-by: David Hildenbrand <david@redhat.com>
Feel free to use
Co-developed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>
instead
> Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> Cc: <stable@vger.kernel.org>
> ---
> mm/huge_memory.c | 7 ++-----
> 1 file changed, 2 insertions(+), 5 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 5fa2ba86dae4..03149a8f46f9 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2894,11 +2894,8 @@ static void split_huge_pages_all(void)
> max_zone_pfn = zone_end_pfn(zone);
> for (pfn = zone->zone_start_pfn; pfn < max_zone_pfn; pfn++) {
> int nr_pages;
Add an empty line here please, while at it.
> - if (!pfn_valid(pfn))
> - continue;
> -
> - page = pfn_to_page(pfn);
> - if (!get_page_unless_zero(page))
> + page = pfn_to_online_page(pfn);
> + if (!page || !get_page_unless_zero(page))
> continue;
>
> if (zone != page_zone(page))
Can you send the patch "officially" as well? Thanks
--
Thanks,
David / dhildenb
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] mm/huge_memory: use pfn_to_online_page() in split_huge_pages_all()
2022-09-07 12:11 ` [PATCH] mm/huge_memory: use pfn_to_online_page() in split_huge_pages_all() Naoya Horiguchi
2022-09-07 12:39 ` David Hildenbrand
@ 2022-09-07 17:10 ` Yang Shi
2022-09-07 17:32 ` Michal Hocko
` (3 subsequent siblings)
5 siblings, 0 replies; 19+ messages in thread
From: Yang Shi @ 2022-09-07 17:10 UTC (permalink / raw)
To: Naoya Horiguchi
Cc: David Hildenbrand, linux-mm, Andrew Morton, Muchun Song,
Miaohe Lin, Matthew Wilcox, Michal Hocko, Naoya Horiguchi
On Wed, Sep 7, 2022 at 5:12 AM Naoya Horiguchi
<naoya.horiguchi@linux.dev> wrote:
>
> On Wed, Sep 07, 2022 at 12:26:18PM +0200, David Hildenbrand wrote:
> > On 07.09.22 12:23, David Hildenbrand wrote:
> > > On 07.09.22 12:08, Naoya Horiguchi wrote:
> > > > Hi MM folks,
> > >
> > > Hi,
> > >
> > > >
> > > > When I'm testing memory hotremove with various settings, I found the following
> > > > NULL-pointer dereference. It reproduces easily with the folloing steps:
> > > >
> > > > $ echo offline > /sys/devices/system/memory/memoryN/state
> > > > $ echo 1 > /sys/kernel/debug/split_huge_pages
> > > >
> > >
> > > That's weird, I don't immediately see how both features are related here, especially because it seems to fail quite early in split_huge_pages_write().
> > >
> > > I was able to trigger it here as well, though, and I get on my kernel with debug options enabled:
> > >
> > > [526045.808737] page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000
> > > [526045.818306] flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff)
> > > [526045.824842] raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000
> > > [526045.832676] raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
> > > [526045.840498] page dumped because: unmovable page
> > > [526056.362715] page:000000007d7ab72e is uninitialized and poisoned
> > > [526056.362720] page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
> > > [526056.374837] ------------[ cut here ]------------
> > > [526056.379544] kernel BUG at include/linux/mm.h:1248!
> > > [526056.384429] invalid opcode: 0000 [#1] PREEMPT SMP PTI
> > > [526056.389570] CPU: 16 PID: 20964 Comm: bash Tainted: G I 6.0.0-rc3-foll-numa+ #41
> > > [526056.398347] Hardware name: Dell Inc. PowerEdge R740/0923K0, BIOS 2.11.2 004/21/2021
> > > [526056.406087] RIP: 0010:split_huge_pages_write+0xcf4/0xe30
> > > [526056.411495] Code: f9 ff ff 48 c7 c6 88 8e 4e a7 48 c7 c7 f8 0b a8 a7 e8 20 1c 3d 00 e9 54 f6 ff ff 48 c7 c6 38 06 4c a7 4c 89 e7 e8 5c 4a f7 ff <0f> 0b 48 8b 7c 24 08 be 02 00 00 00 e8 bb 6c 36 00 e9 11 f5 ff ff
> > > [526056.430325] RSP: 0018:ffffae50e757fb40 EFLAGS: 00010292
> > > [526056.435639] RAX: 0000000000000034 RBX: 0000000002d00000 RCX: 0000000000000000
> > > [526056.442858] RDX: 0000000000000001 RSI: ffffffffa751a839 RDI: 00000000ffffffff
> > > [526056.450077] RBP: 0000000000013937 R08: 0000000000000000 R09: ffffae50e757fa08
> > > [526056.457296] R10: 0000000000000003 R11: ffffffffa793c768 R12: fffff00eb4000000
> > > [526056.464514] R13: ffffae50e757fb7a R14: 000fffffffffffff R15: fffff00eb4000000
> > > [526056.471733] FS: 00007fe35addf740(0000) GS:ffff8b71dc000000(0000) knlGS:0000000000000000
> > > [526056.479906] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [526056.485738] CR2: 000055e87e606358 CR3: 00000004f5b46004 CR4: 00000000007706e0
> > > [526056.492957] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > [526056.500176] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > [526056.507396] PKRU: 55555554
> > > [526056.510196] Call Trace:
> > > [526056.512734] <TASK>
> > > [526056.514929] ? simple_setattr+0x40/0x60
> > > [526056.518859] ? vfs_mkobj+0x1b0/0x1c0
> > > [526056.522529] ? path_openat+0xb2e/0x1360
> > > [526056.526456] ? do_filp_open+0xa1/0x130
> > > [526056.530296] full_proxy_write+0x50/0x80
> > > [526056.534229] vfs_write+0xd7/0x3e0
> > > [526056.537635] ? fpregs_assert_state_consistent+0x22/0x50
> > > [526056.542955] ? exit_to_user_mode_prepare+0x3c/0x1d0
> > > [526056.547929] ? syscall_exit_to_user_mode+0x17/0x40
> > > [526056.552808] ksys_write+0x53/0xd0
> > > [526056.556215] do_syscall_64+0x58/0x80
> > > [526056.559879] ? exit_to_user_mode_prepare+0x3c/0x1d0
> > > [526056.564846] ? syscall_exit_to_user_mode+0x17/0x40
> > > [526056.569726] ? do_syscall_64+0x67/0x80
> > > [526056.573563] ? do_syscall_64+0x67/0x80
> > > [526056.577404] ? do_syscall_64+0x67/0x80
> > > [526056.581244] ? syscall_exit_to_user_mode+0x17/0x40
> > > [526056.586120] ? do_syscall_64+0x67/0x80
> > > [526056.589961] ? do_syscall_64+0x67/0x80
> > > [526056.593801] ? do_syscall_64+0x67/0x80
> > > [526056.597641] entry_SYSCALL_64_after_hwframe+0x63/0xcd
> > > [526056.602779] RIP: 0033:0x7fe35ab01c17
> > > [526056.606446] Code: 0f 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
> > > [526056.625277] RSP: 002b:00007ffc1863c8d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
> > > [526056.632929] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fe35ab01c17
> > > [526056.640147] RDX: 0000000000000002 RSI: 000055e87e7009a0 RDI: 0000000000000001
> > > [526056.647366] RBP: 000055e87e7009a0 R08: 0000000000000000 R09: 0000000000000073
> > > [526056.654586] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000002
> > > [526056.661805] R13: 00007fe35abf8780 R14: 0000000000000002 R15: 00007fe35abf39e0
> > > [526056.669028] </TASK>
> > >
> > >
> > > Looks like there is a page_to_nid() done in an offline memmap, which is wrong.
> > >
> > > Usually, this indicates that a pfn_to_online_page() is missing in a PFN walker.
> > >
> >
> > And indeed, something like the following might do the trick:
> >
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index e9414ee57c5b..f42bb51e023a 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -2894,11 +2894,9 @@ static void split_huge_pages_all(void)
> > max_zone_pfn = zone_end_pfn(zone);
> > for (pfn = zone->zone_start_pfn; pfn < max_zone_pfn; pfn++) {
> > int nr_pages;
> > - if (!pfn_valid(pfn))
> > - continue;
> > - page = pfn_to_page(pfn);
> > - if (!get_page_unless_zero(page))
> > + page = pfn_to_online_page(pfn);
> > + if (!page || !get_page_unless_zero(page))
> > continue;
> > if (zone != page_zone(page))
> >
>
> Thank you very much, I confirmed the fix. So I try to write a patch like below.
> I borrowed your debug message because it has more helpful info than mine.
> If you have any comment or suggestion in the description, please let me know.
>
> Thanks,
> Naoya Horiguchi
> ---
> From 8a5c284df732943065d23838090d15c94cd10395 Mon Sep 17 00:00:00 2001
> From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> Date: Wed, 7 Sep 2022 20:58:44 +0900
> Subject: [PATCH] mm/huge_memory: use pfn_to_online_page() in
> split_huge_pages_all()
>
> NULL pointer dereference is triggered when calling thp split via debugfs
> on the system with offlined memory blocks. With debug option enabled,
> the following kernel messages are printed out:
>
> page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000
> flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff)
> raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000
> raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
> page dumped because: unmovable page
> page:000000007d7ab72e is uninitialized and poisoned
> page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
> ------------[ cut here ]------------
> kernel BUG at include/linux/mm.h:1248!
> invalid opcode: 0000 [#1] PREEMPT SMP PTI
> CPU: 16 PID: 20964 Comm: bash Tainted: G I 6.0.0-rc3-foll-numa+ #41
> ...
> RIP: 0010:split_huge_pages_write+0xcf4/0xe30
>
> This shows that page_to_nid() in page_zone() is unexpectedly called for an
> offlined memmap.
>
> Use pfn_to_online_page() to get struct page in PFN walker.
>
> Fixes: 49071d436b51 ("thp: add debugfs handle to split all huge pages")
> Suggested-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> Cc: <stable@vger.kernel.org>
Thanks, Naoya and David. Looks good to me. Reviewed-by: Yang Shi
<shy828301@gmail.com>
> ---
> mm/huge_memory.c | 7 ++-----
> 1 file changed, 2 insertions(+), 5 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 5fa2ba86dae4..03149a8f46f9 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2894,11 +2894,8 @@ static void split_huge_pages_all(void)
> max_zone_pfn = zone_end_pfn(zone);
> for (pfn = zone->zone_start_pfn; pfn < max_zone_pfn; pfn++) {
> int nr_pages;
> - if (!pfn_valid(pfn))
> - continue;
> -
> - page = pfn_to_page(pfn);
> - if (!get_page_unless_zero(page))
> + page = pfn_to_online_page(pfn);
> + if (!page || !get_page_unless_zero(page))
> continue;
>
> if (zone != page_zone(page))
> --
> 2.26.1.8815.g6a475b71f8
>
>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] mm/huge_memory: use pfn_to_online_page() in split_huge_pages_all()
2022-09-07 12:11 ` [PATCH] mm/huge_memory: use pfn_to_online_page() in split_huge_pages_all() Naoya Horiguchi
2022-09-07 12:39 ` David Hildenbrand
2022-09-07 17:10 ` Yang Shi
@ 2022-09-07 17:32 ` Michal Hocko
2022-09-07 20:57 ` Andrew Morton
` (2 subsequent siblings)
5 siblings, 0 replies; 19+ messages in thread
From: Michal Hocko @ 2022-09-07 17:32 UTC (permalink / raw)
To: Naoya Horiguchi
Cc: David Hildenbrand, linux-mm, Andrew Morton, Muchun Song,
Miaohe Lin, Matthew Wilcox, Yang Shi, Naoya Horiguchi
On Wed 07-09-22 21:11:57, Naoya Horiguchi wrote:
[...]
> >From 8a5c284df732943065d23838090d15c94cd10395 Mon Sep 17 00:00:00 2001
> From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> Date: Wed, 7 Sep 2022 20:58:44 +0900
> Subject: [PATCH] mm/huge_memory: use pfn_to_online_page() in
> split_huge_pages_all()
>
> NULL pointer dereference is triggered when calling thp split via debugfs
> on the system with offlined memory blocks. With debug option enabled,
> the following kernel messages are printed out:
>
> page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000
> flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff)
> raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000
> raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
> page dumped because: unmovable page
> page:000000007d7ab72e is uninitialized and poisoned
> page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
> ------------[ cut here ]------------
> kernel BUG at include/linux/mm.h:1248!
> invalid opcode: 0000 [#1] PREEMPT SMP PTI
> CPU: 16 PID: 20964 Comm: bash Tainted: G I 6.0.0-rc3-foll-numa+ #41
> ...
> RIP: 0010:split_huge_pages_write+0xcf4/0xe30
>
> This shows that page_to_nid() in page_zone() is unexpectedly called for an
> offlined memmap.
>
> Use pfn_to_online_page() to get struct page in PFN walker.
>
> Fixes: 49071d436b51 ("thp: add debugfs handle to split all huge pages")
> Suggested-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> Cc: <stable@vger.kernel.org>
With changes proposed by David
Acked-by: Michal Hocko <mhocko@suse.com>
Thanks!
> ---
> mm/huge_memory.c | 7 ++-----
> 1 file changed, 2 insertions(+), 5 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 5fa2ba86dae4..03149a8f46f9 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2894,11 +2894,8 @@ static void split_huge_pages_all(void)
> max_zone_pfn = zone_end_pfn(zone);
> for (pfn = zone->zone_start_pfn; pfn < max_zone_pfn; pfn++) {
> int nr_pages;
> - if (!pfn_valid(pfn))
> - continue;
> -
> - page = pfn_to_page(pfn);
> - if (!get_page_unless_zero(page))
> + page = pfn_to_online_page(pfn);
> + if (!page || !get_page_unless_zero(page))
> continue;
>
> if (zone != page_zone(page))
> --
> 2.26.1.8815.g6a475b71f8
>
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] mm/huge_memory: use pfn_to_online_page() in split_huge_pages_all()
2022-09-07 12:11 ` [PATCH] mm/huge_memory: use pfn_to_online_page() in split_huge_pages_all() Naoya Horiguchi
` (2 preceding siblings ...)
2022-09-07 17:32 ` Michal Hocko
@ 2022-09-07 20:57 ` Andrew Morton
2022-09-08 2:47 ` HORIGUCHI NAOYA(堀口 直也)
2022-09-08 2:19 ` Miaohe Lin
2022-09-08 3:28 ` Oscar Salvador
5 siblings, 1 reply; 19+ messages in thread
From: Andrew Morton @ 2022-09-07 20:57 UTC (permalink / raw)
To: Naoya Horiguchi
Cc: David Hildenbrand, linux-mm, Muchun Song, Miaohe Lin,
Matthew Wilcox, Michal Hocko, Yang Shi, Naoya Horiguchi
On Wed, 7 Sep 2022 21:11:57 +0900 Naoya Horiguchi <naoya.horiguchi@linux.dev> wrote:
>
> NULL pointer dereference is triggered when calling thp split via debugfs
> on the system with offlined memory blocks. With debug option enabled,
> the following kernel messages are printed out:
>
> page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000
> flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff)
> raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000
> raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
> page dumped because: unmovable page
> page:000000007d7ab72e is uninitialized and poisoned
> page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
> ------------[ cut here ]------------
> kernel BUG at include/linux/mm.h:1248!
> invalid opcode: 0000 [#1] PREEMPT SMP PTI
> CPU: 16 PID: 20964 Comm: bash Tainted: G I 6.0.0-rc3-foll-numa+ #41
> ...
> RIP: 0010:split_huge_pages_write+0xcf4/0xe30
>
> This shows that page_to_nid() in page_zone() is unexpectedly called for an
> offlined memmap.
>
> Use pfn_to_online_page() to get struct page in PFN walker.
>
> Fixes: 49071d436b51 ("thp: add debugfs handle to split all huge pages")
January of 2016! Or was this a long-term bug which was recently
exposed by some other change?
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] mm/huge_memory: use pfn_to_online_page() in split_huge_pages_all()
2022-09-07 12:11 ` [PATCH] mm/huge_memory: use pfn_to_online_page() in split_huge_pages_all() Naoya Horiguchi
` (3 preceding siblings ...)
2022-09-07 20:57 ` Andrew Morton
@ 2022-09-08 2:19 ` Miaohe Lin
2022-09-08 3:06 ` HORIGUCHI NAOYA(堀口 直也)
2022-09-08 3:28 ` Oscar Salvador
5 siblings, 1 reply; 19+ messages in thread
From: Miaohe Lin @ 2022-09-08 2:19 UTC (permalink / raw)
To: Naoya Horiguchi, David Hildenbrand
Cc: linux-mm, Andrew Morton, Muchun Song, Matthew Wilcox,
Michal Hocko, Yang Shi, Naoya Horiguchi
On 2022/9/7 20:11, Naoya Horiguchi wrote:
> On Wed, Sep 07, 2022 at 12:26:18PM +0200, David Hildenbrand wrote:
>> On 07.09.22 12:23, David Hildenbrand wrote:
>>> On 07.09.22 12:08, Naoya Horiguchi wrote:
>>>> Hi MM folks,
>>>
>>> Hi,
>>>
>>>>
>>>> When I'm testing memory hotremove with various settings, I found the following
>>>> NULL-pointer dereference. It reproduces easily with the folloing steps:
>>>>
>>>> $ echo offline > /sys/devices/system/memory/memoryN/state
>>>> $ echo 1 > /sys/kernel/debug/split_huge_pages
>>>>
>>>
>>> That's weird, I don't immediately see how both features are related here, especially because it seems to fail quite early in split_huge_pages_write().
>>>
>>> I was able to trigger it here as well, though, and I get on my kernel with debug options enabled:
>>>
>>> [526045.808737] page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000
>>> [526045.818306] flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff)
>>> [526045.824842] raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000
>>> [526045.832676] raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
>>> [526045.840498] page dumped because: unmovable page
>>> [526056.362715] page:000000007d7ab72e is uninitialized and poisoned
>>> [526056.362720] page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
>>> [526056.374837] ------------[ cut here ]------------
>>> [526056.379544] kernel BUG at include/linux/mm.h:1248!
>>> [526056.384429] invalid opcode: 0000 [#1] PREEMPT SMP PTI
>>> [526056.389570] CPU: 16 PID: 20964 Comm: bash Tainted: G I 6.0.0-rc3-foll-numa+ #41
>>> [526056.398347] Hardware name: Dell Inc. PowerEdge R740/0923K0, BIOS 2.11.2 004/21/2021
>>> [526056.406087] RIP: 0010:split_huge_pages_write+0xcf4/0xe30
>>> [526056.411495] Code: f9 ff ff 48 c7 c6 88 8e 4e a7 48 c7 c7 f8 0b a8 a7 e8 20 1c 3d 00 e9 54 f6 ff ff 48 c7 c6 38 06 4c a7 4c 89 e7 e8 5c 4a f7 ff <0f> 0b 48 8b 7c 24 08 be 02 00 00 00 e8 bb 6c 36 00 e9 11 f5 ff ff
>>> [526056.430325] RSP: 0018:ffffae50e757fb40 EFLAGS: 00010292
>>> [526056.435639] RAX: 0000000000000034 RBX: 0000000002d00000 RCX: 0000000000000000
>>> [526056.442858] RDX: 0000000000000001 RSI: ffffffffa751a839 RDI: 00000000ffffffff
>>> [526056.450077] RBP: 0000000000013937 R08: 0000000000000000 R09: ffffae50e757fa08
>>> [526056.457296] R10: 0000000000000003 R11: ffffffffa793c768 R12: fffff00eb4000000
>>> [526056.464514] R13: ffffae50e757fb7a R14: 000fffffffffffff R15: fffff00eb4000000
>>> [526056.471733] FS: 00007fe35addf740(0000) GS:ffff8b71dc000000(0000) knlGS:0000000000000000
>>> [526056.479906] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [526056.485738] CR2: 000055e87e606358 CR3: 00000004f5b46004 CR4: 00000000007706e0
>>> [526056.492957] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>> [526056.500176] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>> [526056.507396] PKRU: 55555554
>>> [526056.510196] Call Trace:
>>> [526056.512734] <TASK>
>>> [526056.514929] ? simple_setattr+0x40/0x60
>>> [526056.518859] ? vfs_mkobj+0x1b0/0x1c0
>>> [526056.522529] ? path_openat+0xb2e/0x1360
>>> [526056.526456] ? do_filp_open+0xa1/0x130
>>> [526056.530296] full_proxy_write+0x50/0x80
>>> [526056.534229] vfs_write+0xd7/0x3e0
>>> [526056.537635] ? fpregs_assert_state_consistent+0x22/0x50
>>> [526056.542955] ? exit_to_user_mode_prepare+0x3c/0x1d0
>>> [526056.547929] ? syscall_exit_to_user_mode+0x17/0x40
>>> [526056.552808] ksys_write+0x53/0xd0
>>> [526056.556215] do_syscall_64+0x58/0x80
>>> [526056.559879] ? exit_to_user_mode_prepare+0x3c/0x1d0
>>> [526056.564846] ? syscall_exit_to_user_mode+0x17/0x40
>>> [526056.569726] ? do_syscall_64+0x67/0x80
>>> [526056.573563] ? do_syscall_64+0x67/0x80
>>> [526056.577404] ? do_syscall_64+0x67/0x80
>>> [526056.581244] ? syscall_exit_to_user_mode+0x17/0x40
>>> [526056.586120] ? do_syscall_64+0x67/0x80
>>> [526056.589961] ? do_syscall_64+0x67/0x80
>>> [526056.593801] ? do_syscall_64+0x67/0x80
>>> [526056.597641] entry_SYSCALL_64_after_hwframe+0x63/0xcd
>>> [526056.602779] RIP: 0033:0x7fe35ab01c17
>>> [526056.606446] Code: 0f 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
>>> [526056.625277] RSP: 002b:00007ffc1863c8d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
>>> [526056.632929] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fe35ab01c17
>>> [526056.640147] RDX: 0000000000000002 RSI: 000055e87e7009a0 RDI: 0000000000000001
>>> [526056.647366] RBP: 000055e87e7009a0 R08: 0000000000000000 R09: 0000000000000073
>>> [526056.654586] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000002
>>> [526056.661805] R13: 00007fe35abf8780 R14: 0000000000000002 R15: 00007fe35abf39e0
>>> [526056.669028] </TASK>
>>>
>>>
>>> Looks like there is a page_to_nid() done in an offline memmap, which is wrong.
>>>
>>> Usually, this indicates that a pfn_to_online_page() is missing in a PFN walker.
>>>
>>
>> And indeed, something like the following might do the trick:
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index e9414ee57c5b..f42bb51e023a 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -2894,11 +2894,9 @@ static void split_huge_pages_all(void)
>> max_zone_pfn = zone_end_pfn(zone);
>> for (pfn = zone->zone_start_pfn; pfn < max_zone_pfn; pfn++) {
>> int nr_pages;
>> - if (!pfn_valid(pfn))
>> - continue;
>> - page = pfn_to_page(pfn);
>> - if (!get_page_unless_zero(page))
>> + page = pfn_to_online_page(pfn);
>> + if (!page || !get_page_unless_zero(page))
>> continue;
>> if (zone != page_zone(page))
>>
>
> Thank you very much, I confirmed the fix. So I try to write a patch like below.
> I borrowed your debug message because it has more helpful info than mine.
> If you have any comment or suggestion in the description, please let me know.
>
> Thanks,
> Naoya Horiguchi
> ---
>>From 8a5c284df732943065d23838090d15c94cd10395 Mon Sep 17 00:00:00 2001
> From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> Date: Wed, 7 Sep 2022 20:58:44 +0900
> Subject: [PATCH] mm/huge_memory: use pfn_to_online_page() in
> split_huge_pages_all()
>
> NULL pointer dereference is triggered when calling thp split via debugfs
> on the system with offlined memory blocks. With debug option enabled,
> the following kernel messages are printed out:
>
> page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000
> flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff)
> raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000
> raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
> page dumped because: unmovable page
> page:000000007d7ab72e is uninitialized and poisoned
> page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
> ------------[ cut here ]------------
> kernel BUG at include/linux/mm.h:1248!
> invalid opcode: 0000 [#1] PREEMPT SMP PTI
> CPU: 16 PID: 20964 Comm: bash Tainted: G I 6.0.0-rc3-foll-numa+ #41
> ...
> RIP: 0010:split_huge_pages_write+0xcf4/0xe30
>
> This shows that page_to_nid() in page_zone() is unexpectedly called for an
> offlined memmap.
>
> Use pfn_to_online_page() to get struct page in PFN walker.
With changes proposed by David, this patch looks good to me.
Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
BTW: IMHO, there might be many similar code places need to take care of memory hotremove where
*pfn is operated directly* while it's not protected against memory hotremove.
Thanks,
Miaohe Lin
>
> Fixes: 49071d436b51 ("thp: add debugfs handle to split all huge pages")
> Suggested-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> Cc: <stable@vger.kernel.org>
> ---
> mm/huge_memory.c | 7 ++-----
> 1 file changed, 2 insertions(+), 5 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 5fa2ba86dae4..03149a8f46f9 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2894,11 +2894,8 @@ static void split_huge_pages_all(void)
> max_zone_pfn = zone_end_pfn(zone);
> for (pfn = zone->zone_start_pfn; pfn < max_zone_pfn; pfn++) {
> int nr_pages;
> - if (!pfn_valid(pfn))
> - continue;
> -
> - page = pfn_to_page(pfn);
> - if (!get_page_unless_zero(page))
> + page = pfn_to_online_page(pfn);
> + if (!page || !get_page_unless_zero(page))
> continue;
>
> if (zone != page_zone(page))
>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] mm/huge_memory: use pfn_to_online_page() in split_huge_pages_all()
2022-09-07 12:39 ` David Hildenbrand
@ 2022-09-08 2:39 ` HORIGUCHI NAOYA(堀口 直也)
0 siblings, 0 replies; 19+ messages in thread
From: HORIGUCHI NAOYA(堀口 直也) @ 2022-09-08 2:39 UTC (permalink / raw)
To: David Hildenbrand
Cc: Naoya Horiguchi, linux-mm, Andrew Morton, Muchun Song,
Miaohe Lin, Matthew Wilcox, Michal Hocko, Yang Shi
On Wed, Sep 07, 2022 at 02:39:02PM +0200, David Hildenbrand wrote:
> On 07.09.22 14:11, Naoya Horiguchi wrote:
...
> > ---
> > From 8a5c284df732943065d23838090d15c94cd10395 Mon Sep 17 00:00:00 2001
> > From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> > Date: Wed, 7 Sep 2022 20:58:44 +0900
> > Subject: [PATCH] mm/huge_memory: use pfn_to_online_page() in
> > split_huge_pages_all()
> >
> > NULL pointer dereference is triggered when calling thp split via debugfs
> > on the system with offlined memory blocks. With debug option enabled,
> > the following kernel messages are printed out:
> >
> > page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000
> > flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff)
> > raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000
> > raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
> > page dumped because: unmovable page
> > page:000000007d7ab72e is uninitialized and poisoned
> > page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
> > ------------[ cut here ]------------
> > kernel BUG at include/linux/mm.h:1248!
> > invalid opcode: 0000 [#1] PREEMPT SMP PTI
> > CPU: 16 PID: 20964 Comm: bash Tainted: G I 6.0.0-rc3-foll-numa+ #41
> > ...
> > RIP: 0010:split_huge_pages_write+0xcf4/0xe30
> >
> > This shows that page_to_nid() in page_zone() is unexpectedly called for an
> > offlined memmap.
> >
> > Use pfn_to_online_page() to get struct page in PFN walker.
> >
> > Fixes: 49071d436b51 ("thp: add debugfs handle to split all huge pages")
> > Suggested-by: David Hildenbrand <david@redhat.com>
>
> Feel free to use
>
> Co-developed-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>
Sure, I'll add these.
> instead
>
> > Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> > Cc: <stable@vger.kernel.org>
> > ---
> > mm/huge_memory.c | 7 ++-----
> > 1 file changed, 2 insertions(+), 5 deletions(-)
> >
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 5fa2ba86dae4..03149a8f46f9 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -2894,11 +2894,8 @@ static void split_huge_pages_all(void)
> > max_zone_pfn = zone_end_pfn(zone);
> > for (pfn = zone->zone_start_pfn; pfn < max_zone_pfn; pfn++) {
> > int nr_pages;
>
> Add an empty line here please, while at it.
Added it.
> > - if (!pfn_valid(pfn))
> > - continue;
> > -
> > - page = pfn_to_page(pfn);
> > - if (!get_page_unless_zero(page))
> > + page = pfn_to_online_page(pfn);
> > + if (!page || !get_page_unless_zero(page))
> > continue;
> > if (zone != page_zone(page))
>
> Can you send the patch "officially" as well? Thanks
Yes, I'll send this later in the normal patch submission manner.
Thanks,
Naoya Horiguchi
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] mm/huge_memory: use pfn_to_online_page() in split_huge_pages_all()
2022-09-07 20:57 ` Andrew Morton
@ 2022-09-08 2:47 ` HORIGUCHI NAOYA(堀口 直也)
2022-09-08 6:14 ` David Hildenbrand
0 siblings, 1 reply; 19+ messages in thread
From: HORIGUCHI NAOYA(堀口 直也) @ 2022-09-08 2:47 UTC (permalink / raw)
To: Andrew Morton
Cc: Naoya Horiguchi, David Hildenbrand, linux-mm, Muchun Song,
Miaohe Lin, Matthew Wilcox, Michal Hocko, Yang Shi
On Wed, Sep 07, 2022 at 01:57:45PM -0700, Andrew Morton wrote:
> On Wed, 7 Sep 2022 21:11:57 +0900 Naoya Horiguchi <naoya.horiguchi@linux.dev> wrote:
>
> >
> > NULL pointer dereference is triggered when calling thp split via debugfs
> > on the system with offlined memory blocks. With debug option enabled,
> > the following kernel messages are printed out:
> >
> > page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000
> > flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff)
> > raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000
> > raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
> > page dumped because: unmovable page
> > page:000000007d7ab72e is uninitialized and poisoned
> > page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
> > ------------[ cut here ]------------
> > kernel BUG at include/linux/mm.h:1248!
> > invalid opcode: 0000 [#1] PREEMPT SMP PTI
> > CPU: 16 PID: 20964 Comm: bash Tainted: G I 6.0.0-rc3-foll-numa+ #41
> > ...
> > RIP: 0010:split_huge_pages_write+0xcf4/0xe30
> >
> > This shows that page_to_nid() in page_zone() is unexpectedly called for an
> > offlined memmap.
> >
> > Use pfn_to_online_page() to get struct page in PFN walker.
> >
> > Fixes: 49071d436b51 ("thp: add debugfs handle to split all huge pages")
>
> January of 2016! Or was this a long-term bug which was recently
> exposed by some other change?
Maybe yes, I confirmed that this issue reproduces in kernel 5.10 (released
on Dec 13, 2020, which might not be so "recent"), but does not reproduce in
kernel 5.4. So I think that marking "5.10+" for stable is fine.
Thanks,
Naoya Horiguchi
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] mm/huge_memory: use pfn_to_online_page() in split_huge_pages_all()
2022-09-08 2:19 ` Miaohe Lin
@ 2022-09-08 3:06 ` HORIGUCHI NAOYA(堀口 直也)
2022-09-08 3:25 ` Miaohe Lin
0 siblings, 1 reply; 19+ messages in thread
From: HORIGUCHI NAOYA(堀口 直也) @ 2022-09-08 3:06 UTC (permalink / raw)
To: Miaohe Lin
Cc: Naoya Horiguchi, David Hildenbrand, linux-mm, Andrew Morton,
Muchun Song, Matthew Wilcox, Michal Hocko, Yang Shi
On Thu, Sep 08, 2022 at 10:19:03AM +0800, Miaohe Lin wrote:
> On 2022/9/7 20:11, Naoya Horiguchi wrote:
...
> >>From 8a5c284df732943065d23838090d15c94cd10395 Mon Sep 17 00:00:00 2001
> > From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> > Date: Wed, 7 Sep 2022 20:58:44 +0900
> > Subject: [PATCH] mm/huge_memory: use pfn_to_online_page() in
> > split_huge_pages_all()
> >
> > NULL pointer dereference is triggered when calling thp split via debugfs
> > on the system with offlined memory blocks. With debug option enabled,
> > the following kernel messages are printed out:
> >
> > page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000
> > flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff)
> > raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000
> > raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
> > page dumped because: unmovable page
> > page:000000007d7ab72e is uninitialized and poisoned
> > page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
> > ------------[ cut here ]------------
> > kernel BUG at include/linux/mm.h:1248!
> > invalid opcode: 0000 [#1] PREEMPT SMP PTI
> > CPU: 16 PID: 20964 Comm: bash Tainted: G I 6.0.0-rc3-foll-numa+ #41
> > ...
> > RIP: 0010:split_huge_pages_write+0xcf4/0xe30
> >
> > This shows that page_to_nid() in page_zone() is unexpectedly called for an
> > offlined memmap.
> >
> > Use pfn_to_online_page() to get struct page in PFN walker.
>
> With changes proposed by David, this patch looks good to me.
>
> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
Thank you.
>
> BTW: IMHO, there might be many similar code places need to take care of memory hotremove where
> *pfn is operated directly* while it's not protected against memory hotremove.
I had the similar concern, but there seems many place doing PFN walk,
so checking them one-by-one that offlined memory can be walked over
requires much effort.
Thanks,
Naoya Horiguchi
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] mm/huge_memory: use pfn_to_online_page() in split_huge_pages_all()
2022-09-08 3:06 ` HORIGUCHI NAOYA(堀口 直也)
@ 2022-09-08 3:25 ` Miaohe Lin
2022-09-08 7:07 ` Michal Hocko
0 siblings, 1 reply; 19+ messages in thread
From: Miaohe Lin @ 2022-09-08 3:25 UTC (permalink / raw)
To: HORIGUCHI NAOYA(堀口 直也)
Cc: Naoya Horiguchi, David Hildenbrand, linux-mm, Andrew Morton,
Muchun Song, Matthew Wilcox, Michal Hocko, Yang Shi
On 2022/9/8 11:06, HORIGUCHI NAOYA(堀口 直也) wrote:
> On Thu, Sep 08, 2022 at 10:19:03AM +0800, Miaohe Lin wrote:
>> On 2022/9/7 20:11, Naoya Horiguchi wrote:
> ...
>>> >From 8a5c284df732943065d23838090d15c94cd10395 Mon Sep 17 00:00:00 2001
>>> From: Naoya Horiguchi <naoya.horiguchi@nec.com>
>>> Date: Wed, 7 Sep 2022 20:58:44 +0900
>>> Subject: [PATCH] mm/huge_memory: use pfn_to_online_page() in
>>> split_huge_pages_all()
>>>
>>> NULL pointer dereference is triggered when calling thp split via debugfs
>>> on the system with offlined memory blocks. With debug option enabled,
>>> the following kernel messages are printed out:
>>>
>>> page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000
>>> flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff)
>>> raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000
>>> raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
>>> page dumped because: unmovable page
>>> page:000000007d7ab72e is uninitialized and poisoned
>>> page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
>>> ------------[ cut here ]------------
>>> kernel BUG at include/linux/mm.h:1248!
>>> invalid opcode: 0000 [#1] PREEMPT SMP PTI
>>> CPU: 16 PID: 20964 Comm: bash Tainted: G I 6.0.0-rc3-foll-numa+ #41
>>> ...
>>> RIP: 0010:split_huge_pages_write+0xcf4/0xe30
>>>
>>> This shows that page_to_nid() in page_zone() is unexpectedly called for an
>>> offlined memmap.
>>>
>>> Use pfn_to_online_page() to get struct page in PFN walker.
>>
>> With changes proposed by David, this patch looks good to me.
>>
>> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
>
> Thank you.
>
>>
>> BTW: IMHO, there might be many similar code places need to take care of memory hotremove where
>> *pfn is operated directly* while it's not protected against memory hotremove.
>
> I had the similar concern, but there seems many place doing PFN walk,
> so checking them one-by-one that offlined memory can be walked over
> requires much effort.
Yes, that will be a heavy work. We could fix them one by one if they ever occur. ;)
Thanks,
Miaohe Lin
>
> Thanks,
> Naoya Horiguchi
>
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] mm/huge_memory: use pfn_to_online_page() in split_huge_pages_all()
2022-09-07 12:11 ` [PATCH] mm/huge_memory: use pfn_to_online_page() in split_huge_pages_all() Naoya Horiguchi
` (4 preceding siblings ...)
2022-09-08 2:19 ` Miaohe Lin
@ 2022-09-08 3:28 ` Oscar Salvador
5 siblings, 0 replies; 19+ messages in thread
From: Oscar Salvador @ 2022-09-08 3:28 UTC (permalink / raw)
To: Naoya Horiguchi
Cc: David Hildenbrand, linux-mm, Andrew Morton, Muchun Song,
Miaohe Lin, Matthew Wilcox, Michal Hocko, Yang Shi,
Naoya Horiguchi
On Wed, Sep 07, 2022 at 09:11:57PM +0900, Naoya Horiguchi wrote:
> From 8a5c284df732943065d23838090d15c94cd10395 Mon Sep 17 00:00:00 2001
> From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> Date: Wed, 7 Sep 2022 20:58:44 +0900
> Subject: [PATCH] mm/huge_memory: use pfn_to_online_page() in
> split_huge_pages_all()
>
> NULL pointer dereference is triggered when calling thp split via debugfs
> on the system with offlined memory blocks. With debug option enabled,
> the following kernel messages are printed out:
>
> page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000
> flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff)
> raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000
> raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
> page dumped because: unmovable page
> page:000000007d7ab72e is uninitialized and poisoned
> page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
> ------------[ cut here ]------------
> kernel BUG at include/linux/mm.h:1248!
> invalid opcode: 0000 [#1] PREEMPT SMP PTI
> CPU: 16 PID: 20964 Comm: bash Tainted: G I 6.0.0-rc3-foll-numa+ #41
> ...
> RIP: 0010:split_huge_pages_write+0xcf4/0xe30
>
> This shows that page_to_nid() in page_zone() is unexpectedly called for an
> offlined memmap.
>
> Use pfn_to_online_page() to get struct page in PFN walker.
>
> Fixes: 49071d436b51 ("thp: add debugfs handle to split all huge pages")
> Suggested-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> Cc: <stable@vger.kernel.org>
Reviewed-by: Oscar Salvador <osalvador@suse.de>
--
Oscar Salvador
SUSE Labs
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] mm/huge_memory: use pfn_to_online_page() in split_huge_pages_all()
2022-09-08 2:47 ` HORIGUCHI NAOYA(堀口 直也)
@ 2022-09-08 6:14 ` David Hildenbrand
2022-09-08 6:31 ` HORIGUCHI NAOYA(堀口 直也)
0 siblings, 1 reply; 19+ messages in thread
From: David Hildenbrand @ 2022-09-08 6:14 UTC (permalink / raw)
To: HORIGUCHI NAOYA(堀口 直也), Andrew Morton
Cc: Naoya Horiguchi, linux-mm, Muchun Song, Miaohe Lin,
Matthew Wilcox, Michal Hocko, Yang Shi
On 08.09.22 04:47, HORIGUCHI NAOYA(堀口 直也) wrote:
> On Wed, Sep 07, 2022 at 01:57:45PM -0700, Andrew Morton wrote:
>> On Wed, 7 Sep 2022 21:11:57 +0900 Naoya Horiguchi <naoya.horiguchi@linux.dev> wrote:
>>
>>>
>>> NULL pointer dereference is triggered when calling thp split via debugfs
>>> on the system with offlined memory blocks. With debug option enabled,
>>> the following kernel messages are printed out:
>>>
>>> page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000
>>> flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff)
>>> raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000
>>> raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
>>> page dumped because: unmovable page
>>> page:000000007d7ab72e is uninitialized and poisoned
>>> page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
>>> ------------[ cut here ]------------
>>> kernel BUG at include/linux/mm.h:1248!
>>> invalid opcode: 0000 [#1] PREEMPT SMP PTI
>>> CPU: 16 PID: 20964 Comm: bash Tainted: G I 6.0.0-rc3-foll-numa+ #41
>>> ...
>>> RIP: 0010:split_huge_pages_write+0xcf4/0xe30
>>>
>>> This shows that page_to_nid() in page_zone() is unexpectedly called for an
>>> offlined memmap.
>>>
>>> Use pfn_to_online_page() to get struct page in PFN walker.
>>>
>>> Fixes: 49071d436b51 ("thp: add debugfs handle to split all huge pages")
>>
>> January of 2016! Or was this a long-term bug which was recently
>> exposed by some other change?
>
> Maybe yes, I confirmed that this issue reproduces in kernel 5.10 (released
> on Dec 13, 2020, which might not be so "recent"), but does not reproduce in
> kernel 5.4. So I think that marking "5.10+" for stable is fine.
I fixed plenty of such issues in the past. For example:
aad5f69bc161 ("fs/proc/page.c: don't access uninitialized memmaps in fs/proc/page.c")
This one just wasn't found so far because it's triggered via
a debug interface.
The correct fixes tag would be:
Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online") [visible after d0dc12e86b319]
Exposed "recently" -- 2018 ;)
--
Thanks,
David / dhildenb
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] mm/huge_memory: use pfn_to_online_page() in split_huge_pages_all()
2022-09-08 6:14 ` David Hildenbrand
@ 2022-09-08 6:31 ` HORIGUCHI NAOYA(堀口 直也)
0 siblings, 0 replies; 19+ messages in thread
From: HORIGUCHI NAOYA(堀口 直也) @ 2022-09-08 6:31 UTC (permalink / raw)
To: David Hildenbrand
Cc: Andrew Morton, Naoya Horiguchi, linux-mm, Muchun Song,
Miaohe Lin, Matthew Wilcox, Michal Hocko, Yang Shi
On Thu, Sep 08, 2022 at 08:14:57AM +0200, David Hildenbrand wrote:
> On 08.09.22 04:47, HORIGUCHI NAOYA(堀口 直也) wrote:
> > On Wed, Sep 07, 2022 at 01:57:45PM -0700, Andrew Morton wrote:
> > > On Wed, 7 Sep 2022 21:11:57 +0900 Naoya Horiguchi <naoya.horiguchi@linux.dev> wrote:
> > >
> > > >
> > > > NULL pointer dereference is triggered when calling thp split via debugfs
> > > > on the system with offlined memory blocks. With debug option enabled,
> > > > the following kernel messages are printed out:
> > > >
> > > > page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000
> > > > flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff)
> > > > raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000
> > > > raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
> > > > page dumped because: unmovable page
> > > > page:000000007d7ab72e is uninitialized and poisoned
> > > > page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
> > > > ------------[ cut here ]------------
> > > > kernel BUG at include/linux/mm.h:1248!
> > > > invalid opcode: 0000 [#1] PREEMPT SMP PTI
> > > > CPU: 16 PID: 20964 Comm: bash Tainted: G I 6.0.0-rc3-foll-numa+ #41
> > > > ...
> > > > RIP: 0010:split_huge_pages_write+0xcf4/0xe30
> > > >
> > > > This shows that page_to_nid() in page_zone() is unexpectedly called for an
> > > > offlined memmap.
> > > >
> > > > Use pfn_to_online_page() to get struct page in PFN walker.
> > > >
> > > > Fixes: 49071d436b51 ("thp: add debugfs handle to split all huge pages")
> > >
> > > January of 2016! Or was this a long-term bug which was recently
> > > exposed by some other change?
> >
> > Maybe yes, I confirmed that this issue reproduces in kernel 5.10 (released
> > on Dec 13, 2020, which might not be so "recent"), but does not reproduce in
> > kernel 5.4. So I think that marking "5.10+" for stable is fine.
>
>
> I fixed plenty of such issues in the past. For example:
>
> aad5f69bc161 ("fs/proc/page.c: don't access uninitialized memmaps in fs/proc/page.c")
>
> This one just wasn't found so far because it's triggered via
> a debug interface.
>
> The correct fixes tag would be:
>
> Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online") [visible after d0dc12e86b319]
>
> Exposed "recently" -- 2018 ;)
Great, thank you for pinpointing the commit.
- Naoya Horiguchi
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] mm/huge_memory: use pfn_to_online_page() in split_huge_pages_all()
2022-09-08 3:25 ` Miaohe Lin
@ 2022-09-08 7:07 ` Michal Hocko
2022-09-09 0:27 ` Miaohe Lin
0 siblings, 1 reply; 19+ messages in thread
From: Michal Hocko @ 2022-09-08 7:07 UTC (permalink / raw)
To: Miaohe Lin
Cc: HORIGUCHI NAOYA(堀口 直也),
Naoya Horiguchi, David Hildenbrand, linux-mm, Andrew Morton,
Muchun Song, Matthew Wilcox, Yang Shi
On Thu 08-09-22 11:25:54, Miaohe Lin wrote:
> On 2022/9/8 11:06, HORIGUCHI NAOYA(堀口 直也) wrote:
> > On Thu, Sep 08, 2022 at 10:19:03AM +0800, Miaohe Lin wrote:
> >> On 2022/9/7 20:11, Naoya Horiguchi wrote:
> > ...
> >>> >From 8a5c284df732943065d23838090d15c94cd10395 Mon Sep 17 00:00:00 2001
> >>> From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> >>> Date: Wed, 7 Sep 2022 20:58:44 +0900
> >>> Subject: [PATCH] mm/huge_memory: use pfn_to_online_page() in
> >>> split_huge_pages_all()
> >>>
> >>> NULL pointer dereference is triggered when calling thp split via debugfs
> >>> on the system with offlined memory blocks. With debug option enabled,
> >>> the following kernel messages are printed out:
> >>>
> >>> page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000
> >>> flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff)
> >>> raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000
> >>> raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
> >>> page dumped because: unmovable page
> >>> page:000000007d7ab72e is uninitialized and poisoned
> >>> page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
> >>> ------------[ cut here ]------------
> >>> kernel BUG at include/linux/mm.h:1248!
> >>> invalid opcode: 0000 [#1] PREEMPT SMP PTI
> >>> CPU: 16 PID: 20964 Comm: bash Tainted: G I 6.0.0-rc3-foll-numa+ #41
> >>> ...
> >>> RIP: 0010:split_huge_pages_write+0xcf4/0xe30
> >>>
> >>> This shows that page_to_nid() in page_zone() is unexpectedly called for an
> >>> offlined memmap.
> >>>
> >>> Use pfn_to_online_page() to get struct page in PFN walker.
> >>
> >> With changes proposed by David, this patch looks good to me.
> >>
> >> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
> >
> > Thank you.
> >
> >>
> >> BTW: IMHO, there might be many similar code places need to take care of memory hotremove where
> >> *pfn is operated directly* while it's not protected against memory hotremove.
> >
> > I had the similar concern, but there seems many place doing PFN walk,
> > so checking them one-by-one that offlined memory can be walked over
> > requires much effort.
>
> Yes, that will be a heavy work. We could fix them one by one if they ever occur. ;)
Most of those whic are directly triggerable should be taken care of. It
would be still good to go through `git grep -w pfn_to_page' and evaluate
all callers. Still more than 400 callsites so not a trivial task.
--
Michal Hocko
SUSE Labs
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] mm/huge_memory: use pfn_to_online_page() in split_huge_pages_all()
2022-09-08 7:07 ` Michal Hocko
@ 2022-09-09 0:27 ` Miaohe Lin
2022-09-09 9:03 ` David Hildenbrand
0 siblings, 1 reply; 19+ messages in thread
From: Miaohe Lin @ 2022-09-09 0:27 UTC (permalink / raw)
To: Michal Hocko
Cc: HORIGUCHI NAOYA(堀口 直也),
Naoya Horiguchi, David Hildenbrand, linux-mm, Andrew Morton,
Muchun Song, Matthew Wilcox, Yang Shi
On 2022/9/8 15:07, Michal Hocko wrote:
> On Thu 08-09-22 11:25:54, Miaohe Lin wrote:
>> On 2022/9/8 11:06, HORIGUCHI NAOYA(堀口 直也) wrote:
>>> On Thu, Sep 08, 2022 at 10:19:03AM +0800, Miaohe Lin wrote:
>>>> On 2022/9/7 20:11, Naoya Horiguchi wrote:
>>> ...
>>>>> >From 8a5c284df732943065d23838090d15c94cd10395 Mon Sep 17 00:00:00 2001
>>>>> From: Naoya Horiguchi <naoya.horiguchi@nec.com>
>>>>> Date: Wed, 7 Sep 2022 20:58:44 +0900
>>>>> Subject: [PATCH] mm/huge_memory: use pfn_to_online_page() in
>>>>> split_huge_pages_all()
>>>>>
>>>>> NULL pointer dereference is triggered when calling thp split via debugfs
>>>>> on the system with offlined memory blocks. With debug option enabled,
>>>>> the following kernel messages are printed out:
>>>>>
>>>>> page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000
>>>>> flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff)
>>>>> raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000
>>>>> raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
>>>>> page dumped because: unmovable page
>>>>> page:000000007d7ab72e is uninitialized and poisoned
>>>>> page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
>>>>> ------------[ cut here ]------------
>>>>> kernel BUG at include/linux/mm.h:1248!
>>>>> invalid opcode: 0000 [#1] PREEMPT SMP PTI
>>>>> CPU: 16 PID: 20964 Comm: bash Tainted: G I 6.0.0-rc3-foll-numa+ #41
>>>>> ...
>>>>> RIP: 0010:split_huge_pages_write+0xcf4/0xe30
>>>>>
>>>>> This shows that page_to_nid() in page_zone() is unexpectedly called for an
>>>>> offlined memmap.
>>>>>
>>>>> Use pfn_to_online_page() to get struct page in PFN walker.
>>>>
>>>> With changes proposed by David, this patch looks good to me.
>>>>
>>>> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
>>>
>>> Thank you.
>>>
>>>>
>>>> BTW: IMHO, there might be many similar code places need to take care of memory hotremove where
>>>> *pfn is operated directly* while it's not protected against memory hotremove.
>>>
>>> I had the similar concern, but there seems many place doing PFN walk,
>>> so checking them one-by-one that offlined memory can be walked over
>>> requires much effort.
>>
>> Yes, that will be a heavy work. We could fix them one by one if they ever occur. ;)
>
> Most of those whic are directly triggerable should be taken care of. It
> would be still good to go through `git grep -w pfn_to_page' and evaluate
> all callers. Still more than 400 callsites so not a trivial task.
Agree. Even if pfn_to_online_page() is used, it might still races with the memory hotremove if
there's no way to guard against it, e.g. holding an extra page refcnt. So code audit should also
apply to pfn_to_online_page().
Thanks,
Miaohe Lin
^ permalink raw reply [flat|nested] 19+ messages in thread
* Re: [PATCH] mm/huge_memory: use pfn_to_online_page() in split_huge_pages_all()
2022-09-09 0:27 ` Miaohe Lin
@ 2022-09-09 9:03 ` David Hildenbrand
0 siblings, 0 replies; 19+ messages in thread
From: David Hildenbrand @ 2022-09-09 9:03 UTC (permalink / raw)
To: Miaohe Lin, Michal Hocko
Cc: HORIGUCHI NAOYA(堀口 直也),
Naoya Horiguchi, linux-mm, Andrew Morton, Muchun Song,
Matthew Wilcox, Yang Shi
On 09.09.22 02:27, Miaohe Lin wrote:
> On 2022/9/8 15:07, Michal Hocko wrote:
>> On Thu 08-09-22 11:25:54, Miaohe Lin wrote:
>>> On 2022/9/8 11:06, HORIGUCHI NAOYA(堀口 直也) wrote:
>>>> On Thu, Sep 08, 2022 at 10:19:03AM +0800, Miaohe Lin wrote:
>>>>> On 2022/9/7 20:11, Naoya Horiguchi wrote:
>>>> ...
>>>>>> >From 8a5c284df732943065d23838090d15c94cd10395 Mon Sep 17 00:00:00 2001
>>>>>> From: Naoya Horiguchi <naoya.horiguchi@nec.com>
>>>>>> Date: Wed, 7 Sep 2022 20:58:44 +0900
>>>>>> Subject: [PATCH] mm/huge_memory: use pfn_to_online_page() in
>>>>>> split_huge_pages_all()
>>>>>>
>>>>>> NULL pointer dereference is triggered when calling thp split via debugfs
>>>>>> on the system with offlined memory blocks. With debug option enabled,
>>>>>> the following kernel messages are printed out:
>>>>>>
>>>>>> page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000
>>>>>> flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff)
>>>>>> raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000
>>>>>> raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
>>>>>> page dumped because: unmovable page
>>>>>> page:000000007d7ab72e is uninitialized and poisoned
>>>>>> page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
>>>>>> ------------[ cut here ]------------
>>>>>> kernel BUG at include/linux/mm.h:1248!
>>>>>> invalid opcode: 0000 [#1] PREEMPT SMP PTI
>>>>>> CPU: 16 PID: 20964 Comm: bash Tainted: G I 6.0.0-rc3-foll-numa+ #41
>>>>>> ...
>>>>>> RIP: 0010:split_huge_pages_write+0xcf4/0xe30
>>>>>>
>>>>>> This shows that page_to_nid() in page_zone() is unexpectedly called for an
>>>>>> offlined memmap.
>>>>>>
>>>>>> Use pfn_to_online_page() to get struct page in PFN walker.
>>>>>
>>>>> With changes proposed by David, this patch looks good to me.
>>>>>
>>>>> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
>>>>
>>>> Thank you.
>>>>
>>>>>
>>>>> BTW: IMHO, there might be many similar code places need to take care of memory hotremove where
>>>>> *pfn is operated directly* while it's not protected against memory hotremove.
>>>>
>>>> I had the similar concern, but there seems many place doing PFN walk,
>>>> so checking them one-by-one that offlined memory can be walked over
>>>> requires much effort.
>>>
>>> Yes, that will be a heavy work. We could fix them one by one if they ever occur. ;)
>>
>> Most of those whic are directly triggerable should be taken care of. It
>> would be still good to go through `git grep -w pfn_to_page' and evaluate
>> all callers. Still more than 400 callsites so not a trivial task.
>
> Agree. Even if pfn_to_online_page() is used, it might still races with the memory hotremove if
> there's no way to guard against it, e.g. holding an extra page refcnt. So code audit should also
> apply to pfn_to_online_page().
Racing with offlining+unplug a long known problem and so far we decided
to ignore it, because it never appeared in practice.
There once were ideas of using RCU as protection.
--
Thanks,
David / dhildenb
^ permalink raw reply [flat|nested] 19+ messages in thread
end of thread, other threads:[~2022-09-09 9:03 UTC | newest]
Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-07 10:08 [BUG report] kernel NULL pointer dereference in split_huge_page with offlined memory block Naoya Horiguchi
2022-09-07 10:23 ` David Hildenbrand
2022-09-07 10:26 ` David Hildenbrand
2022-09-07 12:11 ` [PATCH] mm/huge_memory: use pfn_to_online_page() in split_huge_pages_all() Naoya Horiguchi
2022-09-07 12:39 ` David Hildenbrand
2022-09-08 2:39 ` HORIGUCHI NAOYA(堀口 直也)
2022-09-07 17:10 ` Yang Shi
2022-09-07 17:32 ` Michal Hocko
2022-09-07 20:57 ` Andrew Morton
2022-09-08 2:47 ` HORIGUCHI NAOYA(堀口 直也)
2022-09-08 6:14 ` David Hildenbrand
2022-09-08 6:31 ` HORIGUCHI NAOYA(堀口 直也)
2022-09-08 2:19 ` Miaohe Lin
2022-09-08 3:06 ` HORIGUCHI NAOYA(堀口 直也)
2022-09-08 3:25 ` Miaohe Lin
2022-09-08 7:07 ` Michal Hocko
2022-09-09 0:27 ` Miaohe Lin
2022-09-09 9:03 ` David Hildenbrand
2022-09-08 3:28 ` Oscar Salvador
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).