linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [BUG report] kernel NULL pointer dereference in split_huge_page with offlined memory block
@ 2022-09-07 10:08 Naoya Horiguchi
  2022-09-07 10:23 ` David Hildenbrand
  0 siblings, 1 reply; 19+ messages in thread
From: Naoya Horiguchi @ 2022-09-07 10:08 UTC (permalink / raw)
  To: linux-mm
  Cc: Andrew Morton, David Hildenbrand, Muchun Song, Miaohe Lin,
	Matthew Wilcox, Michal Hocko, Yang Shi, Naoya Horiguchi

Hi MM folks,

When I'm testing memory hotremove with various settings, I found the following
NULL-pointer dereference.  It reproduces easily with the folloing steps:

  $ echo offline > /sys/devices/system/memory/memoryN/state
  $ echo 1 > /sys/kernel/debug/split_huge_pages

I don't check in which commit this was introduced yet (at least v6.0-rc1,
v6.0-rc4 and mm-everything-2022-09-05-23-30 are affected), but I expect that
someone might have clear idea about this, so let me share first.

Thanks,
Naoya Horiguchi
---

  [  309.947421] BUG: kernel NULL pointer dereference, address: 0000000000000032
  [  309.949600] #PF: supervisor read access in kernel mode
  [  309.951220] #PF: error_code(0x0000) - not-present page
  [  309.952819] PGD 0 P4D 0
  [  309.953649] Oops: 0000 [#1] PREEMPT SMP PTI
  [  309.954999] CPU: 1 PID: 846 Comm: bash Tainted: G            E    N 6.0.0-rc1-v6.0-rc1-220815-2254-000-rc1+ #62
  [  309.958170] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1.fc35 04/01/2014
  [  309.960759] RIP: 0010:split_huge_pages_write.part.0+0x40c/0xe70
  [  309.962684] Code: 00 00 00 4d 8b ae 90 00 00 00 49 01 dd 4c 39 eb 72 47 eb c8 48 8b 41 08 a8 01 0f 85 57 08 00 00 0f 1f 44 00 00 0f 1f 44 00 00 <41> 8b 47 34 85 c0 0f 84 1c 09 00 00 f0 41 ff 4f 34 0f 94 c0 0f 1f
  [  309.968381] RSP: 0018:ffffb4d201d6bbd0 EFLAGS: 00010202
  [  309.970067] RAX: ffffffffffffffff RBX: 0000000000230000 RCX: ffffd6fac8c00000
  [  309.972262] RDX: 00000000000003ff RSI: 0000000000000014 RDI: ffffd6fac4fff300
  [  309.974475] RBP: ffffb4d201d6bc12 R08: 0000000000000054 R09: ffffd6fac46b7f88
  [  309.976725] R10: 00000000ffffffff R11: ffffff8000000000 R12: 0000000000001454
  [  309.978980] R13: 0000000000248000 R14: ffff93ce3ffd5d80 R15: fffffffffffffffe
  [  309.981267] FS:  00007fe2cd337740(0000) GS:ffff93ce3bc80000(0000) knlGS:0000000000000000
  [  309.983842] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [  309.985672] CR2: 0000000000000032 CR3: 00000001018fc005 CR4: 0000000000170ee0
  [  309.987909] Call Trace:
  [  309.988794]  <TASK>
  [  309.989461]  ? _raw_spin_lock+0x13/0x40
  [  309.990578]  ? __mark_inode_dirty+0x113/0x390
  [  309.991933]  ? terminate_walk+0x90/0x100
  [  309.993186]  ? path_openat+0x440/0x1070
  [  309.994421]  ? do_filp_open+0x9f/0x130
  [  309.995610]  full_proxy_write+0x53/0x80
  [  309.996820]  vfs_write+0xb7/0x3a0
  [  309.997902]  ? _raw_spin_unlock+0x15/0x30
  [  309.999190]  ksys_write+0x4f/0xd0
  [  310.000249]  do_syscall_64+0x3b/0x90
  [  310.001418]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
  [  310.002938] RIP: 0033:0x7fe2cd1018b7
  [  310.004143] Code: 0f 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
  [  310.009871] RSP: 002b:00007ffc625f63f8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
  [  310.012060] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fe2cd1018b7
  [  310.014250] RDX: 0000000000000002 RSI: 000055c1a80afc50 RDI: 0000000000000001
  [  310.016533] RBP: 000055c1a80afc50 R08: 0000000000000000 R09: 00007fe2cd1b64e0
  [  310.018782] R10: 00007fe2cd1b63e0 R11: 0000000000000246 R12: 0000000000000002
  [  310.021086] R13: 00007fe2cd1fb5a0 R14: 0000000000000002 R15: 00007fe2cd1fb7a0
  [  310.023169]  </TASK>
  [  310.023844] Modules linked in: nft_fib_inet(E) nft_fib_ipv4(E) nft_fib_ipv6(E) nft_fib(E) nft_reject_inet(E) nf_reject_ipv4(E) nf_reject_ipv6(E) nft_reject(E) nft_ct(E) nft_chain_nat(E) nf_nat(E) nf_conntrack(E) nf_defrag_ipv6(E) nf_defrag_ipv4(E) ip_set(E) rfkill(E) nf_tables(E) nfnetlink(E) qrtr(E) sunrpc(E) 9p(E) fscache(E) netfs(E) intel_rapl_msr(E) intel_rapl_common(E) kvm_intel(E) kvm(E) irqbypass(E) virtio_balloon(E) rapl(E) 9pnet_virtio(E) i2c_piix4(E) 9pnet(E) joydev(E) pcspkr(E) fuse(E) zram(E) ip_tables(E) xfs(E) crc32c_intel(E) serio_raw(E) virtio_blk(E) e1000(E) ata_generic(E) pata_acpi(E) floppy(E) qemu_fw_cfg(E)
  [  310.040426] CR2: 0000000000000032
  [  310.041715] ---[ end trace 0000000000000000 ]---
  [  310.043196] RIP: 0010:split_huge_pages_write.part.0+0x40c/0xe70
  [  310.044953] Code: 00 00 00 4d 8b ae 90 00 00 00 49 01 dd 4c 39 eb 72 47 eb c8 48 8b 41 08 a8 01 0f 85 57 08 00 00 0f 1f 44 00 00 0f 1f 44 00 00 <41> 8b 47 34 85 c0 0f 84 1c 09 00 00 f0 41 ff 4f 34 0f 94 c0 0f 1f
  [  310.050051] RSP: 0018:ffffb4d201d6bbd0 EFLAGS: 00010202
  [  310.051593] RAX: ffffffffffffffff RBX: 0000000000230000 RCX: ffffd6fac8c00000
  [  310.053664] RDX: 00000000000003ff RSI: 0000000000000014 RDI: ffffd6fac4fff300
  [  310.056165] RBP: ffffb4d201d6bc12 R08: 0000000000000054 R09: ffffd6fac46b7f88
  [  310.059144] R10: 00000000ffffffff R11: ffffff8000000000 R12: 0000000000001454
  [  310.062033] R13: 0000000000248000 R14: ffff93ce3ffd5d80 R15: fffffffffffffffe
  [  310.069111] FS:  00007fe2cd337740(0000) GS:ffff93ce3bc80000(0000) knlGS:0000000000000000
  [  310.077141] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
  [  310.079988] CR2: 0000000000000032 CR3: 00000001018fc005 CR4: 0000000000170ee0
  [  310.083292] Kernel panic - not syncing: Fatal exception
  [  310.086117] Kernel Offset: 0x1a000000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
  [  310.090607] Rebooting in 2 seconds..


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [BUG report] kernel NULL pointer dereference in split_huge_page with offlined memory block
  2022-09-07 10:08 [BUG report] kernel NULL pointer dereference in split_huge_page with offlined memory block Naoya Horiguchi
@ 2022-09-07 10:23 ` David Hildenbrand
  2022-09-07 10:26   ` David Hildenbrand
  0 siblings, 1 reply; 19+ messages in thread
From: David Hildenbrand @ 2022-09-07 10:23 UTC (permalink / raw)
  To: Naoya Horiguchi, linux-mm
  Cc: Andrew Morton, Muchun Song, Miaohe Lin, Matthew Wilcox,
	Michal Hocko, Yang Shi, Naoya Horiguchi

On 07.09.22 12:08, Naoya Horiguchi wrote:
> Hi MM folks,

Hi,

> 
> When I'm testing memory hotremove with various settings, I found the following
> NULL-pointer dereference.  It reproduces easily with the folloing steps:
> 
>    $ echo offline > /sys/devices/system/memory/memoryN/state
>    $ echo 1 > /sys/kernel/debug/split_huge_pages
> 

That's weird, I don't immediately see how both features are related here, especially because it seems to fail quite early in split_huge_pages_write().

I was able to trigger it here as well, though, and I get on my kernel with debug options enabled:

[526045.808737] page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000
[526045.818306] flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff)
[526045.824842] raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000
[526045.832676] raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
[526045.840498] page dumped because: unmovable page
[526056.362715] page:000000007d7ab72e is uninitialized and poisoned
[526056.362720] page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
[526056.374837] ------------[ cut here ]------------
[526056.379544] kernel BUG at include/linux/mm.h:1248!
[526056.384429] invalid opcode: 0000 [#1] PREEMPT SMP PTI
[526056.389570] CPU: 16 PID: 20964 Comm: bash Tainted: G          I        6.0.0-rc3-foll-numa+ #41
[526056.398347] Hardware name: Dell Inc. PowerEdge R740/0923K0, BIOS 2.11.2 004/21/2021
[526056.406087] RIP: 0010:split_huge_pages_write+0xcf4/0xe30
[526056.411495] Code: f9 ff ff 48 c7 c6 88 8e 4e a7 48 c7 c7 f8 0b a8 a7 e8 20 1c 3d 00 e9 54 f6 ff ff 48 c7 c6 38 06 4c a7 4c 89 e7 e8 5c 4a f7 ff <0f> 0b 48 8b 7c 24 08 be 02 00 00 00 e8 bb 6c 36 00 e9 11 f5 ff ff
[526056.430325] RSP: 0018:ffffae50e757fb40 EFLAGS: 00010292
[526056.435639] RAX: 0000000000000034 RBX: 0000000002d00000 RCX: 0000000000000000
[526056.442858] RDX: 0000000000000001 RSI: ffffffffa751a839 RDI: 00000000ffffffff
[526056.450077] RBP: 0000000000013937 R08: 0000000000000000 R09: ffffae50e757fa08
[526056.457296] R10: 0000000000000003 R11: ffffffffa793c768 R12: fffff00eb4000000
[526056.464514] R13: ffffae50e757fb7a R14: 000fffffffffffff R15: fffff00eb4000000
[526056.471733] FS:  00007fe35addf740(0000) GS:ffff8b71dc000000(0000) knlGS:0000000000000000
[526056.479906] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[526056.485738] CR2: 000055e87e606358 CR3: 00000004f5b46004 CR4: 00000000007706e0
[526056.492957] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[526056.500176] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[526056.507396] PKRU: 55555554
[526056.510196] Call Trace:
[526056.512734]  <TASK>
[526056.514929]  ? simple_setattr+0x40/0x60
[526056.518859]  ? vfs_mkobj+0x1b0/0x1c0
[526056.522529]  ? path_openat+0xb2e/0x1360
[526056.526456]  ? do_filp_open+0xa1/0x130
[526056.530296]  full_proxy_write+0x50/0x80
[526056.534229]  vfs_write+0xd7/0x3e0
[526056.537635]  ? fpregs_assert_state_consistent+0x22/0x50
[526056.542955]  ? exit_to_user_mode_prepare+0x3c/0x1d0
[526056.547929]  ? syscall_exit_to_user_mode+0x17/0x40
[526056.552808]  ksys_write+0x53/0xd0
[526056.556215]  do_syscall_64+0x58/0x80
[526056.559879]  ? exit_to_user_mode_prepare+0x3c/0x1d0
[526056.564846]  ? syscall_exit_to_user_mode+0x17/0x40
[526056.569726]  ? do_syscall_64+0x67/0x80
[526056.573563]  ? do_syscall_64+0x67/0x80
[526056.577404]  ? do_syscall_64+0x67/0x80
[526056.581244]  ? syscall_exit_to_user_mode+0x17/0x40
[526056.586120]  ? do_syscall_64+0x67/0x80
[526056.589961]  ? do_syscall_64+0x67/0x80
[526056.593801]  ? do_syscall_64+0x67/0x80
[526056.597641]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
[526056.602779] RIP: 0033:0x7fe35ab01c17
[526056.606446] Code: 0f 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
[526056.625277] RSP: 002b:00007ffc1863c8d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[526056.632929] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fe35ab01c17
[526056.640147] RDX: 0000000000000002 RSI: 000055e87e7009a0 RDI: 0000000000000001
[526056.647366] RBP: 000055e87e7009a0 R08: 0000000000000000 R09: 0000000000000073
[526056.654586] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000002
[526056.661805] R13: 00007fe35abf8780 R14: 0000000000000002 R15: 00007fe35abf39e0
[526056.669028]  </TASK>


Looks like there is a page_to_nid() done in an offline memmap, which is wrong.

Usually, this indicates that a pfn_to_online_page() is missing in a PFN walker.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [BUG report] kernel NULL pointer dereference in split_huge_page with offlined memory block
  2022-09-07 10:23 ` David Hildenbrand
@ 2022-09-07 10:26   ` David Hildenbrand
  2022-09-07 12:11     ` [PATCH] mm/huge_memory: use pfn_to_online_page() in split_huge_pages_all() Naoya Horiguchi
  0 siblings, 1 reply; 19+ messages in thread
From: David Hildenbrand @ 2022-09-07 10:26 UTC (permalink / raw)
  To: Naoya Horiguchi, linux-mm
  Cc: Andrew Morton, Muchun Song, Miaohe Lin, Matthew Wilcox,
	Michal Hocko, Yang Shi, Naoya Horiguchi

On 07.09.22 12:23, David Hildenbrand wrote:
> On 07.09.22 12:08, Naoya Horiguchi wrote:
>> Hi MM folks,
> 
> Hi,
> 
>>
>> When I'm testing memory hotremove with various settings, I found the following
>> NULL-pointer dereference.  It reproduces easily with the folloing steps:
>>
>>     $ echo offline > /sys/devices/system/memory/memoryN/state
>>     $ echo 1 > /sys/kernel/debug/split_huge_pages
>>
> 
> That's weird, I don't immediately see how both features are related here, especially because it seems to fail quite early in split_huge_pages_write().
> 
> I was able to trigger it here as well, though, and I get on my kernel with debug options enabled:
> 
> [526045.808737] page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000
> [526045.818306] flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff)
> [526045.824842] raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000
> [526045.832676] raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
> [526045.840498] page dumped because: unmovable page
> [526056.362715] page:000000007d7ab72e is uninitialized and poisoned
> [526056.362720] page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
> [526056.374837] ------------[ cut here ]------------
> [526056.379544] kernel BUG at include/linux/mm.h:1248!
> [526056.384429] invalid opcode: 0000 [#1] PREEMPT SMP PTI
> [526056.389570] CPU: 16 PID: 20964 Comm: bash Tainted: G          I        6.0.0-rc3-foll-numa+ #41
> [526056.398347] Hardware name: Dell Inc. PowerEdge R740/0923K0, BIOS 2.11.2 004/21/2021
> [526056.406087] RIP: 0010:split_huge_pages_write+0xcf4/0xe30
> [526056.411495] Code: f9 ff ff 48 c7 c6 88 8e 4e a7 48 c7 c7 f8 0b a8 a7 e8 20 1c 3d 00 e9 54 f6 ff ff 48 c7 c6 38 06 4c a7 4c 89 e7 e8 5c 4a f7 ff <0f> 0b 48 8b 7c 24 08 be 02 00 00 00 e8 bb 6c 36 00 e9 11 f5 ff ff
> [526056.430325] RSP: 0018:ffffae50e757fb40 EFLAGS: 00010292
> [526056.435639] RAX: 0000000000000034 RBX: 0000000002d00000 RCX: 0000000000000000
> [526056.442858] RDX: 0000000000000001 RSI: ffffffffa751a839 RDI: 00000000ffffffff
> [526056.450077] RBP: 0000000000013937 R08: 0000000000000000 R09: ffffae50e757fa08
> [526056.457296] R10: 0000000000000003 R11: ffffffffa793c768 R12: fffff00eb4000000
> [526056.464514] R13: ffffae50e757fb7a R14: 000fffffffffffff R15: fffff00eb4000000
> [526056.471733] FS:  00007fe35addf740(0000) GS:ffff8b71dc000000(0000) knlGS:0000000000000000
> [526056.479906] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [526056.485738] CR2: 000055e87e606358 CR3: 00000004f5b46004 CR4: 00000000007706e0
> [526056.492957] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> [526056.500176] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> [526056.507396] PKRU: 55555554
> [526056.510196] Call Trace:
> [526056.512734]  <TASK>
> [526056.514929]  ? simple_setattr+0x40/0x60
> [526056.518859]  ? vfs_mkobj+0x1b0/0x1c0
> [526056.522529]  ? path_openat+0xb2e/0x1360
> [526056.526456]  ? do_filp_open+0xa1/0x130
> [526056.530296]  full_proxy_write+0x50/0x80
> [526056.534229]  vfs_write+0xd7/0x3e0
> [526056.537635]  ? fpregs_assert_state_consistent+0x22/0x50
> [526056.542955]  ? exit_to_user_mode_prepare+0x3c/0x1d0
> [526056.547929]  ? syscall_exit_to_user_mode+0x17/0x40
> [526056.552808]  ksys_write+0x53/0xd0
> [526056.556215]  do_syscall_64+0x58/0x80
> [526056.559879]  ? exit_to_user_mode_prepare+0x3c/0x1d0
> [526056.564846]  ? syscall_exit_to_user_mode+0x17/0x40
> [526056.569726]  ? do_syscall_64+0x67/0x80
> [526056.573563]  ? do_syscall_64+0x67/0x80
> [526056.577404]  ? do_syscall_64+0x67/0x80
> [526056.581244]  ? syscall_exit_to_user_mode+0x17/0x40
> [526056.586120]  ? do_syscall_64+0x67/0x80
> [526056.589961]  ? do_syscall_64+0x67/0x80
> [526056.593801]  ? do_syscall_64+0x67/0x80
> [526056.597641]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
> [526056.602779] RIP: 0033:0x7fe35ab01c17
> [526056.606446] Code: 0f 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
> [526056.625277] RSP: 002b:00007ffc1863c8d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
> [526056.632929] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fe35ab01c17
> [526056.640147] RDX: 0000000000000002 RSI: 000055e87e7009a0 RDI: 0000000000000001
> [526056.647366] RBP: 000055e87e7009a0 R08: 0000000000000000 R09: 0000000000000073
> [526056.654586] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000002
> [526056.661805] R13: 00007fe35abf8780 R14: 0000000000000002 R15: 00007fe35abf39e0
> [526056.669028]  </TASK>
> 
> 
> Looks like there is a page_to_nid() done in an offline memmap, which is wrong.
> 
> Usually, this indicates that a pfn_to_online_page() is missing in a PFN walker.
> 

And indeed, something like the following might do the trick:

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index e9414ee57c5b..f42bb51e023a 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2894,11 +2894,9 @@ static void split_huge_pages_all(void)
                 max_zone_pfn = zone_end_pfn(zone);
                 for (pfn = zone->zone_start_pfn; pfn < max_zone_pfn; pfn++) {
                         int nr_pages;
-                       if (!pfn_valid(pfn))
-                               continue;
  
-                       page = pfn_to_page(pfn);
-                       if (!get_page_unless_zero(page))
+                       page = pfn_to_online_page(pfn);
+                       if (!page || !get_page_unless_zero(page))
                                 continue;
  
                         if (zone != page_zone(page))


-- 
Thanks,

David / dhildenb



^ permalink raw reply related	[flat|nested] 19+ messages in thread

* [PATCH] mm/huge_memory: use pfn_to_online_page() in split_huge_pages_all()
  2022-09-07 10:26   ` David Hildenbrand
@ 2022-09-07 12:11     ` Naoya Horiguchi
  2022-09-07 12:39       ` David Hildenbrand
                         ` (5 more replies)
  0 siblings, 6 replies; 19+ messages in thread
From: Naoya Horiguchi @ 2022-09-07 12:11 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: linux-mm, Andrew Morton, Muchun Song, Miaohe Lin, Matthew Wilcox,
	Michal Hocko, Yang Shi, Naoya Horiguchi

On Wed, Sep 07, 2022 at 12:26:18PM +0200, David Hildenbrand wrote:
> On 07.09.22 12:23, David Hildenbrand wrote:
> > On 07.09.22 12:08, Naoya Horiguchi wrote:
> > > Hi MM folks,
> > 
> > Hi,
> > 
> > > 
> > > When I'm testing memory hotremove with various settings, I found the following
> > > NULL-pointer dereference.  It reproduces easily with the folloing steps:
> > > 
> > >     $ echo offline > /sys/devices/system/memory/memoryN/state
> > >     $ echo 1 > /sys/kernel/debug/split_huge_pages
> > > 
> > 
> > That's weird, I don't immediately see how both features are related here, especially because it seems to fail quite early in split_huge_pages_write().
> > 
> > I was able to trigger it here as well, though, and I get on my kernel with debug options enabled:
> > 
> > [526045.808737] page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000
> > [526045.818306] flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff)
> > [526045.824842] raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000
> > [526045.832676] raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
> > [526045.840498] page dumped because: unmovable page
> > [526056.362715] page:000000007d7ab72e is uninitialized and poisoned
> > [526056.362720] page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
> > [526056.374837] ------------[ cut here ]------------
> > [526056.379544] kernel BUG at include/linux/mm.h:1248!
> > [526056.384429] invalid opcode: 0000 [#1] PREEMPT SMP PTI
> > [526056.389570] CPU: 16 PID: 20964 Comm: bash Tainted: G          I        6.0.0-rc3-foll-numa+ #41
> > [526056.398347] Hardware name: Dell Inc. PowerEdge R740/0923K0, BIOS 2.11.2 004/21/2021
> > [526056.406087] RIP: 0010:split_huge_pages_write+0xcf4/0xe30
> > [526056.411495] Code: f9 ff ff 48 c7 c6 88 8e 4e a7 48 c7 c7 f8 0b a8 a7 e8 20 1c 3d 00 e9 54 f6 ff ff 48 c7 c6 38 06 4c a7 4c 89 e7 e8 5c 4a f7 ff <0f> 0b 48 8b 7c 24 08 be 02 00 00 00 e8 bb 6c 36 00 e9 11 f5 ff ff
> > [526056.430325] RSP: 0018:ffffae50e757fb40 EFLAGS: 00010292
> > [526056.435639] RAX: 0000000000000034 RBX: 0000000002d00000 RCX: 0000000000000000
> > [526056.442858] RDX: 0000000000000001 RSI: ffffffffa751a839 RDI: 00000000ffffffff
> > [526056.450077] RBP: 0000000000013937 R08: 0000000000000000 R09: ffffae50e757fa08
> > [526056.457296] R10: 0000000000000003 R11: ffffffffa793c768 R12: fffff00eb4000000
> > [526056.464514] R13: ffffae50e757fb7a R14: 000fffffffffffff R15: fffff00eb4000000
> > [526056.471733] FS:  00007fe35addf740(0000) GS:ffff8b71dc000000(0000) knlGS:0000000000000000
> > [526056.479906] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > [526056.485738] CR2: 000055e87e606358 CR3: 00000004f5b46004 CR4: 00000000007706e0
> > [526056.492957] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > [526056.500176] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > [526056.507396] PKRU: 55555554
> > [526056.510196] Call Trace:
> > [526056.512734]  <TASK>
> > [526056.514929]  ? simple_setattr+0x40/0x60
> > [526056.518859]  ? vfs_mkobj+0x1b0/0x1c0
> > [526056.522529]  ? path_openat+0xb2e/0x1360
> > [526056.526456]  ? do_filp_open+0xa1/0x130
> > [526056.530296]  full_proxy_write+0x50/0x80
> > [526056.534229]  vfs_write+0xd7/0x3e0
> > [526056.537635]  ? fpregs_assert_state_consistent+0x22/0x50
> > [526056.542955]  ? exit_to_user_mode_prepare+0x3c/0x1d0
> > [526056.547929]  ? syscall_exit_to_user_mode+0x17/0x40
> > [526056.552808]  ksys_write+0x53/0xd0
> > [526056.556215]  do_syscall_64+0x58/0x80
> > [526056.559879]  ? exit_to_user_mode_prepare+0x3c/0x1d0
> > [526056.564846]  ? syscall_exit_to_user_mode+0x17/0x40
> > [526056.569726]  ? do_syscall_64+0x67/0x80
> > [526056.573563]  ? do_syscall_64+0x67/0x80
> > [526056.577404]  ? do_syscall_64+0x67/0x80
> > [526056.581244]  ? syscall_exit_to_user_mode+0x17/0x40
> > [526056.586120]  ? do_syscall_64+0x67/0x80
> > [526056.589961]  ? do_syscall_64+0x67/0x80
> > [526056.593801]  ? do_syscall_64+0x67/0x80
> > [526056.597641]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
> > [526056.602779] RIP: 0033:0x7fe35ab01c17
> > [526056.606446] Code: 0f 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
> > [526056.625277] RSP: 002b:00007ffc1863c8d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
> > [526056.632929] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fe35ab01c17
> > [526056.640147] RDX: 0000000000000002 RSI: 000055e87e7009a0 RDI: 0000000000000001
> > [526056.647366] RBP: 000055e87e7009a0 R08: 0000000000000000 R09: 0000000000000073
> > [526056.654586] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000002
> > [526056.661805] R13: 00007fe35abf8780 R14: 0000000000000002 R15: 00007fe35abf39e0
> > [526056.669028]  </TASK>
> > 
> > 
> > Looks like there is a page_to_nid() done in an offline memmap, which is wrong.
> > 
> > Usually, this indicates that a pfn_to_online_page() is missing in a PFN walker.
> > 
> 
> And indeed, something like the following might do the trick:
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index e9414ee57c5b..f42bb51e023a 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2894,11 +2894,9 @@ static void split_huge_pages_all(void)
>                 max_zone_pfn = zone_end_pfn(zone);
>                 for (pfn = zone->zone_start_pfn; pfn < max_zone_pfn; pfn++) {
>                         int nr_pages;
> -                       if (!pfn_valid(pfn))
> -                               continue;
> -                       page = pfn_to_page(pfn);
> -                       if (!get_page_unless_zero(page))
> +                       page = pfn_to_online_page(pfn);
> +                       if (!page || !get_page_unless_zero(page))
>                                 continue;
>                         if (zone != page_zone(page))
> 

Thank you very much, I confirmed the fix.  So I try to write a patch like below.
I borrowed your debug message because it has more helpful info than mine.
If you have any comment or suggestion in the description, please let me know.

Thanks,
Naoya Horiguchi
---
From 8a5c284df732943065d23838090d15c94cd10395 Mon Sep 17 00:00:00 2001
From: Naoya Horiguchi <naoya.horiguchi@nec.com>
Date: Wed, 7 Sep 2022 20:58:44 +0900
Subject: [PATCH] mm/huge_memory: use pfn_to_online_page() in
 split_huge_pages_all()

NULL pointer dereference is triggered when calling thp split via debugfs
on the system with offlined memory blocks.  With debug option enabled,
the following kernel messages are printed out:

  page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000
  flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff)
  raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000
  raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
  page dumped because: unmovable page
  page:000000007d7ab72e is uninitialized and poisoned
  page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
  ------------[ cut here ]------------
  kernel BUG at include/linux/mm.h:1248!
  invalid opcode: 0000 [#1] PREEMPT SMP PTI
  CPU: 16 PID: 20964 Comm: bash Tainted: G          I        6.0.0-rc3-foll-numa+ #41
  ...
  RIP: 0010:split_huge_pages_write+0xcf4/0xe30

This shows that page_to_nid() in page_zone() is unexpectedly called for an
offlined memmap.

Use pfn_to_online_page() to get struct page in PFN walker.

Fixes: 49071d436b51 ("thp: add debugfs handle to split all huge pages")
Suggested-by: David Hildenbrand <david@redhat.com>
Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
Cc: <stable@vger.kernel.org>
---
 mm/huge_memory.c | 7 ++-----
 1 file changed, 2 insertions(+), 5 deletions(-)

diff --git a/mm/huge_memory.c b/mm/huge_memory.c
index 5fa2ba86dae4..03149a8f46f9 100644
--- a/mm/huge_memory.c
+++ b/mm/huge_memory.c
@@ -2894,11 +2894,8 @@ static void split_huge_pages_all(void)
 		max_zone_pfn = zone_end_pfn(zone);
 		for (pfn = zone->zone_start_pfn; pfn < max_zone_pfn; pfn++) {
 			int nr_pages;
-			if (!pfn_valid(pfn))
-				continue;
-
-			page = pfn_to_page(pfn);
-			if (!get_page_unless_zero(page))
+			page = pfn_to_online_page(pfn);
+			if (!page || !get_page_unless_zero(page))
 				continue;
 
 			if (zone != page_zone(page))
-- 
2.26.1.8815.g6a475b71f8




^ permalink raw reply related	[flat|nested] 19+ messages in thread

* Re: [PATCH] mm/huge_memory: use pfn_to_online_page() in split_huge_pages_all()
  2022-09-07 12:11     ` [PATCH] mm/huge_memory: use pfn_to_online_page() in split_huge_pages_all() Naoya Horiguchi
@ 2022-09-07 12:39       ` David Hildenbrand
  2022-09-08  2:39         ` HORIGUCHI NAOYA(堀口 直也)
  2022-09-07 17:10       ` Yang Shi
                         ` (4 subsequent siblings)
  5 siblings, 1 reply; 19+ messages in thread
From: David Hildenbrand @ 2022-09-07 12:39 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: linux-mm, Andrew Morton, Muchun Song, Miaohe Lin, Matthew Wilcox,
	Michal Hocko, Yang Shi, Naoya Horiguchi

On 07.09.22 14:11, Naoya Horiguchi wrote:
> On Wed, Sep 07, 2022 at 12:26:18PM +0200, David Hildenbrand wrote:
>> On 07.09.22 12:23, David Hildenbrand wrote:
>>> On 07.09.22 12:08, Naoya Horiguchi wrote:
>>>> Hi MM folks,
>>>
>>> Hi,
>>>
>>>>
>>>> When I'm testing memory hotremove with various settings, I found the following
>>>> NULL-pointer dereference.  It reproduces easily with the folloing steps:
>>>>
>>>>      $ echo offline > /sys/devices/system/memory/memoryN/state
>>>>      $ echo 1 > /sys/kernel/debug/split_huge_pages
>>>>
>>>
>>> That's weird, I don't immediately see how both features are related here, especially because it seems to fail quite early in split_huge_pages_write().
>>>
>>> I was able to trigger it here as well, though, and I get on my kernel with debug options enabled:
>>>
>>> [526045.808737] page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000
>>> [526045.818306] flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff)
>>> [526045.824842] raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000
>>> [526045.832676] raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
>>> [526045.840498] page dumped because: unmovable page
>>> [526056.362715] page:000000007d7ab72e is uninitialized and poisoned
>>> [526056.362720] page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
>>> [526056.374837] ------------[ cut here ]------------
>>> [526056.379544] kernel BUG at include/linux/mm.h:1248!
>>> [526056.384429] invalid opcode: 0000 [#1] PREEMPT SMP PTI
>>> [526056.389570] CPU: 16 PID: 20964 Comm: bash Tainted: G          I        6.0.0-rc3-foll-numa+ #41
>>> [526056.398347] Hardware name: Dell Inc. PowerEdge R740/0923K0, BIOS 2.11.2 004/21/2021
>>> [526056.406087] RIP: 0010:split_huge_pages_write+0xcf4/0xe30
>>> [526056.411495] Code: f9 ff ff 48 c7 c6 88 8e 4e a7 48 c7 c7 f8 0b a8 a7 e8 20 1c 3d 00 e9 54 f6 ff ff 48 c7 c6 38 06 4c a7 4c 89 e7 e8 5c 4a f7 ff <0f> 0b 48 8b 7c 24 08 be 02 00 00 00 e8 bb 6c 36 00 e9 11 f5 ff ff
>>> [526056.430325] RSP: 0018:ffffae50e757fb40 EFLAGS: 00010292
>>> [526056.435639] RAX: 0000000000000034 RBX: 0000000002d00000 RCX: 0000000000000000
>>> [526056.442858] RDX: 0000000000000001 RSI: ffffffffa751a839 RDI: 00000000ffffffff
>>> [526056.450077] RBP: 0000000000013937 R08: 0000000000000000 R09: ffffae50e757fa08
>>> [526056.457296] R10: 0000000000000003 R11: ffffffffa793c768 R12: fffff00eb4000000
>>> [526056.464514] R13: ffffae50e757fb7a R14: 000fffffffffffff R15: fffff00eb4000000
>>> [526056.471733] FS:  00007fe35addf740(0000) GS:ffff8b71dc000000(0000) knlGS:0000000000000000
>>> [526056.479906] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [526056.485738] CR2: 000055e87e606358 CR3: 00000004f5b46004 CR4: 00000000007706e0
>>> [526056.492957] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>> [526056.500176] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>> [526056.507396] PKRU: 55555554
>>> [526056.510196] Call Trace:
>>> [526056.512734]  <TASK>
>>> [526056.514929]  ? simple_setattr+0x40/0x60
>>> [526056.518859]  ? vfs_mkobj+0x1b0/0x1c0
>>> [526056.522529]  ? path_openat+0xb2e/0x1360
>>> [526056.526456]  ? do_filp_open+0xa1/0x130
>>> [526056.530296]  full_proxy_write+0x50/0x80
>>> [526056.534229]  vfs_write+0xd7/0x3e0
>>> [526056.537635]  ? fpregs_assert_state_consistent+0x22/0x50
>>> [526056.542955]  ? exit_to_user_mode_prepare+0x3c/0x1d0
>>> [526056.547929]  ? syscall_exit_to_user_mode+0x17/0x40
>>> [526056.552808]  ksys_write+0x53/0xd0
>>> [526056.556215]  do_syscall_64+0x58/0x80
>>> [526056.559879]  ? exit_to_user_mode_prepare+0x3c/0x1d0
>>> [526056.564846]  ? syscall_exit_to_user_mode+0x17/0x40
>>> [526056.569726]  ? do_syscall_64+0x67/0x80
>>> [526056.573563]  ? do_syscall_64+0x67/0x80
>>> [526056.577404]  ? do_syscall_64+0x67/0x80
>>> [526056.581244]  ? syscall_exit_to_user_mode+0x17/0x40
>>> [526056.586120]  ? do_syscall_64+0x67/0x80
>>> [526056.589961]  ? do_syscall_64+0x67/0x80
>>> [526056.593801]  ? do_syscall_64+0x67/0x80
>>> [526056.597641]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
>>> [526056.602779] RIP: 0033:0x7fe35ab01c17
>>> [526056.606446] Code: 0f 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
>>> [526056.625277] RSP: 002b:00007ffc1863c8d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
>>> [526056.632929] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fe35ab01c17
>>> [526056.640147] RDX: 0000000000000002 RSI: 000055e87e7009a0 RDI: 0000000000000001
>>> [526056.647366] RBP: 000055e87e7009a0 R08: 0000000000000000 R09: 0000000000000073
>>> [526056.654586] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000002
>>> [526056.661805] R13: 00007fe35abf8780 R14: 0000000000000002 R15: 00007fe35abf39e0
>>> [526056.669028]  </TASK>
>>>
>>>
>>> Looks like there is a page_to_nid() done in an offline memmap, which is wrong.
>>>
>>> Usually, this indicates that a pfn_to_online_page() is missing in a PFN walker.
>>>
>>
>> And indeed, something like the following might do the trick:
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index e9414ee57c5b..f42bb51e023a 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -2894,11 +2894,9 @@ static void split_huge_pages_all(void)
>>                  max_zone_pfn = zone_end_pfn(zone);
>>                  for (pfn = zone->zone_start_pfn; pfn < max_zone_pfn; pfn++) {
>>                          int nr_pages;
>> -                       if (!pfn_valid(pfn))
>> -                               continue;
>> -                       page = pfn_to_page(pfn);
>> -                       if (!get_page_unless_zero(page))
>> +                       page = pfn_to_online_page(pfn);
>> +                       if (!page || !get_page_unless_zero(page))
>>                                  continue;
>>                          if (zone != page_zone(page))
>>
> 
> Thank you very much, I confirmed the fix.  So I try to write a patch like below.
> I borrowed your debug message because it has more helpful info than mine.
> If you have any comment or suggestion in the description, please let me know.
> 
> Thanks,
> Naoya Horiguchi
> ---
>  From 8a5c284df732943065d23838090d15c94cd10395 Mon Sep 17 00:00:00 2001
> From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> Date: Wed, 7 Sep 2022 20:58:44 +0900
> Subject: [PATCH] mm/huge_memory: use pfn_to_online_page() in
>   split_huge_pages_all()
> 
> NULL pointer dereference is triggered when calling thp split via debugfs
> on the system with offlined memory blocks.  With debug option enabled,
> the following kernel messages are printed out:
> 
>    page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000
>    flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff)
>    raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000
>    raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
>    page dumped because: unmovable page
>    page:000000007d7ab72e is uninitialized and poisoned
>    page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
>    ------------[ cut here ]------------
>    kernel BUG at include/linux/mm.h:1248!
>    invalid opcode: 0000 [#1] PREEMPT SMP PTI
>    CPU: 16 PID: 20964 Comm: bash Tainted: G          I        6.0.0-rc3-foll-numa+ #41
>    ...
>    RIP: 0010:split_huge_pages_write+0xcf4/0xe30
> 
> This shows that page_to_nid() in page_zone() is unexpectedly called for an
> offlined memmap.
> 
> Use pfn_to_online_page() to get struct page in PFN walker.
> 
> Fixes: 49071d436b51 ("thp: add debugfs handle to split all huge pages")
> Suggested-by: David Hildenbrand <david@redhat.com>

Feel free to use

Co-developed-by: David Hildenbrand <david@redhat.com>
Signed-off-by: David Hildenbrand <david@redhat.com>

instead

> Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> Cc: <stable@vger.kernel.org>
> ---
>   mm/huge_memory.c | 7 ++-----
>   1 file changed, 2 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 5fa2ba86dae4..03149a8f46f9 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2894,11 +2894,8 @@ static void split_huge_pages_all(void)
>   		max_zone_pfn = zone_end_pfn(zone);
>   		for (pfn = zone->zone_start_pfn; pfn < max_zone_pfn; pfn++) {
>   			int nr_pages;

Add an empty line here please, while at it.

> -			if (!pfn_valid(pfn))
> -				continue;
> -
> -			page = pfn_to_page(pfn);
> -			if (!get_page_unless_zero(page))
> +			page = pfn_to_online_page(pfn);
> +			if (!page || !get_page_unless_zero(page))
>   				continue;
>   
>   			if (zone != page_zone(page))

Can you send the patch "officially" as well? Thanks

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mm/huge_memory: use pfn_to_online_page() in split_huge_pages_all()
  2022-09-07 12:11     ` [PATCH] mm/huge_memory: use pfn_to_online_page() in split_huge_pages_all() Naoya Horiguchi
  2022-09-07 12:39       ` David Hildenbrand
@ 2022-09-07 17:10       ` Yang Shi
  2022-09-07 17:32       ` Michal Hocko
                         ` (3 subsequent siblings)
  5 siblings, 0 replies; 19+ messages in thread
From: Yang Shi @ 2022-09-07 17:10 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: David Hildenbrand, linux-mm, Andrew Morton, Muchun Song,
	Miaohe Lin, Matthew Wilcox, Michal Hocko, Naoya Horiguchi

On Wed, Sep 7, 2022 at 5:12 AM Naoya Horiguchi
<naoya.horiguchi@linux.dev> wrote:
>
> On Wed, Sep 07, 2022 at 12:26:18PM +0200, David Hildenbrand wrote:
> > On 07.09.22 12:23, David Hildenbrand wrote:
> > > On 07.09.22 12:08, Naoya Horiguchi wrote:
> > > > Hi MM folks,
> > >
> > > Hi,
> > >
> > > >
> > > > When I'm testing memory hotremove with various settings, I found the following
> > > > NULL-pointer dereference.  It reproduces easily with the folloing steps:
> > > >
> > > >     $ echo offline > /sys/devices/system/memory/memoryN/state
> > > >     $ echo 1 > /sys/kernel/debug/split_huge_pages
> > > >
> > >
> > > That's weird, I don't immediately see how both features are related here, especially because it seems to fail quite early in split_huge_pages_write().
> > >
> > > I was able to trigger it here as well, though, and I get on my kernel with debug options enabled:
> > >
> > > [526045.808737] page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000
> > > [526045.818306] flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff)
> > > [526045.824842] raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000
> > > [526045.832676] raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
> > > [526045.840498] page dumped because: unmovable page
> > > [526056.362715] page:000000007d7ab72e is uninitialized and poisoned
> > > [526056.362720] page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
> > > [526056.374837] ------------[ cut here ]------------
> > > [526056.379544] kernel BUG at include/linux/mm.h:1248!
> > > [526056.384429] invalid opcode: 0000 [#1] PREEMPT SMP PTI
> > > [526056.389570] CPU: 16 PID: 20964 Comm: bash Tainted: G          I        6.0.0-rc3-foll-numa+ #41
> > > [526056.398347] Hardware name: Dell Inc. PowerEdge R740/0923K0, BIOS 2.11.2 004/21/2021
> > > [526056.406087] RIP: 0010:split_huge_pages_write+0xcf4/0xe30
> > > [526056.411495] Code: f9 ff ff 48 c7 c6 88 8e 4e a7 48 c7 c7 f8 0b a8 a7 e8 20 1c 3d 00 e9 54 f6 ff ff 48 c7 c6 38 06 4c a7 4c 89 e7 e8 5c 4a f7 ff <0f> 0b 48 8b 7c 24 08 be 02 00 00 00 e8 bb 6c 36 00 e9 11 f5 ff ff
> > > [526056.430325] RSP: 0018:ffffae50e757fb40 EFLAGS: 00010292
> > > [526056.435639] RAX: 0000000000000034 RBX: 0000000002d00000 RCX: 0000000000000000
> > > [526056.442858] RDX: 0000000000000001 RSI: ffffffffa751a839 RDI: 00000000ffffffff
> > > [526056.450077] RBP: 0000000000013937 R08: 0000000000000000 R09: ffffae50e757fa08
> > > [526056.457296] R10: 0000000000000003 R11: ffffffffa793c768 R12: fffff00eb4000000
> > > [526056.464514] R13: ffffae50e757fb7a R14: 000fffffffffffff R15: fffff00eb4000000
> > > [526056.471733] FS:  00007fe35addf740(0000) GS:ffff8b71dc000000(0000) knlGS:0000000000000000
> > > [526056.479906] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> > > [526056.485738] CR2: 000055e87e606358 CR3: 00000004f5b46004 CR4: 00000000007706e0
> > > [526056.492957] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> > > [526056.500176] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> > > [526056.507396] PKRU: 55555554
> > > [526056.510196] Call Trace:
> > > [526056.512734]  <TASK>
> > > [526056.514929]  ? simple_setattr+0x40/0x60
> > > [526056.518859]  ? vfs_mkobj+0x1b0/0x1c0
> > > [526056.522529]  ? path_openat+0xb2e/0x1360
> > > [526056.526456]  ? do_filp_open+0xa1/0x130
> > > [526056.530296]  full_proxy_write+0x50/0x80
> > > [526056.534229]  vfs_write+0xd7/0x3e0
> > > [526056.537635]  ? fpregs_assert_state_consistent+0x22/0x50
> > > [526056.542955]  ? exit_to_user_mode_prepare+0x3c/0x1d0
> > > [526056.547929]  ? syscall_exit_to_user_mode+0x17/0x40
> > > [526056.552808]  ksys_write+0x53/0xd0
> > > [526056.556215]  do_syscall_64+0x58/0x80
> > > [526056.559879]  ? exit_to_user_mode_prepare+0x3c/0x1d0
> > > [526056.564846]  ? syscall_exit_to_user_mode+0x17/0x40
> > > [526056.569726]  ? do_syscall_64+0x67/0x80
> > > [526056.573563]  ? do_syscall_64+0x67/0x80
> > > [526056.577404]  ? do_syscall_64+0x67/0x80
> > > [526056.581244]  ? syscall_exit_to_user_mode+0x17/0x40
> > > [526056.586120]  ? do_syscall_64+0x67/0x80
> > > [526056.589961]  ? do_syscall_64+0x67/0x80
> > > [526056.593801]  ? do_syscall_64+0x67/0x80
> > > [526056.597641]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
> > > [526056.602779] RIP: 0033:0x7fe35ab01c17
> > > [526056.606446] Code: 0f 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
> > > [526056.625277] RSP: 002b:00007ffc1863c8d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
> > > [526056.632929] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fe35ab01c17
> > > [526056.640147] RDX: 0000000000000002 RSI: 000055e87e7009a0 RDI: 0000000000000001
> > > [526056.647366] RBP: 000055e87e7009a0 R08: 0000000000000000 R09: 0000000000000073
> > > [526056.654586] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000002
> > > [526056.661805] R13: 00007fe35abf8780 R14: 0000000000000002 R15: 00007fe35abf39e0
> > > [526056.669028]  </TASK>
> > >
> > >
> > > Looks like there is a page_to_nid() done in an offline memmap, which is wrong.
> > >
> > > Usually, this indicates that a pfn_to_online_page() is missing in a PFN walker.
> > >
> >
> > And indeed, something like the following might do the trick:
> >
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index e9414ee57c5b..f42bb51e023a 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -2894,11 +2894,9 @@ static void split_huge_pages_all(void)
> >                 max_zone_pfn = zone_end_pfn(zone);
> >                 for (pfn = zone->zone_start_pfn; pfn < max_zone_pfn; pfn++) {
> >                         int nr_pages;
> > -                       if (!pfn_valid(pfn))
> > -                               continue;
> > -                       page = pfn_to_page(pfn);
> > -                       if (!get_page_unless_zero(page))
> > +                       page = pfn_to_online_page(pfn);
> > +                       if (!page || !get_page_unless_zero(page))
> >                                 continue;
> >                         if (zone != page_zone(page))
> >
>
> Thank you very much, I confirmed the fix.  So I try to write a patch like below.
> I borrowed your debug message because it has more helpful info than mine.
> If you have any comment or suggestion in the description, please let me know.
>
> Thanks,
> Naoya Horiguchi
> ---
> From 8a5c284df732943065d23838090d15c94cd10395 Mon Sep 17 00:00:00 2001
> From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> Date: Wed, 7 Sep 2022 20:58:44 +0900
> Subject: [PATCH] mm/huge_memory: use pfn_to_online_page() in
>  split_huge_pages_all()
>
> NULL pointer dereference is triggered when calling thp split via debugfs
> on the system with offlined memory blocks.  With debug option enabled,
> the following kernel messages are printed out:
>
>   page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000
>   flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff)
>   raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000
>   raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
>   page dumped because: unmovable page
>   page:000000007d7ab72e is uninitialized and poisoned
>   page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
>   ------------[ cut here ]------------
>   kernel BUG at include/linux/mm.h:1248!
>   invalid opcode: 0000 [#1] PREEMPT SMP PTI
>   CPU: 16 PID: 20964 Comm: bash Tainted: G          I        6.0.0-rc3-foll-numa+ #41
>   ...
>   RIP: 0010:split_huge_pages_write+0xcf4/0xe30
>
> This shows that page_to_nid() in page_zone() is unexpectedly called for an
> offlined memmap.
>
> Use pfn_to_online_page() to get struct page in PFN walker.
>
> Fixes: 49071d436b51 ("thp: add debugfs handle to split all huge pages")
> Suggested-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> Cc: <stable@vger.kernel.org>

Thanks, Naoya and David. Looks good to me. Reviewed-by: Yang Shi
<shy828301@gmail.com>

> ---
>  mm/huge_memory.c | 7 ++-----
>  1 file changed, 2 insertions(+), 5 deletions(-)
>
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 5fa2ba86dae4..03149a8f46f9 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2894,11 +2894,8 @@ static void split_huge_pages_all(void)
>                 max_zone_pfn = zone_end_pfn(zone);
>                 for (pfn = zone->zone_start_pfn; pfn < max_zone_pfn; pfn++) {
>                         int nr_pages;
> -                       if (!pfn_valid(pfn))
> -                               continue;
> -
> -                       page = pfn_to_page(pfn);
> -                       if (!get_page_unless_zero(page))
> +                       page = pfn_to_online_page(pfn);
> +                       if (!page || !get_page_unless_zero(page))
>                                 continue;
>
>                         if (zone != page_zone(page))
> --
> 2.26.1.8815.g6a475b71f8
>
>


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mm/huge_memory: use pfn_to_online_page() in split_huge_pages_all()
  2022-09-07 12:11     ` [PATCH] mm/huge_memory: use pfn_to_online_page() in split_huge_pages_all() Naoya Horiguchi
  2022-09-07 12:39       ` David Hildenbrand
  2022-09-07 17:10       ` Yang Shi
@ 2022-09-07 17:32       ` Michal Hocko
  2022-09-07 20:57       ` Andrew Morton
                         ` (2 subsequent siblings)
  5 siblings, 0 replies; 19+ messages in thread
From: Michal Hocko @ 2022-09-07 17:32 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: David Hildenbrand, linux-mm, Andrew Morton, Muchun Song,
	Miaohe Lin, Matthew Wilcox, Yang Shi, Naoya Horiguchi

On Wed 07-09-22 21:11:57, Naoya Horiguchi wrote:
[...]
> >From 8a5c284df732943065d23838090d15c94cd10395 Mon Sep 17 00:00:00 2001
> From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> Date: Wed, 7 Sep 2022 20:58:44 +0900
> Subject: [PATCH] mm/huge_memory: use pfn_to_online_page() in
>  split_huge_pages_all()
> 
> NULL pointer dereference is triggered when calling thp split via debugfs
> on the system with offlined memory blocks.  With debug option enabled,
> the following kernel messages are printed out:
> 
>   page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000
>   flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff)
>   raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000
>   raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
>   page dumped because: unmovable page
>   page:000000007d7ab72e is uninitialized and poisoned
>   page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
>   ------------[ cut here ]------------
>   kernel BUG at include/linux/mm.h:1248!
>   invalid opcode: 0000 [#1] PREEMPT SMP PTI
>   CPU: 16 PID: 20964 Comm: bash Tainted: G          I        6.0.0-rc3-foll-numa+ #41
>   ...
>   RIP: 0010:split_huge_pages_write+0xcf4/0xe30
> 
> This shows that page_to_nid() in page_zone() is unexpectedly called for an
> offlined memmap.
> 
> Use pfn_to_online_page() to get struct page in PFN walker.
> 
> Fixes: 49071d436b51 ("thp: add debugfs handle to split all huge pages")
> Suggested-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> Cc: <stable@vger.kernel.org>

With changes proposed by David
Acked-by: Michal Hocko <mhocko@suse.com>

Thanks!

> ---
>  mm/huge_memory.c | 7 ++-----
>  1 file changed, 2 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 5fa2ba86dae4..03149a8f46f9 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2894,11 +2894,8 @@ static void split_huge_pages_all(void)
>  		max_zone_pfn = zone_end_pfn(zone);
>  		for (pfn = zone->zone_start_pfn; pfn < max_zone_pfn; pfn++) {
>  			int nr_pages;
> -			if (!pfn_valid(pfn))
> -				continue;
> -
> -			page = pfn_to_page(pfn);
> -			if (!get_page_unless_zero(page))
> +			page = pfn_to_online_page(pfn);
> +			if (!page || !get_page_unless_zero(page))
>  				continue;
>  
>  			if (zone != page_zone(page))
> -- 
> 2.26.1.8815.g6a475b71f8
> 

-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mm/huge_memory: use pfn_to_online_page() in split_huge_pages_all()
  2022-09-07 12:11     ` [PATCH] mm/huge_memory: use pfn_to_online_page() in split_huge_pages_all() Naoya Horiguchi
                         ` (2 preceding siblings ...)
  2022-09-07 17:32       ` Michal Hocko
@ 2022-09-07 20:57       ` Andrew Morton
  2022-09-08  2:47         ` HORIGUCHI NAOYA(堀口 直也)
  2022-09-08  2:19       ` Miaohe Lin
  2022-09-08  3:28       ` Oscar Salvador
  5 siblings, 1 reply; 19+ messages in thread
From: Andrew Morton @ 2022-09-07 20:57 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: David Hildenbrand, linux-mm, Muchun Song, Miaohe Lin,
	Matthew Wilcox, Michal Hocko, Yang Shi, Naoya Horiguchi

On Wed, 7 Sep 2022 21:11:57 +0900 Naoya Horiguchi <naoya.horiguchi@linux.dev> wrote:

> 
> NULL pointer dereference is triggered when calling thp split via debugfs
> on the system with offlined memory blocks.  With debug option enabled,
> the following kernel messages are printed out:
> 
>   page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000
>   flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff)
>   raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000
>   raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
>   page dumped because: unmovable page
>   page:000000007d7ab72e is uninitialized and poisoned
>   page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
>   ------------[ cut here ]------------
>   kernel BUG at include/linux/mm.h:1248!
>   invalid opcode: 0000 [#1] PREEMPT SMP PTI
>   CPU: 16 PID: 20964 Comm: bash Tainted: G          I        6.0.0-rc3-foll-numa+ #41
>   ...
>   RIP: 0010:split_huge_pages_write+0xcf4/0xe30
> 
> This shows that page_to_nid() in page_zone() is unexpectedly called for an
> offlined memmap.
> 
> Use pfn_to_online_page() to get struct page in PFN walker.
> 
> Fixes: 49071d436b51 ("thp: add debugfs handle to split all huge pages")

January of 2016!  Or was this a long-term bug which was recently
exposed by some other change?




^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mm/huge_memory: use pfn_to_online_page() in split_huge_pages_all()
  2022-09-07 12:11     ` [PATCH] mm/huge_memory: use pfn_to_online_page() in split_huge_pages_all() Naoya Horiguchi
                         ` (3 preceding siblings ...)
  2022-09-07 20:57       ` Andrew Morton
@ 2022-09-08  2:19       ` Miaohe Lin
  2022-09-08  3:06         ` HORIGUCHI NAOYA(堀口 直也)
  2022-09-08  3:28       ` Oscar Salvador
  5 siblings, 1 reply; 19+ messages in thread
From: Miaohe Lin @ 2022-09-08  2:19 UTC (permalink / raw)
  To: Naoya Horiguchi, David Hildenbrand
  Cc: linux-mm, Andrew Morton, Muchun Song, Matthew Wilcox,
	Michal Hocko, Yang Shi, Naoya Horiguchi

On 2022/9/7 20:11, Naoya Horiguchi wrote:
> On Wed, Sep 07, 2022 at 12:26:18PM +0200, David Hildenbrand wrote:
>> On 07.09.22 12:23, David Hildenbrand wrote:
>>> On 07.09.22 12:08, Naoya Horiguchi wrote:
>>>> Hi MM folks,
>>>
>>> Hi,
>>>
>>>>
>>>> When I'm testing memory hotremove with various settings, I found the following
>>>> NULL-pointer dereference.  It reproduces easily with the folloing steps:
>>>>
>>>>     $ echo offline > /sys/devices/system/memory/memoryN/state
>>>>     $ echo 1 > /sys/kernel/debug/split_huge_pages
>>>>
>>>
>>> That's weird, I don't immediately see how both features are related here, especially because it seems to fail quite early in split_huge_pages_write().
>>>
>>> I was able to trigger it here as well, though, and I get on my kernel with debug options enabled:
>>>
>>> [526045.808737] page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000
>>> [526045.818306] flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff)
>>> [526045.824842] raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000
>>> [526045.832676] raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
>>> [526045.840498] page dumped because: unmovable page
>>> [526056.362715] page:000000007d7ab72e is uninitialized and poisoned
>>> [526056.362720] page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
>>> [526056.374837] ------------[ cut here ]------------
>>> [526056.379544] kernel BUG at include/linux/mm.h:1248!
>>> [526056.384429] invalid opcode: 0000 [#1] PREEMPT SMP PTI
>>> [526056.389570] CPU: 16 PID: 20964 Comm: bash Tainted: G          I        6.0.0-rc3-foll-numa+ #41
>>> [526056.398347] Hardware name: Dell Inc. PowerEdge R740/0923K0, BIOS 2.11.2 004/21/2021
>>> [526056.406087] RIP: 0010:split_huge_pages_write+0xcf4/0xe30
>>> [526056.411495] Code: f9 ff ff 48 c7 c6 88 8e 4e a7 48 c7 c7 f8 0b a8 a7 e8 20 1c 3d 00 e9 54 f6 ff ff 48 c7 c6 38 06 4c a7 4c 89 e7 e8 5c 4a f7 ff <0f> 0b 48 8b 7c 24 08 be 02 00 00 00 e8 bb 6c 36 00 e9 11 f5 ff ff
>>> [526056.430325] RSP: 0018:ffffae50e757fb40 EFLAGS: 00010292
>>> [526056.435639] RAX: 0000000000000034 RBX: 0000000002d00000 RCX: 0000000000000000
>>> [526056.442858] RDX: 0000000000000001 RSI: ffffffffa751a839 RDI: 00000000ffffffff
>>> [526056.450077] RBP: 0000000000013937 R08: 0000000000000000 R09: ffffae50e757fa08
>>> [526056.457296] R10: 0000000000000003 R11: ffffffffa793c768 R12: fffff00eb4000000
>>> [526056.464514] R13: ffffae50e757fb7a R14: 000fffffffffffff R15: fffff00eb4000000
>>> [526056.471733] FS:  00007fe35addf740(0000) GS:ffff8b71dc000000(0000) knlGS:0000000000000000
>>> [526056.479906] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>>> [526056.485738] CR2: 000055e87e606358 CR3: 00000004f5b46004 CR4: 00000000007706e0
>>> [526056.492957] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
>>> [526056.500176] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
>>> [526056.507396] PKRU: 55555554
>>> [526056.510196] Call Trace:
>>> [526056.512734]  <TASK>
>>> [526056.514929]  ? simple_setattr+0x40/0x60
>>> [526056.518859]  ? vfs_mkobj+0x1b0/0x1c0
>>> [526056.522529]  ? path_openat+0xb2e/0x1360
>>> [526056.526456]  ? do_filp_open+0xa1/0x130
>>> [526056.530296]  full_proxy_write+0x50/0x80
>>> [526056.534229]  vfs_write+0xd7/0x3e0
>>> [526056.537635]  ? fpregs_assert_state_consistent+0x22/0x50
>>> [526056.542955]  ? exit_to_user_mode_prepare+0x3c/0x1d0
>>> [526056.547929]  ? syscall_exit_to_user_mode+0x17/0x40
>>> [526056.552808]  ksys_write+0x53/0xd0
>>> [526056.556215]  do_syscall_64+0x58/0x80
>>> [526056.559879]  ? exit_to_user_mode_prepare+0x3c/0x1d0
>>> [526056.564846]  ? syscall_exit_to_user_mode+0x17/0x40
>>> [526056.569726]  ? do_syscall_64+0x67/0x80
>>> [526056.573563]  ? do_syscall_64+0x67/0x80
>>> [526056.577404]  ? do_syscall_64+0x67/0x80
>>> [526056.581244]  ? syscall_exit_to_user_mode+0x17/0x40
>>> [526056.586120]  ? do_syscall_64+0x67/0x80
>>> [526056.589961]  ? do_syscall_64+0x67/0x80
>>> [526056.593801]  ? do_syscall_64+0x67/0x80
>>> [526056.597641]  entry_SYSCALL_64_after_hwframe+0x63/0xcd
>>> [526056.602779] RIP: 0033:0x7fe35ab01c17
>>> [526056.606446] Code: 0f 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
>>> [526056.625277] RSP: 002b:00007ffc1863c8d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
>>> [526056.632929] RAX: ffffffffffffffda RBX: 0000000000000002 RCX: 00007fe35ab01c17
>>> [526056.640147] RDX: 0000000000000002 RSI: 000055e87e7009a0 RDI: 0000000000000001
>>> [526056.647366] RBP: 000055e87e7009a0 R08: 0000000000000000 R09: 0000000000000073
>>> [526056.654586] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000002
>>> [526056.661805] R13: 00007fe35abf8780 R14: 0000000000000002 R15: 00007fe35abf39e0
>>> [526056.669028]  </TASK>
>>>
>>>
>>> Looks like there is a page_to_nid() done in an offline memmap, which is wrong.
>>>
>>> Usually, this indicates that a pfn_to_online_page() is missing in a PFN walker.
>>>
>>
>> And indeed, something like the following might do the trick:
>>
>> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
>> index e9414ee57c5b..f42bb51e023a 100644
>> --- a/mm/huge_memory.c
>> +++ b/mm/huge_memory.c
>> @@ -2894,11 +2894,9 @@ static void split_huge_pages_all(void)
>>                 max_zone_pfn = zone_end_pfn(zone);
>>                 for (pfn = zone->zone_start_pfn; pfn < max_zone_pfn; pfn++) {
>>                         int nr_pages;
>> -                       if (!pfn_valid(pfn))
>> -                               continue;
>> -                       page = pfn_to_page(pfn);
>> -                       if (!get_page_unless_zero(page))
>> +                       page = pfn_to_online_page(pfn);
>> +                       if (!page || !get_page_unless_zero(page))
>>                                 continue;
>>                         if (zone != page_zone(page))
>>
> 
> Thank you very much, I confirmed the fix.  So I try to write a patch like below.
> I borrowed your debug message because it has more helpful info than mine.
> If you have any comment or suggestion in the description, please let me know.
> 
> Thanks,
> Naoya Horiguchi
> ---
>>From 8a5c284df732943065d23838090d15c94cd10395 Mon Sep 17 00:00:00 2001
> From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> Date: Wed, 7 Sep 2022 20:58:44 +0900
> Subject: [PATCH] mm/huge_memory: use pfn_to_online_page() in
>  split_huge_pages_all()
> 
> NULL pointer dereference is triggered when calling thp split via debugfs
> on the system with offlined memory blocks.  With debug option enabled,
> the following kernel messages are printed out:
> 
>   page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000
>   flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff)
>   raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000
>   raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
>   page dumped because: unmovable page
>   page:000000007d7ab72e is uninitialized and poisoned
>   page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
>   ------------[ cut here ]------------
>   kernel BUG at include/linux/mm.h:1248!
>   invalid opcode: 0000 [#1] PREEMPT SMP PTI
>   CPU: 16 PID: 20964 Comm: bash Tainted: G          I        6.0.0-rc3-foll-numa+ #41
>   ...
>   RIP: 0010:split_huge_pages_write+0xcf4/0xe30
> 
> This shows that page_to_nid() in page_zone() is unexpectedly called for an
> offlined memmap.
> 
> Use pfn_to_online_page() to get struct page in PFN walker.

With changes proposed by David, this patch looks good to me.

Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>

BTW: IMHO, there might be many similar code places need to take care of memory hotremove where
*pfn is operated directly* while it's not protected against memory hotremove.

Thanks,
Miaohe Lin


> 
> Fixes: 49071d436b51 ("thp: add debugfs handle to split all huge pages")
> Suggested-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> Cc: <stable@vger.kernel.org>
> ---
>  mm/huge_memory.c | 7 ++-----
>  1 file changed, 2 insertions(+), 5 deletions(-)
> 
> diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> index 5fa2ba86dae4..03149a8f46f9 100644
> --- a/mm/huge_memory.c
> +++ b/mm/huge_memory.c
> @@ -2894,11 +2894,8 @@ static void split_huge_pages_all(void)
>  		max_zone_pfn = zone_end_pfn(zone);
>  		for (pfn = zone->zone_start_pfn; pfn < max_zone_pfn; pfn++) {
>  			int nr_pages;
> -			if (!pfn_valid(pfn))
> -				continue;
> -
> -			page = pfn_to_page(pfn);
> -			if (!get_page_unless_zero(page))
> +			page = pfn_to_online_page(pfn);
> +			if (!page || !get_page_unless_zero(page))
>  				continue;
>  
>  			if (zone != page_zone(page))
> 



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mm/huge_memory: use pfn_to_online_page() in split_huge_pages_all()
  2022-09-07 12:39       ` David Hildenbrand
@ 2022-09-08  2:39         ` HORIGUCHI NAOYA(堀口 直也)
  0 siblings, 0 replies; 19+ messages in thread
From: HORIGUCHI NAOYA(堀口 直也) @ 2022-09-08  2:39 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Naoya Horiguchi, linux-mm, Andrew Morton, Muchun Song,
	Miaohe Lin, Matthew Wilcox, Michal Hocko, Yang Shi

On Wed, Sep 07, 2022 at 02:39:02PM +0200, David Hildenbrand wrote:
> On 07.09.22 14:11, Naoya Horiguchi wrote:
...
> > ---
> >  From 8a5c284df732943065d23838090d15c94cd10395 Mon Sep 17 00:00:00 2001
> > From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> > Date: Wed, 7 Sep 2022 20:58:44 +0900
> > Subject: [PATCH] mm/huge_memory: use pfn_to_online_page() in
> >   split_huge_pages_all()
> > 
> > NULL pointer dereference is triggered when calling thp split via debugfs
> > on the system with offlined memory blocks.  With debug option enabled,
> > the following kernel messages are printed out:
> > 
> >    page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000
> >    flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff)
> >    raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000
> >    raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
> >    page dumped because: unmovable page
> >    page:000000007d7ab72e is uninitialized and poisoned
> >    page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
> >    ------------[ cut here ]------------
> >    kernel BUG at include/linux/mm.h:1248!
> >    invalid opcode: 0000 [#1] PREEMPT SMP PTI
> >    CPU: 16 PID: 20964 Comm: bash Tainted: G          I        6.0.0-rc3-foll-numa+ #41
> >    ...
> >    RIP: 0010:split_huge_pages_write+0xcf4/0xe30
> > 
> > This shows that page_to_nid() in page_zone() is unexpectedly called for an
> > offlined memmap.
> > 
> > Use pfn_to_online_page() to get struct page in PFN walker.
> > 
> > Fixes: 49071d436b51 ("thp: add debugfs handle to split all huge pages")
> > Suggested-by: David Hildenbrand <david@redhat.com>
> 
> Feel free to use
> 
> Co-developed-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: David Hildenbrand <david@redhat.com>

Sure, I'll add these.

> instead
> 
> > Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> > Cc: <stable@vger.kernel.org>
> > ---
> >   mm/huge_memory.c | 7 ++-----
> >   1 file changed, 2 insertions(+), 5 deletions(-)
> > 
> > diff --git a/mm/huge_memory.c b/mm/huge_memory.c
> > index 5fa2ba86dae4..03149a8f46f9 100644
> > --- a/mm/huge_memory.c
> > +++ b/mm/huge_memory.c
> > @@ -2894,11 +2894,8 @@ static void split_huge_pages_all(void)
> >   		max_zone_pfn = zone_end_pfn(zone);
> >   		for (pfn = zone->zone_start_pfn; pfn < max_zone_pfn; pfn++) {
> >   			int nr_pages;
> 
> Add an empty line here please, while at it.

Added it.

> > -			if (!pfn_valid(pfn))
> > -				continue;
> > -
> > -			page = pfn_to_page(pfn);
> > -			if (!get_page_unless_zero(page))
> > +			page = pfn_to_online_page(pfn);
> > +			if (!page || !get_page_unless_zero(page))
> >   				continue;
> >   			if (zone != page_zone(page))
> 
> Can you send the patch "officially" as well? Thanks

Yes, I'll send this later in the normal patch submission manner.

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mm/huge_memory: use pfn_to_online_page() in split_huge_pages_all()
  2022-09-07 20:57       ` Andrew Morton
@ 2022-09-08  2:47         ` HORIGUCHI NAOYA(堀口 直也)
  2022-09-08  6:14           ` David Hildenbrand
  0 siblings, 1 reply; 19+ messages in thread
From: HORIGUCHI NAOYA(堀口 直也) @ 2022-09-08  2:47 UTC (permalink / raw)
  To: Andrew Morton
  Cc: Naoya Horiguchi, David Hildenbrand, linux-mm, Muchun Song,
	Miaohe Lin, Matthew Wilcox, Michal Hocko, Yang Shi

On Wed, Sep 07, 2022 at 01:57:45PM -0700, Andrew Morton wrote:
> On Wed, 7 Sep 2022 21:11:57 +0900 Naoya Horiguchi <naoya.horiguchi@linux.dev> wrote:
> 
> > 
> > NULL pointer dereference is triggered when calling thp split via debugfs
> > on the system with offlined memory blocks.  With debug option enabled,
> > the following kernel messages are printed out:
> > 
> >   page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000
> >   flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff)
> >   raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000
> >   raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
> >   page dumped because: unmovable page
> >   page:000000007d7ab72e is uninitialized and poisoned
> >   page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
> >   ------------[ cut here ]------------
> >   kernel BUG at include/linux/mm.h:1248!
> >   invalid opcode: 0000 [#1] PREEMPT SMP PTI
> >   CPU: 16 PID: 20964 Comm: bash Tainted: G          I        6.0.0-rc3-foll-numa+ #41
> >   ...
> >   RIP: 0010:split_huge_pages_write+0xcf4/0xe30
> > 
> > This shows that page_to_nid() in page_zone() is unexpectedly called for an
> > offlined memmap.
> > 
> > Use pfn_to_online_page() to get struct page in PFN walker.
> > 
> > Fixes: 49071d436b51 ("thp: add debugfs handle to split all huge pages")
> 
> January of 2016!  Or was this a long-term bug which was recently
> exposed by some other change?

Maybe yes, I confirmed that this issue reproduces in kernel 5.10 (released
on Dec 13, 2020, which might not be so "recent"), but does not reproduce in
kernel 5.4.  So I think that marking "5.10+" for stable is fine.

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mm/huge_memory: use pfn_to_online_page() in split_huge_pages_all()
  2022-09-08  2:19       ` Miaohe Lin
@ 2022-09-08  3:06         ` HORIGUCHI NAOYA(堀口 直也)
  2022-09-08  3:25           ` Miaohe Lin
  0 siblings, 1 reply; 19+ messages in thread
From: HORIGUCHI NAOYA(堀口 直也) @ 2022-09-08  3:06 UTC (permalink / raw)
  To: Miaohe Lin
  Cc: Naoya Horiguchi, David Hildenbrand, linux-mm, Andrew Morton,
	Muchun Song, Matthew Wilcox, Michal Hocko, Yang Shi

On Thu, Sep 08, 2022 at 10:19:03AM +0800, Miaohe Lin wrote:
> On 2022/9/7 20:11, Naoya Horiguchi wrote:
...
> >>From 8a5c284df732943065d23838090d15c94cd10395 Mon Sep 17 00:00:00 2001
> > From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> > Date: Wed, 7 Sep 2022 20:58:44 +0900
> > Subject: [PATCH] mm/huge_memory: use pfn_to_online_page() in
> >  split_huge_pages_all()
> > 
> > NULL pointer dereference is triggered when calling thp split via debugfs
> > on the system with offlined memory blocks.  With debug option enabled,
> > the following kernel messages are printed out:
> > 
> >   page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000
> >   flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff)
> >   raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000
> >   raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
> >   page dumped because: unmovable page
> >   page:000000007d7ab72e is uninitialized and poisoned
> >   page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
> >   ------------[ cut here ]------------
> >   kernel BUG at include/linux/mm.h:1248!
> >   invalid opcode: 0000 [#1] PREEMPT SMP PTI
> >   CPU: 16 PID: 20964 Comm: bash Tainted: G          I        6.0.0-rc3-foll-numa+ #41
> >   ...
> >   RIP: 0010:split_huge_pages_write+0xcf4/0xe30
> > 
> > This shows that page_to_nid() in page_zone() is unexpectedly called for an
> > offlined memmap.
> > 
> > Use pfn_to_online_page() to get struct page in PFN walker.
> 
> With changes proposed by David, this patch looks good to me.
> 
> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>

Thank you.

> 
> BTW: IMHO, there might be many similar code places need to take care of memory hotremove where
> *pfn is operated directly* while it's not protected against memory hotremove.

I had the similar concern, but there seems many place doing PFN walk,
so checking them one-by-one that offlined memory can be walked over
requires much effort.

Thanks,
Naoya Horiguchi

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mm/huge_memory: use pfn_to_online_page() in split_huge_pages_all()
  2022-09-08  3:06         ` HORIGUCHI NAOYA(堀口 直也)
@ 2022-09-08  3:25           ` Miaohe Lin
  2022-09-08  7:07             ` Michal Hocko
  0 siblings, 1 reply; 19+ messages in thread
From: Miaohe Lin @ 2022-09-08  3:25 UTC (permalink / raw)
  To: HORIGUCHI NAOYA(堀口 直也)
  Cc: Naoya Horiguchi, David Hildenbrand, linux-mm, Andrew Morton,
	Muchun Song, Matthew Wilcox, Michal Hocko, Yang Shi

On 2022/9/8 11:06, HORIGUCHI NAOYA(堀口 直也) wrote:
> On Thu, Sep 08, 2022 at 10:19:03AM +0800, Miaohe Lin wrote:
>> On 2022/9/7 20:11, Naoya Horiguchi wrote:
> ...
>>> >From 8a5c284df732943065d23838090d15c94cd10395 Mon Sep 17 00:00:00 2001
>>> From: Naoya Horiguchi <naoya.horiguchi@nec.com>
>>> Date: Wed, 7 Sep 2022 20:58:44 +0900
>>> Subject: [PATCH] mm/huge_memory: use pfn_to_online_page() in
>>>  split_huge_pages_all()
>>>
>>> NULL pointer dereference is triggered when calling thp split via debugfs
>>> on the system with offlined memory blocks.  With debug option enabled,
>>> the following kernel messages are printed out:
>>>
>>>   page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000
>>>   flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff)
>>>   raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000
>>>   raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
>>>   page dumped because: unmovable page
>>>   page:000000007d7ab72e is uninitialized and poisoned
>>>   page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
>>>   ------------[ cut here ]------------
>>>   kernel BUG at include/linux/mm.h:1248!
>>>   invalid opcode: 0000 [#1] PREEMPT SMP PTI
>>>   CPU: 16 PID: 20964 Comm: bash Tainted: G          I        6.0.0-rc3-foll-numa+ #41
>>>   ...
>>>   RIP: 0010:split_huge_pages_write+0xcf4/0xe30
>>>
>>> This shows that page_to_nid() in page_zone() is unexpectedly called for an
>>> offlined memmap.
>>>
>>> Use pfn_to_online_page() to get struct page in PFN walker.
>>
>> With changes proposed by David, this patch looks good to me.
>>
>> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
> 
> Thank you.
> 
>>
>> BTW: IMHO, there might be many similar code places need to take care of memory hotremove where
>> *pfn is operated directly* while it's not protected against memory hotremove.
> 
> I had the similar concern, but there seems many place doing PFN walk,
> so checking them one-by-one that offlined memory can be walked over
> requires much effort.

Yes, that will be a heavy work. We could fix them one by one if they ever occur. ;)

Thanks,
Miaohe Lin


> 
> Thanks,
> Naoya Horiguchi
> 



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mm/huge_memory: use pfn_to_online_page() in split_huge_pages_all()
  2022-09-07 12:11     ` [PATCH] mm/huge_memory: use pfn_to_online_page() in split_huge_pages_all() Naoya Horiguchi
                         ` (4 preceding siblings ...)
  2022-09-08  2:19       ` Miaohe Lin
@ 2022-09-08  3:28       ` Oscar Salvador
  5 siblings, 0 replies; 19+ messages in thread
From: Oscar Salvador @ 2022-09-08  3:28 UTC (permalink / raw)
  To: Naoya Horiguchi
  Cc: David Hildenbrand, linux-mm, Andrew Morton, Muchun Song,
	Miaohe Lin, Matthew Wilcox, Michal Hocko, Yang Shi,
	Naoya Horiguchi

On Wed, Sep 07, 2022 at 09:11:57PM +0900, Naoya Horiguchi wrote:
> From 8a5c284df732943065d23838090d15c94cd10395 Mon Sep 17 00:00:00 2001
> From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> Date: Wed, 7 Sep 2022 20:58:44 +0900
> Subject: [PATCH] mm/huge_memory: use pfn_to_online_page() in
>  split_huge_pages_all()
> 
> NULL pointer dereference is triggered when calling thp split via debugfs
> on the system with offlined memory blocks.  With debug option enabled,
> the following kernel messages are printed out:
> 
>   page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000
>   flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff)
>   raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000
>   raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
>   page dumped because: unmovable page
>   page:000000007d7ab72e is uninitialized and poisoned
>   page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
>   ------------[ cut here ]------------
>   kernel BUG at include/linux/mm.h:1248!
>   invalid opcode: 0000 [#1] PREEMPT SMP PTI
>   CPU: 16 PID: 20964 Comm: bash Tainted: G          I        6.0.0-rc3-foll-numa+ #41
>   ...
>   RIP: 0010:split_huge_pages_write+0xcf4/0xe30
> 
> This shows that page_to_nid() in page_zone() is unexpectedly called for an
> offlined memmap.
> 
> Use pfn_to_online_page() to get struct page in PFN walker.
> 
> Fixes: 49071d436b51 ("thp: add debugfs handle to split all huge pages")
> Suggested-by: David Hildenbrand <david@redhat.com>
> Signed-off-by: Naoya Horiguchi <naoya.horiguchi@nec.com>
> Cc: <stable@vger.kernel.org>

Reviewed-by: Oscar Salvador <osalvador@suse.de>

 

-- 
Oscar Salvador
SUSE Labs


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mm/huge_memory: use pfn_to_online_page() in split_huge_pages_all()
  2022-09-08  2:47         ` HORIGUCHI NAOYA(堀口 直也)
@ 2022-09-08  6:14           ` David Hildenbrand
  2022-09-08  6:31             ` HORIGUCHI NAOYA(堀口 直也)
  0 siblings, 1 reply; 19+ messages in thread
From: David Hildenbrand @ 2022-09-08  6:14 UTC (permalink / raw)
  To: HORIGUCHI NAOYA(堀口 直也), Andrew Morton
  Cc: Naoya Horiguchi, linux-mm, Muchun Song, Miaohe Lin,
	Matthew Wilcox, Michal Hocko, Yang Shi

On 08.09.22 04:47, HORIGUCHI NAOYA(堀口 直也) wrote:
> On Wed, Sep 07, 2022 at 01:57:45PM -0700, Andrew Morton wrote:
>> On Wed, 7 Sep 2022 21:11:57 +0900 Naoya Horiguchi <naoya.horiguchi@linux.dev> wrote:
>>
>>>
>>> NULL pointer dereference is triggered when calling thp split via debugfs
>>> on the system with offlined memory blocks.  With debug option enabled,
>>> the following kernel messages are printed out:
>>>
>>>    page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000
>>>    flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff)
>>>    raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000
>>>    raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
>>>    page dumped because: unmovable page
>>>    page:000000007d7ab72e is uninitialized and poisoned
>>>    page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
>>>    ------------[ cut here ]------------
>>>    kernel BUG at include/linux/mm.h:1248!
>>>    invalid opcode: 0000 [#1] PREEMPT SMP PTI
>>>    CPU: 16 PID: 20964 Comm: bash Tainted: G          I        6.0.0-rc3-foll-numa+ #41
>>>    ...
>>>    RIP: 0010:split_huge_pages_write+0xcf4/0xe30
>>>
>>> This shows that page_to_nid() in page_zone() is unexpectedly called for an
>>> offlined memmap.
>>>
>>> Use pfn_to_online_page() to get struct page in PFN walker.
>>>
>>> Fixes: 49071d436b51 ("thp: add debugfs handle to split all huge pages")
>>
>> January of 2016!  Or was this a long-term bug which was recently
>> exposed by some other change?
> 
> Maybe yes, I confirmed that this issue reproduces in kernel 5.10 (released
> on Dec 13, 2020, which might not be so "recent"), but does not reproduce in
> kernel 5.4.  So I think that marking "5.10+" for stable is fine.


I fixed plenty of such issues in the past. For example:

aad5f69bc161 ("fs/proc/page.c: don't access uninitialized memmaps in fs/proc/page.c")

This one just wasn't found so far because it's triggered via
a debug interface.

The correct fixes tag would be:

Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online")      [visible after d0dc12e86b319]

Exposed "recently" -- 2018 ;)

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mm/huge_memory: use pfn_to_online_page() in split_huge_pages_all()
  2022-09-08  6:14           ` David Hildenbrand
@ 2022-09-08  6:31             ` HORIGUCHI NAOYA(堀口 直也)
  0 siblings, 0 replies; 19+ messages in thread
From: HORIGUCHI NAOYA(堀口 直也) @ 2022-09-08  6:31 UTC (permalink / raw)
  To: David Hildenbrand
  Cc: Andrew Morton, Naoya Horiguchi, linux-mm, Muchun Song,
	Miaohe Lin, Matthew Wilcox, Michal Hocko, Yang Shi

On Thu, Sep 08, 2022 at 08:14:57AM +0200, David Hildenbrand wrote:
> On 08.09.22 04:47, HORIGUCHI NAOYA(堀口 直也) wrote:
> > On Wed, Sep 07, 2022 at 01:57:45PM -0700, Andrew Morton wrote:
> > > On Wed, 7 Sep 2022 21:11:57 +0900 Naoya Horiguchi <naoya.horiguchi@linux.dev> wrote:
> > > 
> > > > 
> > > > NULL pointer dereference is triggered when calling thp split via debugfs
> > > > on the system with offlined memory blocks.  With debug option enabled,
> > > > the following kernel messages are printed out:
> > > > 
> > > >    page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000
> > > >    flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff)
> > > >    raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000
> > > >    raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
> > > >    page dumped because: unmovable page
> > > >    page:000000007d7ab72e is uninitialized and poisoned
> > > >    page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
> > > >    ------------[ cut here ]------------
> > > >    kernel BUG at include/linux/mm.h:1248!
> > > >    invalid opcode: 0000 [#1] PREEMPT SMP PTI
> > > >    CPU: 16 PID: 20964 Comm: bash Tainted: G          I        6.0.0-rc3-foll-numa+ #41
> > > >    ...
> > > >    RIP: 0010:split_huge_pages_write+0xcf4/0xe30
> > > > 
> > > > This shows that page_to_nid() in page_zone() is unexpectedly called for an
> > > > offlined memmap.
> > > > 
> > > > Use pfn_to_online_page() to get struct page in PFN walker.
> > > > 
> > > > Fixes: 49071d436b51 ("thp: add debugfs handle to split all huge pages")
> > > 
> > > January of 2016!  Or was this a long-term bug which was recently
> > > exposed by some other change?
> > 
> > Maybe yes, I confirmed that this issue reproduces in kernel 5.10 (released
> > on Dec 13, 2020, which might not be so "recent"), but does not reproduce in
> > kernel 5.4.  So I think that marking "5.10+" for stable is fine.
> 
> 
> I fixed plenty of such issues in the past. For example:
> 
> aad5f69bc161 ("fs/proc/page.c: don't access uninitialized memmaps in fs/proc/page.c")
> 
> This one just wasn't found so far because it's triggered via
> a debug interface.
> 
> The correct fixes tag would be:
> 
> Fixes: f1dd2cd13c4b ("mm, memory_hotplug: do not associate hotadded memory to zones until online")      [visible after d0dc12e86b319]
> 
> Exposed "recently" -- 2018 ;)

Great, thank you for pinpointing the commit.

- Naoya Horiguchi

^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mm/huge_memory: use pfn_to_online_page() in split_huge_pages_all()
  2022-09-08  3:25           ` Miaohe Lin
@ 2022-09-08  7:07             ` Michal Hocko
  2022-09-09  0:27               ` Miaohe Lin
  0 siblings, 1 reply; 19+ messages in thread
From: Michal Hocko @ 2022-09-08  7:07 UTC (permalink / raw)
  To: Miaohe Lin
  Cc: HORIGUCHI NAOYA(堀口 直也),
	Naoya Horiguchi, David Hildenbrand, linux-mm, Andrew Morton,
	Muchun Song, Matthew Wilcox, Yang Shi

On Thu 08-09-22 11:25:54, Miaohe Lin wrote:
> On 2022/9/8 11:06, HORIGUCHI NAOYA(堀口 直也) wrote:
> > On Thu, Sep 08, 2022 at 10:19:03AM +0800, Miaohe Lin wrote:
> >> On 2022/9/7 20:11, Naoya Horiguchi wrote:
> > ...
> >>> >From 8a5c284df732943065d23838090d15c94cd10395 Mon Sep 17 00:00:00 2001
> >>> From: Naoya Horiguchi <naoya.horiguchi@nec.com>
> >>> Date: Wed, 7 Sep 2022 20:58:44 +0900
> >>> Subject: [PATCH] mm/huge_memory: use pfn_to_online_page() in
> >>>  split_huge_pages_all()
> >>>
> >>> NULL pointer dereference is triggered when calling thp split via debugfs
> >>> on the system with offlined memory blocks.  With debug option enabled,
> >>> the following kernel messages are printed out:
> >>>
> >>>   page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000
> >>>   flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff)
> >>>   raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000
> >>>   raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
> >>>   page dumped because: unmovable page
> >>>   page:000000007d7ab72e is uninitialized and poisoned
> >>>   page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
> >>>   ------------[ cut here ]------------
> >>>   kernel BUG at include/linux/mm.h:1248!
> >>>   invalid opcode: 0000 [#1] PREEMPT SMP PTI
> >>>   CPU: 16 PID: 20964 Comm: bash Tainted: G          I        6.0.0-rc3-foll-numa+ #41
> >>>   ...
> >>>   RIP: 0010:split_huge_pages_write+0xcf4/0xe30
> >>>
> >>> This shows that page_to_nid() in page_zone() is unexpectedly called for an
> >>> offlined memmap.
> >>>
> >>> Use pfn_to_online_page() to get struct page in PFN walker.
> >>
> >> With changes proposed by David, this patch looks good to me.
> >>
> >> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
> > 
> > Thank you.
> > 
> >>
> >> BTW: IMHO, there might be many similar code places need to take care of memory hotremove where
> >> *pfn is operated directly* while it's not protected against memory hotremove.
> > 
> > I had the similar concern, but there seems many place doing PFN walk,
> > so checking them one-by-one that offlined memory can be walked over
> > requires much effort.
> 
> Yes, that will be a heavy work. We could fix them one by one if they ever occur. ;)

Most of those whic are directly triggerable should be taken care of. It
would be still good to go through `git grep -w pfn_to_page' and evaluate 
all callers. Still more than 400 callsites so not a trivial task.
-- 
Michal Hocko
SUSE Labs


^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mm/huge_memory: use pfn_to_online_page() in split_huge_pages_all()
  2022-09-08  7:07             ` Michal Hocko
@ 2022-09-09  0:27               ` Miaohe Lin
  2022-09-09  9:03                 ` David Hildenbrand
  0 siblings, 1 reply; 19+ messages in thread
From: Miaohe Lin @ 2022-09-09  0:27 UTC (permalink / raw)
  To: Michal Hocko
  Cc: HORIGUCHI NAOYA(堀口 直也),
	Naoya Horiguchi, David Hildenbrand, linux-mm, Andrew Morton,
	Muchun Song, Matthew Wilcox, Yang Shi

On 2022/9/8 15:07, Michal Hocko wrote:
> On Thu 08-09-22 11:25:54, Miaohe Lin wrote:
>> On 2022/9/8 11:06, HORIGUCHI NAOYA(堀口 直也) wrote:
>>> On Thu, Sep 08, 2022 at 10:19:03AM +0800, Miaohe Lin wrote:
>>>> On 2022/9/7 20:11, Naoya Horiguchi wrote:
>>> ...
>>>>> >From 8a5c284df732943065d23838090d15c94cd10395 Mon Sep 17 00:00:00 2001
>>>>> From: Naoya Horiguchi <naoya.horiguchi@nec.com>
>>>>> Date: Wed, 7 Sep 2022 20:58:44 +0900
>>>>> Subject: [PATCH] mm/huge_memory: use pfn_to_online_page() in
>>>>>  split_huge_pages_all()
>>>>>
>>>>> NULL pointer dereference is triggered when calling thp split via debugfs
>>>>> on the system with offlined memory blocks.  With debug option enabled,
>>>>> the following kernel messages are printed out:
>>>>>
>>>>>   page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000
>>>>>   flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff)
>>>>>   raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000
>>>>>   raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
>>>>>   page dumped because: unmovable page
>>>>>   page:000000007d7ab72e is uninitialized and poisoned
>>>>>   page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
>>>>>   ------------[ cut here ]------------
>>>>>   kernel BUG at include/linux/mm.h:1248!
>>>>>   invalid opcode: 0000 [#1] PREEMPT SMP PTI
>>>>>   CPU: 16 PID: 20964 Comm: bash Tainted: G          I        6.0.0-rc3-foll-numa+ #41
>>>>>   ...
>>>>>   RIP: 0010:split_huge_pages_write+0xcf4/0xe30
>>>>>
>>>>> This shows that page_to_nid() in page_zone() is unexpectedly called for an
>>>>> offlined memmap.
>>>>>
>>>>> Use pfn_to_online_page() to get struct page in PFN walker.
>>>>
>>>> With changes proposed by David, this patch looks good to me.
>>>>
>>>> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
>>>
>>> Thank you.
>>>
>>>>
>>>> BTW: IMHO, there might be many similar code places need to take care of memory hotremove where
>>>> *pfn is operated directly* while it's not protected against memory hotremove.
>>>
>>> I had the similar concern, but there seems many place doing PFN walk,
>>> so checking them one-by-one that offlined memory can be walked over
>>> requires much effort.
>>
>> Yes, that will be a heavy work. We could fix them one by one if they ever occur. ;)
> 
> Most of those whic are directly triggerable should be taken care of. It
> would be still good to go through `git grep -w pfn_to_page' and evaluate 
> all callers. Still more than 400 callsites so not a trivial task.

Agree. Even if pfn_to_online_page() is used, it might still races with the memory hotremove if
there's no way to guard against it, e.g. holding an extra page refcnt. So code audit should also
apply to pfn_to_online_page().

Thanks,
Miaohe Lin



^ permalink raw reply	[flat|nested] 19+ messages in thread

* Re: [PATCH] mm/huge_memory: use pfn_to_online_page() in split_huge_pages_all()
  2022-09-09  0:27               ` Miaohe Lin
@ 2022-09-09  9:03                 ` David Hildenbrand
  0 siblings, 0 replies; 19+ messages in thread
From: David Hildenbrand @ 2022-09-09  9:03 UTC (permalink / raw)
  To: Miaohe Lin, Michal Hocko
  Cc: HORIGUCHI NAOYA(堀口 直也),
	Naoya Horiguchi, linux-mm, Andrew Morton, Muchun Song,
	Matthew Wilcox, Yang Shi

On 09.09.22 02:27, Miaohe Lin wrote:
> On 2022/9/8 15:07, Michal Hocko wrote:
>> On Thu 08-09-22 11:25:54, Miaohe Lin wrote:
>>> On 2022/9/8 11:06, HORIGUCHI NAOYA(堀口 直也) wrote:
>>>> On Thu, Sep 08, 2022 at 10:19:03AM +0800, Miaohe Lin wrote:
>>>>> On 2022/9/7 20:11, Naoya Horiguchi wrote:
>>>> ...
>>>>>> >From 8a5c284df732943065d23838090d15c94cd10395 Mon Sep 17 00:00:00 2001
>>>>>> From: Naoya Horiguchi <naoya.horiguchi@nec.com>
>>>>>> Date: Wed, 7 Sep 2022 20:58:44 +0900
>>>>>> Subject: [PATCH] mm/huge_memory: use pfn_to_online_page() in
>>>>>>   split_huge_pages_all()
>>>>>>
>>>>>> NULL pointer dereference is triggered when calling thp split via debugfs
>>>>>> on the system with offlined memory blocks.  With debug option enabled,
>>>>>> the following kernel messages are printed out:
>>>>>>
>>>>>>    page:00000000467f4890 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x121c000
>>>>>>    flags: 0x17fffc00000000(node=0|zone=2|lastcpupid=0x1ffff)
>>>>>>    raw: 0017fffc00000000 0000000000000000 dead000000000122 0000000000000000
>>>>>>    raw: 0000000000000000 0000000000000000 00000001ffffffff 0000000000000000
>>>>>>    page dumped because: unmovable page
>>>>>>    page:000000007d7ab72e is uninitialized and poisoned
>>>>>>    page dumped because: VM_BUG_ON_PAGE(PagePoisoned(p))
>>>>>>    ------------[ cut here ]------------
>>>>>>    kernel BUG at include/linux/mm.h:1248!
>>>>>>    invalid opcode: 0000 [#1] PREEMPT SMP PTI
>>>>>>    CPU: 16 PID: 20964 Comm: bash Tainted: G          I        6.0.0-rc3-foll-numa+ #41
>>>>>>    ...
>>>>>>    RIP: 0010:split_huge_pages_write+0xcf4/0xe30
>>>>>>
>>>>>> This shows that page_to_nid() in page_zone() is unexpectedly called for an
>>>>>> offlined memmap.
>>>>>>
>>>>>> Use pfn_to_online_page() to get struct page in PFN walker.
>>>>>
>>>>> With changes proposed by David, this patch looks good to me.
>>>>>
>>>>> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com>
>>>>
>>>> Thank you.
>>>>
>>>>>
>>>>> BTW: IMHO, there might be many similar code places need to take care of memory hotremove where
>>>>> *pfn is operated directly* while it's not protected against memory hotremove.
>>>>
>>>> I had the similar concern, but there seems many place doing PFN walk,
>>>> so checking them one-by-one that offlined memory can be walked over
>>>> requires much effort.
>>>
>>> Yes, that will be a heavy work. We could fix them one by one if they ever occur. ;)
>>
>> Most of those whic are directly triggerable should be taken care of. It
>> would be still good to go through `git grep -w pfn_to_page' and evaluate
>> all callers. Still more than 400 callsites so not a trivial task.
> 
> Agree. Even if pfn_to_online_page() is used, it might still races with the memory hotremove if
> there's no way to guard against it, e.g. holding an extra page refcnt. So code audit should also
> apply to pfn_to_online_page().

Racing with offlining+unplug a long known problem and so far we decided 
to ignore it, because it never appeared in practice.

There once were ideas of using RCU as protection.

-- 
Thanks,

David / dhildenb



^ permalink raw reply	[flat|nested] 19+ messages in thread

end of thread, other threads:[~2022-09-09  9:03 UTC | newest]

Thread overview: 19+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-09-07 10:08 [BUG report] kernel NULL pointer dereference in split_huge_page with offlined memory block Naoya Horiguchi
2022-09-07 10:23 ` David Hildenbrand
2022-09-07 10:26   ` David Hildenbrand
2022-09-07 12:11     ` [PATCH] mm/huge_memory: use pfn_to_online_page() in split_huge_pages_all() Naoya Horiguchi
2022-09-07 12:39       ` David Hildenbrand
2022-09-08  2:39         ` HORIGUCHI NAOYA(堀口 直也)
2022-09-07 17:10       ` Yang Shi
2022-09-07 17:32       ` Michal Hocko
2022-09-07 20:57       ` Andrew Morton
2022-09-08  2:47         ` HORIGUCHI NAOYA(堀口 直也)
2022-09-08  6:14           ` David Hildenbrand
2022-09-08  6:31             ` HORIGUCHI NAOYA(堀口 直也)
2022-09-08  2:19       ` Miaohe Lin
2022-09-08  3:06         ` HORIGUCHI NAOYA(堀口 直也)
2022-09-08  3:25           ` Miaohe Lin
2022-09-08  7:07             ` Michal Hocko
2022-09-09  0:27               ` Miaohe Lin
2022-09-09  9:03                 ` David Hildenbrand
2022-09-08  3:28       ` Oscar Salvador

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).