linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed
* [bug] problems with migration of huge pages with v4.20-10214-ge1ef035d272e
       [not found] <1038135449.92986364.1546459244292.JavaMail.zimbra@redhat.com>
@ 2019-01-02 20:30 ` Jan Stancek
  2019-01-02 20:30   ` Jan Stancek
  2019-01-02 21:24   ` Mike Kravetz
  0 siblings, 2 replies; 10+ messages in thread
From: Jan Stancek @ 2019-01-02 20:30 UTC (permalink / raw)
  To: linux-mm, mike.kravetz, kirill.shutemov
  Cc: ltp, mhocko, Rachel Sibley, hughd, n-horiguchi, aarcange,
	aneesh.kumar, dave, prakash.sangappa, colin.king

Hi,

LTP move_pages12 [1] started failing recently.

The test maps/unmaps some anonymous private huge pages
and migrates them between 2 nodes. This now reliably
hits NULL ptr deref:

[  194.819357] BUG: unable to handle kernel NULL pointer dereference at 0000000000000030
[  194.864410] #PF error: [WRITE]
[  194.881502] PGD 22c758067 P4D 22c758067 PUD 235177067 PMD 0
[  194.913833] Oops: 0002 [#1] SMP NOPTI
[  194.935062] CPU: 0 PID: 865 Comm: move_pages12 Not tainted 4.20.0+ #1
[  194.972993] Hardware name: HP ProLiant SL335s G7/, BIOS A24 12/08/2012
[  195.005359] RIP: 0010:down_write+0x1b/0x40
[  195.028257] Code: 00 5c 01 00 48 83 c8 03 48 89 43 20 5b c3 90 0f 1f 44 00 00 53 48 89 fb e8 d2 d7 ff ff 48 89 d8 48 ba 01 00 00 00 ff ff
ff ff <f0> 48 0f c1 10 85 d2 74 05 e8 07 26 ff ff 65 48 8b 04 25 00 5c 01
[  195.121836] RSP: 0018:ffffb87e4224fd00 EFLAGS: 00010246
[  195.147097] RAX: 0000000000000030 RBX: 0000000000000030 RCX: 0000000000000000
[  195.185096] RDX: ffffffff00000001 RSI: ffffffffa69d30f0 RDI: 0000000000000030
[  195.219251] RBP: 0000000000000030 R08: ffffe7d4889d8008 R09: 0000000000000003
[  195.258291] R10: 000000000000000f R11: ffffe7d4889d8008 R12: ffffe7d4889d0008
[  195.294547] R13: ffffe7d490b78000 R14: ffffe7d4889d0000 R15: ffff8be9b2ba4580
[  195.332532] FS:  00007f1670112b80(0000) GS:ffff8be9b7a00000(0000) knlGS:0000000000000000
[  195.373888] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  195.405938] CR2: 0000000000000030 CR3: 000000023477e000 CR4: 00000000000006f0
[  195.443579] Call Trace:
[  195.456876]  migrate_pages+0x833/0xcb0
[  195.478070]  ? __ia32_compat_sys_migrate_pages+0x20/0x20
[  195.506027]  do_move_pages_to_node.isra.63.part.64+0x2a/0x50
[  195.536963]  kernel_move_pages+0x667/0x8c0
[  195.559616]  ? __handle_mm_fault+0xb95/0x1370
[  195.588765]  __x64_sys_move_pages+0x24/0x30
[  195.611439]  do_syscall_64+0x5b/0x160
[  195.631901]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  195.657790] RIP: 0033:0x7f166f5ff959
[  195.676365] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08
0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 17 45 2c 00 f7 d8 64 89 01 48
[  195.772938] RSP: 002b:00007ffd8d77bb48 EFLAGS: 00000246 ORIG_RAX: 0000000000000117
[  195.810207] RAX: ffffffffffffffda RBX: 0000000000000400 RCX: 00007f166f5ff959
[  195.847522] RDX: 0000000002303400 RSI: 0000000000000400 RDI: 0000000000000360
[  195.882327] RBP: 0000000000000400 R08: 0000000002306420 R09: 0000000000000004
[  195.920017] R10: 0000000002305410 R11: 0000000000000246 R12: 0000000002303400
[  195.958053] R13: 0000000002305410 R14: 0000000002306420 R15: 0000000000000003
[  195.997028] Modules linked in: sunrpc amd64_edac_mod ipmi_ssif edac_mce_amd kvm_amd ipmi_si igb ipmi_devintf k10temp kvm pcspkr ipmi_msgha
ndler joydev irqbypass sp5100_tco dca hpwdt hpilo i2c_piix4 xfs libcrc32c radeon i2c_algo_bit drm_kms_helper ttm ata_generic pata_acpi drm se
rio_raw pata_atiixp
[  196.134162] CR2: 0000000000000030
[  196.152788] ---[ end trace 4420ea5061342d3e ]---

Suspected commit is:
  b43a99900559 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization")
which adds to unmap_and_move_huge_page():
+               struct address_space *mapping = page_mapping(hpage);
+
+               /*
+                * try_to_unmap could potentially call huge_pmd_unshare.
+                * Because of this, take semaphore in write mode here and
+                * set TTU_RMAP_LOCKED to let lower levels know we have
+                * taken the lock.
+                */
+               i_mmap_lock_write(mapping);

If I'm reading this right, 'mapping' will be NULL for anon mappings.

Running same test with s/MAP_PRIVATE/MAP_SHARED/ leads to user-space
hanging at:

# cat /proc/23654/stack
[<0>] io_schedule+0x12/0x40
[<0>] __lock_page+0x13c/0x200
[<0>] remove_inode_hugepages+0x275/0x300
[<0>] hugetlbfs_evict_inode+0x2e/0x60
[<0>] evict+0xcb/0x190
[<0>] __dentry_kill+0xce/0x160
[<0>] dentry_kill+0x47/0x170
[<0>] dput.part.33+0xc6/0x100
[<0>] __fput+0x105/0x230
[<0>] task_work_run+0x84/0xa0
[<0>] exit_to_usermode_loop+0xd3/0xe0
[<0>] do_syscall_64+0x14d/0x160
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[<0>] 0xffffffffffffffff

# cat /proc/23655/stack
[<0>] call_rwsem_down_read_failed+0x14/0x30
[<0>] rmap_walk_file+0x1c1/0x2f0
[<0>] remove_migration_ptes+0x6d/0x80
[<0>] migrate_pages+0x86a/0xcb0
[<0>] do_move_pages_to_node.isra.63.part.64+0x2a/0x50
[<0>] kernel_move_pages+0x667/0x8c0
[<0>] __x64_sys_move_pages+0x24/0x30
[<0>] do_syscall_64+0x5b/0x160
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[<0>] 0xffffffffffffffff

Regards,
Jan

[1] https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/syscalls/move_pages/move_pages12.c

^ permalink raw reply	[flat|nested] 10+ messages in thread

* [bug] problems with migration of huge pages with v4.20-10214-ge1ef035d272e
  2019-01-02 20:30 ` [bug] problems with migration of huge pages with v4.20-10214-ge1ef035d272e Jan Stancek
@ 2019-01-02 20:30   ` Jan Stancek
  2019-01-02 21:24   ` Mike Kravetz
  1 sibling, 0 replies; 10+ messages in thread
From: Jan Stancek @ 2019-01-02 20:30 UTC (permalink / raw)
  To: linux-mm, mike.kravetz, kirill.shutemov
  Cc: ltp, mhocko, Rachel Sibley, hughd, n-horiguchi, aarcange,
	aneesh.kumar, dave, prakash.sangappa, colin.king

Hi,

LTP move_pages12 [1] started failing recently.

The test maps/unmaps some anonymous private huge pages
and migrates them between 2 nodes. This now reliably
hits NULL ptr deref:

[  194.819357] BUG: unable to handle kernel NULL pointer dereference at 0000000000000030
[  194.864410] #PF error: [WRITE]
[  194.881502] PGD 22c758067 P4D 22c758067 PUD 235177067 PMD 0
[  194.913833] Oops: 0002 [#1] SMP NOPTI
[  194.935062] CPU: 0 PID: 865 Comm: move_pages12 Not tainted 4.20.0+ #1
[  194.972993] Hardware name: HP ProLiant SL335s G7/, BIOS A24 12/08/2012
[  195.005359] RIP: 0010:down_write+0x1b/0x40
[  195.028257] Code: 00 5c 01 00 48 83 c8 03 48 89 43 20 5b c3 90 0f 1f 44 00 00 53 48 89 fb e8 d2 d7 ff ff 48 89 d8 48 ba 01 00 00 00 ff ff
ff ff <f0> 48 0f c1 10 85 d2 74 05 e8 07 26 ff ff 65 48 8b 04 25 00 5c 01
[  195.121836] RSP: 0018:ffffb87e4224fd00 EFLAGS: 00010246
[  195.147097] RAX: 0000000000000030 RBX: 0000000000000030 RCX: 0000000000000000
[  195.185096] RDX: ffffffff00000001 RSI: ffffffffa69d30f0 RDI: 0000000000000030
[  195.219251] RBP: 0000000000000030 R08: ffffe7d4889d8008 R09: 0000000000000003
[  195.258291] R10: 000000000000000f R11: ffffe7d4889d8008 R12: ffffe7d4889d0008
[  195.294547] R13: ffffe7d490b78000 R14: ffffe7d4889d0000 R15: ffff8be9b2ba4580
[  195.332532] FS:  00007f1670112b80(0000) GS:ffff8be9b7a00000(0000) knlGS:0000000000000000
[  195.373888] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  195.405938] CR2: 0000000000000030 CR3: 000000023477e000 CR4: 00000000000006f0
[  195.443579] Call Trace:
[  195.456876]  migrate_pages+0x833/0xcb0
[  195.478070]  ? __ia32_compat_sys_migrate_pages+0x20/0x20
[  195.506027]  do_move_pages_to_node.isra.63.part.64+0x2a/0x50
[  195.536963]  kernel_move_pages+0x667/0x8c0
[  195.559616]  ? __handle_mm_fault+0xb95/0x1370
[  195.588765]  __x64_sys_move_pages+0x24/0x30
[  195.611439]  do_syscall_64+0x5b/0x160
[  195.631901]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
[  195.657790] RIP: 0033:0x7f166f5ff959
[  195.676365] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08
0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 17 45 2c 00 f7 d8 64 89 01 48
[  195.772938] RSP: 002b:00007ffd8d77bb48 EFLAGS: 00000246 ORIG_RAX: 0000000000000117
[  195.810207] RAX: ffffffffffffffda RBX: 0000000000000400 RCX: 00007f166f5ff959
[  195.847522] RDX: 0000000002303400 RSI: 0000000000000400 RDI: 0000000000000360
[  195.882327] RBP: 0000000000000400 R08: 0000000002306420 R09: 0000000000000004
[  195.920017] R10: 0000000002305410 R11: 0000000000000246 R12: 0000000002303400
[  195.958053] R13: 0000000002305410 R14: 0000000002306420 R15: 0000000000000003
[  195.997028] Modules linked in: sunrpc amd64_edac_mod ipmi_ssif edac_mce_amd kvm_amd ipmi_si igb ipmi_devintf k10temp kvm pcspkr ipmi_msgha
ndler joydev irqbypass sp5100_tco dca hpwdt hpilo i2c_piix4 xfs libcrc32c radeon i2c_algo_bit drm_kms_helper ttm ata_generic pata_acpi drm se
rio_raw pata_atiixp
[  196.134162] CR2: 0000000000000030
[  196.152788] ---[ end trace 4420ea5061342d3e ]---

Suspected commit is:
  b43a99900559 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization")
which adds to unmap_and_move_huge_page():
+               struct address_space *mapping = page_mapping(hpage);
+
+               /*
+                * try_to_unmap could potentially call huge_pmd_unshare.
+                * Because of this, take semaphore in write mode here and
+                * set TTU_RMAP_LOCKED to let lower levels know we have
+                * taken the lock.
+                */
+               i_mmap_lock_write(mapping);

If I'm reading this right, 'mapping' will be NULL for anon mappings.

Running same test with s/MAP_PRIVATE/MAP_SHARED/ leads to user-space
hanging at:

# cat /proc/23654/stack
[<0>] io_schedule+0x12/0x40
[<0>] __lock_page+0x13c/0x200
[<0>] remove_inode_hugepages+0x275/0x300
[<0>] hugetlbfs_evict_inode+0x2e/0x60
[<0>] evict+0xcb/0x190
[<0>] __dentry_kill+0xce/0x160
[<0>] dentry_kill+0x47/0x170
[<0>] dput.part.33+0xc6/0x100
[<0>] __fput+0x105/0x230
[<0>] task_work_run+0x84/0xa0
[<0>] exit_to_usermode_loop+0xd3/0xe0
[<0>] do_syscall_64+0x14d/0x160
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[<0>] 0xffffffffffffffff

# cat /proc/23655/stack
[<0>] call_rwsem_down_read_failed+0x14/0x30
[<0>] rmap_walk_file+0x1c1/0x2f0
[<0>] remove_migration_ptes+0x6d/0x80
[<0>] migrate_pages+0x86a/0xcb0
[<0>] do_move_pages_to_node.isra.63.part.64+0x2a/0x50
[<0>] kernel_move_pages+0x667/0x8c0
[<0>] __x64_sys_move_pages+0x24/0x30
[<0>] do_syscall_64+0x5b/0x160
[<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[<0>] 0xffffffffffffffff

Regards,
Jan

[1] https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/syscalls/move_pages/move_pages12.c


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [bug] problems with migration of huge pages with v4.20-10214-ge1ef035d272e
  2019-01-02 20:30 ` [bug] problems with migration of huge pages with v4.20-10214-ge1ef035d272e Jan Stancek
  2019-01-02 20:30   ` Jan Stancek
@ 2019-01-02 21:24   ` Mike Kravetz
  2019-01-03  1:44     ` Mike Kravetz
  1 sibling, 1 reply; 10+ messages in thread
From: Mike Kravetz @ 2019-01-02 21:24 UTC (permalink / raw)
  To: Jan Stancek, linux-mm, kirill.shutemov
  Cc: ltp, mhocko, Rachel Sibley, hughd, n-horiguchi, aarcange,
	aneesh.kumar, dave, prakash.sangappa, colin.king

On 1/2/19 12:30 PM, Jan Stancek wrote:
> Hi,
> 
> LTP move_pages12 [1] started failing recently.
> 
> The test maps/unmaps some anonymous private huge pages
> and migrates them between 2 nodes. This now reliably
> hits NULL ptr deref:
> 
> [  194.819357] BUG: unable to handle kernel NULL pointer dereference at 0000000000000030
> [  194.864410] #PF error: [WRITE]
> [  194.881502] PGD 22c758067 P4D 22c758067 PUD 235177067 PMD 0
> [  194.913833] Oops: 0002 [#1] SMP NOPTI
> [  194.935062] CPU: 0 PID: 865 Comm: move_pages12 Not tainted 4.20.0+ #1
> [  194.972993] Hardware name: HP ProLiant SL335s G7/, BIOS A24 12/08/2012
> [  195.005359] RIP: 0010:down_write+0x1b/0x40
> [  195.028257] Code: 00 5c 01 00 48 83 c8 03 48 89 43 20 5b c3 90 0f 1f 44 00 00 53 48 89 fb e8 d2 d7 ff ff 48 89 d8 48 ba 01 00 00 00 ff ff
> ff ff <f0> 48 0f c1 10 85 d2 74 05 e8 07 26 ff ff 65 48 8b 04 25 00 5c 01
> [  195.121836] RSP: 0018:ffffb87e4224fd00 EFLAGS: 00010246
> [  195.147097] RAX: 0000000000000030 RBX: 0000000000000030 RCX: 0000000000000000
> [  195.185096] RDX: ffffffff00000001 RSI: ffffffffa69d30f0 RDI: 0000000000000030
> [  195.219251] RBP: 0000000000000030 R08: ffffe7d4889d8008 R09: 0000000000000003
> [  195.258291] R10: 000000000000000f R11: ffffe7d4889d8008 R12: ffffe7d4889d0008
> [  195.294547] R13: ffffe7d490b78000 R14: ffffe7d4889d0000 R15: ffff8be9b2ba4580
> [  195.332532] FS:  00007f1670112b80(0000) GS:ffff8be9b7a00000(0000) knlGS:0000000000000000
> [  195.373888] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [  195.405938] CR2: 0000000000000030 CR3: 000000023477e000 CR4: 00000000000006f0
> [  195.443579] Call Trace:
> [  195.456876]  migrate_pages+0x833/0xcb0
> [  195.478070]  ? __ia32_compat_sys_migrate_pages+0x20/0x20
> [  195.506027]  do_move_pages_to_node.isra.63.part.64+0x2a/0x50
> [  195.536963]  kernel_move_pages+0x667/0x8c0
> [  195.559616]  ? __handle_mm_fault+0xb95/0x1370
> [  195.588765]  __x64_sys_move_pages+0x24/0x30
> [  195.611439]  do_syscall_64+0x5b/0x160
> [  195.631901]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [  195.657790] RIP: 0033:0x7f166f5ff959
> [  195.676365] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08
> 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 17 45 2c 00 f7 d8 64 89 01 48
> [  195.772938] RSP: 002b:00007ffd8d77bb48 EFLAGS: 00000246 ORIG_RAX: 0000000000000117
> [  195.810207] RAX: ffffffffffffffda RBX: 0000000000000400 RCX: 00007f166f5ff959
> [  195.847522] RDX: 0000000002303400 RSI: 0000000000000400 RDI: 0000000000000360
> [  195.882327] RBP: 0000000000000400 R08: 0000000002306420 R09: 0000000000000004
> [  195.920017] R10: 0000000002305410 R11: 0000000000000246 R12: 0000000002303400
> [  195.958053] R13: 0000000002305410 R14: 0000000002306420 R15: 0000000000000003
> [  195.997028] Modules linked in: sunrpc amd64_edac_mod ipmi_ssif edac_mce_amd kvm_amd ipmi_si igb ipmi_devintf k10temp kvm pcspkr ipmi_msgha
> ndler joydev irqbypass sp5100_tco dca hpwdt hpilo i2c_piix4 xfs libcrc32c radeon i2c_algo_bit drm_kms_helper ttm ata_generic pata_acpi drm se
> rio_raw pata_atiixp
> [  196.134162] CR2: 0000000000000030
> [  196.152788] ---[ end trace 4420ea5061342d3e ]---
> 
> Suspected commit is:
>   b43a99900559 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization")
> which adds to unmap_and_move_huge_page():
> +               struct address_space *mapping = page_mapping(hpage);
> +
> +               /*
> +                * try_to_unmap could potentially call huge_pmd_unshare.
> +                * Because of this, take semaphore in write mode here and
> +                * set TTU_RMAP_LOCKED to let lower levels know we have
> +                * taken the lock.
> +                */
> +               i_mmap_lock_write(mapping);
> 
> If I'm reading this right, 'mapping' will be NULL for anon mappings.

Not exactly.

In the anon case mapping will point to a 'struct anon_vma *'.  But,
i_mmap_lock_write expects a 'struct address_space *'.  I believe this
is source of the NULL pointer dereference in down_write.

Not sure what is happening in the MAP_SHARED case.  My test which does
something similar does not have issues.

In any case, the commit is bad.  I will investagate further.
-- 
Mike Kravetz

> Running same test with s/MAP_PRIVATE/MAP_SHARED/ leads to user-space
> hanging at:
> 
> # cat /proc/23654/stack
> [<0>] io_schedule+0x12/0x40
> [<0>] __lock_page+0x13c/0x200
> [<0>] remove_inode_hugepages+0x275/0x300
> [<0>] hugetlbfs_evict_inode+0x2e/0x60
> [<0>] evict+0xcb/0x190
> [<0>] __dentry_kill+0xce/0x160
> [<0>] dentry_kill+0x47/0x170
> [<0>] dput.part.33+0xc6/0x100
> [<0>] __fput+0x105/0x230
> [<0>] task_work_run+0x84/0xa0
> [<0>] exit_to_usermode_loop+0xd3/0xe0
> [<0>] do_syscall_64+0x14d/0x160
> [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [<0>] 0xffffffffffffffff
> 
> # cat /proc/23655/stack
> [<0>] call_rwsem_down_read_failed+0x14/0x30
> [<0>] rmap_walk_file+0x1c1/0x2f0
> [<0>] remove_migration_ptes+0x6d/0x80
> [<0>] migrate_pages+0x86a/0xcb0
> [<0>] do_move_pages_to_node.isra.63.part.64+0x2a/0x50
> [<0>] kernel_move_pages+0x667/0x8c0
> [<0>] __x64_sys_move_pages+0x24/0x30
> [<0>] do_syscall_64+0x5b/0x160
> [<0>] entry_SYSCALL_64_after_hwframe+0x44/0xa9
> [<0>] 0xffffffffffffffff
> 
> Regards,
> Jan
> 
> [1] https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/syscalls/move_pages/move_pages12.c
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [bug] problems with migration of huge pages with v4.20-10214-ge1ef035d272e
  2019-01-02 21:24   ` Mike Kravetz
@ 2019-01-03  1:44     ` Mike Kravetz
  2019-01-03 12:47       ` Jan Stancek
  0 siblings, 1 reply; 10+ messages in thread
From: Mike Kravetz @ 2019-01-03  1:44 UTC (permalink / raw)
  To: Jan Stancek, linux-mm, kirill.shutemov, Andrew Morton
  Cc: ltp, mhocko, Rachel Sibley, hughd, n-horiguchi, aarcange,
	aneesh.kumar, dave, prakash.sangappa, colin.king

On 1/2/19 1:24 PM, Mike Kravetz wrote:
> On 1/2/19 12:30 PM, Jan Stancek wrote:
>> Hi,
>>
>> LTP move_pages12 [1] started failing recently.
>>
>> The test maps/unmaps some anonymous private huge pages
>> and migrates them between 2 nodes. This now reliably
>> hits NULL ptr deref:
>>
>> [  194.819357] BUG: unable to handle kernel NULL pointer dereference at 0000000000000030
>> [  194.864410] #PF error: [WRITE]
>> [  194.881502] PGD 22c758067 P4D 22c758067 PUD 235177067 PMD 0
>> [  194.913833] Oops: 0002 [#1] SMP NOPTI
>> [  194.935062] CPU: 0 PID: 865 Comm: move_pages12 Not tainted 4.20.0+ #1
>> [  194.972993] Hardware name: HP ProLiant SL335s G7/, BIOS A24 12/08/2012
>> [  195.005359] RIP: 0010:down_write+0x1b/0x40
>> [  195.028257] Code: 00 5c 01 00 48 83 c8 03 48 89 43 20 5b c3 90 0f 1f 44 00 00 53 48 89 fb e8 d2 d7 ff ff 48 89 d8 48 ba 01 00 00 00 ff ff
>> ff ff <f0> 48 0f c1 10 85 d2 74 05 e8 07 26 ff ff 65 48 8b 04 25 00 5c 01
>> [  195.121836] RSP: 0018:ffffb87e4224fd00 EFLAGS: 00010246
>> [  195.147097] RAX: 0000000000000030 RBX: 0000000000000030 RCX: 0000000000000000
>> [  195.185096] RDX: ffffffff00000001 RSI: ffffffffa69d30f0 RDI: 0000000000000030
>> [  195.219251] RBP: 0000000000000030 R08: ffffe7d4889d8008 R09: 0000000000000003
>> [  195.258291] R10: 000000000000000f R11: ffffe7d4889d8008 R12: ffffe7d4889d0008
>> [  195.294547] R13: ffffe7d490b78000 R14: ffffe7d4889d0000 R15: ffff8be9b2ba4580
>> [  195.332532] FS:  00007f1670112b80(0000) GS:ffff8be9b7a00000(0000) knlGS:0000000000000000
>> [  195.373888] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
>> [  195.405938] CR2: 0000000000000030 CR3: 000000023477e000 CR4: 00000000000006f0
>> [  195.443579] Call Trace:
>> [  195.456876]  migrate_pages+0x833/0xcb0
>> [  195.478070]  ? __ia32_compat_sys_migrate_pages+0x20/0x20
>> [  195.506027]  do_move_pages_to_node.isra.63.part.64+0x2a/0x50
>> [  195.536963]  kernel_move_pages+0x667/0x8c0
>> [  195.559616]  ? __handle_mm_fault+0xb95/0x1370
>> [  195.588765]  __x64_sys_move_pages+0x24/0x30
>> [  195.611439]  do_syscall_64+0x5b/0x160
>> [  195.631901]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
>> [  195.657790] RIP: 0033:0x7f166f5ff959
>> [  195.676365] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08
>> 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 17 45 2c 00 f7 d8 64 89 01 48
>> [  195.772938] RSP: 002b:00007ffd8d77bb48 EFLAGS: 00000246 ORIG_RAX: 0000000000000117
>> [  195.810207] RAX: ffffffffffffffda RBX: 0000000000000400 RCX: 00007f166f5ff959
>> [  195.847522] RDX: 0000000002303400 RSI: 0000000000000400 RDI: 0000000000000360
>> [  195.882327] RBP: 0000000000000400 R08: 0000000002306420 R09: 0000000000000004
>> [  195.920017] R10: 0000000002305410 R11: 0000000000000246 R12: 0000000002303400
>> [  195.958053] R13: 0000000002305410 R14: 0000000002306420 R15: 0000000000000003
>> [  195.997028] Modules linked in: sunrpc amd64_edac_mod ipmi_ssif edac_mce_amd kvm_amd ipmi_si igb ipmi_devintf k10temp kvm pcspkr ipmi_msgha
>> ndler joydev irqbypass sp5100_tco dca hpwdt hpilo i2c_piix4 xfs libcrc32c radeon i2c_algo_bit drm_kms_helper ttm ata_generic pata_acpi drm se
>> rio_raw pata_atiixp
>> [  196.134162] CR2: 0000000000000030
>> [  196.152788] ---[ end trace 4420ea5061342d3e ]---
>>
>> Suspected commit is:
>>   b43a99900559 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization")
>> which adds to unmap_and_move_huge_page():
>> +               struct address_space *mapping = page_mapping(hpage);
>> +
>> +               /*
>> +                * try_to_unmap could potentially call huge_pmd_unshare.
>> +                * Because of this, take semaphore in write mode here and
>> +                * set TTU_RMAP_LOCKED to let lower levels know we have
>> +                * taken the lock.
>> +                */
>> +               i_mmap_lock_write(mapping);
>>
>> If I'm reading this right, 'mapping' will be NULL for anon mappings.
> 
> Not exactly.
> 

Well, yes exactly.  Sorry, I was confusing this with something else.

That commit does cause BUGs for migration and page poisoning of anon huge
pages.  The patch was trying to take care of i_mmap_rwsem locking outside
try_to_unmap infrastructure.  This is because try_to_unmap will take the
semaphore in read mode (for file mappings) and we really need it to be
taken in write mode.

The patch below continues to take the semaphore outside try_to_unmap for
the file mapping case.  For anon mappings, the locking is done as a special
case in try_to_unmap_one.  This is something I was trying to avoid as it
it harder to follow/understand.  Any suggestions on how to restructure this
or make it more clear are welcome.

Adding Andrew on Cc as he already sent the commit causing the BUGs upstream.

From: Mike Kravetz <mike.kravetz@oracle.com>

hugetlbfs: fix migration and poisoning of anon huge pages

Expanded use of i_mmap_rwsem for pmd sharing synchronization incorrectly
used page_mapping() of anon huge pages to get to address_space
i_mmap_rwsem.  Since page_mapping() is NULL for pages of anon mappings,
an "unable to handle kernel NULL pointer" BUG would occur with stack
similar to:

RIP: 0010:down_write+0x1b/0x40
Call Trace:
 migrate_pages+0x81f/0xb90
 __ia32_compat_sys_migrate_pages+0x190/0x190
 do_move_pages_to_node.isra.53.part.54+0x2a/0x50
 kernel_move_pages+0x566/0x7b0
 __x64_sys_move_pages+0x24/0x30
 do_syscall_64+0x5b/0x180
 entry_SYSCALL_64_after_hwframe+0x44/0xa9

To fix, only use page_mapping() for non-anon or file pages.  For anon
pages wait until we find a vma in which the page is mapped and get the
address_space from vm_file.

Fixes: b43a99900559 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
synchronization")
Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
---
 mm/memory-failure.c | 15 +++++++++++----
 mm/migrate.c        | 34 +++++++++++++++++++++++-----------
 mm/rmap.c           | 15 ++++++++++++---
 3 files changed, 46 insertions(+), 18 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index 93558fb981fb..f229cbd0b347 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -1032,14 +1032,21 @@ static bool hwpoison_user_mappings(struct page *p,
unsigned long pfn,
 		unmap_success = try_to_unmap(hpage, ttu);
 	} else if (mapping) {
 		/*
-		 * For hugetlb pages, try_to_unmap could potentially call
-		 * huge_pmd_unshare.  Because of this, take semaphore in
-		 * write mode here and set TTU_RMAP_LOCKED to indicate we
-		 * have taken the lock at this higer level.
+		 * For file mappings, take semaphore in write mode here and
+		 * set TTU_RMAP_LOCKED to let lower levels know we have taken
+		 * the lock.  This is in case lower levels call
+		 * huge_pmd_unshare.  Without this, try_to_unmap would only
+		 * take the semaphore in read mode.
 		 */
 		i_mmap_lock_write(mapping);
 		unmap_success = try_to_unmap(hpage, ttu|TTU_RMAP_LOCKED);
 		i_mmap_unlock_write(mapping);
+	} else {
+		/*
+		 * For huge page anon mappings, try_to_unmap_one will take the
+		 * i_mmap_rwsem before calling huge_pmd_unshare if necessary.
+		 */
+		unmap_success = try_to_unmap(hpage, ttu);
 	}
 	if (!unmap_success)
 		pr_err("Memory failure: %#lx: failed to unmap page (mapcount=%d)\n",
diff --git a/mm/migrate.c b/mm/migrate.c
index 725edaef238a..45d7dd0c9479 100644
--- a/mm/migrate.c
+++ b/mm/migrate.c
@@ -1309,17 +1309,29 @@ static int unmap_and_move_huge_page(new_page_t get_new_page,
 	if (page_mapped(hpage)) {
 		struct address_space *mapping = page_mapping(hpage);

-		/*
-		 * try_to_unmap could potentially call huge_pmd_unshare.
-		 * Because of this, take semaphore in write mode here and
-		 * set TTU_RMAP_LOCKED to let lower levels know we have
-		 * taken the lock.
-		 */
-		i_mmap_lock_write(mapping);
-		try_to_unmap(hpage,
-			TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS|
-			TTU_RMAP_LOCKED);
-		i_mmap_unlock_write(mapping);
+		if (mapping) {
+			/*
+			 * For file mappings, take semaphore in write mode here
+			 * and set TTU_RMAP_LOCKED to let lower levels know we
+			 * have taken the lock.  This is in case lower levels
+			 * call huge_pmd_unshare.  Without this, try_to_unmap
+			 * would only take the semaphore in read mode.
+			 */
+			i_mmap_lock_write(mapping);
+			try_to_unmap(hpage,
+				TTU_MIGRATION|TTU_IGNORE_MLOCK|
+				TTU_IGNORE_ACCESS|TTU_RMAP_LOCKED);
+			i_mmap_unlock_write(mapping);
+		} else {
+			/*
+			 * For anon mappings, try_to_unmap_one will take the
+			 * i_mmap_rwsem before calling huge_pmd_unshare if
+			 * necessary.
+			 */
+			try_to_unmap(hpage,
+				TTU_MIGRATION|TTU_IGNORE_MLOCK|
+				TTU_IGNORE_ACCESS);
+		}
 		page_was_mapped = 1;
 	}

diff --git a/mm/rmap.c b/mm/rmap.c
index c566bd552535..b267cc084f92 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1375,11 +1375,17 @@ static bool try_to_unmap_one(struct page *page, struct
vm_area_struct *vma,
 		/*
 		 * If sharing is possible, start and end will be adjusted
 		 * accordingly.
-		 *
-		 * If called for a huge page, caller must hold i_mmap_rwsem
-		 * in write mode as it is possible to call huge_pmd_unshare.
 		 */
 		adjust_range_if_pmd_sharing_possible(vma, &start, &end);
+
+		/*
+		 * If called for a huge page file mapping, caller will hold
+		 * i_mmap_rwsem in write mode.  For anon mappings, we must
+		 * take the semaphore here.  All this is necessary because
+		 * it is possible huge_pmd_unshare will 'unshare' a pmd.
+		 */
+		if (PageAnon(page))
+			i_mmap_lock_write(vma->vm_file->f_mapping);
 	}
 	mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);

@@ -1655,6 +1661,9 @@ static bool try_to_unmap_one(struct page *page, struct
vm_area_struct *vma,
 	}

 	mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
+	/* For anon huge pages, we must unlock. */
+	if (PageHuge(page) && PageAnon(page))
+		i_mmap_unlock_write(vma->vm_file->f_mapping);

 	return ret;
 }
-- 
2.17.2

^ permalink raw reply related	[flat|nested] 10+ messages in thread

* Re: [bug] problems with migration of huge pages with v4.20-10214-ge1ef035d272e
  2019-01-03  1:44     ` Mike Kravetz
@ 2019-01-03 12:47       ` Jan Stancek
  2019-01-03 12:47         ` Jan Stancek
  2019-01-03 17:06         ` Jan Stancek
  0 siblings, 2 replies; 10+ messages in thread
From: Jan Stancek @ 2019-01-03 12:47 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, kirill shutemov, Andrew Morton, ltp, mhocko,
	Rachel Sibley, hughd, n-horiguchi, aarcange, aneesh kumar, dave,
	prakash sangappa, colin king



----- Original Message -----
> On 1/2/19 1:24 PM, Mike Kravetz wrote:
> > On 1/2/19 12:30 PM, Jan Stancek wrote:
> >> Hi,
> >>
> >> LTP move_pages12 [1] started failing recently.
> >>
> >> The test maps/unmaps some anonymous private huge pages
> >> and migrates them between 2 nodes. This now reliably
> >> hits NULL ptr deref:
> >>
> >> [  194.819357] BUG: unable to handle kernel NULL pointer dereference at
> >> 0000000000000030
> >> [  194.864410] #PF error: [WRITE]
> >> [  194.881502] PGD 22c758067 P4D 22c758067 PUD 235177067 PMD 0
> >> [  194.913833] Oops: 0002 [#1] SMP NOPTI
> >> [  194.935062] CPU: 0 PID: 865 Comm: move_pages12 Not tainted 4.20.0+ #1
> >> [  194.972993] Hardware name: HP ProLiant SL335s G7/, BIOS A24 12/08/2012
> >> [  195.005359] RIP: 0010:down_write+0x1b/0x40
> >> [  195.028257] Code: 00 5c 01 00 48 83 c8 03 48 89 43 20 5b c3 90 0f 1f 44
> >> 00 00 53 48 89 fb e8 d2 d7 ff ff 48 89 d8 48 ba 01 00 00 00 ff ff
> >> ff ff <f0> 48 0f c1 10 85 d2 74 05 e8 07 26 ff ff 65 48 8b 04 25 00 5c 01
> >> [  195.121836] RSP: 0018:ffffb87e4224fd00 EFLAGS: 00010246
> >> [  195.147097] RAX: 0000000000000030 RBX: 0000000000000030 RCX:
> >> 0000000000000000
> >> [  195.185096] RDX: ffffffff00000001 RSI: ffffffffa69d30f0 RDI:
> >> 0000000000000030
> >> [  195.219251] RBP: 0000000000000030 R08: ffffe7d4889d8008 R09:
> >> 0000000000000003
> >> [  195.258291] R10: 000000000000000f R11: ffffe7d4889d8008 R12:
> >> ffffe7d4889d0008
> >> [  195.294547] R13: ffffe7d490b78000 R14: ffffe7d4889d0000 R15:
> >> ffff8be9b2ba4580
> >> [  195.332532] FS:  00007f1670112b80(0000) GS:ffff8be9b7a00000(0000)
> >> knlGS:0000000000000000
> >> [  195.373888] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >> [  195.405938] CR2: 0000000000000030 CR3: 000000023477e000 CR4:
> >> 00000000000006f0
> >> [  195.443579] Call Trace:
> >> [  195.456876]  migrate_pages+0x833/0xcb0
> >> [  195.478070]  ? __ia32_compat_sys_migrate_pages+0x20/0x20
> >> [  195.506027]  do_move_pages_to_node.isra.63.part.64+0x2a/0x50
> >> [  195.536963]  kernel_move_pages+0x667/0x8c0
> >> [  195.559616]  ? __handle_mm_fault+0xb95/0x1370
> >> [  195.588765]  __x64_sys_move_pages+0x24/0x30
> >> [  195.611439]  do_syscall_64+0x5b/0x160
> >> [  195.631901]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >> [  195.657790] RIP: 0033:0x7f166f5ff959
> >> [  195.676365] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48
> >> 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08
> >> 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 17 45 2c 00 f7 d8 64 89 01 48
> >> [  195.772938] RSP: 002b:00007ffd8d77bb48 EFLAGS: 00000246 ORIG_RAX:
> >> 0000000000000117
> >> [  195.810207] RAX: ffffffffffffffda RBX: 0000000000000400 RCX:
> >> 00007f166f5ff959
> >> [  195.847522] RDX: 0000000002303400 RSI: 0000000000000400 RDI:
> >> 0000000000000360
> >> [  195.882327] RBP: 0000000000000400 R08: 0000000002306420 R09:
> >> 0000000000000004
> >> [  195.920017] R10: 0000000002305410 R11: 0000000000000246 R12:
> >> 0000000002303400
> >> [  195.958053] R13: 0000000002305410 R14: 0000000002306420 R15:
> >> 0000000000000003
> >> [  195.997028] Modules linked in: sunrpc amd64_edac_mod ipmi_ssif
> >> edac_mce_amd kvm_amd ipmi_si igb ipmi_devintf k10temp kvm pcspkr
> >> ipmi_msgha
> >> ndler joydev irqbypass sp5100_tco dca hpwdt hpilo i2c_piix4 xfs libcrc32c
> >> radeon i2c_algo_bit drm_kms_helper ttm ata_generic pata_acpi drm se
> >> rio_raw pata_atiixp
> >> [  196.134162] CR2: 0000000000000030
> >> [  196.152788] ---[ end trace 4420ea5061342d3e ]---
> >>
> >> Suspected commit is:
> >>   b43a99900559 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
> >>   synchronization")
> >> which adds to unmap_and_move_huge_page():
> >> +               struct address_space *mapping = page_mapping(hpage);
> >> +
> >> +               /*
> >> +                * try_to_unmap could potentially call huge_pmd_unshare.
> >> +                * Because of this, take semaphore in write mode here and
> >> +                * set TTU_RMAP_LOCKED to let lower levels know we have
> >> +                * taken the lock.
> >> +                */
> >> +               i_mmap_lock_write(mapping);
> >>
> >> If I'm reading this right, 'mapping' will be NULL for anon mappings.
> > 
> > Not exactly.
> > 
> 
> Well, yes exactly.  Sorry, I was confusing this with something else.
> 
> That commit does cause BUGs for migration and page poisoning of anon huge
> pages.  The patch was trying to take care of i_mmap_rwsem locking outside
> try_to_unmap infrastructure.  This is because try_to_unmap will take the
> semaphore in read mode (for file mappings) and we really need it to be
> taken in write mode.
> 
> The patch below continues to take the semaphore outside try_to_unmap for
> the file mapping case.  For anon mappings, the locking is done as a special
> case in try_to_unmap_one.  This is something I was trying to avoid as it
> it harder to follow/understand.  Any suggestions on how to restructure this
> or make it more clear are welcome.
> 
> Adding Andrew on Cc as he already sent the commit causing the BUGs upstream.
> 
> From: Mike Kravetz <mike.kravetz@oracle.com>
> 
> hugetlbfs: fix migration and poisoning of anon huge pages
> 
> Expanded use of i_mmap_rwsem for pmd sharing synchronization incorrectly
> used page_mapping() of anon huge pages to get to address_space
> i_mmap_rwsem.  Since page_mapping() is NULL for pages of anon mappings,
> an "unable to handle kernel NULL pointer" BUG would occur with stack
> similar to:
> 
> RIP: 0010:down_write+0x1b/0x40
> Call Trace:
>  migrate_pages+0x81f/0xb90
>  __ia32_compat_sys_migrate_pages+0x190/0x190
>  do_move_pages_to_node.isra.53.part.54+0x2a/0x50
>  kernel_move_pages+0x566/0x7b0
>  __x64_sys_move_pages+0x24/0x30
>  do_syscall_64+0x5b/0x180
>  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> To fix, only use page_mapping() for non-anon or file pages.  For anon
> pages wait until we find a vma in which the page is mapped and get the
> address_space from vm_file.
> 
> Fixes: b43a99900559 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
> synchronization")
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>

Mike,

1) with LTP move_pages12 (MAP_PRIVATE version of reproducer)
Patch below fixes the panic for me.
It didn't apply cleanly to latest master, but conflicts were easy to resolve.

2) with MAP_SHARED version of reproducer
It still hangs in user-space.
v4.19 kernel appears to work fine so I've started a bisect.

Regards,
Jan

> ---
>  mm/memory-failure.c | 15 +++++++++++----
>  mm/migrate.c        | 34 +++++++++++++++++++++++-----------
>  mm/rmap.c           | 15 ++++++++++++---
>  3 files changed, 46 insertions(+), 18 deletions(-)
> 
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 93558fb981fb..f229cbd0b347 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1032,14 +1032,21 @@ static bool hwpoison_user_mappings(struct page *p,
> unsigned long pfn,
>  		unmap_success = try_to_unmap(hpage, ttu);
>  	} else if (mapping) {
>  		/*
> -		 * For hugetlb pages, try_to_unmap could potentially call
> -		 * huge_pmd_unshare.  Because of this, take semaphore in
> -		 * write mode here and set TTU_RMAP_LOCKED to indicate we
> -		 * have taken the lock at this higer level.
> +		 * For file mappings, take semaphore in write mode here and
> +		 * set TTU_RMAP_LOCKED to let lower levels know we have taken
> +		 * the lock.  This is in case lower levels call
> +		 * huge_pmd_unshare.  Without this, try_to_unmap would only
> +		 * take the semaphore in read mode.
>  		 */
>  		i_mmap_lock_write(mapping);
>  		unmap_success = try_to_unmap(hpage, ttu|TTU_RMAP_LOCKED);
>  		i_mmap_unlock_write(mapping);
> +	} else {
> +		/*
> +		 * For huge page anon mappings, try_to_unmap_one will take the
> +		 * i_mmap_rwsem before calling huge_pmd_unshare if necessary.
> +		 */
> +		unmap_success = try_to_unmap(hpage, ttu);
>  	}
>  	if (!unmap_success)
>  		pr_err("Memory failure: %#lx: failed to unmap page (mapcount=%d)\n",
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 725edaef238a..45d7dd0c9479 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1309,17 +1309,29 @@ static int unmap_and_move_huge_page(new_page_t
> get_new_page,
>  	if (page_mapped(hpage)) {
>  		struct address_space *mapping = page_mapping(hpage);
> 
> -		/*
> -		 * try_to_unmap could potentially call huge_pmd_unshare.
> -		 * Because of this, take semaphore in write mode here and
> -		 * set TTU_RMAP_LOCKED to let lower levels know we have
> -		 * taken the lock.
> -		 */
> -		i_mmap_lock_write(mapping);
> -		try_to_unmap(hpage,
> -			TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS|
> -			TTU_RMAP_LOCKED);
> -		i_mmap_unlock_write(mapping);
> +		if (mapping) {
> +			/*
> +			 * For file mappings, take semaphore in write mode here
> +			 * and set TTU_RMAP_LOCKED to let lower levels know we
> +			 * have taken the lock.  This is in case lower levels
> +			 * call huge_pmd_unshare.  Without this, try_to_unmap
> +			 * would only take the semaphore in read mode.
> +			 */
> +			i_mmap_lock_write(mapping);
> +			try_to_unmap(hpage,
> +				TTU_MIGRATION|TTU_IGNORE_MLOCK|
> +				TTU_IGNORE_ACCESS|TTU_RMAP_LOCKED);
> +			i_mmap_unlock_write(mapping);
> +		} else {
> +			/*
> +			 * For anon mappings, try_to_unmap_one will take the
> +			 * i_mmap_rwsem before calling huge_pmd_unshare if
> +			 * necessary.
> +			 */
> +			try_to_unmap(hpage,
> +				TTU_MIGRATION|TTU_IGNORE_MLOCK|
> +				TTU_IGNORE_ACCESS);
> +		}
>  		page_was_mapped = 1;
>  	}
> 
> diff --git a/mm/rmap.c b/mm/rmap.c
> index c566bd552535..b267cc084f92 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1375,11 +1375,17 @@ static bool try_to_unmap_one(struct page *page,
> struct
> vm_area_struct *vma,
>  		/*
>  		 * If sharing is possible, start and end will be adjusted
>  		 * accordingly.
> -		 *
> -		 * If called for a huge page, caller must hold i_mmap_rwsem
> -		 * in write mode as it is possible to call huge_pmd_unshare.
>  		 */
>  		adjust_range_if_pmd_sharing_possible(vma, &start, &end);
> +
> +		/*
> +		 * If called for a huge page file mapping, caller will hold
> +		 * i_mmap_rwsem in write mode.  For anon mappings, we must
> +		 * take the semaphore here.  All this is necessary because
> +		 * it is possible huge_pmd_unshare will 'unshare' a pmd.
> +		 */
> +		if (PageAnon(page))
> +			i_mmap_lock_write(vma->vm_file->f_mapping);
>  	}
>  	mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);
> 
> @@ -1655,6 +1661,9 @@ static bool try_to_unmap_one(struct page *page, struct
> vm_area_struct *vma,
>  	}
> 
>  	mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
> +	/* For anon huge pages, we must unlock. */
> +	if (PageHuge(page) && PageAnon(page))
> +		i_mmap_unlock_write(vma->vm_file->f_mapping);
> 
>  	return ret;
>  }
> --
> 2.17.2
> 
> 

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [bug] problems with migration of huge pages with v4.20-10214-ge1ef035d272e
  2019-01-03 12:47       ` Jan Stancek
@ 2019-01-03 12:47         ` Jan Stancek
  2019-01-03 17:06         ` Jan Stancek
  1 sibling, 0 replies; 10+ messages in thread
From: Jan Stancek @ 2019-01-03 12:47 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, kirill shutemov, Andrew Morton, ltp, mhocko,
	Rachel Sibley, hughd, n-horiguchi, aarcange, aneesh kumar, dave,
	prakash sangappa, colin king



----- Original Message -----
> On 1/2/19 1:24 PM, Mike Kravetz wrote:
> > On 1/2/19 12:30 PM, Jan Stancek wrote:
> >> Hi,
> >>
> >> LTP move_pages12 [1] started failing recently.
> >>
> >> The test maps/unmaps some anonymous private huge pages
> >> and migrates them between 2 nodes. This now reliably
> >> hits NULL ptr deref:
> >>
> >> [  194.819357] BUG: unable to handle kernel NULL pointer dereference at
> >> 0000000000000030
> >> [  194.864410] #PF error: [WRITE]
> >> [  194.881502] PGD 22c758067 P4D 22c758067 PUD 235177067 PMD 0
> >> [  194.913833] Oops: 0002 [#1] SMP NOPTI
> >> [  194.935062] CPU: 0 PID: 865 Comm: move_pages12 Not tainted 4.20.0+ #1
> >> [  194.972993] Hardware name: HP ProLiant SL335s G7/, BIOS A24 12/08/2012
> >> [  195.005359] RIP: 0010:down_write+0x1b/0x40
> >> [  195.028257] Code: 00 5c 01 00 48 83 c8 03 48 89 43 20 5b c3 90 0f 1f 44
> >> 00 00 53 48 89 fb e8 d2 d7 ff ff 48 89 d8 48 ba 01 00 00 00 ff ff
> >> ff ff <f0> 48 0f c1 10 85 d2 74 05 e8 07 26 ff ff 65 48 8b 04 25 00 5c 01
> >> [  195.121836] RSP: 0018:ffffb87e4224fd00 EFLAGS: 00010246
> >> [  195.147097] RAX: 0000000000000030 RBX: 0000000000000030 RCX:
> >> 0000000000000000
> >> [  195.185096] RDX: ffffffff00000001 RSI: ffffffffa69d30f0 RDI:
> >> 0000000000000030
> >> [  195.219251] RBP: 0000000000000030 R08: ffffe7d4889d8008 R09:
> >> 0000000000000003
> >> [  195.258291] R10: 000000000000000f R11: ffffe7d4889d8008 R12:
> >> ffffe7d4889d0008
> >> [  195.294547] R13: ffffe7d490b78000 R14: ffffe7d4889d0000 R15:
> >> ffff8be9b2ba4580
> >> [  195.332532] FS:  00007f1670112b80(0000) GS:ffff8be9b7a00000(0000)
> >> knlGS:0000000000000000
> >> [  195.373888] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >> [  195.405938] CR2: 0000000000000030 CR3: 000000023477e000 CR4:
> >> 00000000000006f0
> >> [  195.443579] Call Trace:
> >> [  195.456876]  migrate_pages+0x833/0xcb0
> >> [  195.478070]  ? __ia32_compat_sys_migrate_pages+0x20/0x20
> >> [  195.506027]  do_move_pages_to_node.isra.63.part.64+0x2a/0x50
> >> [  195.536963]  kernel_move_pages+0x667/0x8c0
> >> [  195.559616]  ? __handle_mm_fault+0xb95/0x1370
> >> [  195.588765]  __x64_sys_move_pages+0x24/0x30
> >> [  195.611439]  do_syscall_64+0x5b/0x160
> >> [  195.631901]  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >> [  195.657790] RIP: 0033:0x7f166f5ff959
> >> [  195.676365] Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 48
> >> 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08
> >> 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 17 45 2c 00 f7 d8 64 89 01 48
> >> [  195.772938] RSP: 002b:00007ffd8d77bb48 EFLAGS: 00000246 ORIG_RAX:
> >> 0000000000000117
> >> [  195.810207] RAX: ffffffffffffffda RBX: 0000000000000400 RCX:
> >> 00007f166f5ff959
> >> [  195.847522] RDX: 0000000002303400 RSI: 0000000000000400 RDI:
> >> 0000000000000360
> >> [  195.882327] RBP: 0000000000000400 R08: 0000000002306420 R09:
> >> 0000000000000004
> >> [  195.920017] R10: 0000000002305410 R11: 0000000000000246 R12:
> >> 0000000002303400
> >> [  195.958053] R13: 0000000002305410 R14: 0000000002306420 R15:
> >> 0000000000000003
> >> [  195.997028] Modules linked in: sunrpc amd64_edac_mod ipmi_ssif
> >> edac_mce_amd kvm_amd ipmi_si igb ipmi_devintf k10temp kvm pcspkr
> >> ipmi_msgha
> >> ndler joydev irqbypass sp5100_tco dca hpwdt hpilo i2c_piix4 xfs libcrc32c
> >> radeon i2c_algo_bit drm_kms_helper ttm ata_generic pata_acpi drm se
> >> rio_raw pata_atiixp
> >> [  196.134162] CR2: 0000000000000030
> >> [  196.152788] ---[ end trace 4420ea5061342d3e ]---
> >>
> >> Suspected commit is:
> >>   b43a99900559 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
> >>   synchronization")
> >> which adds to unmap_and_move_huge_page():
> >> +               struct address_space *mapping = page_mapping(hpage);
> >> +
> >> +               /*
> >> +                * try_to_unmap could potentially call huge_pmd_unshare.
> >> +                * Because of this, take semaphore in write mode here and
> >> +                * set TTU_RMAP_LOCKED to let lower levels know we have
> >> +                * taken the lock.
> >> +                */
> >> +               i_mmap_lock_write(mapping);
> >>
> >> If I'm reading this right, 'mapping' will be NULL for anon mappings.
> > 
> > Not exactly.
> > 
> 
> Well, yes exactly.  Sorry, I was confusing this with something else.
> 
> That commit does cause BUGs for migration and page poisoning of anon huge
> pages.  The patch was trying to take care of i_mmap_rwsem locking outside
> try_to_unmap infrastructure.  This is because try_to_unmap will take the
> semaphore in read mode (for file mappings) and we really need it to be
> taken in write mode.
> 
> The patch below continues to take the semaphore outside try_to_unmap for
> the file mapping case.  For anon mappings, the locking is done as a special
> case in try_to_unmap_one.  This is something I was trying to avoid as it
> it harder to follow/understand.  Any suggestions on how to restructure this
> or make it more clear are welcome.
> 
> Adding Andrew on Cc as he already sent the commit causing the BUGs upstream.
> 
> From: Mike Kravetz <mike.kravetz@oracle.com>
> 
> hugetlbfs: fix migration and poisoning of anon huge pages
> 
> Expanded use of i_mmap_rwsem for pmd sharing synchronization incorrectly
> used page_mapping() of anon huge pages to get to address_space
> i_mmap_rwsem.  Since page_mapping() is NULL for pages of anon mappings,
> an "unable to handle kernel NULL pointer" BUG would occur with stack
> similar to:
> 
> RIP: 0010:down_write+0x1b/0x40
> Call Trace:
>  migrate_pages+0x81f/0xb90
>  __ia32_compat_sys_migrate_pages+0x190/0x190
>  do_move_pages_to_node.isra.53.part.54+0x2a/0x50
>  kernel_move_pages+0x566/0x7b0
>  __x64_sys_move_pages+0x24/0x30
>  do_syscall_64+0x5b/0x180
>  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> 
> To fix, only use page_mapping() for non-anon or file pages.  For anon
> pages wait until we find a vma in which the page is mapped and get the
> address_space from vm_file.
> 
> Fixes: b43a99900559 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
> synchronization")
> Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>

Mike,

1) with LTP move_pages12 (MAP_PRIVATE version of reproducer)
Patch below fixes the panic for me.
It didn't apply cleanly to latest master, but conflicts were easy to resolve.

2) with MAP_SHARED version of reproducer
It still hangs in user-space.
v4.19 kernel appears to work fine so I've started a bisect.

Regards,
Jan

> ---
>  mm/memory-failure.c | 15 +++++++++++----
>  mm/migrate.c        | 34 +++++++++++++++++++++++-----------
>  mm/rmap.c           | 15 ++++++++++++---
>  3 files changed, 46 insertions(+), 18 deletions(-)
> 
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index 93558fb981fb..f229cbd0b347 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -1032,14 +1032,21 @@ static bool hwpoison_user_mappings(struct page *p,
> unsigned long pfn,
>  		unmap_success = try_to_unmap(hpage, ttu);
>  	} else if (mapping) {
>  		/*
> -		 * For hugetlb pages, try_to_unmap could potentially call
> -		 * huge_pmd_unshare.  Because of this, take semaphore in
> -		 * write mode here and set TTU_RMAP_LOCKED to indicate we
> -		 * have taken the lock at this higer level.
> +		 * For file mappings, take semaphore in write mode here and
> +		 * set TTU_RMAP_LOCKED to let lower levels know we have taken
> +		 * the lock.  This is in case lower levels call
> +		 * huge_pmd_unshare.  Without this, try_to_unmap would only
> +		 * take the semaphore in read mode.
>  		 */
>  		i_mmap_lock_write(mapping);
>  		unmap_success = try_to_unmap(hpage, ttu|TTU_RMAP_LOCKED);
>  		i_mmap_unlock_write(mapping);
> +	} else {
> +		/*
> +		 * For huge page anon mappings, try_to_unmap_one will take the
> +		 * i_mmap_rwsem before calling huge_pmd_unshare if necessary.
> +		 */
> +		unmap_success = try_to_unmap(hpage, ttu);
>  	}
>  	if (!unmap_success)
>  		pr_err("Memory failure: %#lx: failed to unmap page (mapcount=%d)\n",
> diff --git a/mm/migrate.c b/mm/migrate.c
> index 725edaef238a..45d7dd0c9479 100644
> --- a/mm/migrate.c
> +++ b/mm/migrate.c
> @@ -1309,17 +1309,29 @@ static int unmap_and_move_huge_page(new_page_t
> get_new_page,
>  	if (page_mapped(hpage)) {
>  		struct address_space *mapping = page_mapping(hpage);
> 
> -		/*
> -		 * try_to_unmap could potentially call huge_pmd_unshare.
> -		 * Because of this, take semaphore in write mode here and
> -		 * set TTU_RMAP_LOCKED to let lower levels know we have
> -		 * taken the lock.
> -		 */
> -		i_mmap_lock_write(mapping);
> -		try_to_unmap(hpage,
> -			TTU_MIGRATION|TTU_IGNORE_MLOCK|TTU_IGNORE_ACCESS|
> -			TTU_RMAP_LOCKED);
> -		i_mmap_unlock_write(mapping);
> +		if (mapping) {
> +			/*
> +			 * For file mappings, take semaphore in write mode here
> +			 * and set TTU_RMAP_LOCKED to let lower levels know we
> +			 * have taken the lock.  This is in case lower levels
> +			 * call huge_pmd_unshare.  Without this, try_to_unmap
> +			 * would only take the semaphore in read mode.
> +			 */
> +			i_mmap_lock_write(mapping);
> +			try_to_unmap(hpage,
> +				TTU_MIGRATION|TTU_IGNORE_MLOCK|
> +				TTU_IGNORE_ACCESS|TTU_RMAP_LOCKED);
> +			i_mmap_unlock_write(mapping);
> +		} else {
> +			/*
> +			 * For anon mappings, try_to_unmap_one will take the
> +			 * i_mmap_rwsem before calling huge_pmd_unshare if
> +			 * necessary.
> +			 */
> +			try_to_unmap(hpage,
> +				TTU_MIGRATION|TTU_IGNORE_MLOCK|
> +				TTU_IGNORE_ACCESS);
> +		}
>  		page_was_mapped = 1;
>  	}
> 
> diff --git a/mm/rmap.c b/mm/rmap.c
> index c566bd552535..b267cc084f92 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1375,11 +1375,17 @@ static bool try_to_unmap_one(struct page *page,
> struct
> vm_area_struct *vma,
>  		/*
>  		 * If sharing is possible, start and end will be adjusted
>  		 * accordingly.
> -		 *
> -		 * If called for a huge page, caller must hold i_mmap_rwsem
> -		 * in write mode as it is possible to call huge_pmd_unshare.
>  		 */
>  		adjust_range_if_pmd_sharing_possible(vma, &start, &end);
> +
> +		/*
> +		 * If called for a huge page file mapping, caller will hold
> +		 * i_mmap_rwsem in write mode.  For anon mappings, we must
> +		 * take the semaphore here.  All this is necessary because
> +		 * it is possible huge_pmd_unshare will 'unshare' a pmd.
> +		 */
> +		if (PageAnon(page))
> +			i_mmap_lock_write(vma->vm_file->f_mapping);
>  	}
>  	mmu_notifier_invalidate_range_start(vma->vm_mm, start, end);
> 
> @@ -1655,6 +1661,9 @@ static bool try_to_unmap_one(struct page *page, struct
> vm_area_struct *vma,
>  	}
> 
>  	mmu_notifier_invalidate_range_end(vma->vm_mm, start, end);
> +	/* For anon huge pages, we must unlock. */
> +	if (PageHuge(page) && PageAnon(page))
> +		i_mmap_unlock_write(vma->vm_file->f_mapping);
> 
>  	return ret;
>  }
> --
> 2.17.2
> 
> 


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [bug] problems with migration of huge pages with v4.20-10214-ge1ef035d272e
  2019-01-03 12:47       ` Jan Stancek
  2019-01-03 12:47         ` Jan Stancek
@ 2019-01-03 17:06         ` Jan Stancek
  2019-01-03 17:06           ` Jan Stancek
  2019-01-03 21:44           ` Mike Kravetz
  1 sibling, 2 replies; 10+ messages in thread
From: Jan Stancek @ 2019-01-03 17:06 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, kirill shutemov, Andrew Morton, ltp, mhocko,
	Rachel Sibley, hughd, n-horiguchi, aarcange, aneesh kumar, dave,
	prakash sangappa, colin king



----- Original Message -----
<snip>

> > That commit does cause BUGs for migration and page poisoning of anon huge
> > pages.  The patch was trying to take care of i_mmap_rwsem locking outside
> > try_to_unmap infrastructure.  This is because try_to_unmap will take the
> > semaphore in read mode (for file mappings) and we really need it to be
> > taken in write mode.
> > 
> > The patch below continues to take the semaphore outside try_to_unmap for
> > the file mapping case.  For anon mappings, the locking is done as a special
> > case in try_to_unmap_one.  This is something I was trying to avoid as it
> > it harder to follow/understand.  Any suggestions on how to restructure this
> > or make it more clear are welcome.
> > 
> > Adding Andrew on Cc as he already sent the commit causing the BUGs
> > upstream.
> > 
> > From: Mike Kravetz <mike.kravetz@oracle.com>
> > 
> > hugetlbfs: fix migration and poisoning of anon huge pages
> > 
> > Expanded use of i_mmap_rwsem for pmd sharing synchronization incorrectly
> > used page_mapping() of anon huge pages to get to address_space
> > i_mmap_rwsem.  Since page_mapping() is NULL for pages of anon mappings,
> > an "unable to handle kernel NULL pointer" BUG would occur with stack
> > similar to:
> > 
> > RIP: 0010:down_write+0x1b/0x40
> > Call Trace:
> >  migrate_pages+0x81f/0xb90
> >  __ia32_compat_sys_migrate_pages+0x190/0x190
> >  do_move_pages_to_node.isra.53.part.54+0x2a/0x50
> >  kernel_move_pages+0x566/0x7b0
> >  __x64_sys_move_pages+0x24/0x30
> >  do_syscall_64+0x5b/0x180
> >  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > 
> > To fix, only use page_mapping() for non-anon or file pages.  For anon
> > pages wait until we find a vma in which the page is mapped and get the
> > address_space from vm_file.
> > 
> > Fixes: b43a99900559 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
> > synchronization")
> > Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> 
> Mike,
> 
> 1) with LTP move_pages12 (MAP_PRIVATE version of reproducer)
> Patch below fixes the panic for me.
> It didn't apply cleanly to latest master, but conflicts were easy to resolve.
> 
> 2) with MAP_SHARED version of reproducer
> It still hangs in user-space.
> v4.19 kernel appears to work fine so I've started a bisect.

My bisect with MAP_SHARED version arrived at same 2 commits:
  c86aa7bbfd55 hugetlbfs: Use i_mmap_rwsem to fix page fault/truncate race
  b43a99900559 hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization

Maybe a deadlock between page lock and mapping->i_mmap_rwsem?

thread1:
  hugetlbfs_evict_inode
    i_mmap_lock_write(mapping);
    remove_inode_hugepages
      lock_page(page);

thread2:
  __unmap_and_move
    trylock_page(page) / lock_page(page)
      remove_migration_ptes
        rmap_walk_file
          i_mmap_lock_read(mapping);

Here's strace output:
<snip>
1196  11:27:16 mmap(NULL, 4194304, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_ANONYMOUS|MAP_HUGETLB, -1, 0) = 0x7f646c400000
1197  11:27:16 set_robust_list(0x7f646d5b0e60, 24) = 0
1197  11:27:16 getppid()                = 1196
1197  11:27:16 move_pages(1196, 1024, [0x7f646c400000, 0x7f646c401000, 0x7f646c402000, 0x7f646c403000, 0x7f646c404000, 0x7f646c405000, 0x7f646c406000, 0x7f646c407000, 0x7f646c408000, 0x7f646c409000, 0x7f646c40a000, 0x7f646c40b000, 0x7f646c40c000, 0x7f646c40d000, 0x7f646c40e000, 0x7f646c40f000, 0x7f646c410000, 0x7f646c411000, 0x7f646c412000, 0x7f646c413000, 0x7f646c414000, 0x7f646c415000, 0x7f646c416000, 0x7f646c417000, 0x7f646c418000, 0x7f646c419000, 0x7f646c41a000, 0x7f646c41b000, 0x7f646c41c000, 0x7f646c41d000, 0x7f646c41e000, 0x7f646c41f000, ...], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...], [-ENOENT, -ENOENT, -ENOENT, -ENOENT, -ENOENT, -ENOENT, -ENOENT, -ENOENT, -ENOENT, -ENOENT, -ENOENT, -ENOENT, -ENOENT, -ENOENT, -ENOENT, -ENOENT, -ENOENT, -ENOENT, -ENOENT, -ENOENT, -ENOENT, -ENOENT, -ENOENT, -ENOENT, -ENOENT, -ENOENT, -ENOENT, -ENOENT, -ENOENT, -ENOENT, -ENOENT, -ENOENT, ...], MPOL_MF_MOVE_ALL) = 0
1197  11:27:16 move_pages(1196, 1024, [0x7f646c400000, 0x7f646c401000, 0x7f646c402000, 0x7f646c403000, 0x7f646c404000, 0x7f646c405000, 0x7f646c406000, 0x7f646c407000, 0x7f646c408000, 0x7f646c409000, 0x7f646c40a000, 0x7f646c40b000, 0x7f646c40c000, 0x7f646c40d000, 0x7f646c40e000, 0x7f646c40f000, 0x7f646c410000, 0x7f646c411000, 0x7f646c412000, 0x7f646c413000, 0x7f646c414000, 0x7f646c415000, 0x7f646c416000, 0x7f646c417000, 0x7f646c418000, 0x7f646c419000, 0x7f646c41a000, 0x7f646c41b000, 0x7f646c41c000, 0x7f646c41d000, 0x7f646c41e000, 0x7f646c41f000, ...], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...], [1, -EACCES, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...], MPOL_MF_MOVE_ALL) = 0
1197  11:27:16 move_pages(1196, 1024, [0x7f646c400000, 0x7f646c401000, 0x7f646c402000, 0x7f646c403000, 0x7f646c404000, 0x7f646c405000, 0x7f646c406000, 0x7f646c407000, 0x7f646c408000, 0x7f646c409000, 0x7f646c40a000, 0x7f646c40b000, 0x7f646c40c000, 0x7f646c40d000, 0x7f646c40e000, 0x7f646c40f000, 0x7f646c410000, 0x7f646c411000, 0x7f646c412000, 0x7f646c413000, 0x7f646c414000, 0x7f646c415000, 0x7f646c416000, 0x7f646c417000, 0x7f646c418000, 0x7f646c419000, 0x7f646c41a000, 0x7f646c41b000, 0x7f646c41c000, 0x7f646c41d000, 0x7f646c41e000, 0x7f646c41f000, ...], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...],  <unfinished ...>
1196  11:27:16 munmap(0x7f646c400000, 4194304 <unfinished ...>
<hangs>

Regards,
Jan

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [bug] problems with migration of huge pages with v4.20-10214-ge1ef035d272e
  2019-01-03 17:06         ` Jan Stancek
@ 2019-01-03 17:06           ` Jan Stancek
  2019-01-03 21:44           ` Mike Kravetz
  1 sibling, 0 replies; 10+ messages in thread
From: Jan Stancek @ 2019-01-03 17:06 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-mm, kirill shutemov, Andrew Morton, ltp, mhocko,
	Rachel Sibley, hughd, n-horiguchi, aarcange, aneesh kumar, dave,
	prakash sangappa, colin king



----- Original Message -----
<snip>

> > That commit does cause BUGs for migration and page poisoning of anon huge
> > pages.  The patch was trying to take care of i_mmap_rwsem locking outside
> > try_to_unmap infrastructure.  This is because try_to_unmap will take the
> > semaphore in read mode (for file mappings) and we really need it to be
> > taken in write mode.
> > 
> > The patch below continues to take the semaphore outside try_to_unmap for
> > the file mapping case.  For anon mappings, the locking is done as a special
> > case in try_to_unmap_one.  This is something I was trying to avoid as it
> > it harder to follow/understand.  Any suggestions on how to restructure this
> > or make it more clear are welcome.
> > 
> > Adding Andrew on Cc as he already sent the commit causing the BUGs
> > upstream.
> > 
> > From: Mike Kravetz <mike.kravetz@oracle.com>
> > 
> > hugetlbfs: fix migration and poisoning of anon huge pages
> > 
> > Expanded use of i_mmap_rwsem for pmd sharing synchronization incorrectly
> > used page_mapping() of anon huge pages to get to address_space
> > i_mmap_rwsem.  Since page_mapping() is NULL for pages of anon mappings,
> > an "unable to handle kernel NULL pointer" BUG would occur with stack
> > similar to:
> > 
> > RIP: 0010:down_write+0x1b/0x40
> > Call Trace:
> >  migrate_pages+0x81f/0xb90
> >  __ia32_compat_sys_migrate_pages+0x190/0x190
> >  do_move_pages_to_node.isra.53.part.54+0x2a/0x50
> >  kernel_move_pages+0x566/0x7b0
> >  __x64_sys_move_pages+0x24/0x30
> >  do_syscall_64+0x5b/0x180
> >  entry_SYSCALL_64_after_hwframe+0x44/0xa9
> > 
> > To fix, only use page_mapping() for non-anon or file pages.  For anon
> > pages wait until we find a vma in which the page is mapped and get the
> > address_space from vm_file.
> > 
> > Fixes: b43a99900559 ("hugetlbfs: use i_mmap_rwsem for more pmd sharing
> > synchronization")
> > Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com>
> 
> Mike,
> 
> 1) with LTP move_pages12 (MAP_PRIVATE version of reproducer)
> Patch below fixes the panic for me.
> It didn't apply cleanly to latest master, but conflicts were easy to resolve.
> 
> 2) with MAP_SHARED version of reproducer
> It still hangs in user-space.
> v4.19 kernel appears to work fine so I've started a bisect.

My bisect with MAP_SHARED version arrived at same 2 commits:
  c86aa7bbfd55 hugetlbfs: Use i_mmap_rwsem to fix page fault/truncate race
  b43a99900559 hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization

Maybe a deadlock between page lock and mapping->i_mmap_rwsem?

thread1:
  hugetlbfs_evict_inode
    i_mmap_lock_write(mapping);
    remove_inode_hugepages
      lock_page(page);

thread2:
  __unmap_and_move
    trylock_page(page) / lock_page(page)
      remove_migration_ptes
        rmap_walk_file
          i_mmap_lock_read(mapping);

Here's strace output:
<snip>
1196  11:27:16 mmap(NULL, 4194304, PROT_READ|PROT_WRITE, MAP_SHARED|MAP_ANONYMOUS|MAP_HUGETLB, -1, 0) = 0x7f646c400000
1197  11:27:16 set_robust_list(0x7f646d5b0e60, 24) = 0
1197  11:27:16 getppid()                = 1196
1197  11:27:16 move_pages(1196, 1024, [0x7f646c400000, 0x7f646c401000, 0x7f646c402000, 0x7f646c403000, 0x7f646c404000, 0x7f646c405000, 0x7f646c406000, 0x7f646c407000, 0x7f646c408000, 0x7f646c409000, 0x7f646c40a000, 0x7f646c40b000, 0x7f646c40c000, 0x7f646c40d000, 0x7f646c40e000, 0x7f646c40f000, 0x7f646c410000, 0x7f646c411000, 0x7f646c412000, 0x7f646c413000, 0x7f646c414000, 0x7f646c415000, 0x7f646c416000, 0x7f646c417000, 0x7f646c418000, 0x7f646c419000, 0x7f646c41a000, 0x7f646c41b000, 0x7f646c41c000, 0x7f646c41d000, 0x7f646c41e000, 0x7f646c41f000, ...], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...], [-ENOENT, -ENOENT, -ENOENT, -ENOENT, -ENOENT, -ENOENT, -ENOENT, -ENOENT, -ENOENT, -ENOENT, -ENOENT, -ENOENT, -ENOENT, -ENOENT, -ENOENT, -ENOENT, -ENOENT, -ENOENT, -ENOENT, -ENOENT, -ENOENT, -ENOENT, -ENOENT, -ENOENT, -ENOENT, -ENOENT, -ENOENT, -ENOENT, -ENOENT, -ENOENT, -ENOENT, -ENOENT, ...], MPOL_MF_MOVE_ALL) = 0
1197  11:27:16 move_pages(1196, 1024, [0x7f646c400000, 0x7f646c401000, 0x7f646c402000, 0x7f646c403000, 0x7f646c404000, 0x7f646c405000, 0x7f646c406000, 0x7f646c407000, 0x7f646c408000, 0x7f646c409000, 0x7f646c40a000, 0x7f646c40b000, 0x7f646c40c000, 0x7f646c40d000, 0x7f646c40e000, 0x7f646c40f000, 0x7f646c410000, 0x7f646c411000, 0x7f646c412000, 0x7f646c413000, 0x7f646c414000, 0x7f646c415000, 0x7f646c416000, 0x7f646c417000, 0x7f646c418000, 0x7f646c419000, 0x7f646c41a000, 0x7f646c41b000, 0x7f646c41c000, 0x7f646c41d000, 0x7f646c41e000, 0x7f646c41f000, ...], [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...], [1, -EACCES, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, ...], MPOL_MF_MOVE_ALL) = 0
1197  11:27:16 move_pages(1196, 1024, [0x7f646c400000, 0x7f646c401000, 0x7f646c402000, 0x7f646c403000, 0x7f646c404000, 0x7f646c405000, 0x7f646c406000, 0x7f646c407000, 0x7f646c408000, 0x7f646c409000, 0x7f646c40a000, 0x7f646c40b000, 0x7f646c40c000, 0x7f646c40d000, 0x7f646c40e000, 0x7f646c40f000, 0x7f646c410000, 0x7f646c411000, 0x7f646c412000, 0x7f646c413000, 0x7f646c414000, 0x7f646c415000, 0x7f646c416000, 0x7f646c417000, 0x7f646c418000, 0x7f646c419000, 0x7f646c41a000, 0x7f646c41b000, 0x7f646c41c000, 0x7f646c41d000, 0x7f646c41e000, 0x7f646c41f000, ...], [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, ...],  <unfinished ...>
1196  11:27:16 munmap(0x7f646c400000, 4194304 <unfinished ...>
<hangs>

Regards,
Jan


^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [bug] problems with migration of huge pages with v4.20-10214-ge1ef035d272e
  2019-01-03 17:06         ` Jan Stancek
  2019-01-03 17:06           ` Jan Stancek
@ 2019-01-03 21:44           ` Mike Kravetz
  2019-01-03 21:59             ` Andrew Morton
  1 sibling, 1 reply; 10+ messages in thread
From: Mike Kravetz @ 2019-01-03 21:44 UTC (permalink / raw)
  To: Jan Stancek, Andrew Morton
  Cc: linux-mm, kirill shutemov, ltp, mhocko, Rachel Sibley, hughd,
	n-horiguchi, aarcange, aneesh kumar, dave, prakash sangappa,
	colin king

On 1/3/19 9:06 AM, Jan Stancek wrote:
<snip>
>> 1) with LTP move_pages12 (MAP_PRIVATE version of reproducer)
>> Patch below fixes the panic for me.
>> It didn't apply cleanly to latest master, but conflicts were easy to resolve.
>>
>> 2) with MAP_SHARED version of reproducer
>> It still hangs in user-space.
>> v4.19 kernel appears to work fine so I've started a bisect.
> 
> My bisect with MAP_SHARED version arrived at same 2 commits:
>   c86aa7bbfd55 hugetlbfs: Use i_mmap_rwsem to fix page fault/truncate race
>   b43a99900559 hugetlbfs: use i_mmap_rwsem for more pmd sharing synchronization
> 
> Maybe a deadlock between page lock and mapping->i_mmap_rwsem?
> 
> thread1:
>   hugetlbfs_evict_inode
>     i_mmap_lock_write(mapping);
>     remove_inode_hugepages
>       lock_page(page);
> 
> thread2:
>   __unmap_and_move
>     trylock_page(page) / lock_page(page)
>       remove_migration_ptes
>         rmap_walk_file
>           i_mmap_lock_read(mapping);

Thanks Jan!  That is an ABBA deadlock. :(

Commit c86aa7bbfd55 ("Use i_mmap_rwsem to fix page fault/truncate race") is
the patch which causes remove_inode_hugepages to be called with i_mmap_rwsem
held in write mode.  Clearly, i_mmap_rwsem should not be held when calling
remove_inode_hugepages.  If you back out that patch, then the deadlock will
go away.

But, the whole point of that patch is to expand the locking so that
remove_inode_hugepages can not race with a page fault.  If they do race, then
hugetlbfs specific metadata becomes inconsistent.  With some tweaks to
c86aa7bbfd55, I think we could make truncate/page fault races safe.  However,
the issue would still exist for hole punch/page fault races.  We need some
way to prevent page faults while in remove_inode_hugepages.

Andrew, it might be best to revert these patches.  I am not sure if all the
issues with this approach to synchronization can be fixed.  To do so would
likely require more 'special case' conditions to code paths.  The code is
already difficult to understand.  I'd like to step back and take another look
at the best way to fix these problems.  As mentioned before, the issues these
patches address have existed for at least 10 years.  AFAIK, they have not been
seen in real world use cases.  They were discovered via code inspection and
can only be reproduced with highly targeted test programs.  So, waiting for
another release cycle to get a better solution might be the best approach.
I will continue to work this, but if you agree that backing out is the best
approach for now please let me know the process.  Do I simply send a 'revert'
patch to you and the list?

-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 10+ messages in thread

* Re: [bug] problems with migration of huge pages with v4.20-10214-ge1ef035d272e
  2019-01-03 21:44           ` Mike Kravetz
@ 2019-01-03 21:59             ` Andrew Morton
  0 siblings, 0 replies; 10+ messages in thread
From: Andrew Morton @ 2019-01-03 21:59 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: Jan Stancek, linux-mm, kirill shutemov, ltp, mhocko,
	Rachel Sibley, hughd, n-horiguchi, aarcange, aneesh kumar, dave,
	prakash sangappa, colin king

On Thu, 3 Jan 2019 13:44:20 -0800 Mike Kravetz <mike.kravetz@oracle.com> wrote:

> Do I simply send a 'revert'
> patch to you and the list?

Probably that would be best, to give us a changelog, to make sure that
the correct things are reverted and so we have something which you
tested.

^ permalink raw reply	[flat|nested] 10+ messages in thread

end of thread, other threads:[~2019-01-03 22:00 UTC | newest]

Thread overview: 10+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
     [not found] <1038135449.92986364.1546459244292.JavaMail.zimbra@redhat.com>
2019-01-02 20:30 ` [bug] problems with migration of huge pages with v4.20-10214-ge1ef035d272e Jan Stancek
2019-01-02 20:30   ` Jan Stancek
2019-01-02 21:24   ` Mike Kravetz
2019-01-03  1:44     ` Mike Kravetz
2019-01-03 12:47       ` Jan Stancek
2019-01-03 12:47         ` Jan Stancek
2019-01-03 17:06         ` Jan Stancek
2019-01-03 17:06           ` Jan Stancek
2019-01-03 21:44           ` Mike Kravetz
2019-01-03 21:59             ` Andrew Morton

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).