* resv_huge_page underflow with userfaultfd test @ 2021-05-07 21:21 Mina Almasry 2021-05-11 0:33 ` Mike Kravetz 0 siblings, 1 reply; 29+ messages in thread From: Mina Almasry @ 2021-05-07 21:21 UTC (permalink / raw) To: Mike Kravetz, Linux-MM, Axel Rasmussen, aarcange, peterx Hi folks, I ran into a bug that I'm not sure how to solve so I'm wondering if anyone has suggestions on what the issue could be and how to investigate. I added the WARN_ON_ONCE() here to catch instances of resv_huge_pages underflowing: diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 629aa4c2259c..7d763eed650f 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1165,7 +1165,21 @@ static struct page *dequeue_huge_page_vma(struct hstate *h, page = dequeue_huge_page_nodemask(h, gfp_mask, nid, nodemask); if (page && !avoid_reserve && vma_has_reserves(vma, chg)) { SetHPageRestoreReserve(page); + WARN_ON_ONCE(!h->resv_huge_pages); h->resv_huge_pages--; } And ran the userfaultfd selftests like so: echo 1024 > /proc/sys/vm/nr_hugepages mkdir -p /mnt/huge mount -t hugetlbfs none /mnt/huge ./tools/testings/selftests/vm/userfaultfd hugetlb_shared 1024 200 /mnt/huge/userfaultfd_test And run into this warning indicating this test does discover an underflow: [ 11.163403] ------------[ cut here ]------------ [ 11.163404] WARNING: CPU: 0 PID: 237 at mm/hugetlb.c:1178 alloc_huge_page+0x558/0x5a0 [ 11.163413] Modules linked in: [ 11.163419] CPU: 0 PID: 237 Comm: userfaultfd Not tainted 5.12.0-dbg-DEV #135 [ 11.163424] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-2 04/01/2014 [ 11.163429] RIP: 0010:alloc_huge_page+0x558/0x5a0 [ 11.163432] Code: b0 00 0f 85 3d ff ff ff e9 2a ff ff ff be 01 00 00 00 48 89 df e8 18 e7 ff ff 48 f7 d8 4c 89 ef 48 89 c6 e8 da d7 ff ff eb 8c <0f> 0b 4d 8b 85 c0 00 00 00 e9 95 fd ff ff e8 35 59 84 00 4c 897 [ 11.163434] RSP: 0018:ffff94bb0073fc80 EFLAGS: 00010046 [ 11.163436] RAX: 0000000000000080 RBX: 0000000000000000 RCX: 5fa252c406a76700 [ 11.163438] RDX: c0000000ffff7fff RSI: 0000000000000004 RDI: 0000000000017ffd [ 11.163439] RBP: ffff94bb0073fcf8 R08: 0000000000000000 R09: ffffffff9813ba70 [ 11.163440] R10: 00000000ffff7fff R11: 0000000000000000 R12: ffff8ac7800558c8 [ 11.163442] R13: ffffffff993f8880 R14: 00007f0dfa200000 R15: ffffed85453e0000 [ 11.163443] FS: 00007f0d731fc700(0000) GS:ffff8acba9400000(0000) knlGS:0000000000000000 [ 11.163445] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 11.163448] CR2: 00007f0e65e00028 CR3: 0000000108d50003 CR4: 0000000000370ef0 [ 11.163452] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 11.163453] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 11.163455] Call Trace: [ 11.163468] hugetlb_mcopy_atomic_pte+0xcb/0x450 [ 11.163477] mcopy_atomic+0xa08/0xd60 [ 11.163480] ? __might_fault+0x56/0x80 [ 11.163493] userfaultfd_ioctl+0xb18/0xd60 [ 11.163502] __se_sys_ioctl+0x77/0xc0 [ 11.163507] __x64_sys_ioctl+0x1d/0x20 [ 11.163510] do_syscall_64+0x3f/0x80 [ 11.163515] entry_SYSCALL_64_after_hwframe+0x44/0xae [ 11.163519] RIP: 0033:0x45ec87 [ 11.163531] Code: 3c 1c 48 f7 d8 49 39 c4 72 b8 e8 64 63 03 00 85 c0 78 bd 48 83 c4 08 4c 89 e0 5b 41 5c c3 0f 1f 44 00 00 b8 10 00 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 018 [ 11.163532] RSP: 002b:00007f0d731fc248 EFLAGS: 00000206 ORIG_RAX: 0000000000000010 [ 11.163534] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 000000000045ec87 [ 11.163536] RDX: 00007f0d731fc290 RSI: 00000000c028aa03 RDI: 0000000000000004 [ 11.163537] RBP: 00007f0d731fc270 R08: 00000000004022b3 R09: 00007f0d731fc700 [ 11.163538] R10: 00007f0d731fc9d0 R11: 0000000000000206 R12: 00007fff610cd82e [ 11.163539] R13: 00007fff610cd82f R14: 00007f0d731fc400 R15: 0000000001002000 [ 11.163549] irq event stamp: 722 [ 11.163550] hardirqs last enabled at (721): [<ffffffff967cd41b>] kmem_cache_alloc_trace+0x1db/0x370 [ 11.163558] hardirqs last disabled at (722): [<ffffffff9700c052>] _raw_spin_lock_irq+0x32/0x80 [ 11.163560] softirqs last enabled at (130): [<ffffffff9654e0d6>] __irq_exit_rcu+0xf6/0x100 [ 11.163564] softirqs last disabled at (125): [<ffffffff9654e0d6>] __irq_exit_rcu+0xf6/0x100 [ 11.163567] ---[ end trace 358ac5c76c211ea1 ]--- Debugging further I find the resv_huge_pages underflows by 1 temporarily during the run of the test multiple times, but a __free_huge_page() is always subsequently called that overflows it back to 0. resv_huge_pages is always 0 at the end of the test. I've initially looked at this as I suspected a problem in the resv_huge_pages accounting, but seems the resv_huge_pages accounting is fine in itself as it correctly decrements resv_huge_pages when a page is allocated from reservation and correctly increments it back up when that page is freed. I'm not that familiar with the userfaultfd/hugetlb code so I was hoping to solicit some suggestions for what the issue could be. Things I've tried so far: - Adding code that prevents resv_huge_pages to underflow causes the test to fail, so it seems in this test the calling code actually expects to be able to temporarily allocate 1 more page than the VMA has reserved, which seems like a bug maybe? - Modifying hugetlb_mcopy_atomic_pte() to not use reserved pages causes the test to fail again. Doin that and overprovisioning /proc/sys/vm/nr_hugepages causes the test to pass again but I'm not sure that's right (not familiar with the code). - The failure gets reproduced as far back as 5.11, so it doesn't seem to be related to any recent changes. Thanks in advance! ^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: resv_huge_page underflow with userfaultfd test 2021-05-07 21:21 resv_huge_page underflow with userfaultfd test Mina Almasry @ 2021-05-11 0:33 ` Mike Kravetz 2021-05-11 0:52 ` Mina Almasry 2021-05-12 2:25 ` Mike Kravetz 0 siblings, 2 replies; 29+ messages in thread From: Mike Kravetz @ 2021-05-11 0:33 UTC (permalink / raw) To: Mina Almasry, Linux-MM, Axel Rasmussen, aarcange, peterx On 5/7/21 2:21 PM, Mina Almasry wrote: > Hi folks, > > I ran into a bug that I'm not sure how to solve so I'm wondering if > anyone has suggestions on what the issue could be and how to > investigate. I added the WARN_ON_ONCE() here to catch instances of > resv_huge_pages underflowing: > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 629aa4c2259c..7d763eed650f 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -1165,7 +1165,21 @@ static struct page > *dequeue_huge_page_vma(struct hstate *h, > page = dequeue_huge_page_nodemask(h, gfp_mask, nid, nodemask); > if (page && !avoid_reserve && vma_has_reserves(vma, chg)) { > SetHPageRestoreReserve(page); > + WARN_ON_ONCE(!h->resv_huge_pages); > h->resv_huge_pages--; > } > > And ran the userfaultfd selftests like so: > > echo 1024 > /proc/sys/vm/nr_hugepages > mkdir -p /mnt/huge > mount -t hugetlbfs none /mnt/huge > ./tools/testings/selftests/vm/userfaultfd hugetlb_shared 1024 200 > /mnt/huge/userfaultfd_test > > And run into this warning indicating this test does discover an underflow: > > [ 11.163403] ------------[ cut here ]------------ > [ 11.163404] WARNING: CPU: 0 PID: 237 at mm/hugetlb.c:1178 > alloc_huge_page+0x558/0x5a0 > [ 11.163413] Modules linked in: > [ 11.163419] CPU: 0 PID: 237 Comm: userfaultfd Not tainted 5.12.0-dbg-DEV #135 > [ 11.163424] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), > BIOS 1.14.0-2 04/01/2014 > [ 11.163429] RIP: 0010:alloc_huge_page+0x558/0x5a0 > [ 11.163432] Code: b0 00 0f 85 3d ff ff ff e9 2a ff ff ff be 01 00 > 00 00 48 89 df e8 18 e7 ff ff 48 f7 d8 4c 89 ef 48 89 c6 e8 da d7 ff > ff eb 8c <0f> 0b 4d 8b 85 c0 00 00 00 e9 95 fd ff ff e8 35 59 84 00 4c > 897 > [ 11.163434] RSP: 0018:ffff94bb0073fc80 EFLAGS: 00010046 > [ 11.163436] RAX: 0000000000000080 RBX: 0000000000000000 RCX: 5fa252c406a76700 > [ 11.163438] RDX: c0000000ffff7fff RSI: 0000000000000004 RDI: > 0000000000017ffd > [ 11.163439] RBP: ffff94bb0073fcf8 R08: 0000000000000000 R09: > ffffffff9813ba70 > [ 11.163440] R10: 00000000ffff7fff R11: 0000000000000000 R12: > ffff8ac7800558c8 > [ 11.163442] R13: ffffffff993f8880 R14: 00007f0dfa200000 R15: > ffffed85453e0000 > [ 11.163443] FS: 00007f0d731fc700(0000) GS:ffff8acba9400000(0000) > knlGS:0000000000000000 > [ 11.163445] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > [ 11.163448] CR2: 00007f0e65e00028 CR3: 0000000108d50003 CR4: > 0000000000370ef0 > [ 11.163452] DR0: 0000000000000000 DR1: 0000000000000000 DR2: > 0000000000000000 > [ 11.163453] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: > 0000000000000400 > [ 11.163455] Call Trace: > [ 11.163468] hugetlb_mcopy_atomic_pte+0xcb/0x450 > [ 11.163477] mcopy_atomic+0xa08/0xd60 > [ 11.163480] ? __might_fault+0x56/0x80 > [ 11.163493] userfaultfd_ioctl+0xb18/0xd60 > [ 11.163502] __se_sys_ioctl+0x77/0xc0 > [ 11.163507] __x64_sys_ioctl+0x1d/0x20 > [ 11.163510] do_syscall_64+0x3f/0x80 > [ 11.163515] entry_SYSCALL_64_after_hwframe+0x44/0xae > [ 11.163519] RIP: 0033:0x45ec87 > [ 11.163531] Code: 3c 1c 48 f7 d8 49 39 c4 72 b8 e8 64 63 03 00 85 > c0 78 bd 48 83 c4 08 4c 89 e0 5b 41 5c c3 0f 1f 44 00 00 b8 10 00 00 > 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 > 018 > [ 11.163532] RSP: 002b:00007f0d731fc248 EFLAGS: 00000206 ORIG_RAX: > 0000000000000010 > [ 11.163534] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 000000000045ec87 > [ 11.163536] RDX: 00007f0d731fc290 RSI: 00000000c028aa03 RDI: 0000000000000004 > [ 11.163537] RBP: 00007f0d731fc270 R08: 00000000004022b3 R09: 00007f0d731fc700 > [ 11.163538] R10: 00007f0d731fc9d0 R11: 0000000000000206 R12: 00007fff610cd82e > [ 11.163539] R13: 00007fff610cd82f R14: 00007f0d731fc400 R15: 0000000001002000 > [ 11.163549] irq event stamp: 722 > [ 11.163550] hardirqs last enabled at (721): [<ffffffff967cd41b>] > kmem_cache_alloc_trace+0x1db/0x370 > [ 11.163558] hardirqs last disabled at (722): [<ffffffff9700c052>] > _raw_spin_lock_irq+0x32/0x80 > [ 11.163560] softirqs last enabled at (130): [<ffffffff9654e0d6>] > __irq_exit_rcu+0xf6/0x100 > [ 11.163564] softirqs last disabled at (125): [<ffffffff9654e0d6>] > __irq_exit_rcu+0xf6/0x100 > [ 11.163567] ---[ end trace 358ac5c76c211ea1 ]--- > > Debugging further I find the resv_huge_pages underflows by 1 > temporarily during the run of the test multiple times, but a > __free_huge_page() is always subsequently called that overflows it > back to 0. resv_huge_pages is always 0 at the end of the test. I've > initially looked at this as I suspected a problem in the > resv_huge_pages accounting, but seems the resv_huge_pages accounting > is fine in itself as it correctly decrements resv_huge_pages when a > page is allocated from reservation and correctly increments it back up > when that page is freed. > > I'm not that familiar with the userfaultfd/hugetlb code so I was > hoping to solicit some suggestions for what the issue could be. Things > I've tried so far: > > - Adding code that prevents resv_huge_pages to underflow causes the > test to fail, so it seems in this test the calling code actually > expects to be able to temporarily allocate 1 more page than the VMA > has reserved, which seems like a bug maybe? > - Modifying hugetlb_mcopy_atomic_pte() to not use reserved pages > causes the test to fail again. Doin that and overprovisioning > /proc/sys/vm/nr_hugepages causes the test to pass again but I'm not > sure that's right (not familiar with the code). > - The failure gets reproduced as far back as 5.11, so it doesn't seem > to be related to any recent changes. Hi Mina, I am fairly confident the issue is with hugetlb_mcopy_atomic_pte. It does not detect/handle the case where a page cache page already exists when in MCOPY_ATOMIC_NORMAL mode. If you add a printk/warning after the failure of huge_add_to_page_cache, these will generally correspond to the underflow. From a reservation POV, if the page exists in the cache the reservation was already consumed. The call to alloc_huge_page will 'consume' another reservation which can lead to the underflow. As you noted, this underflow gets cleaned up in the error path. However, we should prevent it from happening as we do not want anyone making decisions on that underflow value. hugetlb_mcopy_atomic_pte should check for a page in the cache and if it exists use it in MCOPY_ATOMIC_NORMAL. This code is quite tricky and my first simple attempt at this did not work. I am happy to continue working on this. However, if you or anyone else want to jump in and fix feel free. -- Mike Kravetz ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: resv_huge_page underflow with userfaultfd test 2021-05-11 0:33 ` Mike Kravetz @ 2021-05-11 0:52 ` Mina Almasry 2021-05-11 6:45 ` Axel Rasmussen 2021-05-12 2:25 ` Mike Kravetz 1 sibling, 1 reply; 29+ messages in thread From: Mina Almasry @ 2021-05-11 0:52 UTC (permalink / raw) To: Mike Kravetz; +Cc: Linux-MM, Axel Rasmussen, aarcange, peterx On Mon, May 10, 2021 at 5:33 PM Mike Kravetz <mike.kravetz@oracle.com> wrote: > > On 5/7/21 2:21 PM, Mina Almasry wrote: > > Hi folks, > > > > I ran into a bug that I'm not sure how to solve so I'm wondering if > > anyone has suggestions on what the issue could be and how to > > investigate. I added the WARN_ON_ONCE() here to catch instances of > > resv_huge_pages underflowing: > > > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > > index 629aa4c2259c..7d763eed650f 100644 > > --- a/mm/hugetlb.c > > +++ b/mm/hugetlb.c > > @@ -1165,7 +1165,21 @@ static struct page > > *dequeue_huge_page_vma(struct hstate *h, > > page = dequeue_huge_page_nodemask(h, gfp_mask, nid, nodemask); > > if (page && !avoid_reserve && vma_has_reserves(vma, chg)) { > > SetHPageRestoreReserve(page); > > + WARN_ON_ONCE(!h->resv_huge_pages); > > h->resv_huge_pages--; > > } > > > > And ran the userfaultfd selftests like so: > > > > echo 1024 > /proc/sys/vm/nr_hugepages > > mkdir -p /mnt/huge > > mount -t hugetlbfs none /mnt/huge > > ./tools/testings/selftests/vm/userfaultfd hugetlb_shared 1024 200 > > /mnt/huge/userfaultfd_test > > > > And run into this warning indicating this test does discover an underflow: > > > > [ 11.163403] ------------[ cut here ]------------ > > [ 11.163404] WARNING: CPU: 0 PID: 237 at mm/hugetlb.c:1178 > > alloc_huge_page+0x558/0x5a0 > > [ 11.163413] Modules linked in: > > [ 11.163419] CPU: 0 PID: 237 Comm: userfaultfd Not tainted 5.12.0-dbg-DEV #135 > > [ 11.163424] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), > > BIOS 1.14.0-2 04/01/2014 > > [ 11.163429] RIP: 0010:alloc_huge_page+0x558/0x5a0 > > [ 11.163432] Code: b0 00 0f 85 3d ff ff ff e9 2a ff ff ff be 01 00 > > 00 00 48 89 df e8 18 e7 ff ff 48 f7 d8 4c 89 ef 48 89 c6 e8 da d7 ff > > ff eb 8c <0f> 0b 4d 8b 85 c0 00 00 00 e9 95 fd ff ff e8 35 59 84 00 4c > > 897 > > [ 11.163434] RSP: 0018:ffff94bb0073fc80 EFLAGS: 00010046 > > [ 11.163436] RAX: 0000000000000080 RBX: 0000000000000000 RCX: 5fa252c406a76700 > > [ 11.163438] RDX: c0000000ffff7fff RSI: 0000000000000004 RDI: > > 0000000000017ffd > > [ 11.163439] RBP: ffff94bb0073fcf8 R08: 0000000000000000 R09: > > ffffffff9813ba70 > > [ 11.163440] R10: 00000000ffff7fff R11: 0000000000000000 R12: > > ffff8ac7800558c8 > > [ 11.163442] R13: ffffffff993f8880 R14: 00007f0dfa200000 R15: > > ffffed85453e0000 > > [ 11.163443] FS: 00007f0d731fc700(0000) GS:ffff8acba9400000(0000) > > knlGS:0000000000000000 > > [ 11.163445] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > [ 11.163448] CR2: 00007f0e65e00028 CR3: 0000000108d50003 CR4: > > 0000000000370ef0 > > [ 11.163452] DR0: 0000000000000000 DR1: 0000000000000000 DR2: > > 0000000000000000 > > [ 11.163453] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: > > 0000000000000400 > > [ 11.163455] Call Trace: > > [ 11.163468] hugetlb_mcopy_atomic_pte+0xcb/0x450 > > [ 11.163477] mcopy_atomic+0xa08/0xd60 > > [ 11.163480] ? __might_fault+0x56/0x80 > > [ 11.163493] userfaultfd_ioctl+0xb18/0xd60 > > [ 11.163502] __se_sys_ioctl+0x77/0xc0 > > [ 11.163507] __x64_sys_ioctl+0x1d/0x20 > > [ 11.163510] do_syscall_64+0x3f/0x80 > > [ 11.163515] entry_SYSCALL_64_after_hwframe+0x44/0xae > > [ 11.163519] RIP: 0033:0x45ec87 > > [ 11.163531] Code: 3c 1c 48 f7 d8 49 39 c4 72 b8 e8 64 63 03 00 85 > > c0 78 bd 48 83 c4 08 4c 89 e0 5b 41 5c c3 0f 1f 44 00 00 b8 10 00 00 > > 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 > > 018 > > [ 11.163532] RSP: 002b:00007f0d731fc248 EFLAGS: 00000206 ORIG_RAX: > > 0000000000000010 > > [ 11.163534] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 000000000045ec87 > > [ 11.163536] RDX: 00007f0d731fc290 RSI: 00000000c028aa03 RDI: 0000000000000004 > > [ 11.163537] RBP: 00007f0d731fc270 R08: 00000000004022b3 R09: 00007f0d731fc700 > > [ 11.163538] R10: 00007f0d731fc9d0 R11: 0000000000000206 R12: 00007fff610cd82e > > [ 11.163539] R13: 00007fff610cd82f R14: 00007f0d731fc400 R15: 0000000001002000 > > [ 11.163549] irq event stamp: 722 > > [ 11.163550] hardirqs last enabled at (721): [<ffffffff967cd41b>] > > kmem_cache_alloc_trace+0x1db/0x370 > > [ 11.163558] hardirqs last disabled at (722): [<ffffffff9700c052>] > > _raw_spin_lock_irq+0x32/0x80 > > [ 11.163560] softirqs last enabled at (130): [<ffffffff9654e0d6>] > > __irq_exit_rcu+0xf6/0x100 > > [ 11.163564] softirqs last disabled at (125): [<ffffffff9654e0d6>] > > __irq_exit_rcu+0xf6/0x100 > > [ 11.163567] ---[ end trace 358ac5c76c211ea1 ]--- > > > > Debugging further I find the resv_huge_pages underflows by 1 > > temporarily during the run of the test multiple times, but a > > __free_huge_page() is always subsequently called that overflows it > > back to 0. resv_huge_pages is always 0 at the end of the test. I've > > initially looked at this as I suspected a problem in the > > resv_huge_pages accounting, but seems the resv_huge_pages accounting > > is fine in itself as it correctly decrements resv_huge_pages when a > > page is allocated from reservation and correctly increments it back up > > when that page is freed. > > > > I'm not that familiar with the userfaultfd/hugetlb code so I was > > hoping to solicit some suggestions for what the issue could be. Things > > I've tried so far: > > > > - Adding code that prevents resv_huge_pages to underflow causes the > > test to fail, so it seems in this test the calling code actually > > expects to be able to temporarily allocate 1 more page than the VMA > > has reserved, which seems like a bug maybe? > > - Modifying hugetlb_mcopy_atomic_pte() to not use reserved pages > > causes the test to fail again. Doin that and overprovisioning > > /proc/sys/vm/nr_hugepages causes the test to pass again but I'm not > > sure that's right (not familiar with the code). > > - The failure gets reproduced as far back as 5.11, so it doesn't seem > > to be related to any recent changes. > > Hi Mina, > > I am fairly confident the issue is with hugetlb_mcopy_atomic_pte. It > does not detect/handle the case where a page cache page already exists > when in MCOPY_ATOMIC_NORMAL mode. If you add a printk/warning after the > failure of huge_add_to_page_cache, these will generally correspond to > the underflow. From a reservation POV, if the page exists in the cache > the reservation was already consumed. The call to alloc_huge_page will > 'consume' another reservation which can lead to the underflow. As you > noted, this underflow gets cleaned up in the error path. However, we > should prevent it from happening as we do not want anyone making > decisions on that underflow value. > > hugetlb_mcopy_atomic_pte should check for a page in the cache and if it > exists use it in MCOPY_ATOMIC_NORMAL. This code is quite tricky and my > first simple attempt at this did not work. I am happy to continue > working on this. However, if you or anyone else want to jump in and fix > feel free. Thank you! This is the guidance I was looking for. I'll jump in and try to fix it definitely but I'm learning the code, etc and will not be terribly quick about this unfortunately. I will start looking into this nevertheless and take a stab at it. Thanks so much! > -- > Mike Kravetz ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: resv_huge_page underflow with userfaultfd test 2021-05-11 0:52 ` Mina Almasry @ 2021-05-11 6:45 ` Axel Rasmussen 2021-05-11 7:08 ` Mina Almasry 2021-05-11 16:38 ` Mike Kravetz 0 siblings, 2 replies; 29+ messages in thread From: Axel Rasmussen @ 2021-05-11 6:45 UTC (permalink / raw) To: Mina Almasry; +Cc: Mike Kravetz, Linux-MM, Andrea Arcangeli, Peter Xu Thanks for the investigation, Mike! Mina, since hugetlb_mcopy_atomic_pte is specific to userfaultfd, I'm happy to take a deeper look at it this week as well. For context, we have seen the WARN_ON Mina described trigger in production before, but were never able to reproduce it. The userfaultfd self test turns out to reproduce it reliably, so the thinking up to this point was that it just happened to reproduce some non-userfaultfd-specific issue. But from Mike's description, it seems this bug is very specific to userfaultfd after all. :) On Mon, May 10, 2021 at 5:52 PM Mina Almasry <almasrymina@google.com> wrote: > > On Mon, May 10, 2021 at 5:33 PM Mike Kravetz <mike.kravetz@oracle.com> wrote: > > > > On 5/7/21 2:21 PM, Mina Almasry wrote: > > > Hi folks, > > > > > > I ran into a bug that I'm not sure how to solve so I'm wondering if > > > anyone has suggestions on what the issue could be and how to > > > investigate. I added the WARN_ON_ONCE() here to catch instances of > > > resv_huge_pages underflowing: > > > > > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > > > index 629aa4c2259c..7d763eed650f 100644 > > > --- a/mm/hugetlb.c > > > +++ b/mm/hugetlb.c > > > @@ -1165,7 +1165,21 @@ static struct page > > > *dequeue_huge_page_vma(struct hstate *h, > > > page = dequeue_huge_page_nodemask(h, gfp_mask, nid, nodemask); > > > if (page && !avoid_reserve && vma_has_reserves(vma, chg)) { > > > SetHPageRestoreReserve(page); > > > + WARN_ON_ONCE(!h->resv_huge_pages); > > > h->resv_huge_pages--; > > > } > > > > > > And ran the userfaultfd selftests like so: > > > > > > echo 1024 > /proc/sys/vm/nr_hugepages > > > mkdir -p /mnt/huge > > > mount -t hugetlbfs none /mnt/huge > > > ./tools/testings/selftests/vm/userfaultfd hugetlb_shared 1024 200 > > > /mnt/huge/userfaultfd_test > > > > > > And run into this warning indicating this test does discover an underflow: > > > > > > [ 11.163403] ------------[ cut here ]------------ > > > [ 11.163404] WARNING: CPU: 0 PID: 237 at mm/hugetlb.c:1178 > > > alloc_huge_page+0x558/0x5a0 > > > [ 11.163413] Modules linked in: > > > [ 11.163419] CPU: 0 PID: 237 Comm: userfaultfd Not tainted 5.12.0-dbg-DEV #135 > > > [ 11.163424] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), > > > BIOS 1.14.0-2 04/01/2014 > > > [ 11.163429] RIP: 0010:alloc_huge_page+0x558/0x5a0 > > > [ 11.163432] Code: b0 00 0f 85 3d ff ff ff e9 2a ff ff ff be 01 00 > > > 00 00 48 89 df e8 18 e7 ff ff 48 f7 d8 4c 89 ef 48 89 c6 e8 da d7 ff > > > ff eb 8c <0f> 0b 4d 8b 85 c0 00 00 00 e9 95 fd ff ff e8 35 59 84 00 4c > > > 897 > > > [ 11.163434] RSP: 0018:ffff94bb0073fc80 EFLAGS: 00010046 > > > [ 11.163436] RAX: 0000000000000080 RBX: 0000000000000000 RCX: 5fa252c406a76700 > > > [ 11.163438] RDX: c0000000ffff7fff RSI: 0000000000000004 RDI: > > > 0000000000017ffd > > > [ 11.163439] RBP: ffff94bb0073fcf8 R08: 0000000000000000 R09: > > > ffffffff9813ba70 > > > [ 11.163440] R10: 00000000ffff7fff R11: 0000000000000000 R12: > > > ffff8ac7800558c8 > > > [ 11.163442] R13: ffffffff993f8880 R14: 00007f0dfa200000 R15: > > > ffffed85453e0000 > > > [ 11.163443] FS: 00007f0d731fc700(0000) GS:ffff8acba9400000(0000) > > > knlGS:0000000000000000 > > > [ 11.163445] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 > > > [ 11.163448] CR2: 00007f0e65e00028 CR3: 0000000108d50003 CR4: > > > 0000000000370ef0 > > > [ 11.163452] DR0: 0000000000000000 DR1: 0000000000000000 DR2: > > > 0000000000000000 > > > [ 11.163453] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: > > > 0000000000000400 > > > [ 11.163455] Call Trace: > > > [ 11.163468] hugetlb_mcopy_atomic_pte+0xcb/0x450 > > > [ 11.163477] mcopy_atomic+0xa08/0xd60 > > > [ 11.163480] ? __might_fault+0x56/0x80 > > > [ 11.163493] userfaultfd_ioctl+0xb18/0xd60 > > > [ 11.163502] __se_sys_ioctl+0x77/0xc0 > > > [ 11.163507] __x64_sys_ioctl+0x1d/0x20 > > > [ 11.163510] do_syscall_64+0x3f/0x80 > > > [ 11.163515] entry_SYSCALL_64_after_hwframe+0x44/0xae > > > [ 11.163519] RIP: 0033:0x45ec87 > > > [ 11.163531] Code: 3c 1c 48 f7 d8 49 39 c4 72 b8 e8 64 63 03 00 85 > > > c0 78 bd 48 83 c4 08 4c 89 e0 5b 41 5c c3 0f 1f 44 00 00 b8 10 00 00 > > > 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 > > > 018 > > > [ 11.163532] RSP: 002b:00007f0d731fc248 EFLAGS: 00000206 ORIG_RAX: > > > 0000000000000010 > > > [ 11.163534] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 000000000045ec87 > > > [ 11.163536] RDX: 00007f0d731fc290 RSI: 00000000c028aa03 RDI: 0000000000000004 > > > [ 11.163537] RBP: 00007f0d731fc270 R08: 00000000004022b3 R09: 00007f0d731fc700 > > > [ 11.163538] R10: 00007f0d731fc9d0 R11: 0000000000000206 R12: 00007fff610cd82e > > > [ 11.163539] R13: 00007fff610cd82f R14: 00007f0d731fc400 R15: 0000000001002000 > > > [ 11.163549] irq event stamp: 722 > > > [ 11.163550] hardirqs last enabled at (721): [<ffffffff967cd41b>] > > > kmem_cache_alloc_trace+0x1db/0x370 > > > [ 11.163558] hardirqs last disabled at (722): [<ffffffff9700c052>] > > > _raw_spin_lock_irq+0x32/0x80 > > > [ 11.163560] softirqs last enabled at (130): [<ffffffff9654e0d6>] > > > __irq_exit_rcu+0xf6/0x100 > > > [ 11.163564] softirqs last disabled at (125): [<ffffffff9654e0d6>] > > > __irq_exit_rcu+0xf6/0x100 > > > [ 11.163567] ---[ end trace 358ac5c76c211ea1 ]--- > > > > > > Debugging further I find the resv_huge_pages underflows by 1 > > > temporarily during the run of the test multiple times, but a > > > __free_huge_page() is always subsequently called that overflows it > > > back to 0. resv_huge_pages is always 0 at the end of the test. I've > > > initially looked at this as I suspected a problem in the > > > resv_huge_pages accounting, but seems the resv_huge_pages accounting > > > is fine in itself as it correctly decrements resv_huge_pages when a > > > page is allocated from reservation and correctly increments it back up > > > when that page is freed. > > > > > > I'm not that familiar with the userfaultfd/hugetlb code so I was > > > hoping to solicit some suggestions for what the issue could be. Things > > > I've tried so far: > > > > > > - Adding code that prevents resv_huge_pages to underflow causes the > > > test to fail, so it seems in this test the calling code actually > > > expects to be able to temporarily allocate 1 more page than the VMA > > > has reserved, which seems like a bug maybe? > > > - Modifying hugetlb_mcopy_atomic_pte() to not use reserved pages > > > causes the test to fail again. Doin that and overprovisioning > > > /proc/sys/vm/nr_hugepages causes the test to pass again but I'm not > > > sure that's right (not familiar with the code). > > > - The failure gets reproduced as far back as 5.11, so it doesn't seem > > > to be related to any recent changes. > > > > Hi Mina, > > > > I am fairly confident the issue is with hugetlb_mcopy_atomic_pte. It > > does not detect/handle the case where a page cache page already exists > > when in MCOPY_ATOMIC_NORMAL mode. If you add a printk/warning after the > > failure of huge_add_to_page_cache, these will generally correspond to > > the underflow. From a reservation POV, if the page exists in the cache > > the reservation was already consumed. The call to alloc_huge_page will > > 'consume' another reservation which can lead to the underflow. As you > > noted, this underflow gets cleaned up in the error path. However, we > > should prevent it from happening as we do not want anyone making > > decisions on that underflow value. > > > > hugetlb_mcopy_atomic_pte should check for a page in the cache and if it > > exists use it in MCOPY_ATOMIC_NORMAL. This code is quite tricky and my > > first simple attempt at this did not work. I am happy to continue > > working on this. However, if you or anyone else want to jump in and fix > > feel free. > > Thank you! This is the guidance I was looking for. I'll jump in and > try to fix it definitely but I'm learning the code, etc and will not > be terribly quick about this unfortunately. I will start looking into > this nevertheless and take a stab at it. > > Thanks so much! > > > -- > > Mike Kravetz ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: resv_huge_page underflow with userfaultfd test 2021-05-11 6:45 ` Axel Rasmussen @ 2021-05-11 7:08 ` Mina Almasry 2021-05-11 16:38 ` Mike Kravetz 1 sibling, 0 replies; 29+ messages in thread From: Mina Almasry @ 2021-05-11 7:08 UTC (permalink / raw) To: Axel Rasmussen; +Cc: Mike Kravetz, Linux-MM, Andrea Arcangeli, Peter Xu On Mon, May 10, 2021 at 11:46 PM Axel Rasmussen <axelrasmussen@google.com> wrote: > > Thanks for the investigation, Mike! > > Mina, since hugetlb_mcopy_atomic_pte is specific to userfaultfd, I'm > happy to take a deeper look at it this week as well. > > For context, we have seen the WARN_ON Mina described trigger in > production before, but were never able to reproduce it. The > userfaultfd self test turns out to reproduce it reliably, so the > thinking up to this point was that it just happened to reproduce some > non-userfaultfd-specific issue. But from Mike's description, it seems > this bug is very specific to userfaultfd after all. :) > That's fine, I'll take an initial look at least. If I end up being roped into userfaultfd stuff way over my head I'll solicit your help or punt it over to you.Thanks! ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: resv_huge_page underflow with userfaultfd test 2021-05-11 6:45 ` Axel Rasmussen 2021-05-11 7:08 ` Mina Almasry @ 2021-05-11 16:38 ` Mike Kravetz 2021-05-11 19:08 ` Mina Almasry 1 sibling, 1 reply; 29+ messages in thread From: Mike Kravetz @ 2021-05-11 16:38 UTC (permalink / raw) To: Axel Rasmussen, Mina Almasry; +Cc: Linux-MM, Andrea Arcangeli, Peter Xu On 5/10/21 11:45 PM, Axel Rasmussen wrote: > Thanks for the investigation, Mike! > > Mina, since hugetlb_mcopy_atomic_pte is specific to userfaultfd, I'm > happy to take a deeper look at it this week as well. > > For context, we have seen the WARN_ON Mina described trigger in > production before, but were never able to reproduce it. The > userfaultfd self test turns out to reproduce it reliably, so the > thinking up to this point was that it just happened to reproduce some > non-userfaultfd-specific issue. But from Mike's description, it seems > this bug is very specific to userfaultfd after all. :) Certainly, this case is userfaultfd specific. However, I too recall seeing 'transient' underflows in the past. Pretty sure this was not userfaultfd specific. Specifically, when working on commit 22146c3ce989 "hugetlbfs: dirty pages as they are added to pagecache" I recall seeing transient underflow. After fixing the issue in 22146c3ce989, I could not reproduce transient underflows and stopped looking for the cause. We added code to a production kernel in an attmempt to catch the issue: https://github.com/oracle/linux-uek/commit/bd697676290d91762ef2bf79832f653b44e6f83b#diff-fb6066ca63d9afdc2e3660c85e2dcc04cf31b37900ca9df2a1019ee8fa80dce0 I'll try running some non-userfaultfd specific tests to see if I can reproduce. -- Mike Kravetz ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: resv_huge_page underflow with userfaultfd test 2021-05-11 16:38 ` Mike Kravetz @ 2021-05-11 19:08 ` Mina Almasry 0 siblings, 0 replies; 29+ messages in thread From: Mina Almasry @ 2021-05-11 19:08 UTC (permalink / raw) To: Mike Kravetz; +Cc: Axel Rasmussen, Linux-MM, Andrea Arcangeli, Peter Xu On Tue, May 11, 2021 at 9:38 AM Mike Kravetz <mike.kravetz@oracle.com> wrote: > > On 5/10/21 11:45 PM, Axel Rasmussen wrote: > > Thanks for the investigation, Mike! > > > > Mina, since hugetlb_mcopy_atomic_pte is specific to userfaultfd, I'm > > happy to take a deeper look at it this week as well. > > > > For context, we have seen the WARN_ON Mina described trigger in > > production before, but were never able to reproduce it. The > > userfaultfd self test turns out to reproduce it reliably, so the > > thinking up to this point was that it just happened to reproduce some > > non-userfaultfd-specific issue. But from Mike's description, it seems > > this bug is very specific to userfaultfd after all. :) > > Certainly, this case is userfaultfd specific. > > However, I too recall seeing 'transient' underflows in the past. Pretty > sure this was not userfaultfd specific. Specifically, when working on > commit 22146c3ce989 "hugetlbfs: dirty pages as they are added to pagecache" > I recall seeing transient underflow. After fixing the issue in 22146c3ce989, > I could not reproduce transient underflows and stopped looking for the > cause. We added code to a production kernel in an attmempt to catch > the issue: > https://github.com/oracle/linux-uek/commit/bd697676290d91762ef2bf79832f653b44e6f83b#diff-fb6066ca63d9afdc2e3660c85e2dcc04cf31b37900ca9df2a1019ee8fa80dce0 > We also have local changes that detect resv_huge_pages underflowing, and they trigger in production every few months and we don't use userfaultfd in production. We haven't been able to reproduce these issues though, and we run an older version of the kernel so I haven't had anything useful to share until now. I'm (quite tentatively to be honest) considering proposing adding some of these WARN_ONCE_ON() to the kernel after I fix this issue and soliciting reproducers from the community to fix these underflows. I'm hesitant because I'm not confident I can fix the ensuing bug reports in a timely manner and the volume of reports may be large. But we can cross that bridge when this issue is figured out :) > I'll try running some non-userfaultfd specific tests to see if I can > reproduce. > -- > Mike Kravetz ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: resv_huge_page underflow with userfaultfd test 2021-05-11 0:33 ` Mike Kravetz 2021-05-11 0:52 ` Mina Almasry @ 2021-05-12 2:25 ` Mike Kravetz 2021-05-12 2:35 ` Peter Xu 1 sibling, 1 reply; 29+ messages in thread From: Mike Kravetz @ 2021-05-12 2:25 UTC (permalink / raw) To: Mina Almasry, Linux-MM, Axel Rasmussen, aarcange, peterx On 5/10/21 5:33 PM, Mike Kravetz wrote: > On 5/7/21 2:21 PM, Mina Almasry wrote: >> I ran into a bug that I'm not sure how to solve so I'm wondering if >> anyone has suggestions on what the issue could be and how to >> investigate. I added the WARN_ON_ONCE() here to catch instances of >> resv_huge_pages underflowing: >> > > I am fairly confident the issue is with hugetlb_mcopy_atomic_pte. It > does not detect/handle the case where a page cache page already exists > when in MCOPY_ATOMIC_NORMAL mode. If you add a printk/warning after the > failure of huge_add_to_page_cache, these will generally correspond to > the underflow. From a reservation POV, if the page exists in the cache > the reservation was already consumed. The call to alloc_huge_page will > 'consume' another reservation which can lead to the underflow. As you > noted, this underflow gets cleaned up in the error path. However, we > should prevent it from happening as we do not want anyone making > decisions on that underflow value. > > hugetlb_mcopy_atomic_pte should check for a page in the cache and if it > exists use it in MCOPY_ATOMIC_NORMAL. This code is quite tricky and my > first simple attempt at this did not work. I am happy to continue > working on this. However, if you or anyone else want to jump in and fix > feel free. I looked at this a bit more today and am not exactly sure of the expected behavior. The situation is: - UFFDIO_COPY is called for hugetlb mapping - the dest address is in a shared mapping - there is a page in the cache associated with the address in the shared mapping Currently, the code will fail when trying to update the page cache as the entry already exists. The shm code appears to do the same. Quick question. Is this the expected behavior? Or, would you expect the UFFDIO_COPY to update the page in the page cache, and then resolve the fault/update the pte? -- Mike Kravetz ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: resv_huge_page underflow with userfaultfd test 2021-05-12 2:25 ` Mike Kravetz @ 2021-05-12 2:35 ` Peter Xu 2021-05-12 3:06 ` Mike Kravetz 0 siblings, 1 reply; 29+ messages in thread From: Peter Xu @ 2021-05-12 2:35 UTC (permalink / raw) To: Mike Kravetz; +Cc: Mina Almasry, Linux-MM, Axel Rasmussen, aarcange Mike, On Tue, May 11, 2021 at 07:25:39PM -0700, Mike Kravetz wrote: > I looked at this a bit more today and am not exactly sure of the > expected behavior. The situation is: > - UFFDIO_COPY is called for hugetlb mapping > - the dest address is in a shared mapping > - there is a page in the cache associated with the address in the > shared mapping > > Currently, the code will fail when trying to update the page cache as > the entry already exists. The shm code appears to do the same. > > Quick question. Is this the expected behavior? Or, would you expect > the UFFDIO_COPY to update the page in the page cache, and then resolve > the fault/update the pte? AFAICT that's the expected behavior, and it need to be like that so as to avoid silent data corruption (if the page cache existed, it means the page is not "missing" at all, then it does not suite for a UFFDIO_COPY as it's only used for uffd page missing case). Thanks, -- Peter Xu ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: resv_huge_page underflow with userfaultfd test 2021-05-12 2:35 ` Peter Xu @ 2021-05-12 3:06 ` Mike Kravetz [not found] ` <20210512065813.89270-1-almasrymina@google.com> 0 siblings, 1 reply; 29+ messages in thread From: Mike Kravetz @ 2021-05-12 3:06 UTC (permalink / raw) To: Peter Xu; +Cc: Mina Almasry, Linux-MM, Axel Rasmussen, aarcange On 5/11/21 7:35 PM, Peter Xu wrote: > Mike, > > On Tue, May 11, 2021 at 07:25:39PM -0700, Mike Kravetz wrote: >> I looked at this a bit more today and am not exactly sure of the >> expected behavior. The situation is: >> - UFFDIO_COPY is called for hugetlb mapping >> - the dest address is in a shared mapping >> - there is a page in the cache associated with the address in the >> shared mapping >> >> Currently, the code will fail when trying to update the page cache as >> the entry already exists. The shm code appears to do the same. >> >> Quick question. Is this the expected behavior? Or, would you expect >> the UFFDIO_COPY to update the page in the page cache, and then resolve >> the fault/update the pte? > > AFAICT that's the expected behavior, and it need to be like that so as to avoid > silent data corruption (if the page cache existed, it means the page is not > "missing" at all, then it does not suite for a UFFDIO_COPY as it's only used > for uffd page missing case). Thanks, Thanks Peter. Making it fail in that case is pretty straight forward. BTW, in this case the page was not in the cache at the time of the original fault which is why it is being handled as missing by userfaultfd. By the time the UFFDIO_COPY is requested, a page has been added to the cache. -- Mike Kravetz ^ permalink raw reply [flat|nested] 29+ messages in thread
[parent not found: <20210512065813.89270-1-almasrymina@google.com>]
* Re: [PATCH] mm, hugetlb: fix resv_huge_pages underflow on UFFDIO_COPY [not found] ` <20210512065813.89270-1-almasrymina@google.com> @ 2021-05-12 7:44 ` Mina Almasry [not found] ` <CAJHvVch0ZMapPVEc0Ge5V4KDgNDNhECbqwDi0y9XxsxFXQZ-gg@mail.gmail.com> 1 sibling, 0 replies; 29+ messages in thread From: Mina Almasry @ 2021-05-12 7:44 UTC (permalink / raw) To: Andrew Morton, Linux-MM, open list; +Cc: Mike Kravetz, Axel Rasmussen, Peter Xu On Tue, May 11, 2021 at 11:58 PM Mina Almasry <almasrymina@google.com> wrote: > > When hugetlb_mcopy_atomic_pte() is called with: > - mode==MCOPY_ATOMIC_NORMAL and, > - we already have a page in the page cache corresponding to the > associated address, > > We will allocate a huge page from the reserves, and then fail to insert it > into the cache and return -EEXIST. > > In this case, we need to return -EEXIST without allocating a new page as > the page already exists in the cache. Allocating the extra page causes > the resv_huge_pages to underflow temporarily until the extra page is > freed. > > Also, add the warning so that future similar instances of resv_huge_pages > underflowing will be caught. > > Also, minor drive-by cleanups to this code path: > - pagep is an out param never set by calling code, so delete code > assuming there could be a valid page in there. > - use hugetlbfs_pagecache_page() instead of repeating its > implementation. > > Tested using: > ./tools/testing/selftests/vm/userfaultfd hugetlb_shared 1024 200 \ > /mnt/huge > > Test passes, and dmesg shows no underflow warnings. > > Signed-off-by: Mina Almasry <almasrymina@google.com> > Cc: Mike Kravetz <mike.kravetz@oracle.com> > Cc: Axel Rasmussen <axelrasmussen@google.com> > Cc: Peter Xu <peterx@redhat.com> > > --- > mm/hugetlb.c | 33 ++++++++++++++++++++------------- > 1 file changed, 20 insertions(+), 13 deletions(-) > > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 629aa4c2259c..40f4ad1bca29 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -1165,6 +1165,7 @@ static struct page *dequeue_huge_page_vma(struct hstate *h, > page = dequeue_huge_page_nodemask(h, gfp_mask, nid, nodemask); > if (page && !avoid_reserve && vma_has_reserves(vma, chg)) { > SetHPageRestoreReserve(page); > + WARN_ON_ONCE(!h->resv_huge_pages); > h->resv_huge_pages--; > } > > @@ -4868,30 +4869,39 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, > struct page **pagep) > { > bool is_continue = (mode == MCOPY_ATOMIC_CONTINUE); > - struct address_space *mapping; > - pgoff_t idx; > + struct hstate *h = hstate_vma(dst_vma); > + struct address_space *mapping = dst_vma->vm_file->f_mapping; > + pgoff_t idx = vma_hugecache_offset(h, dst_vma, dst_addr); > unsigned long size; > int vm_shared = dst_vma->vm_flags & VM_SHARED; > - struct hstate *h = hstate_vma(dst_vma); > pte_t _dst_pte; > spinlock_t *ptl; > - int ret; > + int ret = -ENOMEM; > struct page *page; > int writable; > > - mapping = dst_vma->vm_file->f_mapping; > - idx = vma_hugecache_offset(h, dst_vma, dst_addr); > + /* Out parameter. */ > + WARN_ON(*pagep); > > if (is_continue) { > ret = -EFAULT; > - page = find_lock_page(mapping, idx); > + page = hugetlbfs_pagecache_page(h, dst_vma, dst_addr); > if (!page) > goto out; > - } else if (!*pagep) { > - ret = -ENOMEM; > + } else { > + /* If a page already exists, then it's UFFDIO_COPY for > + * a non-missing case. Return -EEXIST. > + */ > + if (hugetlbfs_pagecache_present(h, dst_vma, dst_addr)) { > + ret = -EEXIST; > + goto out; > + } > + > page = alloc_huge_page(dst_vma, dst_addr, 0); > - if (IS_ERR(page)) > + if (IS_ERR(page)) { > + ret = -ENOMEM; > goto out; > + } > > ret = copy_huge_page_from_user(page, > (const void __user *) src_addr, > @@ -4904,9 +4914,6 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, > /* don't free the page */ > goto out; > } > - } else { > - page = *pagep; > - *pagep = NULL; > } > > /* > -- > 2.31.1.607.g51e8a6a459-goog I just realized I missed CCing Andrew and the mailing lists to this patch's review. I'll collect review comments from folks and send a v2 to the correct folks and mailing lists. ^ permalink raw reply [flat|nested] 29+ messages in thread
[parent not found: <CAJHvVch0ZMapPVEc0Ge5V4KDgNDNhECbqwDi0y9XxsxFXQZ-gg@mail.gmail.com>]
[parent not found: <c455d241-11f6-95a6-eb29-0ddd94eedbd7@oracle.com>]
* Re: [PATCH] mm, hugetlb: fix resv_huge_pages underflow on UFFDIO_COPY [not found] ` <c455d241-11f6-95a6-eb29-0ddd94eedbd7@oracle.com> @ 2021-05-12 19:42 ` Mina Almasry 2021-05-12 20:14 ` Peter Xu 0 siblings, 1 reply; 29+ messages in thread From: Mina Almasry @ 2021-05-12 19:42 UTC (permalink / raw) To: Mike Kravetz, Andrew Morton, Linux-MM, open list; +Cc: Axel Rasmussen, Peter Xu On Wed, May 12, 2021 at 10:22 AM Mike Kravetz <mike.kravetz@oracle.com> wrote: > > On 5/12/21 8:53 AM, Axel Rasmussen wrote: > > On Tue, May 11, 2021 at 11:58 PM Mina Almasry <almasrymina@google.com> wrote: > >> > >> When hugetlb_mcopy_atomic_pte() is called with: > >> - mode==MCOPY_ATOMIC_NORMAL and, > >> - we already have a page in the page cache corresponding to the > >> associated address, > >> > >> We will allocate a huge page from the reserves, and then fail to insert it > >> into the cache and return -EEXIST. > >> > >> In this case, we need to return -EEXIST without allocating a new page as > >> the page already exists in the cache. Allocating the extra page causes > >> the resv_huge_pages to underflow temporarily until the extra page is > >> freed. > >> > >> Also, add the warning so that future similar instances of resv_huge_pages > >> underflowing will be caught. > >> > >> Also, minor drive-by cleanups to this code path: > >> - pagep is an out param never set by calling code, so delete code > >> assuming there could be a valid page in there. > >> - use hugetlbfs_pagecache_page() instead of repeating its > >> implementation. > >> > >> Tested using: > >> ./tools/testing/selftests/vm/userfaultfd hugetlb_shared 1024 200 \ > >> /mnt/huge > >> > >> Test passes, and dmesg shows no underflow warnings. > >> > >> Signed-off-by: Mina Almasry <almasrymina@google.com> > >> Cc: Mike Kravetz <mike.kravetz@oracle.com> > >> Cc: Axel Rasmussen <axelrasmussen@google.com> > >> Cc: Peter Xu <peterx@redhat.com> > >> > >> --- > >> mm/hugetlb.c | 33 ++++++++++++++++++++------------- > >> 1 file changed, 20 insertions(+), 13 deletions(-) > >> > >> diff --git a/mm/hugetlb.c b/mm/hugetlb.c > >> index 629aa4c2259c..40f4ad1bca29 100644 > >> --- a/mm/hugetlb.c > >> +++ b/mm/hugetlb.c > >> @@ -1165,6 +1165,7 @@ static struct page *dequeue_huge_page_vma(struct hstate *h, > >> page = dequeue_huge_page_nodemask(h, gfp_mask, nid, nodemask); > >> if (page && !avoid_reserve && vma_has_reserves(vma, chg)) { > >> SetHPageRestoreReserve(page); > >> + WARN_ON_ONCE(!h->resv_huge_pages); > > This warning catches underflow in a relatively specific case. In a > previous email, you mentioned that you have seem underflow on production > systems several times. Was it this specific case? I am also assuming > that the underflow you saw was transitive and corrected itself. The > value did not remain negative? > > As mentioned above, this warning only catches the specific case where > resv_huge_pages goes negative in this routine. If we believe this is > possible, then there are likely more cases where resv_huge_pages is simply > decremented when it should not. For example: resv_huge_pages temporarily > goes to 2034 from 2035 when it should not. Also, there are several > other places where resv_huge_pages could incorrectly be modified and go > negative. > My only motivation for adding this particular warning is to make sure this particular issue remains fixed and doesn't get re-introduced in the future. If that's not too useful then I can remove it, no problem, I'm not too attached to it or anything. > I would prefer not to add this warning unless you have seen this > specific case in production or some other environments. If so, then > please add the specifics. I am not opposed to adding warnings or code to > detect underflow or other accounting issues. We just need to make sure > they are likely to provide useful data. > I've actually looked at all the resv_huge_pages underflow issues we have internally, and upon a closer look I find that they are all on kernels so old they don't have 1b1edc140dc7 ("hugetlbfs: dirty pages as they are added to pagecache") or any of the others patches that fixed resv_huge_pages issues recently. I can't seem to find instances new enough that they would be useful to look at, so I take back what I said earlier. If any underflow issues pop up on our newer kernels I'll bring this up again, but for now, it seems it's just this issue related to userfaultfd. Sorry for the noise :( > >> h->resv_huge_pages--; > >> } > >> > >> @@ -4868,30 +4869,39 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, > >> struct page **pagep) > >> { > >> bool is_continue = (mode == MCOPY_ATOMIC_CONTINUE); > >> - struct address_space *mapping; > >> - pgoff_t idx; > >> + struct hstate *h = hstate_vma(dst_vma); > >> + struct address_space *mapping = dst_vma->vm_file->f_mapping; > >> + pgoff_t idx = vma_hugecache_offset(h, dst_vma, dst_addr); > >> unsigned long size; > >> int vm_shared = dst_vma->vm_flags & VM_SHARED; > >> - struct hstate *h = hstate_vma(dst_vma); > >> pte_t _dst_pte; > >> spinlock_t *ptl; > >> - int ret; > >> + int ret = -ENOMEM; > >> struct page *page; > >> int writable; > >> > >> - mapping = dst_vma->vm_file->f_mapping; > >> - idx = vma_hugecache_offset(h, dst_vma, dst_addr); > >> + /* Out parameter. */ > >> + WARN_ON(*pagep); > > > > I don't think this warning works, because we do set *pagep, in the > > copy_huge_page_from_user failure case. In that case, the following > > happens: > > > > 1. We set *pagep, and return immediately. > > 2. Our caller notices this particular error, drops mmap_lock, and then > > calls us again with *pagep set. > > > > In this path, we're supposed to just re-use this existing *pagep > > instead of allocating a second new page. > > > > I think this also means we need to keep the "else" case where *pagep > > is set below. > > > > +1 to Peter's comment. > Gah, sorry about that. I'll fix in v2. > -- > Mike Kravetz ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] mm, hugetlb: fix resv_huge_pages underflow on UFFDIO_COPY 2021-05-12 19:42 ` Mina Almasry @ 2021-05-12 20:14 ` Peter Xu 2021-05-12 21:31 ` Mike Kravetz 0 siblings, 1 reply; 29+ messages in thread From: Peter Xu @ 2021-05-12 20:14 UTC (permalink / raw) To: Mina Almasry Cc: Mike Kravetz, Andrew Morton, Linux-MM, open list, Axel Rasmussen Mina, On Wed, May 12, 2021 at 12:42:32PM -0700, Mina Almasry wrote: > > >> @@ -4868,30 +4869,39 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, > > >> struct page **pagep) > > >> { > > >> bool is_continue = (mode == MCOPY_ATOMIC_CONTINUE); > > >> - struct address_space *mapping; > > >> - pgoff_t idx; > > >> + struct hstate *h = hstate_vma(dst_vma); > > >> + struct address_space *mapping = dst_vma->vm_file->f_mapping; > > >> + pgoff_t idx = vma_hugecache_offset(h, dst_vma, dst_addr); > > >> unsigned long size; > > >> int vm_shared = dst_vma->vm_flags & VM_SHARED; > > >> - struct hstate *h = hstate_vma(dst_vma); > > >> pte_t _dst_pte; > > >> spinlock_t *ptl; > > >> - int ret; > > >> + int ret = -ENOMEM; > > >> struct page *page; > > >> int writable; > > >> > > >> - mapping = dst_vma->vm_file->f_mapping; > > >> - idx = vma_hugecache_offset(h, dst_vma, dst_addr); > > >> + /* Out parameter. */ > > >> + WARN_ON(*pagep); > > > > > > I don't think this warning works, because we do set *pagep, in the > > > copy_huge_page_from_user failure case. In that case, the following > > > happens: > > > > > > 1. We set *pagep, and return immediately. > > > 2. Our caller notices this particular error, drops mmap_lock, and then > > > calls us again with *pagep set. > > > > > > In this path, we're supposed to just re-use this existing *pagep > > > instead of allocating a second new page. > > > > > > I think this also means we need to keep the "else" case where *pagep > > > is set below. > > > > > > > +1 to Peter's comment. > > > > Gah, sorry about that. I'll fix in v2. I have a question regarding v1: how do you guarantee huge_add_to_page_cache() won't fail again even if checked before page alloc? Say, what if the page cache got inserted after hugetlbfs_pagecache_present() (which is newly added in your v1) but before huge_add_to_page_cache()? I also have a feeling that we've been trying to work around something else, but I can't tell yet as I'll probably need to read a bit more/better on how hugetlb does the accounting and also on reservations. Thanks, -- Peter Xu ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] mm, hugetlb: fix resv_huge_pages underflow on UFFDIO_COPY 2021-05-12 20:14 ` Peter Xu @ 2021-05-12 21:31 ` Mike Kravetz 2021-05-12 21:52 ` Mina Almasry 0 siblings, 1 reply; 29+ messages in thread From: Mike Kravetz @ 2021-05-12 21:31 UTC (permalink / raw) To: Peter Xu, Mina Almasry; +Cc: Andrew Morton, Linux-MM, open list, Axel Rasmussen On 5/12/21 1:14 PM, Peter Xu wrote: > On Wed, May 12, 2021 at 12:42:32PM -0700, Mina Almasry wrote: >>>>> @@ -4868,30 +4869,39 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, >>>>> + WARN_ON(*pagep); >>>> >>>> I don't think this warning works, because we do set *pagep, in the >>>> copy_huge_page_from_user failure case. In that case, the following >>>> happens: >>>> >>>> 1. We set *pagep, and return immediately. >>>> 2. Our caller notices this particular error, drops mmap_lock, and then >>>> calls us again with *pagep set. >>>> >>>> In this path, we're supposed to just re-use this existing *pagep >>>> instead of allocating a second new page. >>>> >>>> I think this also means we need to keep the "else" case where *pagep >>>> is set below. >>>> >>> >>> +1 to Peter's comment. >>> Apologies to Axel (and Peter) as that comment was from Axel. >> >> Gah, sorry about that. I'll fix in v2. > > I have a question regarding v1: how do you guarantee huge_add_to_page_cache() > won't fail again even if checked before page alloc? Say, what if the page > cache got inserted after hugetlbfs_pagecache_present() (which is newly added in > your v1) but before huge_add_to_page_cache()? In the caller (__mcopy_atomic_hugetlb) we obtain the hugetlb fault mutex before calling this routine. This should prevent changes to the cache while in the routine. However, things get complicated in the case where copy_huge_page_from_user fails. In this case, we will return to the caller which will drop mmap_lock and the hugetlb fault mutex before doing the copy. After dropping the mutex, someone could populate the cache. This would result in the same situation where two reserves are 'temporarily' consumed for the same mapping offset. By the time we get to the second call to hugetlb_mcopy_atomic_pte where the previously allocated page is passed in, it is too late. -- Mike Kravetz ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] mm, hugetlb: fix resv_huge_pages underflow on UFFDIO_COPY 2021-05-12 21:31 ` Mike Kravetz @ 2021-05-12 21:52 ` Mina Almasry 0 siblings, 0 replies; 29+ messages in thread From: Mina Almasry @ 2021-05-12 21:52 UTC (permalink / raw) To: Mike Kravetz; +Cc: Peter Xu, Andrew Morton, Linux-MM, open list, Axel Rasmussen On Wed, May 12, 2021 at 2:31 PM Mike Kravetz <mike.kravetz@oracle.com> wrote: > > On 5/12/21 1:14 PM, Peter Xu wrote: > > On Wed, May 12, 2021 at 12:42:32PM -0700, Mina Almasry wrote: > >>>>> @@ -4868,30 +4869,39 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, > >>>>> + WARN_ON(*pagep); > >>>> > >>>> I don't think this warning works, because we do set *pagep, in the > >>>> copy_huge_page_from_user failure case. In that case, the following > >>>> happens: > >>>> > >>>> 1. We set *pagep, and return immediately. > >>>> 2. Our caller notices this particular error, drops mmap_lock, and then > >>>> calls us again with *pagep set. > >>>> > >>>> In this path, we're supposed to just re-use this existing *pagep > >>>> instead of allocating a second new page. > >>>> > >>>> I think this also means we need to keep the "else" case where *pagep > >>>> is set below. > >>>> > >>> > >>> +1 to Peter's comment. > >>> > > Apologies to Axel (and Peter) as that comment was from Axel. > > >> > >> Gah, sorry about that. I'll fix in v2. > > > > I have a question regarding v1: how do you guarantee huge_add_to_page_cache() > > won't fail again even if checked before page alloc? Say, what if the page > > cache got inserted after hugetlbfs_pagecache_present() (which is newly added in > > your v1) but before huge_add_to_page_cache()? > > In the caller (__mcopy_atomic_hugetlb) we obtain the hugetlb fault mutex > before calling this routine. This should prevent changes to the cache > while in the routine. > > However, things get complicated in the case where copy_huge_page_from_user > fails. In this case, we will return to the caller which will drop mmap_lock > and the hugetlb fault mutex before doing the copy. After dropping the > mutex, someone could populate the cache. This would result in the same > situation where two reserves are 'temporarily' consumed for the same > mapping offset. By the time we get to the second call to > hugetlb_mcopy_atomic_pte where the previously allocated page is passed > in, it is too late. > Thanks. I tried locally to allocate a page, then add it into the cache, *then* copy its contents (dropping that lock if that fails). That also has the test passing, but I'm not sure if I'm causing a fire somewhere else by having a page in the cache that has uninitialized contents. The only other code that checks the cache seems to be the hugetlb_fault/hugetlb_cow code. I'm reading that code to try to understand if I'm breaking that code doing this. > -- > Mike Kravetz ^ permalink raw reply [flat|nested] 29+ messages in thread
* [PATCH] mm, hugetlb: fix resv_huge_pages underflow on UFFDIO_COPY @ 2021-05-13 23:43 Mina Almasry 2021-05-13 23:49 ` Mina Almasry 0 siblings, 1 reply; 29+ messages in thread From: Mina Almasry @ 2021-05-13 23:43 UTC (permalink / raw) Cc: Mina Almasry, Axel Rasmussen, Peter Xu, linux-mm, Mike Kravetz, Andrew Morton, linux-kernel When hugetlb_mcopy_atomic_pte() is called with: - mode==MCOPY_ATOMIC_NORMAL and, - we already have a page in the page cache corresponding to the associated address, We will allocate a huge page from the reserves, and then fail to insert it into the cache and return -EEXIST. In this case, we need to return -EEXIST without allocating a new page as the page already exists in the cache. Allocating the extra page causes the resv_huge_pages to underflow temporarily until the extra page is freed. To fix this we check if a page exists in the cache, and allocate it and insert it in the cache immediately while holding the lock. After that we copy the contents into the page. As a side effect of this, pages may exist in the cache for which the copy failed and for these pages PageUptodate(page) == false. Modify code that query the cache to handle this correctly. Tested using: ./tools/testing/selftests/vm/userfaultfd hugetlb_shared 1024 200 \ /mnt/huge Test passes, and dmesg shows no underflow warnings. Signed-off-by: Mina Almasry <almasrymina@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: linux-mm@kvack.org Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org --- fs/hugetlbfs/inode.c | 2 +- mm/hugetlb.c | 109 +++++++++++++++++++++++-------------------- 2 files changed, 60 insertions(+), 51 deletions(-) diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index a2a42335e8fd..cc027c335242 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -346,7 +346,7 @@ static ssize_t hugetlbfs_read_iter(struct kiocb *iocb, struct iov_iter *to) /* Find the page */ page = find_lock_page(mapping, index); - if (unlikely(page == NULL)) { + if (unlikely(page == NULL || !PageUptodate(page))) { /* * We have a HOLE, zero out the user-buffer for the * length of the hole or request. diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 629aa4c2259c..a5a5fbf7ac25 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -4543,7 +4543,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm, retry: page = find_lock_page(mapping, idx); - if (!page) { + if (!page || !PageUptodate(page)) { /* Check for page in userfault range */ if (userfaultfd_missing(vma)) { ret = hugetlb_handle_userfault(vma, mapping, idx, @@ -4552,26 +4552,30 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm, goto out; } - page = alloc_huge_page(vma, haddr, 0); - if (IS_ERR(page)) { - /* - * Returning error will result in faulting task being - * sent SIGBUS. The hugetlb fault mutex prevents two - * tasks from racing to fault in the same page which - * could result in false unable to allocate errors. - * Page migration does not take the fault mutex, but - * does a clear then write of pte's under page table - * lock. Page fault code could race with migration, - * notice the clear pte and try to allocate a page - * here. Before returning error, get ptl and make - * sure there really is no pte entry. - */ - ptl = huge_pte_lock(h, mm, ptep); - ret = 0; - if (huge_pte_none(huge_ptep_get(ptep))) - ret = vmf_error(PTR_ERR(page)); - spin_unlock(ptl); - goto out; + if (!page) { + page = alloc_huge_page(vma, haddr, 0); + if (IS_ERR(page)) { + /* + * Returning error will result in faulting task + * being sent SIGBUS. The hugetlb fault mutex + * prevents two tasks from racing to fault in + * the same page which could result in false + * unable to allocate errors. Page migration + * does not take the fault mutex, but does a + * clear then write of pte's under page table + * lock. Page fault code could race with + * migration, notice the clear pte and try to + * allocate a page here. Before returning + * error, get ptl and make sure there really is + * no pte entry. + */ + ptl = huge_pte_lock(h, mm, ptep); + ret = 0; + if (huge_pte_none(huge_ptep_get(ptep))) + ret = vmf_error(PTR_ERR(page)); + spin_unlock(ptl); + goto out; + } } clear_huge_page(page, address, pages_per_huge_page(h)); __SetPageUptodate(page); @@ -4868,31 +4872,55 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, struct page **pagep) { bool is_continue = (mode == MCOPY_ATOMIC_CONTINUE); - struct address_space *mapping; - pgoff_t idx; + struct hstate *h = hstate_vma(dst_vma); + struct address_space *mapping = dst_vma->vm_file->f_mapping; + pgoff_t idx = vma_hugecache_offset(h, dst_vma, dst_addr); unsigned long size; int vm_shared = dst_vma->vm_flags & VM_SHARED; - struct hstate *h = hstate_vma(dst_vma); pte_t _dst_pte; spinlock_t *ptl; - int ret; + int ret = -ENOMEM; struct page *page; int writable; - mapping = dst_vma->vm_file->f_mapping; - idx = vma_hugecache_offset(h, dst_vma, dst_addr); - if (is_continue) { ret = -EFAULT; - page = find_lock_page(mapping, idx); - if (!page) + page = hugetlbfs_pagecache_page(h, dst_vma, dst_addr); + if (!page) { + ret = -ENOMEM; goto out; + } } else if (!*pagep) { - ret = -ENOMEM; + /* If a page already exists, then it's UFFDIO_COPY for + * a non-missing case. Return -EEXIST. + */ + if (hugetlbfs_pagecache_present(h, dst_vma, dst_addr)) { + ret = -EEXIST; + goto out; + } + page = alloc_huge_page(dst_vma, dst_addr, 0); if (IS_ERR(page)) goto out; + /* Add shared, newly allocated pages to the page cache. */ + if (vm_shared) { + size = i_size_read(mapping->host) >> huge_page_shift(h); + ret = -EFAULT; + if (idx >= size) + goto out; + + /* + * Serialization between remove_inode_hugepages() and + * huge_add_to_page_cache() below happens through the + * hugetlb_fault_mutex_table that here must be hold by + * the caller. + */ + ret = huge_add_to_page_cache(page, mapping, idx); + if (ret) + goto out; + } + ret = copy_huge_page_from_user(page, (const void __user *) src_addr, pages_per_huge_page(h), false); @@ -4916,24 +4944,6 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, */ __SetPageUptodate(page); - /* Add shared, newly allocated pages to the page cache. */ - if (vm_shared && !is_continue) { - size = i_size_read(mapping->host) >> huge_page_shift(h); - ret = -EFAULT; - if (idx >= size) - goto out_release_nounlock; - - /* - * Serialization between remove_inode_hugepages() and - * huge_add_to_page_cache() below happens through the - * hugetlb_fault_mutex_table that here must be hold by - * the caller. - */ - ret = huge_add_to_page_cache(page, mapping, idx); - if (ret) - goto out_release_nounlock; - } - ptl = huge_pte_lockptr(h, dst_mm, dst_pte); spin_lock(ptl); @@ -4994,7 +5004,6 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, spin_unlock(ptl); if (vm_shared || is_continue) unlock_page(page); -out_release_nounlock: put_page(page); goto out; } -- 2.31.1.751.gd2f1c929bd-goog ^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: [PATCH] mm, hugetlb: fix resv_huge_pages underflow on UFFDIO_COPY 2021-05-13 23:43 Mina Almasry @ 2021-05-13 23:49 ` Mina Almasry 2021-05-14 0:14 ` Mike Kravetz 0 siblings, 1 reply; 29+ messages in thread From: Mina Almasry @ 2021-05-13 23:49 UTC (permalink / raw) Cc: Axel Rasmussen, Peter Xu, Linux-MM, Mike Kravetz, Andrew Morton, open list On Thu, May 13, 2021 at 4:43 PM Mina Almasry <almasrymina@google.com> wrote: > > When hugetlb_mcopy_atomic_pte() is called with: > - mode==MCOPY_ATOMIC_NORMAL and, > - we already have a page in the page cache corresponding to the > associated address, > > We will allocate a huge page from the reserves, and then fail to insert it > into the cache and return -EEXIST. In this case, we need to return -EEXIST > without allocating a new page as the page already exists in the cache. > Allocating the extra page causes the resv_huge_pages to underflow temporarily > until the extra page is freed. > > To fix this we check if a page exists in the cache, and allocate it and > insert it in the cache immediately while holding the lock. After that we > copy the contents into the page. > > As a side effect of this, pages may exist in the cache for which the > copy failed and for these pages PageUptodate(page) == false. Modify code > that query the cache to handle this correctly. > To be honest, I'm not sure I've done this bit correctly. Please take a look and let me know what you think. It may be too overly complicated to have !PageUptodate() pages in the cache and ask the rest of the code to handle that edge case correctly, but I'm not sure how else to fix this issue. > Tested using: > ./tools/testing/selftests/vm/userfaultfd hugetlb_shared 1024 200 \ > /mnt/huge > > Test passes, and dmesg shows no underflow warnings. > > Signed-off-by: Mina Almasry <almasrymina@google.com> > Cc: Axel Rasmussen <axelrasmussen@google.com> > Cc: Peter Xu <peterx@redhat.com> > Cc: linux-mm@kvack.org > Cc: Mike Kravetz <mike.kravetz@oracle.com> > Cc: Andrew Morton <akpm@linux-foundation.org> > Cc: linux-mm@kvack.org > Cc: linux-kernel@vger.kernel.org > > --- > fs/hugetlbfs/inode.c | 2 +- > mm/hugetlb.c | 109 +++++++++++++++++++++++-------------------- > 2 files changed, 60 insertions(+), 51 deletions(-) > > diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c > index a2a42335e8fd..cc027c335242 100644 > --- a/fs/hugetlbfs/inode.c > +++ b/fs/hugetlbfs/inode.c > @@ -346,7 +346,7 @@ static ssize_t hugetlbfs_read_iter(struct kiocb *iocb, struct iov_iter *to) > > /* Find the page */ > page = find_lock_page(mapping, index); > - if (unlikely(page == NULL)) { > + if (unlikely(page == NULL || !PageUptodate(page))) { > /* > * We have a HOLE, zero out the user-buffer for the > * length of the hole or request. > diff --git a/mm/hugetlb.c b/mm/hugetlb.c > index 629aa4c2259c..a5a5fbf7ac25 100644 > --- a/mm/hugetlb.c > +++ b/mm/hugetlb.c > @@ -4543,7 +4543,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm, > > retry: > page = find_lock_page(mapping, idx); > - if (!page) { > + if (!page || !PageUptodate(page)) { > /* Check for page in userfault range */ > if (userfaultfd_missing(vma)) { > ret = hugetlb_handle_userfault(vma, mapping, idx, > @@ -4552,26 +4552,30 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm, > goto out; > } > > - page = alloc_huge_page(vma, haddr, 0); > - if (IS_ERR(page)) { > - /* > - * Returning error will result in faulting task being > - * sent SIGBUS. The hugetlb fault mutex prevents two > - * tasks from racing to fault in the same page which > - * could result in false unable to allocate errors. > - * Page migration does not take the fault mutex, but > - * does a clear then write of pte's under page table > - * lock. Page fault code could race with migration, > - * notice the clear pte and try to allocate a page > - * here. Before returning error, get ptl and make > - * sure there really is no pte entry. > - */ > - ptl = huge_pte_lock(h, mm, ptep); > - ret = 0; > - if (huge_pte_none(huge_ptep_get(ptep))) > - ret = vmf_error(PTR_ERR(page)); > - spin_unlock(ptl); > - goto out; > + if (!page) { > + page = alloc_huge_page(vma, haddr, 0); > + if (IS_ERR(page)) { > + /* > + * Returning error will result in faulting task > + * being sent SIGBUS. The hugetlb fault mutex > + * prevents two tasks from racing to fault in > + * the same page which could result in false > + * unable to allocate errors. Page migration > + * does not take the fault mutex, but does a > + * clear then write of pte's under page table > + * lock. Page fault code could race with > + * migration, notice the clear pte and try to > + * allocate a page here. Before returning > + * error, get ptl and make sure there really is > + * no pte entry. > + */ > + ptl = huge_pte_lock(h, mm, ptep); > + ret = 0; > + if (huge_pte_none(huge_ptep_get(ptep))) > + ret = vmf_error(PTR_ERR(page)); > + spin_unlock(ptl); > + goto out; > + } > } > clear_huge_page(page, address, pages_per_huge_page(h)); > __SetPageUptodate(page); > @@ -4868,31 +4872,55 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, > struct page **pagep) > { > bool is_continue = (mode == MCOPY_ATOMIC_CONTINUE); > - struct address_space *mapping; > - pgoff_t idx; > + struct hstate *h = hstate_vma(dst_vma); > + struct address_space *mapping = dst_vma->vm_file->f_mapping; > + pgoff_t idx = vma_hugecache_offset(h, dst_vma, dst_addr); > unsigned long size; > int vm_shared = dst_vma->vm_flags & VM_SHARED; > - struct hstate *h = hstate_vma(dst_vma); > pte_t _dst_pte; > spinlock_t *ptl; > - int ret; > + int ret = -ENOMEM; > struct page *page; > int writable; > > - mapping = dst_vma->vm_file->f_mapping; > - idx = vma_hugecache_offset(h, dst_vma, dst_addr); > - > if (is_continue) { > ret = -EFAULT; > - page = find_lock_page(mapping, idx); > - if (!page) > + page = hugetlbfs_pagecache_page(h, dst_vma, dst_addr); > + if (!page) { > + ret = -ENOMEM; > goto out; > + } > } else if (!*pagep) { > - ret = -ENOMEM; > + /* If a page already exists, then it's UFFDIO_COPY for > + * a non-missing case. Return -EEXIST. > + */ > + if (hugetlbfs_pagecache_present(h, dst_vma, dst_addr)) { > + ret = -EEXIST; > + goto out; > + } > + > page = alloc_huge_page(dst_vma, dst_addr, 0); > if (IS_ERR(page)) > goto out; > > + /* Add shared, newly allocated pages to the page cache. */ > + if (vm_shared) { > + size = i_size_read(mapping->host) >> huge_page_shift(h); > + ret = -EFAULT; > + if (idx >= size) > + goto out; > + > + /* > + * Serialization between remove_inode_hugepages() and > + * huge_add_to_page_cache() below happens through the > + * hugetlb_fault_mutex_table that here must be hold by > + * the caller. > + */ > + ret = huge_add_to_page_cache(page, mapping, idx); > + if (ret) > + goto out; > + } > + > ret = copy_huge_page_from_user(page, > (const void __user *) src_addr, > pages_per_huge_page(h), false); > @@ -4916,24 +4944,6 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, > */ > __SetPageUptodate(page); > > - /* Add shared, newly allocated pages to the page cache. */ > - if (vm_shared && !is_continue) { > - size = i_size_read(mapping->host) >> huge_page_shift(h); > - ret = -EFAULT; > - if (idx >= size) > - goto out_release_nounlock; > - > - /* > - * Serialization between remove_inode_hugepages() and > - * huge_add_to_page_cache() below happens through the > - * hugetlb_fault_mutex_table that here must be hold by > - * the caller. > - */ > - ret = huge_add_to_page_cache(page, mapping, idx); > - if (ret) > - goto out_release_nounlock; > - } > - > ptl = huge_pte_lockptr(h, dst_mm, dst_pte); > spin_lock(ptl); > > @@ -4994,7 +5004,6 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, > spin_unlock(ptl); > if (vm_shared || is_continue) > unlock_page(page); > -out_release_nounlock: > put_page(page); > goto out; > } > -- > 2.31.1.751.gd2f1c929bd-goog ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] mm, hugetlb: fix resv_huge_pages underflow on UFFDIO_COPY 2021-05-13 23:49 ` Mina Almasry @ 2021-05-14 0:14 ` Mike Kravetz 2021-05-14 0:23 ` Mina Almasry 2021-05-20 19:18 ` Mina Almasry 0 siblings, 2 replies; 29+ messages in thread From: Mike Kravetz @ 2021-05-14 0:14 UTC (permalink / raw) To: Mina Almasry; +Cc: Axel Rasmussen, Peter Xu, Linux-MM, Andrew Morton, open list On 5/13/21 4:49 PM, Mina Almasry wrote: > On Thu, May 13, 2021 at 4:43 PM Mina Almasry <almasrymina@google.com> wrote: >> >> When hugetlb_mcopy_atomic_pte() is called with: >> - mode==MCOPY_ATOMIC_NORMAL and, >> - we already have a page in the page cache corresponding to the >> associated address, >> >> We will allocate a huge page from the reserves, and then fail to insert it >> into the cache and return -EEXIST. In this case, we need to return -EEXIST >> without allocating a new page as the page already exists in the cache. >> Allocating the extra page causes the resv_huge_pages to underflow temporarily >> until the extra page is freed. >> >> To fix this we check if a page exists in the cache, and allocate it and >> insert it in the cache immediately while holding the lock. After that we >> copy the contents into the page. >> >> As a side effect of this, pages may exist in the cache for which the >> copy failed and for these pages PageUptodate(page) == false. Modify code >> that query the cache to handle this correctly. >> > > To be honest, I'm not sure I've done this bit correctly. Please take a > look and let me know what you think. It may be too overly complicated > to have !PageUptodate() pages in the cache and ask the rest of the > code to handle that edge case correctly, but I'm not sure how else to > fix this issue. > I think you just moved the underflow from hugetlb_mcopy_atomic_pte to hugetlb_no_page. Why? Consider the case where there is only one reserve left and someone does the MCOPY_ATOMIC_NORMAL for the address. We will allocate the page and consume the reserve (reserve count == 0) and insert the page into the cache. Now, if the copy_huge_page_from_user fails we must drop the locks/fault mutex to do the copy. While locks are dropped, someone faults on the address and ends up in hugetlb_no_page. The page is in the cache but not up to date, so we go down the allocate new page path and will decrement the reserve count again to cause underflow. How about this approach? - Keep the check for hugetlbfs_pagecache_present in hugetlb_mcopy_atomic_pte that you added. That will catch the race where the page was added to the cache before entering the routine. - With the above check in place, we only need to worry about the case where copy_huge_page_from_user fails and we must drop locks. In this case we: - Free the page previously allocated. - Allocate a 'temporary' huge page without consuming reserves. I'm thinking of something similar to page migration. - Drop the locks and let the copy_huge_page_from_user be done to the temporary page. - When reentering hugetlb_mcopy_atomic_pte after dropping locks (the *pagep case) we need to once again check hugetlbfs_pagecache_present. - We then try to allocate the huge page which will consume the reserve. If successful, copy contents of temporary page to newly allocated page. Free temporary page. There may be issues with this, and I have not given it deep thought. It does abuse the temporary huge page concept, but perhaps no more than page migration. Things do slow down if the extra page allocation and copy is required, but that would only be the case if copy_huge_page_from_user needs to be done without locks. Not sure, but hoping that is rare. -- Mike Kravetz ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] mm, hugetlb: fix resv_huge_pages underflow on UFFDIO_COPY 2021-05-14 0:14 ` Mike Kravetz @ 2021-05-14 0:23 ` Mina Almasry 2021-05-14 4:02 ` Mike Kravetz 2021-05-20 19:18 ` Mina Almasry 1 sibling, 1 reply; 29+ messages in thread From: Mina Almasry @ 2021-05-14 0:23 UTC (permalink / raw) To: Mike Kravetz; +Cc: Axel Rasmussen, Peter Xu, Linux-MM, Andrew Morton, open list On Thu, May 13, 2021 at 5:14 PM Mike Kravetz <mike.kravetz@oracle.com> wrote: > > On 5/13/21 4:49 PM, Mina Almasry wrote: > > On Thu, May 13, 2021 at 4:43 PM Mina Almasry <almasrymina@google.com> wrote: > >> > >> When hugetlb_mcopy_atomic_pte() is called with: > >> - mode==MCOPY_ATOMIC_NORMAL and, > >> - we already have a page in the page cache corresponding to the > >> associated address, > >> > >> We will allocate a huge page from the reserves, and then fail to insert it > >> into the cache and return -EEXIST. In this case, we need to return -EEXIST > >> without allocating a new page as the page already exists in the cache. > >> Allocating the extra page causes the resv_huge_pages to underflow temporarily > >> until the extra page is freed. > >> > >> To fix this we check if a page exists in the cache, and allocate it and > >> insert it in the cache immediately while holding the lock. After that we > >> copy the contents into the page. > >> > >> As a side effect of this, pages may exist in the cache for which the > >> copy failed and for these pages PageUptodate(page) == false. Modify code > >> that query the cache to handle this correctly. > >> > > > > To be honest, I'm not sure I've done this bit correctly. Please take a > > look and let me know what you think. It may be too overly complicated > > to have !PageUptodate() pages in the cache and ask the rest of the > > code to handle that edge case correctly, but I'm not sure how else to > > fix this issue. > > > > I think you just moved the underflow from hugetlb_mcopy_atomic_pte to > hugetlb_no_page. Why? > > Consider the case where there is only one reserve left and someone does > the MCOPY_ATOMIC_NORMAL for the address. We will allocate the page and > consume the reserve (reserve count == 0) and insert the page into the > cache. Now, if the copy_huge_page_from_user fails we must drop the > locks/fault mutex to do the copy. While locks are dropped, someone > faults on the address and ends up in hugetlb_no_page. The page is in > the cache but not up to date, so we go down the allocate new page path > and will decrement the reserve count again to cause underflow. > For what it's worth, I think I fixed the underflow with this patch, not moved it. I added a check in hugetlb_no_page() such that if we find a page in the cache with !PageUptodate(page), we will reuse that page instead of allocating a new one and decrementing the count again. Running the test with the WARN_ONCE_ON locally shows no warnings again. > How about this approach? I'll give it a shot for sure. FWIW on first glance it looks more complicated that what I have here, but my guess I'm not doing the !PageUptodate() handling correctly and that's why it seems this solution is simpler. I'll give it a shot though. > - Keep the check for hugetlbfs_pagecache_present in hugetlb_mcopy_atomic_pte > that you added. That will catch the race where the page was added to > the cache before entering the routine. > - With the above check in place, we only need to worry about the case > where copy_huge_page_from_user fails and we must drop locks. In this > case we: > - Free the page previously allocated. > - Allocate a 'temporary' huge page without consuming reserves. I'm > thinking of something similar to page migration. > - Drop the locks and let the copy_huge_page_from_user be done to the > temporary page. > - When reentering hugetlb_mcopy_atomic_pte after dropping locks (the > *pagep case) we need to once again check > hugetlbfs_pagecache_present. > - We then try to allocate the huge page which will consume the > reserve. If successful, copy contents of temporary page to newly > allocated page. Free temporary page. > > There may be issues with this, and I have not given it deep thought. It > does abuse the temporary huge page concept, but perhaps no more than > page migration. Things do slow down if the extra page allocation and > copy is required, but that would only be the case if copy_huge_page_from_user > needs to be done without locks. Not sure, but hoping that is rare. > -- > Mike Kravetz ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] mm, hugetlb: fix resv_huge_pages underflow on UFFDIO_COPY 2021-05-14 0:23 ` Mina Almasry @ 2021-05-14 4:02 ` Mike Kravetz 2021-05-14 12:31 ` Peter Xu 0 siblings, 1 reply; 29+ messages in thread From: Mike Kravetz @ 2021-05-14 4:02 UTC (permalink / raw) To: Mina Almasry; +Cc: Axel Rasmussen, Peter Xu, Linux-MM, Andrew Morton, open list On 5/13/21 5:23 PM, Mina Almasry wrote: > On Thu, May 13, 2021 at 5:14 PM Mike Kravetz <mike.kravetz@oracle.com> wrote: >> >> On 5/13/21 4:49 PM, Mina Almasry wrote: >>> On Thu, May 13, 2021 at 4:43 PM Mina Almasry <almasrymina@google.com> wrote: >>>> >>>> When hugetlb_mcopy_atomic_pte() is called with: >>>> - mode==MCOPY_ATOMIC_NORMAL and, >>>> - we already have a page in the page cache corresponding to the >>>> associated address, >>>> >>>> We will allocate a huge page from the reserves, and then fail to insert it >>>> into the cache and return -EEXIST. In this case, we need to return -EEXIST >>>> without allocating a new page as the page already exists in the cache. >>>> Allocating the extra page causes the resv_huge_pages to underflow temporarily >>>> until the extra page is freed. >>>> >>>> To fix this we check if a page exists in the cache, and allocate it and >>>> insert it in the cache immediately while holding the lock. After that we >>>> copy the contents into the page. >>>> >>>> As a side effect of this, pages may exist in the cache for which the >>>> copy failed and for these pages PageUptodate(page) == false. Modify code >>>> that query the cache to handle this correctly. >>>> >>> >>> To be honest, I'm not sure I've done this bit correctly. Please take a >>> look and let me know what you think. It may be too overly complicated >>> to have !PageUptodate() pages in the cache and ask the rest of the >>> code to handle that edge case correctly, but I'm not sure how else to >>> fix this issue. >>> >> >> I think you just moved the underflow from hugetlb_mcopy_atomic_pte to >> hugetlb_no_page. Why? >> >> Consider the case where there is only one reserve left and someone does >> the MCOPY_ATOMIC_NORMAL for the address. We will allocate the page and >> consume the reserve (reserve count == 0) and insert the page into the >> cache. Now, if the copy_huge_page_from_user fails we must drop the >> locks/fault mutex to do the copy. While locks are dropped, someone >> faults on the address and ends up in hugetlb_no_page. The page is in >> the cache but not up to date, so we go down the allocate new page path >> and will decrement the reserve count again to cause underflow. >> > > For what it's worth, I think I fixed the underflow with this patch, > not moved it. I added a check in hugetlb_no_page() such that if we > find a page in the cache with !PageUptodate(page), we will reuse that > page instead of allocating a new one and decrementing the count again. > Running the test with the WARN_ONCE_ON locally shows no warnings > again. My bad, sorry I missed that. I am also concerned with the semantics of this approach and what happens when a fault races with the userfaultfd copy. Previously I asked Peter if we could/should use a page found in the cache for the copy. His answer was as follows: AFAICT that's the expected behavior, and it need to be like that so as to avoid silent data corruption (if the page cache existed, it means the page is not "missing" at all, then it does not suite for a UFFDIO_COPY as it's only used for uffd page missing case). The interesting thing in your approach is that there was no page in the cache before the start of the userfaultfd copy. Yet, because we drop the locks we could race with a fault that would normally add a page to the cache. Unless I am reading your patch incorrectly again, the !PageUptodate(page) path in hugetlb_no_page is going to clear the page in the cache. This clearing would race with the copy to page done by the userfaultfd code. > >> How about this approach? > > I'll give it a shot for sure. FWIW on first glance it looks more > complicated that what I have here, but my guess I'm not doing the > !PageUptodate() handling correctly and that's why it seems this > solution is simpler. I'll give it a shot though. Ok, I will look into this more as well. Unfortunately, there does not appear to be a simple elegant solution. -- Mike Kravetz > >> - Keep the check for hugetlbfs_pagecache_present in hugetlb_mcopy_atomic_pte >> that you added. That will catch the race where the page was added to >> the cache before entering the routine. >> - With the above check in place, we only need to worry about the case >> where copy_huge_page_from_user fails and we must drop locks. In this >> case we: >> - Free the page previously allocated. >> - Allocate a 'temporary' huge page without consuming reserves. I'm >> thinking of something similar to page migration. >> - Drop the locks and let the copy_huge_page_from_user be done to the >> temporary page. >> - When reentering hugetlb_mcopy_atomic_pte after dropping locks (the >> *pagep case) we need to once again check >> hugetlbfs_pagecache_present. >> - We then try to allocate the huge page which will consume the >> reserve. If successful, copy contents of temporary page to newly >> allocated page. Free temporary page. >> >> There may be issues with this, and I have not given it deep thought. It >> does abuse the temporary huge page concept, but perhaps no more than >> page migration. Things do slow down if the extra page allocation and >> copy is required, but that would only be the case if copy_huge_page_from_user >> needs to be done without locks. Not sure, but hoping that is rare. >> -- >> Mike Kravetz ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] mm, hugetlb: fix resv_huge_pages underflow on UFFDIO_COPY 2021-05-14 4:02 ` Mike Kravetz @ 2021-05-14 12:31 ` Peter Xu 2021-05-14 17:56 ` Mike Kravetz 0 siblings, 1 reply; 29+ messages in thread From: Peter Xu @ 2021-05-14 12:31 UTC (permalink / raw) To: Mike Kravetz Cc: Mina Almasry, Axel Rasmussen, Linux-MM, Andrew Morton, open list Hi, Mike, On Thu, May 13, 2021 at 09:02:15PM -0700, Mike Kravetz wrote: [...] > I am also concerned with the semantics of this approach and what happens > when a fault races with the userfaultfd copy. Previously I asked Peter > if we could/should use a page found in the cache for the copy. His > answer was as follows: > > AFAICT that's the expected behavior, and it need to be like that so as to avoid > silent data corruption (if the page cache existed, it means the page is not > "missing" at all, then it does not suite for a UFFDIO_COPY as it's only used > for uffd page missing case). I didn't follow the rest discussion in depth yet... but just to mention that the above answer was for the question whether we can "update the page in the page cache", rather than "use a page found in the page cache". I think reuse the page should be fine, however it'll definitely break existing user interface (as it'll expect -EEXIST for now - we have kselftest covers that), meanwhile I don't see why the -EEXIST bothers a lot: it still tells the user that this page was filled in already. Normally it was filled in by another UFFDIO_COPY (as we could have multiple uffd service threads) along with a valid pte, then this userspace thread can simply skip this message as it means the event has been handled by some other servicing thread. (This also reminded me that there won't be a chance of UFFDIO_COPY race on page no page fault at least, since no page fault will always go into the uffd missing handling rather than filling in the page cache for a VM_UFFD_MISSING vma; while mmap read lock should guarantee VM_UFFD_MISSING be persistent) Thanks, -- Peter Xu ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] mm, hugetlb: fix resv_huge_pages underflow on UFFDIO_COPY 2021-05-14 12:31 ` Peter Xu @ 2021-05-14 17:56 ` Mike Kravetz 2021-05-14 18:30 ` Axel Rasmussen 2021-05-14 19:16 ` Peter Xu 0 siblings, 2 replies; 29+ messages in thread From: Mike Kravetz @ 2021-05-14 17:56 UTC (permalink / raw) To: Peter Xu; +Cc: Mina Almasry, Axel Rasmussen, Linux-MM, Andrew Morton, open list On 5/14/21 5:31 AM, Peter Xu wrote: > Hi, Mike, > > On Thu, May 13, 2021 at 09:02:15PM -0700, Mike Kravetz wrote: > > [...] > >> I am also concerned with the semantics of this approach and what happens >> when a fault races with the userfaultfd copy. Previously I asked Peter >> if we could/should use a page found in the cache for the copy. His >> answer was as follows: >> >> AFAICT that's the expected behavior, and it need to be like that so as to avoid >> silent data corruption (if the page cache existed, it means the page is not >> "missing" at all, then it does not suite for a UFFDIO_COPY as it's only used >> for uffd page missing case). > > I didn't follow the rest discussion in depth yet... but just to mention that > the above answer was for the question whether we can "update the page in the > page cache", rather than "use a page found in the page cache". > > I think reuse the page should be fine, however it'll definitely break existing > user interface (as it'll expect -EEXIST for now - we have kselftest covers > that), meanwhile I don't see why the -EEXIST bothers a lot: it still tells the > user that this page was filled in already. Normally it was filled in by > another UFFDIO_COPY (as we could have multiple uffd service threads) along with > a valid pte, then this userspace thread can simply skip this message as it > means the event has been handled by some other servicing thread. > > (This also reminded me that there won't be a chance of UFFDIO_COPY race on page > no page fault at least, since no page fault will always go into the uffd > missing handling rather than filling in the page cache for a VM_UFFD_MISSING > vma; while mmap read lock should guarantee VM_UFFD_MISSING be persistent) Perhaps I am missing something. Since this is a shared mapping, can we not have a 'regular' mapping to the same range that is uffd registered? And, that regular mappings could fault and race with the uffd copy code? -- Mike Kravetz ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] mm, hugetlb: fix resv_huge_pages underflow on UFFDIO_COPY 2021-05-14 17:56 ` Mike Kravetz @ 2021-05-14 18:30 ` Axel Rasmussen 2021-05-14 19:16 ` Peter Xu 1 sibling, 0 replies; 29+ messages in thread From: Axel Rasmussen @ 2021-05-14 18:30 UTC (permalink / raw) To: Mike Kravetz; +Cc: Peter Xu, Mina Almasry, Linux-MM, Andrew Morton, open list It's complicated and would take some more time for me to be certain, but after looking for half an hour or so this morning, I agree with Mike that such a race is possible. That is, we may back out into the retry path, and drop mmap_lock, and leave a situation where a page is in the cache, but we have !PageUptodate(). hugetlb_mcopy_atomic_pte clearly handles the VM_SHARED case, so I don't see a reason why there can't be another (non-userfaultfd-registered) mapping. If it were faulted at the right time, it seems like such a fault would indeed zero the page, and then the UFFDIO_COPY retry (once it acquired the lock again) would try to reuse it. On Fri, May 14, 2021 at 10:56 AM Mike Kravetz <mike.kravetz@oracle.com> wrote: > > On 5/14/21 5:31 AM, Peter Xu wrote: > > Hi, Mike, > > > > On Thu, May 13, 2021 at 09:02:15PM -0700, Mike Kravetz wrote: > > > > [...] > > > >> I am also concerned with the semantics of this approach and what happens > >> when a fault races with the userfaultfd copy. Previously I asked Peter > >> if we could/should use a page found in the cache for the copy. His > >> answer was as follows: > >> > >> AFAICT that's the expected behavior, and it need to be like that so as to avoid > >> silent data corruption (if the page cache existed, it means the page is not > >> "missing" at all, then it does not suite for a UFFDIO_COPY as it's only used > >> for uffd page missing case). > > > > I didn't follow the rest discussion in depth yet... but just to mention that > > the above answer was for the question whether we can "update the page in the > > page cache", rather than "use a page found in the page cache". > > > > I think reuse the page should be fine, however it'll definitely break existing > > user interface (as it'll expect -EEXIST for now - we have kselftest covers > > that), meanwhile I don't see why the -EEXIST bothers a lot: it still tells the > > user that this page was filled in already. Normally it was filled in by > > another UFFDIO_COPY (as we could have multiple uffd service threads) along with > > a valid pte, then this userspace thread can simply skip this message as it > > means the event has been handled by some other servicing thread. > > > > (This also reminded me that there won't be a chance of UFFDIO_COPY race on page > > no page fault at least, since no page fault will always go into the uffd > > missing handling rather than filling in the page cache for a VM_UFFD_MISSING > > vma; while mmap read lock should guarantee VM_UFFD_MISSING be persistent) > > Perhaps I am missing something. > > Since this is a shared mapping, can we not have a 'regular' mapping to > the same range that is uffd registered? And, that regular mappings could > fault and race with the uffd copy code? > > -- > Mike Kravetz ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] mm, hugetlb: fix resv_huge_pages underflow on UFFDIO_COPY 2021-05-14 17:56 ` Mike Kravetz 2021-05-14 18:30 ` Axel Rasmussen @ 2021-05-14 19:16 ` Peter Xu 1 sibling, 0 replies; 29+ messages in thread From: Peter Xu @ 2021-05-14 19:16 UTC (permalink / raw) To: Mike Kravetz Cc: Mina Almasry, Axel Rasmussen, Linux-MM, Andrew Morton, open list On Fri, May 14, 2021 at 10:56:36AM -0700, Mike Kravetz wrote: > Since this is a shared mapping, can we not have a 'regular' mapping to > the same range that is uffd registered? And, that regular mappings could > fault and race with the uffd copy code? Yeah we can.. Please ignore the paragraph in the parenthesis. Sorry. -- Peter Xu ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] mm, hugetlb: fix resv_huge_pages underflow on UFFDIO_COPY 2021-05-14 0:14 ` Mike Kravetz 2021-05-14 0:23 ` Mina Almasry @ 2021-05-20 19:18 ` Mina Almasry 2021-05-20 19:21 ` Mina Almasry 1 sibling, 1 reply; 29+ messages in thread From: Mina Almasry @ 2021-05-20 19:18 UTC (permalink / raw) To: Mike Kravetz; +Cc: Axel Rasmussen, Peter Xu, Linux-MM, Andrew Morton, open list On Thu, May 13, 2021 at 5:14 PM Mike Kravetz <mike.kravetz@oracle.com> wrote: > > On 5/13/21 4:49 PM, Mina Almasry wrote: > > On Thu, May 13, 2021 at 4:43 PM Mina Almasry <almasrymina@google.com> wrote: > >> > >> When hugetlb_mcopy_atomic_pte() is called with: > >> - mode==MCOPY_ATOMIC_NORMAL and, > >> - we already have a page in the page cache corresponding to the > >> associated address, > >> > >> We will allocate a huge page from the reserves, and then fail to insert it > >> into the cache and return -EEXIST. In this case, we need to return -EEXIST > >> without allocating a new page as the page already exists in the cache. > >> Allocating the extra page causes the resv_huge_pages to underflow temporarily > >> until the extra page is freed. > >> > >> To fix this we check if a page exists in the cache, and allocate it and > >> insert it in the cache immediately while holding the lock. After that we > >> copy the contents into the page. > >> > >> As a side effect of this, pages may exist in the cache for which the > >> copy failed and for these pages PageUptodate(page) == false. Modify code > >> that query the cache to handle this correctly. > >> > > > > To be honest, I'm not sure I've done this bit correctly. Please take a > > look and let me know what you think. It may be too overly complicated > > to have !PageUptodate() pages in the cache and ask the rest of the > > code to handle that edge case correctly, but I'm not sure how else to > > fix this issue. > > > > I think you just moved the underflow from hugetlb_mcopy_atomic_pte to > hugetlb_no_page. Why? > > Consider the case where there is only one reserve left and someone does > the MCOPY_ATOMIC_NORMAL for the address. We will allocate the page and > consume the reserve (reserve count == 0) and insert the page into the > cache. Now, if the copy_huge_page_from_user fails we must drop the > locks/fault mutex to do the copy. While locks are dropped, someone > faults on the address and ends up in hugetlb_no_page. The page is in > the cache but not up to date, so we go down the allocate new page path > and will decrement the reserve count again to cause underflow. > > How about this approach? > - Keep the check for hugetlbfs_pagecache_present in hugetlb_mcopy_atomic_pte > that you added. That will catch the race where the page was added to > the cache before entering the routine. > - With the above check in place, we only need to worry about the case > where copy_huge_page_from_user fails and we must drop locks. In this > case we: > - Free the page previously allocated. > - Allocate a 'temporary' huge page without consuming reserves. I'm > thinking of something similar to page migration. > - Drop the locks and let the copy_huge_page_from_user be done to the > temporary page. > - When reentering hugetlb_mcopy_atomic_pte after dropping locks (the > *pagep case) we need to once again check > hugetlbfs_pagecache_present. > - We then try to allocate the huge page which will consume the > reserve. If successful, copy contents of temporary page to newly > allocated page. Free temporary page. > > There may be issues with this, and I have not given it deep thought. It > does abuse the temporary huge page concept, but perhaps no more than > page migration. Things do slow down if the extra page allocation and > copy is required, but that would only be the case if copy_huge_page_from_user > needs to be done without locks. Not sure, but hoping that is rare. Just following up this a bit: I've implemented this approach locally, and with it it's passing the test as-is. When I hack the code such that the copy in hugetlb_mcopy_atomic_pte() always fails, I run into this edge case, which causes resv_huge_pages to underflow again (this time permemantly): - hugetlb_no_page() is called on an index and a page is allocated and inserted into the cache consuming the reservation. - remove_huge_page() is called on this index and the page is removed from cache. - hugetlb_mcopy_atomic_pte() is called on this index, we do not find the page in the cache and we trigger this code patch and the copy fails. - The allocations in this code path seem to double consume the reservation and resv_huge_pages underflows. I'm looking at this edge case to understand why a prior remove_huge_page() causes my code to underflow resv_huge_pages. > -- > Mike Kravetz ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] mm, hugetlb: fix resv_huge_pages underflow on UFFDIO_COPY 2021-05-20 19:18 ` Mina Almasry @ 2021-05-20 19:21 ` Mina Almasry 2021-05-20 20:00 ` Mike Kravetz 0 siblings, 1 reply; 29+ messages in thread From: Mina Almasry @ 2021-05-20 19:21 UTC (permalink / raw) To: Mike Kravetz; +Cc: Axel Rasmussen, Peter Xu, Linux-MM, Andrew Morton, open list On Thu, May 20, 2021 at 12:18 PM Mina Almasry <almasrymina@google.com> wrote: > > On Thu, May 13, 2021 at 5:14 PM Mike Kravetz <mike.kravetz@oracle.com> wrote: > > > > On 5/13/21 4:49 PM, Mina Almasry wrote: > > > On Thu, May 13, 2021 at 4:43 PM Mina Almasry <almasrymina@google.com> wrote: > > >> > > >> When hugetlb_mcopy_atomic_pte() is called with: > > >> - mode==MCOPY_ATOMIC_NORMAL and, > > >> - we already have a page in the page cache corresponding to the > > >> associated address, > > >> > > >> We will allocate a huge page from the reserves, and then fail to insert it > > >> into the cache and return -EEXIST. In this case, we need to return -EEXIST > > >> without allocating a new page as the page already exists in the cache. > > >> Allocating the extra page causes the resv_huge_pages to underflow temporarily > > >> until the extra page is freed. > > >> > > >> To fix this we check if a page exists in the cache, and allocate it and > > >> insert it in the cache immediately while holding the lock. After that we > > >> copy the contents into the page. > > >> > > >> As a side effect of this, pages may exist in the cache for which the > > >> copy failed and for these pages PageUptodate(page) == false. Modify code > > >> that query the cache to handle this correctly. > > >> > > > > > > To be honest, I'm not sure I've done this bit correctly. Please take a > > > look and let me know what you think. It may be too overly complicated > > > to have !PageUptodate() pages in the cache and ask the rest of the > > > code to handle that edge case correctly, but I'm not sure how else to > > > fix this issue. > > > > > > > I think you just moved the underflow from hugetlb_mcopy_atomic_pte to > > hugetlb_no_page. Why? > > > > Consider the case where there is only one reserve left and someone does > > the MCOPY_ATOMIC_NORMAL for the address. We will allocate the page and > > consume the reserve (reserve count == 0) and insert the page into the > > cache. Now, if the copy_huge_page_from_user fails we must drop the > > locks/fault mutex to do the copy. While locks are dropped, someone > > faults on the address and ends up in hugetlb_no_page. The page is in > > the cache but not up to date, so we go down the allocate new page path > > and will decrement the reserve count again to cause underflow. > > > > How about this approach? > > - Keep the check for hugetlbfs_pagecache_present in hugetlb_mcopy_atomic_pte > > that you added. That will catch the race where the page was added to > > the cache before entering the routine. > > - With the above check in place, we only need to worry about the case > > where copy_huge_page_from_user fails and we must drop locks. In this > > case we: > > - Free the page previously allocated. > > - Allocate a 'temporary' huge page without consuming reserves. I'm > > thinking of something similar to page migration. > > - Drop the locks and let the copy_huge_page_from_user be done to the > > temporary page. > > - When reentering hugetlb_mcopy_atomic_pte after dropping locks (the > > *pagep case) we need to once again check > > hugetlbfs_pagecache_present. > > - We then try to allocate the huge page which will consume the > > reserve. If successful, copy contents of temporary page to newly > > allocated page. Free temporary page. > > > > There may be issues with this, and I have not given it deep thought. It > > does abuse the temporary huge page concept, but perhaps no more than > > page migration. Things do slow down if the extra page allocation and > > copy is required, but that would only be the case if copy_huge_page_from_user > > needs to be done without locks. Not sure, but hoping that is rare. > > Just following up this a bit: I've implemented this approach locally, > and with it it's passing the test as-is. When I hack the code such > that the copy in hugetlb_mcopy_atomic_pte() always fails, I run into > this edge case, which causes resv_huge_pages to underflow again (this > time permemantly): > > - hugetlb_no_page() is called on an index and a page is allocated and > inserted into the cache consuming the reservation. > - remove_huge_page() is called on this index and the page is removed from cache. > - hugetlb_mcopy_atomic_pte() is called on this index, we do not find > the page in the cache and we trigger this code patch and the copy > fails. > - The allocations in this code path seem to double consume the > reservation and resv_huge_pages underflows. > > I'm looking at this edge case to understand why a prior > remove_huge_page() causes my code to underflow resv_huge_pages. > I should also mention, without a prior remove_huge_page() this code path works fine, so it seems the fact that the reservation is consumed before causes trouble, but I'm not sure why yet. > > -- > > Mike Kravetz ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] mm, hugetlb: fix resv_huge_pages underflow on UFFDIO_COPY 2021-05-20 19:21 ` Mina Almasry @ 2021-05-20 20:00 ` Mike Kravetz 2021-05-20 20:31 ` Mina Almasry 0 siblings, 1 reply; 29+ messages in thread From: Mike Kravetz @ 2021-05-20 20:00 UTC (permalink / raw) To: Mina Almasry; +Cc: Axel Rasmussen, Peter Xu, Linux-MM, Andrew Morton, open list On 5/20/21 12:21 PM, Mina Almasry wrote: > On Thu, May 20, 2021 at 12:18 PM Mina Almasry <almasrymina@google.com> wrote: >> >> On Thu, May 13, 2021 at 5:14 PM Mike Kravetz <mike.kravetz@oracle.com> wrote: >>> >>> How about this approach? >>> - Keep the check for hugetlbfs_pagecache_present in hugetlb_mcopy_atomic_pte >>> that you added. That will catch the race where the page was added to >>> the cache before entering the routine. >>> - With the above check in place, we only need to worry about the case >>> where copy_huge_page_from_user fails and we must drop locks. In this >>> case we: >>> - Free the page previously allocated. >>> - Allocate a 'temporary' huge page without consuming reserves. I'm >>> thinking of something similar to page migration. >>> - Drop the locks and let the copy_huge_page_from_user be done to the >>> temporary page. >>> - When reentering hugetlb_mcopy_atomic_pte after dropping locks (the >>> *pagep case) we need to once again check >>> hugetlbfs_pagecache_present. >>> - We then try to allocate the huge page which will consume the >>> reserve. If successful, copy contents of temporary page to newly >>> allocated page. Free temporary page. >>> >>> There may be issues with this, and I have not given it deep thought. It >>> does abuse the temporary huge page concept, but perhaps no more than >>> page migration. Things do slow down if the extra page allocation and >>> copy is required, but that would only be the case if copy_huge_page_from_user >>> needs to be done without locks. Not sure, but hoping that is rare. >> >> Just following up this a bit: I've implemented this approach locally, >> and with it it's passing the test as-is. When I hack the code such >> that the copy in hugetlb_mcopy_atomic_pte() always fails, I run into >> this edge case, which causes resv_huge_pages to underflow again (this >> time permemantly): >> >> - hugetlb_no_page() is called on an index and a page is allocated and >> inserted into the cache consuming the reservation. >> - remove_huge_page() is called on this index and the page is removed from cache. >> - hugetlb_mcopy_atomic_pte() is called on this index, we do not find >> the page in the cache and we trigger this code patch and the copy >> fails. >> - The allocations in this code path seem to double consume the >> reservation and resv_huge_pages underflows. >> >> I'm looking at this edge case to understand why a prior >> remove_huge_page() causes my code to underflow resv_huge_pages. >> > > I should also mention, without a prior remove_huge_page() this code > path works fine, so it seems the fact that the reservation is consumed > before causes trouble, but I'm not sure why yet. > Hi Mina, How about quickly posting the code? I may be able to provide more suggestions if I can see the actual code. -- Mike Kravetz ^ permalink raw reply [flat|nested] 29+ messages in thread
* Re: [PATCH] mm, hugetlb: fix resv_huge_pages underflow on UFFDIO_COPY 2021-05-20 20:00 ` Mike Kravetz @ 2021-05-20 20:31 ` Mina Almasry 2021-05-21 2:05 ` Mina Almasry 0 siblings, 1 reply; 29+ messages in thread From: Mina Almasry @ 2021-05-20 20:31 UTC (permalink / raw) To: Mike Kravetz; +Cc: Axel Rasmussen, Peter Xu, Linux-MM, Andrew Morton, open list [-- Attachment #1: Type: text/plain, Size: 3317 bytes --] On Thu, May 20, 2021 at 1:00 PM Mike Kravetz <mike.kravetz@oracle.com> wrote: > > On 5/20/21 12:21 PM, Mina Almasry wrote: > > On Thu, May 20, 2021 at 12:18 PM Mina Almasry <almasrymina@google.com> wrote: > >> > >> On Thu, May 13, 2021 at 5:14 PM Mike Kravetz <mike.kravetz@oracle.com> wrote: > >>> > >>> How about this approach? > >>> - Keep the check for hugetlbfs_pagecache_present in hugetlb_mcopy_atomic_pte > >>> that you added. That will catch the race where the page was added to > >>> the cache before entering the routine. > >>> - With the above check in place, we only need to worry about the case > >>> where copy_huge_page_from_user fails and we must drop locks. In this > >>> case we: > >>> - Free the page previously allocated. > >>> - Allocate a 'temporary' huge page without consuming reserves. I'm > >>> thinking of something similar to page migration. > >>> - Drop the locks and let the copy_huge_page_from_user be done to the > >>> temporary page. > >>> - When reentering hugetlb_mcopy_atomic_pte after dropping locks (the > >>> *pagep case) we need to once again check > >>> hugetlbfs_pagecache_present. > >>> - We then try to allocate the huge page which will consume the > >>> reserve. If successful, copy contents of temporary page to newly > >>> allocated page. Free temporary page. > >>> > >>> There may be issues with this, and I have not given it deep thought. It > >>> does abuse the temporary huge page concept, but perhaps no more than > >>> page migration. Things do slow down if the extra page allocation and > >>> copy is required, but that would only be the case if copy_huge_page_from_user > >>> needs to be done without locks. Not sure, but hoping that is rare. > >> > >> Just following up this a bit: I've implemented this approach locally, > >> and with it it's passing the test as-is. When I hack the code such > >> that the copy in hugetlb_mcopy_atomic_pte() always fails, I run into > >> this edge case, which causes resv_huge_pages to underflow again (this > >> time permemantly): > >> > >> - hugetlb_no_page() is called on an index and a page is allocated and > >> inserted into the cache consuming the reservation. > >> - remove_huge_page() is called on this index and the page is removed from cache. > >> - hugetlb_mcopy_atomic_pte() is called on this index, we do not find > >> the page in the cache and we trigger this code patch and the copy > >> fails. > >> - The allocations in this code path seem to double consume the > >> reservation and resv_huge_pages underflows. > >> > >> I'm looking at this edge case to understand why a prior > >> remove_huge_page() causes my code to underflow resv_huge_pages. > >> > > > > I should also mention, without a prior remove_huge_page() this code > > path works fine, so it seems the fact that the reservation is consumed > > before causes trouble, but I'm not sure why yet. > > > > Hi Mina, > > How about quickly posting the code? I may be able to provide more > suggestions if I can see the actual code. Sure thing, attached my patch so far. It's quite messy with prints everywhere and VM_BUG_ON() in error paths that I'm not handling yet. I've also hacked the code so that the hugetlb_mcopy_atomic_pte() copy always fails so I exercise that code path. > -- > Mike Kravetz [-- Attachment #2: 0001-mm-hugetlb-fix-resv_huge_pages-underflow-on-UFFDIO_C.patch --] [-- Type: application/octet-stream, Size: 8530 bytes --] From ff86c690f9236d8ba74cb2eb6f529b6354a96223 Mon Sep 17 00:00:00 2001 From: Mina Almasry <almasrymina@google.com> Date: Tue, 11 May 2021 17:01:50 -0700 Subject: [PATCH] mm, hugetlb: fix resv_huge_pages underflow on UFFDIO_COPY test passes. With force copy fail, test fails resv underflows permenantly. Seems pages are added, removed, then not found in cache double consuming the reservation. Signed-off-by: Mina Almasry <almasrymina@google.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: linux-mm@kvack.org Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org Change-Id: Ida20f35fdfa7fce598582dcfa199845115eaac18 --- fs/hugetlbfs/inode.c | 6 +++ mm/hugetlb.c | 124 ++++++++++++++++++++++++++++++++++++++----- 2 files changed, 116 insertions(+), 14 deletions(-) diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c index a2a42335e8fd..ba0b3fe88f18 100644 --- a/fs/hugetlbfs/inode.c +++ b/fs/hugetlbfs/inode.c @@ -525,6 +525,9 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart, * to be adjusted. */ VM_BUG_ON(PagePrivate(page)); + printk("%s:%d:%s removing huge page, idx=%d h->resv_huge_pages=%d\n", + __FILE__, __LINE__, __func__, index, + h->resv_huge_pages); remove_huge_page(page); freed++; if (!truncate_op) { @@ -532,6 +535,9 @@ static void remove_inode_hugepages(struct inode *inode, loff_t lstart, index, index + 1, 1))) hugetlb_fix_reserve_counts(inode); } + printk("%s:%d:%s removed huge page, idx=%d h->resv_huge_pages=%d\n", + __FILE__, __LINE__, __func__, index, + h->resv_huge_pages); unlock_page(page); if (!truncate_op) diff --git a/mm/hugetlb.c b/mm/hugetlb.c index 629aa4c2259c..2d9b557cbafa 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -93,6 +93,28 @@ static inline bool subpool_is_free(struct hugepage_subpool *spool) return true; } +/* + * Gigantic pages are so large that we do not guarantee that page++ pointer + * arithmetic will work across the entire page. We need something more + * specialized. + */ +static void __copy_gigantic_page(struct page *dst, struct page *src, + int nr_pages) +{ + int i; + struct page *dst_base = dst; + struct page *src_base = src; + + for (i = 0; i < nr_pages;) { + cond_resched(); + copy_highpage(dst, src); + + i++; + dst = mem_map_next(dst, dst_base, i); + src = mem_map_next(src, src_base, i); + } +} + static inline void unlock_or_release_subpool(struct hugepage_subpool *spool, unsigned long irq_flags) { @@ -1165,6 +1187,7 @@ static struct page *dequeue_huge_page_vma(struct hstate *h, page = dequeue_huge_page_nodemask(h, gfp_mask, nid, nodemask); if (page && !avoid_reserve && vma_has_reserves(vma, chg)) { SetHPageRestoreReserve(page); + WARN_ON_ONCE(!h->resv_huge_pages); h->resv_huge_pages--; } @@ -2405,6 +2428,7 @@ struct page *alloc_huge_page(struct vm_area_struct *vma, int ret, idx; struct hugetlb_cgroup *h_cg; bool deferred_reserve; + pgoff_t pageidx = vma_hugecache_offset(h, vma, addr); idx = hstate_index(h); /* @@ -2508,6 +2532,9 @@ struct page *alloc_huge_page(struct vm_area_struct *vma, hugetlb_cgroup_uncharge_page_rsvd(hstate_index(h), pages_per_huge_page(h), page); } + printk("%s:%d:%s allocated page idx=%d, deferred_reserve=%d, avoid_reserve=%d\n", + __FILE__, __LINE__, __func__, pageidx, deferred_reserve, + avoid_reserve); return page; out_uncharge_cgroup: @@ -4571,6 +4598,8 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm, if (huge_pte_none(huge_ptep_get(ptep))) ret = vmf_error(PTR_ERR(page)); spin_unlock(ptl); + printk("%s:%d:%s failed \n", __FILE__, __LINE__, + __func__); goto out; } clear_huge_page(page, address, pages_per_huge_page(h)); @@ -4578,8 +4607,12 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm, new_page = true; if (vma->vm_flags & VM_MAYSHARE) { + printk("%s:%d:%s adding page to cache idx=%d mapping=%px\n", + __FILE__, __LINE__, __func__, idx, mapping); int err = huge_add_to_page_cache(page, mapping, idx); if (err) { + printk("%s:%d:%s failed adding page to cache idx=%d\n", + __FILE__, __LINE__, __func__, idx); put_page(page); if (err == -EEXIST) goto retry; @@ -4868,44 +4901,102 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, struct page **pagep) { bool is_continue = (mode == MCOPY_ATOMIC_CONTINUE); - struct address_space *mapping; - pgoff_t idx; + struct hstate *h = hstate_vma(dst_vma); + struct address_space *mapping = dst_vma->vm_file->f_mapping; + pgoff_t idx = vma_hugecache_offset(h, dst_vma, dst_addr); unsigned long size; int vm_shared = dst_vma->vm_flags & VM_SHARED; - struct hstate *h = hstate_vma(dst_vma); pte_t _dst_pte; spinlock_t *ptl; - int ret; + int ret = -ENOMEM; struct page *page; int writable; - - mapping = dst_vma->vm_file->f_mapping; - idx = vma_hugecache_offset(h, dst_vma, dst_addr); + unsigned long resv_temp = 0; if (is_continue) { ret = -EFAULT; page = find_lock_page(mapping, idx); - if (!page) + if (!page) { + ret = -ENOMEM; goto out; + } } else if (!*pagep) { - ret = -ENOMEM; + /* If a page already exists, then it's UFFDIO_COPY for + * a non-missing case. Return -EEXIST. + */ + if (hugetlbfs_pagecache_present(h, dst_vma, dst_addr)) { + ret = -EEXIST; + printk("%s:%d:%s found in cache. idx=%d mapping=%px\n", + __FILE__, __LINE__, __func__, idx, + dst_vma->vm_file->f_mapping); + goto out; + } + + printk("%s:%d:%s not found in cache. Allocating consuming reservation idx=%d mapping=%px\n", + __FILE__, __LINE__, __func__, idx, + dst_vma->vm_file->f_mapping); page = alloc_huge_page(dst_vma, dst_addr, 0); - if (IS_ERR(page)) + BUG_ON(IS_ERR(page)); + if (IS_ERR(page)) { + printk("%s:%d:%s page allocation failed\n", __FILE__, + __LINE__, __func__); goto out; + } +#if 0 ret = copy_huge_page_from_user(page, - (const void __user *) src_addr, - pages_per_huge_page(h), false); + (const void __user *)src_addr, + pages_per_huge_page(h), false); +#else + ret = -ENOENT; +#endif /* fallback to copy_from_user outside mmap_lock */ if (unlikely(ret)) { ret = -ENOENT; + printk("%s:%d:%s copy failed, freeing page idx=%d\n", + __FILE__, __LINE__, __func__, idx); + resv_temp = h->resv_huge_pages; + put_page(page); + + /* Reallocate the page outside the reserves. */ + struct mempolicy *mpol; + nodemask_t *nodemask; + gfp_t gfp_mask = htlb_alloc_mask(h); + int node = huge_node(dst_vma, dst_addr, gfp_mask, &mpol, + &nodemask); + resv_temp = h->resv_huge_pages; + page = alloc_migrate_huge_page(h, gfp_mask, node, + nodemask); + VM_BUG_ON(h->resv_huge_pages != resv_temp); + if (IS_ERR(page)) { + VM_BUG_ON(true); + ret = -ENOMEM; + printk("%s:%d:%s failed allocating migrate_huge_page\n", + __FILE__, __LINE__, __func__); + goto out; + } *pagep = page; /* don't free the page */ goto out; } } else { - page = *pagep; + if (hugetlbfs_pagecache_present(h, dst_vma, dst_addr)) { + printk("%s:%d:%s found huge_page in cache idx=%d mapping=%px\n", + __FILE__, __LINE__, __func__, idx, + dst_vma->vm_file->f_mapping); + put_page(*pagep); + ret = -EEXIST; + goto out; + } + + printk("%s:%d:%s not found in cache, allocating consuming reservation idx=%d mapping=%px\n", + __FILE__, __LINE__, __func__, idx, + dst_vma->vm_file->f_mapping); + page = alloc_huge_page(dst_vma, dst_addr, 0); + VM_BUG_ON(IS_ERR(page)); + __copy_gigantic_page(page, *pagep, pages_per_huge_page(h)); + put_page(*pagep); *pagep = NULL; } @@ -4929,9 +5020,14 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, * hugetlb_fault_mutex_table that here must be hold by * the caller. */ + printk("%s:%d:%s adding page to cache idx=%d mapping=%px\n", + __FILE__, __LINE__, __func__, idx, mapping); ret = huge_add_to_page_cache(page, mapping, idx); - if (ret) + if (ret) { + printk("%s:%d:%s failed adding to cache\n", __FILE__, + __LINE__, __func__); goto out_release_nounlock; + } } ptl = huge_pte_lockptr(h, dst_mm, dst_pte); -- 2.31.1.818.g46aad6cb9e-goog ^ permalink raw reply related [flat|nested] 29+ messages in thread
* Re: [PATCH] mm, hugetlb: fix resv_huge_pages underflow on UFFDIO_COPY 2021-05-20 20:31 ` Mina Almasry @ 2021-05-21 2:05 ` Mina Almasry 0 siblings, 0 replies; 29+ messages in thread From: Mina Almasry @ 2021-05-21 2:05 UTC (permalink / raw) To: Mike Kravetz; +Cc: Axel Rasmussen, Peter Xu, Linux-MM, Andrew Morton, open list On Thu, May 20, 2021 at 1:31 PM Mina Almasry <almasrymina@google.com> wrote: > > On Thu, May 20, 2021 at 1:00 PM Mike Kravetz <mike.kravetz@oracle.com> wrote: > > > > On 5/20/21 12:21 PM, Mina Almasry wrote: > > > On Thu, May 20, 2021 at 12:18 PM Mina Almasry <almasrymina@google.com> wrote: > > >> > > >> On Thu, May 13, 2021 at 5:14 PM Mike Kravetz <mike.kravetz@oracle.com> wrote: > > >>> > > >>> How about this approach? > > >>> - Keep the check for hugetlbfs_pagecache_present in hugetlb_mcopy_atomic_pte > > >>> that you added. That will catch the race where the page was added to > > >>> the cache before entering the routine. > > >>> - With the above check in place, we only need to worry about the case > > >>> where copy_huge_page_from_user fails and we must drop locks. In this > > >>> case we: > > >>> - Free the page previously allocated. > > >>> - Allocate a 'temporary' huge page without consuming reserves. I'm > > >>> thinking of something similar to page migration. > > >>> - Drop the locks and let the copy_huge_page_from_user be done to the > > >>> temporary page. > > >>> - When reentering hugetlb_mcopy_atomic_pte after dropping locks (the > > >>> *pagep case) we need to once again check > > >>> hugetlbfs_pagecache_present. > > >>> - We then try to allocate the huge page which will consume the > > >>> reserve. If successful, copy contents of temporary page to newly > > >>> allocated page. Free temporary page. > > >>> > > >>> There may be issues with this, and I have not given it deep thought. It > > >>> does abuse the temporary huge page concept, but perhaps no more than > > >>> page migration. Things do slow down if the extra page allocation and > > >>> copy is required, but that would only be the case if copy_huge_page_from_user > > >>> needs to be done without locks. Not sure, but hoping that is rare. > > >> > > >> Just following up this a bit: I've implemented this approach locally, > > >> and with it it's passing the test as-is. When I hack the code such > > >> that the copy in hugetlb_mcopy_atomic_pte() always fails, I run into > > >> this edge case, which causes resv_huge_pages to underflow again (this > > >> time permemantly): > > >> > > >> - hugetlb_no_page() is called on an index and a page is allocated and > > >> inserted into the cache consuming the reservation. > > >> - remove_huge_page() is called on this index and the page is removed from cache. > > >> - hugetlb_mcopy_atomic_pte() is called on this index, we do not find > > >> the page in the cache and we trigger this code patch and the copy > > >> fails. > > >> - The allocations in this code path seem to double consume the > > >> reservation and resv_huge_pages underflows. > > >> > > >> I'm looking at this edge case to understand why a prior > > >> remove_huge_page() causes my code to underflow resv_huge_pages. > > >> > > > > > > I should also mention, without a prior remove_huge_page() this code > > > path works fine, so it seems the fact that the reservation is consumed > > > before causes trouble, but I'm not sure why yet. > > > > > > > Hi Mina, > > > > How about quickly posting the code? I may be able to provide more > > suggestions if I can see the actual code. > > Sure thing, attached my patch so far. It's quite messy with prints > everywhere and VM_BUG_ON() in error paths that I'm not handling yet. > I've also hacked the code so that the hugetlb_mcopy_atomic_pte() copy > always fails so I exercise that code path. > Of course right after I send out my patch, I figure out what's wrong. It turns out freeing the allocated page when the copy fails is extremely complicated (or complicated to figure out why it's broken). Turns out I need to: restore_page_on_error() if (!HPageRestoreReserves(page)) { hugetlb_unreserve_pages(mapping, idx, idx + 1, 1); } put_page(page); This is because even though we always allocate asking for a page from the reserves, the page may not come from the reserves if the VM doesn't have reservation for this index (which is the case if the page has been allocated by hugetlb_no_page() and then removed with remove_huge_page()). So, we need to correctly handle both cases. Sorry for the noise but this was hard to track down. Patch should be incoming soon unless I run into other issues with a closer look. > > -- > > Mike Kravetz ^ permalink raw reply [flat|nested] 29+ messages in thread
end of thread, other threads:[~2021-05-21 2:06 UTC | newest] Thread overview: 29+ messages (download: mbox.gz / follow: Atom feed) -- links below jump to the message on this page -- 2021-05-07 21:21 resv_huge_page underflow with userfaultfd test Mina Almasry 2021-05-11 0:33 ` Mike Kravetz 2021-05-11 0:52 ` Mina Almasry 2021-05-11 6:45 ` Axel Rasmussen 2021-05-11 7:08 ` Mina Almasry 2021-05-11 16:38 ` Mike Kravetz 2021-05-11 19:08 ` Mina Almasry 2021-05-12 2:25 ` Mike Kravetz 2021-05-12 2:35 ` Peter Xu 2021-05-12 3:06 ` Mike Kravetz [not found] ` <20210512065813.89270-1-almasrymina@google.com> 2021-05-12 7:44 ` [PATCH] mm, hugetlb: fix resv_huge_pages underflow on UFFDIO_COPY Mina Almasry [not found] ` <CAJHvVch0ZMapPVEc0Ge5V4KDgNDNhECbqwDi0y9XxsxFXQZ-gg@mail.gmail.com> [not found] ` <c455d241-11f6-95a6-eb29-0ddd94eedbd7@oracle.com> 2021-05-12 19:42 ` Mina Almasry 2021-05-12 20:14 ` Peter Xu 2021-05-12 21:31 ` Mike Kravetz 2021-05-12 21:52 ` Mina Almasry 2021-05-13 23:43 Mina Almasry 2021-05-13 23:49 ` Mina Almasry 2021-05-14 0:14 ` Mike Kravetz 2021-05-14 0:23 ` Mina Almasry 2021-05-14 4:02 ` Mike Kravetz 2021-05-14 12:31 ` Peter Xu 2021-05-14 17:56 ` Mike Kravetz 2021-05-14 18:30 ` Axel Rasmussen 2021-05-14 19:16 ` Peter Xu 2021-05-20 19:18 ` Mina Almasry 2021-05-20 19:21 ` Mina Almasry 2021-05-20 20:00 ` Mike Kravetz 2021-05-20 20:31 ` Mina Almasry 2021-05-21 2:05 ` Mina Almasry
This is a public inbox, see mirroring instructions for how to clone and mirror all data and code used for this inbox; as well as URLs for NNTP newsgroup(s).