All of lore.kernel.org
 help / color / mirror / Atom feed
* resv_huge_page underflow with userfaultfd test
@ 2021-05-07 21:21 Mina Almasry
  2021-05-11  0:33 ` Mike Kravetz
  0 siblings, 1 reply; 40+ messages in thread
From: Mina Almasry @ 2021-05-07 21:21 UTC (permalink / raw)
  To: Mike Kravetz, Linux-MM, Axel Rasmussen, aarcange, peterx

Hi folks,

I ran into a bug that I'm not sure how to solve so I'm wondering if
anyone has suggestions on what the issue could be and how to
investigate. I added the WARN_ON_ONCE() here to catch instances of
resv_huge_pages underflowing:

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 629aa4c2259c..7d763eed650f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -1165,7 +1165,21 @@ static struct page
*dequeue_huge_page_vma(struct hstate *h,
        page = dequeue_huge_page_nodemask(h, gfp_mask, nid, nodemask);
        if (page && !avoid_reserve && vma_has_reserves(vma, chg)) {
                SetHPageRestoreReserve(page);
+               WARN_ON_ONCE(!h->resv_huge_pages);
                h->resv_huge_pages--;
        }

And ran the userfaultfd selftests like so:

echo 1024 > /proc/sys/vm/nr_hugepages
mkdir -p /mnt/huge
mount -t hugetlbfs none /mnt/huge
./tools/testings/selftests/vm/userfaultfd hugetlb_shared 1024 200
/mnt/huge/userfaultfd_test

And run into this warning indicating this test does discover an underflow:

[   11.163403] ------------[ cut here ]------------
[   11.163404] WARNING: CPU: 0 PID: 237 at mm/hugetlb.c:1178
alloc_huge_page+0x558/0x5a0
[   11.163413] Modules linked in:
[   11.163419] CPU: 0 PID: 237 Comm: userfaultfd Not tainted 5.12.0-dbg-DEV #135
[   11.163424] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.14.0-2 04/01/2014
[   11.163429] RIP: 0010:alloc_huge_page+0x558/0x5a0
[   11.163432] Code: b0 00 0f 85 3d ff ff ff e9 2a ff ff ff be 01 00
00 00 48 89 df e8 18 e7 ff ff 48 f7 d8 4c 89 ef 48 89 c6 e8 da d7 ff
ff eb 8c <0f> 0b 4d 8b 85 c0 00 00 00 e9 95 fd ff ff e8 35 59 84 00 4c
897
[   11.163434] RSP: 0018:ffff94bb0073fc80 EFLAGS: 00010046
[   11.163436] RAX: 0000000000000080 RBX: 0000000000000000 RCX: 5fa252c406a76700
[   11.163438] RDX: c0000000ffff7fff RSI: 0000000000000004 RDI:
0000000000017ffd
[   11.163439] RBP: ffff94bb0073fcf8 R08: 0000000000000000 R09:
ffffffff9813ba70
[   11.163440] R10: 00000000ffff7fff R11: 0000000000000000 R12:
ffff8ac7800558c8
[   11.163442] R13: ffffffff993f8880 R14: 00007f0dfa200000 R15:
ffffed85453e0000
[   11.163443] FS:  00007f0d731fc700(0000) GS:ffff8acba9400000(0000)
knlGS:0000000000000000
[   11.163445] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   11.163448] CR2: 00007f0e65e00028 CR3: 0000000108d50003 CR4:
0000000000370ef0
[   11.163452] DR0: 0000000000000000 DR1: 0000000000000000 DR2:
0000000000000000
[   11.163453] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7:
0000000000000400
[   11.163455] Call Trace:
[   11.163468]  hugetlb_mcopy_atomic_pte+0xcb/0x450
[   11.163477]  mcopy_atomic+0xa08/0xd60
[   11.163480]  ? __might_fault+0x56/0x80
[   11.163493]  userfaultfd_ioctl+0xb18/0xd60
[   11.163502]  __se_sys_ioctl+0x77/0xc0
[   11.163507]  __x64_sys_ioctl+0x1d/0x20
[   11.163510]  do_syscall_64+0x3f/0x80
[   11.163515]  entry_SYSCALL_64_after_hwframe+0x44/0xae
[   11.163519] RIP: 0033:0x45ec87
[   11.163531] Code: 3c 1c 48 f7 d8 49 39 c4 72 b8 e8 64 63 03 00 85
c0 78 bd 48 83 c4 08 4c 89 e0 5b 41 5c c3 0f 1f 44 00 00 b8 10 00 00
00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89
018
[   11.163532] RSP: 002b:00007f0d731fc248 EFLAGS: 00000206 ORIG_RAX:
0000000000000010
[   11.163534] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 000000000045ec87
[   11.163536] RDX: 00007f0d731fc290 RSI: 00000000c028aa03 RDI: 0000000000000004
[   11.163537] RBP: 00007f0d731fc270 R08: 00000000004022b3 R09: 00007f0d731fc700
[   11.163538] R10: 00007f0d731fc9d0 R11: 0000000000000206 R12: 00007fff610cd82e
[   11.163539] R13: 00007fff610cd82f R14: 00007f0d731fc400 R15: 0000000001002000
[   11.163549] irq event stamp: 722
[   11.163550] hardirqs last  enabled at (721): [<ffffffff967cd41b>]
kmem_cache_alloc_trace+0x1db/0x370
[   11.163558] hardirqs last disabled at (722): [<ffffffff9700c052>]
_raw_spin_lock_irq+0x32/0x80
[   11.163560] softirqs last  enabled at (130): [<ffffffff9654e0d6>]
__irq_exit_rcu+0xf6/0x100
[   11.163564] softirqs last disabled at (125): [<ffffffff9654e0d6>]
__irq_exit_rcu+0xf6/0x100
[   11.163567] ---[ end trace 358ac5c76c211ea1 ]---

Debugging further I find the resv_huge_pages underflows by 1
temporarily during the run of the test multiple times, but a
__free_huge_page() is always subsequently called that overflows it
back to 0. resv_huge_pages is always 0 at the end of the test. I've
initially looked at this as I suspected a problem in the
resv_huge_pages accounting, but seems the resv_huge_pages accounting
is fine in itself as it correctly decrements resv_huge_pages when a
page is allocated from reservation and correctly increments it back up
when that page is freed.

I'm not that familiar with the userfaultfd/hugetlb code so I was
hoping to solicit some suggestions for what the issue could be. Things
I've tried so far:

- Adding code that prevents resv_huge_pages to underflow causes the
test to fail, so it seems in this test the calling code actually
expects to be able to temporarily allocate 1 more page than the VMA
has reserved, which seems like a bug maybe?
- Modifying hugetlb_mcopy_atomic_pte() to not use reserved pages
causes the test to fail again. Doin that and overprovisioning
/proc/sys/vm/nr_hugepages causes the test to pass again but I'm not
sure that's right (not familiar with the code).
- The failure gets reproduced as far back as 5.11, so it doesn't seem
to be related to any recent changes.

Thanks in advance!


^ permalink raw reply related	[flat|nested] 40+ messages in thread
* [PATCH] mm, hugetlb: fix resv_huge_pages underflow on UFFDIO_COPY
@ 2021-05-13 23:43 ` Mina Almasry
  0 siblings, 0 replies; 40+ messages in thread
From: Mina Almasry @ 2021-05-13 23:43 UTC (permalink / raw)
  Cc: Mina Almasry, Axel Rasmussen, Peter Xu, linux-mm, Mike Kravetz,
	Andrew Morton, linux-kernel

When hugetlb_mcopy_atomic_pte() is called with:
- mode==MCOPY_ATOMIC_NORMAL and,
- we already have a page in the page cache corresponding to the
associated address,

We will allocate a huge page from the reserves, and then fail to insert it
into the cache and return -EEXIST. In this case, we need to return -EEXIST
without allocating a new page as the page already exists in the cache.
Allocating the extra page causes the resv_huge_pages to underflow temporarily
until the extra page is freed.

To fix this we check if a page exists in the cache, and allocate it and
insert it in the cache immediately while holding the lock. After that we
copy the contents into the page.

As a side effect of this, pages may exist in the cache for which the
copy failed and for these pages PageUptodate(page) == false. Modify code
that query the cache to handle this correctly.

Tested using:
./tools/testing/selftests/vm/userfaultfd hugetlb_shared 1024 200 \
/mnt/huge

Test passes, and dmesg shows no underflow warnings.

Signed-off-by: Mina Almasry <almasrymina@google.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Peter Xu <peterx@redhat.com>
Cc: linux-mm@kvack.org
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: linux-mm@kvack.org
Cc: linux-kernel@vger.kernel.org

---
 fs/hugetlbfs/inode.c |   2 +-
 mm/hugetlb.c         | 109 +++++++++++++++++++++++--------------------
 2 files changed, 60 insertions(+), 51 deletions(-)

diff --git a/fs/hugetlbfs/inode.c b/fs/hugetlbfs/inode.c
index a2a42335e8fd..cc027c335242 100644
--- a/fs/hugetlbfs/inode.c
+++ b/fs/hugetlbfs/inode.c
@@ -346,7 +346,7 @@ static ssize_t hugetlbfs_read_iter(struct kiocb *iocb, struct iov_iter *to)

 		/* Find the page */
 		page = find_lock_page(mapping, index);
-		if (unlikely(page == NULL)) {
+		if (unlikely(page == NULL || !PageUptodate(page))) {
 			/*
 			 * We have a HOLE, zero out the user-buffer for the
 			 * length of the hole or request.
diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 629aa4c2259c..a5a5fbf7ac25 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -4543,7 +4543,7 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,

 retry:
 	page = find_lock_page(mapping, idx);
-	if (!page) {
+	if (!page || !PageUptodate(page)) {
 		/* Check for page in userfault range */
 		if (userfaultfd_missing(vma)) {
 			ret = hugetlb_handle_userfault(vma, mapping, idx,
@@ -4552,26 +4552,30 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 			goto out;
 		}

-		page = alloc_huge_page(vma, haddr, 0);
-		if (IS_ERR(page)) {
-			/*
-			 * Returning error will result in faulting task being
-			 * sent SIGBUS.  The hugetlb fault mutex prevents two
-			 * tasks from racing to fault in the same page which
-			 * could result in false unable to allocate errors.
-			 * Page migration does not take the fault mutex, but
-			 * does a clear then write of pte's under page table
-			 * lock.  Page fault code could race with migration,
-			 * notice the clear pte and try to allocate a page
-			 * here.  Before returning error, get ptl and make
-			 * sure there really is no pte entry.
-			 */
-			ptl = huge_pte_lock(h, mm, ptep);
-			ret = 0;
-			if (huge_pte_none(huge_ptep_get(ptep)))
-				ret = vmf_error(PTR_ERR(page));
-			spin_unlock(ptl);
-			goto out;
+		if (!page) {
+			page = alloc_huge_page(vma, haddr, 0);
+			if (IS_ERR(page)) {
+				/*
+				 * Returning error will result in faulting task
+				 * being sent SIGBUS.  The hugetlb fault mutex
+				 * prevents two tasks from racing to fault in
+				 * the same page which could result in false
+				 * unable to allocate errors.  Page migration
+				 * does not take the fault mutex, but does a
+				 * clear then write of pte's under page table
+				 * lock.  Page fault code could race with
+				 * migration, notice the clear pte and try to
+				 * allocate a page here.  Before returning
+				 * error, get ptl and make sure there really is
+				 * no pte entry.
+				 */
+				ptl = huge_pte_lock(h, mm, ptep);
+				ret = 0;
+				if (huge_pte_none(huge_ptep_get(ptep)))
+					ret = vmf_error(PTR_ERR(page));
+				spin_unlock(ptl);
+				goto out;
+			}
 		}
 		clear_huge_page(page, address, pages_per_huge_page(h));
 		__SetPageUptodate(page);
@@ -4868,31 +4872,55 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 			    struct page **pagep)
 {
 	bool is_continue = (mode == MCOPY_ATOMIC_CONTINUE);
-	struct address_space *mapping;
-	pgoff_t idx;
+	struct hstate *h = hstate_vma(dst_vma);
+	struct address_space *mapping = dst_vma->vm_file->f_mapping;
+	pgoff_t idx = vma_hugecache_offset(h, dst_vma, dst_addr);
 	unsigned long size;
 	int vm_shared = dst_vma->vm_flags & VM_SHARED;
-	struct hstate *h = hstate_vma(dst_vma);
 	pte_t _dst_pte;
 	spinlock_t *ptl;
-	int ret;
+	int ret = -ENOMEM;
 	struct page *page;
 	int writable;

-	mapping = dst_vma->vm_file->f_mapping;
-	idx = vma_hugecache_offset(h, dst_vma, dst_addr);
-
 	if (is_continue) {
 		ret = -EFAULT;
-		page = find_lock_page(mapping, idx);
-		if (!page)
+		page = hugetlbfs_pagecache_page(h, dst_vma, dst_addr);
+		if (!page) {
+			ret = -ENOMEM;
 			goto out;
+		}
 	} else if (!*pagep) {
-		ret = -ENOMEM;
+		/* If a page already exists, then it's UFFDIO_COPY for
+		 * a non-missing case. Return -EEXIST.
+		 */
+		if (hugetlbfs_pagecache_present(h, dst_vma, dst_addr)) {
+			ret = -EEXIST;
+			goto out;
+		}
+
 		page = alloc_huge_page(dst_vma, dst_addr, 0);
 		if (IS_ERR(page))
 			goto out;

+		/* Add shared, newly allocated pages to the page cache. */
+		if (vm_shared) {
+			size = i_size_read(mapping->host) >> huge_page_shift(h);
+			ret = -EFAULT;
+			if (idx >= size)
+				goto out;
+
+			/*
+			 * Serialization between remove_inode_hugepages() and
+			 * huge_add_to_page_cache() below happens through the
+			 * hugetlb_fault_mutex_table that here must be hold by
+			 * the caller.
+			 */
+			ret = huge_add_to_page_cache(page, mapping, idx);
+			if (ret)
+				goto out;
+		}
+
 		ret = copy_huge_page_from_user(page,
 						(const void __user *) src_addr,
 						pages_per_huge_page(h), false);
@@ -4916,24 +4944,6 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 	 */
 	__SetPageUptodate(page);

-	/* Add shared, newly allocated pages to the page cache. */
-	if (vm_shared && !is_continue) {
-		size = i_size_read(mapping->host) >> huge_page_shift(h);
-		ret = -EFAULT;
-		if (idx >= size)
-			goto out_release_nounlock;
-
-		/*
-		 * Serialization between remove_inode_hugepages() and
-		 * huge_add_to_page_cache() below happens through the
-		 * hugetlb_fault_mutex_table that here must be hold by
-		 * the caller.
-		 */
-		ret = huge_add_to_page_cache(page, mapping, idx);
-		if (ret)
-			goto out_release_nounlock;
-	}
-
 	ptl = huge_pte_lockptr(h, dst_mm, dst_pte);
 	spin_lock(ptl);

@@ -4994,7 +5004,6 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm,
 	spin_unlock(ptl);
 	if (vm_shared || is_continue)
 		unlock_page(page);
-out_release_nounlock:
 	put_page(page);
 	goto out;
 }
--
2.31.1.751.gd2f1c929bd-goog

^ permalink raw reply related	[flat|nested] 40+ messages in thread

end of thread, other threads:[~2021-05-21  2:06 UTC | newest]

Thread overview: 40+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2021-05-07 21:21 resv_huge_page underflow with userfaultfd test Mina Almasry
2021-05-11  0:33 ` Mike Kravetz
2021-05-11  0:52   ` Mina Almasry
2021-05-11  6:45     ` Axel Rasmussen
2021-05-11  7:08       ` Mina Almasry
2021-05-11 16:38       ` Mike Kravetz
2021-05-11 19:08         ` Mina Almasry
2021-05-12  2:25   ` Mike Kravetz
2021-05-12  2:35     ` Peter Xu
2021-05-12  3:06       ` Mike Kravetz
     [not found]         ` <20210512065813.89270-1-almasrymina@google.com>
2021-05-12  7:44           ` [PATCH] mm, hugetlb: fix resv_huge_pages underflow on UFFDIO_COPY Mina Almasry
2021-05-12  7:44             ` Mina Almasry
     [not found]           ` <CAJHvVch0ZMapPVEc0Ge5V4KDgNDNhECbqwDi0y9XxsxFXQZ-gg@mail.gmail.com>
     [not found]             ` <c455d241-11f6-95a6-eb29-0ddd94eedbd7@oracle.com>
2021-05-12 19:42               ` Mina Almasry
2021-05-12 19:42                 ` Mina Almasry
2021-05-12 20:14                 ` Peter Xu
2021-05-12 21:31                   ` Mike Kravetz
2021-05-12 21:52                     ` Mina Almasry
2021-05-12 21:52                       ` Mina Almasry
2021-05-13 23:43 Mina Almasry
2021-05-13 23:43 ` Mina Almasry
2021-05-13 23:49 ` Mina Almasry
2021-05-13 23:49   ` Mina Almasry
2021-05-14  0:14   ` Mike Kravetz
2021-05-14  0:23     ` Mina Almasry
2021-05-14  0:23       ` Mina Almasry
2021-05-14  4:02       ` Mike Kravetz
2021-05-14 12:31         ` Peter Xu
2021-05-14 17:56           ` Mike Kravetz
2021-05-14 18:30             ` Axel Rasmussen
2021-05-14 18:30               ` Axel Rasmussen
2021-05-14 19:16             ` Peter Xu
2021-05-20 19:18     ` Mina Almasry
2021-05-20 19:18       ` Mina Almasry
2021-05-20 19:21       ` Mina Almasry
2021-05-20 19:21         ` Mina Almasry
2021-05-20 20:00         ` Mike Kravetz
2021-05-20 20:31           ` Mina Almasry
2021-05-20 20:31             ` Mina Almasry
2021-05-21  2:05             ` Mina Almasry
2021-05-21  2:05               ` Mina Almasry

This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.