[PATCH 0/3] mm/hugetlb: Fix selftest failures with write check

linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed

* [PATCH 0/3] mm/hugetlb: Fix selftest failures with write check
@ 2022-10-03 15:56 Peter Xu
  2022-10-03 15:56 ` [PATCH 1/3] mm/hugetlb: Fix race condition of uffd missing/minor handling Peter Xu
                   ` (2 more replies)
  0 siblings, 3 replies; 11+ messages in thread
From: Peter Xu @ 2022-10-03 15:56 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: peterx, Andrea Arcangeli, Mike Rapoport, David Hildenbrand,
	Andrew Morton, Axel Rasmussen, Mike Kravetz, Nadav Amit

Currently akpm mm-unstable fails with uffd hugetlb private mapping test
randomly on a write check.

The initial bisection of that points to the recent pmd unshare series, but
it turns out there's no direction relationship with the series but only
some timing change caused the race to start trigger.

The race should be fixed in patch 1.  Patch 2 is a trivial cleanup on the
similar race with hugetlb migrations, patch 3 comment on the write check
so when anyone read it again it'll be clear why it's there.

Thanks,

Peter Xu (3):
  mm/hugetlb: Fix race condition of uffd missing/minor handling
  mm/hugetlb: Use hugetlb_pte_stable in migration race check
  mm/selftest: uffd: Explain the write missing fault check

 mm/hugetlb.c                             | 62 ++++++++++++++++++++----
 tools/testing/selftests/vm/userfaultfd.c | 22 ++++++++-
 2 files changed, 73 insertions(+), 11 deletions(-)

-- 
2.37.3

^ permalink raw reply	[flat|nested] 11+ messages in thread

* [PATCH 1/3] mm/hugetlb: Fix race condition of uffd missing/minor handling
  2022-10-03 15:56 [PATCH 0/3] mm/hugetlb: Fix selftest failures with write check Peter Xu
@ 2022-10-03 15:56 ` Peter Xu
  2022-10-03 17:20   ` Mike Kravetz
  2022-10-03 21:00   ` Nadav Amit
  2022-10-03 15:56 ` [PATCH 2/3] mm/hugetlb: Use hugetlb_pte_stable in migration race check Peter Xu
  2022-10-03 15:56 ` [PATCH 3/3] mm/selftest: uffd: Explain the write missing fault check Peter Xu
  2 siblings, 2 replies; 11+ messages in thread
From: Peter Xu @ 2022-10-03 15:56 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: peterx, Andrea Arcangeli, Mike Rapoport, David Hildenbrand,
	Andrew Morton, Axel Rasmussen, Mike Kravetz, Nadav Amit

After the recent rework patchset of hugetlb locking on pmd sharing,
kselftest for userfaultfd sometimes fails on hugetlb private tests with
unexpected write fault checks.

It turns out there's nothing wrong within the locking series regarding this
matter, but it could have changed the timing of threads so it can trigger
an old bug.

The real bug is when we call hugetlb_no_page() we're not with the pgtable
lock.  It means we're reading the pte values lockless.  It's perfectly fine
in most cases because before we do normal page allocations we'll take the
lock and check pte_same() again.  However before that, there are actually
two paths on userfaultfd missing/minor handling that may directly move on
with the fault process without checking the pte values.

It means for these two paths we may be generating an uffd message based on
an unstable pte, while an unstable pte can legally be anything as long as
the modifier holds the pgtable lock.

One example, which is also what happened in the failing kselftest and
caused the test failure, is that for private mappings CoW can happen on one
page.  CoW requires pte being cleared before being replaced with a new page
for TLB coherency, but then there can be a race condition:

        thread 1                              thread 2
        --------                              --------

      hugetlb_fault                         hugetlb_fault
        private pte RO
        hugetlb_wp
          pgtable_lock()
          huge_ptep_clear_flush
                                              pte=NULL
                                              hugetlb_no_page
                                                generate uffd missing event
                                                even if page existed!!
          set_huge_pte_at
          pgtable_unlock()

Fix this by recheck the pte after pgtable lock for both userfaultfd missing
& minor fault paths.

This bug should have been around starting from uffd hugetlb introduced, so
attaching a Fixes to the commit.  Also attach another Fixes to the minor
support commit for easier tracking.

Note that userfaultfd is actually fine with false positives (e.g. caused by
pte changed), but not wrong logical events (e.g. caused by reading a pte
during changing).  The latter can confuse the userspace, so the strictness
is very much preferred.  E.g., MISSING event should never happen on the
page after UFFDIO_COPY has correctly installed the page and returned.

Cc: Andrea Arcangeli <aarcange@redhat.com>
Cc: Mike Kravetz <mike.kravetz@oracle.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>
Cc: Nadav Amit <nadav.amit@gmail.com>
Fixes: 1a1aad8a9b7b ("userfaultfd: hugetlbfs: add userfaultfd hugetlb hook")
Fixes: 7677f7fd8be7 ("userfaultfd: add minor fault registration mode")
Co-developed-by: Mike Kravetz <mike.kravetz@oracle.com>
Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/hugetlb.c | 55 ++++++++++++++++++++++++++++++++++++++++++++++------
 1 file changed, 49 insertions(+), 6 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index 9679fe519b90..fa3fcdb0c4b8 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5521,6 +5521,23 @@ static inline vm_fault_t hugetlb_handle_userfault(struct vm_area_struct *vma,
 	return ret;
 }
 
+/*
+ * Recheck pte with pgtable lock.  Returns true if pte didn't change, or
+ * false if pte changed or is changing.
+ */
+static bool hugetlb_pte_stable(struct hstate *h, struct mm_struct *mm,
+			       pte_t *ptep, pte_t old_pte)
+{
+	spinlock_t *ptl;
+	bool same;
+
+	ptl = huge_pte_lock(h, mm, ptep);
+	same = pte_same(huge_ptep_get(ptep), old_pte);
+	spin_unlock(ptl);
+
+	return same;
+}
+
 static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 			struct vm_area_struct *vma,
 			struct address_space *mapping, pgoff_t idx,
@@ -5562,9 +5579,30 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 			goto out;
 		/* Check for page in userfault range */
 		if (userfaultfd_missing(vma)) {
-			ret = hugetlb_handle_userfault(vma, mapping, idx,
-						       flags, haddr, address,
-						       VM_UFFD_MISSING);
+			/*
+			 * Since hugetlb_no_page() was examining pte
+			 * without pgtable lock, we need to re-test under
+			 * lock because the pte may not be stable and could
+			 * have changed from under us.  Try to detect
+			 * either changed or during-changing ptes and retry
+			 * properly when needed.
+			 *
+			 * Note that userfaultfd is actually fine with
+			 * false positives (e.g. caused by pte changed),
+			 * but not wrong logical events (e.g. caused by
+			 * reading a pte during changing).  The latter can
+			 * confuse the userspace, so the strictness is very
+			 * much preferred.  E.g., MISSING event should
+			 * never happen on the page after UFFDIO_COPY has
+			 * correctly installed the page and returned.
+			 */
+			if (hugetlb_pte_stable(h, mm, ptep, old_pte))
+				ret = hugetlb_handle_userfault(
+				    vma, mapping, idx, flags, haddr,
+				    address, VM_UFFD_MISSING);
+			else
+				/* Retry the fault */
+				ret = 0;
 			goto out;
 		}
 
@@ -5634,9 +5672,14 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 		if (userfaultfd_minor(vma)) {
 			unlock_page(page);
 			put_page(page);
-			ret = hugetlb_handle_userfault(vma, mapping, idx,
-						       flags, haddr, address,
-						       VM_UFFD_MINOR);
+			/* See comment in userfaultfd_missing() block above */
+			if (hugetlb_pte_stable(h, mm, ptep, old_pte))
+				ret = hugetlb_handle_userfault(
+				    vma, mapping, idx, flags, haddr,
+				    address, VM_UFFD_MINOR);
+			else
+				/* Retry the fault */
+				ret = 0;
 			goto out;
 		}
 	}
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 2/3] mm/hugetlb: Use hugetlb_pte_stable in migration race check
  2022-10-03 15:56 [PATCH 0/3] mm/hugetlb: Fix selftest failures with write check Peter Xu
  2022-10-03 15:56 ` [PATCH 1/3] mm/hugetlb: Fix race condition of uffd missing/minor handling Peter Xu
@ 2022-10-03 15:56 ` Peter Xu
  2022-10-03 15:56 ` [PATCH 3/3] mm/selftest: uffd: Explain the write missing fault check Peter Xu
  2 siblings, 0 replies; 11+ messages in thread
From: Peter Xu @ 2022-10-03 15:56 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: peterx, Andrea Arcangeli, Mike Rapoport, David Hildenbrand,
	Andrew Morton, Axel Rasmussen, Mike Kravetz, Nadav Amit

After hugetlb_pte_stable() introduced, we can also rewrite the migration
race condition against page allocation to use the new helper too.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 mm/hugetlb.c | 7 +++----
 1 file changed, 3 insertions(+), 4 deletions(-)

diff --git a/mm/hugetlb.c b/mm/hugetlb.c
index fa3fcdb0c4b8..e762c5369a6f 100644
--- a/mm/hugetlb.c
+++ b/mm/hugetlb.c
@@ -5620,11 +5620,10 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
 			 * here.  Before returning error, get ptl and make
 			 * sure there really is no pte entry.
 			 */
-			ptl = huge_pte_lock(h, mm, ptep);
-			ret = 0;
-			if (huge_pte_none(huge_ptep_get(ptep)))
+			if (hugetlb_pte_stable(h, mm, ptep, old_pte))
 				ret = vmf_error(PTR_ERR(page));
-			spin_unlock(ptl);
+			else
+				ret = 0;
 			goto out;
 		}
 		clear_huge_page(page, address, pages_per_huge_page(h));
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* [PATCH 3/3] mm/selftest: uffd: Explain the write missing fault check
  2022-10-03 15:56 [PATCH 0/3] mm/hugetlb: Fix selftest failures with write check Peter Xu
  2022-10-03 15:56 ` [PATCH 1/3] mm/hugetlb: Fix race condition of uffd missing/minor handling Peter Xu
  2022-10-03 15:56 ` [PATCH 2/3] mm/hugetlb: Use hugetlb_pte_stable in migration race check Peter Xu
@ 2022-10-03 15:56 ` Peter Xu
  2 siblings, 0 replies; 11+ messages in thread
From: Peter Xu @ 2022-10-03 15:56 UTC (permalink / raw)
  To: linux-kernel, linux-mm
  Cc: peterx, Andrea Arcangeli, Mike Rapoport, David Hildenbrand,
	Andrew Morton, Axel Rasmussen, Mike Kravetz, Nadav Amit

It's not obvious why we had a write check for each of the missing messages,
especially when it should be a locking op.  Add a rich comment for that,
and also try to explain its good side and limitations, so that if someone
hit it again for either a bug or a different glibc impl there'll be some
clue to start with.

Signed-off-by: Peter Xu <peterx@redhat.com>
---
 tools/testing/selftests/vm/userfaultfd.c | 22 +++++++++++++++++++++-
 1 file changed, 21 insertions(+), 1 deletion(-)

diff --git a/tools/testing/selftests/vm/userfaultfd.c b/tools/testing/selftests/vm/userfaultfd.c
index 74babdbc02e5..297f250c1d95 100644
--- a/tools/testing/selftests/vm/userfaultfd.c
+++ b/tools/testing/selftests/vm/userfaultfd.c
@@ -774,7 +774,27 @@ static void uffd_handle_page_fault(struct uffd_msg *msg,
 		continue_range(uffd, msg->arg.pagefault.address, page_size);
 		stats->minor_faults++;
 	} else {
-		/* Missing page faults */
+		/*
+		 * Missing page faults.
+		 *
+		 * Here we force a write check for each of the missing mode
+		 * faults.  It's guaranteed because the only threads that
+		 * will trigger uffd faults are the locking threads, and
+		 * their first instruction to touch the missing page will
+		 * always be pthread_mutex_lock().
+		 *
+		 * Note that here we relied on an NPTL glibc impl detail to
+		 * always read the lock type at the entry of the lock op
+		 * (pthread_mutex_t.__data.__type, offset 0x10) before
+		 * doing any locking operations to guarantee that.  It's
+		 * actually not good to rely on this impl detail because
+		 * logically a pthread-compatible lib can implement the
+		 * locks without types and we can fail when linking with
+		 * them.  However since we used to find bugs with this
+		 * strict check we still keep it around.  Hopefully this
+		 * could be a good hint when it fails again.  If one day
+		 * it'll break on some other impl of glibc we'll revisit.
+		 */
 		if (msg->arg.pagefault.flags & UFFD_PAGEFAULT_FLAG_WRITE)
 			err("unexpected write fault");
 
-- 
2.37.3


^ permalink raw reply related	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/3] mm/hugetlb: Fix race condition of uffd missing/minor handling
  2022-10-03 15:56 ` [PATCH 1/3] mm/hugetlb: Fix race condition of uffd missing/minor handling Peter Xu
@ 2022-10-03 17:20   ` Mike Kravetz
  2022-10-03 21:27     ` Peter Xu
  2022-10-03 21:00   ` Nadav Amit
  1 sibling, 1 reply; 11+ messages in thread
From: Mike Kravetz @ 2022-10-03 17:20 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, Andrea Arcangeli, Mike Rapoport,
	David Hildenbrand, Andrew Morton, Axel Rasmussen, Nadav Amit

On 10/03/22 11:56, Peter Xu wrote:
> After the recent rework patchset of hugetlb locking on pmd sharing,
> kselftest for userfaultfd sometimes fails on hugetlb private tests with
> unexpected write fault checks.
> 
> It turns out there's nothing wrong within the locking series regarding this
> matter, but it could have changed the timing of threads so it can trigger
> an old bug.
> 
> The real bug is when we call hugetlb_no_page() we're not with the pgtable
> lock.  It means we're reading the pte values lockless.  It's perfectly fine
> in most cases because before we do normal page allocations we'll take the
> lock and check pte_same() again.  However before that, there are actually
> two paths on userfaultfd missing/minor handling that may directly move on
> with the fault process without checking the pte values.
> 
> It means for these two paths we may be generating an uffd message based on
> an unstable pte, while an unstable pte can legally be anything as long as
> the modifier holds the pgtable lock.
> 
> One example, which is also what happened in the failing kselftest and
> caused the test failure, is that for private mappings CoW can happen on one
> page.  CoW requires pte being cleared before being replaced with a new page
> for TLB coherency, but then there can be a race condition:
> 
>         thread 1                              thread 2
>         --------                              --------
> 
>       hugetlb_fault                         hugetlb_fault
>         private pte RO
>         hugetlb_wp
>           pgtable_lock()
>           huge_ptep_clear_flush
>                                               pte=NULL
>                                               hugetlb_no_page
>                                                 generate uffd missing event
>                                                 even if page existed!!
>           set_huge_pte_at
>           pgtable_unlock()

Thanks for working on this Peter!

I agree with this patch, but I suspect the above race is not possible.  Why?
In both cases, we take the hugetlb fault mutex when processing a huegtlb
page fault.  This means only one thread can execute the fault code for
a specific mapping/index at a time.  This is why I was so confused, and may
remain a bit confused about how we end up with userfault processing a write
fault.  It was part of the reason for my 'unclear' wording about this being
more about cpus not seeing updated values.  Note that we do drop the fault
mutex before calling handle_usefault, but by then we have already made the
'missing' determination.

Thoughts?  Perhaps, I am still confused.
-- 
Mike Kravetz

> 
> Fix this by recheck the pte after pgtable lock for both userfaultfd missing
> & minor fault paths.
> 
> This bug should have been around starting from uffd hugetlb introduced, so
> attaching a Fixes to the commit.  Also attach another Fixes to the minor
> support commit for easier tracking.
> 
> Note that userfaultfd is actually fine with false positives (e.g. caused by
> pte changed), but not wrong logical events (e.g. caused by reading a pte
> during changing).  The latter can confuse the userspace, so the strictness
> is very much preferred.  E.g., MISSING event should never happen on the
> page after UFFDIO_COPY has correctly installed the page and returned.
> 
> Cc: Andrea Arcangeli <aarcange@redhat.com>
> Cc: Mike Kravetz <mike.kravetz@oracle.com>
> Cc: Axel Rasmussen <axelrasmussen@google.com>
> Cc: Nadav Amit <nadav.amit@gmail.com>
> Fixes: 1a1aad8a9b7b ("userfaultfd: hugetlbfs: add userfaultfd hugetlb hook")
> Fixes: 7677f7fd8be7 ("userfaultfd: add minor fault registration mode")
> Co-developed-by: Mike Kravetz <mike.kravetz@oracle.com>
> Signed-off-by: Peter Xu <peterx@redhat.com>
> ---
>  mm/hugetlb.c | 55 ++++++++++++++++++++++++++++++++++++++++++++++------
>  1 file changed, 49 insertions(+), 6 deletions(-)
> 
> diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> index 9679fe519b90..fa3fcdb0c4b8 100644
> --- a/mm/hugetlb.c
> +++ b/mm/hugetlb.c
> @@ -5521,6 +5521,23 @@ static inline vm_fault_t hugetlb_handle_userfault(struct vm_area_struct *vma,
>  	return ret;
>  }
>  
> +/*
> + * Recheck pte with pgtable lock.  Returns true if pte didn't change, or
> + * false if pte changed or is changing.
> + */
> +static bool hugetlb_pte_stable(struct hstate *h, struct mm_struct *mm,
> +			       pte_t *ptep, pte_t old_pte)
> +{
> +	spinlock_t *ptl;
> +	bool same;
> +
> +	ptl = huge_pte_lock(h, mm, ptep);
> +	same = pte_same(huge_ptep_get(ptep), old_pte);
> +	spin_unlock(ptl);
> +
> +	return same;
> +}
> +
>  static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
>  			struct vm_area_struct *vma,
>  			struct address_space *mapping, pgoff_t idx,
> @@ -5562,9 +5579,30 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
>  			goto out;
>  		/* Check for page in userfault range */
>  		if (userfaultfd_missing(vma)) {
> -			ret = hugetlb_handle_userfault(vma, mapping, idx,
> -						       flags, haddr, address,
> -						       VM_UFFD_MISSING);
> +			/*
> +			 * Since hugetlb_no_page() was examining pte
> +			 * without pgtable lock, we need to re-test under
> +			 * lock because the pte may not be stable and could
> +			 * have changed from under us.  Try to detect
> +			 * either changed or during-changing ptes and retry
> +			 * properly when needed.
> +			 *
> +			 * Note that userfaultfd is actually fine with
> +			 * false positives (e.g. caused by pte changed),
> +			 * but not wrong logical events (e.g. caused by
> +			 * reading a pte during changing).  The latter can
> +			 * confuse the userspace, so the strictness is very
> +			 * much preferred.  E.g., MISSING event should
> +			 * never happen on the page after UFFDIO_COPY has
> +			 * correctly installed the page and returned.
> +			 */
> +			if (hugetlb_pte_stable(h, mm, ptep, old_pte))
> +				ret = hugetlb_handle_userfault(
> +				    vma, mapping, idx, flags, haddr,
> +				    address, VM_UFFD_MISSING);
> +			else
> +				/* Retry the fault */
> +				ret = 0;
>  			goto out;
>  		}
>  
> @@ -5634,9 +5672,14 @@ static vm_fault_t hugetlb_no_page(struct mm_struct *mm,
>  		if (userfaultfd_minor(vma)) {
>  			unlock_page(page);
>  			put_page(page);
> -			ret = hugetlb_handle_userfault(vma, mapping, idx,
> -						       flags, haddr, address,
> -						       VM_UFFD_MINOR);
> +			/* See comment in userfaultfd_missing() block above */
> +			if (hugetlb_pte_stable(h, mm, ptep, old_pte))
> +				ret = hugetlb_handle_userfault(
> +				    vma, mapping, idx, flags, haddr,
> +				    address, VM_UFFD_MINOR);
> +			else
> +				/* Retry the fault */
> +				ret = 0;
>  			goto out;
>  		}
>  	}
> -- 
> 2.37.3
> 

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/3] mm/hugetlb: Fix race condition of uffd missing/minor handling
  2022-10-03 15:56 ` [PATCH 1/3] mm/hugetlb: Fix race condition of uffd missing/minor handling Peter Xu
  2022-10-03 17:20   ` Mike Kravetz
@ 2022-10-03 21:00   ` Nadav Amit
  2022-10-03 21:16     ` Peter Xu
  1 sibling, 1 reply; 11+ messages in thread
From: Nadav Amit @ 2022-10-03 21:00 UTC (permalink / raw)
  To: Peter Xu
  Cc: LKML, Linux MM, Andrea Arcangeli, Mike Rapoport,
	David Hildenbrand, Andrew Morton, Axel Rasmussen, Mike Kravetz

On Oct 3, 2022, at 8:56 AM, Peter Xu <peterx@redhat.com> wrote:

> 
> +			 */
> +			if (hugetlb_pte_stable(h, mm, ptep, old_pte))
> +				ret = hugetlb_handle_userfault(
> +				    vma, mapping, idx, flags, haddr,
> +				    address, VM_UFFD_MISSING);
> +			else
> +				/* Retry the fault */
> +				ret = 0;

Might be unrelated, but at least for the sake of consistency if not
potential GUP issues, don’t you want to return VM_FAULT_RETRY ?


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/3] mm/hugetlb: Fix race condition of uffd missing/minor handling
  2022-10-03 21:00   ` Nadav Amit
@ 2022-10-03 21:16     ` Peter Xu
  2022-10-03 21:32       ` Nadav Amit
  0 siblings, 1 reply; 11+ messages in thread
From: Peter Xu @ 2022-10-03 21:16 UTC (permalink / raw)
  To: Nadav Amit
  Cc: LKML, Linux MM, Andrea Arcangeli, Mike Rapoport,
	David Hildenbrand, Andrew Morton, Axel Rasmussen, Mike Kravetz

On Mon, Oct 03, 2022 at 02:00:36PM -0700, Nadav Amit wrote:
> On Oct 3, 2022, at 8:56 AM, Peter Xu <peterx@redhat.com> wrote:
> 
> > 
> > +			 */
> > +			if (hugetlb_pte_stable(h, mm, ptep, old_pte))
> > +				ret = hugetlb_handle_userfault(
> > +				    vma, mapping, idx, flags, haddr,
> > +				    address, VM_UFFD_MISSING);
> > +			else
> > +				/* Retry the fault */
> > +				ret = 0;
> 
> Might be unrelated, but at least for the sake of consistency if not
> potential GUP issues, don’t you want to return VM_FAULT_RETRY ?

VM_FAULT_RETRY implies releasing of mmap sem, while we didn't?

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/3] mm/hugetlb: Fix race condition of uffd missing/minor handling
  2022-10-03 17:20   ` Mike Kravetz
@ 2022-10-03 21:27     ` Peter Xu
  2022-10-03 21:45       ` Mike Kravetz
  0 siblings, 1 reply; 11+ messages in thread
From: Peter Xu @ 2022-10-03 21:27 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-kernel, linux-mm, Andrea Arcangeli, Mike Rapoport,
	David Hildenbrand, Andrew Morton, Axel Rasmussen, Nadav Amit

On Mon, Oct 03, 2022 at 10:20:29AM -0700, Mike Kravetz wrote:
> On 10/03/22 11:56, Peter Xu wrote:
> > After the recent rework patchset of hugetlb locking on pmd sharing,
> > kselftest for userfaultfd sometimes fails on hugetlb private tests with
> > unexpected write fault checks.
> > 
> > It turns out there's nothing wrong within the locking series regarding this
> > matter, but it could have changed the timing of threads so it can trigger
> > an old bug.
> > 
> > The real bug is when we call hugetlb_no_page() we're not with the pgtable
> > lock.  It means we're reading the pte values lockless.  It's perfectly fine
> > in most cases because before we do normal page allocations we'll take the
> > lock and check pte_same() again.  However before that, there are actually
> > two paths on userfaultfd missing/minor handling that may directly move on
> > with the fault process without checking the pte values.
> > 
> > It means for these two paths we may be generating an uffd message based on
> > an unstable pte, while an unstable pte can legally be anything as long as
> > the modifier holds the pgtable lock.
> > 
> > One example, which is also what happened in the failing kselftest and
> > caused the test failure, is that for private mappings CoW can happen on one
> > page.  CoW requires pte being cleared before being replaced with a new page
> > for TLB coherency, but then there can be a race condition:
> > 
> >         thread 1                              thread 2
> >         --------                              --------
> > 
> >       hugetlb_fault                         hugetlb_fault
> >         private pte RO
> >         hugetlb_wp
> >           pgtable_lock()
> >           huge_ptep_clear_flush
> >                                               pte=NULL
> >                                               hugetlb_no_page
> >                                                 generate uffd missing event
> >                                                 even if page existed!!
> >           set_huge_pte_at
> >           pgtable_unlock()
> 
> Thanks for working on this Peter!
> 
> I agree with this patch, but I suspect the above race is not possible.  Why?
> In both cases, we take the hugetlb fault mutex when processing a huegtlb
> page fault.  This means only one thread can execute the fault code for
> a specific mapping/index at a time.  This is why I was so confused, and may
> remain a bit confused about how we end up with userfault processing a write
> fault.  It was part of the reason for my 'unclear' wording about this being
> more about cpus not seeing updated values.  Note that we do drop the fault
> mutex before calling handle_usefault, but by then we have already made the
> 'missing' determination.
> 
> Thoughts?  Perhaps, I am still confused.

It's my fault to have the commit message wrong, sorry. :) And thanks for
raising this question, I could have overlooked that.

It turns out it's not the CoW that's clearing the pte... it's the
wr-protect with huge_ptep_modify_prot_start().  So the race is with
UFFDIO_WRITEPROTECT, not CoW.

Obviously when I was tracking the hpte changes I overlooked
huge_ptep_get_and_clear(), only seeing the CoW path and I'm pretty sure
that's already a bug which was obvious enough.  I didn't prove they
happened at the same time during the MISSING event.

Then after I further looked at the CoW code I start to question myself on
why CoW would trigger at all even with an available fault mutex, since for
private mappings mapcount should be 1:

	if (page_mapcount(old_page) == 1 && PageAnon(old_page)) {
		if (!PageAnonExclusive(old_page))
			page_move_anon_rmap(old_page, vma);
		if (likely(!unshare))
			set_huge_ptep_writable(vma, haddr, ptep);

		delayacct_wpcopy_end();
		return 0;
	}

There could still be something else I didn't catch, even though that may
not be relevant to this patchset anymore.

I'll wait a few more hours for other reviewers, then prepare a new version
with the corrected commit message.

Thanks,

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/3] mm/hugetlb: Fix race condition of uffd missing/minor handling
  2022-10-03 21:16     ` Peter Xu
@ 2022-10-03 21:32       ` Nadav Amit
  0 siblings, 0 replies; 11+ messages in thread
From: Nadav Amit @ 2022-10-03 21:32 UTC (permalink / raw)
  To: Peter Xu
  Cc: LKML, Linux MM, Andrea Arcangeli, Mike Rapoport,
	David Hildenbrand, Andrew Morton, Axel Rasmussen, Mike Kravetz

On Oct 3, 2022, at 2:16 PM, Peter Xu <peterx@redhat.com> wrote:

> On Mon, Oct 03, 2022 at 02:00:36PM -0700, Nadav Amit wrote:
>> On Oct 3, 2022, at 8:56 AM, Peter Xu <peterx@redhat.com> wrote:
>> 
>>> +			 */
>>> +			if (hugetlb_pte_stable(h, mm, ptep, old_pte))
>>> +				ret = hugetlb_handle_userfault(
>>> +				    vma, mapping, idx, flags, haddr,
>>> +				    address, VM_UFFD_MISSING);
>>> +			else
>>> +				/* Retry the fault */
>>> +				ret = 0;
>> 
>> Might be unrelated, but at least for the sake of consistency if not
>> potential GUP issues, don’t you want to return VM_FAULT_RETRY ?
> 
> VM_FAULT_RETRY implies releasing of mmap sem, while we didn't?

Of course. My bad.


^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/3] mm/hugetlb: Fix race condition of uffd missing/minor handling
  2022-10-03 21:27     ` Peter Xu
@ 2022-10-03 21:45       ` Mike Kravetz
  2022-10-04  0:28         ` Peter Xu
  0 siblings, 1 reply; 11+ messages in thread
From: Mike Kravetz @ 2022-10-03 21:45 UTC (permalink / raw)
  To: Peter Xu
  Cc: linux-kernel, linux-mm, Andrea Arcangeli, Mike Rapoport,
	David Hildenbrand, Andrew Morton, Axel Rasmussen, Nadav Amit

On 10/03/22 17:27, Peter Xu wrote:
> On Mon, Oct 03, 2022 at 10:20:29AM -0700, Mike Kravetz wrote:
> > On 10/03/22 11:56, Peter Xu wrote:
> > > After the recent rework patchset of hugetlb locking on pmd sharing,
> > > kselftest for userfaultfd sometimes fails on hugetlb private tests with
> > > unexpected write fault checks.
> > > 
> > > It turns out there's nothing wrong within the locking series regarding this
> > > matter, but it could have changed the timing of threads so it can trigger
> > > an old bug.
> > > 
> > > The real bug is when we call hugetlb_no_page() we're not with the pgtable
> > > lock.  It means we're reading the pte values lockless.  It's perfectly fine
> > > in most cases because before we do normal page allocations we'll take the
> > > lock and check pte_same() again.  However before that, there are actually
> > > two paths on userfaultfd missing/minor handling that may directly move on
> > > with the fault process without checking the pte values.
> > > 
> > > It means for these two paths we may be generating an uffd message based on
> > > an unstable pte, while an unstable pte can legally be anything as long as
> > > the modifier holds the pgtable lock.
> > > 
> > > One example, which is also what happened in the failing kselftest and
> > > caused the test failure, is that for private mappings CoW can happen on one
> > > page.  CoW requires pte being cleared before being replaced with a new page
> > > for TLB coherency, but then there can be a race condition:
> > > 
> > >         thread 1                              thread 2
> > >         --------                              --------
> > > 
> > >       hugetlb_fault                         hugetlb_fault
> > >         private pte RO
> > >         hugetlb_wp
> > >           pgtable_lock()
> > >           huge_ptep_clear_flush
> > >                                               pte=NULL
> > >                                               hugetlb_no_page
> > >                                                 generate uffd missing event
> > >                                                 even if page existed!!
> > >           set_huge_pte_at
> > >           pgtable_unlock()
> > 
> > Thanks for working on this Peter!
> > 
> > I agree with this patch, but I suspect the above race is not possible.  Why?
> > In both cases, we take the hugetlb fault mutex when processing a huegtlb
> > page fault.  This means only one thread can execute the fault code for
> > a specific mapping/index at a time.  This is why I was so confused, and may
> > remain a bit confused about how we end up with userfault processing a write
> > fault.  It was part of the reason for my 'unclear' wording about this being
> > more about cpus not seeing updated values.  Note that we do drop the fault
> > mutex before calling handle_usefault, but by then we have already made the
> > 'missing' determination.
> > 
> > Thoughts?  Perhaps, I am still confused.
> 
> It's my fault to have the commit message wrong, sorry. :) And thanks for
> raising this question, I could have overlooked that.
> 
> It turns out it's not the CoW that's clearing the pte... it's the
> wr-protect with huge_ptep_modify_prot_start().  So the race is with
> UFFDIO_WRITEPROTECT, not CoW.

Thank you!  Now I understand.

This also explains why the new locking exposes the race.
hugetlb_change_protection needs to take the i_mmap_sema in write mode because
it could unshare pmds.  Previously, hugetlb page faults took i_mmap_sema in
read mode so this race could not happen.
-- 
Mike Kravetz

^ permalink raw reply	[flat|nested] 11+ messages in thread

* Re: [PATCH 1/3] mm/hugetlb: Fix race condition of uffd missing/minor handling
  2022-10-03 21:45       ` Mike Kravetz
@ 2022-10-04  0:28         ` Peter Xu
  0 siblings, 0 replies; 11+ messages in thread
From: Peter Xu @ 2022-10-04  0:28 UTC (permalink / raw)
  To: Mike Kravetz
  Cc: linux-kernel, linux-mm, Andrea Arcangeli, Mike Rapoport,
	David Hildenbrand, Andrew Morton, Axel Rasmussen, Nadav Amit

On Mon, Oct 03, 2022 at 02:45:47PM -0700, Mike Kravetz wrote:
> This also explains why the new locking exposes the race.
> hugetlb_change_protection needs to take the i_mmap_sema in write mode because
> it could unshare pmds.  Previously, hugetlb page faults took i_mmap_sema in
> read mode so this race could not happen.

Makes sense, thanks for explaining.

-- 
Peter Xu


^ permalink raw reply	[flat|nested] 11+ messages in thread

end of thread, other threads:[~2022-10-04  0:28 UTC | newest]

Thread overview: 11+ messages (download: mbox.gz / follow: Atom feed)
-- links below jump to the message on this page --
2022-10-03 15:56 [PATCH 0/3] mm/hugetlb: Fix selftest failures with write check Peter Xu
2022-10-03 15:56 ` [PATCH 1/3] mm/hugetlb: Fix race condition of uffd missing/minor handling Peter Xu
2022-10-03 17:20   ` Mike Kravetz
2022-10-03 21:27     ` Peter Xu
2022-10-03 21:45       ` Mike Kravetz
2022-10-04  0:28         ` Peter Xu
2022-10-03 21:00   ` Nadav Amit
2022-10-03 21:16     ` Peter Xu
2022-10-03 21:32       ` Nadav Amit
2022-10-03 15:56 ` [PATCH 2/3] mm/hugetlb: Use hugetlb_pte_stable in migration race check Peter Xu
2022-10-03 15:56 ` [PATCH 3/3] mm/selftest: uffd: Explain the write missing fault check Peter Xu

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).