Re: [PATCH] mm: Add an explicit smp_wmb() to UFFDIO_CONTINUE

From: James Houghton <jthoughton@google.com>
To: Peter Xu <peterx@redhat.com>
Cc: Axel Rasmussen <axelrasmussen@google.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	 Muchun Song <songmuchun@bytedance.com>,
	linux-mm@kvack.org, linux-kernel@vger.kernel.org
Subject: Re: [PATCH] mm: Add an explicit smp_wmb() to UFFDIO_CONTINUE
Date: Wed, 6 Mar 2024 09:57:17 -0800	[thread overview]
Message-ID: <CADrL8HWaeEwoUov9y0weGCTNPffN4nbURDetF_+s2bF+MttKsw@mail.gmail.com> (raw)
In-Reply-To: <Zefl5mJ32IxxYtaF@x1n>

On Tue, Mar 5, 2024 at 7:41 PM Peter Xu <peterx@redhat.com> wrote:
>
> On Wed, Mar 06, 2024 at 12:15:10AM +0000, James Houghton wrote:
> > I've tried to see if I can legitimately get a user to read stale data,
> > and a few attempts with this test[2] have been unsuccessful.
>
> AFAICT that won't easily reproduce even if the problem existed, as we
> contain so many implict memory barriers here and there.  E.g. right at the
> entry of ioctl(), mmget_not_zero() already contains a full ordering
> constraint:
>
> /**
>  * atomic_inc_not_zero() - atomic increment unless zero with full ordering
>  * @v: pointer to atomic_t

Oh yes, of course. Thanks for pointing this out. So we definitely
don't need a Fixes.

>
> I was expecting the syscall routine will guarantee an ordering already but
> indeed I can't find any.  I also checked up Intel's spec and SYSCALL inst
> document only has one paragraph on ordering:
>
>         Instruction ordering. Instructions following a SYSCALL may be
>         fetched from memory before earlier instructions complete execution,
>         but they will not execute (even speculatively) until all
>         instructions prior to the SYSCALL have completed execution (the
>         later instructions may execute before data stored by the earlier
>         instructions have become globally visible).
>
> I guess it implies a hardware reordering is indeed possible in this case?

That's my understanding as well.

>
> >
> > [1]: commit 153132571f02 ("userfaultfd/shmem: support UFFDIO_CONTINUE for shmem")
> > [2]: https://gist.github.com/48ca/38d0665b0f1a6319a56507dc73a173f9
> >
> >  mm/hugetlb.c     | 15 +++++++++------
> >  mm/userfaultfd.c | 18 ++++++++++++++++++
> >  2 files changed, 27 insertions(+), 6 deletions(-)
> >
> > diff --git a/mm/hugetlb.c b/mm/hugetlb.c
> > index bb17e5c22759..533bf6b2d94d 100644
> > --- a/mm/hugetlb.c
> > +++ b/mm/hugetlb.c
> > @@ -6779,12 +6779,15 @@ int hugetlb_mfill_atomic_pte(pte_t *dst_pte,
> >               }
> >       }
> >
> > -     /*
> > -      * The memory barrier inside __folio_mark_uptodate makes sure that
> > -      * preceding stores to the page contents become visible before
> > -      * the set_pte_at() write.
> > -      */
> > -     __folio_mark_uptodate(folio);
> > +     if (!is_continue) {
> > +             /*
> > +              * The memory barrier inside __folio_mark_uptodate makes sure
> > +              * that preceding stores to the page contents become visible
> > +              * before the set_pte_at() write.
> > +              */
> > +             __folio_mark_uptodate(folio);
>
> Can we move the comment above the "if", explaining both conditions?

Yes, I'll do that. I think the explanation for
WARN_ON_ONCE(!folio_test_uptodate(folio)) is:

    We only need to `__folio_mark_uptodate(folio)` if we have
allocated a new folio, and HugeTLB pages will always be Uptodate if
they are in the pagecache.

We could even drop the WARN_ON_ONCE.

>
> > +     } else
> > +             WARN_ON_ONCE(!folio_test_uptodate(folio));
> >
> >       /* Add shared, newly allocated pages to the page cache. */
> >       if (vm_shared && !is_continue) {
> > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c
> > index 503ea77c81aa..d515b640ca48 100644
> > --- a/mm/userfaultfd.c
> > +++ b/mm/userfaultfd.c
> > @@ -531,6 +531,10 @@ static __always_inline ssize_t mfill_atomic_hugetlb(
> >                       goto out_unlock;
> >       }
> >
> > +     if (uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE))
> > +             /* See the comment in mfill_atomic. */
> > +             smp_wmb();
> > +
> >       while (src_addr < src_start + len) {
> >               BUG_ON(dst_addr >= dst_start + len);
> >
> > @@ -743,6 +747,20 @@ static __always_inline ssize_t mfill_atomic(struct userfaultfd_ctx *ctx,
> >           uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE))
> >               goto out_unlock;
> >
> > +     if (uffd_flags_mode_is(flags, MFILL_ATOMIC_CONTINUE))
> > +             /*
> > +              * A caller might reasonably assume that UFFDIO_CONTINUE
> > +              * contains a wmb() to ensure that any writes to the
> > +              * about-to-be-mapped page by the thread doing the
> > +              * UFFDIO_CONTINUE are guaranteed to be visible to subsequent
> > +              * reads of the page through the newly mapped address.
> > +              *
> > +              * For MFILL_ATOMIC_COPY, the wmb() is done for each COPYed
> > +              * page. We can do the wmb() now for CONTINUE as the user has
> > +              * already prepared the page contents.
> > +              */
> > +             smp_wmb();
> > +
>
> Why you did it twice separately?  Can we still share the code?
>
> I'm wildly guessing: I don't worry on an extra wmb() in failure paths, as
> that's never a performance concern to make failure slightly slower, IMHO.

You're right, I shouldn't care so much about making the slow paths a
little bit slower. I'll move the wmb earlier so that we only need it
in one place.

>
> Thanks,

Thanks Peter!