Re: [RFC PATCH v2 3/5] userfaultfd: introduce write-likely mode for copy/wp operations

From: Peter Xu <peterx@redhat.com>
To: Nadav Amit <nadav.amit@gmail.com>
Cc: Linux MM <linux-mm@kvack.org>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	Hugh Dickins <hughd@google.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Axel Rasmussen <axelrasmussen@google.com>,
	David Hildenbrand <david@redhat.com>,
	Mike Rapoport <rppt@linux.ibm.com>
Subject: Re: [RFC PATCH v2 3/5] userfaultfd: introduce write-likely mode for copy/wp operations
Date: Tue, 21 Jun 2022 14:10:35 -0400	[thread overview]
Message-ID: <YrIJm6kRhcHtupwZ@xz-m1.local> (raw)
In-Reply-To: <C2C87385-20DB-44FC-8785-09DADF731253@gmail.com>

On Tue, Jun 21, 2022 at 10:14:00AM -0700, Nadav Amit wrote:
> On Jun 21, 2022, at 9:38 AM, Peter Xu <peterx@redhat.com> wrote:
> 
> > Hi, Nadav,
> > 
> > On Sun, Jun 19, 2022 at 04:34:47PM -0700, Nadav Amit wrote:
> >> From: Nadav Amit <namit@vmware.com>
> >> 
> >> Commit 9ae0f87d009ca ("mm/shmem: unconditionally set pte dirty in
> >> mfill_atomic_install_pte") has set PTEs as dirty as its title indicates.
> >> However, setting read-only PTEs as dirty can have several undesired
> >> implications.
> >> 
> >> First, setting read-only PTEs as dirty, can cause these PTEs to become
> >> writable during mprotect() syscall. See in change_pte_range():
> >> 
> >> 	/* Avoid taking write faults for known dirty pages */
> >> 	if (dirty_accountable && pte_dirty(ptent) &&
> >> 			(pte_soft_dirty(ptent) ||
> >> 			 !(vma->vm_flags & VM_SOFTDIRTY))) {
> >> 		ptent = pte_mkwrite(ptent);
> >> 	}
> > 
> > IMHO this is not really the direct reason to add the feature to "allow user
> > to specify whether dirty bit be set for UFFDIO_COPY/... ioctl", because
> > IIUC what's missing is the pte_uffd_wp() check that I should have added in
> > the shmem+hugetlb uffd-wp series in change_pte_range() but I missed..
> > 
> > But since this is fixed by David's patch to optimize mprotect() altogether
> > which checks pte_uffd_wp() (and afaict that's only needed after the
> > shmem+hugetlb patchset, not before), so I think we're safe now with all the
> > branches.
> > 
> > So IMHO we don't need to mention this as it's kind of misleading. It'll be
> > welcomed if you want to recover the pte_dirty behavior in
> > mfill_atomic_install_pte() but probably this is not the right patch for it?
> 
> Sorry, I will remove it from the commit log.
> 
> I am not sure whether there should be an additional patch to revert (or
> partially revert) 9ae0f87d009ca and to be backported. What do you say?

I think we discussed it offlist, I'd not bother but I don't have a strong
opinion.  Please feel free to go with your preference.

Just to mention that it might not be a clean revert since at that time we
don't have continue and uffd-wp on files, so we may need to add the
complete check back if we're going to make it.  Please also do proper swap
tests for both anon and shmem with things like memcg, and when you posted
the patches I can do some tests too.

> 
> >> Second, unmapping read-only dirty PTEs often prevents TLB flush batching.
> >> See try_to_unmap_one():
> >> 
> >> 	/*
> >> 	 * Page is dirty. Flush the TLB if a writable entry
> >> 	 * potentially exists to avoid CPU writes after IO
> >> 	 * starts and then write it out here.
> >> 	 */
> >> 	try_to_unmap_flush_dirty();
> >> 
> >> Similarly batching TLB flushed might be prevented in zap_pte_range():
> >> 
> >> 	if (!PageAnon(page)) {
> >> 		if (pte_dirty(ptent)) {
> >> 			force_flush = 1;
> >> 			set_page_dirty(page);
> >> 		}
> >> 	...
> > 
> > I still keep the pure question here (which I asked in the private reply to
> > you) on why we'd like only pte_dirty() but not pte_write() && pte_dirty()
> > here. I'll rewrite what I have in the private email to you here again that
> > I think should be the write thing to do here..
> > 
> > if (!PageAnon(page)) {
> > if (pte_dirty(ptent)) {
> > set_page_dirty(page);
> > if (pte_write(ptent))
> > force_flush = 1;
> > }
> > }
> 
> The “Second” part regards a perf issue, not a correctness issue. As I told
> you offline, I think if makes sense to look both on pte_dirty() and
> pte_write(), but I am afraid of other architectures than x86, which I know.

I don't really see why this logic is arch-specific.  I meant, as long as
pte_write() returns a value that means "this page is writable", I still
think it's making sense.

> After our discussion, I looked at other architectures, and it does look
> safe. So I will add an additional patch for that (with your suggested-by).

Only if you agree with what I thought, thanks.  And that could be a
separate patch for sure out of this one even if to come.

> 
> > I also mentioned the other example of doing mprotect(PROT_READ) upon a
> > dirty pte can also create a pte with dirty=1 and write=0 which should be
> > the same condition we have here with uffd, afaict. So it's at least not a
> > solo problem for uffd, and I still think if we treat this tlb flush issue
> > as a perf bug we should consider fixing up the tlb flush code instead
> > because otherwise the "mprotect(PROT_READ) upon dirty pte" will also be
> > able to hit it.
> > 
> > Meanwhile, I'll have the same comment that this is not helping any reviewer
> > to understand why we need to grant user ability to conditionally set dirty
> > bit from uABI, so I think it's better to drop this paragraph too for this
> > patch alone.
> 
> Ok. Point taken.
> 
> > 
> >> In general, setting a PTE as dirty seems for read-only entries might be
> >> dangerous. It should be reminded the dirty-COW vulnerability mitigation
> >> also relies on the dirty bit being set only after COW (although it does
> >> not appear to apply to userfaultfd).
> >> 
> >> To summarize, setting the dirty bit for read-only PTEs is dangerous. But
> >> even if we only consider writable pages, always setting the dirty bit or
> >> always leaving it clear, does not seem as the best policy. Leaving the
> >> bit clear introduces overhead on the first write-access to set the bit.
> >> Setting the bit for pages the are eventually not written to can require
> >> more TLB flushes.
> > 
> > And IMHO only this paragraph is the real and proper reasoning for this
> > patch..
> 
> Will do.
> 
> > 
> >> @@ -1902,10 +1908,10 @@ static int userfaultfd_continue(struct userfaultfd_ctx *ctx, unsigned long arg)
> >> 	 uffdio_continue.range.start) {
> >> 		goto out;
> >> 	}
> >> -	if (uffdio_continue.mode & ~UFFDIO_CONTINUE_MODE_DONTWAKE)
> >> +	if (uffdio_continue.mode & ~(UFFDIO_CONTINUE_MODE_DONTWAKE))
> >> 		goto out;
> >> 
> >> -	uffd_flags = UFFD_FLAGS_ACCESS_LIKELY;
> >> +	uffd_flags = UFFD_FLAGS_ACCESS_LIKELY | UFFD_FLAGS_WRITE_LIKELY;
> > 
> > Setting dirty by default for CONTINUE may make some sense (unlike young),
> > at least to keep the old behavior of the code where the pte was
> > unconditionally set.
> > 
> > But there's another thought a real CONTINUE user modifies the page cache
> > elsewhere so logically the PageDirty should be set elsewhere already
> > anyways, in most cases when the userapp is updating the page cache with
> > another mapping before installing pgtables for this mapping. Then this
> > dirty is not required, it seems.
> > 
> > If we're going to export this to uABI, I'm wondering maybe it'll be nicer
> > to not apply either young or dirty bit for CONTINUE, because fundamentally
> > losing dirty bit doesn't sound risky, meanwhile the user app should have
> > the best knowledge of what to do (whether page was requested or it's just
> > during a pre-faulting in the background).
> > 
> > Axel may have some better thoughts.
> 
> I think that perhaps I did not communicate well enough the reason it makes
> sense to set the dirty-bit.
> 
> Setting the access-bit and dirty-bit takes quite some time. So regardless of
> whether you set the PageDirty, if you didn’t see access-bit and dirty-bit
> and later you access the page - you are going to pay ~550 cycles for
> nothing.
> 
> For reference: https://marc.info/?l=linux-kernel&m=146582237922378&w=2
> 
> If you want me to allow userspace to provide a hint, let me know.

You did communicate well, so probably it's me that didn't. :)

Yes I think it'll be nicer to allow userspace provide the same hint for
UFFDIO_CONTINUE, the reason as I explained in the other thread on the young
bit (or say ACCESS_LIKELY), that UFFDIO_CONTINUE can (at least in my
current qemu impl [1]) proactively be used to install pgtables even if
they're code pages and they may not be accessed in the near future.  So
IMHO we should treat UFFDIO_CONTINUE the same as UFFDIO_COPY, IMHO, from
this point of view.

There's just still some differences on young/dirty here so I'm not sure
whether the idea applies to both, but I think at least it applies to young
bit (ACCESS_LIKELY).

[1] https://github.com/xzpeter/qemu/commit/b7758ad55af42b5364796362e96f4a06b51d9582

Thanks,

-- 
Peter Xu