Re: [RFC PATCH v2 3/5] userfaultfd: introduce write-likely mode for copy/wp operations

From: Nadav Amit <nadav.amit@gmail.com>
To: Peter Xu <peterx@redhat.com>
Cc: Linux MM <linux-mm@kvack.org>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	Hugh Dickins <hughd@google.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Axel Rasmussen <axelrasmussen@google.com>,
	David Hildenbrand <david@redhat.com>,
	Mike Rapoport <rppt@linux.ibm.com>
Subject: Re: [RFC PATCH v2 3/5] userfaultfd: introduce write-likely mode for copy/wp operations
Date: Tue, 21 Jun 2022 11:30:52 -0700	[thread overview]
Message-ID: <58E53A1D-466C-4A2B-8E10-2993EFF60856@gmail.com> (raw)
In-Reply-To: <YrIJm6kRhcHtupwZ@xz-m1.local>

On Jun 21, 2022, at 11:10 AM, Peter Xu <peterx@redhat.com> wrote:

> On Tue, Jun 21, 2022 at 10:14:00AM -0700, Nadav Amit wrote:
>> On Jun 21, 2022, at 9:38 AM, Peter Xu <peterx@redhat.com> wrote:
>> 
>>> Hi, Nadav,
>>> 
>>> On Sun, Jun 19, 2022 at 04:34:47PM -0700, Nadav Amit wrote:
>>>> From: Nadav Amit <namit@vmware.com>
>>>> 
>>>> Commit 9ae0f87d009ca ("mm/shmem: unconditionally set pte dirty in
>>>> mfill_atomic_install_pte") has set PTEs as dirty as its title indicates.
>>>> However, setting read-only PTEs as dirty can have several undesired
>>>> implications.
>>>> 
>>>> First, setting read-only PTEs as dirty, can cause these PTEs to become
>>>> writable during mprotect() syscall. See in change_pte_range():
>>>> 
>>>> 	/* Avoid taking write faults for known dirty pages */
>>>> 	if (dirty_accountable && pte_dirty(ptent) &&
>>>> 			(pte_soft_dirty(ptent) ||
>>>> 			 !(vma->vm_flags & VM_SOFTDIRTY))) {
>>>> 		ptent = pte_mkwrite(ptent);
>>>> 	}
>>> 
>>> IMHO this is not really the direct reason to add the feature to "allow user
>>> to specify whether dirty bit be set for UFFDIO_COPY/... ioctl", because
>>> IIUC what's missing is the pte_uffd_wp() check that I should have added in
>>> the shmem+hugetlb uffd-wp series in change_pte_range() but I missed..
>>> 
>>> But since this is fixed by David's patch to optimize mprotect() altogether
>>> which checks pte_uffd_wp() (and afaict that's only needed after the
>>> shmem+hugetlb patchset, not before), so I think we're safe now with all the
>>> branches.
>>> 
>>> So IMHO we don't need to mention this as it's kind of misleading. It'll be
>>> welcomed if you want to recover the pte_dirty behavior in
>>> mfill_atomic_install_pte() but probably this is not the right patch for it?
>> 
>> Sorry, I will remove it from the commit log.
>> 
>> I am not sure whether there should be an additional patch to revert (or
>> partially revert) 9ae0f87d009ca and to be backported. What do you say?
> 
> I think we discussed it offlist, I'd not bother but I don't have a strong
> opinion. Please feel free to go with your preference.
> 
> Just to mention that it might not be a clean revert since at that time we
> don't have continue and uffd-wp on files, so we may need to add the
> complete check back if we're going to make it. Please also do proper swap
> tests for both anon and shmem with things like memcg, and when you posted
> the patches I can do some tests too.

That’s what I was afraid of. It moves it down in my priority list...

>>>> Second, unmapping read-only dirty PTEs often prevents TLB flush batching.
>>>> See try_to_unmap_one():
>>>> 
>>>> 	/*
>>>> 	 * Page is dirty. Flush the TLB if a writable entry
>>>> 	 * potentially exists to avoid CPU writes after IO
>>>> 	 * starts and then write it out here.
>>>> 	 */
>>>> 	try_to_unmap_flush_dirty();
>>>> 
>>>> Similarly batching TLB flushed might be prevented in zap_pte_range():
>>>> 
>>>> 	if (!PageAnon(page)) {
>>>> 		if (pte_dirty(ptent)) {
>>>> 			force_flush = 1;
>>>> 			set_page_dirty(page);
>>>> 		}
>>>> 	...
>>> 
>>> I still keep the pure question here (which I asked in the private reply to
>>> you) on why we'd like only pte_dirty() but not pte_write() && pte_dirty()
>>> here. I'll rewrite what I have in the private email to you here again that
>>> I think should be the write thing to do here..
>>> 
>>> if (!PageAnon(page)) {
>>> if (pte_dirty(ptent)) {
>>> set_page_dirty(page);
>>> if (pte_write(ptent))
>>> force_flush = 1;
>>> }
>>> }
>> 
>> The “Second” part regards a perf issue, not a correctness issue. As I told
>> you offline, I think if makes sense to look both on pte_dirty() and
>> pte_write(), but I am afraid of other architectures than x86, which I know.
> 
> I don't really see why this logic is arch-specific. I meant, as long as
> pte_write() returns a value that means "this page is writable", I still
> think it's making sense.
> 
>> After our discussion, I looked at other architectures, and it does look
>> safe. So I will add an additional patch for that (with your suggested-by).
> 
> Only if you agree with what I thought, thanks. And that could be a
> separate patch for sure out of this one even if to come.

I do agree. I did not quite understand your second sentence. I guess you
meant not to combine it into this patchset (or at least that’s what I
thought and I attribute it to you).

>> 
>> 
>> I think that perhaps I did not communicate well enough the reason it makes
>> sense to set the dirty-bit.
>> 
>> Setting the access-bit and dirty-bit takes quite some time. So regardless of
>> whether you set the PageDirty, if you didn’t see access-bit and dirty-bit
>> and later you access the page - you are going to pay ~550 cycles for
>> nothing.
>> 
>> For reference: https://marc.info/?l=linux-kernel&m=146582237922378&w=2
>> 
>> If you want me to allow userspace to provide a hint, let me know.
> 
> You did communicate well, so probably it's me that didn't. :)
> 
> Yes I think it'll be nicer to allow userspace provide the same hint for
> UFFDIO_CONTINUE, the reason as I explained in the other thread on the young
> bit (or say ACCESS_LIKELY), that UFFDIO_CONTINUE can (at least in my
> current qemu impl [1]) proactively be used to install pgtables even if
> they're code pages and they may not be accessed in the near future. So
> IMHO we should treat UFFDIO_CONTINUE the same as UFFDIO_COPY, IMHO, from
> this point of view.
> 
> There's just still some differences on young/dirty here so I'm not sure
> whether the idea applies to both, but I think at least it applies to young
> bit (ACCESS_LIKELY).
> 
> [1] https://github.com/xzpeter/qemu/commit/b7758ad55af42b5364796362e96f4a06b51d9582

We have a saying in Hebrew: “When you explain slowly, I understand quickly”.
Thanks for explaining. I will add hints for UFFDIO_CONTINUE as well.