All of lore.kernel.org
 help / color / mirror / Atom feed
From: Nadav Amit <nadav.amit@gmail.com>
To: Peter Xu <peterx@redhat.com>
Cc: Linux MM <linux-mm@kvack.org>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	Hugh Dickins <hughd@google.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Axel Rasmussen <axelrasmussen@google.com>,
	David Hildenbrand <david@redhat.com>,
	Mike Rapoport <rppt@linux.ibm.com>
Subject: Re: [RFC PATCH v2 3/5] userfaultfd: introduce write-likely mode for copy/wp operations
Date: Tue, 21 Jun 2022 11:30:52 -0700	[thread overview]
Message-ID: <58E53A1D-466C-4A2B-8E10-2993EFF60856@gmail.com> (raw)
In-Reply-To: <YrIJm6kRhcHtupwZ@xz-m1.local>

On Jun 21, 2022, at 11:10 AM, Peter Xu <peterx@redhat.com> wrote:

> On Tue, Jun 21, 2022 at 10:14:00AM -0700, Nadav Amit wrote:
>> On Jun 21, 2022, at 9:38 AM, Peter Xu <peterx@redhat.com> wrote:
>> 
>>> Hi, Nadav,
>>> 
>>> On Sun, Jun 19, 2022 at 04:34:47PM -0700, Nadav Amit wrote:
>>>> From: Nadav Amit <namit@vmware.com>
>>>> 
>>>> Commit 9ae0f87d009ca ("mm/shmem: unconditionally set pte dirty in
>>>> mfill_atomic_install_pte") has set PTEs as dirty as its title indicates.
>>>> However, setting read-only PTEs as dirty can have several undesired
>>>> implications.
>>>> 
>>>> First, setting read-only PTEs as dirty, can cause these PTEs to become
>>>> writable during mprotect() syscall. See in change_pte_range():
>>>> 
>>>> 	/* Avoid taking write faults for known dirty pages */
>>>> 	if (dirty_accountable && pte_dirty(ptent) &&
>>>> 			(pte_soft_dirty(ptent) ||
>>>> 			 !(vma->vm_flags & VM_SOFTDIRTY))) {
>>>> 		ptent = pte_mkwrite(ptent);
>>>> 	}
>>> 
>>> IMHO this is not really the direct reason to add the feature to "allow user
>>> to specify whether dirty bit be set for UFFDIO_COPY/... ioctl", because
>>> IIUC what's missing is the pte_uffd_wp() check that I should have added in
>>> the shmem+hugetlb uffd-wp series in change_pte_range() but I missed..
>>> 
>>> But since this is fixed by David's patch to optimize mprotect() altogether
>>> which checks pte_uffd_wp() (and afaict that's only needed after the
>>> shmem+hugetlb patchset, not before), so I think we're safe now with all the
>>> branches.
>>> 
>>> So IMHO we don't need to mention this as it's kind of misleading. It'll be
>>> welcomed if you want to recover the pte_dirty behavior in
>>> mfill_atomic_install_pte() but probably this is not the right patch for it?
>> 
>> Sorry, I will remove it from the commit log.
>> 
>> I am not sure whether there should be an additional patch to revert (or
>> partially revert) 9ae0f87d009ca and to be backported. What do you say?
> 
> I think we discussed it offlist, I'd not bother but I don't have a strong
> opinion. Please feel free to go with your preference.
> 
> Just to mention that it might not be a clean revert since at that time we
> don't have continue and uffd-wp on files, so we may need to add the
> complete check back if we're going to make it. Please also do proper swap
> tests for both anon and shmem with things like memcg, and when you posted
> the patches I can do some tests too.

That’s what I was afraid of. It moves it down in my priority list...

>>>> Second, unmapping read-only dirty PTEs often prevents TLB flush batching.
>>>> See try_to_unmap_one():
>>>> 
>>>> 	/*
>>>> 	 * Page is dirty. Flush the TLB if a writable entry
>>>> 	 * potentially exists to avoid CPU writes after IO
>>>> 	 * starts and then write it out here.
>>>> 	 */
>>>> 	try_to_unmap_flush_dirty();
>>>> 
>>>> Similarly batching TLB flushed might be prevented in zap_pte_range():
>>>> 
>>>> 	if (!PageAnon(page)) {
>>>> 		if (pte_dirty(ptent)) {
>>>> 			force_flush = 1;
>>>> 			set_page_dirty(page);
>>>> 		}
>>>> 	...
>>> 
>>> I still keep the pure question here (which I asked in the private reply to
>>> you) on why we'd like only pte_dirty() but not pte_write() && pte_dirty()
>>> here. I'll rewrite what I have in the private email to you here again that
>>> I think should be the write thing to do here..
>>> 
>>> if (!PageAnon(page)) {
>>> if (pte_dirty(ptent)) {
>>> set_page_dirty(page);
>>> if (pte_write(ptent))
>>> force_flush = 1;
>>> }
>>> }
>> 
>> The “Second” part regards a perf issue, not a correctness issue. As I told
>> you offline, I think if makes sense to look both on pte_dirty() and
>> pte_write(), but I am afraid of other architectures than x86, which I know.
> 
> I don't really see why this logic is arch-specific. I meant, as long as
> pte_write() returns a value that means "this page is writable", I still
> think it's making sense.
> 
>> After our discussion, I looked at other architectures, and it does look
>> safe. So I will add an additional patch for that (with your suggested-by).
> 
> Only if you agree with what I thought, thanks. And that could be a
> separate patch for sure out of this one even if to come.

I do agree. I did not quite understand your second sentence. I guess you
meant not to combine it into this patchset (or at least that’s what I
thought and I attribute it to you).

>> 
>> 
>> I think that perhaps I did not communicate well enough the reason it makes
>> sense to set the dirty-bit.
>> 
>> Setting the access-bit and dirty-bit takes quite some time. So regardless of
>> whether you set the PageDirty, if you didn’t see access-bit and dirty-bit
>> and later you access the page - you are going to pay ~550 cycles for
>> nothing.
>> 
>> For reference: https://marc.info/?l=linux-kernel&m=146582237922378&w=2
>> 
>> If you want me to allow userspace to provide a hint, let me know.
> 
> You did communicate well, so probably it's me that didn't. :)
> 
> Yes I think it'll be nicer to allow userspace provide the same hint for
> UFFDIO_CONTINUE, the reason as I explained in the other thread on the young
> bit (or say ACCESS_LIKELY), that UFFDIO_CONTINUE can (at least in my
> current qemu impl [1]) proactively be used to install pgtables even if
> they're code pages and they may not be accessed in the near future. So
> IMHO we should treat UFFDIO_CONTINUE the same as UFFDIO_COPY, IMHO, from
> this point of view.
> 
> There's just still some differences on young/dirty here so I'm not sure
> whether the idea applies to both, but I think at least it applies to young
> bit (ACCESS_LIKELY).
> 
> [1] https://github.com/xzpeter/qemu/commit/b7758ad55af42b5364796362e96f4a06b51d9582

We have a saying in Hebrew: “When you explain slowly, I understand quickly”.
Thanks for explaining. I will add hints for UFFDIO_CONTINUE as well.



  reply	other threads:[~2022-06-21 18:31 UTC|newest]

Thread overview: 24+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-06-19 23:34 [RFC PATCH v2 0/5] userfaultfd: support access/write hints Nadav Amit
2022-06-19 23:34 ` [RFC PATCH v2 1/5] userfaultfd: introduce uffd_flags Nadav Amit
2022-06-21  8:41   ` David Hildenbrand
2022-06-21 15:31     ` Peter Xu
2022-06-21 15:29   ` Peter Xu
2022-06-21 17:41     ` Nadav Amit
2022-06-19 23:34 ` [RFC PATCH v2 2/5] userfaultfd: introduce access-likely mode for copy/wp operations Nadav Amit
2022-06-20 10:33   ` kernel test robot
2022-06-21  8:48   ` David Hildenbrand
2022-06-21 15:42     ` Peter Xu
2022-06-21 17:27     ` Nadav Amit
2022-06-19 23:34 ` [RFC PATCH v2 3/5] userfaultfd: introduce write-likely " Nadav Amit
2022-06-21 16:38   ` Peter Xu
2022-06-21 17:14     ` Nadav Amit
2022-06-21 18:10       ` Peter Xu
2022-06-21 18:30         ` Nadav Amit [this message]
2022-06-21 18:43           ` Peter Xu
2022-06-19 23:34 ` [RFC PATCH v2 4/5] userfaultfd: zero access/write hints Nadav Amit
2022-06-20 18:06   ` kernel test robot
2022-06-21 17:04   ` Peter Xu
2022-06-21 17:17     ` Nadav Amit
2022-06-21 17:56       ` Peter Xu
2022-06-21 17:58         ` Nadav Amit
2022-06-19 23:34 ` [RFC PATCH v2 5/5] selftest/userfaultfd: test read/write hints Nadav Amit

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=58E53A1D-466C-4A2B-8E10-2993EFF60856@gmail.com \
    --to=nadav.amit@gmail.com \
    --cc=akpm@linux-foundation.org \
    --cc=axelrasmussen@google.com \
    --cc=david@redhat.com \
    --cc=hughd@google.com \
    --cc=linux-mm@kvack.org \
    --cc=mike.kravetz@oracle.com \
    --cc=peterx@redhat.com \
    --cc=rppt@linux.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.