All of lore.kernel.org
 help / color / mirror / Atom feed
From: Nadav Amit <namit@vmware.com>
To: Peter Xu <peterx@redhat.com>
Cc: Linux MM <linux-mm@kvack.org>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	Hugh Dickins <hughd@google.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Axel Rasmussen <axelrasmussen@google.com>,
	David Hildenbrand <david@redhat.com>,
	Mike Rapoport <rppt@linux.ibm.com>
Subject: Re: [PATCH v1 2/5] userfaultfd: introduce access-likely mode for common operations
Date: Fri, 24 Jun 2022 00:03:38 +0000	[thread overview]
Message-ID: <18BCC23E-B344-41A8-926D-A49D768485AF@vmware.com> (raw)
In-Reply-To: <YrT8ISwJ+3x6nIKz@xz-m1.local>

On Jun 23, 2022, at 4:49 PM, Peter Xu <peterx@redhat.com> wrote:

> On Thu, Jun 23, 2022 at 11:35:00PM +0000, Nadav Amit wrote:
>> On Jun 23, 2022, at 4:24 PM, Peter Xu <peterx@redhat.com> wrote:
>> 
>>> ⚠ External Email
>>> 
>>> On Wed, Jun 22, 2022 at 11:50:35AM -0700, Nadav Amit wrote:
>>>> From: Nadav Amit <namit@vmware.com>
>>>> 
>>>> Using a PTE on x86 with cleared access-bit (aka young-bit)
>>>> takes ~600 cycles more than when the access bit is set. At the same
>>>> time, setting the access-bit for memory that is not used (e.g.,
>>>> prefetched) can introduce greater overheads, as the prefetched memory is
>>>> reclaimed later than it should be.
>>>> 
>>>> Userfaultfd currently does not set the access-bit (excluding the
>>>> huge-pages case). Arguably, it is best to let the user control whether
>>>> the access bit should be set or not. The expected use is to request
>>>> userfaultfd to set the access-bit when the copy/wp operation is done to
>>>> resolve a page-fault, and not to set the access-bit when the memory is
>>>> prefetched.
>>>> 
>>>> Introduce UFFDIO_[op]_ACCESS_LIKELY to enable userspace to request the
>>>> young bit to be set.
>>>> 
>>>> Cc: Mike Kravetz <mike.kravetz@oracle.com>
>>>> Cc: Hugh Dickins <hughd@google.com>
>>>> Cc: Andrew Morton <akpm@linux-foundation.org>
>>>> Cc: Axel Rasmussen <axelrasmussen@google.com>
>>>> Cc: Peter Xu <peterx@redhat.com>
>>>> Cc: David Hildenbrand <david@redhat.com>
>>>> Cc: Mike Rapoport <rppt@linux.ibm.com>
>>>> Signed-off-by: Nadav Amit <namit@vmware.com>
>>> 
>>> Hmm.. is the hugetlb code overlooked (for both of the hints), or maybe I
>>> missed it? Do we need to cover them too? Thanks,
>> 
>> I do not know what the value of not setting the PTE’s access/dirty when it
>> comes to performance. The pages won’t be swapped out, just as you wrote in
>> your comment in hugetlb_mcopy_atomic_pte():
>> 
>>        /*
>>         * Always mark UFFDIO_COPY page dirty; note that this may not be
>>         * extremely important for hugetlbfs for now since swapping is not
>>         * supported, but we should still be clear in that this page cannot be
>>         * thrown away at will, even if write bit not set.
>>         */
>>        _dst_pte = huge_pte_mkdirty(_dst_pte);
>>        _dst_pte = pte_mkyoung(_dst_pte);
>> 
>> If you want for consistency/robustness not to set dirty on read-only
>> entries, that’s something that I can do.
> 
> We have more than one options here I guess.
> 
> We could have the hints not available for hugetlbfs, but then we'll both
> need to add special document for hugetlbfs (when you write the man-page,
> let's say) and also it'll be probably better to fail the ioctls with hints
> when applying to hugetlb vmas.
> 
> Leaving them silent to hugetlbfs is another option, it just sounds weird to
> me without any kind of documentation so I like this one least.
> 
> Or we could make them work for hugetlbfs too.  Not saying that swap will
> work some day (but it still could?!) it just seems all consistent as you
> said.
> 
> So I'd prefer the last one, but only if you agree.

My take is that hints are hints. Following David’s (or was it yours?)
feedback, I fixed the description to indicate that this is merely a hint and
removed all references to dirty/access bits. The kernel therefore can ignore
the hint when it wants to or use it in any other way. I fully agree that
this gives the kernel the ability to change the behavior as needed.

Note that for write-protected 4KB zero-page (where we share the zero-page)
we always set the access-bit, regardless of the hint, because it makes
sense: the zero-page is not swappable and therefore the access-bit is set.

I think that the lesser user-facing documentation there is on how the
feature is *exactly* used by the kernel - is better from an API point of
view.

So I see no reason to fail or be forced not to set a page as young, just
because a hint was *not* provided. This would even be a regression in the
behavior. The hint is actually always respected right now, it is just that
even if you do not provide the hint, the access/dirty is set.

The only consistency I think worth thinking about is with the dirty-bit, and
I can add it if you want. Note that the access-bit (in x86) might be set
speculatively in contrast to the dirty-bit is only set atomically with a
real access. That’s the reason I think it may make sense not to set the
dirty without a hint.

Is that acceptable? Access-bit always set, dirty-bit according to hint?


  reply	other threads:[~2022-06-24  0:03 UTC|newest]

Thread overview: 34+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2022-06-22 18:50 [PATCH v1 0/5] userfaultfd: support access/write hints Nadav Amit
2022-06-22 18:50 ` [PATCH v1 1/5] userfaultfd: introduce uffd_flags Nadav Amit
2022-06-23 21:57   ` Peter Xu
2022-06-23 22:04     ` Nadav Amit
2022-06-22 18:50 ` [PATCH v1 2/5] userfaultfd: introduce access-likely mode for common operations Nadav Amit
2022-06-23 23:24   ` Peter Xu
2022-06-23 23:35     ` Nadav Amit
2022-06-23 23:49       ` Peter Xu
2022-06-24  0:03         ` Nadav Amit [this message]
2022-06-24  2:05           ` Peter Xu
2022-06-24  2:42             ` Nadav Amit
2022-06-24 21:58               ` Peter Xu
2022-06-24 22:17                 ` Peter Xu
2022-06-25  7:49                   ` Nadav Amit
2022-06-27 13:12                     ` Peter Xu
2022-06-27 13:27                       ` David Hildenbrand
2022-06-27 14:59                         ` Peter Xu
2022-06-27 23:37                       ` Nadav Amit
2022-06-28 10:55                         ` David Hildenbrand
2022-06-28 19:15                         ` Peter Xu
2022-06-28 20:30                           ` Nadav Amit
2022-06-28 20:56                             ` Peter Xu
2022-06-28 21:03                               ` Nadav Amit
2022-06-28 21:12                                 ` Peter Xu
2022-06-28 21:15                                   ` Nadav Amit
2022-07-12  6:19   ` Nadav Amit
2022-07-12 14:56     ` Peter Xu
2022-07-13  1:09       ` Nadav Amit
2022-07-13 16:02         ` Peter Xu
2022-07-13 16:49           ` Nadav Amit
2022-06-22 18:50 ` [PATCH v1 3/5] userfaultfd: introduce write-likely mode for uffd operations Nadav Amit
2022-06-22 18:50 ` [PATCH v1 4/5] userfaultfd: zero access/write hints Nadav Amit
2022-06-23 23:34   ` Peter Xu
2022-06-22 18:50 ` [PATCH v1 5/5] selftest/userfaultfd: test read/write hints Nadav Amit

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=18BCC23E-B344-41A8-926D-A49D768485AF@vmware.com \
    --to=namit@vmware.com \
    --cc=akpm@linux-foundation.org \
    --cc=axelrasmussen@google.com \
    --cc=david@redhat.com \
    --cc=hughd@google.com \
    --cc=linux-mm@kvack.org \
    --cc=mike.kravetz@oracle.com \
    --cc=peterx@redhat.com \
    --cc=rppt@linux.ibm.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is an external index of several public inboxes,
see mirroring instructions on how to clone and mirror
all data and code used by this external index.