linux-kernel.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Mike Rapoport <rppt@kernel.org>
To: Axel Rasmussen <axelrasmussen@google.com>
Cc: Alejandro Colomar <alx.manpages@gmail.com>,
	Michael Kerrisk <mtk.manpages@gmail.com>,
	Andrea Arcangeli <aarcange@redhat.com>,
	Andrew Morton <akpm@linux-foundation.org>,
	Hugh Dickins <hughd@google.com>,
	Mike Kravetz <mike.kravetz@oracle.com>,
	Peter Xu <peterx@redhat.com>, LKML <linux-kernel@vger.kernel.org>,
	linux-man@vger.kernel.org, Linux MM <linux-mm@kvack.org>
Subject: Re: [PATCH v2] ioctl_userfaultfd.2, userfaultfd.2: add minor fault mode
Date: Mon, 2 Aug 2021 14:21:42 +0300	[thread overview]
Message-ID: <YQfVRuV2Ab2rlKVI@kernel.org> (raw)
In-Reply-To: <CAJHvVcjzi-7Wvrho1LqWiQC2WNbtg0XGf6-JBRcDZS1=banbVA@mail.gmail.com>

(added man-pages maintainers)

On Tue, Jul 27, 2021 at 09:32:34AM -0700, Axel Rasmussen wrote:
> Any remaining issues with this patch? I just realized today it was
> never merged. 5.13 (which contains this new feature) was released some
> weeks ago.
> 
> On Fri, Jun 4, 2021 at 12:56 PM Axel Rasmussen <axelrasmussen@google.com> wrote:
> >
> > Userfaultfd minor fault mode is supported starting from Linux 5.13.
> >
> > This commit adds a description of the new mode, as well as the new ioctl
> > used to resolve such faults. The two go hand-in-hand: one can't resolve
> > a minor fault without continue, and continue can't be used to resolve
> > any other kind of fault.
> >
> > This patch covers just the hugetlbfs implementation (in 5.13). Support
> > for shmem is forthcoming, but as it has not yet made it into a kernel
> > release candidate, it will be added in a future commit.
> >
> > Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
> > ---
> >  man2/ioctl_userfaultfd.2 | 125 ++++++++++++++++++++++++++++++++++++---
> >  man2/userfaultfd.2       |  79 ++++++++++++++++++++-----
> >  2 files changed, 182 insertions(+), 22 deletions(-)
> >
> > diff --git a/man2/ioctl_userfaultfd.2 b/man2/ioctl_userfaultfd.2
> > index 504f61d4b..7b990c24a 100644
> > --- a/man2/ioctl_userfaultfd.2
> > +++ b/man2/ioctl_userfaultfd.2
> > @@ -214,6 +214,10 @@ memory accesses to the regions registered with userfaultfd.
> >  If this feature bit is set,
> >  .I uffd_msg.pagefault.feat.ptid
> >  will be set to the faulted thread ID for each page-fault message.
> > +.TP
> > +.BR UFFD_FEATURE_MINOR_HUGETLBFS " (since Linux 5.13)"
> > +If this feature bit is set, the kernel supports registering userfaultfd ranges
> > +in minor mode on hugetlbfs-backed memory areas.
> >  .PP
> >  The returned
> >  .I ioctls
> > @@ -240,6 +244,11 @@ operation is supported.
> >  The
> >  .B UFFDIO_WRITEPROTECT
> >  operation is supported.
> > +.TP
> > +.B 1 << _UFFDIO_CONTINUE
> > +The
> > +.B UFFDIO_CONTINUE
> > +operation is supported.
> >  .PP
> >  This
> >  .BR ioctl (2)
> > @@ -278,14 +287,8 @@ by the current kernel version.
> >  (Since Linux 4.3.)
> >  Register a memory address range with the userfaultfd object.
> >  The pages in the range must be "compatible".
> > -.PP
> > -Up to Linux kernel 4.11,
> > -only private anonymous ranges are compatible for registering with
> > -.BR UFFDIO_REGISTER .
> > -.PP
> > -Since Linux 4.11,
> > -hugetlbfs and shared memory ranges are also compatible with
> > -.BR UFFDIO_REGISTER .
> > +Please refer to the list of register modes below for the compatible memory
> > +backends for each mode.
> >  .PP
> >  The
> >  .I argp
> > @@ -324,9 +327,16 @@ the specified range:
> >  .TP
> >  .B UFFDIO_REGISTER_MODE_MISSING
> >  Track page faults on missing pages.
> > +Since Linux 4.3, only private anonymous ranges are compatible.
> > +Since Linux 4.11, hugetlbfs and shared memory ranges are also compatible.
> >  .TP
> >  .B UFFDIO_REGISTER_MODE_WP
> >  Track page faults on write-protected pages.
> > +Since Linux 5.7, only private anonymous ranges are compatible.
> > +.TP
> > +.B UFFDIO_REGISTER_MODE_MINOR
> > +Track minor page faults.
> > +Since Linux 5.13, only hugetlbfs ranges are compatible.
> >  .PP
> >  If the operation is successful, the kernel modifies the
> >  .I ioctls
> > @@ -735,6 +745,105 @@ or not registered with userfaultfd write-protect mode.
> >  .TP
> >  .B EFAULT
> >  Encountered a generic fault during processing.
> > +.\"
> > +.SS UFFDIO_CONTINUE
> > +(Since Linux 5.13.)
> > +Resolve a minor page fault by installing page table entries for existing pages
> > +in the page cache.
> > +.PP
> > +The
> > +.I argp
> > +argument is a pointer to a
> > +.I uffdio_continue
> > +structure as shown below:
> > +.PP
> > +.in +4n
> > +.EX
> > +struct uffdio_continue {
> > +    struct uffdio_range range; /* Range to install PTEs for and continue */
> > +    __u64 mode;                /* Flags controlling the behavior of continue */
> > +    __s64 mapped;              /* Number of bytes mapped, or negated error */
> > +};
> > +.EE
> > +.in
> > +.PP
> > +The following value may be bitwise ORed in
> > +.IR mode
> > +to change the behavior of the
> > +.B UFFDIO_CONTINUE
> > +operation:
> > +.TP
> > +.B UFFDIO_CONTINUE_MODE_DONTWAKE
> > +Do not wake up the thread that waits for page-fault resolution.
> > +.PP
> > +The
> > +.I mapped
> > +field is used by the kernel to return the number of bytes
> > +that were actually mapped, or an error in the same manner as
> > +.BR UFFDIO_COPY .
> > +If the value returned in the
> > +.I mapped
> > +field doesn't match the value that was specified in
> > +.IR range.len ,
> > +the operation fails with the error
> > +.BR EAGAIN .
> > +The
> > +.I mapped
> > +field is output-only;
> > +it is not read by the
> > +.B UFFDIO_CONTINUE
> > +operation.
> > +.PP
> > +This
> > +.BR ioctl (2)
> > +operation returns 0 on success.
> > +In this case, the entire area was mapped.
> > +On error, \-1 is returned and
> > +.I errno
> > +is set to indicate the error.
> > +Possible errors include:
> > +.TP
> > +.B EAGAIN
> > +The number of bytes mapped (i.e., the value returned in the
> > +.I mapped
> > +field) does not equal the value that was specified in the
> > +.I range.len
> > +field.
> > +.TP
> > +.B EINVAL
> > +Either
> > +.I range.start
> > +or
> > +.I range.len
> > +was not a multiple of the system page size; or
> > +.I range.len
> > +was zero; or the range specified was invalid.
> > +.TP
> > +.B EINVAL
> > +An invalid bit was specified in the
> > +.IR mode
> > +field.
> > +.TP
> > +.B EEXIST
> > +One or more pages were already mapped in the given range.
> > +.TP
> > +.B ENOENT
> > +The faulting process has changed its virtual memory layout simultaneously with
> > +an outstanding
> > +.B UFFDIO_CONTINUE
> > +operation.
> > +.TP
> > +.B ENOMEM
> > +Allocating memory needed to setup the page table mappings failed.
> > +.TP
> > +.B EFAULT
> > +No existing page could be found in the page cache for the given range.
> > +.TP
> > +.BR ESRCH
> > +The faulting process has exited at the time of a
> > +.B UFFDIO_CONTINUE
> > +operation.
> > +.\"
> >  .SH RETURN VALUE
> >  See descriptions of the individual operations, above.
> >  .SH ERRORS
> > diff --git a/man2/userfaultfd.2 b/man2/userfaultfd.2
> > index 593c189d8..07f53c6ff 100644
> > --- a/man2/userfaultfd.2
> > +++ b/man2/userfaultfd.2
> > @@ -78,7 +78,7 @@ all memory ranges that were registered with the object are unregistered
> >  and unread events are flushed.
> >  .\"
> >  .PP
> > -Userfaultfd supports two modes of registration:
> > +Userfaultfd supports three modes of registration:
> >  .TP
> >  .BR UFFDIO_REGISTER_MODE_MISSING " (since 4.10)"
> >  When registered with
> > @@ -92,6 +92,18 @@ or an
> >  .B UFFDIO_ZEROPAGE
> >  ioctl.
> >  .TP
> > +.BR UFFDIO_REGISTER_MODE_MINOR " (since 5.13)"
> > +When registered with
> > +.B UFFDIO_REGISTER_MODE_MINOR
> > +mode, user-space will receive a page-fault notification
> > +when a minor page fault occurs.
> > +That is, when a backing page is in the page cache, but
> > +page table entries don't yet exist.
> > +The faulted thread will be stopped from execution until the page fault is
> > +resolved from user-space by an
> > +.B UFFDIO_CONTINUE
> > +ioctl.
> > +.TP
> >  .BR UFFDIO_REGISTER_MODE_WP " (since 5.7)"
> >  When registered with
> >  .B UFFDIO_REGISTER_MODE_WP
> > @@ -212,9 +224,10 @@ a page fault occurring in the requested memory range, and satisfying
> >  the mode defined at the registration time, will be forwarded by the kernel to
> >  the user-space application.
> >  The application can then use the
> > -.B UFFDIO_COPY
> > +.B UFFDIO_COPY ,
> > +.B UFFDIO_ZEROPAGE ,
> >  or
> > -.B UFFDIO_ZEROPAGE
> > +.B UFFDIO_CONTINUE
> >  .BR ioctl (2)
> >  operations to resolve the page fault.
> >  .PP
> > @@ -318,6 +331,43 @@ should have the flag
> >  cleared upon the faulted page or range.
> >  .PP
> >  Write-protect mode supports only private anonymous memory.
> > +.\"
> > +.SS Userfaultfd minor fault mode (since 5.13)
> > +Since Linux 5.13, userfaultfd supports minor fault mode.
> > +In this mode, fault messages are produced not for major faults (where the
> > +page was missing), but rather for minor faults, where a page exists in the page
> > +cache, but the page table entries are not yet present.
> > +The user needs to first check availability of this feature using
> > +.B UFFDIO_API
> > +ioctl against the feature bit
> > +.B UFFD_FEATURE_MINOR_HUGETLBFS
> > +before using this feature.
> > +.PP
> > +To register with userfaultfd minor fault mode, the user needs to initiate the
> > +.B UFFDIO_REGISTER
> > +ioctl with mode
> > +.B UFFD_REGISTER_MODE_MINOR
> > +set.
> > +.PP
> > +When a minor fault occurs, user-space will receive a page-fault notification
> > +whose
> > +.I uffd_msg.pagefault.flags
> > +will have the
> > +.B UFFD_PAGEFAULT_FLAG_MINOR
> > +flag set.
> > +.PP
> > +To resolve a minor page fault, the handler should decide whether or not the
> > +existing page contents need to be modified first.
> > +If so, this should be done in-place via a second, non-userfaultfd-registered
> > +mapping to the same backing page (e.g., by mapping the hugetlbfs file twice).
> > +Once the page is considered "up to date", the fault can be resolved by
> > +initiating an
> > +.B UFFDIO_CONTINUE
> > +ioctl, which installs the page table entries and (by default) wakes up the
> > +faulting thread(s).
> > +.PP
> > +Minor fault mode supports only hugetlbfs-backed memory.
> > +.\"
> >  .SS Reading from the userfaultfd structure
> >  Each
> >  .BR read (2)
> > @@ -456,19 +506,20 @@ For
> >  the following flag may appear:
> >  .RS
> >  .TP
> > -.B UFFD_PAGEFAULT_FLAG_WRITE
> > -If the address is in a range that was registered with the
> > -.B UFFDIO_REGISTER_MODE_MISSING
> > -flag (see
> > -.BR ioctl_userfaultfd (2))
> > -and this flag is set, this a write fault;
> > -otherwise it is a read fault.
> > +.B UFFD_PAGEFAULT_FLAG_WP
> > +If this flag is set, then the fault was a write-protect fault.
> >  .TP
> > +.B UFFD_PAGEFAULT_FLAG_MINOR
> > +If this flag is set, then the fault was a minor fault.
> > +.TP
> > +.B UFFD_PAGEFAULT_FLAG_WRITE
> > +If this flag is set, then the fault was a write fault.
> > +.HP
> > +If neither
> >  .B UFFD_PAGEFAULT_FLAG_WP
> > -If the address is in a range that was registered with the
> > -.B UFFDIO_REGISTER_MODE_WP
> > -flag, when this bit is set, it means it is a write-protect fault.
> > -Otherwise it is a page-missing fault.
> > +nor
> > +.B UFFD_PAGEFAULT_FLAG_MINOR
> > +are set, then the fault was a missing fault.
> >  .RE
> >  .TP
> >  .I pagefault.feat.pid
> > --
> > 2.32.0.rc1.229.g3e70b5a671-goog
> >
> 

-- 
Sincerely yours,
Mike.

  reply	other threads:[~2021-08-02 11:21 UTC|newest]

Thread overview: 7+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2021-06-04 19:56 [PATCH v2] ioctl_userfaultfd.2, userfaultfd.2: add minor fault mode Axel Rasmussen
2021-07-27 16:32 ` Axel Rasmussen
2021-08-02 11:21   ` Mike Rapoport [this message]
2021-08-02 12:21     ` Alejandro Colomar (man-pages)
2022-03-22 16:31       ` Axel Rasmussen
2022-04-02 21:48         ` Alejandro Colomar (man-pages)
2021-07-27 16:37 ` Peter Xu

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=YQfVRuV2Ab2rlKVI@kernel.org \
    --to=rppt@kernel.org \
    --cc=aarcange@redhat.com \
    --cc=akpm@linux-foundation.org \
    --cc=alx.manpages@gmail.com \
    --cc=axelrasmussen@google.com \
    --cc=hughd@google.com \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-man@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mike.kravetz@oracle.com \
    --cc=mtk.manpages@gmail.com \
    --cc=peterx@redhat.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).