From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.7 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,INCLUDES_CR_TRAILER,INCLUDES_PATCH, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 97C59C4320A for ; Mon, 2 Aug 2021 11:21:51 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 7E1EB60FC2 for ; Mon, 2 Aug 2021 11:21:51 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233479AbhHBLV7 (ORCPT ); Mon, 2 Aug 2021 07:21:59 -0400 Received: from mail.kernel.org ([198.145.29.99]:58304 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233464AbhHBLV6 (ORCPT ); Mon, 2 Aug 2021 07:21:58 -0400 Received: by mail.kernel.org (Postfix) with ESMTPSA id 57BA760FD8; Mon, 2 Aug 2021 11:21:46 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1627903309; bh=jDlHV014x9rAROwwlwoYLKLmQ2dAtFwsKHY5yX1IK/A=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=S/Z+ouTznhLgGXpjUal4LeZf7gFfh4KeOeww37tTRIzs29GPGAhvLHb6sXTCkTMPm CjFoIbr0gbl7e3C0Z4hTGctcisLAOdrqvsg3wey68AHXeRIY01J43LAZBr14zAfcic d0IrHn3EuoBRquAXICqbhTCHF2tM4pvKodlrqTOG4tF/OFvE3o2yw+/+4WfPgXEBgd SoFMs2o4OX627XObWc3VD/nmKz452CHT5CtR9ydN63PUgf8DkvIGZDBfnJDSMzGcEb oarYkTrvZ/pswu6iHRYw8H0G/K4ckuauRguFvkXg7gX5llCbij5+Y+BQWqsipvYrGv tK+PUGKShg0Ug== Date: Mon, 2 Aug 2021 14:21:42 +0300 From: Mike Rapoport To: Axel Rasmussen Cc: Alejandro Colomar , Michael Kerrisk , Andrea Arcangeli , Andrew Morton , Hugh Dickins , Mike Kravetz , Peter Xu , LKML , linux-man@vger.kernel.org, Linux MM Subject: Re: [PATCH v2] ioctl_userfaultfd.2, userfaultfd.2: add minor fault mode Message-ID: References: <20210604195622.1249588-1-axelrasmussen@google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org (added man-pages maintainers) On Tue, Jul 27, 2021 at 09:32:34AM -0700, Axel Rasmussen wrote: > Any remaining issues with this patch? I just realized today it was > never merged. 5.13 (which contains this new feature) was released some > weeks ago. > > On Fri, Jun 4, 2021 at 12:56 PM Axel Rasmussen wrote: > > > > Userfaultfd minor fault mode is supported starting from Linux 5.13. > > > > This commit adds a description of the new mode, as well as the new ioctl > > used to resolve such faults. The two go hand-in-hand: one can't resolve > > a minor fault without continue, and continue can't be used to resolve > > any other kind of fault. > > > > This patch covers just the hugetlbfs implementation (in 5.13). Support > > for shmem is forthcoming, but as it has not yet made it into a kernel > > release candidate, it will be added in a future commit. > > > > Signed-off-by: Axel Rasmussen > > --- > > man2/ioctl_userfaultfd.2 | 125 ++++++++++++++++++++++++++++++++++++--- > > man2/userfaultfd.2 | 79 ++++++++++++++++++++----- > > 2 files changed, 182 insertions(+), 22 deletions(-) > > > > diff --git a/man2/ioctl_userfaultfd.2 b/man2/ioctl_userfaultfd.2 > > index 504f61d4b..7b990c24a 100644 > > --- a/man2/ioctl_userfaultfd.2 > > +++ b/man2/ioctl_userfaultfd.2 > > @@ -214,6 +214,10 @@ memory accesses to the regions registered with userfaultfd. > > If this feature bit is set, > > .I uffd_msg.pagefault.feat.ptid > > will be set to the faulted thread ID for each page-fault message. > > +.TP > > +.BR UFFD_FEATURE_MINOR_HUGETLBFS " (since Linux 5.13)" > > +If this feature bit is set, the kernel supports registering userfaultfd ranges > > +in minor mode on hugetlbfs-backed memory areas. > > .PP > > The returned > > .I ioctls > > @@ -240,6 +244,11 @@ operation is supported. > > The > > .B UFFDIO_WRITEPROTECT > > operation is supported. > > +.TP > > +.B 1 << _UFFDIO_CONTINUE > > +The > > +.B UFFDIO_CONTINUE > > +operation is supported. > > .PP > > This > > .BR ioctl (2) > > @@ -278,14 +287,8 @@ by the current kernel version. > > (Since Linux 4.3.) > > Register a memory address range with the userfaultfd object. > > The pages in the range must be "compatible". > > -.PP > > -Up to Linux kernel 4.11, > > -only private anonymous ranges are compatible for registering with > > -.BR UFFDIO_REGISTER . > > -.PP > > -Since Linux 4.11, > > -hugetlbfs and shared memory ranges are also compatible with > > -.BR UFFDIO_REGISTER . > > +Please refer to the list of register modes below for the compatible memory > > +backends for each mode. > > .PP > > The > > .I argp > > @@ -324,9 +327,16 @@ the specified range: > > .TP > > .B UFFDIO_REGISTER_MODE_MISSING > > Track page faults on missing pages. > > +Since Linux 4.3, only private anonymous ranges are compatible. > > +Since Linux 4.11, hugetlbfs and shared memory ranges are also compatible. > > .TP > > .B UFFDIO_REGISTER_MODE_WP > > Track page faults on write-protected pages. > > +Since Linux 5.7, only private anonymous ranges are compatible. > > +.TP > > +.B UFFDIO_REGISTER_MODE_MINOR > > +Track minor page faults. > > +Since Linux 5.13, only hugetlbfs ranges are compatible. > > .PP > > If the operation is successful, the kernel modifies the > > .I ioctls > > @@ -735,6 +745,105 @@ or not registered with userfaultfd write-protect mode. > > .TP > > .B EFAULT > > Encountered a generic fault during processing. > > +.\" > > +.SS UFFDIO_CONTINUE > > +(Since Linux 5.13.) > > +Resolve a minor page fault by installing page table entries for existing pages > > +in the page cache. > > +.PP > > +The > > +.I argp > > +argument is a pointer to a > > +.I uffdio_continue > > +structure as shown below: > > +.PP > > +.in +4n > > +.EX > > +struct uffdio_continue { > > + struct uffdio_range range; /* Range to install PTEs for and continue */ > > + __u64 mode; /* Flags controlling the behavior of continue */ > > + __s64 mapped; /* Number of bytes mapped, or negated error */ > > +}; > > +.EE > > +.in > > +.PP > > +The following value may be bitwise ORed in > > +.IR mode > > +to change the behavior of the > > +.B UFFDIO_CONTINUE > > +operation: > > +.TP > > +.B UFFDIO_CONTINUE_MODE_DONTWAKE > > +Do not wake up the thread that waits for page-fault resolution. > > +.PP > > +The > > +.I mapped > > +field is used by the kernel to return the number of bytes > > +that were actually mapped, or an error in the same manner as > > +.BR UFFDIO_COPY . > > +If the value returned in the > > +.I mapped > > +field doesn't match the value that was specified in > > +.IR range.len , > > +the operation fails with the error > > +.BR EAGAIN . > > +The > > +.I mapped > > +field is output-only; > > +it is not read by the > > +.B UFFDIO_CONTINUE > > +operation. > > +.PP > > +This > > +.BR ioctl (2) > > +operation returns 0 on success. > > +In this case, the entire area was mapped. > > +On error, \-1 is returned and > > +.I errno > > +is set to indicate the error. > > +Possible errors include: > > +.TP > > +.B EAGAIN > > +The number of bytes mapped (i.e., the value returned in the > > +.I mapped > > +field) does not equal the value that was specified in the > > +.I range.len > > +field. > > +.TP > > +.B EINVAL > > +Either > > +.I range.start > > +or > > +.I range.len > > +was not a multiple of the system page size; or > > +.I range.len > > +was zero; or the range specified was invalid. > > +.TP > > +.B EINVAL > > +An invalid bit was specified in the > > +.IR mode > > +field. > > +.TP > > +.B EEXIST > > +One or more pages were already mapped in the given range. > > +.TP > > +.B ENOENT > > +The faulting process has changed its virtual memory layout simultaneously with > > +an outstanding > > +.B UFFDIO_CONTINUE > > +operation. > > +.TP > > +.B ENOMEM > > +Allocating memory needed to setup the page table mappings failed. > > +.TP > > +.B EFAULT > > +No existing page could be found in the page cache for the given range. > > +.TP > > +.BR ESRCH > > +The faulting process has exited at the time of a > > +.B UFFDIO_CONTINUE > > +operation. > > +.\" > > .SH RETURN VALUE > > See descriptions of the individual operations, above. > > .SH ERRORS > > diff --git a/man2/userfaultfd.2 b/man2/userfaultfd.2 > > index 593c189d8..07f53c6ff 100644 > > --- a/man2/userfaultfd.2 > > +++ b/man2/userfaultfd.2 > > @@ -78,7 +78,7 @@ all memory ranges that were registered with the object are unregistered > > and unread events are flushed. > > .\" > > .PP > > -Userfaultfd supports two modes of registration: > > +Userfaultfd supports three modes of registration: > > .TP > > .BR UFFDIO_REGISTER_MODE_MISSING " (since 4.10)" > > When registered with > > @@ -92,6 +92,18 @@ or an > > .B UFFDIO_ZEROPAGE > > ioctl. > > .TP > > +.BR UFFDIO_REGISTER_MODE_MINOR " (since 5.13)" > > +When registered with > > +.B UFFDIO_REGISTER_MODE_MINOR > > +mode, user-space will receive a page-fault notification > > +when a minor page fault occurs. > > +That is, when a backing page is in the page cache, but > > +page table entries don't yet exist. > > +The faulted thread will be stopped from execution until the page fault is > > +resolved from user-space by an > > +.B UFFDIO_CONTINUE > > +ioctl. > > +.TP > > .BR UFFDIO_REGISTER_MODE_WP " (since 5.7)" > > When registered with > > .B UFFDIO_REGISTER_MODE_WP > > @@ -212,9 +224,10 @@ a page fault occurring in the requested memory range, and satisfying > > the mode defined at the registration time, will be forwarded by the kernel to > > the user-space application. > > The application can then use the > > -.B UFFDIO_COPY > > +.B UFFDIO_COPY , > > +.B UFFDIO_ZEROPAGE , > > or > > -.B UFFDIO_ZEROPAGE > > +.B UFFDIO_CONTINUE > > .BR ioctl (2) > > operations to resolve the page fault. > > .PP > > @@ -318,6 +331,43 @@ should have the flag > > cleared upon the faulted page or range. > > .PP > > Write-protect mode supports only private anonymous memory. > > +.\" > > +.SS Userfaultfd minor fault mode (since 5.13) > > +Since Linux 5.13, userfaultfd supports minor fault mode. > > +In this mode, fault messages are produced not for major faults (where the > > +page was missing), but rather for minor faults, where a page exists in the page > > +cache, but the page table entries are not yet present. > > +The user needs to first check availability of this feature using > > +.B UFFDIO_API > > +ioctl against the feature bit > > +.B UFFD_FEATURE_MINOR_HUGETLBFS > > +before using this feature. > > +.PP > > +To register with userfaultfd minor fault mode, the user needs to initiate the > > +.B UFFDIO_REGISTER > > +ioctl with mode > > +.B UFFD_REGISTER_MODE_MINOR > > +set. > > +.PP > > +When a minor fault occurs, user-space will receive a page-fault notification > > +whose > > +.I uffd_msg.pagefault.flags > > +will have the > > +.B UFFD_PAGEFAULT_FLAG_MINOR > > +flag set. > > +.PP > > +To resolve a minor page fault, the handler should decide whether or not the > > +existing page contents need to be modified first. > > +If so, this should be done in-place via a second, non-userfaultfd-registered > > +mapping to the same backing page (e.g., by mapping the hugetlbfs file twice). > > +Once the page is considered "up to date", the fault can be resolved by > > +initiating an > > +.B UFFDIO_CONTINUE > > +ioctl, which installs the page table entries and (by default) wakes up the > > +faulting thread(s). > > +.PP > > +Minor fault mode supports only hugetlbfs-backed memory. > > +.\" > > .SS Reading from the userfaultfd structure > > Each > > .BR read (2) > > @@ -456,19 +506,20 @@ For > > the following flag may appear: > > .RS > > .TP > > -.B UFFD_PAGEFAULT_FLAG_WRITE > > -If the address is in a range that was registered with the > > -.B UFFDIO_REGISTER_MODE_MISSING > > -flag (see > > -.BR ioctl_userfaultfd (2)) > > -and this flag is set, this a write fault; > > -otherwise it is a read fault. > > +.B UFFD_PAGEFAULT_FLAG_WP > > +If this flag is set, then the fault was a write-protect fault. > > .TP > > +.B UFFD_PAGEFAULT_FLAG_MINOR > > +If this flag is set, then the fault was a minor fault. > > +.TP > > +.B UFFD_PAGEFAULT_FLAG_WRITE > > +If this flag is set, then the fault was a write fault. > > +.HP > > +If neither > > .B UFFD_PAGEFAULT_FLAG_WP > > -If the address is in a range that was registered with the > > -.B UFFDIO_REGISTER_MODE_WP > > -flag, when this bit is set, it means it is a write-protect fault. > > -Otherwise it is a page-missing fault. > > +nor > > +.B UFFD_PAGEFAULT_FLAG_MINOR > > +are set, then the fault was a missing fault. > > .RE > > .TP > > .I pagefault.feat.pid > > -- > > 2.32.0.rc1.229.g3e70b5a671-goog > > > -- Sincerely yours, Mike.