From: Peter Xu <peterx@redhat.com>
To: Axel Rasmussen <axelrasmussen@google.com>
Cc: "Alexander Viro" <viro@zeniv.linux.org.uk>,
"Alexey Dobriyan" <adobriyan@gmail.com>,
"Andrea Arcangeli" <aarcange@redhat.com>,
"Andrew Morton" <akpm@linux-foundation.org>,
"Anshuman Khandual" <anshuman.khandual@arm.com>,
"Catalin Marinas" <catalin.marinas@arm.com>,
"Chinwen Chang" <chinwen.chang@mediatek.com>,
"Huang Ying" <ying.huang@intel.com>,
"Ingo Molnar" <mingo@redhat.com>, "Jann Horn" <jannh@google.com>,
"Jerome Glisse" <jglisse@redhat.com>,
"Lokesh Gidra" <lokeshgidra@google.com>,
"Matthew Wilcox (Oracle)" <willy@infradead.org>,
"Michael Ellerman" <mpe@ellerman.id.au>,
"Michal Koutný" <mkoutny@suse.com>,
"Michel Lespinasse" <walken@google.com>,
"Mike Kravetz" <mike.kravetz@oracle.com>,
"Mike Rapoport" <rppt@linux.vnet.ibm.com>,
"Nicholas Piggin" <npiggin@gmail.com>, "Shaohua Li" <shli@fb.com>,
"Shawn Anastasio" <shawn@anastas.io>,
"Steven Rostedt" <rostedt@goodmis.org>,
"Steven Price" <steven.price@arm.com>,
"Vlastimil Babka" <vbabka@suse.cz>,
linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org,
linux-mm@kvack.org, "Adam Ruprecht" <ruprecht@google.com>,
"Cannon Matthews" <cannonmatthews@google.com>,
"Dr . David Alan Gilbert" <dgilbert@redhat.com>,
"David Rientjes" <rientjes@google.com>,
"Mina Almasry" <almasrymina@google.com>,
"Oliver Upton" <oupton@google.com>
Subject: Re: [PATCH v4 05/10] userfaultfd: add minor fault registration mode
Date: Mon, 8 Feb 2021 19:00:58 -0500 [thread overview]
Message-ID: <20210209000058.GA78818@xz-x1> (raw)
In-Reply-To: <20210204183433.1431202-6-axelrasmussen@google.com>
On Thu, Feb 04, 2021 at 10:34:28AM -0800, Axel Rasmussen wrote:
> This feature allows userspace to intercept "minor" faults. By "minor"
> faults, I mean the following situation:
>
> Let there exist two mappings (i.e., VMAs) to the same page(s). One of
> the mappings is registered with userfaultfd (in minor mode), and the
> other is not. Via the non-UFFD mapping, the underlying pages have
> already been allocated & filled with some contents. The UFFD mapping
> has not yet been faulted in; when it is touched for the first time,
> this results in what I'm calling a "minor" fault. As a concrete
> example, when working with hugetlbfs, we have huge_pte_none(), but
> find_lock_page() finds an existing page.
>
> This commit adds the new registration mode, and sets the relevant flag
> on the VMAs being registered. In the hugetlb fault path, if we find
> that we have huge_pte_none(), but find_lock_page() does indeed find an
> existing page, then we have a "minor" fault, and if the VMA has the
> userfaultfd registration flag, we call into userfaultfd to handle it.
>
> Why add a new registration mode, as opposed to adding a feature to
> MISSING registration, like UFFD_FEATURE_SIGBUS?
>
> - The semantics are significantly different. UFFDIO_COPY or
> UFFDIO_ZEROPAGE do not make sense for these minor faults; userspace
> would instead just memset() or memcpy() or whatever via the non-UFFD
> mapping. Unlike MISSING registration, MINOR registration only makes
> sense for hugetlbfs (or, in the future, shmem), as this is the only
> way to get two VMAs to a single set of underlying pages.
>
> - Doing so would make handle_userfault()'s "reason" argument confusing.
> We'd pass in "MISSING" even if the pages weren't really missing.
>
> Signed-off-by: Axel Rasmussen <axelrasmussen@google.com>
> ---
> fs/proc/task_mmu.c | 1 +
> fs/userfaultfd.c | 81 ++++++++++++++++++++------------
> include/linux/mm.h | 1 +
> include/linux/userfaultfd_k.h | 15 +++++-
> include/trace/events/mmflags.h | 1 +
> include/uapi/linux/userfaultfd.h | 15 +++++-
> mm/hugetlb.c | 32 +++++++++++++
> 7 files changed, 112 insertions(+), 34 deletions(-)
>
> diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c
> index 602e3a52884d..94e951ea3e03 100644
> --- a/fs/proc/task_mmu.c
> +++ b/fs/proc/task_mmu.c
> @@ -651,6 +651,7 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma)
> [ilog2(VM_MTE)] = "mt",
> [ilog2(VM_MTE_ALLOWED)] = "",
> #endif
> + [ilog2(VM_UFFD_MINOR)] = "ui",
> #ifdef CONFIG_ARCH_HAS_PKEYS
> /* These come out via ProtectionKey: */
> [ilog2(VM_PKEY_BIT0)] = "",
> diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c
> index a0f66e12026b..c643cf13d957 100644
> --- a/fs/userfaultfd.c
> +++ b/fs/userfaultfd.c
> @@ -197,24 +197,21 @@ static inline struct uffd_msg userfault_msg(unsigned long address,
> msg_init(&msg);
> msg.event = UFFD_EVENT_PAGEFAULT;
> msg.arg.pagefault.address = address;
> + /*
> + * These flags indicate why the userfault occurred:
> + * - UFFD_PAGEFAULT_FLAG_WP indicates a write protect fault.
> + * - UFFD_PAGEFAULT_FLAG_MINOR indicates a minor fault.
> + * - Neither of these flags being set indicates a MISSING fault.
> + *
> + * Separately, UFFD_PAGEFAULT_FLAG_WRITE indicates it was a write
> + * fault. Otherwise, it was a read fault.
> + */
> if (flags & FAULT_FLAG_WRITE)
> - /*
> - * If UFFD_FEATURE_PAGEFAULT_FLAG_WP was set in the
> - * uffdio_api.features and UFFD_PAGEFAULT_FLAG_WRITE
> - * was not set in a UFFD_EVENT_PAGEFAULT, it means it
> - * was a read fault, otherwise if set it means it's
> - * a write fault.
> - */
> msg.arg.pagefault.flags |= UFFD_PAGEFAULT_FLAG_WRITE;
> if (reason & VM_UFFD_WP)
> - /*
> - * If UFFD_FEATURE_PAGEFAULT_FLAG_WP was set in the
> - * uffdio_api.features and UFFD_PAGEFAULT_FLAG_WP was
> - * not set in a UFFD_EVENT_PAGEFAULT, it means it was
> - * a missing fault, otherwise if set it means it's a
> - * write protect fault.
> - */
> msg.arg.pagefault.flags |= UFFD_PAGEFAULT_FLAG_WP;
> + if (reason & VM_UFFD_MINOR)
> + msg.arg.pagefault.flags |= UFFD_PAGEFAULT_FLAG_MINOR;
> if (features & UFFD_FEATURE_THREAD_ID)
> msg.arg.pagefault.feat.ptid = task_pid_vnr(current);
> return msg;
> @@ -401,8 +398,10 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason)
>
> BUG_ON(ctx->mm != mm);
>
> - VM_BUG_ON(reason & ~(VM_UFFD_MISSING|VM_UFFD_WP));
> - VM_BUG_ON(!(reason & VM_UFFD_MISSING) ^ !!(reason & VM_UFFD_WP));
> + /* Any unrecognized flag is a bug. */
> + VM_BUG_ON(reason & ~__VM_UFFD_FLAGS);
> + /* 0 or > 1 flags set is a bug; we expect exactly 1. */
> + VM_BUG_ON(!reason || !!(reason & (reason - 1)));
>
> if (ctx->features & UFFD_FEATURE_SIGBUS)
> goto out;
> @@ -612,7 +611,7 @@ static void userfaultfd_event_wait_completion(struct userfaultfd_ctx *ctx,
> for (vma = mm->mmap; vma; vma = vma->vm_next)
> if (vma->vm_userfaultfd_ctx.ctx == release_new_ctx) {
> vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
> - vma->vm_flags &= ~(VM_UFFD_WP | VM_UFFD_MISSING);
> + vma->vm_flags &= ~__VM_UFFD_FLAGS;
> }
> mmap_write_unlock(mm);
>
> @@ -644,7 +643,7 @@ int dup_userfaultfd(struct vm_area_struct *vma, struct list_head *fcs)
> octx = vma->vm_userfaultfd_ctx.ctx;
> if (!octx || !(octx->features & UFFD_FEATURE_EVENT_FORK)) {
> vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
> - vma->vm_flags &= ~(VM_UFFD_WP | VM_UFFD_MISSING);
> + vma->vm_flags &= ~__VM_UFFD_FLAGS;
> return 0;
> }
>
> @@ -726,7 +725,7 @@ void mremap_userfaultfd_prep(struct vm_area_struct *vma,
> } else {
> /* Drop uffd context if remap feature not enabled */
> vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX;
> - vma->vm_flags &= ~(VM_UFFD_WP | VM_UFFD_MISSING);
> + vma->vm_flags &= ~__VM_UFFD_FLAGS;
> }
> }
>
> @@ -867,12 +866,12 @@ static int userfaultfd_release(struct inode *inode, struct file *file)
> for (vma = mm->mmap; vma; vma = vma->vm_next) {
> cond_resched();
> BUG_ON(!!vma->vm_userfaultfd_ctx.ctx ^
> - !!(vma->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP)));
> + !!(vma->vm_flags & __VM_UFFD_FLAGS));
> if (vma->vm_userfaultfd_ctx.ctx != ctx) {
> prev = vma;
> continue;
> }
> - new_flags = vma->vm_flags & ~(VM_UFFD_MISSING | VM_UFFD_WP);
> + new_flags = vma->vm_flags & ~__VM_UFFD_FLAGS;
> prev = vma_merge(mm, prev, vma->vm_start, vma->vm_end,
> new_flags, vma->anon_vma,
> vma->vm_file, vma->vm_pgoff,
> @@ -1305,9 +1304,29 @@ static inline bool vma_can_userfault(struct vm_area_struct *vma,
> unsigned long vm_flags)
> {
> /* FIXME: add WP support to hugetlbfs and shmem */
> - return vma_is_anonymous(vma) ||
> - ((is_vm_hugetlb_page(vma) || vma_is_shmem(vma)) &&
> - !(vm_flags & VM_UFFD_WP));
> + if (vm_flags & VM_UFFD_WP) {
> + if (is_vm_hugetlb_page(vma) || vma_is_shmem(vma))
> + return false;
> + }
> +
> + if (vm_flags & VM_UFFD_MINOR) {
> + /*
> + * The use case for minor registration (intercepting minor
> + * faults) is to handle the case where a page is present, but
> + * needs to be modified before it can be used. This only makes
> + * sense when you have two mappings to the same underlying
> + * pages (one UFFD registered, one not), but the memory doesn't
> + * have to be shared (consider one process mapping a hugetlbfs
> + * file with MAP_SHARED, and then a second process doing
> + * MAP_PRIVATE).
No strong opinion, but I'd drop the whole chunk of comment here..
- "what is minor fault" should be covered in the documentation file already.
- "two mappings" seems slightly superfluous too, since we can still use minor
fault with TRUNCATE+UFFDIO_COPY.. if we want? maybe?
- "memory doesn't have to be shared" would be a bit odd too if saying that
without any code checking against "shared" at all, I'd say. :)
The FIXME below it is fine.
If you agree with above, feel free to add my r-b after dropping the chunk:
Reviewed-by: Peter Xu <peterx@redhat.com>
Thanks,
--
Peter Xu
next prev parent reply other threads:[~2021-02-09 0:01 UTC|newest]
Thread overview: 22+ messages / expand[flat|nested] mbox.gz Atom feed top
2021-02-04 18:34 [PATCH v4 00/10] userfaultfd: add minor fault handling Axel Rasmussen
2021-02-04 18:34 ` [PATCH v4 01/10] hugetlb: Pass vma into huge_pte_alloc() and huge_pmd_share() Axel Rasmussen
2021-02-05 1:04 ` kernel test robot
2021-02-04 18:34 ` [PATCH v4 02/10] hugetlb/userfaultfd: Forbid huge pmd sharing when uffd enabled Axel Rasmussen
2021-02-04 18:34 ` [PATCH v4 03/10] mm/hugetlb: Move flush_hugetlb_tlb_range() into hugetlb.h Axel Rasmussen
2021-02-04 18:34 ` [PATCH v4 04/10] hugetlb/userfaultfd: Unshare all pmds for hugetlbfs when register wp Axel Rasmussen
2021-02-04 18:34 ` [PATCH v4 05/10] userfaultfd: add minor fault registration mode Axel Rasmussen
2021-02-09 0:00 ` Peter Xu [this message]
2021-02-04 18:34 ` [PATCH v4 06/10] userfaultfd: disable huge PMD sharing for MINOR registered VMAs Axel Rasmussen
2021-02-04 18:34 ` [PATCH v4 07/10] userfaultfd: hugetlbfs: only compile UFFD helpers if config enabled Axel Rasmussen
2021-02-04 18:34 ` [PATCH v4 08/10] userfaultfd: add UFFDIO_CONTINUE ioctl Axel Rasmussen
2021-02-08 23:54 ` Peter Xu
2021-02-10 18:00 ` Axel Rasmussen
2021-02-10 19:06 ` Peter Xu
2021-02-04 18:34 ` [PATCH v4 09/10] userfaultfd: update documentation to describe minor fault handling Axel Rasmussen
2021-02-04 19:57 ` Randy Dunlap
2021-02-04 21:04 ` Axel Rasmussen
2021-02-04 21:07 ` Randy Dunlap
2021-02-04 18:34 ` [PATCH v4 10/10] userfaultfd/selftests: add test exercising " Axel Rasmussen
2021-02-04 18:38 ` [PATCH v4 00/10] userfaultfd: add " Axel Rasmussen
2021-02-09 0:03 ` Peter Xu
2021-02-09 0:19 ` Axel Rasmussen
Reply instructions:
You may reply publicly to this message via plain-text email
using any one of the following methods:
* Save the following mbox file, import it into your mail client,
and reply-to-all from there: mbox
Avoid top-posting and favor interleaved quoting:
https://en.wikipedia.org/wiki/Posting_style#Interleaved_style
* Reply using the --to, --cc, and --in-reply-to
switches of git-send-email(1):
git send-email \
--in-reply-to=20210209000058.GA78818@xz-x1 \
--to=peterx@redhat.com \
--cc=aarcange@redhat.com \
--cc=adobriyan@gmail.com \
--cc=akpm@linux-foundation.org \
--cc=almasrymina@google.com \
--cc=anshuman.khandual@arm.com \
--cc=axelrasmussen@google.com \
--cc=cannonmatthews@google.com \
--cc=catalin.marinas@arm.com \
--cc=chinwen.chang@mediatek.com \
--cc=dgilbert@redhat.com \
--cc=jannh@google.com \
--cc=jglisse@redhat.com \
--cc=linux-fsdevel@vger.kernel.org \
--cc=linux-kernel@vger.kernel.org \
--cc=linux-mm@kvack.org \
--cc=lokeshgidra@google.com \
--cc=mike.kravetz@oracle.com \
--cc=mingo@redhat.com \
--cc=mkoutny@suse.com \
--cc=mpe@ellerman.id.au \
--cc=npiggin@gmail.com \
--cc=oupton@google.com \
--cc=rientjes@google.com \
--cc=rostedt@goodmis.org \
--cc=rppt@linux.vnet.ibm.com \
--cc=ruprecht@google.com \
--cc=shawn@anastas.io \
--cc=shli@fb.com \
--cc=steven.price@arm.com \
--cc=vbabka@suse.cz \
--cc=viro@zeniv.linux.org.uk \
--cc=walken@google.com \
--cc=willy@infradead.org \
--cc=ying.huang@intel.com \
/path/to/YOUR_REPLY
https://kernel.org/pub/software/scm/git/docs/git-send-email.html
* If your mail client supports setting the In-Reply-To header
via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line
before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).