From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-16.3 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_CR_TRAILER,INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id CF066C433E0 for ; Tue, 9 Feb 2021 00:01:06 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 3FB7964E9A for ; Tue, 9 Feb 2021 00:01:06 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 3FB7964E9A Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id AC1126B0005; Mon, 8 Feb 2021 19:01:05 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id A73296B006C; Mon, 8 Feb 2021 19:01:05 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8EBF66B006E; Mon, 8 Feb 2021 19:01:05 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0053.hostedemail.com [216.40.44.53]) by kanga.kvack.org (Postfix) with ESMTP id 79FEC6B0005 for ; Mon, 8 Feb 2021 19:01:05 -0500 (EST) Received: from smtpin14.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id 3CE7618477183 for ; Tue, 9 Feb 2021 00:01:05 +0000 (UTC) X-FDA: 77796774090.14.can83_551699427602 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin14.hostedemail.com (Postfix) with ESMTP id 131291802F6DE for ; Tue, 9 Feb 2021 00:01:05 +0000 (UTC) X-HE-Tag: can83_551699427602 X-Filterd-Recvd-Size: 13076 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124]) by imf20.hostedemail.com (Postfix) with ESMTP for ; Tue, 9 Feb 2021 00:01:04 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1612828863; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=xcZ+yeCeKdV2jRak92LuD28RYXAN2xMEiGah/lbPE24=; b=JMikQXkQK0JhVCMgqvOvB2BORkfXIaEL4Pc/EXH7vmrJceo1EWkLckdlN5f+gqGTKe8rQZ WZcMD+AtZK9F6hwWQyZasTP/IEkj759P+rXdcwnvlJuMBIzKjnIfXtVEVtj6+icbeUiFWi FDnhBnTyLnrrS8VBF8D3nU6fupLj1aI= Received: from mail-qk1-f200.google.com (mail-qk1-f200.google.com [209.85.222.200]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-103-XYpjh8kLMaKMKhmCGe9rpA-1; Mon, 08 Feb 2021 19:01:02 -0500 X-MC-Unique: XYpjh8kLMaKMKhmCGe9rpA-1 Received: by mail-qk1-f200.google.com with SMTP id m9so14304591qka.22 for ; Mon, 08 Feb 2021 16:01:02 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=xcZ+yeCeKdV2jRak92LuD28RYXAN2xMEiGah/lbPE24=; b=b9bVSLZpMYiZo6DaPBerUmsEDc0JM5dnaRx9qUp4+ddl0Pi1zHFfew5mrxcHQUgzYE PuXVA9E+665HYA947m6zuwePXTYNNZTdWgkSSdkSr7YQ8lP6EPM4viF0dCeVL0Q84WMA 3P9qbgrdm51ouYMplezbtngBZMK2IbKPmdU75S8Sniz/kzV8447MfM75O5YStJr6iSc6 HB4V80jSIWimYMKONzTTlJgXA7TgjVOvWxPaw2WUHRY6OPPtf14gwzsyDnAWZOK+WHzG fEM8sCTNzTL84ftl5Uj/ckH8u52P+R4Ce6FvchfltKPiIq++QnVVuBoKmBdzNxB53oWa pIAg== X-Gm-Message-State: AOAM530RmS5VAZNiJSFZHJLVASbmwWoR9Xrvbw/NAX62caGU5y2dJEpN sR9TTIea9MDG+TdJJmL/bKAP4PU6Hjy+BjIAHZ+8p0++J3dSopX91H5JFtsuEILrgz4we9l18xx QmU4TlVHcpAQ= X-Received: by 2002:a0c:8365:: with SMTP id j92mr18642858qva.19.1612828861678; Mon, 08 Feb 2021 16:01:01 -0800 (PST) X-Google-Smtp-Source: ABdhPJzwdJLdWaHbVrmWjJHQTdvfqr10RsoH69jk6l+AH5Y/5B1w0vnsBBKgJ9zss03AZxS1edcY2w== X-Received: by 2002:a0c:8365:: with SMTP id j92mr18642830qva.19.1612828861397; Mon, 08 Feb 2021 16:01:01 -0800 (PST) Received: from xz-x1 (bras-vprn-toroon474qw-lp130-20-174-93-89-182.dsl.bell.ca. [174.93.89.182]) by smtp.gmail.com with ESMTPSA id j46sm5936461qtk.1.2021.02.08.16.00.59 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 08 Feb 2021 16:01:00 -0800 (PST) Date: Mon, 8 Feb 2021 19:00:58 -0500 From: Peter Xu To: Axel Rasmussen Cc: Alexander Viro , Alexey Dobriyan , Andrea Arcangeli , Andrew Morton , Anshuman Khandual , Catalin Marinas , Chinwen Chang , Huang Ying , Ingo Molnar , Jann Horn , Jerome Glisse , Lokesh Gidra , "Matthew Wilcox (Oracle)" , Michael Ellerman , Michal =?utf-8?Q?Koutn=C3=BD?= , Michel Lespinasse , Mike Kravetz , Mike Rapoport , Nicholas Piggin , Shaohua Li , Shawn Anastasio , Steven Rostedt , Steven Price , Vlastimil Babka , linux-kernel@vger.kernel.org, linux-fsdevel@vger.kernel.org, linux-mm@kvack.org, Adam Ruprecht , Cannon Matthews , "Dr . David Alan Gilbert" , David Rientjes , Mina Almasry , Oliver Upton Subject: Re: [PATCH v4 05/10] userfaultfd: add minor fault registration mode Message-ID: <20210209000058.GA78818@xz-x1> References: <20210204183433.1431202-1-axelrasmussen@google.com> <20210204183433.1431202-6-axelrasmussen@google.com> MIME-Version: 1.0 In-Reply-To: <20210204183433.1431202-6-axelrasmussen@google.com> Authentication-Results: relay.mimecast.com; auth=pass smtp.auth=CUSA124A263 smtp.mailfrom=peterx@redhat.com X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu, Feb 04, 2021 at 10:34:28AM -0800, Axel Rasmussen wrote: > This feature allows userspace to intercept "minor" faults. By "minor" > faults, I mean the following situation: > > Let there exist two mappings (i.e., VMAs) to the same page(s). One of > the mappings is registered with userfaultfd (in minor mode), and the > other is not. Via the non-UFFD mapping, the underlying pages have > already been allocated & filled with some contents. The UFFD mapping > has not yet been faulted in; when it is touched for the first time, > this results in what I'm calling a "minor" fault. As a concrete > example, when working with hugetlbfs, we have huge_pte_none(), but > find_lock_page() finds an existing page. > > This commit adds the new registration mode, and sets the relevant flag > on the VMAs being registered. In the hugetlb fault path, if we find > that we have huge_pte_none(), but find_lock_page() does indeed find an > existing page, then we have a "minor" fault, and if the VMA has the > userfaultfd registration flag, we call into userfaultfd to handle it. > > Why add a new registration mode, as opposed to adding a feature to > MISSING registration, like UFFD_FEATURE_SIGBUS? > > - The semantics are significantly different. UFFDIO_COPY or > UFFDIO_ZEROPAGE do not make sense for these minor faults; userspace > would instead just memset() or memcpy() or whatever via the non-UFFD > mapping. Unlike MISSING registration, MINOR registration only makes > sense for hugetlbfs (or, in the future, shmem), as this is the only > way to get two VMAs to a single set of underlying pages. > > - Doing so would make handle_userfault()'s "reason" argument confusing. > We'd pass in "MISSING" even if the pages weren't really missing. > > Signed-off-by: Axel Rasmussen > --- > fs/proc/task_mmu.c | 1 + > fs/userfaultfd.c | 81 ++++++++++++++++++++------------ > include/linux/mm.h | 1 + > include/linux/userfaultfd_k.h | 15 +++++- > include/trace/events/mmflags.h | 1 + > include/uapi/linux/userfaultfd.h | 15 +++++- > mm/hugetlb.c | 32 +++++++++++++ > 7 files changed, 112 insertions(+), 34 deletions(-) > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > index 602e3a52884d..94e951ea3e03 100644 > --- a/fs/proc/task_mmu.c > +++ b/fs/proc/task_mmu.c > @@ -651,6 +651,7 @@ static void show_smap_vma_flags(struct seq_file *m, struct vm_area_struct *vma) > [ilog2(VM_MTE)] = "mt", > [ilog2(VM_MTE_ALLOWED)] = "", > #endif > + [ilog2(VM_UFFD_MINOR)] = "ui", > #ifdef CONFIG_ARCH_HAS_PKEYS > /* These come out via ProtectionKey: */ > [ilog2(VM_PKEY_BIT0)] = "", > diff --git a/fs/userfaultfd.c b/fs/userfaultfd.c > index a0f66e12026b..c643cf13d957 100644 > --- a/fs/userfaultfd.c > +++ b/fs/userfaultfd.c > @@ -197,24 +197,21 @@ static inline struct uffd_msg userfault_msg(unsigned long address, > msg_init(&msg); > msg.event = UFFD_EVENT_PAGEFAULT; > msg.arg.pagefault.address = address; > + /* > + * These flags indicate why the userfault occurred: > + * - UFFD_PAGEFAULT_FLAG_WP indicates a write protect fault. > + * - UFFD_PAGEFAULT_FLAG_MINOR indicates a minor fault. > + * - Neither of these flags being set indicates a MISSING fault. > + * > + * Separately, UFFD_PAGEFAULT_FLAG_WRITE indicates it was a write > + * fault. Otherwise, it was a read fault. > + */ > if (flags & FAULT_FLAG_WRITE) > - /* > - * If UFFD_FEATURE_PAGEFAULT_FLAG_WP was set in the > - * uffdio_api.features and UFFD_PAGEFAULT_FLAG_WRITE > - * was not set in a UFFD_EVENT_PAGEFAULT, it means it > - * was a read fault, otherwise if set it means it's > - * a write fault. > - */ > msg.arg.pagefault.flags |= UFFD_PAGEFAULT_FLAG_WRITE; > if (reason & VM_UFFD_WP) > - /* > - * If UFFD_FEATURE_PAGEFAULT_FLAG_WP was set in the > - * uffdio_api.features and UFFD_PAGEFAULT_FLAG_WP was > - * not set in a UFFD_EVENT_PAGEFAULT, it means it was > - * a missing fault, otherwise if set it means it's a > - * write protect fault. > - */ > msg.arg.pagefault.flags |= UFFD_PAGEFAULT_FLAG_WP; > + if (reason & VM_UFFD_MINOR) > + msg.arg.pagefault.flags |= UFFD_PAGEFAULT_FLAG_MINOR; > if (features & UFFD_FEATURE_THREAD_ID) > msg.arg.pagefault.feat.ptid = task_pid_vnr(current); > return msg; > @@ -401,8 +398,10 @@ vm_fault_t handle_userfault(struct vm_fault *vmf, unsigned long reason) > > BUG_ON(ctx->mm != mm); > > - VM_BUG_ON(reason & ~(VM_UFFD_MISSING|VM_UFFD_WP)); > - VM_BUG_ON(!(reason & VM_UFFD_MISSING) ^ !!(reason & VM_UFFD_WP)); > + /* Any unrecognized flag is a bug. */ > + VM_BUG_ON(reason & ~__VM_UFFD_FLAGS); > + /* 0 or > 1 flags set is a bug; we expect exactly 1. */ > + VM_BUG_ON(!reason || !!(reason & (reason - 1))); > > if (ctx->features & UFFD_FEATURE_SIGBUS) > goto out; > @@ -612,7 +611,7 @@ static void userfaultfd_event_wait_completion(struct userfaultfd_ctx *ctx, > for (vma = mm->mmap; vma; vma = vma->vm_next) > if (vma->vm_userfaultfd_ctx.ctx == release_new_ctx) { > vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX; > - vma->vm_flags &= ~(VM_UFFD_WP | VM_UFFD_MISSING); > + vma->vm_flags &= ~__VM_UFFD_FLAGS; > } > mmap_write_unlock(mm); > > @@ -644,7 +643,7 @@ int dup_userfaultfd(struct vm_area_struct *vma, struct list_head *fcs) > octx = vma->vm_userfaultfd_ctx.ctx; > if (!octx || !(octx->features & UFFD_FEATURE_EVENT_FORK)) { > vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX; > - vma->vm_flags &= ~(VM_UFFD_WP | VM_UFFD_MISSING); > + vma->vm_flags &= ~__VM_UFFD_FLAGS; > return 0; > } > > @@ -726,7 +725,7 @@ void mremap_userfaultfd_prep(struct vm_area_struct *vma, > } else { > /* Drop uffd context if remap feature not enabled */ > vma->vm_userfaultfd_ctx = NULL_VM_UFFD_CTX; > - vma->vm_flags &= ~(VM_UFFD_WP | VM_UFFD_MISSING); > + vma->vm_flags &= ~__VM_UFFD_FLAGS; > } > } > > @@ -867,12 +866,12 @@ static int userfaultfd_release(struct inode *inode, struct file *file) > for (vma = mm->mmap; vma; vma = vma->vm_next) { > cond_resched(); > BUG_ON(!!vma->vm_userfaultfd_ctx.ctx ^ > - !!(vma->vm_flags & (VM_UFFD_MISSING | VM_UFFD_WP))); > + !!(vma->vm_flags & __VM_UFFD_FLAGS)); > if (vma->vm_userfaultfd_ctx.ctx != ctx) { > prev = vma; > continue; > } > - new_flags = vma->vm_flags & ~(VM_UFFD_MISSING | VM_UFFD_WP); > + new_flags = vma->vm_flags & ~__VM_UFFD_FLAGS; > prev = vma_merge(mm, prev, vma->vm_start, vma->vm_end, > new_flags, vma->anon_vma, > vma->vm_file, vma->vm_pgoff, > @@ -1305,9 +1304,29 @@ static inline bool vma_can_userfault(struct vm_area_struct *vma, > unsigned long vm_flags) > { > /* FIXME: add WP support to hugetlbfs and shmem */ > - return vma_is_anonymous(vma) || > - ((is_vm_hugetlb_page(vma) || vma_is_shmem(vma)) && > - !(vm_flags & VM_UFFD_WP)); > + if (vm_flags & VM_UFFD_WP) { > + if (is_vm_hugetlb_page(vma) || vma_is_shmem(vma)) > + return false; > + } > + > + if (vm_flags & VM_UFFD_MINOR) { > + /* > + * The use case for minor registration (intercepting minor > + * faults) is to handle the case where a page is present, but > + * needs to be modified before it can be used. This only makes > + * sense when you have two mappings to the same underlying > + * pages (one UFFD registered, one not), but the memory doesn't > + * have to be shared (consider one process mapping a hugetlbfs > + * file with MAP_SHARED, and then a second process doing > + * MAP_PRIVATE). No strong opinion, but I'd drop the whole chunk of comment here.. - "what is minor fault" should be covered in the documentation file already. - "two mappings" seems slightly superfluous too, since we can still use minor fault with TRUNCATE+UFFDIO_COPY.. if we want? maybe? - "memory doesn't have to be shared" would be a bit odd too if saying that without any code checking against "shared" at all, I'd say. :) The FIXME below it is fine. If you agree with above, feel free to add my r-b after dropping the chunk: Reviewed-by: Peter Xu Thanks, -- Peter Xu