From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-18.3 required=3.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, INCLUDES_PATCH,MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 51757C433DB for ; Wed, 10 Feb 2021 18:01:00 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id DC7BE64D9E for ; Wed, 10 Feb 2021 18:00:59 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org DC7BE64D9E Authentication-Results: mail.kernel.org; dmarc=fail (p=reject dis=none) header.from=google.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 6432C6B006C; Wed, 10 Feb 2021 13:00:59 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 5F39E6B006E; Wed, 10 Feb 2021 13:00:59 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 4E29B6B0070; Wed, 10 Feb 2021 13:00:59 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0161.hostedemail.com [216.40.44.161]) by kanga.kvack.org (Postfix) with ESMTP id 392786B006C for ; Wed, 10 Feb 2021 13:00:59 -0500 (EST) Received: from smtpin24.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id ECCAF183F8681 for ; Wed, 10 Feb 2021 18:00:58 +0000 (UTC) X-FDA: 77803124196.24.team31_2a08cb827612 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin24.hostedemail.com (Postfix) with ESMTP id C94EE1A4A5 for ; Wed, 10 Feb 2021 18:00:58 +0000 (UTC) X-HE-Tag: team31_2a08cb827612 X-Filterd-Recvd-Size: 12178 Received: from mail-io1-f42.google.com (mail-io1-f42.google.com [209.85.166.42]) by imf36.hostedemail.com (Postfix) with ESMTP for ; Wed, 10 Feb 2021 18:00:58 +0000 (UTC) Received: by mail-io1-f42.google.com with SMTP id e133so2821639iof.8 for ; Wed, 10 Feb 2021 10:00:58 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=o49AbbaqNtmBNLl+e+wTukbuIeuXAQ703tVisLKMY9c=; b=VnBvbp2kKLGhA2qV6eCfPPdlSh0iITgW/Ydl0Kov5nh6W1YxOmH+Ik/0n9wAqnrnLh LX0jtFzkMrTa3yi3gbWkvJk33UG+i07lV5sfsGdwYPKoGKfeg60nkwwSvdBANDftw6+T OevAAzoj++555XcWpQBgfTaxVoqlRmevJnnDfpRCfWwW2q/wPA+848sPmJeejxfmqWWr bgNLqtxSh7yCPEG7xQPxKBjdp7EpjOjoT7losdxwYbVb0rm1ytkUo72a+gzsnmUfTVPl siAAUbZh2GPRkb2UJ85C31an6GT3FqzTQX55e0tXmLhbDZMFU7NhOJqbJAojJMKVWOlS jDJg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=o49AbbaqNtmBNLl+e+wTukbuIeuXAQ703tVisLKMY9c=; b=Hl2PBqzfZ6iuQjIjdhK6srgJ9tARupGS+QabggojfSu+mpKn19b5yfh/Ils1jBWiLC KQZaoMYucOsQukGGChfvMNr9LrzccsjqdSTU8wQWZ9P00TFEguhokB05qFSSo+Bmnx7Y zPnEp40mZYa/V3nl2St9Ymg/lUjcdxa9O+RDy174n/y/E2ihh/ksOdyR75tUmsRCUYt3 zqvVOqhDJqnxJZ3HaiOzIPP20VnDpyJfVN61NZJagKCHF7TL0VN2UQyHSdg6ELIDjMwF kJXTb6Msj5Rynag38PtHDo/IKFLS880pKmnEdIqARGFmeC2GsjajqzMAIDUAg/fitKuh SaqQ== X-Gm-Message-State: AOAM533FtvPQJ8DWKCkDo0GhuD88HYX14u+ya1XRHWGOHVT9UBzCeHQb Gwpscn/hVVynlYk8+Qvn7Dc11EV/uUVzdOAu6UaA8g== X-Google-Smtp-Source: ABdhPJz14jAl0ZNEsgW94gY4Oc38D/91s0rn8mt639JdpQP4MMeQ3rmxqg/qx1Smz8psQGuJotvWDsnmDxhOsAhhLRA= X-Received: by 2002:a5e:8807:: with SMTP id l7mr1921757ioj.23.1612980057418; Wed, 10 Feb 2021 10:00:57 -0800 (PST) MIME-Version: 1.0 References: <20210204183433.1431202-1-axelrasmussen@google.com> <20210204183433.1431202-9-axelrasmussen@google.com> <20210208235411.GC71523@xz-x1> In-Reply-To: <20210208235411.GC71523@xz-x1> From: Axel Rasmussen Date: Wed, 10 Feb 2021 10:00:21 -0800 Message-ID: Subject: Re: [PATCH v4 08/10] userfaultfd: add UFFDIO_CONTINUE ioctl To: Peter Xu Cc: Alexander Viro , Alexey Dobriyan , Andrea Arcangeli , Andrew Morton , Anshuman Khandual , Catalin Marinas , Chinwen Chang , Huang Ying , Ingo Molnar , Jann Horn , Jerome Glisse , Lokesh Gidra , "Matthew Wilcox (Oracle)" , Michael Ellerman , =?UTF-8?Q?Michal_Koutn=C3=BD?= , Michel Lespinasse , Mike Kravetz , Mike Rapoport , Nicholas Piggin , Shaohua Li , Shawn Anastasio , Steven Rostedt , Steven Price , Vlastimil Babka , LKML , linux-fsdevel@vger.kernel.org, Linux MM , Adam Ruprecht , Cannon Matthews , "Dr . David Alan Gilbert" , David Rientjes , Mina Almasry , Oliver Upton Content-Type: text/plain; charset="UTF-8" X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon, Feb 8, 2021 at 3:54 PM Peter Xu wrote: > > On Thu, Feb 04, 2021 at 10:34:31AM -0800, Axel Rasmussen wrote: > > +enum mcopy_atomic_mode { > > + /* A normal copy_from_user into the destination range. */ > > + MCOPY_ATOMIC_NORMAL, > > + /* Don't copy; map the destination range to the zero page. */ > > + MCOPY_ATOMIC_ZEROPAGE, > > + /* Just setup the dst_vma, without modifying the underlying page(s). */ > > "setup the dst_vma" sounds odd. How about "install pte with the existing page > in the page cache"? > > > + MCOPY_ATOMIC_CONTINUE, > > +}; > > [...] > > > @@ -4749,22 +4754,27 @@ int hugetlb_mcopy_atomic_pte(struct mm_struct *dst_mm, > > hugepage_add_new_anon_rmap(page, dst_vma, dst_addr); > > } > > > > - _dst_pte = make_huge_pte(dst_vma, page, dst_vma->vm_flags & VM_WRITE); > > - if (dst_vma->vm_flags & VM_WRITE) > > + dst_pte_flags = dst_vma->vm_flags & VM_WRITE; > > + /* For CONTINUE on a non-shared VMA, don't set VM_WRITE for CoW. */ > > + if (mode == MCOPY_ATOMIC_CONTINUE && !vm_shared) > > + dst_pte_flags &= ~VM_WRITE; > > I agree it should work but it's odd to explicitly remove a VM_WRITE bit, since > imho what we want to do is not changing vma or vma flags but deciding whether > to keep the write bit in the ptes. How about as simple as: > > bool writable; > > if (mode == MCOPY_ATOMIC_CONTINUE && !vm_shared) > writable = false; > else > writable = dst_vma->vm_flags & VM_WRITE; > > _dst_pte = make_huge_pte(dst_vma, page, writable); > if (writable) > _dst_pte = huge_pte_mkdirty(_dst_pte); > > ? > > > + _dst_pte = make_huge_pte(dst_vma, page, dst_pte_flags); > > + if (dst_pte_flags & VM_WRITE) > > _dst_pte = huge_pte_mkdirty(_dst_pte); > > _dst_pte = pte_mkyoung(_dst_pte); > > > > set_huge_pte_at(dst_mm, dst_addr, dst_pte, _dst_pte); > > > > (void)huge_ptep_set_access_flags(dst_vma, dst_addr, dst_pte, _dst_pte, > > - dst_vma->vm_flags & VM_WRITE); > > + dst_pte_flags); > > hugetlb_count_add(pages_per_huge_page(h), dst_mm); > > > > /* No need to invalidate - it was non-present before */ > > update_mmu_cache(dst_vma, dst_addr, dst_pte); > > > > spin_unlock(ptl); > > - set_page_huge_active(page); > > + if (mode != MCOPY_ATOMIC_CONTINUE) > > + set_page_huge_active(page); > > This has been changed to SetHPageMigratable(page) in akpm-next by Mike's new > series. So maybe it's time to rebase your series to that starting from the > next post. > > > if (vm_shared) > > unlock_page(page); > > After removing the shared restriction, I think we need: > > if (vm_shared || (mode == MCOPY_ATOMIC_CONTINUE)) > unlock_page(page); > > Since we seem to check (mode == MCOPY_ATOMIC_CONTINUE) a lot, maybe we can > introduce a temp var for that too. > > > ret = 0; > > diff --git a/mm/userfaultfd.c b/mm/userfaultfd.c > > index b2ce61c1b50d..7bf83ffa456b 100644 > > --- a/mm/userfaultfd.c > > +++ b/mm/userfaultfd.c > > @@ -207,7 +207,7 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm, > > unsigned long dst_start, > > unsigned long src_start, > > unsigned long len, > > - bool zeropage) > > + enum mcopy_atomic_mode mode) > > { > > int vm_alloc_shared = dst_vma->vm_flags & VM_SHARED; > > int vm_shared = dst_vma->vm_flags & VM_SHARED; > > @@ -227,7 +227,7 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm, > > * by THP. Since we can not reliably insert a zero page, this > > * feature is not supported. > > */ > > - if (zeropage) { > > + if (mode == MCOPY_ATOMIC_ZEROPAGE) { > > mmap_read_unlock(dst_mm); > > return -EINVAL; > > } > > @@ -273,8 +273,6 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm, > > } > > > > while (src_addr < src_start + len) { > > - pte_t dst_pteval; > > - > > BUG_ON(dst_addr >= dst_start + len); > > > > /* > > @@ -297,16 +295,17 @@ static __always_inline ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm, > > goto out_unlock; > > } > > > > - err = -EEXIST; > > - dst_pteval = huge_ptep_get(dst_pte); > > - if (!huge_pte_none(dst_pteval)) { > > - mutex_unlock(&hugetlb_fault_mutex_table[hash]); > > - i_mmap_unlock_read(mapping); > > - goto out_unlock; > > + if (mode != MCOPY_ATOMIC_CONTINUE) { > > + if (!huge_pte_none(huge_ptep_get(dst_pte))) { > > Maybe merge the two "if"s? > > > + err = -EEXIST; > > + mutex_unlock(&hugetlb_fault_mutex_table[hash]); > > + i_mmap_unlock_read(mapping); > > + goto out_unlock; > > + } > > } > > > > err = hugetlb_mcopy_atomic_pte(dst_mm, dst_pte, dst_vma, > > - dst_addr, src_addr, &page); > > + dst_addr, src_addr, mode, &page); > > > > mutex_unlock(&hugetlb_fault_mutex_table[hash]); > > i_mmap_unlock_read(mapping); > > @@ -408,7 +407,7 @@ extern ssize_t __mcopy_atomic_hugetlb(struct mm_struct *dst_mm, > > unsigned long dst_start, > > unsigned long src_start, > > unsigned long len, > > - bool zeropage); > > + enum mcopy_atomic_mode mode); > > #endif /* CONFIG_HUGETLB_PAGE */ > > > > static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm, > > @@ -417,10 +416,14 @@ static __always_inline ssize_t mfill_atomic_pte(struct mm_struct *dst_mm, > > unsigned long dst_addr, > > unsigned long src_addr, > > struct page **page, > > - bool zeropage, > > + enum mcopy_atomic_mode mode, > > bool wp_copy) > > { > > ssize_t err; > > + bool zeropage = (mode == MCOPY_ATOMIC_ZEROPAGE); > > + > > + if (mode == MCOPY_ATOMIC_CONTINUE) > > + return -EINVAL; > > So you still passed in the mode into mfill_atomic_pte() just to make sure > CONTINUE is not called there. It's okay, but again I think it's not extremely > necessary: we should make sure to fail early at the entry of uffdio_continue() > by checking against the vma type to be hugetlb, rather than reaching here. Hmm, it's not quite as simple as that. We don't have the dst_vma yet in uffdio_continue(), __mcopy_atomic looks it up. I'd prefer not to look it up in uffdio_continue(), because I think that means changing the API so all the ioctls look up the vma, and then plumb it into __mcopy_atomic. (We don't want to look it up twice, since each lookup has to traverse the rbtree.) This is complicated too by the fact that the ioctl handlers would need to perform various validation / checks - e.g., acquiring mmap_lock, dealing with *mmap_changing, validating the range, .... We can move the enforcement up one more layer, into __mcopy_atomic, easily enough, though. The other comments above I agree with, so I'll send a v5. :) > > Thanks, > > -- > Peter Xu >