From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-13.3 required=3.0 tests=BAYES_00,DKIMWL_WL_MED, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_DKIM_WL autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 6A240C433ED for ; Tue, 13 Apr 2021 04:41:02 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 3806A613AE for ; Tue, 13 Apr 2021 04:41:02 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S229791AbhDMElT (ORCPT ); Tue, 13 Apr 2021 00:41:19 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:42980 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229779AbhDMElT (ORCPT ); Tue, 13 Apr 2021 00:41:19 -0400 Received: from mail-io1-xd2d.google.com (mail-io1-xd2d.google.com [IPv6:2607:f8b0:4864:20::d2d]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 75AD2C06138C for ; Mon, 12 Apr 2021 21:41:00 -0700 (PDT) Received: by mail-io1-xd2d.google.com with SMTP id x16so15811517iob.1 for ; Mon, 12 Apr 2021 21:41:00 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=Bx8fq9RMVJeYgjbVUnNNsjq4NPDKlddkrDot/R9W3U0=; b=qKzEmUjCHxF9v7Gk82y4AynFWSLYAJC6SJRUorlaJEsI9Xns8UJNnx8nRAVPggQx0s LWJzTdXinMJNElG07Ww6Bnew9ELqWfMsI6yYw3jmvxBi5cXn2u2EMnnr9aZsXkwP6mTZ gaRf9nEztlW+Cn3CNcpGrJ9+gDcMngs9Nng8vE4gCYcpyqqhqI3V7U28TkEo9FmwJFwc jktzgI39xnAJx6LnHhRef1VUg9LKH28vrRXbzDYq1lnlVfxFGHaTaKlFx2Ah0D1uih1n c7zD8YQ/ssMJ61nT9aM1ExgPBPmLRWtwpTGBl1pxSvDXa96TffWdKP483bmZZ7+Bc1sT vXMQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=Bx8fq9RMVJeYgjbVUnNNsjq4NPDKlddkrDot/R9W3U0=; b=DaXBLvrbTR38wLgmPNQbZso6k0qqDLreQVlhTefHFi0bbzOev5omQHSQ0vE/ByO86M B18Gtv3UYLKlq6dZVsuvFOexB2wWQ9z58NaV7mcl+xeXA1qxCATw3CzIgp4S5tC5i6GS m1X2QMBk/1PaT8MKeV2B3X0slVfJ2we9USDHMKiYOhZTwmsfO6OIXaoRbsydhE0CauXU 8rJpPMjIjrEWDsIH1BNzCNOZCjdazr+Fmdhh0tu3c1+Ps8LuwWbJc96891szuriMv5uV 60KUOM5mJ1QKBG/gGM1PPcC2/0/0j0nqSnPwFTh40X9UxqzZh1f/ySPQSFbHBAjHlNa+ 5eXg== X-Gm-Message-State: AOAM5328l0V3QOMUPuYzHZ3Jrfufy56XWcor/mB4vQqGnGxSrzzx0zmd UC2Wi6VppKjM+k+OglJAEP4/ZieJagzINTUuco8tBg== X-Google-Smtp-Source: ABdhPJwXI/HKUS1nvH+kTJZ0qN/bCT+G1Iz1obFXECV6bQgKoKWtAB2XGAaDRQDS+bHpG6sbHn3DJbPfx/d0Y/AwLxo= X-Received: by 2002:a05:6602:2dce:: with SMTP id l14mr6765893iow.23.1618288859391; Mon, 12 Apr 2021 21:40:59 -0700 (PDT) MIME-Version: 1.0 References: <20210408234327.624367-1-axelrasmussen@google.com> <20210408234327.624367-5-axelrasmussen@google.com> <20210412231736.GA1002612@xz-x1> In-Reply-To: <20210412231736.GA1002612@xz-x1> From: Axel Rasmussen Date: Mon, 12 Apr 2021 21:40:22 -0700 Message-ID: Subject: Re: [PATCH 4/9] userfaultfd/shmem: support UFFDIO_CONTINUE for shmem To: Peter Xu Cc: Alexander Viro , Andrea Arcangeli , Andrew Morton , Daniel Colascione , Hugh Dickins , Jerome Glisse , Joe Perches , Lokesh Gidra , Mike Kravetz , Mike Rapoport , Shaohua Li , Shuah Khan , Stephen Rothwell , Wang Qing , linux-api@vger.kernel.org, linux-fsdevel@vger.kernel.org, LKML , linux-kselftest@vger.kernel.org, Linux MM , Brian Geffon , "Dr . David Alan Gilbert" , Mina Almasry , Oliver Upton Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kselftest@vger.kernel.org On Mon, Apr 12, 2021 at 4:17 PM Peter Xu wrote: > > On Thu, Apr 08, 2021 at 04:43:22PM -0700, Axel Rasmussen wrote: > > +/* > > + * Install PTEs, to map dst_addr (within dst_vma) to page. > > + * > > + * This function handles MCOPY_ATOMIC_CONTINUE (which is always file-backed), > > + * whether or not dst_vma is VM_SHARED. It also handles the more general > > + * MCOPY_ATOMIC_NORMAL case, when dst_vma is *not* VM_SHARED (it may be file > > + * backed, or not). > > + * > > + * Note that MCOPY_ATOMIC_NORMAL for a VM_SHARED dst_vma is handled by > > + * shmem_mcopy_atomic_pte instead. > > + */ > > +static int mcopy_atomic_install_ptes(struct mm_struct *dst_mm, pmd_t *dst_pmd, > > + struct vm_area_struct *dst_vma, > > + unsigned long dst_addr, struct page *page, > > + bool newly_allocated, bool wp_copy) > > +{ > > + int ret; > > + pte_t _dst_pte, *dst_pte; > > + int writable; > > + bool vm_shared = dst_vma->vm_flags & VM_SHARED; > > + spinlock_t *ptl; > > + struct inode *inode; > > + pgoff_t offset, max_off; > > + > > + _dst_pte = mk_pte(page, dst_vma->vm_page_prot); > > + writable = dst_vma->vm_flags & VM_WRITE; > > + /* For private, non-anon we need CoW (don't write to page cache!) */ > > + if (!vma_is_anonymous(dst_vma) && !vm_shared) > > + writable = 0; > > + > > + if (writable || vma_is_anonymous(dst_vma)) > > + _dst_pte = pte_mkdirty(_dst_pte); > > + if (writable) { > > + if (wp_copy) > > + _dst_pte = pte_mkuffd_wp(_dst_pte); > > + else > > + _dst_pte = pte_mkwrite(_dst_pte); > > + } else if (vm_shared) { > > + /* > > + * Since we didn't pte_mkdirty(), mark the page dirty or it > > + * could be freed from under us. We could do this > > + * unconditionally, but doing it only if !writable is faster. > > + */ > > + set_page_dirty(page); > > + } > > + > > + dst_pte = pte_offset_map_lock(dst_mm, dst_pmd, dst_addr, &ptl); > > + > > + if (vma_is_shmem(dst_vma)) { > > + /* The shmem MAP_PRIVATE case requires checking the i_size */ > > When you start to use this function in the last patch it'll be needed too even > if MAP_SHARED? > > How about directly state the reason of doing this ("serialize against truncate > with the PT lock") instead of commenting about "who will need it"? > > > + inode = dst_vma->vm_file->f_inode; > > + offset = linear_page_index(dst_vma, dst_addr); > > + max_off = DIV_ROUND_UP(i_size_read(inode), PAGE_SIZE); > > + ret = -EFAULT; > > + if (unlikely(offset >= max_off)) > > + goto out_unlock; > > + } > > [...] > > > +/* Handles UFFDIO_CONTINUE for all shmem VMAs (shared or private). */ > > +static int mcontinue_atomic_pte(struct mm_struct *dst_mm, > > + pmd_t *dst_pmd, > > + struct vm_area_struct *dst_vma, > > + unsigned long dst_addr, > > + bool wp_copy) > > +{ > > + struct inode *inode = file_inode(dst_vma->vm_file); > > + pgoff_t pgoff = linear_page_index(dst_vma, dst_addr); > > + struct page *page; > > + int ret; > > + > > + ret = shmem_getpage(inode, pgoff, &page, SGP_READ); > > SGP_READ looks right, as we don't want page allocation. However I noticed > there's very slight difference when the page was just fallocated: > > /* fallocated page? */ > if (page && !PageUptodate(page)) { > if (sgp != SGP_READ) > goto clear; > unlock_page(page); > put_page(page); > page = NULL; > hindex = index; > } > > I think it won't happen for your case since the page should be uptodate already > (the other thread should check and modify the page before CONTINUE), but still > raise this up, since if the page was allocated it smells better to still > install the fallocated page (do we need to clear the page and SetUptodate)? Sorry for the somewhat rambling thought process: My first thought is, I don't really know what PageUptodate means for shmem pages. If I understand correctly, normally we say PageUptodate() if the in memory data is more recent or equivalent to the on-disk data. But, shmem pages are entirely in memory - they are file backed in name only, in some sense. fallocate() does all sorts of things so the comment to me seems a bit ambiguous, but it seems the implication is that we're worried specifically about the case where the shmem page was recently allocated with fallocate(mode=0)? In that case, do we use !PageUptodate() to denote that the page has been allocated, but its contents are undefined? I suppose that would make sense, as the action "goto clear;" generally memset()-s the page to zero it, and then calls SetPageUptodate(). Okay so let's say the following sequence of events happens: 1. Userspace calls fallocate(mode=0) to allocate some shmem pages. 2. Another thread, via a UFFD-registered mapping, manages to trigger a minor fault on one such page, while we still have !PageUptodate(). (I'm not 100% sure this can happen, but let's say it can.) 3. UFFD handler thread gets the minor fault event, and for whatever (buggy?) reason does nothing - it doesn't modify the page, it just calls CONTINUE. I think if we get to this point, zeroing the page, returning it, and setting up the PTEs seems somewhat reasonable to me. I suppose alternatively we could notice that this happened and return an error to the caller? I'm hesitant to mess with the behavior of shmem_getpage_gfp() to make such a thing happen though. I do think if we're going to set up the PTEs instead of returning an error, we definitely do need to clear and SetPageUptodate() the page first. In conclusion, I think this behavior is correct. > > -- > Peter Xu >