From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.8 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 42583C43461 for ; Tue, 15 Sep 2020 19:13:55 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id BC5A12080C for ; Tue, 15 Sep 2020 19:13:54 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=redhat.com header.i=@redhat.com header.b="dFUbelPI" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org BC5A12080C Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 1A1B6900037; Tue, 15 Sep 2020 15:13:54 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 156686B012A; Tue, 15 Sep 2020 15:13:54 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 067E5900037; Tue, 15 Sep 2020 15:13:54 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0138.hostedemail.com [216.40.44.138]) by kanga.kvack.org (Postfix) with ESMTP id E51B96B0129 for ; Tue, 15 Sep 2020 15:13:53 -0400 (EDT) Received: from smtpin07.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id A3A8712EA for ; Tue, 15 Sep 2020 19:13:53 +0000 (UTC) X-FDA: 77266245546.07.stove12_430b6c427113 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin07.hostedemail.com (Postfix) with ESMTP id 775601803FFD1 for ; Tue, 15 Sep 2020 19:13:53 +0000 (UTC) X-HE-Tag: stove12_430b6c427113 X-Filterd-Recvd-Size: 7663 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124]) by imf37.hostedemail.com (Postfix) with ESMTP for ; Tue, 15 Sep 2020 19:13:52 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1600197232; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: in-reply-to:in-reply-to:references:references; bh=ZWaM8lK4E+JgpTzBkYVyMmwKV2oNtb0ZHbYSY0G1m3c=; b=dFUbelPIhzRHeqerstnrrTTbDKdFmZ/zct1eiVC6K6c+bppPdB7XqAv1hmjmN/3qfb0OGi U0S1KSfKC6mLrk7EbCooHjGft2X6KOujZaOABFHwSts7b0bZd6xokyj4jiq7yLbByuvfka Ni+P1JVWCXnSSVUnK2s9pNst/B7jOMg= Received: from mail-qt1-f200.google.com (mail-qt1-f200.google.com [209.85.160.200]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-45-4S5rVjzzM5-oOHfb9zwc_g-1; Tue, 15 Sep 2020 15:13:50 -0400 X-MC-Unique: 4S5rVjzzM5-oOHfb9zwc_g-1 Received: by mail-qt1-f200.google.com with SMTP id 60so3612829qtf.21 for ; Tue, 15 Sep 2020 12:13:50 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=ZWaM8lK4E+JgpTzBkYVyMmwKV2oNtb0ZHbYSY0G1m3c=; b=OmDFY5psbdx61C4yyZQpfO3FW/tKek095mLAIEBd/kLYjSQRrWWs/ObiirnuMVIbub rKAVvgQwOtQwNDHTUjSTiWO9LCfIKTJPOgTSALol11c0A7C8zt+gq9GwXLM7IEXi9MMt YgHZ6IlSJ63toeOToILsuCwiSTGQkLwJEmuhdfpqqJpklt2pNQyLv/vcx89r/PEeNvE8 fgravMmzXhPDegn8CKobjg6YVyXaV5XEAx5mQk71hxgX3Xf7VTRSUmPW7VJj1TuLi01d tcgx4mtRqJ4IqajphQz2i5io2+cKK0TCoWxSLSexhnJ27dcLkfnTgXNgRYnlVBIXYGsu l6wQ== X-Gm-Message-State: AOAM5336AHo6vdZcxKDKHNz2CH6XbWKSJgFwFVxGwTX86TaLSZew1JEy dzrqq7jaacH3nYgs/TOTnT+sNyNV+VIvq7wJKOFWp6MlIBgrjydy8tlDvn8uxjolzQZ60p5J+co R+FDu0sFacaU= X-Received: by 2002:a0c:8f02:: with SMTP id z2mr20132533qvd.21.1600197230005; Tue, 15 Sep 2020 12:13:50 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxxrXKCE5/KIwcV94++sTnuK0HWS7T+vCs5eQrem8OZqmgstUi1ZfdeyXHn8mZnLCVDAkBY1g== X-Received: by 2002:a0c:8f02:: with SMTP id z2mr20132499qvd.21.1600197229558; Tue, 15 Sep 2020 12:13:49 -0700 (PDT) Received: from xz-x1 (bras-vprn-toroon474qw-lp130-11-70-53-122-15.dsl.bell.ca. [70.53.122.15]) by smtp.gmail.com with ESMTPSA id s15sm17787207qke.134.2020.09.15.12.13.47 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 15 Sep 2020 12:13:48 -0700 (PDT) Date: Tue, 15 Sep 2020 15:13:46 -0400 From: Peter Xu To: Jason Gunthorpe Cc: Linus Torvalds , Leon Romanovsky , Linux-MM , Linux Kernel Mailing List , "Maya B . Gokhale" , Yang Shi , Marty Mcfadden , Kirill Shutemov , Oleg Nesterov , Jann Horn , Jan Kara , Kirill Tkhai , Andrea Arcangeli , Christoph Hellwig , Andrew Morton Subject: Re: [PATCH 1/4] mm: Trial do_wp_page() simplification Message-ID: <20200915191346.GD2949@xz-x1> References: <20200914143829.GA1424636@nvidia.com> <20200914183436.GD30881@xz-x1> <20200914211515.GA5901@xz-x1> <20200914225542.GO904879@nvidia.com> <20200914232851.GH1221970@ziepe.ca> <20200915145040.GA2949@xz-x1> <20200915160553.GJ1221970@ziepe.ca> <20200915182933.GM1221970@ziepe.ca> MIME-Version: 1.0 In-Reply-To: <20200915182933.GM1221970@ziepe.ca> Authentication-Results: relay.mimecast.com; auth=pass smtp.auth=CUSA124A263 smtp.mailfrom=peterx@redhat.com X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8 Content-Disposition: inline X-Rspamd-Queue-Id: 775601803FFD1 X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam03 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Sep 15, 2020 at 03:29:33PM -0300, Jason Gunthorpe wrote: > On Tue, Sep 15, 2020 at 01:05:53PM -0300, Jason Gunthorpe wrote: > > On Tue, Sep 15, 2020 at 10:50:40AM -0400, Peter Xu wrote: > > > On Mon, Sep 14, 2020 at 08:28:51PM -0300, Jason Gunthorpe wrote: > > > > Yes, this stuff does pin_user_pages_fast() and MADV_DONTFORK > > > > together. It sets FOLL_FORCE and FOLL_WRITE to get an exclusive copy > > > > of the page and MADV_DONTFORK was needed to ensure that a future fork > > > > doesn't establish a COW that would break the DMA by moving the > > > > physical page over to the fork. DMA should stay with the process that > > > > called pin_user_pages_fast() (Is MADV_DONTFORK still needed with > > > > recent years work to GUP/etc? It is a pretty terrible ancient thing) > > > > > > ... Now I'm more confused on what has happened. > > > > I'm going to try to confirm that the MADV_DONTFORK is actually being > > done by userspace properly, more later. > > It turns out the test is broken and does not call MADV_DONTFORK when > doing forks - it is an opt-in it didn't do. > > It looks to me like this patch makes it much more likely that the COW > break after page pinning will end up moving the pinned physical page > to the fork while before it was not very common. Does that make sense? My understanding is that the fix should not matter much with current failing test case, as long as it's with FOLL_FORCE & FOLL_WRITE. However what I'm not sure is what if the RDMA/DMA buffers are designed for pure read from userspace. E.g. for vfio I'm looking at vaddr_get_pfn() where I believe such pure read buffers will be a GUP with FOLL_PIN and !FOLL_WRITE which will finally pass to pin_user_pages_remote(). So what I'm worrying is something like this: 1. Proc A gets a private anon page X for DMA, mapcount==refcount==1. 2. Proc A fork()s and gives birth to proc B, page X will now have mapcount==refcount==2, write-protected. proc B quits. Page X goes back to mapcount==refcount==1 (note! without WRITE bits set in the PTE). 3. pin_user_pages(write=false) for page X. Since it's with !FORCE & !WRITE, no COW needed. Refcount==2 after that. 4. Pass these pages to device. We either setup IOMMU page table or just use the PFNs, which is not important imho - the most important thing is the device will DMA into page X no matter what. 5. Some thread of proc A writes to page X, trigger COW since it's write-protected with mapcount==1 && refcount==2. The HVA that pointing to page X will be changed to point to another page Y after the COW. 6. Device DMA happens, data resides on X. Proc A can never get the data, though, because it's looking at page Y now. If this is a problem, we may still need the fix patch (maybe not as urgent as before at least). But I'd like to double confirm, just in case I miss some obvious facts above. > > Given that the tests are wrong it seems like broken userspace, > however, it also worked reliably for a fairly long time. IMHO it worked because the page to do RDMA has mapcount==1, so it was reused previously just as-is even after the fork without MADV_DONTFORK and after the child quits. However logically it should really be protected by MADV_DONTFORK rather than being reused. Thanks, -- Peter Xu