From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.8 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id A72EEC433E0 for ; Mon, 11 Jan 2021 14:31:27 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 67F412251F for ; Mon, 11 Jan 2021 14:31:27 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1731039AbhAKOb0 (ORCPT ); Mon, 11 Jan 2021 09:31:26 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37892 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729869AbhAKObY (ORCPT ); Mon, 11 Jan 2021 09:31:24 -0500 Received: from mail-il1-x12f.google.com (mail-il1-x12f.google.com [IPv6:2607:f8b0:4864:20::12f]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D3526C061786 for ; Mon, 11 Jan 2021 06:30:43 -0800 (PST) Received: by mail-il1-x12f.google.com with SMTP id y13so10607702ilm.12 for ; Mon, 11 Jan 2021 06:30:43 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ziepe.ca; s=google; h=date:from:to:cc:subject:message-id:references:mime-version :content-disposition:in-reply-to; bh=KMF8Z4FWkjNeZljl8WD2NdixP/23YL2yC6tkyI8SqoM=; b=fB3ODE+P38k+rxpwzuWHcrxJ3NGma2vpXW+RePxBICF1ySjkFRi+HDqfENHHdUUxXu 5/W0YM0RUxgcjOCJpK6o7vHAZh99xLsK9/5VApHbf1rYHP2ed3vV2CHw9Mx5raouv8k6 7KEOPJruEadjnWd+ngnafRaaDZ4Gaoedm0YEgV6wNLoVtjpoV+i6y1UC2hbqTSDq5iTl j9PsTPR+jeqbPHKdd3BaTkPjUaliLFrAQRtNwawQseTOj2f4w4BYVSIyvVAeoxWfG8j1 84NxuB1aWa4DCv7+zxuAu3wheOYhngp69+BNvawrn8bDym20dQBkO/VpTDFlZimLwbMi 4nkA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to; bh=KMF8Z4FWkjNeZljl8WD2NdixP/23YL2yC6tkyI8SqoM=; b=uaTE67J4Gz78Z8B/YPEWXD16iKNWRlXaK76GWia9Qcnpy0sQbBMStVr+IIHi8LWUIw VDO5IH//ekJaCUwSGndtxhTZ/eBLBXYKzngEUHXgCPRsM1N6P6+G8Xy/+sximK2Iz/4V q5mlx4GmTxpdALNJ2ew5S+NOacxorPBSDp5c+k8/obDcOAsxjvx3wkggAX0TUwYHgNQR tfdj6kaYvRd0bVIK2P9PEBdGG8Qz2Ip8Nf8YW4YPxAypnGIJhegJQff9W3wDcx9e1s3/ Msuj6fAzXB8jqvoFf1ZSLfU7L5oniXwL1OZ8rDZq6y+C3W2tUuvHcDUt8aUCu71RyGoU G3TA== X-Gm-Message-State: AOAM531N6qH+sIHwtK0aoYMD7B7EDeWAccXhP3p/1U1lnE448PvV0DIm 4BGxTjDlCOWaBUbXXk8RS2/EAg== X-Google-Smtp-Source: ABdhPJzqdEDH3WPfRNU/XsnpU8jeOuA8kd5tFyq5xCR+hHYuQb5mo0xFb8H4iIUDd1JPA9p2Ol+row== X-Received: by 2002:a92:d40a:: with SMTP id q10mr15882339ilm.20.1610375443138; Mon, 11 Jan 2021 06:30:43 -0800 (PST) Received: from ziepe.ca (hlfxns017vw-142-162-115-133.dhcp-dynamic.fibreop.ns.bellaliant.net. [142.162.115.133]) by smtp.gmail.com with ESMTPSA id j65sm16018698ilg.53.2021.01.11.06.30.42 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 11 Jan 2021 06:30:42 -0800 (PST) Received: from jgg by mlx with local (Exim 4.94) (envelope-from ) id 1kyyDR-005WFX-K9; Mon, 11 Jan 2021 10:30:41 -0400 Date: Mon, 11 Jan 2021 10:30:41 -0400 From: Jason Gunthorpe To: Andrea Arcangeli Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Yu Zhao , Andy Lutomirski , Peter Xu , Pavel Emelyanov , Mike Kravetz , Mike Rapoport , Minchan Kim , Will Deacon , Peter Zijlstra , Linus Torvalds , Hugh Dickins , "Kirill A. Shutemov" , Matthew Wilcox , Oleg Nesterov , Jann Horn , Kees Cook , John Hubbard , Leon Romanovsky , Jan Kara , Kirill Tkhai Subject: Re: [PATCH 0/2] page_count can't be used to decide when wp_page_copy Message-ID: <20210111143041.GI504133@ziepe.ca> References: <20210107200402.31095-1-aarcange@redhat.com> <20210107202525.GD504133@ziepe.ca> <20210108133649.GE504133@ziepe.ca> <20210108181945.GF504133@ziepe.ca> <20210109004255.GG504133@ziepe.ca> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jan 08, 2021 at 09:50:08PM -0500, Andrea Arcangeli wrote: > For all those that aren't using mmu notifier and that rely solely on > page pins, they still require privilege, except they do through /dev/ > permissions. It is normal that the dev nodes are a+rw so it doesn't really require privilege in any real sense. > Actually the mmu notifier doesn't strictly require pins, it only > requires GUP. All users tend to use FOLL_GET just as a safety > precaution (I already tried to optimize away the two atomics per GUP, > but we were naked by the KVM maintainer that didn't want to take the > risk, I would have, but it's a fair point indeed, obviously it's safer > with the pin plus the mmu notifier, two is safer than one). I'm not sure what holding the pin will do to reduce risk? If you get into a situation where you are stuffing a page into the SMMU that is not in the CPU's MMU then everything is lost. Holding a pin while carrying a page from the CPU page table to the SMMU just ensures that page isn't freed until it is installed, but once installed you are back to being broken. > I'm not sure how any copy-user could obviate a secondary MMU mapping, > mappings and copies are mutually exclusive. Any copy would be breaking > memory coherency in this environment. Because most places need to copy from user to stable kernel memory before processing data under user control. You can't just cast a user controlled pointer to a kstruct and use it - that is very likely a security bug. Still, the general version is something like kmap: map = user_map_setup(user_ptr, length) kptr = user_map_enter(map) [use kptr] user_map_leave(map, kptr) And inside it could use mmu notifiers, or gup, or whatever. user_map_setup() would register the notifier and user_map_enter() would validate the cache'd page pointer and block cached invalidation until user_map_leave(). > The primary concern with the mmu notifier in io_uring is the > take_all_locks latency. Just enabling mmu_notifier takes a performance hit on the entire process too, it is not such a simple decision.. We'd need benchmarks against a database or scientific application to see how negative the notifier actually becomes. > The problem with the mmu notifier as an universal solution, for > example is that it can't wait for I/O completion of O_DIRECT since it > has no clue where the put_page is to wait for it, otherwise we could > avoid even the FOLL_GET for O_DIRECT and guarantee the I/O has to be > completed before paging or anything can unmap the page under I/O from > the pagetable. GPU is already doing something like this, waiting in a notifier invalidate callback for DMA to finish before allowing invalidate to complete. It is horrendously complicated and I'm not sure blocking invalidate for a long time is actually much better for the MM.. > I see the incompatibility you describe as problem we have today, in > the present, and that will fade with time. > > Reminds me when we had >4G of RAM and 32bit devices doing DMA. How > many 32bit devices are there now? I'm not so sure anymore. A few years ago OpenCAPI and PCI PRI seemed like good things, but now with experience they carry pretty bad performance hits to use them. Lots of places are skipping them. CXL offers another chance at this, so we'll see again in another 5 years or so if it works out. It is not any easy problem to solve from a HW perspective. > We're not talking here about any random PCI device, we're talking here > about very special and very advanced devices that need to have "long > term" GUP pins in order to operate, not the usual nvme/gigabit device > where GUP pins are never long term. Beyond RDMA, netdev's XDP uses FOLL_LONGTERM, so do various video devices, lots of things related to virtualization like vfio, vdpa and vhost. I think this is a bit defeatist to say it doesn't matter. If anything as time goes on it seems to be growing, not shrinking currently. > The point is that if you do echo ... >/proc/self/clear_refs on your > pid, that has any FOLL_LONGTERM on its mm, it'll just cause your > device driver to go out of sync with the mm. It'll see the old pages, > before the spurious COWs. The CPU will use new pages (the spurious > COWs). But if you do that then clear-refs isn't going to work they way it thought either - this first needs some explanation for how clear_refs is supposed to work when DMA WRITE is active on the page. I'd certainly say causing a loss of synchrony is not acceptable, so if we keep Linus's version of COW then clear_refs has to not write protect pages under DMA. > > secondary-mmu drivers using mmu notifier should not trigger this logic > > and should not restrict write protect. > > That's a great point. I didn't think the mmu notifier will invalidate > the secondary MMU and ultimately issue a GUP after the wp_copy_page to > keep it in sync. It had better, or mmu notifiers are broken, right? > The funny thing that doesn't make sense is that wp_copy_page will only > be invoked because the PIN was left by KVM on the page for that extra > safety I was talking about earlier. Yes, with the COW change if kvm cares about this inefficiency it should not have the unnecessary pin. > You clearly contemplate the existance of a read mode, long term. That > is also completely compatible with wrprotection. We talked about a read mode, but we didn't flesh it out. It is not unconditionally compatible with wrprotect - most likely you still can't write protect a page under READ DMA because when you eventually take the COW there will be ambiguous situations that will break the synchrony. Jason