From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1C55CC54EE9 for ; Tue, 27 Sep 2022 22:47:43 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S231701AbiI0Wrk (ORCPT ); Tue, 27 Sep 2022 18:47:40 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60624 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S230185AbiI0Wrg (ORCPT ); Tue, 27 Sep 2022 18:47:36 -0400 Received: from mail-pf1-x436.google.com (mail-pf1-x436.google.com [IPv6:2607:f8b0:4864:20::436]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 4D54FCF4A5 for ; Tue, 27 Sep 2022 15:47:34 -0700 (PDT) Received: by mail-pf1-x436.google.com with SMTP id d82so10924982pfd.10 for ; Tue, 27 Sep 2022 15:47:34 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:from:to:cc:subject:date; bh=Ga4pQWEDXBG9/XYtSTlScJ77nN/lKyKRhvrVN6ZYtEQ=; b=Zt6DAEHvTS8wxWv1WyEBwMpb/J0TAGqNJWSYq7B/TUUmald3XX9OlT/d5Lzj6Si+Bc j2rFkg3bzOhJaWeJmSBV0V8U3XVEjrwk9qHAVJiEkvevxCuOWyXtok5P3iv/Iw8ESLef 6LPeA8HKLKlToZ8BNFYQq63V4Oktsc1qDJcFhLSS4oHG7yiRN/kOGqpPb7Yj4VO8VCP6 +5edR6WRMr1lS7G3XjYbUcnl8iY36M5Ts6GW4jUNk9gvSm6NZePMAF2GZViOeXNHh0tr SRk8IvIZSS6mDMv+zdaigNV2/rVcP+cwljmgJZfO6Qwc/ZF3/efpr9a3QxY+c3ZolRxi dBSQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=in-reply-to:content-disposition:mime-version:references:message-id :subject:cc:to:from:date:x-gm-message-state:from:to:cc:subject:date; bh=Ga4pQWEDXBG9/XYtSTlScJ77nN/lKyKRhvrVN6ZYtEQ=; b=onlTsduvx7WlK6q1jZdllYhoF61flEuDe2xRgeKMSBO/P4gJQ74glGlcLJmjtzPvqG rdg53e3H8vcUTN6/CfPuk7CykYUmtAENtBaA6P7RiHoL6Byhf2CMlD1XZNLSG8nyZYNS FuMHQ6+6ss4R5Xzry4tLQCPqG/0YmZuCYiUSDjK488+ljSMyVg+pe2cbrtmeoj4Wsaxw E7CHNZIkvrDMF52Y/mvf11D2vrff8Hx3+63S3213QmT/woiG6o01mVfmu81Jui819jdR Mvl3gft6s0/Az8nNiBImA7Pr2iFIuQb13D1ep7voE4cidzwKVCskTh32gaD7pwtaBMIO uqOQ== X-Gm-Message-State: ACrzQf0pyD7VIo2UhvnBvFaqOQs2IM3rG9m6GphWbSrcP+LUAi7ZyVzO tRDcS3GMzhrBwLUbeCknYJIj4A== X-Google-Smtp-Source: AMsMyM7e3pORG/ZiCK/KOAmPTIhR9W8OFaIKxiR5Fl7ql/8DpZyNJcNG+S+vv6Tj5cxsHspqGFGwnQ== X-Received: by 2002:a63:1e03:0:b0:43a:a64d:f3a4 with SMTP id e3-20020a631e03000000b0043aa64df3a4mr26001459pge.121.1664318853506; Tue, 27 Sep 2022 15:47:33 -0700 (PDT) Received: from google.com (7.104.168.34.bc.googleusercontent.com. [34.168.104.7]) by smtp.gmail.com with ESMTPSA id l3-20020a170902f68300b00176b3c9693esm2081016plg.299.2022.09.27.15.47.32 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 27 Sep 2022 15:47:32 -0700 (PDT) Date: Tue, 27 Sep 2022 22:47:29 +0000 From: Sean Christopherson To: Fuad Tabba Cc: Chao Peng , David Hildenbrand , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, linux-doc@vger.kernel.org, qemu-devel@nongnu.org, Paolo Bonzini , Jonathan Corbet , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , x86@kernel.org, "H . Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Shuah Khan , Mike Rapoport , Steven Price , "Maciej S . Szmigiero" , Vlastimil Babka , Vishal Annapurve , Yu Zhang , "Kirill A . Shutemov" , luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, aarcange@redhat.com, ddutile@redhat.com, dhildenb@redhat.com, Quentin Perret , Michael Roth , mhocko@suse.com, Muchun Song , wei.w.wang@intel.com, Will Deacon , Marc Zyngier Subject: Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd Message-ID: References: <20220915142913.2213336-1-chao.p.peng@linux.intel.com> <20220915142913.2213336-2-chao.p.peng@linux.intel.com> <20220926142330.GC2658254@chaop.bj.intel.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Mon, Sep 26, 2022, Fuad Tabba wrote: > Hi, > > On Mon, Sep 26, 2022 at 3:28 PM Chao Peng wrote: > > > > On Fri, Sep 23, 2022 at 04:19:46PM +0100, Fuad Tabba wrote: > > > > Then on the KVM side, its mmap_start() + mmap_end() sequence would: > > > > > > > > 1. Not be supported for TDX or SEV-SNP because they don't allow adding non-zero > > > > memory into the guest (after pre-boot phase). > > > > > > > > 2. Be mutually exclusive with shared<=>private conversions, and is allowed if > > > > and only if the entire gfn range of the associated memslot is shared. > > > > > > In general I think that this would work with pKVM. However, limiting > > > private<->shared conversions to the granularity of a whole memslot > > > might be difficult to handle in pKVM, since the guest doesn't have the > > > concept of memslots. For example, in pKVM right now, when a guest > > > shares back its restricted DMA pool with the host it does so at the > > > page-level. Y'all are killing me :-) Isn't the guest enlightened? E.g. can't you tell the guest "thou shalt share at granularity X"? With KVM's newfangled scalable memslots and per-vCPU MRU slot, X doesn't even have to be that high to get reasonable performance, e.g. assuming the DMA pool is at most 2GiB, that's "only" 1024 memslots, which is supposed to work just fine in KVM. > > > pKVM would also need a way to make an fd accessible again > > > when shared back, which I think isn't possible with this patch. > > > > But does pKVM really want to mmap/munmap a new region at the page-level, > > that can cause VMA fragmentation if the conversion is frequent as I see. > > Even with a KVM ioctl for mapping as mentioned below, I think there will > > be the same issue. > > pKVM doesn't really need to unmap the memory. What is really important > is that the memory is not GUP'able. Well, not entirely unguppable, just unguppable without a magic FOLL_* flag, otherwise KVM wouldn't be able to get the PFN to map into guest memory. The problem is that gup() and "mapped" are tied together. So yes, pKVM doesn't strictly need to unmap memory _in the untrusted host_, but since mapped==guppable, the end result is the same. Emphasis above because pKVM still needs unmap the memory _somehwere_. IIUC, the current approach is to do that only in the stage-2 page tables, i.e. only in the context of the hypervisor. Which is also the source of the gup() problems; the untrusted kernel is blissfully unaware that the memory is inaccessible. Any approach that moves some of that information into the untrusted kernel so that the kernel can protect itself will incur fragmentation in the VMAs. Well, unless all of guest memory becomes unguppable, but that's likely not a viable option.