From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from lists.gnu.org (lists.gnu.org [209.51.188.17]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.lore.kernel.org (Postfix) with ESMTPS id 45C54C433FE for ; Fri, 30 Sep 2022 16:24:37 +0000 (UTC) Received: from localhost ([::1]:38014 helo=lists1p.gnu.org) by lists.gnu.org with esmtp (Exim 4.90_1) (envelope-from ) id 1oeIoW-0008IL-3F for qemu-devel@archiver.kernel.org; Fri, 30 Sep 2022 12:24:36 -0400 Received: from eggs.gnu.org ([2001:470:142:3::10]:51956) by lists.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_256_GCM_SHA384:256) (Exim 4.90_1) (envelope-from ) id 1oeIjV-0005Eu-Oc for qemu-devel@nongnu.org; Fri, 30 Sep 2022 12:19:27 -0400 Received: from mail-lj1-x22d.google.com ([2a00:1450:4864:20::22d]:33522) by eggs.gnu.org with esmtps (TLS1.2:ECDHE_RSA_AES_128_GCM_SHA256:128) (Exim 4.90_1) (envelope-from ) id 1oeIjT-0008MD-SG for qemu-devel@nongnu.org; Fri, 30 Sep 2022 12:19:25 -0400 Received: by mail-lj1-x22d.google.com with SMTP id a10so5356940ljq.0 for ; Fri, 30 Sep 2022 09:19:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date; bh=l4aQ5hemtjsltDjjTFlU40Ar40/n0aYhBBQw2x9xe4M=; b=n5IXmIQPtFLCjFkpUE2OwV3H+1+xrjCHl08IJahd5SyBcKJ/UsdBT0ARbrXIomUclN fd8sqCRfVDzZ+eC1NLtqAp85IzNsTsvVkvysUm5VB52PQNwCA+yB4jBC1xuC3abZlkL3 co9ueIAqbCn/IDfi+RRnhmBQfNkNlKbH/RmP/Gy9yuqVbaafOyHxCJY2JFPjHrbWeyua MRGR4fju3q766ysi3eZBbex8yLYzg8fVBN0u24+SImCGIZCtg/U83iul66WqacM8s0qy dn97lGXvlczA2HSN41cwl43KlCmWxBUo/uGJrr47oJcMCi3tp0jWH6xze24KGsk1CzdL IBTA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date; bh=l4aQ5hemtjsltDjjTFlU40Ar40/n0aYhBBQw2x9xe4M=; b=IATdoOfulqp+NcPR89zDsbT+X7whENWlgjf22XabW8WmDX78gnSKt+jf3KZIbgGpdh 4lrP+rSwd7hbKnZ8vWfK0j0RGf//EE+7nWJq47+31nr4c4YIdYZFLKNtq74D/F+oiqQz TWorXq6Eli4yS/AMOn0GEDzhd1URiYMhJ7TDnLVc95PmDd3u6KkpAlESX35DxWW0um5e mlWExRK79nyOLnVOVBV3DFiUZacLUeZTORseaS2iY0qLRztgWJnFbvVQP9jAPpsBu3r/ zWfUsEHfQYwiKe4xKCFViQjcenSktmQzSXgKgqsLItGNhxd6rlqOsH6ULLvUftP9bz7N ul9w== X-Gm-Message-State: ACrzQf0v+j2cfyzxbJ5tmXy8C3fmaxKF8v0rRsSlinC6nVMcAXD2q+V5 wRhQbjJuBqLaBvpgq0V8vDM86HxZ+9mJvW8yrCw/pA== X-Google-Smtp-Source: AMsMyM74N5LK+lGu2FSq6RuG8EoPePtn6IJGRM9qjqNHtPAGMTrCCfMcbV+Z5ehpLR7mFWYQ45apCQbGVR2FeZvjGGE= X-Received: by 2002:a2e:9954:0:b0:26c:5555:b121 with SMTP id r20-20020a2e9954000000b0026c5555b121mr3154070ljj.280.1664554761893; Fri, 30 Sep 2022 09:19:21 -0700 (PDT) MIME-Version: 1.0 References: <20220915142913.2213336-1-chao.p.peng@linux.intel.com> <20220915142913.2213336-2-chao.p.peng@linux.intel.com> <20220926142330.GC2658254@chaop.bj.intel.com> In-Reply-To: From: Fuad Tabba Date: Fri, 30 Sep 2022 17:19:00 +0100 Message-ID: Subject: Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd To: Sean Christopherson Cc: Chao Peng , David Hildenbrand , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, linux-doc@vger.kernel.org, qemu-devel@nongnu.org, Paolo Bonzini , Jonathan Corbet , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , x86@kernel.org, "H . Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Shuah Khan , Mike Rapoport , Steven Price , "Maciej S . Szmigiero" , Vlastimil Babka , Vishal Annapurve , Yu Zhang , "Kirill A . Shutemov" , luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, aarcange@redhat.com, ddutile@redhat.com, dhildenb@redhat.com, Quentin Perret , Michael Roth , mhocko@suse.com, Muchun Song , wei.w.wang@intel.com, Will Deacon , Marc Zyngier Content-Type: text/plain; charset="UTF-8" Received-SPF: pass client-ip=2a00:1450:4864:20::22d; envelope-from=tabba@google.com; helo=mail-lj1-x22d.google.com X-Spam_score_int: -175 X-Spam_score: -17.6 X-Spam_bar: ----------------- X-Spam_report: (-17.6 / 5.0 requ) BAYES_00=-1.9, DKIMWL_WL_MED=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, ENV_AND_HDR_SPF_MATCH=-0.5, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, USER_IN_DEF_DKIM_WL=-7.5, USER_IN_DEF_SPF_WL=-7.5 autolearn=ham autolearn_force=no X-Spam_action: no action X-BeenThere: qemu-devel@nongnu.org X-Mailman-Version: 2.1.29 Precedence: list List-Id: List-Unsubscribe: , List-Archive: List-Post: List-Help: List-Subscribe: , Errors-To: qemu-devel-bounces+qemu-devel=archiver.kernel.org@nongnu.org Sender: "Qemu-devel" Hi, On Tue, Sep 27, 2022 at 11:47 PM Sean Christopherson wrote: > > On Mon, Sep 26, 2022, Fuad Tabba wrote: > > Hi, > > > > On Mon, Sep 26, 2022 at 3:28 PM Chao Peng wrote: > > > > > > On Fri, Sep 23, 2022 at 04:19:46PM +0100, Fuad Tabba wrote: > > > > > Then on the KVM side, its mmap_start() + mmap_end() sequence would: > > > > > > > > > > 1. Not be supported for TDX or SEV-SNP because they don't allow adding non-zero > > > > > memory into the guest (after pre-boot phase). > > > > > > > > > > 2. Be mutually exclusive with shared<=>private conversions, and is allowed if > > > > > and only if the entire gfn range of the associated memslot is shared. > > > > > > > > In general I think that this would work with pKVM. However, limiting > > > > private<->shared conversions to the granularity of a whole memslot > > > > might be difficult to handle in pKVM, since the guest doesn't have the > > > > concept of memslots. For example, in pKVM right now, when a guest > > > > shares back its restricted DMA pool with the host it does so at the > > > > page-level. > > Y'all are killing me :-) :D > Isn't the guest enlightened? E.g. can't you tell the guest "thou shalt share at > granularity X"? With KVM's newfangled scalable memslots and per-vCPU MRU slot, > X doesn't even have to be that high to get reasonable performance, e.g. assuming > the DMA pool is at most 2GiB, that's "only" 1024 memslots, which is supposed to > work just fine in KVM. The guest is potentially enlightened, but the host doesn't necessarily know which memslot the guest might want to share back, since it doesn't know where the guest might want to place the DMA pool. If I understand this correctly, for this to work, all memslots would need to be the same size and sharing would always need to happen at that granularity. Moreover, for something like a small DMA pool this might scale, but I'm not sure about potential future workloads (e.g., multimedia in-place sharing). > > > > > pKVM would also need a way to make an fd accessible again > > > > when shared back, which I think isn't possible with this patch. > > > > > > But does pKVM really want to mmap/munmap a new region at the page-level, > > > that can cause VMA fragmentation if the conversion is frequent as I see. > > > Even with a KVM ioctl for mapping as mentioned below, I think there will > > > be the same issue. > > > > pKVM doesn't really need to unmap the memory. What is really important > > is that the memory is not GUP'able. > > Well, not entirely unguppable, just unguppable without a magic FOLL_* flag, > otherwise KVM wouldn't be able to get the PFN to map into guest memory. > > The problem is that gup() and "mapped" are tied together. So yes, pKVM doesn't > strictly need to unmap memory _in the untrusted host_, but since mapped==guppable, > the end result is the same. > > Emphasis above because pKVM still needs unmap the memory _somehwere_. IIUC, the > current approach is to do that only in the stage-2 page tables, i.e. only in the > context of the hypervisor. Which is also the source of the gup() problems; the > untrusted kernel is blissfully unaware that the memory is inaccessible. > > Any approach that moves some of that information into the untrusted kernel so that > the kernel can protect itself will incur fragmentation in the VMAs. Well, unless > all of guest memory becomes unguppable, but that's likely not a viable option. Actually, for pKVM, there is no need for the guest memory to be GUP'able at all if we use the new inaccessible_get_pfn(). This of course goes back to what I'd mentioned before in v7; it seems that representing the memslot memory as a file descriptor should be orthogonal to whether the memory is shared or private, rather than a private_fd for private memory and the userspace_addr for shared memory. The host can then map or unmap the shared/private memory using the fd, which allows it more freedom in even choosing to unmap shared memory when not needed, for example. Cheers, /fuad