From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by smtp.lore.kernel.org (Postfix) with ESMTP id 749BAC4321E for ; Wed, 19 Oct 2022 15:13:37 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232310AbiJSPNe (ORCPT ); Wed, 19 Oct 2022 11:13:34 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:46936 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232301AbiJSPNL (ORCPT ); Wed, 19 Oct 2022 11:13:11 -0400 Received: from mail-lj1-x22e.google.com (mail-lj1-x22e.google.com [IPv6:2a00:1450:4864:20::22e]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id AFFD1BC1 for ; Wed, 19 Oct 2022 08:05:40 -0700 (PDT) Received: by mail-lj1-x22e.google.com with SMTP id r22so22537750ljn.10 for ; Wed, 19 Oct 2022 08:05:40 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:from:to:cc:subject:date:message-id:reply-to; bh=GQN3xPWEz80R9HLK7JV5q12hLSAfmmt3AkrUC4EVJho=; b=ZtzKb9bf0kl/pILB/MreCGWs3HY3xz/J5znIIt+ih1I+DS9Baz4ZCKTyXqVnYQOAS+ V6QVBNa/A5tGi+XgQ3wQ/nMzWhSb4oPpDy5DVP/oTOR1D99/6mXkdRjTjPGkgyFCVd/k MeCz2GQlahKw0B4SPorBsy6aaMo1Y8vetsa5D14CxosC/gFGr1U7i/O2pf1WWTzAYw+t ISKJm3BRcC3fYJeGm6J60yWx9ZOpYS1Z+ehehirudx28YZ914nLWPcDEakMmtkEMsxra 1zFnLTP6AZbcwQ6m5YMxlfTE2slt6esDIpPTrJAG7863klJALjIVm0v020eIkoRGYSkV O4Ew== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=cc:to:subject:message-id:date:from:in-reply-to:references :mime-version:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=GQN3xPWEz80R9HLK7JV5q12hLSAfmmt3AkrUC4EVJho=; b=XoDBXM6UL2JQdgbKt6fj8peZVo0qbTHJ9xnXHmkFQY7APksqaQIO8JVZSthoAdvtzJ yCu9VIO02FY4EQALQMVHneyTrR8GCgmBi5UjqQEa9goIn3e1e6GlN0X30A6LoAhA+Q4L rXKVu3oZR+VcqSMWzQcLCL3xoWmWu6mqb7DtJEnZih50/zzNyg3fPZ/lHv9oFOFLjJ4r TkYhiFF1Z8SqXA2dbaxwR2XYx2FXWaIjuYZB3FF+mF7OlucYjSHWtFR0k15kie9gxSC6 fMzzJXRWaKEOjhx6EvgCK7ZUuVf44yWmeuR8kpHbSuIol5hZ4rOvZ0cMQZAX5/SwuVR7 FbUw== X-Gm-Message-State: ACrzQf1ATVjHfb4Z/W8ExutaL4qSDZIufoaDXwVYtCqQ7LZaj/zTzulF UhAgmw8vJ0jwLyl5DJloIpCmDF7h3vbAernMvyzM0Q== X-Google-Smtp-Source: AMsMyM7zEJ7a72P/c6HXyJCsWM2MpEG2w4gRuPgsLcWOhRdb89OcuooXjcuMFVvo+b/FWTUfd+LFgMAfBmxdQhZD75M= X-Received: by 2002:a2e:bd12:0:b0:264:7373:3668 with SMTP id n18-20020a2ebd12000000b0026473733668mr2834642ljq.18.1666191934004; Wed, 19 Oct 2022 08:05:34 -0700 (PDT) MIME-Version: 1.0 References: <20220915142913.2213336-1-chao.p.peng@linux.intel.com> <20220915142913.2213336-2-chao.p.peng@linux.intel.com> <20220926142330.GC2658254@chaop.bj.intel.com> In-Reply-To: From: Fuad Tabba Date: Wed, 19 Oct 2022 16:04:57 +0100 Message-ID: Subject: Re: [PATCH v8 1/8] mm/memfd: Introduce userspace inaccessible memfd To: Sean Christopherson Cc: Chao Peng , David Hildenbrand , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-fsdevel@vger.kernel.org, linux-api@vger.kernel.org, linux-doc@vger.kernel.org, qemu-devel@nongnu.org, Paolo Bonzini , Jonathan Corbet , Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , Thomas Gleixner , Ingo Molnar , Borislav Petkov , x86@kernel.org, "H . Peter Anvin" , Hugh Dickins , Jeff Layton , "J . Bruce Fields" , Andrew Morton , Shuah Khan , Mike Rapoport , Steven Price , "Maciej S . Szmigiero" , Vlastimil Babka , Vishal Annapurve , Yu Zhang , "Kirill A . Shutemov" , luto@kernel.org, jun.nakajima@intel.com, dave.hansen@intel.com, ak@linux.intel.com, aarcange@redhat.com, ddutile@redhat.com, dhildenb@redhat.com, Quentin Perret , Michael Roth , mhocko@suse.com, Muchun Song , wei.w.wang@intel.com, Will Deacon , Marc Zyngier Content-Type: text/plain; charset="UTF-8" Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi, On Tue, Oct 18, 2022 at 1:34 AM Sean Christopherson wrote: > > On Fri, Sep 30, 2022, Fuad Tabba wrote: > > > > > > pKVM would also need a way to make an fd accessible again > > > > > > when shared back, which I think isn't possible with this patch. > > > > > > > > > > But does pKVM really want to mmap/munmap a new region at the page-level, > > > > > that can cause VMA fragmentation if the conversion is frequent as I see. > > > > > Even with a KVM ioctl for mapping as mentioned below, I think there will > > > > > be the same issue. > > > > > > > > pKVM doesn't really need to unmap the memory. What is really important > > > > is that the memory is not GUP'able. > > > > > > Well, not entirely unguppable, just unguppable without a magic FOLL_* flag, > > > otherwise KVM wouldn't be able to get the PFN to map into guest memory. > > > > > > The problem is that gup() and "mapped" are tied together. So yes, pKVM doesn't > > > strictly need to unmap memory _in the untrusted host_, but since mapped==guppable, > > > the end result is the same. > > > > > > Emphasis above because pKVM still needs unmap the memory _somehwere_. IIUC, the > > > current approach is to do that only in the stage-2 page tables, i.e. only in the > > > context of the hypervisor. Which is also the source of the gup() problems; the > > > untrusted kernel is blissfully unaware that the memory is inaccessible. > > > > > > Any approach that moves some of that information into the untrusted kernel so that > > > the kernel can protect itself will incur fragmentation in the VMAs. Well, unless > > > all of guest memory becomes unguppable, but that's likely not a viable option. > > > > Actually, for pKVM, there is no need for the guest memory to be GUP'able at > > all if we use the new inaccessible_get_pfn(). > > Ya, I was referring to pKVM without UPM / inaccessible memory. > > Jumping back to blocking gup(), what about using the same tricks as secretmem to > block gup()? E.g. compare vm_ops to block regular gup() and a_ops to block fast > gup() on struct page? With a Kconfig that's selected by pKVM (which would also > need its own Kconfig), e.g. CONFIG_INACCESSIBLE_MAPPABLE_MEM, there would be zero > performance overhead for non-pKVM kernels, i.e. hooking gup() shouldn't be > controversial. > > I suspect the fast gup() path could even be optimized to avoid the page_mapping() > lookup by adding a PG_inaccessible flag that's defined iff the TBD Kconfig is > selected. I'm guessing pKVM isn't expected to be deployed on massivve NUMA systems > anytime soon, so there should be plenty of page flags to go around. > > Blocking gup() instead of trying to play refcount games when converting back to > private would eliminate the need to put heavy restrictions on mapping, as the goal > of those were purely to simplify the KVM implementation, e.g. the "one mapping per > memslot" thing would go away entirely. My implementation of mmap for inaccessible_fops was setting VM_PFNMAP. That said, I realized that that might be adding an unnecessary restriction, and now have changed it to do it the secretmem way. That's straightforward and works well. > > This of course goes back to what I'd mentioned before in v7; it seems that > > representing the memslot memory as a file descriptor should be orthogonal to > > whether the memory is shared or private, rather than a private_fd for private > > memory and the userspace_addr for shared memory. > > I also explored the idea of backing any guest memory with an fd, but came to > the conclusion that private memory needs a separate handle[1], at least on x86. > > For SNP and TDX, even though the GPA is the same (ignoring the fact that SNP and > TDX steal GPA bits to differentiate private vs. shared), the two types need to be > treated as separate mappings[2]. Post-boot, converting is lossy in both directions, > so even conceptually they are two disctint pages that just happen to share (some) > GPA bits. > > To allow conversions, i.e. changing which mapping to use, without memslot updates, > KVM needs to let userspace provide both mappings in a single memslot. So while > fd-based memory is an orthogonal concept, e.g. we could add fd-based shared memory, > KVM would still need a dedicated private handle. > > For pKVM, the fd doesn't strictly need to be mutually exclusive with the existing > userspace_addr, but since the private_fd is going to be added for x86, I think it > makes sense to use that instead of adding generic fd-based memory for pKVM's use > case (which is arguably still "private" memory but with special semantics). > > [1] https://lore.kernel.org/all/YulTH7bL4MwT5v5K@google.com > [2] https://lore.kernel.org/all/869622df-5bf6-0fbb-cac4-34c6ae7df119@kernel.org As long as the API does not impose this limit, which would imply pKVM is misusing it, then I agree. I think that's why renaming it to something like "restricted" might be clearer. Thanks, /fuad