From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-7.1 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A, SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 31A8FC4320E for ; Thu, 26 Aug 2021 10:15:57 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 9A69360551 for ; Thu, 26 Aug 2021 10:15:56 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 9A69360551 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id BA1148D0002; Thu, 26 Aug 2021 06:15:55 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id B50808D0001; Thu, 26 Aug 2021 06:15:55 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A18228D0002; Thu, 26 Aug 2021 06:15:55 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0190.hostedemail.com [216.40.44.190]) by kanga.kvack.org (Postfix) with ESMTP id 8222F8D0001 for ; Thu, 26 Aug 2021 06:15:55 -0400 (EDT) Received: from smtpin22.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 1A95422ACE for ; Thu, 26 Aug 2021 10:15:55 +0000 (UTC) X-FDA: 78516825870.22.C4E259B Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124]) by imf07.hostedemail.com (Postfix) with ESMTP id 7CEB210001C2 for ; Thu, 26 Aug 2021 10:15:54 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1629972954; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=apRFv7n2xmJmjLZVOcA8MxfCoee4fjfaD3LiY08vc2c=; b=Sdv+E/Yvy/8ejPlfpzRX+vw9ProxarxTUWDSq8l9zwbfg0fVrEkOd3y8+7zTDmey26fuGz uXZmn/BstXZTGPe4iQNcK89J3bxHdiE12iB+Cc0Za6G26XLtqPEzo4PNlbFyrWJ56a0zLA yGx86vWg8SwlgM9nFYK8DovHspVkqbc= Received: from mail-wm1-f69.google.com (mail-wm1-f69.google.com [209.85.128.69]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-321-l8VLgegCN0CiNo0qb8sxTA-1; Thu, 26 Aug 2021 06:15:52 -0400 X-MC-Unique: l8VLgegCN0CiNo0qb8sxTA-1 Received: by mail-wm1-f69.google.com with SMTP id g3-20020a1c2003000000b002e751c4f439so4120803wmg.7 for ; Thu, 26 Aug 2021 03:15:52 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:to:cc:references:from:organization:subject :message-id:date:user-agent:mime-version:in-reply-to :content-language:content-transfer-encoding; bh=apRFv7n2xmJmjLZVOcA8MxfCoee4fjfaD3LiY08vc2c=; b=O5dgroDlXtvei93PcyBDJBQJP7ito/6kPAIQeAfcIEzto0HweLxj+O/exjkM+3D/bE cXFGDf54ii57/p034KJZ3dzwO7qjgo9Ve/m+5zc/B0Fird38Niql/GTgR/L++CFW6739 Y3I2RL3Ed9gDLdhH9dV58+hwRf5C1xNYmbzj8Lnq5N79rfn8iUEdDsFg0Wm6bhU4brAF eLfR47KeaWQ3Q6nLjAbdi8LSb1zz+z7wkVRhwi/JG8gD0R+nJin6gjiQLr68UIzqoThx nT0yYs1H9gIOUy9V5gLSPuekK7ie/YhxKT7X85jI7aMkpl1INg7qzQRdlrcfK9EqgGc7 KCqg== X-Gm-Message-State: AOAM530Mj70lbvzmpVH/Kwtywi0BTRNh18fXhF5CtTKn2MpKG46OZ+mj /UYrYf5vvPmUM/+uE7WoUeZQ2s38QEJZdJAg1SlNnyr+y4c8MRkYpqnj5OvoPfaGZXWvcnqLwDg hKyEXqdQaft8= X-Received: by 2002:a1c:443:: with SMTP id 64mr2848200wme.180.1629972951202; Thu, 26 Aug 2021 03:15:51 -0700 (PDT) X-Google-Smtp-Source: ABdhPJws+Zz5x4nWfAWadjnNuM/1anXIJMI5QeO37QlkNT5/NFOCoz+fMpeETmx5krIHxk7l3Ib+vw== X-Received: by 2002:a1c:443:: with SMTP id 64mr2848149wme.180.1629972950788; Thu, 26 Aug 2021 03:15:50 -0700 (PDT) Received: from [192.168.3.132] (p4ff23dec.dip0.t-ipconnect.de. [79.242.61.236]) by smtp.gmail.com with ESMTPSA id z19sm9274050wma.0.2021.08.26.03.15.49 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Thu, 26 Aug 2021 03:15:50 -0700 (PDT) To: Sean Christopherson , Paolo Bonzini Cc: Vitaly Kuznetsov , Wanpeng Li , Jim Mattson , Joerg Roedel , kvm@vger.kernel.org, linux-kernel@vger.kernel.org, Borislav Petkov , Andy Lutomirski , Andrew Morton , Joerg Roedel , Andi Kleen , David Rientjes , Vlastimil Babka , Tom Lendacky , Thomas Gleixner , Peter Zijlstra , Ingo Molnar , Varad Gautam , Dario Faggioli , x86@kernel.org, linux-mm@kvack.org, linux-coco@lists.linux.dev, "Kirill A . Shutemov" , "Kirill A . Shutemov" , Kuppuswamy Sathyanarayanan , Dave Hansen , Yu Zhang References: <20210824005248.200037-1-seanjc@google.com> From: David Hildenbrand Organization: Red Hat Subject: Re: [RFC] KVM: mm: fd-based approach for supporting KVM guest private memory Message-ID: <307d385a-a263-276f-28eb-4bc8dd287e32@redhat.com> Date: Thu, 26 Aug 2021 12:15:48 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.11.0 MIME-Version: 1.0 In-Reply-To: <20210824005248.200037-1-seanjc@google.com> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Authentication-Results: imf07.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b="Sdv+E/Yv"; spf=none (imf07.hostedemail.com: domain of david@redhat.com has no SPF policy when checking 216.205.24.124) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 7CEB210001C2 X-Stat-Signature: tji99ts5377k3dg8dii9mrwxjf4buyk5 X-HE-Tag: 1629972954-685888 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 24.08.21 02:52, Sean Christopherson wrote: > The goal of this RFC is to try and align KVM, mm, and anyone else with = skin in the > game, on an acceptable direction for supporting guest private memory, e= .g. for > Intel's TDX. The TDX architectural effectively allows KVM guests to cr= ash the > host if guest private memory is accessible to host userspace, and thus = does not > play nice with KVM's existing approach of pulling the pfn and mapping l= evel from > the host page tables. >=20 > This is by no means a complete patch; it's a rough sketch of the KVM ch= anges that > would be needed. The kernel side of things is completely omitted from = the patch; > the design concept is below. >=20 > There's also fair bit of hand waving on implementation details that sho= uldn't > fundamentally change the overall ABI, e.g. how the backing store will e= nsure > there are no mappings when "converting" to guest private. >=20 This is a lot of complexity and rather advanced approaches (not saying=20 they are bad, just that we try to teach the whole stack something=20 completely new). What I think would really help is a list of requirements, such that=20 everybody is aware of what we actually want to achieve. Let me start: GFN: Guest Frame Number EPFN: Encrypted Physical Frame Number 1) An EPFN must not get mapped into more than one VM: it belongs exactly=20 to one VM. It must neither be shared between VMs between processes nor=20 between VMs within a processes. 2) User space (well, and actually the kernel) must never access an EPFN: - If we go for an fd, essentially all operations (read/write) have to fail. - If we have to map an EPFN into user space page tables (e.g., to simplify KVM), we could only allow fake swap entries such that "there is something" but it cannot be accessed and is flagged accordingly. - /proc/kcore and friends have to be careful as well and should not read this memory. So there has to be a way to flag these pages. 3) We need a way to express the GFN<->EPFN mapping and essentially=20 assign an EPFN to a GFN. 4) Once we assigned a EPFN to a GFN, that assignment must not longer=20 change. Further, an EPFN must not get assigned to multiple GFNs. 5) There has to be a way to "replace" encrypted parts by "shared" parts and the other way around. What else? > Background > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >=20 > This is a loose continuation of Kirill's RFC[*] to support TDX guest pr= ivate > memory by tracking guest memory at the 'struct page' level. This propo= sal is the > result of several offline discussions that were prompted by Andy Lutomi= rksi's > concerns with tracking via 'struct page': >=20 > 1. The kernel wouldn't easily be able to enforce a 1:1 page:guest as= sociation, > let alone a 1:1 pfn:gfn mapping. Well, it could with some help on higher layers. Someone has to do the=20 tracking. Marking EPFNs as EPFNs can actually be very helpful, e.g.,=20 allow /proc/kcore to just not touch such pages. If we want to do all the=20 tracking in the struct page is a different story. >=20 > 2. Does not work for memory that isn't backed by 'struct page', e.g.= if devices > gain support for exposing encrypted memory regions to guests. Let's keep it simple. If a struct page is right now what we need to=20 properly track it, so be it. If not, good. But let's not make this a=20 requirement right from the start if it's stuff for the far future. >=20 > 3. Does not help march toward page migration or swap support (though= it doesn't > hurt either). "Does not help towards world peace, (though it doesn't hurt either)". Maybe let's ignore that for now, as it doesn't seem to be required to=20 get something reasonable running. >=20 > [*] https://lkml.kernel.org/r/20210416154106.23721-1-kirill.shutemov@li= nux.intel.com >=20 > Concept > =3D=3D=3D=3D=3D=3D=3D >=20 > Guest private memory must be backed by an "enlightened" file descriptor= , where > "enlightened" means the implementing subsystem supports a one-way "conv= ersion" to > guest private memory and provides bi-directional hooks to communicate d= irectly > with KVM. Creating a private fd doesn't necessarily have to be a conve= rsion, e.g. it > could also be a flag provided at file creation, a property of the file = system itself, > etc... Doesn't sound too crazy. Maybe even introducing memfd_encrypted() if=20 extending the other ones turns out too complicated. >=20 > Before a private fd can be mapped into a KVM guest, it must be paired 1= :1 with a > KVM guest, i.e. multiple guests cannot share a fd. At pairing, KVM and= the fd's > subsystem exchange a set of function pointers to allow KVM to call into= the subsystem, > e.g. to translate gfn->pfn, and vice versa to allow the subsystem to ca= ll into KVM, > e.g. to invalidate/move/swap a gfn range. >=20 > Mapping a private fd in host userspace is disallowed, i.e. there is nev= er a host > virtual address associated with the fd and thus no userspace page table= s pointing > at the private memory. To keep the primary vs. secondary MMU thing working, I think it would=20 actually be nice to go with special swap entries instead; it just keeps=20 most things working as expected. But let's see where we end up. >=20 > Pinning _from KVM_ is not required. If the backing store supports page= migration > and/or swap, it can query the KVM-provided function pointers to see if = KVM supports > the operation. If the operation is not supported (this will be the cas= e initially > in KVM), the backing store is responsible for ensuring correct function= ality. >=20 > Unmapping guest memory, e.g. to prevent use-after-free, is handled via = a callback > from the backing store to KVM. KVM will employ techniques similar to t= hose it uses > for mmu_notifiers to ensure the guest cannot access freed memory. >=20 > A key point is that, unlike similar failed proposals of the past, e.g. = /dev/mktme, > existing backing stores can be englightened, a from-scratch implementat= ions is not > required (though would obviously be possible as well). Right. But if it's just a bad fit, let's do something new. Just like we=20 did with memfd_secret. >=20 > One idea for extending existing backing stores, e.g. HugeTLBFS and tmpf= s, is > to add F_SEAL_GUEST, which would convert the entire file to guest priva= te memory > and either fail if the current size is non-zero or truncate the size to= zero. While possible, I actually do have the feeling that we want eventually=20 to have something new, as the semantics are just too different. But=20 let's see. > KVM > =3D=3D=3D >=20 > Guest private memory is managed as a new address space, i.e. as a diffe= rent set of > memslots, similar to how KVM has a separate memory view for when a gues= t vCPU is > executing in virtual SMM. SMM is mutually exclusive with guest private= memory. >=20 > The fd (the actual integer) is provided to KVM when a private memslot i= s added > via KVM_SET_USER_MEMORY_REGION. This is when the aforementioned pairin= g occurs. >=20 > By default, KVM memslot lookups will be "shared", only specific touchpo= ints will > be modified to work with private memslots, e.g. guest page faults. All= host > accesses to guest memory, e.g. for emulation, will thus look for shared= memory > and naturally fail without attempting copy_to/from_user() if the guest = attempts > to coerce KVM into access private memory. Note, avoiding copy_to/from_= user() and > friends isn't strictly necessary, it's more of a happy side effect. >=20 > A new KVM exit reason, e.g. KVM_EXIT_MEMORY_ERROR, and data struct in v= cpu->run > is added to propagate illegal accesses (see above) and implicit convers= ions > to userspace (see below). Note, the new exit reason + struct can also = be to > support several other feature requests in KVM[1][2]. >=20 > The guest may explicitly or implicity request KVM to map a shared/priva= te variant > of a GFN. An explicit map request is done via hypercall (out of scope = for this > proposal as both TDX and SNP ABIs define such a hypercall). An implici= t map request > is triggered simply by the guest accessing the shared/private variant, = which KVM > sees as a guest page fault (EPT violation or #NPF). Ideally only expli= cit requests > would be supported, but neither TDX nor SNP require this in their guest= <->host ABIs. >=20 > For implicit or explicit mappings, if a memslot is found that fully cov= ers the > requested range (which is a single gfn for implicit mappings), KVM's no= rmal guest > page fault handling works with minimal modification. >=20 > If a memslot is not found, for explicit mappings, KVM will exit to user= space with > the aforementioned dedicated exit reason. For implict _private_ mappin= gs, KVM will > also immediately exit with the same dedicated reason. For implicit sha= red mappings, > an additional check is required to differentiate between emulated MMIO = and an > implicit private->shared conversion[*]. If there is an existing privat= e memslot > for the gfn, KVM will exit to userspace, otherwise KVM will treat the a= ccess as an > emulated MMIO access and handle the page fault accordingly. Do you mean some kind of overlay. "Ordinary" user memory regions overlay=20 "private user memory regions"? So when marking something shared, you'd=20 leave the private user memory region alone and only create a new=20 "ordinary"user memory regions that references shared memory in QEMU=20 (IOW, a different mapping)? Reading below, I think you were not actually thinking about an overlay,=20 but maybe overlays might actually be a nice concept to have instead. > Punching Holes > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >=20 > The expected userspace memory model is that mapping requests will be ha= ndled as > conversions, e.g. on a shared mapping request, first unmap the private = gfn range, > then map the shared gfn range. A new KVM ioctl() will likely be needed= to allow > userspace to punch a hole in a memslot, as expressing such an operation= isn't > possible with KVM_SET_USER_MEMORY_REGION. While userspace could delete= the > memslot, then recreate three new memslots, doing so would be destructiv= e to guest > data as unmapping guest private memory (from the EPT/NPT tables) is des= tructive > to the data for both TDX and SEV-SNP guests. If you'd treat it like an overlay, you'd not actually be punching holes.=20 You'd only be creating/removing ordinary user memory regions when=20 marking something shared/unshared. >=20 > Pros (vs. struct page) > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >=20 > Easy to enforce 1:1 fd:guest pairing, as well as 1:1 gfn:pfn mapping. >=20 > Userspace page tables are not populated, e.g. reduced memory footprint,= lower > probability of making private memory accessible to userspace. Agreed to the first part, although I consider that a secondary concern.=20 The second part, I'm not sure if that is really the case. Fake swap=20 entries are just a marker. >=20 > Provides line of sight to supporting page migration and swap. Again, let's leave that out for now. I think that's an kernel internal=20 that will require quite some thought either way. >=20 > Provides line of sight to mapping MMIO pages into guest private memory. That's an interesting thought. Would it work via overlays as well? Can=20 you elaborate? >=20 > Cons (vs. struct page) > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >=20 > Significantly more churn in KVM, e.g. to plumb 'private' through where = needed, > support memslot hole punching, etc... >=20 > KVM's MMU gets another method of retrieving host pfn and page size. >=20 > Requires enabling in every backing store that someone wants to support. I think we will only care about anonymous memory eventually with=20 huge/gigantic pages in the next years. Just to what memfd() is already=20 limited. File-backed -- I don't know ... if at all, swapping ... in a=20 couple of years ... >=20 > Because the NUMA APIs work on virtual addresses, new syscalls fmove_pag= es(), > fbind(), etc... would be required to provide equivalents to existing NU= MA > functionality (though those syscalls would likely be useful irrespectiv= e of guest > private memory). Right, that's because we don't have a VMA that describes all this. E.g.,=20 mbind(). >=20 > Washes (vs. struct page) > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D >=20 > A misbehaving guest that triggers a large number of shared memory mappi= ngs will > consume a large number of memslots. But, this is likely a wash as simi= lar effect > would happen with VMAs in the struct page approach. Just cap it then to something sane. 32k which we have right now is crazy=20 and only required in very special setups. You can just make QEMU=20 override/set the KVM default. My wild idea after reading everything so far (full of flaws, just want=20 to mention it, maybe it gives some ideas): Introduce memfd_encrypted(). Similar to like memfd_secret() - Most system calls will just fail. - Allow MAP_SHARED only. - Enforce VM_DONTDUMP and skip during fork(). - File size can change exactly once, before any mmap. (IIRC) Different to memfd_secret(), allow mapping each page of the fd exactly=20 one time via mmap() into a single process. The simplest way would be requiring that only the whole file can be=20 mmaped, and that can happen exactly once. memremap() and friends will=20 fail. Splitting the VMA will fail. munmap()/mmap(MAP_FIXED) will fail.=20 You'll have it in a single process mapped ever only and persistent.=20 Unmap will only work when tearing down the MM. Hole punching via fallocate() has to be evaluated (below). You'll end up with a VMA that corresponds to the whole file in a single=20 process only, and that cannot vanish, not even in parts. Define "ordinary" user memory slots as overlay on top of "encrypted"=20 memory slots. Inside KVM, bail out if you encounter such a VMA inside a=20 normal user memory slot. When creating a "encryped" user memory slot,=20 require that the whole VMA is covered at creation time. You know the VMA=20 can't change later. In QEMU, allocate for each RAMBlock a MAP_PRIVATE memfd_encrypted() and=20 a MAP_SHARED memfd(). Make QEMU and other processes always access the=20 MAP_SHARED part only. Initially, create "encrypted" user memory regions=20 in KVM when defining the RAM layout, disallowing changes. Define the=20 MAP_SHARED memfd() overlay user memory regions in KVM depending on the=20 shared/private state. In the actual memory backend, flag all new allocated pages as=20 PG_encrypted. Pages can be faulted into a process page table via a new=20 fake swap entry, where KVM can look them up via a special GUP flag, as=20 Kirill suggested. If access via user space or another GUP user, SIGBUS=20 instead of converting the special swap entry. Allow only a single encrypted VM per process ever in KVM for now. All that handshake regarding swapping/migration ... can be handled=20 internally if ever required. Memory hotplug: should not be an issue. Memory hotunplug would require some thought -- maybe=20 fallcoate(PUNCHHOLE) will do. Reducing memory consumption: With MADV_DONTNEED/fallocate(punchole) we=20 can reclaim memory at least within the shared memory part when switching=20 back and forth. Maybe even in the MAP_PRIVATE/memfd_encrypted() part=20 when marking something shared. TBD. --=20 Thanks, David / dhildenb