From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-6.4 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,MAILING_LIST_MULTI,SPF_HELO_NONE, SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 8C5E8C4320E for ; Wed, 1 Sep 2021 17:10:21 +0000 (UTC) Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by mail.kernel.org (Postfix) with ESMTP id 6E7926108B for ; Wed, 1 Sep 2021 17:10:21 +0000 (UTC) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1345596AbhIARLR (ORCPT ); Wed, 1 Sep 2021 13:11:17 -0400 Received: from mail.kernel.org ([198.145.29.99]:54484 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S236271AbhIARLN (ORCPT ); Wed, 1 Sep 2021 13:11:13 -0400 Received: by mail.kernel.org (Postfix) with ESMTPSA id C839C61058; Wed, 1 Sep 2021 17:10:14 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1630516216; bh=BeSjTyk8IACcdnvMjm5tLUlTS5W7wDTBRMg6jrANFuw=; h=In-Reply-To:References:Date:From:To:Cc:Subject:From; b=PdQElsi7VKnwDy88y+JdYK1So+Eu+TwJoL7hzyymopjEWpRiPjwEfAVF3LDSHvO2/ ZSBZT9dDyfeqjeil3NrwxlCywVnR0NPDFOgz53ARsvbH4BmQdFB2rLpElFKD+bAzvS K1JclkPoKH4T4vhxs4Xht9T5hSPHbez4GcM0Y68o2P17l3ghjvt2kTb86YcJVblGpm 2teVT33ZeTO7ggBxsjx9zbBjBQfoH2onaYZmK+jOjrlzdq54CBCkDVPMz6vQbM/upm acrao1ghIRgjgbfRm9jFRn5Q3ySiuGNrfPjQ+A1wPgl9ecLIgLDvdjxkH8jYSMo86l 1hO9oW1OWBfRA== Received: from compute6.internal (compute6.nyi.internal [10.202.2.46]) by mailauth.nyi.internal (Postfix) with ESMTP id F2B4D27C005A; Wed, 1 Sep 2021 13:10:13 -0400 (EDT) Received: from imap2 ([10.202.2.52]) by compute6.internal (MEProxy); Wed, 01 Sep 2021 13:10:13 -0400 X-ME-Sender: X-ME-Proxy-Cause: gggruggvucftvghtrhhoucdtuddrgedvtddruddvfedguddtlecutefuodetggdotefrod ftvfcurfhrohhfihhlvgemucfhrghsthforghilhdpqfgfvfdpuffrtefokffrpgfnqfgh necuuegrihhlohhuthemuceftddtnecusecvtfgvtghiphhivghnthhsucdlqddutddtmd enucfjughrpefofgggkfgjfhffhffvufgtgfesthhqredtreerjeenucfhrhhomhepfdet nhguhicunfhuthhomhhirhhskhhifdcuoehluhhtoheskhgvrhhnvghlrdhorhhgqeenuc ggtffrrghtthgvrhhnpedvleehjeejvefhuddtgeegffdtjedtffegveethedvgfejieev ieeufeevuedvteenucevlhhushhtvghrufhiiigvpedtnecurfgrrhgrmhepmhgrihhlfh hrohhmpegrnhguhidomhgvshhmthhprghuthhhphgvrhhsohhnrghlihhthidqudduiedu keehieefvddqvdeifeduieeitdekqdhluhhtoheppehkvghrnhgvlhdrohhrgheslhhinh hugidrlhhuthhordhush X-ME-Proxy: Received: by mailuser.nyi.internal (Postfix, from userid 501) id 62C92A002E4; Wed, 1 Sep 2021 13:10:09 -0400 (EDT) X-Mailer: MessagingEngine.com Webmail Interface User-Agent: Cyrus-JMAP/3.5.0-alpha0-1126-g6962059b07-fm-20210901.001-g6962059b Mime-Version: 1.0 Message-Id: In-Reply-To: <37d70069-b59f-04c7-f9af-a08af18d0339@redhat.com> References: <20210824005248.200037-1-seanjc@google.com> <307d385a-a263-276f-28eb-4bc8dd287e32@redhat.com> <61ea53ce-2ba7-70cc-950d-ca128bcb29c5@redhat.com> <9ec3636a-6434-4c98-9d8d-addc82858c41@www.fastmail.com> <37d70069-b59f-04c7-f9af-a08af18d0339@redhat.com> Date: Wed, 01 Sep 2021 10:09:48 -0700 From: "Andy Lutomirski" To: "David Hildenbrand" , "Sean Christopherson" Cc: "Paolo Bonzini" , "Vitaly Kuznetsov" , "Wanpeng Li" , "Jim Mattson" , "Joerg Roedel" , "kvm list" , "Linux Kernel Mailing List" , "Borislav Petkov" , "Andrew Morton" , "Joerg Roedel" , "Andi Kleen" , "David Rientjes" , "Vlastimil Babka" , "Tom Lendacky" , "Thomas Gleixner" , "Peter Zijlstra (Intel)" , "Ingo Molnar" , "Varad Gautam" , "Dario Faggioli" , "the arch/x86 maintainers" , linux-mm@kvack.org, linux-coco@lists.linux.dev, "Kirill A. Shutemov" , "Kirill A . Shutemov" , "Sathyanarayanan Kuppuswamy" , "Dave Hansen" , "Yu Zhang" Subject: =?UTF-8?Q?Re:_[RFC]_KVM:_mm:_fd-based_approach_for_supporting_KVM_guest_?= =?UTF-8?Q?private_memory?= Content-Type: text/plain;charset=utf-8 Content-Transfer-Encoding: quoted-printable Precedence: bulk List-ID: X-Mailing-List: kvm@vger.kernel.org On Wed, Sep 1, 2021, at 9:16 AM, David Hildenbrand wrote: > On 01.09.21 17:54, Andy Lutomirski wrote: > > On Wed, Sep 1, 2021, at 1:09 AM, David Hildenbrand wrote: > >>>> Do we have to protect from that? How would KVM protect from user = space > >>>> replacing private pages by shared pages in any of the models we d= iscuss? > >>> > >>> The overarching rule is that KVM needs to guarantee a given pfn is= never mapped[*] > >>> as both private and shared, where "shared" also incorporates any m= apping from the > >>> host. Essentially it boils down to the kernel ensuring that a pfn= is unmapped > >>> before it's converted to/from private, and KVM ensuring that it ho= nors any > >>> unmap notifications from the kernel, e.g. via mmu_notifier or via = a direct callback > >>> as proposed in this RFC. > >> > >> Okay, so the fallocate(PUNCHHOLE) from user space could trigger the > >> respective unmapping and freeing of backing storage. > >> > >>> > >>> As it pertains to PUNCH_HOLE, the responsibilities are no differen= t than when the > >>> backing-store is destroyed; the backing-store needs to notify down= stream MMUs > >>> (a.k.a. KVM) to unmap the pfn(s) before freeing the associated mem= ory. > >> > >> Right. > >> > >>> > >>> [*] Whether or not the kernel's direct mapping needs to be removed= is debatable, > >>> but my argument is that that behavior is not visible to user= space and thus > >>> out of scope for this discussion, e.g. zapping/restoring the= direct map can > >>> be added/removed without impacting the userspace ABI. > >> > >> Right. Removing it shouldn't also be requited IMHO. There are other= ways > >> to teach the kernel to not read/write some online pages (filter > >> /proc/kcore, disable hibernation, strict access checks for /dev/mem= ...). > >> > >>> > >>>>>> Define "ordinary" user memory slots as overlay on top of "encry= pted" memory > >>>>>> slots. Inside KVM, bail out if you encounter such a VMA inside= a normal > >>>>>> user memory slot. When creating a "encryped" user memory slot, = require that > >>>>>> the whole VMA is covered at creation time. You know the VMA can= 't change > >>>>>> later. > >>>>> > >>>>> This can work for the basic use cases, but even then I'd strongl= y prefer not to > >>>>> tie memslot correctness to the VMAs. KVM doesn't truly care wha= t lies behind > >>>>> the virtual address of a memslot, and when it does care, it tend= s to do poorly, > >>>>> e.g. see the whole PFNMAP snafu. KVM cares about the pfn<->gfn = mappings, and > >>>>> that's reflected in the infrastructure. E.g. KVM relies on the = mmu_notifiers > >>>>> to handle mprotect()/munmap()/etc... > >>>> > >>>> Right, and for the existing use cases this worked. But encrypted = memory > >>>> breaks many assumptions we once made ... > >>>> > >>>> I have somewhat mixed feelings about pages that are mapped into $= WHATEVER > >>>> page tables but not actually mapped into user space page tables. = There is no > >>>> way to reach these via the rmap. > >>>> > >>>> We have something like that already via vfio. And that is fundame= ntally > >>>> broken when it comes to mmu notifiers, page pinning, page migrati= on, ... > >>> > >>> I'm not super familiar with VFIO internals, but the idea with the = fd-based > >>> approach is that the backing-store would be in direct communicatio= n with KVM and > >>> would handle those operations through that direct channel. > >> > >> Right. The problem I am seeing is that e.g., try_to_unmap() might n= ot be > >> able to actually fully unmap a page, because some non-synchronized = KVM > >> MMU still maps a page. It would be great to evaluate how the fd > >> callbacks would fit into the whole picture, including the current r= map. > >> > >> I guess I'm missing the bigger picture how it all fits together on = the > >> !KVM side. > >=20 > > The big picture is fundamentally a bit nasty. Logically (ignoring t= he implementation details of rmap, mmu_notifier, etc), you can call try_= to_unmap and end up with a page that is Just A Bunch Of Bytes (tm). The= n you can write it to disk, memcpy to another node, compress it, etc. Wh= en it gets faulted back in, you make sure the same bytes end up somewher= e and put the PTEs back. > >=20 > > With guest-private memory, this doesn't work. Forget about the impl= ementation: you simply can't take a page of private memory, quiesce it s= o the guest can't access it without faulting, and turn it into Just A Bu= nch Of Bytes. TDX does not have that capability. (Any system with inte= grity-protected memory won't without having >PAGE_SIZE bytes or otherwis= e storing metadata, but TDX can't do this at all.) SEV-ES *can* (I thin= k -- I asked the lead architect), but that's not the same thing as sayin= g it's a good idea. > >=20 > > So if you want to migrate a TDX page from one NUMA node to another, = you need to do something different (I don't know all the details), you w= ill have to ask Intel to explain how this might work in the future (it w= asn't in the public docs last time I looked), but I'm fairly confident t= hat it does not resemble try_to_unmap(). > >=20 > > Even on SEV, where a page *can* be transformed into a Just A Bunch O= f Bytes, the operation doesn't really look like try_to_unmap(). As I un= derstand it, it's more of: > >=20 > > look up the one-and-only guest and GPA at which this page is mapped. > > tear down the NPT PTE. (SEV, unlike TDX, uses paging structures in = normal memory.) > > Ask the magic SEV mechanism to turn the page into swappable data > > Go to town. > >=20 > > This doesn't really resemble the current rmap + mmu_notifier code, a= nd shoehorning this into mmu_notifier seems like it may be uglier and mo= re code than implementing it more directly in the backing store. > >=20 > > If you actually just try_to_unmap() a SEV-ES page (and it worked, wh= ich it currently won't), you will have data corruption or cache incohere= ncy. > >=20 > > If you want to swap a page on TDX, you can't. Sorry, go directly to= jail, do not collect $200. > >=20 > > So I think there are literally zero code paths that currently call t= ry_to_unmap() that will actually work like that on TDX. If we run out o= f memory on a TDX host, we can kill the guest completely and reclaim all= of its memory (which probably also involves killing QEMU or whatever ot= her user program is in charge), but that's really our only option. >=20 > try_to_unmap() would actually do the right thing I think. It's the use= rs=20 > that access page content (migration, swapping) that need additional ca= re=20 > to special-case these pages. Completely agree that these would be brok= en. I suspect that the eventual TDX numa migration flow will involve asking = the TD module to relocate the page without unmapping it first. TDX doesn= =E2=80=99t really have a nondestructive unmap operation. >=20 >=20 > --=20 > Thanks, >=20 > David / dhildenb >=20 >=20