kvm.vger.kernel.org archive mirror
 help / color / mirror / Atom feed
From: Ackerley Tng <ackerleytng@google.com>
To: Sean Christopherson <seanjc@google.com>
Cc: pbonzini@redhat.com, maz@kernel.org, oliver.upton@linux.dev,
	chenhuacai@kernel.org, mpe@ellerman.id.au, anup@brainfault.org,
	paul.walmsley@sifive.com, palmer@dabbelt.com,
	aou@eecs.berkeley.edu, willy@infradead.org,
	akpm@linux-foundation.org, paul@paul-moore.com,
	jmorris@namei.org, serge@hallyn.com, kvm@vger.kernel.org,
	linux-arm-kernel@lists.infradead.org, kvmarm@lists.linux.dev,
	linux-mips@vger.kernel.org, linuxppc-dev@lists.ozlabs.org,
	kvm-riscv@lists.infradead.org, linux-riscv@lists.infradead.org,
	linux-fsdevel@vger.kernel.org, linux-mm@kvack.org,
	linux-security-module@vger.kernel.org,
	linux-kernel@vger.kernel.org, chao.p.peng@linux.intel.com,
	tabba@google.com, jarkko@kernel.org, yu.c.zhang@linux.intel.com,
	vannapurve@google.com, mail@maciej.szmigiero.name,
	vbabka@suse.cz, david@redhat.com, qperret@google.com,
	michael.roth@amd.com, wei.w.wang@intel.com,
	liam.merwick@oracle.com, isaku.yamahata@gmail.com,
	kirill.shutemov@linux.intel.com
Subject: Re: [RFC PATCH v11 12/29] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory
Date: Mon, 28 Aug 2023 22:56:12 +0000	[thread overview]
Message-ID: <diqzttsiu67n.fsf@ackerleytng-ctop.c.googlers.com> (raw)
In-Reply-To: <ZOO782YGRY0YMuPu@google.com> (message from Sean Christopherson on Mon, 21 Aug 2023 12:33:07 -0700)

Sean Christopherson <seanjc@google.com> writes:

> On Mon, Aug 21, 2023, Ackerley Tng wrote:
>> Sean Christopherson <seanjc@google.com> writes:
>>
>> > On Tue, Aug 15, 2023, Ackerley Tng wrote:
>> >> Sean Christopherson <seanjc@google.com> writes:
>> >> > Nullifying the KVM pointer isn't sufficient, because without additional actions
>> >> > userspace could extract data from a VM by deleting its memslots and then binding
>> >> > the guest_memfd to an attacker controlled VM.  Or more likely with TDX and SNP,
>> >> > induce badness by coercing KVM into mapping memory into a guest with the wrong
>> >> > ASID/HKID.
>> >> >
>> >> > I can think of three ways to handle that:
>> >> >
>> >> >   (a) prevent a different VM from *ever* binding to the gmem instance
>> >> >   (b) free/zero physical pages when unbinding
>> >> >   (c) free/zero when binding to a different VM
>> >> >
>> >> > Option (a) is easy, but that pretty much defeats the purpose of decopuling
>> >> > guest_memfd from a VM.
>> >> >
>> >> > Option (b) isn't hard to implement, but it screws up the lifecycle of the memory,
>> >> > e.g. would require memory when a memslot is deleted.  That isn't necessarily a
>> >> > deal-breaker, but it runs counter to how KVM memlots currently operate.  Memslots
>> >> > are basically just weird page tables, e.g. deleting a memslot doesn't have any
>> >> > impact on the underlying data in memory.  TDX throws a wrench in this as removing
>> >> > a page from the Secure EPT is effectively destructive to the data (can't be mapped
>> >> > back in to the VM without zeroing the data), but IMO that's an oddity with TDX and
>> >> > not necessarily something we want to carry over to other VM types.
>> >> >
>> >> > There would also be performance implications (probably a non-issue in practice),
>> >> > and weirdness if/when we get to sharing, linking and/or mmap()ing gmem.  E.g. what
>> >> > should happen if the last memslot (binding) is deleted, but there outstanding userspace
>> >> > mappings?
>> >> >
>> >> > Option (c) is better from a lifecycle perspective, but it adds its own flavor of
>> >> > complexity, e.g. the performant way to reclaim TDX memory requires the TDMR
>> >> > (effectively the VM pointer), and so a deferred relcaim doesn't really work for
>> >> > TDX.  And I'm pretty sure it *can't* work for SNP, because RMP entries must not
>> >> > outlive the VM; KVM can't reuse an ASID if there are pages assigned to that ASID
>> >> > in the RMP, i.e. until all memory belonging to the VM has been fully freed.
>
> ...
>
>> I agree with you that nulling the KVM pointer is insufficient to keep
>> host userspace out of the TCB. Among the three options (a) preventing a
>> different VM (HKID/ASID) from binding to the gmem instance, or zeroing
>> the memory either (b) on unbinding, or (c) on binding to another VM
>> (HKID/ASID),
>>
>> (a) sounds like adding a check issued to TDX/SNP upon binding and this
>>     check would just return OK for software-protected VMs (line of sight
>>     to removing host userspace from TCB).
>>
>> Or, we could go further for software-protected VMs and add tracking in
>> the inode to prevent the same inode from being bound to different
>> "HKID/ASID"s, perhaps like this:
>>
>> + On first binding, store the KVM pointer in the inode - not file (but
>>   not hold a refcount)
>> + On rebinding, check that the KVM matches the pointer in the inode
>> + On intra-host migration, update the KVM pointer in the inode to allow
>>   binding to the new struct kvm
>>
>> I think you meant associating the file with a struct kvm at creation
>> time as an implementation for (a), but technically since the inode is
>> the representation of memory, tracking of struct kvm should be with the
>> inode instead of the file.
>>
>> (b) You're right that this messes up the lifecycle of the memory and
>>     wouldn't work with intra-host migration.
>>
>> (c) sounds like doing the clearing on a check similar to that of (a)
>
> Sort of, though it's much nastier, because it requires the "old" KVM instance to
> be alive enough to support various operations.  I.e. we'd have to make stronger
> guarantees about exactly when the handoff/transition could happen.
>

Good point!

>> If we track struct kvm with the inode, then I think (a), (b) and (c) can
>> be independent of the refcounting method. What do you think?
>
> No go.  Because again, the inode (physical memory) is coupled to the virtual machine
> as a thing, not to a "struct kvm".  Or more concretely, the inode is coupled to an
> ASID or an HKID, and there can be multiple "struct kvm" objects associated with a
> single ASID.  And at some point in the future, I suspect we'll have multiple KVM
> objects per HKID too.
>
> The current SEV use case is for the migration helper, where two KVM objects share
> a single ASID (the "real" VM and the helper).  I suspect TDX will end up with
> similar behavior where helper "VMs" can use the HKID of the "real" VM.  For KVM,
> that means multiple struct kvm objects being associated with a single HKID.
>
> To prevent use-after-free, KVM "just" needs to ensure the helper instances can't
> outlive the real instance, i.e. can't use the HKID/ASID after the owning virtual
> machine has been destroyed.
>
> To put it differently, "struct kvm" is a KVM software construct that _usually_,
> but not always, is associated 1:1 with a virtual machine.
>
> And FWIW, stashing the pointer without holding a reference would not be a complete
> solution, because it couldn't guard against KVM reusing a pointer.  E.g. if a
> struct kvm was unbound and then freed, KVM could reuse the same memory for a new
> struct kvm, with a different ASID/HKID, and get a false negative on the rebinding
> check.

I agree that inode (physical memory) is coupled to the virtual machine
as a more generic concept.

I was hoping that in the absence of CC hardware providing a HKID/ASID,
the struct kvm pointer could act as a representation of the "virtual
machine". You're definitely right that KVM could reuse a pointer and so
that idea doesn't stand.

I thought about generating UUIDs to represent "virtual machines" in the
absence of CC hardware, and this UUID could be transferred during
intra-host migration, but this still doesn't take host userspace out of
the TCB. A malicious host VMM could just use the migration ioctl to copy
the UUID to a malicious dumper VM, which would then pass checks with a
gmem file linked to the malicious dumper VM. This is fine for HKID/ASIDs
because the memory is encrypted; with UUIDs there's no memory
encryption.

Circling back to the original topic, was associating the file with
struct kvm at gmem file creation time meant to constrain the use of the
gmem file to one struct kvm, or one virtual machine, or something else?

Follow up questions:

1. Since the physical memory's representation is the inode and should be
   coupled to the virtual machine (as a concept, not struct kvm), should
   the binding/coupling be with the file, or the inode?

2. Should struct kvm still be bound to the file/inode at gmem file
   creation time, since

   + struct kvm isn't a good representation of a "virtual machine"
   + we currently don't have anything that really represents a "virtual
     machine" without hardware support


I'd also like to bring up another userspace use case that Google has:
re-use of gmem files for rebooting guests when the KVM instance is
destroyed and rebuilt.

When rebooting a VM there are some steps relating to gmem that are
performance-sensitive:

a.      Zeroing pages from the old VM when we close a gmem file/inode
b. Deallocating pages from the old VM when we close a gmem file/inode
c.   Allocating pages for the new VM from the new gmem file/inode
d.      Zeroing pages on page allocation

We want to reuse the gmem file to save re-allocating pages (b. and c.),
and one of the two page zeroing allocations (a. or d.).

Binding the gmem file to a struct kvm on creation time means the gmem
file can't be reused with another VM on reboot. Also, host userspace is
forced to close the gmem file to allow the old VM to be freed.

For other places where files pin KVM, like the stats fd pinning vCPUs, I
guess that matters less since there isn't much of a penalty to close and
re-open the stats fd.

  reply	other threads:[~2023-08-28 22:57 UTC|newest]

Thread overview: 140+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2023-07-18 23:44 [RFC PATCH v11 00/29] KVM: guest_memfd() and per-page attributes Sean Christopherson
2023-07-18 23:44 ` [RFC PATCH v11 01/29] KVM: Wrap kvm_gfn_range.pte in a per-action union Sean Christopherson
2023-07-19 13:39   ` Jarkko Sakkinen
2023-07-19 15:39     ` Sean Christopherson
2023-07-19 16:55   ` Paolo Bonzini
2023-07-26 20:22     ` Sean Christopherson
2023-07-21  6:26   ` Yan Zhao
2023-07-21 10:45     ` Xu Yilun
2023-07-25 18:05       ` Sean Christopherson
2023-07-18 23:44 ` [RFC PATCH v11 02/29] KVM: Tweak kvm_hva_range and hva_handler_t to allow reusing for gfn ranges Sean Christopherson
2023-07-19 17:12   ` Paolo Bonzini
2023-07-18 23:44 ` [RFC PATCH v11 03/29] KVM: Use gfn instead of hva for mmu_notifier_retry Sean Christopherson
2023-07-19 17:12   ` Paolo Bonzini
2023-07-18 23:44 ` [RFC PATCH v11 04/29] KVM: PPC: Drop dead code related to KVM_ARCH_WANT_MMU_NOTIFIER Sean Christopherson
2023-07-19 17:34   ` Paolo Bonzini
2023-07-18 23:44 ` [RFC PATCH v11 05/29] KVM: Convert KVM_ARCH_WANT_MMU_NOTIFIER to CONFIG_KVM_GENERIC_MMU_NOTIFIER Sean Christopherson
2023-07-19  7:31   ` Yuan Yao
2023-07-19 14:15     ` Sean Christopherson
2023-07-20  1:15       ` Yuan Yao
2023-07-18 23:44 ` [RFC PATCH v11 06/29] KVM: Introduce KVM_SET_USER_MEMORY_REGION2 Sean Christopherson
2023-07-21  9:03   ` Paolo Bonzini
2023-07-28  9:25   ` Quentin Perret
2023-07-29  0:03     ` Sean Christopherson
2023-07-31  9:30       ` Quentin Perret
2023-07-31 15:58       ` Paolo Bonzini
2023-07-18 23:44 ` [RFC PATCH v11 07/29] KVM: Add KVM_EXIT_MEMORY_FAULT exit Sean Christopherson
2023-07-19  7:54   ` Yuan Yao
2023-07-19 14:16     ` Sean Christopherson
2023-07-18 23:44 ` [RFC PATCH v11 08/29] KVM: Introduce per-page memory attributes Sean Christopherson
2023-07-20  8:09   ` Yuan Yao
2023-07-20 19:02     ` Isaku Yamahata
2023-07-20 20:20       ` Sean Christopherson
2023-07-21 10:57   ` Paolo Bonzini
2023-07-21 15:56   ` Xiaoyao Li
2023-07-24  4:43   ` Xu Yilun
2023-07-26 15:59     ` Sean Christopherson
2023-07-27  3:24       ` Xu Yilun
2023-08-02 20:31   ` Isaku Yamahata
2023-08-14  0:44   ` Binbin Wu
2023-08-14 21:54     ` Sean Christopherson
2023-07-18 23:44 ` [RFC PATCH v11 09/29] KVM: x86: Disallow hugepages when memory attributes are mixed Sean Christopherson
2023-07-21 11:59   ` Paolo Bonzini
2023-07-21 17:41     ` Sean Christopherson
2023-07-18 23:44 ` [RFC PATCH v11 10/29] mm: Add AS_UNMOVABLE to mark mapping as completely unmovable Sean Christopherson
2023-07-25 10:24   ` Kirill A . Shutemov
2023-07-25 12:51     ` Matthew Wilcox
2023-07-26 11:36       ` Kirill A . Shutemov
2023-07-28 16:02       ` Vlastimil Babka
2023-07-28 16:13         ` Paolo Bonzini
2023-09-01  8:23       ` Vlastimil Babka
2023-07-18 23:44 ` [RFC PATCH v11 11/29] security: Export security_inode_init_security_anon() for use by KVM Sean Christopherson
2023-07-19  2:14   ` Paul Moore
2023-07-31 10:46   ` Vlastimil Babka
2023-07-18 23:44 ` [RFC PATCH v11 12/29] KVM: Add KVM_CREATE_GUEST_MEMFD ioctl() for guest-specific backing memory Sean Christopherson
2023-07-19 17:21   ` Vishal Annapurve
2023-07-19 17:47     ` Sean Christopherson
2023-07-20 14:45   ` Xiaoyao Li
2023-07-20 15:14     ` Sean Christopherson
2023-07-20 21:28   ` Isaku Yamahata
2023-07-21  6:13   ` Yuan Yao
2023-07-21 22:27     ` Isaku Yamahata
2023-07-21 22:33       ` Sean Christopherson
2023-07-21 15:05   ` Xiaoyao Li
2023-07-21 15:42     ` Xiaoyao Li
2023-07-21 17:42       ` Sean Christopherson
2023-07-21 17:17   ` Paolo Bonzini
2023-07-21 17:50     ` Sean Christopherson
2023-07-25 15:09   ` Wang, Wei W
2023-07-25 16:03     ` Sean Christopherson
2023-07-26  1:51       ` Wang, Wei W
2023-07-31 16:23       ` Fuad Tabba
2023-07-26 17:18   ` Elliot Berman
2023-07-26 19:28     ` Sean Christopherson
2023-07-27 10:39   ` Fuad Tabba
2023-07-27 17:13     ` Sean Christopherson
2023-07-31 13:46       ` Fuad Tabba
2023-08-03 19:15   ` Ryan Afranji
2023-08-07 23:06   ` Ackerley Tng
2023-08-08 21:13     ` Sean Christopherson
2023-08-10 23:57       ` Vishal Annapurve
2023-08-11 17:44         ` Sean Christopherson
2023-08-15 18:43       ` Ackerley Tng
2023-08-15 20:03         ` Sean Christopherson
2023-08-21 17:30           ` Ackerley Tng
2023-08-21 19:33             ` Sean Christopherson
2023-08-28 22:56               ` Ackerley Tng [this message]
2023-08-29  2:53                 ` Elliot Berman
2023-09-14 19:12                   ` Sean Christopherson
2023-09-14 18:15                 ` Sean Christopherson
2023-09-14 23:19                   ` Ackerley Tng
2023-09-15  0:33                     ` Sean Christopherson
2023-08-30 15:12   ` Binbin Wu
2023-08-30 16:44     ` Ackerley Tng
2023-09-01  3:45       ` Binbin Wu
2023-09-01 16:46         ` Ackerley Tng
2023-07-18 23:44 ` [RFC PATCH v11 13/29] KVM: Add transparent hugepage support for dedicated guest memory Sean Christopherson
2023-07-21 15:07   ` Paolo Bonzini
2023-07-21 17:13     ` Sean Christopherson
2023-09-06 22:10       ` Paolo Bonzini
2023-07-18 23:44 ` [RFC PATCH v11 14/29] KVM: x86/mmu: Handle page fault for private memory Sean Christopherson
2023-07-21 15:09   ` Paolo Bonzini
2023-07-18 23:44 ` [RFC PATCH v11 15/29] KVM: Drop superfluous __KVM_VCPU_MULTIPLE_ADDRESS_SPACE macro Sean Christopherson
2023-07-21 15:07   ` Paolo Bonzini
2023-07-18 23:44 ` [RFC PATCH v11 16/29] KVM: Allow arch code to track number of memslot address spaces per VM Sean Christopherson
2023-07-21 15:12   ` Paolo Bonzini
2023-07-18 23:45 ` [RFC PATCH v11 17/29] KVM: x86: Add support for "protected VMs" that can utilize private memory Sean Christopherson
2023-07-18 23:45 ` [RFC PATCH v11 18/29] KVM: selftests: Drop unused kvm_userspace_memory_region_find() helper Sean Christopherson
2023-07-21 15:14   ` Paolo Bonzini
2023-07-18 23:45 ` [RFC PATCH v11 19/29] KVM: selftests: Convert lib's mem regions to KVM_SET_USER_MEMORY_REGION2 Sean Christopherson
2023-07-18 23:45 ` [RFC PATCH v11 20/29] KVM: selftests: Add support for creating private memslots Sean Christopherson
2023-07-18 23:45 ` [RFC PATCH v11 21/29] KVM: selftests: Add helpers to convert guest memory b/w private and shared Sean Christopherson
2023-07-18 23:45 ` [RFC PATCH v11 22/29] KVM: selftests: Add helpers to do KVM_HC_MAP_GPA_RANGE hypercalls (x86) Sean Christopherson
2023-07-18 23:45 ` [RFC PATCH v11 23/29] KVM: selftests: Introduce VM "shape" to allow tests to specify the VM type Sean Christopherson
2023-07-18 23:45 ` [RFC PATCH v11 24/29] KVM: selftests: Add GUEST_SYNC[1-6] macros for synchronizing more data Sean Christopherson
2023-07-18 23:45 ` [RFC PATCH v11 25/29] KVM: selftests: Add x86-only selftest for private memory conversions Sean Christopherson
2023-07-18 23:45 ` [RFC PATCH v11 26/29] KVM: selftests: Add KVM_SET_USER_MEMORY_REGION2 helper Sean Christopherson
2023-07-18 23:45 ` [RFC PATCH v11 27/29] KVM: selftests: Expand set_memory_region_test to validate guest_memfd() Sean Christopherson
2023-08-07 23:17   ` Ackerley Tng
2023-07-18 23:45 ` [RFC PATCH v11 28/29] KVM: selftests: Add basic selftest for guest_memfd() Sean Christopherson
2023-08-07 23:20   ` Ackerley Tng
2023-08-18 23:03     ` Sean Christopherson
2023-08-07 23:25   ` Ackerley Tng
2023-08-18 23:01     ` Sean Christopherson
2023-08-21 19:49       ` Ackerley Tng
2023-07-18 23:45 ` [RFC PATCH v11 29/29] KVM: selftests: Test KVM exit behavior for private memory/access Sean Christopherson
2023-07-24  6:38 ` [RFC PATCH v11 00/29] KVM: guest_memfd() and per-page attributes Nikunj A. Dadhania
2023-07-24 17:00   ` Sean Christopherson
2023-07-26 11:20     ` Nikunj A. Dadhania
2023-07-26 14:24       ` Sean Christopherson
2023-07-27  6:42         ` Nikunj A. Dadhania
2023-08-03 11:03       ` Vlastimil Babka
2023-07-24 20:16 ` Sean Christopherson
2023-08-25 17:47 ` Sean Christopherson
2023-08-29  9:12   ` Chao Peng
2023-08-31 18:29     ` Sean Christopherson
2023-09-01  1:17       ` Chao Peng
2023-09-01  8:26         ` Vlastimil Babka
2023-09-01  9:10         ` Paolo Bonzini
2023-08-30  0:00   ` Isaku Yamahata
2023-09-09  0:16   ` Sean Christopherson

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=diqzttsiu67n.fsf@ackerleytng-ctop.c.googlers.com \
    --to=ackerleytng@google.com \
    --cc=akpm@linux-foundation.org \
    --cc=anup@brainfault.org \
    --cc=aou@eecs.berkeley.edu \
    --cc=chao.p.peng@linux.intel.com \
    --cc=chenhuacai@kernel.org \
    --cc=david@redhat.com \
    --cc=isaku.yamahata@gmail.com \
    --cc=jarkko@kernel.org \
    --cc=jmorris@namei.org \
    --cc=kirill.shutemov@linux.intel.com \
    --cc=kvm-riscv@lists.infradead.org \
    --cc=kvm@vger.kernel.org \
    --cc=kvmarm@lists.linux.dev \
    --cc=liam.merwick@oracle.com \
    --cc=linux-arm-kernel@lists.infradead.org \
    --cc=linux-fsdevel@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mips@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=linux-riscv@lists.infradead.org \
    --cc=linux-security-module@vger.kernel.org \
    --cc=linuxppc-dev@lists.ozlabs.org \
    --cc=mail@maciej.szmigiero.name \
    --cc=maz@kernel.org \
    --cc=michael.roth@amd.com \
    --cc=mpe@ellerman.id.au \
    --cc=oliver.upton@linux.dev \
    --cc=palmer@dabbelt.com \
    --cc=paul.walmsley@sifive.com \
    --cc=paul@paul-moore.com \
    --cc=pbonzini@redhat.com \
    --cc=qperret@google.com \
    --cc=seanjc@google.com \
    --cc=serge@hallyn.com \
    --cc=tabba@google.com \
    --cc=vannapurve@google.com \
    --cc=vbabka@suse.cz \
    --cc=wei.w.wang@intel.com \
    --cc=willy@infradead.org \
    --cc=yu.c.zhang@linux.intel.com \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link
Be sure your reply has a Subject: header at the top and a blank line before the message body.
This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox;
as well as URLs for NNTP newsgroup(s).